Return data for Fama–French model

Recently somebody asked me where to look for monthly return data for the US market. The data was needed for Fama-French factor model. It seems to me that the U.S. market factor data is available from Kenneth R. French’s website (2019). The data provided here is mainly in a TXT or CSV format and can be processed and analysed using different programs.

Some of the Fama-French data is also available through other sources, for instance, the WRDS platform (Wharton Research Data Services). The data that they provide can be found as is labled as: “Fama French & Liquidity Factors” or “Fama-French Portfolios and Factors”. Access and downloading through this platform does, however, require a personal or organizational subscription.

I do not think all data from the original Kenneth French website is available but it does have additional related data: Pastor-Stambaugh Liquidity series, Sadka Liquidity measures, and China Factors (Liu, Stambaugh and Yuan).
At the Vrije Universiteit Amsterdam we have access to it (screenshot).

S&P Indexes in Compustat no more

Recently the company behind the S&P Indices (S&P Dow Jones) has decided to withdraw their constituent data from the Capital IQ Compustat databases. This means that it is still possible to get the prices for the overall indices, but it no longer is possible to get detailed data on individual companies as they move in and out of the lists over the years. Lists like S&P 500 are heavily used in much research as a benchmark list for market developments.

I know of only one database (available to me) at the university that still allows for the generation of constituent lists at the Vrije Universiteit Amsterdam and this is the database FactSet. There is one drawback compared to the lists that used to be available through Compustat, however: where Compustat made it possible to generate a multiyear list over time, FactSet only allows you to generate a daily list.
If you need multiyear lists from FactSet you will have to download multiple lists for years and then combine them yourself.

I have created an example movie on how to download an index list and put it on Youtube:

Using FactSet requires FactSet software as well as a FactSet account. Information on this for personnel and students at the university is available here.

N.B.: The Compustat database Execucomp has annual data on executives for companies that are (or have been) part of the S&P 1500 list. A download of company codes (GVKEY, CUSIP, Ticker) could perhaps serve as a rough list on which companies are part of the index over a time period. It is only a rough estimate, as the data is based on annual SEC filings and the index changes daily. Execucomp includes data going back to 1992.

Quarterly IBES data (in WRDS)

Understanding the forecast data in the I/B/E/S/ database (Institutional Brokers Estimate System database from Refinitiv) was not as straightforward as I thought it would be. What threw me off was the way I initially looked at the data. Below you find an example download I did for 1 company (Wendy’s Co.) for 1 year. I downloaded the quarterly forecast EPS data for 2011 from the Summary Statistics, Unadjusted part database from the Unadjusted Summary collection of I/B/E/S/ databases.

The three columns that are important where it concerns understanding the data, are marked green.

  • Forecast Period End Date (SAS Format) = this date indicates on what date the calendar quarter ends for which the forecast was made
  • Forecast Period Indicator (FPI) = 6 indicates quarter 1, 7 means quarter 2 , 8 is quarter 3 and 9 is actually quarter 4
  • I/B/E/S/ Statistical Period (SAS Format) = The date when the calculated summary statistics were finalized, which could only be done when the final data on the forecast quarter had been collected. That is why monthly this date is sometimes more recent = past the actual forecast quarter.

What was confusing to me was how I needed to interpret the Forecast Period Indicator. I could not make sense of it based on the information in the I/B/E/S/ manual. After talking to the owners of the database it finally made sense:

Where the FPI indicates 6 what is meant is: the current quarter. That means that the forecasts for these rows that have the number 6 were done within the same quarter as the Forecast Period quarter (indicated in column P). Where the FPI says 7, that means that the forecast data on the calendar Period quarter is from the previous quarter. For the FPI 8 the forecast data is from two quarters back and where the FPI says 9 it is 3 quarters before the quarter indicated as the Forecast period quarter.

To make it easier to understand I have below calculated the end date of the actual quarter in which the forecasts were done based on the FPI in column G: and the Forecast quarter dates in column P. The Excel formula I used for this (column Q) is:


In column R I decided to use the Excel formula =EOMONTH() to get the actual End date of the month and Column S has a formatted date to format the date the way the date in column P is formatted. I used Excel formulat =TEXT(R2;”mm/dd/yyyy”) for this.

The result looks as follows:

The extra columns (yellow) indicate the quarter in which the forecasts were made by analysts for the calendar quarter indicated in column P.

FactSet IPO Offer Method field data

The database has a dataset on IPO and IPO-related information. Below are the current definitions of offer method typings in the field for Offer method available in FactSet for IPO’s.

1) Bookbuild Issue
Unlike the U.S. and some other larger countries where the offer price is determined based on market conditions, a bookbuild issue is where the bookrunner collects bids from potential investors on how much they are willing to pay. After the bookbuilding process has ended, the bookrunner determines the price at which the shares will be sold.

2) Best Efforts
An underwriting arrangement in which an underwriter agrees only to use its best efforts to sell the securities as the issuer’s agent. The underwriter does not purchase the securities itself and has no obligation to purchase any amount not purchased by investors.

3) Firm Commitment
An underwriting in which an investment banking firm commits to buy and sell an entire issue of stock and assumes all financial responsibility for any unsold shares

4) Direct Listing
Company raises capital in a private placement to accredited investors as the initial step in financing. Amount of capital raised varies depending on sufficient capital needed for company’s growth plans and registration and listing costs before it achieves an exchange listing (typically, within six to twelve months after the private placement offering is completed).

5) Placing
Placing is the method by which a company lists its shares on a stock exchange and raises capital by selling shares to institutional investors. These shares are not generally marketed to retail investors at the time of the issue unless the correct approvals are included in the offer documentation.

6) Offer for Subscription
Offering where a minimum level of subscription for the shares, and if this is not reached the offer is withdrawn.

The second and third methods are both considered bookbuild issues, but there is a Bookbuild issue type separately indicated. So, how do those three compare, because these methods seem to overlap in method typing?
According to FactSet helpdesk the offering type is a field that they are looking to change. One of the reasons is that they agree that the offer methods are not distinct. Also, in a given deal, there can be multiple offering types used. FactSet is going to change this in the future. At the moment they are considering offering 3 new separate fields where information is made available on the offering type. The fields already exist at the backend but are currently not available through the FactSet database. The new fields will describe:

  • how the deal was priced
  • what was the involvement from the underwriters
  • and how were the shares sold/listed on the exchange?

Using the existing structure of the database FactSet needs to choose which offering method is the best of each IPO based on data collected in those three fields. The information on the Offer method may therefore deviate considerably from what is indicated in the company prospectus.

N.B.: For example, compare the Offer method in FactSet with the actual prospectus for Basic-Fit NV. The Prospectus indicates a combination of offer methods. The filings for companies in FactSet are quickly available through the Company / Security tab by looking up the company using the name or ticker BFIT

FactSet and ESG data

At the university I get a lot of questions on the availability of microeconomic level data on subjects like corporate responsibility with regard to Social, Environmental, Governance or Economic topics. Through FactSet it is possible to get access to this type of data but it very much depends on what FactSet has available as well as what sources / datasets can be accessed through FactSet.

Examples of datasets that may be accessible through FactSet are: MSCI ESG fund metrics (Moody), and Sustainalytics (Morningstar). I am not sure if the Asset4 ESG dataset from Refinitiv (Formerly Thomson) can also be made available. In all cases it does, however, require an extra license from the company that produces and maintains these datasets.

What is available then, through the default basic license for FactSet, is the following:

  • The lastest ESG data from Sustainalytics (no history): screenshot
  • 6 years biannual ESG data from FTSE: screenshot
  • Data on Green bonds (see previous post)
  • Limited data on ETF funds with investment strategy ESG: screenshot
  • Limited Governance data on board members, board structure, committee membership, etc. (no history): screenshot

N.B.: I got the tip on the availability of FTSE ESG data in FactSet from Dries Laurs, Msc.

FactSet and historical ownership of companies

The database FactSet has a specific dataset that allows you to view and download historical shareholder data for companies. The availability of data depends on the coverage by FactSet of course. There are two levels available: detailed ownership and aggregate data.

The main interface of FactSet allows you to view the ownership data in two ways:

  1. Through the Company tab
  2. Using the Ownership tab.

1) Then company tab has a menu on the left side which includes the option Ownership.
Usually when you click the Company tab you get the last company that you looked up. If you have not yet done this before you can use the search box at the top left corner to search by name or code (ticker or other codes). When the company details appear you can select the ownership option from the menu. Several options are available. This includes equity as well as debt ownership. The data is available on the level of the instrument as well as the level of the company. Options to change the presentation are available at the top.

2)The Ownership tab works in the same way as the company tab.
Search for a company by name or code and you get the result presented. There are two levels: the top level concerns the ownership of the company itself. Below this there is a section covering the ownership by the company itself (of other companies / subsidiaries).

One of the ownership options to get shareholder data is Historical Analysis. This allows you to get multi-year data on ownership as it changes for the company for the following 5 shareholder types: Institutions, Institutions/Insiders, Mutual FUnds/ETFs, Insiders/Stakeholders, and Beneficial owners. The overview can be changed for the number of years as well as the frequency by clicking the button Dates at the top.

If you wish to download data this can be done using the green arrow button at the top right corner.

In addition to the main interface it is also possible to download some ownership data through the FactSet Excel Add-in. There appear to be only two options to download shareholder data: current data of the most recent nature (depends on company filings), and aggregate historical ownership data. You can find an example here:

Step 1: Use screening to make a selection and download this to Excel

Step 2: In Excel organize the data to get the codes in the first row of the sheet.


Step 3: Use the side bar to download Aggregate ownership data.
Important: It is usually necessary in Excel to first indicate the time period if you wish to download security level data. On company level the time period of available data is more limited.


1) If the company or ownership tabs/apps (or both) are unavailable you can add them using the following option and clicking the green Plus next to the tab/app in the list.

2) If you wish to download detailed historical ownership data for lists of companies you will have to create a template in Excel to do this or contact the FactSet helpdesk to get them to do it for you.
3) For a step by step approach for downloading data, watch the movies on FactSet here.

Stata solution to reshape FactSet data

In two previous posts I showed examples on how to reshape data from wide to long format. Both post were about Datastream, but one was about regular downloads and one about large datasets. In Stata reshaping data usually works fine but may sometimes not work very well. In some versions of Stata or in some cases the reshape command seems to work only in part and not all data is reshaped. The solution to the problem was changing the usual reshape command slightly.

1) In the example I first prepared the FactSet data as per usual. The bonds data in the example first looks as follows:

First I used Copy+Paste > transpose to change the rows and columns. Then I used the Excel formula =”Y”&TEXT(C1;”yyyymmdd”) to change the dates for Stata. Finally I copy+pasted as values to get rid of the Excel formulas. The data now looks as follows:

2) Normally, after importing this data in Stata , I would use the command:
reshape long Y, i(Code) j(origdate). In some Stata versions this is fine, but in other editions this does not work very well. The way to get this to work is now adding the word string to the command as follows:

reshape long Y, i(Code) j(origdate) string

N.B.: Afterwards I also need to do some work on the dates etc. to be able to get the data in a format to be able to work with it. Reminder: for really large downloads from FactSet you may have to reshape the data in batches because Stata versions may be limited by the number of variables it can work with. Afterwards the datasets can be merged by appending them. You can download an example do file here with the commands and an excel file to work on with this.


Using FactSet to find and select Green bonds

The previous post was about Green Bonds, on how the data is available through SDC Platinum, and how to get the time series data (like yield) through FactSet. The latter database also has data on bonds, including Green bonds. Usually I look for different categories of bonds using a type variable. The main categories for selecting bonds in FactSet in the start screen do not cover a type variable indicating it’s purpose(/type). In FactSet looking for these bonds works as follows:

1) Start up the FactSet software use the Screening option at the top to go to the Universal screening screen. In this screen click the tab Debt and then click the option “Starter Screen – Corporate/Gov/AGency

2) In the next screen you can use the (main) Select criteria to narrow down to certain bonds, including choosing nations as well as industries and a time frame. The bottom half of the screen shows the selection/Results and updates this immediately depending on the selected criteria. In the field Add columns you now type the variable name FI_Issue_FLG and make a preliminary choice from the suggestions.

3) When you have selected an option a popup appears where the Item option allows you to select different types of bonds, including Green bonds. Select this option and click the button Add New to add an extra column.

4) This extra column can be used to filter the selected bonds and select only bonds of this type. To call up the filter click the arrow in the name of the column Green Bond Flag. Next select the option filter from the options that appear.

5) The Filter popup allows you to select values. In this case the Green Bond variable is either yes or no which equals to 1 to 0. Drag the line at min to the right until it matches the line / column on the right at max.

6) Click the button done when you are finished and finalize the list of Green bonds. The result will look as follows:

N.B.: Multiple columns can be added providing different information on the bonds. This screen is intended to make a selection of bonds. If you wish to download time series data on the bonds you download the list to Excel (use the green arrow pointing downwards) and use the FactSet add-in to do this. For bonds downloading the series data is different from regular series. Check the previous post on how to do this step by step using FactSet.


Using SDC & FactSet to download (green) bonds data

Sustainability has been a hot topic for many years now and one of the related topics concerns the rise of Green Bonds over the last 10 years. One of the databases that has data on the issuance of these bonds (and other types) is the Global New Issue database in SDC Platinum (more recently called Eikon). Usually I look for different categories of bonds using a type variable. In this case the category is defined not by the regular type but by it’s purpose. Looking up these Green bonds can therefore be done by using the variable “Use of Proceeds“. This variable is available under the “all items” search tab.

After selecting the bonds you can create reports and download specific data (variables) on the bonds itself (menu option: Report > New custom ). When finished selecting the variables it is possible to call up and download the data.

If I need to get time series data on these Green bonds (or any other bonds) I need to use a second database/source for this. Options include Datastream/Eikon or FactSet. As I currently have access only to the latter I will explain how this works in FactSet. First off you need to get the data for the series and the dates that tell you how long the bonds run. In this case the easiest option to use is the ISIN code. For the time period I use the “Issue Date” and the “Final Maturity” date.

1) To Start working with the data I first filter the data on the availability of the ISIN code and also remove remarks like “perp.” or “perpetual” where it is stated in the column with the maturity date. The end result of the data we need looks as follows:

2) I now first need to take care of the fact that each bond series in my selection has a different maturity date which may lie in the past or the future. As I download the data now, in my case the latest data would be for now() or yesterday. I therefore need to create a date where the maturity date was perpetual and give it the date now() and replace the maturity date where the date lies in the future with now(). Using the function =now() in a specific cell I can use this as a point of reference. It is of course also possible to choose a specific date. Using a nested conditional IF() statement in Excel I can now generate a new end date for each series:

3) The next step is finding out how long each bond series runs as they have different time periods. Using the new generated date I deduct the issue date from this and divide this by 30 to get an approximate number of months. I then sort the table to get the currently longest running series at the top. The result looks as follows:

4) I need to get the longest running series at the top because of the way I should download the data in bulk in Excel from FactSet. For bulk downloads I need to copy formulas onwards and using shorter series as the beginning (and copying onwards) ensures incomplete data.

5) The final substep to prepare is to copy and paste the newly calculated column with dates (from step 2) and make them values instead of formulas. The series and dates are now ready to be used for a bulk download of the Yield from FactSet. The result looks as follows:

N.B.: If you wish to know how to do the bulk download step by step, please check out the previous post at this blog on Bulk downloads or Event studies.


Using Stata to clean data

Recently I got a research question on CEO related data. The main source that has this type of data at the university where I work is Compustat’s Execucomp database. This database has data on executives for companies that are (or have been) part of the S&P 1500. The source for the data seems to be mainly the proxy filings (or DEF 14A filings) for these companies.
The maximum amount of executives that you get data on is 9, although the board of a huge conglomerate can exceed this number in size.

Downloading data for companies from Execucomp is not a big deal and is straitforward using the Wharton interface. Usually you download all data for all executives from the database using the WRDS search option “search entire database” for a certain time frame.
When the research question concerns the CEO only, I recommend including the variables that indicate when some executive is (or was) the CEO. The fields I recommend are:

  • Executive ID Number (execid): the unique ID for an executive is more uniwue then the name
  • Exec_FullName (exec_fullname): this allows for easy fact checking
  • Current CEO (pceo): indicates who the current CEO is
  • Date Became CEO (becameceo): Date when an executive became CEO
  • Date left as CEO (leftofc): Date when an executive left as CEO
  • Year (year): Fiscal year.

For the data cleaning it is possible to use many programs, including Excel, SPSS and Stata. For this example I used the Stata program as this was the quickest way:

1) For the research question it was important to have a dataset with clear data on who is the CEO in each Fiscal Year. The first cleaning step was to delete all non-CEO executives from te download. This can be done easily using the command: drop if mi( becameceo)

2) An extra check can now be done on the data but may not be necessary as we probably lose this data anyway. The check has to do with the reliability of the dataset. We usually assume that the variable Date left as CEO (when this field has an end date) lies after the date in the variable Date Became CEO. Dates are actually numbers and therefore tO check data accuracy we can use the command: assert (leftofc-becameceo)<0

If the assertiation does not check out, Stata will say: assertion is false r(9);
To view the problematic observations/data, we can use: bro if (leftofc-becameceo)<0

Some of these CEO’s probably have been CEO at the firm several times with one or more hiatus. That is why (I think) these problematic observations can occur. If you have the time, it is possible to manually look up the filings and figure out what the dates should be. The quick way is just to delete these few observations. To remove these observations just change the command to:
drop if (leftofc-becameceo)<0

3) The next check phase is, to find out if the data for the CEO’s for each fiscal year concerns a fiscal year in which they actually were CEO (tenure). This involves a check based on the start date and the end date of the tenure and we need to compare these to the fiscal year. I usually do this step by step using the following commands:
gen cbc=year(becameceo) // this generates a variable cbc with the year of the date from becameceo
gen clc=year(leftofc) // this generates a variable clc with the year of the date from leftofc

Now I can remove data for CEO’s from the dataset that is from fiscal years when the person that is or was the CEO was not yet the CEO, or had already left:
drop if (clc-year)<0
drop if (cbc-year)>0

4) Now the only observations that can still cause problems (if I want a single CEO for each fiscal year) are those fiscal years where a CEO change took place. As a quick rule of thumb I chose to delete those CEO’s from the dataset where they had the lesser tenure (in months) for that specific year of the change. To do this I had to generate a variable containing the number of months for each CEO for that year. I used the following commands to do this:

gen monthsyearend=mdy(12, 31, year(becameceo)) if cbc==year
format monthsyearend %td
gen months=(monthsyearend-becameceo)/30

gen monthsyearbegin=mdy(1, 1, year(leftofc)) if clc==year
format monthsyearbegin %td
replace months=(leftofc-monthsyearbegin)/30 if clc==year

To remove the CEO observations with the lesser tenure I now only had to give the command:
drop if months<6

5) Now I still need to do an extra check to see if years remain with more than one CEO. To do this I generate an extra variable that counts duplicates and (to make it more easy to spot) sort the data:
bysort gvkey year: egen dubbel=count(execid)
sort gvkey year

To view any duplicates I can now use the command:
bro if dubbel>1

In most cases the duplicates involve companies that have one or more fiscal year with more than one CEO. Although unusual, this happens. You can manually check these companies and decide what to do. As the number of companies / observations was not big, I chose to delete them from the dataset:
drop if dubbel>1

In the end I was left with 5200 observations with single CEO data for each fiscal year. Abovementioned commands are just examples but this shows how easy it is to clean Execucomp data using Stata. You can download the script (do file) with the commands here and adapt them to use them on your own dataset. I hope this shows how powerful Stata is in cleaning datasets.

N.B: When I used the number of months for the tenure of CEO’s for companies I based this assumption on calendar years. To be more exact, I would first have had to find out what the specific fiscal year ends are for each company for each year and then use these to compare the months of the tenure. As this example was meant to be a simple one, I decided not to do this.