Using Stata to clean data

Recently I got a research question on CEO related data. The main source that has this type of data at the university where I work is Compustat’s Execucomp database. This database has data on executives for companies that are (or have been) part of the S&P 1500. The source for the data seems to be mainly the proxy filings (or DEF 14A filings) for these companies.
The maximum amount of executives that you get data on is 9, although the board of a huge conglomerate can exceed this number in size.

Downloading data for companies from Execucomp is not a big deal and is straitforward using the Wharton interface. Usually you download all data for all executives from the database using the WRDS search option “search entire database” for a certain time frame.
When the research question concerns the CEO only, I recommend including the variables that indicate when some executive is (or was) the CEO. The fields I recommend are:

  • Executive ID Number (execid): the unique ID for an executive is more uniwue then the name
  • Exec_FullName (exec_fullname): this allows for easy fact checking
  • Current CEO (pceo): indicates who the current CEO is
  • Date Became CEO (becameceo): Date when an executive became CEO
  • Date left as CEO (leftofc): Date when an executive left as CEO
  • Year (year): Fiscal year.

For the data cleaning it is possible to use many programs, including Excel, SPSS and Stata. For this example I used the Stata program as this was the quickest way:

1) For the research question it was important to have a dataset with clear data on who is the CEO in each Fiscal Year. The first cleaning step was to delete all non-CEO executives from te download. This can be done easily using the command: drop if mi( becameceo)

2) An extra check can now be done on the data but may not be necessary as we probably lose this data anyway. The check has to do with the reliability of the dataset. We usually assume that the variable Date left as CEO (when this field has an end date) lies after the date in the variable Date Became CEO. Dates are actually numbers and therefore tO check data accuracy we can use the command: assert (leftofc-becameceo)<0

If the assertiation does not check out, Stata will say: assertion is false r(9);
To view the problematic observations/data, we can use: bro if (leftofc-becameceo)<0

Some of these CEO’s probably have been CEO at the firm several times with one or more hiatus. That is why (I think) these problematic observations can occur. If you have the time, it is possible to manually look up the filings and figure out what the dates should be. The quick way is just to delete these few observations. To remove these observations just change the command to:
drop if (leftofc-becameceo)<0

3) The next check phase is, to find out if the data for the CEO’s for each fiscal year concerns a fiscal year in which they actually were CEO (tenure). This involves a check based on the start date and the end date of the tenure and we need to compare these to the fiscal year. I usually do this step by step using the following commands:
gen cbc=year(becameceo) // this generates a variable cbc with the year of the date from becameceo
gen clc=year(leftofc) // this generates a variable clc with the year of the date from leftofc

Now I can remove data for CEO’s from the dataset that is from fiscal years when the person that is or was the CEO was not yet the CEO, or had already left:
drop if (clc-year)<0
drop if (cbc-year)>0

4) Now the only observations that can still cause problems (if I want a single CEO for each fiscal year) are those fiscal years where a CEO change took place. As a quick rule of thumb I chose to delete those CEO’s from the dataset where they had the lesser tenure (in months) for that specific year of the change. To do this I had to generate a variable containing the number of months for each CEO for that year. I used the following commands to do this:

gen monthsyearend=mdy(12, 31, year(becameceo)) if cbc==year
format monthsyearend %td
gen months=(monthsyearend-becameceo)/30

gen monthsyearbegin=mdy(1, 1, year(leftofc)) if clc==year
format monthsyearbegin %td
replace months=(leftofc-monthsyearbegin)/30 if clc==year

To remove the CEO observations with the lesser tenure I now only had to give the command:
drop if months<6

5) Now I still need to do an extra check to see if years remain with more than one CEO. To do this I generate an extra variable that counts duplicates and (to make it more easy to spot) sort the data:
bysort gvkey year: egen dubbel=count(execid)
sort gvkey year

To view any duplicates I can now use the command:
bro if dubbel>1

In most cases the duplicates involve companies that have one or more fiscal year with more than one CEO. Although unusual, this happens. You can manually check these companies and decide what to do. As the number of companies / observations was not big, I chose to delete them from the dataset:
drop if dubbel>1

In the end I was left with 5200 observations with single CEO data for each fiscal year. Abovementioned commands are just examples but this shows how easy it is to clean Execucomp data using Stata. You can download the script (do file) with the commands here and adapt them to use them on your own dataset. I hope this shows how powerful Stata is in cleaning datasets.

N.B: When I used the number of months for the tenure of CEO’s for companies I based this assumption on calendar years. To be more exact, I would first have had to find out what the specific fiscal year ends are for each company for each year and then use these to compare the months of the tenure. As this example was meant to be a simple one, I decided not to do this.


FactSet instruction movies

To facilitate the introduction and use of FactSet I have created some movies on Youtube that will give a quick idea of how some searches and downloads work using the FactSet software. I link them here too to make sure they are used as much as possible:

Movie 1: Creating a list of (listed) companies
This movie show how to use the FactSet interface to create your own selection of companies (in this case listed ones).

Movie 2: Bulk downloads for lists of companies
Using the Excel add-in for FactSet it is easy to do bulk downloads. This shows how this may be done.

Movie 3: This shows how to get historical index lists

Movie 4: Using the FactSet Excel add-in to do an event study
The example for this movie includes making a selection of Merger & Acquisition events and then using this list to download data surrounding the varying specific event windows


FactSet static data items

Some of the data in the FactSet database is not time series data. It is static data where only recent/current information is available on some topics/companies. This includes data on board size, board diversity, committees, shareholder data, geographical segmentation, etc.

Although the data is not available as time series data, the current data may still be sufficient in some smaller studies or for student papers. When downloading data of this type it is important to note that downloading this data in bulk causes missing data if done in a similar way as you would download time series data (see earlier post).

The example below shows how you would normally download data in bulk format: you download the data for the first company in a list and then change the FactSet formulas to be able to copy the formulas onwards for the rest in your list. In the example I have two items that I want current information on: names of members of the board of directors (BOD), and I want to see which companies are related to the companies in my list. See example:

After copying the formulas and refreshing the download using the FactSet add-in, the result would look as follows:

On the surface the download looks okay. Unfortunately we have missing data as the number of members of the BOD varies depending on the type and size of a company. Also, the number of related companies varies a lot depending on the company in our list. The missing data is caused by the amount of available data for the first company: this causes a maximum amount of data which is incorrect because we copy the formulas onward.

The best way to download the full/right amount of data in each case (for each variable) is doing downloads through the add-in in batches of 25 companies at a time. This is a limit set by the FactSet add-in in Excel.

Step 1: select the first 25 companies/codes of your list

Step 2: click the red button in the FactSet Add-in to import the list

Step 3: confirm the selection of identifiers by clicking OK in the pop-up

Step 4: search for a variable and download the data. Below you find an example screenshot for two variables:


FactSet financial data

Recently I got a question on a variable that was downloaded from FactSet and this data did not seem to match the original reported data for this company. The item was pretty basic: Total Assets. In this case the Total assets for American Biltrite Inc (ABLT) that was reported in the (10-K) SEC filings for financial year 2006 was 331,672 (Million US dollar).

The FactSet download shows for 2006 a different number: 357,395. At first I thought this had to do with a restatement or different accounting standards or filings used. The data download looks like this:

The reason for this difference has to do with how Total Assets are presented by FactSet. In the database it is an aggregate: Total Assets + (Net deferred Tax Assets – Deferred Income taxes).
If you look in the Company snapshot in FactSet you can find this calculation by clicking on the number for Total Assets:

In the example for ABLT it would be: 331,672 + (27,541 – 1,818) = 357,395.

There are probably sound reasons for changing the original reported total assets. If you want to know more you can check up stuff online (investopdia) or in the theoretical books on income statements and balance sheet items.

If you wish to recreate the original numbers for SEC Total Assets you can download the variable in FactSet for Deferred Tax Assets (= FF_DFD_TAX_DB) and calculate it yourself.



Currency conversion in FactSet

Some databases offer currency conversion as an option which makes it (for instance) easier to combine and compare data on companies or series that report in different currencies. FactSet has such a currency conversion as an option as well. Below is a brief explanation I got from the helpdesk on how this works.

1) FactSet uses WM/Reuters exchange rate series as the source to recalculate data

2) Annual income Statements and P&L statements are recalculated using the average of the daily exchange rate of a specific currency for the last “X” months until the end of that period. This depends on the frequency of the dataset used. In this case FactSet uses a single overall daily average exchange rate for the specified time period based on calendar or fiscal year (which is usually 12 months).

3) The recalculation for quarterly income Statements and P&L statements works the same way, only this time the overall average daily exchange rates used are based on the 3 months prior to the end of each specified time period based on calendar or fiscal quarters.

4) Balance sheet items use currency conversion as of the period end dates. Again, this is a single specific average daily exchange rate but this time for the specific period end date.

5) Where the data has a monthly frequency, the recalculation is done using the specific average daily exchange rate at the month-end dates of the specified time period.

6) Time series with daily data are recalculated using the average daily exchange rates on the days.

To sum it up I created a picture with an overview here:


FactSet and historical index lists

Somebody asked me recently: is it possible to call up historical lists of an index, for instance: the AEX for an older year? My answer was yes, although I do not think this option is available for all indexes as some of them are proprietary and may require an extra license. I am also not sure how far back index compositions are available.

This is how it works:

  • Start up FactSet and click the Universal Screening option
  • In the next screen select the option Starter Screen to get to the Basic Start Screen
  • Remove the example S&P 500 index and in the Add criteria box search for the index: AEX
  • Select the option that is presented for Netherlands AEX (Index). The selection of this list will automatically generate a list of companies that are currently part of the AEX index
  • In the criteria screen now click on the selected index: Netherlands AEX (Index). Clicking the option will make it turn yellow and show the following formula in the criteria search box: FG_CONSTITUENTS(181010,0,CLOSE)
  • Clicking on this formula calls up a popup that shows more options that can be changed. Replace the zero in the formula with the date of choice to get something like: FG_CONSTITUENTS(181010,1/1/2005,CLOSE)
  • Changing the formula will generate a new list for the index with the selection of companies in the index of that time

You can now choose to download the list or change it first (by adding columns with static data) and then download the list as you need it


FactSet and large lists of series

FactSet is an application that has similar datasets compared to Datastream and Bloomberg although there are also many differences. If you need access a subscription is required. The software can be downloaded from their website.

The main FactSet application can be used to quickly search and finc out if certain information is available or where it can be found. Quick comparisons can be done using the Charting option. If you need to download more data on more than 25 series/companies the Excel add-in for FactSet is what you need to use as the final step to actually download data. Below follows a step by step approach on how to download data:

1) A list of companies/series
There are two ways to work with lists in FactSet:

  • You can use the Screening tool (through which you can use universal screening or Idea screening) in the FactSet menu to create your own list and download codes to Excel
  • If you have a list of codes (ISIN, Sedol, etc.) you can use these in Excel.

2) Preparing the list
Below you see an example list that I downloaded from FactSet.

To use the list with FactSet I create a new sheet in the Excel Worksheet and copy the codes there transposed. This can be done using the right-click mouse button:

The limit for using the list depends on the version of Microsoft Excel. A list of 200 companies should not be a problem. Prepare the list by adding two dates indicating the start and end of the period for which you wish to download data. You can indicate a Frequency too but this is not necessary:

3) Downloading the data for 1 company

  • Make sure to click a cell below the rows with data, for instance: cell A7
  • If it has not been done already: now click the FactSet Tab / Add-in and click the option “Show or Hide Sidebar“. This option is the first button on the left side in the FactSet add-in
  • In the FactSet Sidebar click the middle option “Insert
  • Click the Select “cell reference buttonto the right of the “identifier lookup button”

  • Click the first cell with a company code (B1). When you do this FactSet will immediately lookup the code and present the name of the corresponding company / series
  • In the Data Item section search for a data variable you need. The example is: Total Assets. Type this in and press the Enter button on your keyboard. The search result shows available options (depends on whether the Sources are filtered)
  • Click the top option: “Total Assets > Factset Fundamentals – FF_ASSETS
  • A new section appears showing the options at Inputs for FF_ASSETS. In this section again use the Cell Reference buttons to select the time frame: cells B3 and B4 in the example
  • A temporary result is shown (in green). Now click the big grey button option “Insert” in the sidebar to finalize

4) Downloading the data for the rest of the series/codes
To download the data for the other codes we need to use the $ sign to fix cell or row references.

  • First change the formula in Cell B7 from:
    =FDS(“NA”;B1;”FG_COMPANY_NAME()”)  to:
  • Change the formula in B8 from
  • Change the third cell in B9 but use the key-combination SHIFT+CTRL+Enter to finalize the change! You change the formula in the cell from:
    =FDS(“NA”;B1;”FF_ASSETS(ANN_R,”&B3&”,”&B4&”)”) to:
  • Select the Cells from B7 to B26 and use the little black square in the bottom left corner of the seclected cells to B26 to copy the formulas to the right for the other codes
  • The screen will not immediately show data but the text #Calc
  • To get the actiual data now click the button Refresh in the FactSet Add-in. The result should look like this: