Stata solution to reshape FactSet data

In two previous posts I showed examples on how to reshape data from wide to long format. Both post were about Datastream, but one was about regular downloads and one about large datasets. In Stata reshaping data usually works fine but may sometimes not work very well. In some versions of Stata or in some cases the reshape command seems to work only in part and not all data is reshaped. The solution to the problem was changing the usual reshape command slightly.

1) In the example I first prepared the FactSet data as per usual. The bonds data in the example first looks as follows:

First I used Copy+Paste > transpose to change the rows and columns. Then I used the Excel formula =”Y”&TEXT(C1;”yyyymmdd”) to change the dates for Stata. Finally I copy+pasted as values to get rid of the Excel formulas. The data now looks as follows:

2) Normally, after importing this data in Stata , I would use the command:
reshape long Y, i(Code) j(origdate). In some Stata versions this is fine, but in other editions this does not work very well. The way to get this to work is now adding the word string to the command as follows:

reshape long Y, i(Code) j(origdate) string

N.B.: Afterwards I also need to do some work on the dates etc. to be able to get the data in a format to be able to work with it. Reminder: for really large downloads from FactSet you may have to reshape the data in batches because Stata versions may be limited by the number of variables it can work with. Afterwards the datasets can be merged by appending them. You can download an example do file here with the commands and an excel file to work on with this.


Using FactSet to find and select Green bonds

The previous post was about Green Bonds, on how the data is available through SDC Platinum, and how to get the time series data (like yield) through FactSet. The latter database also has data on bonds, including Green bonds. Usually I look for different categories of bonds using a type variable. The main categories for selecting bonds in FactSet in the start screen do not cover a type variable indicating it’s purpose(/type). In FactSet looking for these bonds works as follows:

1) Start up the FactSet software use the Screening option at the top to go to the Universal screening screen. In this screen click the tab Debt and then click the option “Starter Screen – Corporate/Gov/AGency

2) In the next screen you can use the (main) Select criteria to narrow down to certain bonds, including choosing nations as well as industries and a time frame. The bottom half of the screen shows the selection/Results and updates this immediately depending on the selected criteria. In the field Add columns you now type the variable name FI_Issue_FLG and make a preliminary choice from the suggestions.

3) When you have selected an option a popup appears where the Item option allows you to select different types of bonds, including Green bonds. Select this option and click the button Add New to add an extra column.

4) This extra column can be used to filter the selected bonds and select only bonds of this type. To call up the filter click the arrow in the name of the column Green Bond Flag. Next select the option filter from the options that appear.

5) The Filter popup allows you to select values. In this case the Green Bond variable is either yes or no which equals to 1 to 0. Drag the line at min to the right until it matches the line / column on the right at max.

6) Click the button done when you are finished and finalize the list of Green bonds. The result will look as follows:

N.B.: Multiple columns can be added providing different information on the bonds. This screen is intended to make a selection of bonds. If you wish to download time series data on the bonds you download the list to Excel (use the green arrow pointing downwards) and use the FactSet add-in to do this. For bonds downloading the series data is different from regular series. Check the previous post on how to do this step by step using FactSet.


Using SDC & FactSet to download (green) bonds data

Sustainability has been a hot topic for many years now and one of the related topics concerns the rise of Green Bonds over the last 10 years. One of the databases that has data on the issuance of these bonds (and other types) is the Global New Issue database in SDC Platinum (more recently called Eikon). Usually I look for different categories of bonds using a type variable. In this case the category is defined not by the regular type but by it’s purpose. Looking up these Green bonds can therefore be done by using the variable “Use of Proceeds“. This variable is available under the “all items” search tab.

After selecting the bonds you can create reports and download specific data (variables) on the bonds itself (menu option: Report > New custom ). When finished selecting the variables it is possible to call up and download the data.

If I need to get time series data on these Green bonds (or any other bonds) I need to use a second database/source for this. Options include Datastream/Eikon or FactSet. As I currently have access only to the latter I will explain how this works in FactSet. First off you need to get the data for the series and the dates that tell you how long the bonds run. In this case the easiest option to use is the ISIN code. For the time period I use the “Issue Date” and the “Final Maturity” date.

1) To Start working with the data I first filter the data on the availability of the ISIN code and also remove remarks like “perp.” or “perpetual” where it is stated in the column with the maturity date. The end result of the data we need looks as follows:

2) I now first need to take care of the fact that each bond series in my selection has a different maturity date which may lie in the past or the future. As I download the data now, in my case the latest data would be for now() or yesterday. I therefore need to create a date where the maturity date was perpetual and give it the date now() and replace the maturity date where the date lies in the future with now(). Using the function =now() in a specific cell I can use this as a point of reference. It is of course also possible to choose a specific date. Using a nested conditional IF() statement in Excel I can now generate a new end date for each series:

3) The next step is finding out how long each bond series runs as they have different time periods. Using the new generated date I deduct the issue date from this and divide this by 30 to get an approximate number of months. I then sort the table to get the currently longest running series at the top. The result looks as follows:

4) I need to get the longest running series at the top because of the way I should download the data in bulk in Excel from FactSet. For bulk downloads I need to copy formulas onwards and using shorter series as the beginning (and copying onwards) ensures incomplete data.

5) The final substep to prepare is to copy and paste the newly calculated column with dates (from step 2) and make them values instead of formulas. The series and dates are now ready to be used for a bulk download of the Yield from FactSet. The result looks as follows:

N.B.: If you wish to know how to do the bulk download step by step, please check out the previous post at this blog on Bulk downloads or Event studies.


Using Stata to clean data

Recently I got a research question on CEO related data. The main source that has this type of data at the university where I work is Compustat’s Execucomp database. This database has data on executives for companies that are (or have been) part of the S&P 1500. The source for the data seems to be mainly the proxy filings (or DEF 14A filings) for these companies.
The maximum amount of executives that you get data on is 9, although the board of a huge conglomerate can exceed this number in size.

Downloading data for companies from Execucomp is not a big deal and is straitforward using the Wharton interface. Usually you download all data for all executives from the database using the WRDS search option “search entire database” for a certain time frame.
When the research question concerns the CEO only, I recommend including the variables that indicate when some executive is (or was) the CEO. The fields I recommend are:

  • Executive ID Number (execid): the unique ID for an executive is more uniwue then the name
  • Exec_FullName (exec_fullname): this allows for easy fact checking
  • Current CEO (pceo): indicates who the current CEO is
  • Date Became CEO (becameceo): Date when an executive became CEO
  • Date left as CEO (leftofc): Date when an executive left as CEO
  • Year (year): Fiscal year.

For the data cleaning it is possible to use many programs, including Excel, SPSS and Stata. For this example I used the Stata program as this was the quickest way:

1) For the research question it was important to have a dataset with clear data on who is the CEO in each Fiscal Year. The first cleaning step was to delete all non-CEO executives from te download. This can be done easily using the command: drop if mi( becameceo)

2) An extra check can now be done on the data but may not be necessary as we probably lose this data anyway. The check has to do with the reliability of the dataset. We usually assume that the variable Date left as CEO (when this field has an end date) lies after the date in the variable Date Became CEO. Dates are actually numbers and therefore tO check data accuracy we can use the command: assert (leftofc-becameceo)<0

If the assertiation does not check out, Stata will say: assertion is false r(9);
To view the problematic observations/data, we can use: bro if (leftofc-becameceo)<0

Some of these CEO’s probably have been CEO at the firm several times with one or more hiatus. That is why (I think) these problematic observations can occur. If you have the time, it is possible to manually look up the filings and figure out what the dates should be. The quick way is just to delete these few observations. To remove these observations just change the command to:
drop if (leftofc-becameceo)<0

3) The next check phase is, to find out if the data for the CEO’s for each fiscal year concerns a fiscal year in which they actually were CEO (tenure). This involves a check based on the start date and the end date of the tenure and we need to compare these to the fiscal year. I usually do this step by step using the following commands:
gen cbc=year(becameceo) // this generates a variable cbc with the year of the date from becameceo
gen clc=year(leftofc) // this generates a variable clc with the year of the date from leftofc

Now I can remove data for CEO’s from the dataset that is from fiscal years when the person that is or was the CEO was not yet the CEO, or had already left:
drop if (clc-year)<0
drop if (cbc-year)>0

4) Now the only observations that can still cause problems (if I want a single CEO for each fiscal year) are those fiscal years where a CEO change took place. As a quick rule of thumb I chose to delete those CEO’s from the dataset where they had the lesser tenure (in months) for that specific year of the change. To do this I had to generate a variable containing the number of months for each CEO for that year. I used the following commands to do this:

gen monthsyearend=mdy(12, 31, year(becameceo)) if cbc==year
format monthsyearend %td
gen months=(monthsyearend-becameceo)/30

gen monthsyearbegin=mdy(1, 1, year(leftofc)) if clc==year
format monthsyearbegin %td
replace months=(leftofc-monthsyearbegin)/30 if clc==year

To remove the CEO observations with the lesser tenure I now only had to give the command:
drop if months<6

5) Now I still need to do an extra check to see if years remain with more than one CEO. To do this I generate an extra variable that counts duplicates and (to make it more easy to spot) sort the data:
bysort gvkey year: egen dubbel=count(execid)
sort gvkey year

To view any duplicates I can now use the command:
bro if dubbel>1

In most cases the duplicates involve companies that have one or more fiscal year with more than one CEO. Although unusual, this happens. You can manually check these companies and decide what to do. As the number of companies / observations was not big, I chose to delete them from the dataset:
drop if dubbel>1

In the end I was left with 5200 observations with single CEO data for each fiscal year. Abovementioned commands are just examples but this shows how easy it is to clean Execucomp data using Stata. You can download the script (do file) with the commands here and adapt them to use them on your own dataset. I hope this shows how powerful Stata is in cleaning datasets.

N.B: When I used the number of months for the tenure of CEO’s for companies I based this assumption on calendar years. To be more exact, I would first have had to find out what the specific fiscal year ends are for each company for each year and then use these to compare the months of the tenure. As this example was meant to be a simple one, I decided not to do this.


FactSet instruction movies

To facilitate the introduction and use of FactSet I have created some movies on Youtube that will give a quick idea of how some searches and downloads work using the FactSet software. I link them here too to make sure they are used as much as possible:

Movie 1: Creating a list of (listed) companies
This movie show how to use the FactSet interface to create your own selection of companies (in this case listed ones).

Movie 2: Bulk downloads for lists of companies
Using the Excel add-in for FactSet it is easy to do bulk downloads. This shows how this may be done.

Movie 3: This shows how to get historical index lists

Movie 4: Using the FactSet Excel add-in to do an event study
The example for this movie includes making a selection of Merger & Acquisition events and then using this list to download data surrounding the varying specific event windows


FactSet static data items

Some of the data in the FactSet database is not time series data. It is static data where only recent/current information is available on some topics/companies. This includes data on board size, board diversity, committees, shareholder data, geographical segmentation, etc.

Although the data is not available as time series data, the current data may still be sufficient in some smaller studies or for student papers. When downloading data of this type it is important to note that downloading this data in bulk causes missing data if done in a similar way as you would download time series data (see earlier post).

The example below shows how you would normally download data in bulk format: you download the data for the first company in a list and then change the FactSet formulas to be able to copy the formulas onwards for the rest in your list. In the example I have two items that I want current information on: names of members of the board of directors (BOD), and I want to see which companies are related to the companies in my list. See example:

After copying the formulas and refreshing the download using the FactSet add-in, the result would look as follows:

On the surface the download looks okay. Unfortunately we have missing data as the number of members of the BOD varies depending on the type and size of a company. Also, the number of related companies varies a lot depending on the company in our list. The missing data is caused by the amount of available data for the first company: this causes a maximum amount of data which is incorrect because we copy the formulas onward.

The best way to download the full/right amount of data in each case (for each variable) is doing downloads through the add-in in batches of 25 companies at a time. This is a limit set by the FactSet add-in in Excel.

Step 1: select the first 25 companies/codes of your list

Step 2: click the red button in the FactSet Add-in to import the list

Step 3: confirm the selection of identifiers by clicking OK in the pop-up

Step 4: search for a variable and download the data. Below you find an example screenshot for two variables:


FactSet financial data

Recently I got a question on a variable that was downloaded from FactSet and this data did not seem to match the original reported data for this company. The item was pretty basic: Total Assets. In this case the Total assets for American Biltrite Inc (ABLT) that was reported in the (10-K) SEC filings for financial year 2006 was 331,672 (Million US dollar).

The FactSet download shows for 2006 a different number: 357,395. At first I thought this had to do with a restatement or different accounting standards or filings used. The data download looks like this:

The reason for this difference has to do with how Total Assets are presented by FactSet. In the database it is an aggregate: Total Assets + (Net deferred Tax Assets – Deferred Income taxes).
If you look in the Company snapshot in FactSet you can find this calculation by clicking on the number for Total Assets:

In the example for ABLT it would be: 331,672 + (27,541 – 1,818) = 357,395.

There are probably sound reasons for changing the original reported total assets. If you want to know more you can check up stuff online (investopdia) or in the theoretical books on income statements and balance sheet items.

If you wish to recreate the original numbers for SEC Total Assets you can download the variable in FactSet for Deferred Tax Assets (= FF_DFD_TAX_DB) and calculate it yourself.