Recently I got a research question on CEO related data. The main source that has this type of data at the university where I work is Compustat’s Execucomp database. This database has data on executives for companies that are (or have been) part of the S&P 1500. The source for the data seems to be mainly the proxy filings (or DEF 14A filings) for these companies.
The maximum amount of executives that you get data on is 9, although the board of a huge conglomerate can exceed this number in size.
Downloading data for companies from Execucomp is not a big deal and is straitforward using the Wharton interface. Usually you download all data for all executives from the database using the WRDS search option “search entire database” for a certain time frame.
When the research question concerns the CEO only, I recommend including the variables that indicate when some executive is (or was) the CEO. The fields I recommend are:
- Executive ID Number (execid): the unique ID for an executive is more uniwue then the name
- Exec_FullName (exec_fullname): this allows for easy fact checking
- Current CEO (pceo): indicates who the current CEO is
- Date Became CEO (becameceo): Date when an executive became CEO
- Date left as CEO (leftofc): Date when an executive left as CEO
- Year (year): Fiscal year.
For the data cleaning it is possible to use many programs, including Excel, SPSS and Stata. For this example I used the Stata program as this was the quickest way:
1) For the research question it was important to have a dataset with clear data on who is the CEO in each Fiscal Year. The first cleaning step was to delete all non-CEO executives from te download. This can be done easily using the command: drop if mi( becameceo)

2) An extra check can now be done on the data but may not be necessary as we probably lose this data anyway. The check has to do with the reliability of the dataset. We usually assume that the variable Date left as CEO (when this field has an end date) lies after the date in the variable Date Became CEO. Dates are actually numbers and therefore tO check data accuracy we can use the command: assert (leftofc-becameceo)<0
If the assertiation does not check out, Stata will say: assertion is false r(9);
To view the problematic observations/data, we can use: bro if (leftofc-becameceo)<0

Some of these CEO’s probably have been CEO at the firm several times with one or more hiatus. That is why (I think) these problematic observations can occur. If you have the time, it is possible to manually look up the filings and figure out what the dates should be. The quick way is just to delete these few observations. To remove these observations just change the command to:
drop if (leftofc-becameceo)<0
3) The next check phase is, to find out if the data for the CEO’s for each fiscal year concerns a fiscal year in which they actually were CEO (tenure). This involves a check based on the start date and the end date of the tenure and we need to compare these to the fiscal year. I usually do this step by step using the following commands:
gen cbc=year(becameceo) // this generates a variable cbc with the year of the date from becameceo
gen clc=year(leftofc) // this generates a variable clc with the year of the date from leftofc
Now I can remove data for CEO’s from the dataset that is from fiscal years when the person that is or was the CEO was not yet the CEO, or had already left:
drop if (clc-year)<0
drop if (cbc-year)>0
4) Now the only observations that can still cause problems (if I want a single CEO for each fiscal year) are those fiscal years where a CEO change took place. As a quick rule of thumb I chose to delete those CEO’s from the dataset where they had the lesser tenure (in months) for that specific year of the change. To do this I had to generate a variable containing the number of months for each CEO for that year. I used the following commands to do this:
gen monthsyearend=mdy(12, 31, year(becameceo)) if cbc==year
format monthsyearend %td
gen months=(monthsyearend-becameceo)/30
gen monthsyearbegin=mdy(1, 1, year(leftofc)) if clc==year
format monthsyearbegin %td
replace months=(leftofc-monthsyearbegin)/30 if clc==year
To remove the CEO observations with the lesser tenure I now only had to give the command:
drop if months<6
5) Now I still need to do an extra check to see if years remain with more than one CEO. To do this I generate an extra variable that counts duplicates and (to make it more easy to spot) sort the data:
bysort gvkey year: egen dubbel=count(execid)
sort gvkey year
To view any duplicates I can now use the command:
bro if dubbel>1
In most cases the duplicates involve companies that have one or more fiscal year with more than one CEO. Although unusual, this happens. You can manually check these companies and decide what to do. As the number of companies / observations was not big, I chose to delete them from the dataset:
drop if dubbel>1
In the end I was left with 5200 observations with single CEO data for each fiscal year. Abovementioned commands are just examples but this shows how easy it is to clean Execucomp data using Stata. You can download the script (do file) with the commands here and adapt them to use them on your own dataset. I hope this shows how powerful Stata is in cleaning datasets.
N.B: When I used the number of months for the tenure of CEO’s for companies I based this assumption on calendar years. To be more exact, I would first have had to find out what the specific fiscal year ends are for each company for each year and then use these to compare the months of the tenure. As this example was meant to be a simple one, I decided not to do this.
Email
Filed under: Compustat, Data management, Stata, Uncategorized | Comments Off on Using Stata to clean data