Excel and duplicates in a dataset

In the past I posted items on duplicate data from datasets (from Compustat databases) and how to find this out using programs like stata. For smaller datasets the program Microsoft Excel can also be used to investigate your datasets. AT some point, when merging datasets from multiple sources it may happen that you get duplicate data. Using Excel functions like a pivot table (draaitabel) and vlookup (vertikaal zoeken) duplicate data can be detected as follows:

1) first you need to tell Excel that your dataset is a specific delineated table using the menu option: Insert > Table

2) When you insert this option you need to indicate the range and if there are any column headers

3) The same menu tab insert (invoegen) has the option to create a summary pivot table (samenvatten met draaitabel). Select this option at the top left corner:

4) Create the PivotTable in a new (or empty) sheet:

5) The empty Pivot Table will be shown as follows:

6) In this example I drag the fields I want to check for duplicates (records/observaions) down to rows (Rijen). At the box values I drag a random field (in this case indfmt). On the left side of the screen the result will be presented:

7) In this example I click the field gvkey and choose the option to change the field settings

8) At subtotals & filters mark the second option No(ne)

9) In the tab Format & Print (Indeling & Afdrukken) mark the option: item labels in table format (Itemlabels in tabelvorm weergeven). Also make sure to mark the option to repeat item labels (Itemlabels herhalen):

10) The result chould look as follows:

11) Copy the list to a new sheet and use the concatenate function (Tekst.samenvoegen) to create a combination of GVKEY and year:

12) Create a similar link list with the Concatenate function in the original Compustat datasheet

13) Now use the VLOOKUP function in Excel in the Compustat datasheet to look up the link in the sheet with the result of the Pivottable. Make sure that the VLOOKUP option says False (Onwaar) at the 4th option as follows:

14) Finally, using the filter option you can find out if there are duplicatesd by selecting everything for which the count is higher than 1. The tricky thing will be deciding what to do with the result.

N.B.: If a dataset has over 100.000 observations the process described above will take some time as Microsoft Excel will require significant processing power from the computer. For larger datasets I reccomend using Stata.


Using Stata to count a sequence

Not long ago a student asked me how to calculate the tenure for auditing firms that were attached to firms within a certain time frame for up to a specific year. The student was interested in the effect of auditor tenure on companies. The subject of auditor rotation has become interesting as a research subject in the field of accountancy as in some countries firms are required to change their Auditor every few years to ensure auditor independence.

Using the Stata program this can be done using a script to identify “spells”. My colleague Matthijs de Zwaan helped me with this and created an example script based on an article in the Stata Journal.
The script I made was based on the example and can be adapted to count other kinds of sequences in datasets.  I also added a few lines of code to show how a dummy could be created to identify if the auditor in the example Excel file was a Big Four accounting firm or not. The script can be downloaded here and looks as follows:

The end result (after removing suplicate data) gives the tenure of the last auditor (latenure) for each firm at the year 2008:

N.B.: In the .do file the location for all files is the I: drive. You may need to change the drive letter in the original script to (for instance) c: or H: to get it to run. Make sure both the script file and Excel dataset are in the same location.


Using Stata to count segments

At the end of March I got asked the question how to use Compustat North America segments data and get aggregated counts on business segments or geographic segments. The variable business segments was to be used as an indicator of diversity: how many different types of activity a company included in it’s activities. The Geographic segments was to be used as an indicator on how widespread these activities were geographically for each company.

Specific important commands that are needed:

generate year=year(datadate) > using this command you get a year which can be used to count instances of segments. This is only needed if no available year can be used (like fiscal year / fyear).

drop > using this command you delete all variables that are non-essential from the dataset

order gvkey year > this command sorts the dataset first on the gvkey (= global company key which uniqely identifies a company in any Compustat database) and then by year

duplicates drop > this command deletes any possible duplicate annual data. This is important as the count only involves unique segments

by gvkey year: egen segmentcount = count(sid) > this command generates a new variable (segmentcount) and gives it the value of the count of the segment id codes (SID) for each company and individual year.

To later combine the business segments count dataset with the geographical count dataset a unique ID (UID) is created to later merge the datasets again into a single dataset.

Overall the script (.do file) I created does three things:
1) It creates a new dataset with business counts
2) A dataset with Geographical counts is made
3) It merges both newly created datasets into a single dataset

Example script screenshot:

The example dataset with .do script file can be downloaded here.

Example result screenshot:

N.B.: In the .do file the location for all files is the U: drive. You may need to change the drive letter in the original script to (for instance) c: or H: to get it to run. Make sure both the script file and Stata dataset are in the same location.


Working with Compustat Execucomp tenure data

Not too long ago I had a question from someone who was having trouble working with data from Compustat Execucomp. He wanted the yearly tenure for a specific group of people with the function of Chief Financial Officer. The research spanned a period of 2009-2014 (post-crisis). The data that was downloaded looked something like this:

Step 1: Data cleaning
One of the first steps to take in this case is to make sure to have the right kind of data to work with. In this case the columns H and I needed to be checked and cleaned. In column I you see the date when a person left as CFO working for the company. In this situation we see items like n/a where the data is unavailable and this means that the person still continues to work as CFO for the company. We first need to replace such values with the value 2014 for the last year of our research as we are looking for the tenure within the time frame 2009-2014. Any other years after 2014 can in this case also be replaced with 2014. You can use the search and replace function in Excel to do this step by step. Afterwards you can use the Filter option in Excel to check for weird data or outliers. In principle you have to check both columns with start year and left year to be sure there are no outliers (weird values).

In column H you see the year when a person joined the company. I am assuming that this was also the startyear for each person when he came to work as CFO at the company (I have not personally checked this). You see in the screenshot that not every year is seen as a numerical value: Excel shows little green triangle dots in the cells where it thinks the data is text. To ensure that a year is seen as a numerical value you can add a new column and use a trick to create numerical values in this column: devide cell by 1. See screenshot column J for the original data and column K for the new years. In the top left corner you see the “formula” you can copy downwards for all years.

Step 2: Calculate tenure for each year
In the example for this blog I only calculated the tenure for the final years of the research time frame (2013 and 2014). You can figure out the formulas for the other years. First I started calculating the CFO tenure for 2014. In this example I assume that if the startyear matches the lastyear someone has worked in this capacity for less then a full year making the tenure less then 1 and thus zero. In this case I get the right number of years of the tenure by substracting the startyear from the lastyear (= research year 2014). See example:

Now for the tenure of the previous year (2013) the If statement comes in handy to figure out the tenure for this year. The full formula is:

K3 = start year tenure within the research window (or before)
L3 = last year for the research window (2014)
P3 = tenure for the final year of the research window (= 2014)

The formula in essence does the following: if the previous year (in this case 2013) matches the start year (or is smaller), then the tenure is that of 2014 minus 1. If not then put the word False there. This last condition prevents outliers from causing problems. Screenshot:

The same formula can also be used for the previous years. All you need to do is change the formula for the right numbers. 2012 example: =IF(K3<=(L3-2);(P3-2);FALSE)

Step 3: Figure out the relevant years
This step is essentially not necessary as the filter option of excel is already available to make an annual selection by year but you then have to add the tenure year manually for each year after copying the relevant tenure data by year (to a new sheet).
The formulas in step two will provide a tenure of  0 (or more) as long as the end year for the tenure (within the time frame 2009-2014) is equal to (or higher) then the start year for the tenure (within the time frame or earlier). To know the tenure by year we create columns to show which tenure applies to what year. That allows us to use a filter in Excel to more easily get the relevant data where there is a tenure of more then zero. I created the columns N and O to get the tenure years for 2014 and 2013. The formula I used for 2014 is: =IF((P4>0);RIGHT($N$1;4);FALSE)
where: P4 = calculated number of tenure years for 2014. I have put the year in the name in the first cell at the top of the column (first) as the last 4 digits making it possible to use $N$1.

For 2013 all you need to change is the cell P4 into Q4 and the header $N$1 into $O$1. You also need to put the year in the name of the variable at the top of the column). Subsequent years work the same way.

Step 4: Filter the data for the relevant years
As the final step you can now use the standard filter option to copy the relevant data by year to a new sheet.

I would then also remove irrelevant data for other years which do not apply to the specific year I have filtered for. The end result would look something like this:


Stata & missing or duplicate data

When you work with large datasets or big data it may happen that after working with it for some time you need to take a good look at what has happened to the data. Especially if you work with combinations of datasets and/or work on it with more people. Another instance is: when you have received the dataset from a researcher or organization and need to remove superfluous data that may not be relevant to your own research.

1) Investigate the data
There are a few simple commands in Stata that provide a good overview:

  • desc or describe = this command provides a brief summary of the entire dataset
  • summ or summarize = another fine command that gives a quick overview of all the variables with information on: number of observations, the mean, standard deviation, and the lowest and highest values (min & max)
  • tab or tabulate = a good way to cross-reference several items and see whether there are any obvious outliers or patterns in the data

These and many more commands or combinations of commands allow you to watch and judge the data.

2) Missing data

  • Using the summ command it was easy to see that some fields had no data. In this case it may be a good idea to delete them as they serve no purpose here. You can delete a variable/field by typing drop variable. For example: drop CIKNew. A range of variables next to each other can also be dropped with a single command. For this example: drop indfmt – conm. There are many more options to delete entire variables/fields from a dataset.
  • Another way to clean data can be applied if you require only those observations/records that (for crucial variables) do not have missing values/data. Deleting observations can be done using the missing value command: drop if mi(variable). For example: drop if mi(Totaldebt). The Stata result screen will show the result of this action: number of observations deleted.
  • Deleting missing values is, however not always straightforward. Stata shows missing values as dots if you view a dataset with the browse command. In some datasets, however, missing values may sometimes (partially) be represented by another value in some observations. If this is the case it is a good idea to replace some of these values first to allow for easier editing/deletion. If in your dataset the number zero indicates the same thing as a missing value (in some records) you can use mvdecode to replace them with a dot (= how Stata usually represents missing values). The command would look like: mvdecode variable, mv(0=.). Afterwards you can the remove all missing values the usual way with drop.

3) Removing duplicate data
When you are using multiple datasets and have combined them you could have some duplicate observations. Using data from some specific databases may also get you unintentional duplicate data. In Compustat you run the risk of duplicates if, for instance, you only need data for industrial type companies but, when doing the search in the Fundamentals Annual database you forget to unmark the option FS at the screening options at Step 2 in WRDS. Some companies have more than one statement in Compustat for the same fiscal years and will get you both FS and IND type/format statements.
The Stata command to remove duplicates should be chosen carefully. I usually combine a unique ID code with a specific event year or date. For instance: duplicates drop CIK year, force


  • duplicates drop removes duplicates
  • in this example duplicates are identified by the combination of the variable CIK (ID code = Central Index Key) with the variable year
  • duplicates will be removed without warning by including the las bit:
    , force

Personally I think removing duplicates without first checking may not always be the smart thing to do. If you are working with a large dataset it may be a good idea to first tag possible duplicates and then have a look before removing these. The command to tag the duplicates is: duplicates tag, gen(newvariable). This command checks the whole dataset with all variables for all observations for duplicates and stores the result as a number in the new variable with the name newvariable.

Another version of removing duplicates may have to do with the number of necessary observations by entity in a dataset. In some cases an analysis requires a minimum number of observations/records to be relevant. If there are too few observations you may again remove them only, in this case it can be done using the count function on the entity (for example a company identifier like ISIN, CIK, or GVKEY). You do this as follows:

  • Sort the dataset on the ID that will be counted. Example command: sort CIK
  • Now count the number of ID’s in the dataset and store them in a variable. Example command: by CIK: egen cnt = count(year). What this does is count the times each CIK ID occurs by counting the years and stores the count/number of years in the new variable cnt.
  • We can now remove observations of entities for which the count (of years) is below the number stored in the variable cnt. Example command: drop if cnt<10. This means that we need a minimum of 10 observations for an entity.

N.B.: A few final remarks on handling missing data concern the way you work with the data. When you are performing such cleaning actions as described above it is a good idea to first make a copy of your database before you do all this and save the actions as there is no undo like in many programs. You can also experiment a bit with a copy and you should definitely save the actions that you choose the finalize in a Do-file and when yiou continue from there again start with a copy. To keep track of your versions of the database you can fut a date in the name of each version. When you work with much data over a long time it is also a good idea to save space and memory by compressing the database with the command: compress. Some variables will then be changed to save space.


Stata & date formats

The past few weeks I have been learning about and working with Stata. This program can handle a lot of data and uses commands to edit data or analyse it. There are many commands available and one command is very handy when it comes to changing date formats.
When you work with data and intend to do an event study using the Request Table option from Datastream you may have to change the date format for the event dates to something that Datastream understands (also depends on the date format used by the computer).

In Excel you can use the command =TEXT(A1,”dd-mm-yy”). The Dutch version looks similar: =TEKST(A1;”dd-mm-jj”)
Afterwards you still have to copy and the paste as values to get workable dates for Datastream.

Now if you are working with a dataset in Stata you can use a command to do a similar thing: format [variablename] %tdDD-NN-YY

The elements mean:
format = change the format of a variable
variablename = indicates the date/variable item you wish to change
%tdDD-NN-YY = indicates that the timedate variable should look like 17-02-99. DD = day with two digits, NN = month with two digits and YY = Year with two digits.

Example of dataset before the change:

Now I use the command to change the look of variable dateff: format dateff %tdDD-NN-YY. The result looks as follows:

If you look closely you see that “under water” the variable still looks as it is originally presented: the first date 20-08-07 still looks like 20aug2007. On the right side you also see that dates in Stata are actually variables of the type Integer = a number that can be written without a fractional component (Wikipedia).
Interesting sidenote: in Stata dates are actually integer numbers, just like in Excel. Where in Excel number 1 represents 1-1-1900 in Stata the number 1 represents 1-1-1960. Negative integer numbers in Stata represent older dates. When you work with older dates it is wiser to use 4-digit year formats to avoid confusion.

There are a lot more options to change the dateformat. Another example that may be useful for using with the Event Tool in the CRSP database is:

format variablename %tdCCYYNNDD

In this instance dates will be formatted with 4 digits for a year, for example: 19871022

If you want to know all the options for formatting dates you can use the Help command: help datetime display formats.

Important: If you export a formatted file from Stata to an Excel file you will lose the formatting for the dates you just did. If you wish to keep the dates with the (changed) formatting you need to export/save the file as a (delimited) text file using the command: export delimited filename.txt and then open the text file in Excel using the text wizard. Also remember to open the formatted date column as text to keep the formatted date in tact!

This example applies to Stata editions 13 and 14.


Changing Datastream data & Stata

The past few weeks I have been learning about and working with Stata. This program can handle a lot of data and uses commands to edit data or analyse it. A sequence of commands can be saved in .do files and then rerun as a script. There are many commands available and one of them is very handy when it comes to changing data from columns to rows. It is similar to the transpose option that Microsoft Excel offers for quick changes. It is the reshape command.

In this blog post I will use the reshape command to change Datastream data as an example. Similar work can be done for other downloads from databases like Amadeus or Bankscope.
For downloads it can take a bit of work to change the data and rework it before Stata can be used to merge it with other data. In the case of Datastream this has everything to do with the fact that Datastream does not repeat an ID for each year of data you download (unless using a Request Table search). Example screenshot original download from Datastream (transposed):

The following changes need to be made in Excel:

1) Get the ID back from the download without the extra Datastream codes. This can be done using the function MID(Cell,Start,Number) (= in Dutch versions of Excel: the Deel() function). MID() allows you to get part of the contents back from a Cell. Example:

2) Next you need to create column year headers that start with a text character and then the year of the column (this step is important for Stata later). It can be done smartly using a specific cell for the header and then combining it with the original year above each column as follows with the formula =($A$1&(B$1)). The dollar signs fix certain cells or a row (or column). Example:

3) Now save the Datastream Excel file without these formulas. If necessary use copy > paste special > Values

4) Next we start up Stata and get the file. In this case we prepared the Datastream file nicely and can therefore use the command: import excel using DS-Prepared.xlsx, firstrow
The firstrow option tells Stata that this row has the variable names and lables. Example:

5) When we look at the data with the browse command you see that the data looks basically the same as how it appeared in Excel:

Only two things are important here:

  • The data is not yet formatted as we need it: the numerical data are now strings (red) and need to be changed later
  • To later merge data the company ID’s need to be combined with the years

6) We now transpose the data using the command Reshape. This command has many options and allows you to rework tables. If you need more explanation on how it works you can type the command help reshape and Stata will provide much information on how to use the command. For this example the command is:

reshape long Y, i(CompanyCode) j(year)

In a nutshell this tells Stata that the years for the columns need to be repeated for the observations/records and can be found in the names of the variables. The ID for each company also needs te be repeated for the years of all observations. Example result:

When you use the command browse again you see the result of the data change:

7) We now need to create a unique ID combination to later merge the data. To do this we need to turn the years back from numerical values into strings using the tostring command: tostring year, replace. The final step is creating the unique ID combination UID with the command: gen UID = CompanyCode+year

8) Now there are a few steps left to finalize the Datastream data but not all of it is necessary. One thing is necessary: for Stata the numerical data is still a (text) string and now need to be changed. Use the destring command to as follows: destring Y, replace force

Destring also removes any values that were not numbers (but text) and replaces them with an empty cell (.) because of the option force (this can be dangerous if the content of a cell is a combination of a numerical value and text).

Optional stuff:

  • Use the drop command to remove unnecessary columns
  • Remove observations without data using, for example: drop if mi(Y)
  • The command order can be used to reorder data and put the UID variable in the first column
  • The command rename can be used to rename variables

The end result could look as follows:

To make stuff easier you could create a .do file that contains all the commands in sequence and this allows you to reuse it in similar situations for similar downloads. Here you can download an example .do file that shows examples of all the commands.

This example applies to Stata editions 13 and 14.