Excel and duplicates in a dataset

In the past I posted items on duplicate data from datasets (from Compustat databases) and how to find this out using programs like stata. For smaller datasets the program Microsoft Excel can also be used to investigate your datasets. AT some point, when merging datasets from multiple sources it may happen that you get duplicate data. Using Excel functions like a pivot table (draaitabel) and vlookup (vertikaal zoeken) duplicate data can be detected as follows:

1) first you need to tell Excel that your dataset is a specific delineated table using the menu option: Insert > Table

2) When you insert this option you need to indicate the range and if there are any column headers

3) The same menu tab insert (invoegen) has the option to create a summary pivot table (samenvatten met draaitabel). Select this option at the top left corner:

4) Create the PivotTable in a new (or empty) sheet:

5) The empty Pivot Table will be shown as follows:

6) In this example I drag the fields I want to check for duplicates (records/observaions) down to rows (Rijen). At the box values I drag a random field (in this case indfmt). On the left side of the screen the result will be presented:

7) In this example I click the field gvkey and choose the option to change the field settings

8) At subtotals & filters mark the second option No(ne)

9) In the tab Format & Print (Indeling & Afdrukken) mark the option: item labels in table format (Itemlabels in tabelvorm weergeven). Also make sure to mark the option to repeat item labels (Itemlabels herhalen):

10) The result chould look as follows:

11) Copy the list to a new sheet and use the concatenate function (Tekst.samenvoegen) to create a combination of GVKEY and year:

12) Create a similar link list with the Concatenate function in the original Compustat datasheet

13) Now use the VLOOKUP function in Excel in the Compustat datasheet to look up the link in the sheet with the result of the Pivottable. Make sure that the VLOOKUP option says False (Onwaar) at the 4th option as follows:

14) Finally, using the filter option you can find out if there are duplicatesd by selecting everything for which the count is higher than 1. The tricky thing will be deciding what to do with the result.

N.B.: If a dataset has over 100.000 observations the process described above will take some time as Microsoft Excel will require significant processing power from the computer. For larger datasets I reccomend using Stata.


Thomson Eikon & Event windows

Not too long ago I was working with the successor of Thomson 1 / Thomson One which is called Eikon or Thomson Eikon. This product is web-based and is similar in look and style as Bloomberg. I found out that in most cases for me the Excel add-in was the most useful part and allowed me to download larger amounts of data.
The specific Add-in did, however, miss a key feature which I use heavily in Datastream: there is no Request Table tool to allow event studies with changing dates and time windows. Using the add-in I was able to figure out how to still do something like this using the commands which Thomson inputs in your Excel worksheet to call up the data.

With a bit of help from the Eikon Helpdesk I was able to come up with a template that kind of does the same thing as a Request Table Excel tool from Datastream. You can download the example here.

Explanation of the worksheet:

0) Before using the worksheet template, make sure that the connection to Eikon on your computer is not yet live. Otherwise the sheet will immediately start updating as the Add-in and Eikon code is usually set up this way.

1) The first sheet is called Events
As the name already suggests, this is where you put the codes for the series (column A). In columns B and C the Start and End dates for the series go. If you need an exact number of trading days (of data) you can use the formulas from a previous post.
If the exact number of days need not be the exact same column D shows the number of calendar days between the start and end days. This is done by deducting the end date from the start date. This works fine as dates are also numbers in Excel. Cell E2 shows the highest number in the entire column D using formula =MAX(D:D)

2) The second sheet is Data
The example sheet is used to download the Eikon data into and is based on generated Eikon code(s).

Cell D1 in sheet Data is linked to Cell E2 from the Events sheet and will show the same value with the formula: =Events!D2
This number is important as this indicates the (maximum) number of rows for each new window to start downloading data without overwriting the data from previous (downloaded) series. This may cause empty rows between series/data, but these empty rows can be deleted afterwards

The formulas in Column C are used to calculate the row/cell numbers for each separate series to start downloading: =((B4-$B$3)*$D$1)

This is also based on the generated number of events in columb B: =B3+1.

The Helper Column is column A. This column is needed to get the download destinations to be included later in the Cells with the Thomson Eikon codes. There is just a technical reason for this, otherwise the Eikon code does not accept the generated destination cells. The formula is: =ADDRESS($C4;6) where the $C4 is the number of rows it takes from column C and 6 indicates the column number where the download should start.

The download code from eikon is as follows: =TR(Events!$A2;”TR.PriceClose”;”Frq=D SDate=#1 EDate=#2 Curn=EUR CH=Fd;date;IN Transpose=Y NULL=NA CODE=ISIN”;INDIRECT($A3);Events!$B2;Events!$C2)

  • This formula takes the series data from the Events tab in column A where it says Events!$A2
  • The start & end dates are included at the end where it says: Events!$B2;Events!$C2
  • The data download cell destinations for the events are indicated where it says: INDIRECT($A3)

You just copy everything downwards (including the Eikon formula) depending on the number of events in the dataset.

3) When you are done copying the events in to the sheet Events and copying the necessary formulas in the Data sheet, you can then go live by signing in into Eikon. You will see that the data download starts automatically.

N.B.: The example download sheet downloads the closing price for series with the Eikon datatype code TR.PriceClose. If you need another datatype this part needs to be changed in the formula.


Working with Compustat Execucomp tenure data

Not too long ago I had a question from someone who was having trouble working with data from Compustat Execucomp. He wanted the yearly tenure for a specific group of people with the function of Chief Financial Officer. The research spanned a period of 2009-2014 (post-crisis). The data that was downloaded looked something like this:

Step 1: Data cleaning
One of the first steps to take in this case is to make sure to have the right kind of data to work with. In this case the columns H and I needed to be checked and cleaned. In column I you see the date when a person left as CFO working for the company. In this situation we see items like n/a where the data is unavailable and this means that the person still continues to work as CFO for the company. We first need to replace such values with the value 2014 for the last year of our research as we are looking for the tenure within the time frame 2009-2014. Any other years after 2014 can in this case also be replaced with 2014. You can use the search and replace function in Excel to do this step by step. Afterwards you can use the Filter option in Excel to check for weird data or outliers. In principle you have to check both columns with start year and left year to be sure there are no outliers (weird values).

In column H you see the year when a person joined the company. I am assuming that this was also the startyear for each person when he came to work as CFO at the company (I have not personally checked this). You see in the screenshot that not every year is seen as a numerical value: Excel shows little green triangle dots in the cells where it thinks the data is text. To ensure that a year is seen as a numerical value you can add a new column and use a trick to create numerical values in this column: devide cell by 1. See screenshot column J for the original data and column K for the new years. In the top left corner you see the “formula” you can copy downwards for all years.

Step 2: Calculate tenure for each year
In the example for this blog I only calculated the tenure for the final years of the research time frame (2013 and 2014). You can figure out the formulas for the other years. First I started calculating the CFO tenure for 2014. In this example I assume that if the startyear matches the lastyear someone has worked in this capacity for less then a full year making the tenure less then 1 and thus zero. In this case I get the right number of years of the tenure by substracting the startyear from the lastyear (= research year 2014). See example:

Now for the tenure of the previous year (2013) the If statement comes in handy to figure out the tenure for this year. The full formula is:

K3 = start year tenure within the research window (or before)
L3 = last year for the research window (2014)
P3 = tenure for the final year of the research window (= 2014)

The formula in essence does the following: if the previous year (in this case 2013) matches the start year (or is smaller), then the tenure is that of 2014 minus 1. If not then put the word False there. This last condition prevents outliers from causing problems. Screenshot:

The same formula can also be used for the previous years. All you need to do is change the formula for the right numbers. 2012 example: =IF(K3<=(L3-2);(P3-2);FALSE)

Step 3: Figure out the relevant years
This step is essentially not necessary as the filter option of excel is already available to make an annual selection by year but you then have to add the tenure year manually for each year after copying the relevant tenure data by year (to a new sheet).
The formulas in step two will provide a tenure of  0 (or more) as long as the end year for the tenure (within the time frame 2009-2014) is equal to (or higher) then the start year for the tenure (within the time frame or earlier). To know the tenure by year we create columns to show which tenure applies to what year. That allows us to use a filter in Excel to more easily get the relevant data where there is a tenure of more then zero. I created the columns N and O to get the tenure years for 2014 and 2013. The formula I used for 2014 is: =IF((P4>0);RIGHT($N$1;4);FALSE)
where: P4 = calculated number of tenure years for 2014. I have put the year in the name in the first cell at the top of the column (first) as the last 4 digits making it possible to use $N$1.

For 2013 all you need to change is the cell P4 into Q4 and the header $N$1 into $O$1. You also need to put the year in the name of the variable at the top of the column). Subsequent years work the same way.

Step 4: Filter the data for the relevant years
As the final step you can now use the standard filter option to copy the relevant data by year to a new sheet.

I would then also remove irrelevant data for other years which do not apply to the specific year I have filtered for. The end result would look something like this:


Abnormal Return & market choice

Today I had a student who wanted the abnormal return for a couple of lists of stocks but wanted to choose different markets depending on the stock. To get the Abnormal Return from datastream you can use the following formula:


I posted about the formula earlier and the post explains that the LI stands for the comparison Local Market Index. In this case that is not the index that was needed for some of the lists that the student wanted to use, because, LI usually is the index of the main exchange where a company is listed. See the previous post on how to find out which index for each equity this is.

To change the formula, you can exchange the market by replacing the X(LI) with a Datastream mnemnonic for an index. Example for the Dutch midkap index AMSMKAP:


If you also need to change to multiple indices/markets for a list the same method can be used to generate formulas like I did for generating a list of BETA formulas. You can download an example here for a list of Abnormal Return formulas. Using such a list an example Request table sheet in Datastream would look like this:

The same thing also applies to other Datastream formulas, like the Alpha formula:


You can download an example Excel sheet here as well.


Restoring unique codes

In the past I blogged about the fact that if you work with codes like SEDOL (or any other types), to get information on listed companies, it is important to keep them in tact. Excel, for instance, has a habit of removing zeros when they are the first digits of a code.
Some databases (like SDC Platinum) will not give you a good search result if codes are incomplete when you search using a list of these abbreviated codes. Some databases are more forgiving when you search on codes and the zeros in front are gone.
The second, more important reason, however, to keep the codes in tact (including zero digits in front) is, that when you combine datasets, you will probably use codes to match them. Example:

For whatever reason, when you use Excel, you may have inadvertently have codes in a dataset that are no longer in tact. In the earlier blog post I solved this problem using a nested IF() function for Sedol codes. This example could be adapted for longer codes.
Recently, a student of the Vrije Universiteit, pointed out a more elegant solution: using the REPT() and LEN() functions. The advantage of this solution is, that it can very easily be adapted for codes of a different length. The solution function is:


Explanation: The function will repeat/add the character Zero (= 0) to the characters in a Cell and the number of zeroes depends on the X minus the Length (= len() ) of the number of characters already in the Cell. Added to this will be the characters of the original Cell.

Example for Sedol codes (always 7 digits long):

If the abbreviated Sedol code is: 68G78
The Function would be (Sedol code is 7 digits long): = REPT(“0”;7-LEN(Cell))&Cell
The result would be: 0068G78

Below you see an example for abbreviated 9-digit CUSIP codes:



Using Excel to download the BETA

In a few earlier posts I explained how to use the formula to get the BETA from Datastream using the following example formula:


The formula can be used and in Excel you can generate a list of formula’s. I explained this in a follow-up post. The important thing to remember is that the list you can generate this way can only be used in a Request Table type search.

To make it easier I have now posted an example Excel sheet that allows you to more easily create a list of equities (with the markets/indices). You can download it here.

Please remember that this excel worksheet is just an example! I randomly used equities and indices. When using the formula it is necessary to choose the right list of equities and indices/markets that are right for your research!


Creating Dummy items with data using Excel

Not long ago I got the question on how to create dummy numbers for a dataset. The student who asked me this wanted to use these dummies in a regression analysis. In some models you may only want to have a single value of 1 or 0 but in this case the student wanted to create four different categories that are equal/similar in size. So, I used Excel to create the 4 groups. I am not sure if this makes sense from a mathematical point of view but I leave that up to the students.

First I created a small dataset with monthly price data for 11 randomly chosen Dutch stocks (fake example). To make the test even more challenging I made sure that there were also missing values because when you are working with larger datasets this might happen and you cannot always check the dataset for it. Below you see the example:

The example that I worked on needed to be divided in 4 groups and the groupings are different depending on the number of available stocks as well as the different price values. In Excel we can use the QUARTILE() function to let Excel calculate from a certain range where the groups should end. When the formula is used you need to indicate which range to use and the value of the Quartile: =QUARTILE(RANGE;VALUE). For the four Quartiles this could be

In the example screenshot these are Cells M2 to P2:

As the next step I needed to create a conditional statement for 2 scenarios:

  • What to do when there is no data
  • Which value (of 1,2,3,4) to show depending on the calculated Quartile.

In Excel I can do this with three functions:
IF(Condition;Then;Else) = function to tell Excel what to do when
AND(Condition1;Condition2; …) = function to use more then 1 condition
LEN(Cell) = function to allow for missing values

When I started out I begin by telling Excel what to do when data is missing, i.e. when LEN(Cell)=0. If I have no data in a Cell I indicate that the text should be presented: NODATA. I put these between Quotes to indicate to Excel that this is Text information.
The following steps indicate what value (of 1,2,3,4) should be shown depending on the comparison of the Price in a cell with the calculated 4 quartiles. In order for me to copy these formulas (downwards and to the right) I locked the columns M,N,O and P with the dollar sign before the letter of these four columns. The full formula looks as follows:


Here you see a screenshot of the result using the formula:

As a final step I did a quick check to see how many stocks would get values 1,2,3 or 4 using the COUNTIF() function. I need to create a column for each group, and therefore use COUNTIF(RANGE;VALUE) and get in the example:

It looks like you get a (roughly) equal number of Stocks with values 1-4. This depends of course on how many stocks you have and how many missing values for each stock there are. See example screenshot:

This is a small example of the functions that you can use in Excel and you can download the Excel sheet here. I also did a larger test with Daily Frequency data from 2011 to now for 23 stocks and I got a similar good result. You can download this larger example Excel Sheet here.

N.B.: Sometimes the big conditional formula does not wotk as it should and you get values 1-4 when cells are empty. Be sure to check whether these seeming visually empty cells are actually empty. To solve this problem you can save the datasheet first as Text (Tab delimited) file, and the open it again in Excel using the Text wizard. When you retry the formulas from this example everything should then work fine. Be sure to always save a copy of the original download/dataset in case of errors.