Excel and duplicates in a dataset

In the past I posted items on duplicate data from datasets (from Compustat databases) and how to find this out using programs like stata. For smaller datasets the program Microsoft Excel can also be used to investigate your datasets. AT some point, when merging datasets from multiple sources it may happen that you get duplicate data. Using Excel functions like a pivot table (draaitabel) and vlookup (vertikaal zoeken) duplicate data can be detected as follows:

1) first you need to tell Excel that your dataset is a specific delineated table using the menu option: Insert > Table

2) When you insert this option you need to indicate the range and if there are any column headers

3) The same menu tab insert (invoegen) has the option to create a summary pivot table (samenvatten met draaitabel). Select this option at the top left corner:

4) Create the PivotTable in a new (or empty) sheet:

5) The empty Pivot Table will be shown as follows:

6) In this example I drag the fields I want to check for duplicates (records/observaions) down to rows (Rijen). At the box values I drag a random field (in this case indfmt). On the left side of the screen the result will be presented:

7) In this example I click the field gvkey and choose the option to change the field settings

8) At subtotals & filters mark the second option No(ne)

9) In the tab Format & Print (Indeling & Afdrukken) mark the option: item labels in table format (Itemlabels in tabelvorm weergeven). Also make sure to mark the option to repeat item labels (Itemlabels herhalen):

10) The result chould look as follows:

11) Copy the list to a new sheet and use the concatenate function (Tekst.samenvoegen) to create a combination of GVKEY and year:

12) Create a similar link list with the Concatenate function in the original Compustat datasheet

13) Now use the VLOOKUP function in Excel in the Compustat datasheet to look up the link in the sheet with the result of the Pivottable. Make sure that the VLOOKUP option says False (Onwaar) at the 4th option as follows:

14) Finally, using the filter option you can find out if there are duplicatesd by selecting everything for which the count is higher than 1. The tricky thing will be deciding what to do with the result.

N.B.: If a dataset has over 100.000 observations the process described above will take some time as Microsoft Excel will require significant processing power from the computer. For larger datasets I reccomend using Stata.

Email

Using Stata to count a sequence

Not long ago a student asked me how to calculate the tenure for auditing firms that were attached to firms within a certain time frame for up to a specific year. The student was interested in the effect of auditor tenure on companies. The subject of auditor rotation has become interesting as a research subject in the field of accountancy as in some countries firms are required to change their Auditor every few years to ensure auditor independence.

Using the Stata program this can be done using a script to identify “spells”. My colleague Matthijs de Zwaan helped me with this and created an example script based on an article in the Stata Journal.
The script I made was based on the example and can be adapted to count other kinds of sequences in datasets.  I also added a few lines of code to show how a dummy could be created to identify if the auditor in the example Excel file was a Big Four accounting firm or not. The script can be downloaded here and looks as follows:

The end result (after removing suplicate data) gives the tenure of the last auditor (latenure) for each firm at the year 2008:

N.B.: In the .do file the location for all files is the I: drive. You may need to change the drive letter in the original script to (for instance) c: or H: to get it to run. Make sure both the script file and Excel dataset are in the same location.

Email

Using Stata to count segments

At the end of March I got asked the question how to use Compustat North America segments data and get aggregated counts on business segments or geographic segments. The variable business segments was to be used as an indicator of diversity: how many different types of activity a company included in it’s activities. The Geographic segments was to be used as an indicator on how widespread these activities were geographically for each company.

Specific important commands that are needed:

generate year=year(datadate) > using this command you get a year which can be used to count instances of segments. This is only needed if no available year can be used (like fiscal year / fyear).

drop > using this command you delete all variables that are non-essential from the dataset

order gvkey year > this command sorts the dataset first on the gvkey (= global company key which uniqely identifies a company in any Compustat database) and then by year

duplicates drop > this command deletes any possible duplicate annual data. This is important as the count only involves unique segments

by gvkey year: egen segmentcount = count(sid) > this command generates a new variable (segmentcount) and gives it the value of the count of the segment id codes (SID) for each company and individual year.

To later combine the business segments count dataset with the geographical count dataset a unique ID (UID) is created to later merge the datasets again into a single dataset.

Overall the script (.do file) I created does three things:
1) It creates a new dataset with business counts
2) A dataset with Geographical counts is made
3) It merges both newly created datasets into a single dataset

Example script screenshot:

The example dataset with .do script file can be downloaded here.

Example result screenshot:

N.B.: In the .do file the location for all files is the U: drive. You may need to change the drive letter in the original script to (for instance) c: or H: to get it to run. Make sure both the script file and Stata dataset are in the same location.

Email

Datastream Navigator update 4.6

Recently it was announced by Thomson that the Navigator tool in Datastream will undergo some significant changes. The changes concern the addition of some important search filters to more easily allow for finding and selecting the right series and data types (variables).
Some of these filter options are older ones that have returned from a previous edition of the Datastream Navigator.

The main changes are listed in the the document that they have released and which you can find here as well.

Email

Thomson content update February 2017

At the end of January a change was made that has impacted the content of SDC Platinum, Thomson One and the newer Eikon product. Thomson Reuters uses league tables that capture market/deal activity for both the mergers and acquisitions, equity issuance, debt issuance, syndicated loans and US municipal bond markets. The rankings in the league tables are (in part) based on imputed investment banking fee variables.

Investment banking league tables rank investment banks on their level of involvement in specific types of transactions. An individual league table might, for example, rank the investment advisors of merger and Acquisition deals (usually by aggregate deal volume) and another might rank the underwriters of equity or fixed income offerings. (Source: SC Johnson College of Business website)”

The Thomson Reuters historical imputed fee data & history was created & added to the Thomson database(s) from 1998. Freeman & Company have been an exclusive content provider of investment banking imputed deal fees to Thomson Reuters since 2005. Freeman’s fee engine is powered by Thomson Reuters Deals data.

Starting in February 2017 new imputed fee data is created & added according to the proprietary StarMine model of Thomson. The older Freeman imputed fee data will remain available in the databases. Only in a limited number of cases some of the older data may be replaced with newer imputed fee data: historical deals where fee related fields are updated will also be given a new Thomson Reuters imputed fee.

The following variables are impacted in SDC Platinum databases:

Mergers & Acquisitions database (Deals database):

  • IMPUTED_ACQ_TOTAL_FEE
  • IMPUTED_ACQ_TOTAL_FEE_AUD
  • IMPUTED_ACQ_TOTAL_FEE_EURO
  • IMPUTED_ACQ_TOTAL_FEE_STG
  • IMPUTED_ACQ_TOTAL_FEE_YEN
  • IMPUTED_TARGET_TOTAL_FEE
  • IMPUTED_TARGET_TOTAL_FEE_AUD
  • IMPUTED_TARGET_TOTAL_FEE_EURO
  • IMPUTED_TARGET_TOTAL_FEE_STG
  • IMPUTED_TARGET_TOTAL_FEE_YEN
  • IMPUTED_INV_FEES
  • IMPUTED_INV_FEES_AUD
  • IMPUTED_INV_FEES_EURO
  • IMPUTED_INV_FEES_STG
  • IMPUTED_INV_FEES_YEN

Global New Issues database (Equity, Bonds & Syndicated Loans):

  • IMPUTED_TOTAL_FEES
  • IMPUTED_TOTAL_FEES_AUD
  • IMPUTED_TOTAL_FEES_EURO
  • IMPUTED_TOTAL_FEES_STG
  • IMPUTED_TOTAL_FEES_YEN

An update of the SDC Platinum software is required to keep everything up to date and available as per usual. The newer Thomson products Thomson One and Eikon are also impacted but as these products are web-based no software updates on the client side are necessary. The data in the newer products has already been updated. More background information can be downloaded here.

Email

Thomson Eikon & Event windows

Not too long ago I was working with the successor of Thomson 1 / Thomson One which is called Eikon or Thomson Eikon. This product is web-based and is similar in look and style as Bloomberg. I found out that in most cases for me the Excel add-in was the most useful part and allowed me to download larger amounts of data.
The specific Add-in did, however, miss a key feature which I use heavily in Datastream: there is no Request Table tool to allow event studies with changing dates and time windows. Using the add-in I was able to figure out how to still do something like this using the commands which Thomson inputs in your Excel worksheet to call up the data.

With a bit of help from the Eikon Helpdesk I was able to come up with a template that kind of does the same thing as a Request Table Excel tool from Datastream. You can download the example here.

Explanation of the worksheet:

0) Before using the worksheet template, make sure that the connection to Eikon on your computer is not yet live. Otherwise the sheet will immediately start updating as the Add-in and Eikon code is usually set up this way.

1) The first sheet is called Events
As the name already suggests, this is where you put the codes for the series (column A). In columns B and C the Start and End dates for the series go. If you need an exact number of trading days (of data) you can use the formulas from a previous post.
If the exact number of days need not be the exact same column D shows the number of calendar days between the start and end days. This is done by deducting the end date from the start date. This works fine as dates are also numbers in Excel. Cell E2 shows the highest number in the entire column D using formula =MAX(D:D)

2) The second sheet is Data
The example sheet is used to download the Eikon data into and is based on generated Eikon code(s).

Cell D1 in sheet Data is linked to Cell E2 from the Events sheet and will show the same value with the formula: =Events!D2
This number is important as this indicates the (maximum) number of rows for each new window to start downloading data without overwriting the data from previous (downloaded) series. This may cause empty rows between series/data, but these empty rows can be deleted afterwards


The formulas in Column C are used to calculate the row/cell numbers for each separate series to start downloading: =((B4-$B$3)*$D$1)

This is also based on the generated number of events in columb B: =B3+1.


The Helper Column is column A. This column is needed to get the download destinations to be included later in the Cells with the Thomson Eikon codes. There is just a technical reason for this, otherwise the Eikon code does not accept the generated destination cells. The formula is: =ADDRESS($C4;6) where the $C4 is the number of rows it takes from column C and 6 indicates the column number where the download should start.


The download code from eikon is as follows: =TR(Events!$A2;”TR.PriceClose”;”Frq=D SDate=#1 EDate=#2 Curn=EUR CH=Fd;date;IN Transpose=Y NULL=NA CODE=ISIN”;INDIRECT($A3);Events!$B2;Events!$C2)

  • This formula takes the series data from the Events tab in column A where it says Events!$A2
  • The start & end dates are included at the end where it says: Events!$B2;Events!$C2
  • The data download cell destinations for the events are indicated where it says: INDIRECT($A3)

You just copy everything downwards (including the Eikon formula) depending on the number of events in the dataset.

3) When you are done copying the events in to the sheet Events and copying the necessary formulas in the Data sheet, you can then go live by signing in into Eikon. You will see that the data download starts automatically.

N.B.: The example download sheet downloads the closing price for series with the Eikon datatype code TR.PriceClose. If you need another datatype this part needs to be changed in the formula.

Email

Stata & large downloads from Datastream

In the past I demonstrated how it was possible to change a dataset from a wide presentation (= data in columns) to a long presentation (= data in rows) using the reshape command in Stata. This option works fine where smaller datasets are concerned and if there is no choice when downloading data. For larger datasets, however, it may become a problem to download the data in wide format and also the Stata software may have difficulty changing the presentation.

If you download, for instance, daily price data from Datastream for many years (using ISIN codes) for x number of listed companies the wide download format (transposed) is not a good option. Try downloading everything without transposing and you get everything in long format as follows:

Changing the data to a proper long format for Stata goes as follows:

1) Make a copy of the original data and add a new row at the top.
2) Above the Dates in column A type in a general name like “company” in Cell A1.
3) Also use numbers for each individual company at the top.
4) Insert a new row above the data and give the name Date in Column A above the actual dates
5) In the cells of this same row use an Excel formula to generate fake names. Example: =$A$1&B1
The copied Sheet should look as follows:

6) Now also make a copy of the first 3 rows and paste the data transposed using the Right-click option in a new sheet. Example:

7) Use the Excel function =left() to get the ISIN codes (12 digits) in a separate column or use Search and replace to get rid of the variable indicator (P)


8) Copy and paste this data as values to a new sheet. It should look as follows:9) Make a new copy of the first changed sheet and delete the first few rows. Make sure to copy & paste as value (right-click) to remove any formulas. The sheet should now look as follows:

Now start up Stata en use the sheets in the Excel Worksheet step by step:

10) First you import the sheet (from step 9) with the price data from Excel using the command import (or the Menu option File > Import > Excel spreadsheet).

11) Use the reshape command the rearrange the data to long format: reshape long company, i(Date) j(price). Change the variable names and make sure that the column with numbers is called company. Next save this as a Stata database.

12) Use clear all to start with the second sheet. Now import the second sheet with the company ISIN codes from step 8.

13) Use the merge command to merge this data with the Stata database we created at step 11). The command would look as follows:
merge 1:m company ” c:\ … \filname.dta”, sheet(…), firstrow


14) now do some data curation steps to change prices to numbers (destring command) and generate a newly formatted date. Make sure to save this file.


As an example I provide here a zip-file with an example download in Excel with the Steps on separate sheets and a .do file that can be run step by step. If you want to use this, be sure to extract both files to the folder c:\temp on your computer. You can also extract them in a different folder. In that case you need to change the locations of the files in the .do file before running the file in Stata.

Important to remember with this example: Stata may still run into problems if you have hundreds of companies (or more) for which you have daily data for many years. If there are problems I recommend converting the download in brackets of 100 to 200 companies and then appending the resulting databases to create a single Stata database.

N.B.: A student pointed an example of this out to me which he found at: Princeton University Library Data and Statistical Services. I do not claim any credit for the original idea but the Stata example provided on this blog was made by me.

Email