Excel and Filling up blank cells

Recently I got a question on how to arrange the data in Excel for an analysis where some data was missing. Below here you see an example download from the database Joint Ventures & Alliances from SDC Platinum. Not every row has data but each deal starts on a new row with the date variable “Alliance Date Announced”. Each alliance also has a unique Deal number (3rd column):

In some programs (to do an analysis of the data) what you need is to have the data repeated for each row (observation) related to that specific instance and not any rows with blanks (unless a variable for a row/observation contains no data of course). In the following example the data looks like the previous picture, incomplete:

You want it to look as follows:

Excel has an option that allows you to automatically fill in all the blanks. This works as follows:

Step 1: Make sure that you click on a cell inside the data table

Step 2: In Excel click the option in the Excel ribbon: Find & Select > Go To Special

Step 3: In the pop-up mark the option: Blanks

Step 4: You see that Excel now tries to find all the empty cells in the active table and they are highlighted

Step 5: Use your keyboard to type “=
Step 6: Now click the Upwards arrow on your keyboard
Step 7: Use the keycombination: CTRL + Enter

N.B. 1: Please remember that after the final row Excel tends to add another row. In this case you need to delete it again as this would only add a duplicate observation to the table.
N.B. 2: Dates may be filled in as numbers. This is fine because in Excel dates are really numbers (number 1 is 1-1-1900). Just use formatting to make sure these numbers are presented as dates again

Warning: When you use the option to complete data for each observation, be sure that the correct data is duplicated. In the example output from SDC Platinum (below) you see columns that relate to events but which are spread over more rows. It may be an idea to copy parts of the data to separate sheets (for instance using unique event/deal numbers) to duplicate data there and later merge the data tables again. See columns G, I and J.


Is it possible to use RIC codes in Datastream?

Recently a student asked the question if RIC codes could be used in other Databases to search for more information on companies. In a nutshell, the answer is: to a certain point and with some manual searches many codes can be indeed turned into codes for other databases if you can use Datastream.

RIC codes are Reuters Instrument codes (Wikipedia). These codes usually originate from Thomson Reuters databases like Thomson One, Eikon or Thomson Reuters News Analytics. The RIC codes for companies usually have a root code and a suffix for the market of a listing. If a company has multiple listings the root stays the same but the suffix gives indicates a different market. A theoretical example: the IBN company has the root code IBM and for a listing at the london stock exchange .L would be added to get the RIC code: IBM.L . If the listing is at the New York Stock exchange .N would be added to create IBM.N .

In the example of the student he got the list from a supervisor and was asked to get some data for these companies. The trouble was: most databases cannot work with RIC codes as these are proprietary and are not included as codes. If your university/organisation does not have a license to Thomson databases (like the ones mentioned above) but does have a Datastream license, the codes can still be converted (somewhat) into codes for this database. Although Datastream is a Thomson Reuters database, and RIC codes (or Thomson1 codes) can be recognised somehow, Datastream sometimes does not provide output or does not return the RIC codes.

Here follows step by step how I still managed to convert the original RIC codes into codes that could be recognised (mostly) by Datastream and return data:

1) Create a new list of codes where the root codes are seperated from the market codes. In excel this can be done by using the function from the Data tab, called: Text to columns. Important to remember here is: not all market codes start with a dot. Some codes stqart with the “equal to” character =. In Excel this character usually signifies the start of a function. Be sure to use the Text to columns option and seperate everything out in one or more rounds and have every column seperated as Text.

2) Create another column with the root codes seperated from the market codes in a similar way as before using the Text to columns option. This time you do not seperate the RIC code out in two text colums. The first column which carries the root code should now be left at “general“:

Some company RIC codes start with numbers that include zero characters in front. These zeros need to stay there because they are part of the code and when converting RIC codes to codes for Datastream these zeros sometimes need to still be there. In essence: we get two columns where you can choose either the original RIC root codes, or the abbreviated ones when necessary.

3) Now create a new column where you recreate the original full suffix code which includes the dot or equal to (=) character. The end result should look as follows:

4) The next step is finding out which RIC market codes represent certain exchanges. I did a few searches and found an excellent site from SIRCA that provided me with a list. I copied and saved this as a text file and imported this text file where each column is a TEXT column. I did not copy the list to Excel immediately from the website as this affected some market codes: specifically the ones that start with the “equal to” character: =
The end result looked as follows:

5) As a next step I matched the market codes from the original list with the list of market codes using 3 times the Excel function =Vlookup(). The result looks as follows:

6) What I now need to do as the final step is create two more columns in case I need to change codes from a specific market/exchange and need to add a suffix code, a prefix code, or both. For the final column I can now use the Excel funstion =Concatenate() to create alternative codes for the RIC codes and have Excel automatically add prefixes or suffixes as required for a list of codes.

7) The last step involves trial and error searches to find out for certain markets what prefix or suffix codes can be added to create either Datastream Mnemonic codes or Local codes that Datastream can actually work with. Example:

Warning 1: Remember that some exchanges that numbers as codes for companies and may require either the original codes with zeroes or not. Choose the relevant Ticker column A or B (which we created) when the Mnemonic or local code requires it.

Warning 2: Most exchanges use common prefixes or suffixes but not all of them: for some exchanges you may need to look up all codes!

8) Using the list or newly created codes we can now do a Static search in Datastream to find out if all the codes are stil current and which are not: the original RIC code list may contain codes that have changed and some codes were deleted (even though the company data is still there). First copy the new codes to a new sheet (or column) and paste them as values to get rid of the Excel formulas. Then upload them using the Datastream option: “Create List(From Range)“:

9) Next Do a Static Request search and use the list to get some basic data back that signifies which companies are associated with these codes. In the example search I requested the following codes back: DSCD (Datastream code), NAME, ISIN code. The list of codes that we uploaded at step 8 we can find/use by clicking the List Picker button below the orange Find series button.

If the original list of RIC codes is a recent one, chances are you will get many good codes back which can then be used to get actual data from Datastream or another database. The result looks something like this:

N.B.: When you have access to databases like Amadeus, Orbis, Orbis Europe or Compustat you can also try uploading the RIC root company codes as tickers and see whether they are recognized there as well. This method is more quick & dirty (tricky), as you do not know whether the right listings and ticker codes will be available for the companies.


Excel and duplicates in a dataset

In the past I posted items on duplicate data from datasets (from Compustat databases) and how to find this out using programs like stata. For smaller datasets the program Microsoft Excel can also be used to investigate your datasets. AT some point, when merging datasets from multiple sources it may happen that you get duplicate data. Using Excel functions like a pivot table (draaitabel) and vlookup (vertikaal zoeken) duplicate data can be detected as follows:

1) first you need to tell Excel that your dataset is a specific delineated table using the menu option: Insert > Table

2) When you insert this option you need to indicate the range and if there are any column headers

3) The same menu tab insert (invoegen) has the option to create a summary pivot table (samenvatten met draaitabel). Select this option at the top left corner:

4) Create the PivotTable in a new (or empty) sheet:

5) The empty Pivot Table will be shown as follows:

6) In this example I drag the fields I want to check for duplicates (records/observaions) down to rows (Rijen). At the box values I drag a random field (in this case indfmt). On the left side of the screen the result will be presented:

7) In this example I click the field gvkey and choose the option to change the field settings

8) At subtotals & filters mark the second option No(ne)

9) In the tab Format & Print (Indeling & Afdrukken) mark the option: item labels in table format (Itemlabels in tabelvorm weergeven). Also make sure to mark the option to repeat item labels (Itemlabels herhalen):

10) The result chould look as follows:

11) Copy the list to a new sheet and use the concatenate function (Tekst.samenvoegen) to create a combination of GVKEY and year:

12) Create a similar link list with the Concatenate function in the original Compustat datasheet

13) Now use the VLOOKUP function in Excel in the Compustat datasheet to look up the link in the sheet with the result of the Pivottable. Make sure that the VLOOKUP option says False (Onwaar) at the 4th option as follows:

14) Finally, using the filter option you can find out if there are duplicatesd by selecting everything for which the count is higher than 1. The tricky thing will be deciding what to do with the result.

N.B.: If a dataset has over 100.000 observations the process described above will take some time as Microsoft Excel will require significant processing power from the computer. For larger datasets I reccomend using Stata.


Using Stata to count a sequence

Not long ago a student asked me how to calculate the tenure for auditing firms that were attached to firms within a certain time frame for up to a specific year. The student was interested in the effect of auditor tenure on companies. The subject of auditor rotation has become interesting as a research subject in the field of accountancy as in some countries firms are required to change their Auditor every few years to ensure auditor independence.

Using the Stata program this can be done using a script to identify “spells”. My colleague Matthijs de Zwaan helped me with this and created an example script based on an article in the Stata Journal.
The script I made was based on the example and can be adapted to count other kinds of sequences in datasets.  I also added a few lines of code to show how a dummy could be created to identify if the auditor in the example Excel file was a Big Four accounting firm or not. The script can be downloaded here and looks as follows:

The end result (after removing suplicate data) gives the tenure of the last auditor (latenure) for each firm at the year 2008:

N.B.: In the .do file the location for all files is the I: drive. You may need to change the drive letter in the original script to (for instance) c: or H: to get it to run. Make sure both the script file and Excel dataset are in the same location.


Using Stata to count segments

At the end of March I got asked the question how to use Compustat North America segments data and get aggregated counts on business segments or geographic segments. The variable business segments was to be used as an indicator of diversity: how many different types of activity a company included in it’s activities. The Geographic segments was to be used as an indicator on how widespread these activities were geographically for each company.

Specific important commands that are needed:

generate year=year(datadate) > using this command you get a year which can be used to count instances of segments. This is only needed if no available year can be used (like fiscal year / fyear).

drop > using this command you delete all variables that are non-essential from the dataset

order gvkey year > this command sorts the dataset first on the gvkey (= global company key which uniqely identifies a company in any Compustat database) and then by year

duplicates drop > this command deletes any possible duplicate annual data. This is important as the count only involves unique segments

by gvkey year: egen segmentcount = count(sid) > this command generates a new variable (segmentcount) and gives it the value of the count of the segment id codes (SID) for each company and individual year.

To later combine the business segments count dataset with the geographical count dataset a unique ID (UID) is created to later merge the datasets again into a single dataset.

Overall the script (.do file) I created does three things:
1) It creates a new dataset with business counts
2) A dataset with Geographical counts is made
3) It merges both newly created datasets into a single dataset

Example script screenshot:

The example dataset with .do script file can be downloaded here.

Example result screenshot:

N.B.: In the .do file the location for all files is the U: drive. You may need to change the drive letter in the original script to (for instance) c: or H: to get it to run. Make sure both the script file and Stata dataset are in the same location.


Datastream Navigator update 4.6

Recently it was announced by Thomson that the Navigator tool in Datastream will undergo some significant changes. The changes concern the addition of some important search filters to more easily allow for finding and selecting the right series and data types (variables).
Some of these filter options are older ones that have returned from a previous edition of the Datastream Navigator.

The main changes are listed in the the document that they have released and which you can find here as well.


Thomson content update February 2017

At the end of January a change was made that has impacted the content of SDC Platinum, Thomson One and the newer Eikon product. Thomson Reuters uses league tables that capture market/deal activity for both the mergers and acquisitions, equity issuance, debt issuance, syndicated loans and US municipal bond markets. The rankings in the league tables are (in part) based on imputed investment banking fee variables.

Investment banking league tables rank investment banks on their level of involvement in specific types of transactions. An individual league table might, for example, rank the investment advisors of merger and Acquisition deals (usually by aggregate deal volume) and another might rank the underwriters of equity or fixed income offerings. (Source: SC Johnson College of Business website)”

The Thomson Reuters historical imputed fee data & history was created & added to the Thomson database(s) from 1998. Freeman & Company have been an exclusive content provider of investment banking imputed deal fees to Thomson Reuters since 2005. Freeman’s fee engine is powered by Thomson Reuters Deals data.

Starting in February 2017 new imputed fee data is created & added according to the proprietary StarMine model of Thomson. The older Freeman imputed fee data will remain available in the databases. Only in a limited number of cases some of the older data may be replaced with newer imputed fee data: historical deals where fee related fields are updated will also be given a new Thomson Reuters imputed fee.

The following variables are impacted in SDC Platinum databases:

Mergers & Acquisitions database (Deals database):


Global New Issues database (Equity, Bonds & Syndicated Loans):


An update of the SDC Platinum software is required to keep everything up to date and available as per usual. The newer Thomson products Thomson One and Eikon are also impacted but as these products are web-based no software updates on the client side are necessary. The data in the newer products has already been updated. More background information can be downloaded here.