Changing Datastream data & Stata

The past few weeks I have been learning about and working with Stata. This program can handle a lot of data and uses commands to edit data or analyse it. A sequence of commands can be saved in .do files and then rerun as a script. There are many commands available and one of them is very handy when it comes to changing data from columns to rows. It is similar to the transpose option that Microsoft Excel offers for quick changes. It is the reshape command.

In this blog post I will use the reshape command to change Datastream data as an example. Similar work can be done for other downloads from databases like Amadeus or Bankscope.
For downloads it can take a bit of work to change the data and rework it before Stata can be used to merge it with other data. In the case of Datastream this has everything to do with the fact that Datastream does not repeat an ID for each year of data you download (unless using a Request Table search). Example screenshot original download from Datastream (transposed):

The following changes need to be made in Excel:

1) Get the ID back from the download without the extra Datastream codes. This can be done using the function MID(Cell,Start,Number) (= in Dutch versions of Excel: the Deel() function). MID() allows you to get part of the contents back from a Cell. Example:

2) Next you need to create column year headers that start with a text character and then the year of the column (this step is important for Stata later). It can be done smartly using a specific cell for the header and then combining it with the original year above each column as follows with the formula =($A$1&(B$1)). The dollar signs fix certain cells or a row (or column). Example:

3) Now save the Datastream Excel file without these formulas. If necessary use copy > paste special > Values

4) Next we start up Stata and get the file. In this case we prepared the Datastream file nicely and can therefore use the command: import excel using DS-Prepared.xlsx, firstrow
The firstrow option tells Stata that this row has the variable names and lables. Example:

5) When we look at the data with the browse command you see that the data looks basically the same as how it appeared in Excel:

Only two things are important here:

  • The data is not yet formatted as we need it: the numerical data are now strings (red) and need to be changed later
  • To later merge data the company ID’s need to be combined with the years

6) We now transpose the data using the command Reshape. This command has many options and allows you to rework tables. If you need more explanation on how it works you can type the command help reshape and Stata will provide much information on how to use the command. For this example the command is:

reshape long Y, i(CompanyCode) j(year)

In a nutshell this tells Stata that the years for the columns need to be repeated for the observations/records and can be found in the names of the variables. The ID for each company also needs te be repeated for the years of all observations. Example result:

When you use the command browse again you see the result of the data change:

7) We now need to create a unique ID combination to later merge the data. To do this we need to turn the years back from numerical values into strings using the tostring command: tostring year, replace. The final step is creating the unique ID combination UID with the command: gen UID = CompanyCode+year

8) Now there are a few steps left to finalize the Datastream data but not all of it is necessary. One thing is necessary: for Stata the numerical data is still a (text) string and now need to be changed. Use the destring command to as follows: destring Y, replace force

Destring also removes any values that were not numbers (but text) and replaces them with an empty cell (.) because of the option force (this can be dangerous if the content of a cell is a combination of a numerical value and text).

Optional stuff:

  • Use the drop command to remove unnecessary columns
  • Remove observations without data using, for example: drop if mi(Y)
  • The command order can be used to reorder data and put the UID variable in the first column
  • The command rename can be used to rename variables

The end result could look as follows:

To make stuff easier you could create a .do file that contains all the commands in sequence and this allows you to reuse it in similar situations for similar downloads. Here you can download an example .do file that shows examples of all the commands.

This example applies to Stata editions 13 and 14.


WRDS platform: major changes

This week some major changes were implemented in the Wharton Research Data Services (WRDS) platform. These changes mainly concern the search options for databases (step 2) and the selection options (step 3). In essence: no options were changed for the databases. Only the way you used to make a selection has changed and the way you make a selection of variables. These are the most recent (major) changes since July 2015.

The selection screen is now much more compact and straightforward with all the options immediately available and more intuitive:

The selection screen has undergone a much bigger change. Instead of scrolling downwards through the list of boxes with lists of variables, you can now scroll sideways through the lists of variables:

The other two steps in WRDS are usually: Step 1 (Selecting a time period), and Step 4 (Output options). No major changes seem to have happened there. If I notice any more major changes in the interface I will post them here.


Multi-year shareholder lists & Amadeus

In an earlier post I told something about the 2 most usual ways to get shareholder data in a list presentation in the database Amadeus. The first option was through the company full shareholder presentation reports. The second option was by changing the report option to get the shareholders at Year’s End. The third option to get multi-year shareholder data is through a specific list option. Shareholders data can be included in the list the usual way by clicking the Columns option in the middle of the screen (or the Add button) and then selecting the items you want. This will get you a list that looks as follows and only shows the “current” shareholders: If you want to get a multi-year overview you need to arrange this when selecting the data-items for the list presentation. After clicking the Columns option you can scroll to the shareholder items. Step 1: In this example I chose to include the shareholder name as the first item. When you mark the item it is included in the variable list on the right side. Step 2: Click the Modify button that is presented next to the variable name Step 3: A new pop-up screen appears. In this screen there are many options you can choose. Click the button at the top, called Archived Data. Step 4: In the screen that now appears it is possible to choose certain Months + Years that you want to see the archived data: up to 5 years are possible (not including the current option: Last available data). In this example I chose the last 5 years and December: Step 5: After clicking the OK button to conform the selection you will go back to the screen where you can select additional shareholder data items like % Direct Ownership and % Total ownership. When you select a multi-year item Amadeus will automatically include the required number of columns that was selected (at step 4): Step 6: After clicking OK to finish the selection you will get the following list presentation that now includes multi-year shareholder data: N.B.: When you need multi-year shareholder data and you select cetain months you need to be careful when inerpreting the result. In the example above I requested shareholder data for December of each year. This means that Amadeus will show the latest available shareholder data since the last selected month. This does not necessarily mean that the presented shareholder data is from the December month! If there were several reported months (for 1 year) with data for a shareholder, the lastest months’ data for that year will be presented: for instance if there is data for March, June and July, then in this case the data for July will be presented. I therefore recommend that you always include the items Information Source and Information Date. That way, you can always judge how recent the data is for each selected month and which entity reported the information.


New BETA platform: Wharton

The WRDS platform (Wharton Research Data Services) is a web-based interface that allows you to search through many licensed databases and makes it easier to download data from them. The learning curve on how to use the Wharton platform is minimal as all databases can be searched in a similar way. The main effort that you need to expend is: learning the details of individual databases and how the data is made available. Databases that can be searched through Wharton include: Compustat Global & North America, Execucomp, CRSP, Amadeus & Audit Analytics.

The past few weeks it has been made possible to view the new changed BETA version of the WRDS platform which looks much different from the current platform. The old interface looks as follows:

The new BETA WRDS interface looks very different:

Overall, the changes are mostly cosmetic. when you click through to a specific database you still get the same search options that are available in the previous / current version: 1) Select time frame, 2) Make a selection, 3) Choose variables, and 4) Select the output format and date format.

N.B.: The new BETA version can be viewed but as yet I do not know when the current version will be replaced. I expect it to be sometime during the summer or autumn of 2015.


Historical shareholder lists & Amadeus

A database like Amadeus (from Bureau van Dijk) has shareholder data that can go back to 2002. In a previous post I mentioned how you could find out how the types of shareholders could be identified using the codes. If your license / subscription includes the ownership module the information is available for European companies of the industrial type. Amadeus does not include financial type company data like Banks or Insurance companies. If you need these types of companies or a Global geographical coverage you could go for a database like Orbis or Bankscope.

The only way to view the full historical data for the current and previous shareholders for companies, is using the Ownership Data option that appears in the menu (on the right side of the screen) when you call up the basic Snapshot on a company from your results list by clicking on the name.

There are two presentation options for the Shareholder data:

1) History – all data
The first option presents a long list of all recorded data in the database. At the top the current shareholders are listed and below this the previous shareholders:

2) History – at each closing date
This option presents the recorded shareholders in columns by year. Allthough this makes it easier to view the history of ownership, if many changes occur in each year this makes interpreting the data not always easier:

Information on shareholders includes data like: type, percentage direct ownership, total percentage ownership, country, etc. Also included is information on Source Identity and Date of Info. Bureau van Dijk uses many sources for the Shareholder type data and the main sources include:

  • Annual Reports (RS or RM)
  • Private correspondence (PC)
  • SEC Filings (PX or NQ)
  • Stock exchanges (SE)
  • Company web-sites (WW)
  • Telephone calls (TC)
  • Press News (PN)
  • Additional sources (OS)
  • Information providers (IP)

The choice for the data listed as ownership (for each month) depends on the sources. The Ownership Guide also explains how the sources are used and also has a full list of the Information Providers used. This allows you to find out which Source Ident. codes concern what source for each month.


Industry and economic classification systems

Many people who do research on companies and markets use a system like the SIC codes or NAICS codes. Both systems allow you to determine the activities for companies. The SIC code system is the oldest of the two and can be downloaded as a variable for companies in many databases. A company can get assigned one or more SIC codes. Sometimes a primary SIC code is given, which indicates the main activity. In essence, the Standard Industrial Classification (SIC) is a system for classifying industries by a four-digit code. The system was originally developed in the United States in 1937, and it is used by US government agencies to classify industry activities.  Basic information is available on Wikipedia.
If your research covers a long time frame it may be necessary to determine whether changes that occur over time in the SIC system have an effect. Luckily, an older version of the SIC code system is available. The U.S. Department of Labor, Occupational Safety & Health Administration website allows the user to search the 1987 version SIC codes.
A more recent version of the SIC system is available on the SEC website of the Division of Corporation Finance: Standard Industrial Classification (SIC) 2011 Code List.

The SIC system is also used by agencies in other countries. In the United ingdom they have developed their own version of the SIC codes. This United Kingdom Standard Industrial Classification of Economic Activities (UK SIC) is used to classify business establishments and other standard units by the type of economic activity in which they are engaged. The new version of these codes (SIC 2007) was adopted by the UK as from 1st January 2008. Older versions of the UK SIC system are also available online, specifically, the UK Standard Industrial Classifications 2007, 2003, and 1992 versions.

The North American Industry Classification System (NAICS) is the system used by US Federal statistical agencies for classifying businesses for the purpose of collecting, analyzing, and publishing statistical data related to the U.S. business economy. The NAICS system was developed under the auspices of the Office of Management and Budget (OMB), and adopted in 1997 to replace the Standard Industrial Classification (SIC) system. From 2002 the system is increasingly used. Different versions are available through the US Census website.

The NACE-code (Nomenclature générale des Activités économiques) is a code which is largely used in the European Union and its member states use it to classify commercial and non-commercial economic activities. It is mainly developed as a useful instrument when collecting data and publishing economic statistical overviews. Many companies and organizations exhibit a diverse range of activities. Companies only get assigned a single NACE code, however: the code that indicates the primary acrivity which contributes the most to the total added value of a company. The first version of the NACE was created sometime around 1970. The first revision was published in 1990 and was called NACE Rev. 1. The second major revision (NACE Rev. 1.1) took place in 2002. The second revision was intended to synchronize the system with the  “International Standard Industrial Classification of all economic activities” ISIC of the United Nations. The different NACE versions can be found on the Eurostat website. The Dutch national SBI code system (which replaced the original BIK code system) is based on the NACE system.

The International Standard Industrial Classification of all economic activities, abbreviated as ISIC, is a standard used by the United Nations Statistics Division (UNSD). The ISIC is used to classify economic activities so that entities can be classified according to the activity they carry out.
The ISIC classification combines the statistical units according to their character, technology, organisation and financing of production. The ISIC is used widely, both nationally and internationally, in classifying economic activity data in the fields of population, production, employment, gross domestic product and other economic activities. It is a basic tool for studying economic phenomena, fostering international comparability of data and for promoting the development of sound national statistical systems.
The current and older ISIC versions are available on the statistical website of the United Nations.

When you are using a database, always check the help information to find out what version of an industry code system is being used. If a code system is available but no information on what version it is, you should find out by contacting the owner of the database. Some databases collect data but gather only historical (industry code) information from other sources without using their own version of such as system. The SDC Platinum databases have historical SIC and NAICS codes as they were indicated by the sources that Thomson Reuters uses to collect data.

N.B.: In addition to the aforementioned industry codes there are many more code systems that were developed for statistical purposes in specific countries. The database Amadeus has an easy tool in the help section that allows you to translate specific codes to other codes. This tool is available in the Bureau Van Dijk version of Amadeus:


Directors, Board members & Committees in Europe

One of the few databases that cover information on directors and board members as well as some information on senior management staff for European companies is Amadeus. The Amadeus database covers listed and unlisted production companies. No financial type companies are covered here. Part of the information can be called up through separate report-presentations. When you have created a list and click on one of the company names you get the default company snapshot presented. Only when the snapshot is presented, is it possible to make a selection for the specific snapshot/report on board members, directors and management staff data.

There are two versions of the management report:

  • Report (Current Directors / Managers / Contacts)
  • Report (Previous Directors / Managers / Contacts)    .

See example screenshot:

Saving these reports with all the data is possible but the limit for downloading the data is: 50 reports at once. If you need reports for more companies you will need to export repeatedly in batches of 50 (1-50, 51-100, etc.). The data for each company will be presented in a separate tab/sheet in the Excel download file. Much depends on the company, whether board/directors/managers data is available. The origin of the data is usually the annual report, the company website or data from other news sources. The column with the name/header Body, indicates in what company bodies the person is active. Examples are:

  • AdmDep (Administration Department)
  • AudC (Audit Committee)
  • BoD (Board of Directors)
  • CoGoC (Corporate Governance Committee)
  • HR (Human Resources dept.)
  • NomC (Nomination Committee)
  • OthBC (Other Board Committee)
  • R&D (Research & Development)
  • RemC (Remunaration Committee)
  • SenMan (Senior Management)
  • SupB (Supervisory Board).

Important: The Report presentation does not allow you to export all data (including small biographies). You will have to change the list presentation to do this. Click on the option Columns (above the list) to elect and unselect variables/data for the download. Example screenshot:

NB: The database REACH is a database that covers all registered companies at the Dutch Chamber of Commerce (listed and unlisted). Depending on your subscription a module on Board members & Management may be included. If available this allows you to download similar board and Management data (as described above) for Dutch companies.