Changing Datastream data & Stata

The past few weeks I have been learning about and working with Stata. This program can handle a lot of data and uses commands to edit data or analyse it. A sequence of commands can be saved in .do files and then rerun as a script. There are many commands available and one of them is very handy when it comes to changing data from columns to rows. It is similar to the transpose option that Microsoft Excel offers for quick changes. It is the reshape command.

In this blog post I will use the reshape command to change Datastream data as an example. Similar work can be done for other downloads from databases like Amadeus or Bankscope.
For downloads it can take a bit of work to change the data and rework it before Stata can be used to merge it with other data. In the case of Datastream this has everything to do with the fact that Datastream does not repeat an ID for each year of data you download (unless using a Request Table search). Example screenshot original download from Datastream (transposed):

The following changes need to be made in Excel:

1) Get the ID back from the download without the extra Datastream codes. This can be done using the function MID(Cell,Start,Number) (= in Dutch versions of Excel: the Deel() function). MID() allows you to get part of the contents back from a Cell. Example:

2) Next you need to create column year headers that start with a text character and then the year of the column (this step is important for Stata later). It can be done smartly using a specific cell for the header and then combining it with the original year above each column as follows with the formula =($A$1&(B$1)). The dollar signs fix certain cells or a row (or column). Example:

3) Now save the Datastream Excel file without these formulas. If necessary use copy > paste special > Values

4) Next we start up Stata and get the file. In this case we prepared the Datastream file nicely and can therefore use the command: import excel using DS-Prepared.xlsx, firstrow
The firstrow option tells Stata that this row has the variable names and lables. Example:

5) When we look at the data with the browse command you see that the data looks basically the same as how it appeared in Excel:

Only two things are important here:

  • The data is not yet formatted as we need it: the numerical data are now strings (red) and need to be changed later
  • To later merge data the company ID’s need to be combined with the years

6) We now transpose the data using the command Reshape. This command has many options and allows you to rework tables. If you need more explanation on how it works you can type the command help reshape and Stata will provide much information on how to use the command. For this example the command is:

reshape long Y, i(CompanyCode) j(year)

In a nutshell this tells Stata that the years for the columns need to be repeated for the observations/records and can be found in the names of the variables. The ID for each company also needs te be repeated for the years of all observations. Example result:

When you use the command browse again you see the result of the data change:

7) We now need to create a unique ID combination to later merge the data. To do this we need to turn the years back from numerical values into strings using the tostring command: tostring year, replace. The final step is creating the unique ID combination UID with the command: gen UID = CompanyCode+year

8) Now there are a few steps left to finalize the Datastream data but not all of it is necessary. One thing is necessary: for Stata the numerical data is still a (text) string and now need to be changed. Use the destring command to as follows: destring Y, replace force

Destring also removes any values that were not numbers (but text) and replaces them with an empty cell (.) because of the option force (this can be dangerous if the content of a cell is a combination of a numerical value and text).

Optional stuff:

  • Use the drop command to remove unnecessary columns
  • Remove observations without data using, for example: drop if mi(Y)
  • The command order can be used to reorder data and put the UID variable in the first column
  • The command rename can be used to rename variables

The end result could look as follows:

To make stuff easier you could create a .do file that contains all the commands in sequence and this allows you to reuse it in similar situations for similar downloads. Here you can download an example .do file that shows examples of all the commands.

This example applies to Stata editions 13 and 14.


%d bloggers like this: