Small error in S&P 500 Total Return data

Recently a long time series was needed to use in some exercises and in this case the choice was made to use the the default closing Price (PI) and the Total Return (RI) data for the S&P 500 index. The source for the data was Datastream from Thomson Reuters. The price data from Datastream for the S&P 500 index goes back to December 31st of 1963. The Total Return data from Datastream for this index goes back to January  1st of 1988.

While working with the data a strange outlier in the data was discovered for January 24th 1990. Statistically speaking, the calculated return made a huge jump there. See the example here:

The data was then compared to data from Yahoo Finance (using the Quandl add-in to do a quick download to excel). Using the Yahoo Finance data to calculate the Total Return again this appeared to give a comparable result to the manually calculated RI using the Price data for the index from Datastream.
After contacting the Thomson Reuters helpdesk it became clear that there was a small error in their spreadsheet that is used to calculate the RI: for January 24th an incorrect price was used for the RI calculation: 324,17 instead of 330,26.

Unfortunately, Thomson Reuters will not change this small error for their calculated RI. If you need to go back and have to use the data from that time you might use a corrected RI for this day which could be: 354,53. This number is more in line with statistical variance for the return over a period of 25 year or more. See also:

This goes to show that it always pays off to check for outliers and (when in doubt) to contact the vendor/source and discuss possibilities.


The Quandl Excel add-in

In the past I have already mentioned the data initiative called Quandl. Through the website it is possible to find and download numercial datasets that people or organizations provide. Many of these datasets are provided free of charge. The data can be downloaded in the following formats: Excel, CSV, text, JSON or XML.

The way the data is handled (by Quandl) makes it easy to download it: when data from different sources is provided in different frequencies or for different time frames, it can still be downloaded in the same sheet at the same time! The frequencies and time frames for selected time series are automatically matched when downloading the data. The data is on all kinds of subjects: from large time series on a macroeconomic level to company data on a microeconomic level and both financial and non-financial data.

The creaters of Quandl also offer an Excel add-in which allows you to directly download the data into Excel. The current version is 2.5. This add-in can only be used when you have an Authetication Token which is provided upon free registration on the Quandl website. The installation of the add-in is simple: just like any Add-in through the Excel Options. Once it is installed (and the Authentication Token is added at the Preferences in the Quandl add-in) you can directly search and download data. Quandl offers an extensive help section with short and clear movies that explain how the Add-in works.

Over 500 sources are already available, offering millions of datasets. Some of these sources are:

People who have registered can also upload their own datasets making it a great tool to share data. One of the people who has done this is the renowned Thomas Piketty (Author of Capital in the 21 century): Link to datasets.


Multiple CDS Series & Datastream

A standard subscription to Datastream includes data from CDS series. In a previous blog item I mentioned that there are two default providers of CDS data available:

 – CDS data from CMA (from 2004 – September 2010)

 – CDS data from Thomson Reuters (from 2007 – now).

There are, of course more suppliers / providers of CDS time series data. In 2010 Mayordomo, Peña and Schwartz did some research and wrote an article where they compare some of these providers: “Are all Credit Default Swap Databases Equal?“. The authors concluded that the provider CMA was the best source of reliable CDS data and is less prone to causing deviations (like fat tails: see also the definition and background).
If your research covers a time period that makes it possible to use just one source, that is also preferable as you do not run the risk of additional deviations because of differing methodologies of the providers in gathering the data.

If the research at your organisation leaves you no way out, and you have to use the standard two Datastream sources with CDS data, you can combine the data from these two providers. In the earlier blog post I mentioned the SPLC function/formula which can be used to combine series. The SPLC formula can also be used to look up several series at once.

1) First you need to identify (or searc for) relevant CDS CMA or Thomson Reuters (TR) series.

2) Next the Excel function VLOOKUP (= Vertikaal zoeken) can be used to match CMA and TR series. You have to be careful here as there seem to be more different TR series (with different currencies) that match CMA series. It may be an idea to select the series of the same currencies before matchen them by Mnemonics (= Datastream codes). See example with a small selection of 5 Year CDS series using the Excel file that matches all series (see original blog item):

3) As the next step you can combine the matched series in Excel with the SPLC formula using the Concatenate function (= tekst samenvoegen). See example:

4) The final step is doing a Request Table search to lookup all combined series in one go. You have to copy the formulas in the series column and Paste them as values! As the SPLC formula includes the datatypes SM it is also not necessary to put a datatype in the column for Datatypes! An example Request table can be downloaded here and would look as follows:

N.B.: The example Request Table spreadsheet (above) has the American date-notation: mm-dd-yy. On your own computer you may have to change this the date-notation to the European version: dd-mm-yy (or dd-mm-jj).


Datastream US equity data quality

I recently came across an article that raised the issue of the quality of Datastream data. Ince & Porter (2006) warn researchers for potential flaws in the coverage by Datastream. The authors compared the US coverage of equity data to that of CRSP over the period 1975 and 2002. They found that certainly in the older years a large number of equities are missing in Datastream. An example is that for the year 1975 20% of common equity issues (found in the CRSP database) were missing from Datastream.

The problem is more important for small caps (= lower 20% of stocks measured by market capitalisation). Datastream also suffers to some extent from survivorship bias. The authors suggest a number of methods that alleviate the problems somewhat. More details are available in the article below:

Ince, O.S. & R.B. Porter, 2006, “Individual equity return data from Thomson Datastream: handle with care!,” Journal of Financial Research 29(4), 463-479.


IPO data and quality

One of the most often used sources for data on Initial Public Offerings (IPO) and follow-ons is the databse Global New Issue(s) from Securities Data Company (= SDC, now part of Thomson Financial/Thomson Reuters). While looking for information on IPO’s I came across two websites that report on IPO data and data issues when dealing with data from SDC.

The first website is created by Jay R. Ritter, Professor of Finance at the Warrington College of the University of Florida. A number of historical data files or refferals are listed/offered that provide additional data or alternative sources. The oldest data available is from the Nineteen seventies.

The second website is created by Alexander Ljungqvist, Professor of the Stern School of Business at the New York University. He has compared data from IPO’s with data gathered from SDC Platinum’s Global New Issue database and reports on problems for shares outstanding, venture backing, and syndicate size.