NYSE TAQ data & Python

Some of the databases at our university are made available through the WRDS (Wharton Research Data Services) platform. The data that is hosted at the WRDS servers can also be used with an API through programming languages like Python. Not just through programs like Spyder and PyCharm but also through a Jupyter Notebook.

Step 1: Set up access to WRDS
Using Python to access databases on WRDS servers takes a few steps which are also described on the WRDS platform.

  • Register at WRDS to request a WRDS account
  • Make sure to install the wrds module in python: this can usually be done at the python or Anaconda prompt using a command like: pip install wrds
  • Create a file called (literally): .pgpass . This is a plain text file that contains 5 elements: wrds-pgdata.wharton.upenn.edu:9737:wrds:accountname:accountpassword where the last two items should be replaced with the actual WRDS account user name and password. Example: wrds-pgdata.wharton.upenn.edu:9737:wrds:pietjepuk:123456
  • I do not know how this works for Mac computers, but for a Windows computer make sure to put this file in your Windows computer user account folder. This will look similar to C:\Users\inlogaccountname\ where inlogaccountname is the name of the account you use to log into your Windows computer.
    I personally use PyCharm for running Python programs and when using the wrds module it may also generate a .pgpass file automatically without first creating it and put it in the right location on the computer.

Step 2: Use the wrds module with Python
In each program where you wish to approach and download data from databases like NYSE it requires you to call the wrds module using: import wrds . Another line that is required, is, actually connecting to the WRDS platform using the account, for example:
conn = wrds.Connection(wrds_username=’accountname’) . The example I use here is for the NYSE Trades and Quotes database and this contains High Frequency data on equities listed at the NYSE exchange. Because of the fact that the data is so huge, it is divided up in several databases/tables each year.

Now before you start pulling data from the wrds servers, you need to check a few things:

a) What databases / libraries do you have access to? The institute that you work at needs to have a subscription (or license) to be able to access anything. Through the front end (web based) it should be easy to find out what you can access or download data from.
b) Test a script and figure out what libraries are available from which NYSE libraries. It is possible to do this using commands like: print(sorted(conn.list_libraries())) where conn is a specified variable to allow you to connect to wrds. If you already know what library you are interested it is a good idea to make a list of the available tables, using for instance: full_list = (sorted(conn.list_tables(library=’taqmsec’)))
c) Use the front end description of tables in WRDS to determine which variables you want data for and include these in the script.

You can download the example here. If anyone wishes to re-use the Python scripts and adapt them, please feel free to do so.

N.B.:
1) The example Python program that you can download here is slightly altered to keep my personal wrds account safe.
2) You need to change the line with the destination directory/folder (sbase) to an existing drive letter and folder that applies in your situation
3) The script allows you to download data from one or more companies from multiples tables. In the example I limited the number of tables to just show a working example. It is possible to call data for many tables consecutively. It is also possible to just hard code the Ticker list in the script.

Nasdaq Data Link & Python

Datasets can sometimes be downloaded and also approached using programming languages like Python and R. Both are open source software and much information is available online (or in print) to help get you started using them (Software carpentries are an example).
I recently got interested in Python programming because at the university where I work increasingly the trend is towards Open Science and Open Software like Python (and R). The Datasets in Nasdaq Data Link seemed like a fun project to try and use for some initial testing and learning. It has a simple API that can be used to download data. Without registration the API can be used 50 times for free daily. After registration (also free), there is no limit of 50.

After installing Anaconda and Python I started learning the basics through the software carpentries using Jupyter NoteBooks and PyCharm (both freely available). In the end I preferred working with PyCharm and continued using this. I started working with an example program using pandas that was demonstrated on Youtube by Giles McMullen-Klein. The original example program was small and can be downloaded here.

The program works fine and I initially made a small version of it to download other data to test it out changing it using a BitCoin USD exchange rate dataset

This seemed to work fine and I decided on a more ambitious project where:

  • The number of series was not determined in the script
  • The specific series needed to be input manually (using Ticker symbols)
  • You get the choice to just look at part of the data in a graph, or all the data.

It took some figuring out and in the end I created a script that did just that for American listed companies. This script can be downloaded here. I was happy with how it worked but not entirely happy with the dataset. I got errors for recent years and some companies. In the end this turned out to be a limitation of the dataset: it covers approx. 3000+ US listed companies but the time series data apparently only runs until (part of) 2018. It was discontinued.

Finally I decided on a new test / variation of the same script and for an existing dataset where new data was still added. I decided on the dataset IMF Cross Country Macroeconomic Statistics. Some of the issues that I wanted to test in this script:

  • Uploading a list/text file where I could change the series and just re-run the script without needing to change the Python script. This is possible with the IMF data as the series have a well organized structure and naming convention. Additionally, it uses the standardized naming convention for countries: ISO.
  • Changing the API request to include a query to limit the download to either a number of years or to download all data
  • Use the download to aggregate data and create a different type of graph

In the end it all worked out and I created the following script that does all that. You can download the example here.

If anyone wishes to re-use the Python scripts and adapt them, please feel free to do so.

N.B.:

  1. Not too long ago (2018) Quandl was acquired by the Nasdaq Inc company. The Quandl platform hosted many datasets that were free and also many datasets from commercial providers (license required). After the acquisition the platform was rebranded and is now called Nasdaq Data Link.
  2. Before using the API from Nasdaq Data Link you need to install the nasdaqdatalink Python module. If you previously installed the Quandl Python module, this also still works (at the moment) for making requests
  3. The example Python programs that you can download here are slightly altered because it uses my personal Authentication Key from Nasdaq. I created a small Python program AK_config.py with just a variable AK containing this key and referenced it / used it from the other programs using: from AK_config import AK.
  4. If you wish to re-use the programs from this post, make sure to change the location folders and drive letters as needed (for input and output). Also, make sure to first import necessary modules or latest versions (as needed) of Python modules like Pandas, Numpy and matplotlib.pyplot

 

 

WRDS Country Indices

Recently I had a look at some relatively new WRDS (Wharton Research Data Services) databases that have been created concerning indices. The databases are available through the main WRDS screen via World Indices by WRDS. The databases contain country total return data based on indices that they have created for over 40 countries. The available frequency for the data is Daily and Monthly and they could possibly serve as alternatives to existing country indices. The daily and monthly data are available as separate databases. Additionally it is possible to find out what companies are part of the indices that were created, using the WRDS World Index Constituents part database.

The data that the WRDS Indices are based on, originates from Capital IQ’s Compustat Global Security Daily part database. The (2016) methodology that the data is based on is made available as part of the documentation. It also has an appendix that lists the correlation of the WRDS index for each country with comparable specific country indices.
The main total return data for countries includes two series: country total return data with dividends or without dividends. I did a simple comparison for the Dutch AEX index based on monthly data from the Compustat Global part database with price index data. I compared it to the monthly return data from the WRDS indexes for the Netherlands and they seem to compare pretty good as a representation of stock market developments over time.

N.B.: I used AEX Index prices from Compustat Global to calculate the monthly return using the simple formula: (PriceB – PriceA) / PriceB. Brighter minds then mine can probably think of a better formula to do a more accurate comparison.

 

 

Availbility of the Euro Stoxx 600

Not too long ago a student asked me if a specific index was available and how to get data on it. In this specific case it concerned the Euro Stoxx 600. Initially, I thought of two options (that I have available) to get this type of data: FactSet and Capital IQ’s Compustat Global database.

 

Option 1: FactSet
This option seemed straightforward. Using the Charting option in the advanced interface (or using the Excel add-in) it is easy to get at the data:

If your research involves specific companies over time that are (or have been) part of the index, then the advanced interface allows you to do this using the Screening option. After selecting the screening option you choose Universal screening and click on the option “Starter Screen”. The default series that is shown is the S&P 500 (Index). To go on from there you remove this index: hover with the mouse over the index and click on the red button to Remove. Use the search bar to find “stoxx 600” and click the index option that is presented (from the list). Initially, the current Euro Stoxx 600 list of companies are presented:

By clicking the series in the Select part of the screen you can get the formula to appear in the search bar. By clicking the formula in the search bar it is possible to call up options that include changing the date, which makes it possible to call up an older (historical) list of constituents for this index:

It does, however, require not just access to FactSet, but also an additional subscription to have access to older constituent lists for Stoxx indexes. Another downside is, that it is not possible to get a list of constituents over a period of time. FactSet only allows you access to lists for specific points in time.

Option 2: Compustat Global
The database is available through the WRDS platform (this requires a separate subscription). The part database Index Prices allows you to search for and downloads index data on an index as a whole. At step two in the search screen it is possible to search for the index code using the option “Code Lookup“:

Looking for stoxx 600 will get you the Dow stoxx 600 code 150376. Using this code it is possible to download price data for this index.

Within the platform it is also possible to browse to the part database “Index Constituents“. The same “Code Lookup” option is available to find the same code: 150376. Using this code it is possible to download historical lists of the companies in this index going back to 2000:

On the face of it, the list from Compustat appears to be the Euro Stoxx 600. Unfortunately, this is not the case. The index was originally introduced in 1998 to represent a proxy of overall (stock) market developments in Europe. It is derived from the STOXX Europe Total Market Index (TMI) and is a subset of the STOXX Global 1800 Index. The joint venture Stoxx Limited provides the Stoxx indexes and was formed as a joint venture. In 2009 Dow Jones exited this joint venture and the Stoxx indexes were now renamed by removing the DJ/Dow Jones prefixes. This means that the Euro Stoxx 600 index available through Compustat may be very similar from 2000 until 2009 but after the demerger will likely deviate from the actual Stoxx 600 indices from Stoxx Ltd. The name in the download still indicates Stoxx Indices, however, which may cause confusion as to which index it is. At this point in time the actual indices from Stoxx Ltd are not available (anymore) through Capital IQ’s Compustat in WRDS.

Audit Analytics update

The database Audit Analytics was recently updated and now includes more options to merge the data from it’s database(s). Previously only the options Ticker (/Best Edgar Ticker) and the CIK were available as codes representing corporate entities.

With the new update, two more codes are now available: CUSIP 9, and ISIN. These two identifiers are available in many more databases. Combining data from other databases with data from Audit Analytics is now much improved. Important: your license to the Audit Analytics databases may not as a default have these codes added. I guess that it may require an expansion of an existing license to get these codes.

Also important, is, the fact that at the moment both new codes are not always available in the WRDS version of the Audit Analytics databases. This may have to do with the fact, that the databases in WRDS are updated on a quarterly basis while the main website is updated daily.

To make it easier to work with multiple databases I updated my overview of databases (that I am familiar with) and which identification codes are available in these sources.

SPAC companies and FactSet

The database FactSet has data on SPAC companies in two separate databases: the M&A database, covering data on Mergers and Acquisitions, and the IPO database, covering initial public offerings.

1) Finding SPACs ipo data
The Idea Screening option allows you to search for this type of data under “IPOs & Follow-Ons”:

The database has the variable “Blank Check Company” that allows you to quickly select companies of that type in the database. There is a specific sub-categorie for Special Purpose Acquisition Company availabe. Since a SPAC is not always clearly distinguishable from the other two types it may be best two select them all and later decide based on additional data:

To further narrow down the result it may be necessary to include items like the “Business Description” in the output. SPAC companies are regularly called Blank Check companies and not specifically indicated as SPAC. Sometimes the Business Description will indicate a different text: “SPAC”, or “Special purpose acq co (SPAC)” or a variation on this. The name of a company often also contains keywords or terms like acq or acquisition or SPAC. A report with the output could look as follows:

2) Finding SPAC M&A data
The Idea Screening option allows you to search for this type of data under “M&A:

In this database you can find SPAC companies involved in M&A’s using the variable: “Blank Check/SPAC Acquirer”. Example:

Afterwards, when creating a custom report, it is a good idea to add variables like FactSet ID numbers / identifiers to later add additional financial data through the Excel Add-in.

3) Using the Excel add-in to find SPAC companies
Initially you could have a list of companies that was either generated using FactSet or through a second database. It is possible to find out if any listed entities are either a certain Blank Check Type company or a SPAC through variables like:

  • “Blank Check Type”
  • “Is a SPAC related entity”

SPAC companies and SDC Platinum

A special purpose acquisition company (SPAC) is a company with no activities that is formed strictly to raise capital through an initial public offering (IPO) for the purpose of acquiring an existing company. Also known as “blank check companies”. A blank check company is a publicly-traded, developmental stage company that has no established business plan. It is used to gather funds as a startup or, more likely, it has the intent to merge or acquire another business entity. SPAC companies have become popular in America the past few years and, recently, also in Europe.

Recent Dutch examples of SPAC IPO’s from the past few years:

  • Dutch Star Companies ONE NV
  • Dutch Star Companies Two BV
  • ESG Core Investments BV
  • European Biotech Acquisition Corp
  • Pegasus Acquisition Company Europe BV.

The database SDC Platinum (also part of Eikon) has data on SPAC companies in two separate databases: the M&A database, covering data on Mergers and Acquisitions, and the Global New Issue database, covering IPO data.

1) Finding SPACs ipo data
The database has the variable “Blank Check Company Flag (y/n)” that allows you to quickly select companies of that type in the database. There is no specific sub-categorie for SPAC availabe. The flag variable is available by searching in the “All Items” tab:

gni-blankcheck_search

To further narrow down the result it may be necessary to include items like the “Business Description” in the output. SPAC companies are regularly called Blank Check companies and not specifically indicated as SPAC. Sometimes the Business Description will indicate a different text: “SPAC”, or “Special purpose acq co (SPAC)” or a variation on this. The name of a company often also contains keywords or terms like acq or acquisition or SPAC:

gni-result_blankcheck

2) Finding SPAC M&A data
In this database you can find SPAC companies involved in M&A’s using two variables: “Acquiror is a Special Purpose Acquisition Vehicle” and, perhaps, to avoid missing anything, the variable “Blank Check Company Flag (y/n)”. Using the Logical Set Operator option you can combine both into a single search result. If it concerns M&A deals that have been completed, then it would be a good idea to add the Deal Status (= Completed or Unconditional). Example:

gni-blankcheck_search

Afterwards, when creating a custom report, it is a good idea to add both variables also to the report, and including the deal synopsis as well as the Business description for the acquiror.

N.B.: To combine data from the IPO database and M&A databases from SDC Platinum, the best variables are either using CUSIP 6 and/or the ticker (or primary acquiror ticker) codes. Additional options (directly or indicartly useful) may be Sedol codes or ISIN codes when available (Additional information).

Return data for Fama–French model

Recently somebody asked me where to look for monthly return data for the US market. The data was needed for Fama-French factor model. It seems to me that the U.S. market factor data is available from Kenneth R. French’s website (2019). The data provided here is mainly in a TXT or CSV format and can be processed and analysed using different programs.

Some of the Fama-French data is also available through other sources, for instance, the WRDS platform (Wharton Research Data Services). The data that they provide can be found as is labled as: “Fama French & Liquidity Factors” or “Fama-French Portfolios and Factors”. Access and downloading through this platform does, however, require a personal or organizational subscription.

I do not think all data from the original Kenneth French website is available but it does have additional related data: Pastor-Stambaugh Liquidity series, Sadka Liquidity measures, and China Factors (Liu, Stambaugh and Yuan).
At the Vrije Universiteit Amsterdam we have access to it (screenshot).

S&P Indexes in Compustat no more

Recently the company behind the S&P Indices (S&P Dow Jones) has decided to withdraw their constituent data from the Capital IQ Compustat databases. This means that it is still possible to get the prices for the overall indices, but it no longer is possible to get detailed data on individual companies as they move in and out of the lists over the years. Lists like S&P 500 are heavily used in much research as a benchmark list for market developments.

I know of only one database (available to me) at the university that still allows for the generation of constituent lists at the Vrije Universiteit Amsterdam and this is the database FactSet. There is one drawback compared to the lists that used to be available through Compustat, however: where Compustat made it possible to generate a multiyear list over time, FactSet only allows you to generate a daily list.
If you need multiyear lists from FactSet you will have to download multiple lists for years and then combine them yourself.

I have created an example movie on how to download an index list and put it on Youtube:

Using FactSet requires FactSet software as well as a FactSet account. Information on this for personnel and students at the university is available here.

N.B.: The Compustat database Execucomp has annual data on executives for companies that are (or have been) part of the S&P 1500 list. A download of company codes (GVKEY, CUSIP, Ticker) could perhaps serve as a rough list on which companies are part of the index over a time period. It is only a rough estimate, as the data is based on annual SEC filings and the index changes daily. Execucomp includes data going back to 1992.

Quarterly IBES data (in WRDS)

Understanding the forecast data in the I/B/E/S/ database (Institutional Brokers Estimate System database from Refinitiv) was not as straightforward as I thought it would be. What threw me off was the way I initially looked at the data. Below you find an example download I did for 1 company (Wendy’s Co.) for 1 year. I downloaded the quarterly forecast EPS data for 2011 from the Summary Statistics, Unadjusted part database from the Unadjusted Summary collection of I/B/E/S/ databases.

The three columns that are important where it concerns understanding the data, are marked green.

  • Forecast Period End Date (SAS Format) = this date indicates on what date the calendar quarter ends for which the forecast was made
  • Forecast Period Indicator (FPI) = 6 indicates quarter 1, 7 means quarter 2 , 8 is quarter 3 and 9 is actually quarter 4
  • I/B/E/S/ Statistical Period (SAS Format) = The date when the calculated summary statistics were finalized, which could only be done when the final data on the forecast quarter had been collected. That is why monthly this date is sometimes more recent = past the actual forecast quarter.

What was confusing to me was how I needed to interpret the Forecast Period Indicator. I could not make sense of it based on the information in the I/B/E/S/ manual. After talking to the owners of the database it finally made sense:

Where the FPI indicates 6 what is meant is: the current quarter. That means that the forecasts for these rows that have the number 6 were done within the same quarter as the Forecast Period quarter (indicated in column P). Where the FPI says 7, that means that the forecast data on the calendar Period quarter is from the previous quarter. For the FPI 8 the forecast data is from two quarters back and where the FPI says 9 it is 3 quarters before the quarter indicated as the Forecast period quarter.

To make it easier to understand I have below calculated the end date of the actual quarter in which the forecasts were done based on the FPI in column G: and the Forecast quarter dates in column P. The Excel formula I used for this (column Q) is:

=IF(G2=6;P2;IF(G2=7;P2-92;IF(G2=8;P2-184;IF(G2=9;P2-275))))

In column R I decided to use the Excel formula =EOMONTH() to get the actual End date of the month and Column S has a formatted date to format the date the way the date in column P is formatted. I used Excel formulat =TEXT(R2;”mm/dd/yyyy”) for this.

The result looks as follows:

The extra columns (yellow) indicate the quarter in which the forecasts were made by analysts for the calendar quarter indicated in column P.