Warning: WRDS and Two-Factor Authentication

WRDS is a company that offers a platform to use many financial and economic databases. Examples are: CRSP, Compustat, NYSE intraday data and Audit Analytics. Recently it was announced (quite suddenly) that people who use the Wharton Research Data Services accounts for the WRDS platform, will be required to change the way they get access. WRDS account users will need to use either an App from Duo Security (CISCO) or another way to indicate that they are the rightful users of the account. In essence Two-Factor Authentication (2FA).

Details are explained on the WRDS platform at and FAQ and on the general pages. The change will start at 28 November 2022. The only WRDS type accounts that will not (yet) be impacted by this new policy, are the WRDS Class type accounts. At some organizations that have a WRDS license it is also possible to use the WRDS platform through a WRDS Day Pass option. The new policy will not impact Day Pass users.

In 2023 the 2FA policy will also impact the WRDS account users that use programs like SSH, SAS, R and Python to access licensed data directly from WRDS servers. The specifics on how this works will be disclosed at a later date. The precise date in 2023 is not yet disclosed either.

NYSE TAQ data & Python

Some of the databases at our university are made available through the WRDS (Wharton Research Data Services) platform. The data that is hosted at the WRDS servers can also be used with an API through programming languages like Python. Not just through programs like Spyder and PyCharm but also through a Jupyter Notebook.

Step 1: Set up access to WRDS
Using Python to access databases on WRDS servers takes a few steps which are also described on the WRDS platform.

  • Register at WRDS to request a WRDS account
  • Make sure to install the wrds module in python: this can usually be done at the python or Anaconda prompt using a command like: pip install wrds
  • Create a file called (literally): .pgpass . This is a plain text file that contains 5 elements: wrds-pgdata.wharton.upenn.edu:9737:wrds:accountname:accountpassword where the last two items should be replaced with the actual WRDS account user name and password. Example: wrds-pgdata.wharton.upenn.edu:9737:wrds:pietjepuk:123456
  • I do not know how this works for Mac computers, but for a Windows computer make sure to put this file in your Windows computer user account folder. This will look similar to C:\Users\inlogaccountname\ where inlogaccountname is the name of the account you use to log into your Windows computer.
    I personally use PyCharm for running Python programs and when using the wrds module it may also generate a .pgpass file automatically without first creating it and put it in the right location on the computer.

Step 2: Use the wrds module with Python
In each program where you wish to approach and download data from databases like NYSE it requires you to call the wrds module using: import wrds . Another line that is required, is, actually connecting to the WRDS platform using the account, for example:
conn = wrds.Connection(wrds_username=’accountname’) . The example I use here is for the NYSE Trades and Quotes database and this contains High Frequency data on equities listed at the NYSE exchange. Because of the fact that the data is so huge, it is divided up in several databases/tables each year.

Now before you start pulling data from the wrds servers, you need to check a few things:

a) What databases / libraries do you have access to? The institute that you work at needs to have a subscription (or license) to be able to access anything. Through the front end (web based) it should be easy to find out what you can access or download data from.
b) Test a script and figure out what libraries are available from which NYSE libraries. It is possible to do this using commands like: print(sorted(conn.list_libraries())) where conn is a specified variable to allow you to connect to wrds. If you already know what library you are interested it is a good idea to make a list of the available tables, using for instance: full_list = (sorted(conn.list_tables(library=’taqmsec’)))
c) Use the front end description of tables in WRDS to determine which variables you want data for and include these in the script.

You can download the example here. If anyone wishes to re-use the Python scripts and adapt them, please feel free to do so.

N.B.:
1) The example Python program that you can download here is slightly altered to keep my personal wrds account safe.
2) You need to change the line with the destination directory/folder (sbase) to an existing drive letter and folder that applies in your situation
3) The script allows you to download data from one or more companies from multiples tables. In the example I limited the number of tables to just show a working example. It is possible to call data for many tables consecutively. It is also possible to just hard code the Ticker list in the script.


WRDS Country Indices

Recently I had a look at some relatively new WRDS (Wharton Research Data Services) databases that have been created concerning indices. The databases are available through the main WRDS screen via World Indices by WRDS. The databases contain country total return data based on indices that they have created for over 40 countries. The available frequency for the data is Daily and Monthly and they could possibly serve as alternatives to existing country indices. The daily and monthly data are available as separate databases. Additionally it is possible to find out what companies are part of the indices that were created, using the WRDS World Index Constituents part database.

The data that the WRDS Indices are based on, originates from Capital IQ’s Compustat Global Security Daily part database. The (2016) methodology that the data is based on is made available as part of the documentation. It also has an appendix that lists the correlation of the WRDS index for each country with comparable specific country indices.
The main total return data for countries includes two series: country total return data with dividends or without dividends. I did a simple comparison for the Dutch AEX index based on monthly data from the Compustat Global part database with price index data. I compared it to the monthly return data from the WRDS indexes for the Netherlands and they seem to compare pretty good as a representation of stock market developments over time.

N.B.: I used AEX Index prices from Compustat Global to calculate the monthly return using the simple formula: (PriceB – PriceA) / PriceB. Brighter minds then mine can probably think of a better formula to do a more accurate comparison.

 

 

Audit Analytics update

The database Audit Analytics was recently updated and now includes more options to merge the data from it’s database(s). Previously only the options Ticker (/Best Edgar Ticker) and the CIK were available as codes representing corporate entities.

With the new update, two more codes are now available: CUSIP 9, and ISIN. These two identifiers are available in many more databases. Combining data from other databases with data from Audit Analytics is now much improved. Important: your license to the Audit Analytics databases may not as a default have these codes added. I guess that it may require an expansion of an existing license to get these codes.

Also important, is, the fact that at the moment both new codes are not always available in the WRDS version of the Audit Analytics databases. This may have to do with the fact, that the databases in WRDS are updated on a quarterly basis while the main website is updated daily.

To make it easier to work with multiple databases I updated my overview of databases (that I am familiar with) and which identification codes are available in these sources.

Return data for Fama–French model

Recently somebody asked me where to look for monthly return data for the US market. The data was needed for Fama-French factor model. It seems to me that the U.S. market factor data is available from Kenneth R. French’s website (2019). The data provided here is mainly in a TXT or CSV format and can be processed and analysed using different programs.

Some of the Fama-French data is also available through other sources, for instance, the WRDS platform (Wharton Research Data Services). The data that they provide can be found as is labled as: “Fama French & Liquidity Factors” or “Fama-French Portfolios and Factors”. Access and downloading through this platform does, however, require a personal or organizational subscription.

I do not think all data from the original Kenneth French website is available but it does have additional related data: Pastor-Stambaugh Liquidity series, Sadka Liquidity measures, and China Factors (Liu, Stambaugh and Yuan).
At the Vrije Universiteit Amsterdam we have access to it (screenshot).

Quarterly IBES data (in WRDS)

Understanding the forecast data in the I/B/E/S/ database (Institutional Brokers Estimate System database from Refinitiv) was not as straightforward as I thought it would be. What threw me off was the way I initially looked at the data. Below you find an example download I did for 1 company (Wendy’s Co.) for 1 year. I downloaded the quarterly forecast EPS data for 2011 from the Summary Statistics, Unadjusted part database from the Unadjusted Summary collection of I/B/E/S/ databases.

The three columns that are important where it concerns understanding the data, are marked green.

  • Forecast Period End Date (SAS Format) = this date indicates on what date the calendar quarter ends for which the forecast was made
  • Forecast Period Indicator (FPI) = 6 indicates quarter 1, 7 means quarter 2 , 8 is quarter 3 and 9 is actually quarter 4
  • I/B/E/S/ Statistical Period (SAS Format) = The date when the calculated summary statistics were finalized, which could only be done when the final data on the forecast quarter had been collected. That is why monthly this date is sometimes more recent = past the actual forecast quarter.

What was confusing to me was how I needed to interpret the Forecast Period Indicator. I could not make sense of it based on the information in the I/B/E/S/ manual. After talking to the owners of the database it finally made sense:

Where the FPI indicates 6 what is meant is: the current quarter. That means that the forecasts for these rows that have the number 6 were done within the same quarter as the Forecast Period quarter (indicated in column P). Where the FPI says 7, that means that the forecast data on the calendar Period quarter is from the previous quarter. For the FPI 8 the forecast data is from two quarters back and where the FPI says 9 it is 3 quarters before the quarter indicated as the Forecast period quarter.

To make it easier to understand I have below calculated the end date of the actual quarter in which the forecasts were done based on the FPI in column G: and the Forecast quarter dates in column P. The Excel formula I used for this (column Q) is:

=IF(G2=6;P2;IF(G2=7;P2-92;IF(G2=8;P2-184;IF(G2=9;P2-275))))

In column R I decided to use the Excel formula =EOMONTH() to get the actual End date of the month and Column S has a formatted date to format the date the way the date in column P is formatted. I used Excel formulat =TEXT(R2;”mm/dd/yyyy”) for this.

The result looks as follows:

The extra columns (yellow) indicate the quarter in which the forecasts were made by analysts for the calendar quarter indicated in column P.

Filing size of 10-K reports

No too long ago a student asked me if it was possible to find out what the size was for specific filings of companies with the SEC. Specifically it concerned the 10-K filings which are the annual reports with a comprehensive summary of the companies financial health. As I understand it, the file size of these filings was to be used in an exercise and served as a proxy for financial reporting readability.

To find the original filings it is of course possible to use the EDGAR search option to find the original full-text filings and parts (including the XBRL filings):

Using the Central Index Key (CIK), Ticker, or Company name it is easy to find a specific company. Using the Filing Type search box you can narrow the search down to specific filings, like 10-K. This is often possible going back to 1993/1994 (depends on the company).

A specific 10-K filing overview in Edgar would look as follows:

If you need to do this for several hundreds of companies and multiple years it would take some time to collect the file size data. If you have access to the Audit Analytics database it is possible to get the 10-K file size for a large number of companies at once through the WRDS platform. Audit Analytics has a part database that is called Accelerated Filer.


Through this part database it is possible to get filings data (from 2000 to now) using Ticker lists or CIK lists. In the example output below I have put the following variables:

NAME = Company Name
FORM FKEY = Form name (10-K and 10-Q)
FISCAL YE = Fiscal Year End
FILE SIZE = Filing size
MATCHFY_SUMFEES_FY = Fiscal Year

Example output using Excel to filter for just 10-K filings:

Important: If you look at the 2015 10-K form file size according to Audit Analytics for the company ADVANCE AUTO PARTS INC this is 17 MB. This corresponds (roughly) with the size of Complete submission file according to EDGAR: 17.748.770. The Complete submission file in Edgar includes not just text and html codes, but may also include pictures and any other file types (Excel files etc.).

Email

Total Q and Tobin’s Q

A new Compustat data source has become available to people who have access to Compustat databases through WRDS: Peters and Taylor Total Q. This new source provides data on firms’ “Total q” ratio and the replacement cost of firms’ intangible capital. Total q is an improved Tobin’s q proxy that includes intangible capital in the denominator, i.e., in the replacement cost of firms’ capital. Peters and Taylor estimate the replacement cost of firms’ intangible capital by accumulating past investments in R&D (Research and Development) and SG&A (Selling, General and Administrative Expenses). Background paper can be downloaded here.

Overview of the variables and names in the database:

datadate = Date
fyear = Fiscal Year
gvkey = GVKEY / Compustat unique company code
K_int = Firm’s intangible capital estimated replacement cost
K_int_Know = Firm’s knowledge capital replacement cost
K_int_offBS = Portion of K_int that doesn’t appear on firm’s balance sheet
K_int_Org = Firm’s organization capital replacement cost
q_tot = Total q


The database offers mainly data for companies which are included in the Compustat North America database. Most of these companies are American. The coverage is: 1950 – now.

Email

Stata & missing or duplicate data

When you work with large datasets or big data it may happen that after working with it for some time you need to take a good look at what has happened to the data. Especially if you work with combinations of datasets and/or work on it with more people. Another instance is: when you have received the dataset from a researcher or organization and need to remove superfluous data that may not be relevant to your own research.

1) Investigate the data
There are a few simple commands in Stata that provide a good overview:

  • desc or describe = this command provides a brief summary of the entire dataset
  • summ or summarize = another fine command that gives a quick overview of all the variables with information on: number of observations, the mean, standard deviation, and the lowest and highest values (min & max)
  • tab or tabulate = a good way to cross-reference several items and see whether there are any obvious outliers or patterns in the data

These and many more commands or combinations of commands allow you to watch and judge the data.

2) Missing data

  • Using the summ command it was easy to see that some fields had no data. In this case it may be a good idea to delete them as they serve no purpose here. You can delete a variable/field by typing drop variable. For example: drop CIKNew. A range of variables next to each other can also be dropped with a single command. For this example: drop indfmt – conm. There are many more options to delete entire variables/fields from a dataset.
  • Another way to clean data can be applied if you require only those observations/records that (for crucial variables) do not have missing values/data. Deleting observations can be done using the missing value command: drop if mi(variable). For example: drop if mi(Totaldebt). The Stata result screen will show the result of this action: number of observations deleted.
  • Deleting missing values is, however not always straightforward. Stata shows missing values as dots if you view a dataset with the browse command. In some datasets, however, missing values may sometimes (partially) be represented by another value in some observations. If this is the case it is a good idea to replace some of these values first to allow for easier editing/deletion. If in your dataset the number zero indicates the same thing as a missing value (in some records) you can use mvdecode to replace them with a dot (= how Stata usually represents missing values). The command would look like: mvdecode variable, mv(0=.). Afterwards you can the remove all missing values the usual way with drop.

3) Removing duplicate data
When you are using multiple datasets and have combined them you could have some duplicate observations. Using data from some specific databases may also get you unintentional duplicate data. In Compustat you run the risk of duplicates if, for instance, you only need data for industrial type companies but, when doing the search in the Fundamentals Annual database you forget to unmark the option FS at the screening options at Step 2 in WRDS. Some companies have more than one statement in Compustat for the same fiscal years and will get you both FS and IND type/format statements.
The Stata command to remove duplicates should be chosen carefully. I usually combine a unique ID code with a specific event year or date. For instance: duplicates drop CIK year, force

Explanation:

  • duplicates drop removes duplicates
  • in this example duplicates are identified by the combination of the variable CIK (ID code = Central Index Key) with the variable year
  • duplicates will be removed without warning by including the las bit:
    , force

Personally I think removing duplicates without first checking may not always be the smart thing to do. If you are working with a large dataset it may be a good idea to first tag possible duplicates and then have a look before removing these. The command to tag the duplicates is: duplicates tag, gen(newvariable). This command checks the whole dataset with all variables for all observations for duplicates and stores the result as a number in the new variable with the name newvariable.

Another version of removing duplicates may have to do with the number of necessary observations by entity in a dataset. In some cases an analysis requires a minimum number of observations/records to be relevant. If there are too few observations you may again remove them only, in this case it can be done using the count function on the entity (for example a company identifier like ISIN, CIK, or GVKEY). You do this as follows:

  • Sort the dataset on the ID that will be counted. Example command: sort CIK
  • Now count the number of ID’s in the dataset and store them in a variable. Example command: by CIK: egen cnt = count(year). What this does is count the times each CIK ID occurs by counting the years and stores the count/number of years in the new variable cnt.
  • We can now remove observations of entities for which the count (of years) is below the number stored in the variable cnt. Example command: drop if cnt<10. This means that we need a minimum of 10 observations for an entity.

N.B.: A few final remarks on handling missing data concern the way you work with the data. When you are performing such cleaning actions as described above it is a good idea to first make a copy of your database before you do all this and save the actions as there is no undo like in many programs. You can also experiment a bit with a copy and you should definitely save the actions that you choose the finalize in a Do-file and when yiou continue from there again start with a copy. To keep track of your versions of the database you can fut a date in the name of each version. When you work with much data over a long time it is also a good idea to save space and memory by compressing the database with the command: compress. Some variables will then be changed to save space.

Email

New WRDS platform & Audit Analytics

Last week some major changes were implemented in the Wharton Research Data Services platform. Overall this seems to improve the way you search in the databases.

Today I discovered a problem with Audit Analytics databases, however. No matter what type of search I tried to do in several part-databases, I could not get any data. I kept getting an error message from WRDS stating that I had not selected variables at step 3 (even though I did). The original message was:

I have also checked other databases (Compustat and Amadeus) but there does not appear to be a problem with these sources. I also checked different versions of internet software (IE 10 and Firefox 44.0.2) but this did not help.
I reported the problem to the WRDS Helpdesk and expected a swift resolution of the problem. If you need data from the Audit Analytics database you should try and see if you can also get the data you need directly through their IVES website.

Update 1: Later this evening (February 29) I tried the same type of searches again in Audit Analytics databases using Firefox 42, Edge 25 and Chrome 48 and everything seems to work fine again. Maybe the problem was easily found or small/temporary  and quickly fixed.

Update 2: On Wednesday I was notified what caused the (fixed) problem by the WRDS helpdesk: “[the error] was caused by an effort to preserve column ordering for another client. The process of sorting the columns was in this case impeded by erroneous trailing spaces in the column names.” I am glad that the issue was found and solved quickly.

Email