Filing size of 10-K reports

No too long ago a student asked me if it was possible to find out what the size was for specific filings of companies with the SEC. Specifically it concerned the 10-K filings which are the annual reports with a comprehensive summary of the companies financial health. As I understand it, the file size of these filings was to be used in an exercise and served as a proxy for financial reporting readability.

To find the original filings it is of course possible to use the EDGAR search option to find the original full-text filings and parts (including the XBRL filings):

Using the Central Index Key (CIK), Ticker, or Company name it is easy to find a specific company. Using the Filing Type search box you can narrow the search down to specific filings, like 10-K. This is often possible going back to 1993/1994 (depends on the company).

A specific 10-K filing overview in Edgar would look as follows:

If you need to do this for several hundreds of companies and multiple years it would take some time to collect the file size data. If you have access to the Audit Analytics database it is possible to get the 10-K file size for a large number of companies at once through the WRDS platform. Audit Analytics has a part database that is called Accelerated Filer.

Through this part database it is possible to get filings data (from 2000 to now) using Ticker lists or CIK lists. In the example output below I have put the following variables:

NAME = Company Name
FORM FKEY = Form name (10-K and 10-Q)
FISCAL YE = Fiscal Year End
FILE SIZE = Filing size

Example output using Excel to filter for just 10-K filings:

Important: If you look at the 2015 10-K form file size according to Audit Analytics for the company ADVANCE AUTO PARTS INC this is 17 MB. This corresponds (roughly) with the size of Complete submission file according to EDGAR: 17.748.770. The Complete submission file in Edgar includes not just text and html codes, but may also include pictures and any other file types (Excel files etc.).


Total Q and Tobin’s Q

A new Compustat data source has become available to people who have access to Compustat databases through WRDS: Peters and Taylor Total Q. This new source provides data on firms’ “Total q” ratio and the replacement cost of firms’ intangible capital. Total q is an improved Tobin’s q proxy that includes intangible capital in the denominator, i.e., in the replacement cost of firms’ capital. Peters and Taylor estimate the replacement cost of firms’ intangible capital by accumulating past investments in R&D (Research and Development) and SG&A (Selling, General and Administrative Expenses). Background paper can be downloaded here.

Overview of the variables and names in the database:

datadate = Date
fyear = Fiscal Year
gvkey = GVKEY / Compustat unique company code
K_int = Firm’s intangible capital estimated replacement cost
K_int_Know = Firm’s knowledge capital replacement cost
K_int_offBS = Portion of K_int that doesn’t appear on firm’s balance sheet
K_int_Org = Firm’s organization capital replacement cost
q_tot = Total q

The database offers mainly data for companies which are included in the Compustat North America database. Most of these companies are American. The coverage is: 1950 – now.


Stata & missing or duplicate data

When you work with large datasets or big data it may happen that after working with it for some time you need to take a good look at what has happened to the data. Especially if you work with combinations of datasets and/or work on it with more people. Another instance is: when you have received the dataset from a researcher or organization and need to remove superfluous data that may not be relevant to your own research.

1) Investigate the data
There are a few simple commands in Stata that provide a good overview:

  • desc or describe = this command provides a brief summary of the entire dataset
  • summ or summarize = another fine command that gives a quick overview of all the variables with information on: number of observations, the mean, standard deviation, and the lowest and highest values (min & max)
  • tab or tabulate = a good way to cross-reference several items and see whether there are any obvious outliers or patterns in the data

These and many more commands or combinations of commands allow you to watch and judge the data.

2) Missing data

  • Using the summ command it was easy to see that some fields had no data. In this case it may be a good idea to delete them as they serve no purpose here. You can delete a variable/field by typing drop variable. For example: drop CIKNew. A range of variables next to each other can also be dropped with a single command. For this example: drop indfmt – conm. There are many more options to delete entire variables/fields from a dataset.
  • Another way to clean data can be applied if you require only those observations/records that (for crucial variables) do not have missing values/data. Deleting observations can be done using the missing value command: drop if mi(variable). For example: drop if mi(Totaldebt). The Stata result screen will show the result of this action: number of observations deleted.
  • Deleting missing values is, however not always straightforward. Stata shows missing values as dots if you view a dataset with the browse command. In some datasets, however, missing values may sometimes (partially) be represented by another value in some observations. If this is the case it is a good idea to replace some of these values first to allow for easier editing/deletion. If in your dataset the number zero indicates the same thing as a missing value (in some records) you can use mvdecode to replace them with a dot (= how Stata usually represents missing values). The command would look like: mvdecode variable, mv(0=.). Afterwards you can the remove all missing values the usual way with drop.

3) Removing duplicate data
When you are using multiple datasets and have combined them you could have some duplicate observations. Using data from some specific databases may also get you unintentional duplicate data. In Compustat you run the risk of duplicates if, for instance, you only need data for industrial type companies but, when doing the search in the Fundamentals Annual database you forget to unmark the option FS at the screening options at Step 2 in WRDS. Some companies have more than one statement in Compustat for the same fiscal years and will get you both FS and IND type/format statements.
The Stata command to remove duplicates should be chosen carefully. I usually combine a unique ID code with a specific event year or date. For instance: duplicates drop CIK year, force


  • duplicates drop removes duplicates
  • in this example duplicates are identified by the combination of the variable CIK (ID code = Central Index Key) with the variable year
  • duplicates will be removed without warning by including the las bit:
    , force

Personally I think removing duplicates without first checking may not always be the smart thing to do. If you are working with a large dataset it may be a good idea to first tag possible duplicates and then have a look before removing these. The command to tag the duplicates is: duplicates tag, gen(newvariable). This command checks the whole dataset with all variables for all observations for duplicates and stores the result as a number in the new variable with the name newvariable.

Another version of removing duplicates may have to do with the number of necessary observations by entity in a dataset. In some cases an analysis requires a minimum number of observations/records to be relevant. If there are too few observations you may again remove them only, in this case it can be done using the count function on the entity (for example a company identifier like ISIN, CIK, or GVKEY). You do this as follows:

  • Sort the dataset on the ID that will be counted. Example command: sort CIK
  • Now count the number of ID’s in the dataset and store them in a variable. Example command: by CIK: egen cnt = count(year). What this does is count the times each CIK ID occurs by counting the years and stores the count/number of years in the new variable cnt.
  • We can now remove observations of entities for which the count (of years) is below the number stored in the variable cnt. Example command: drop if cnt<10. This means that we need a minimum of 10 observations for an entity.

N.B.: A few final remarks on handling missing data concern the way you work with the data. When you are performing such cleaning actions as described above it is a good idea to first make a copy of your database before you do all this and save the actions as there is no undo like in many programs. You can also experiment a bit with a copy and you should definitely save the actions that you choose the finalize in a Do-file and when yiou continue from there again start with a copy. To keep track of your versions of the database you can fut a date in the name of each version. When you work with much data over a long time it is also a good idea to save space and memory by compressing the database with the command: compress. Some variables will then be changed to save space.


New WRDS platform & Audit Analytics

Last week some major changes were implemented in the Wharton Research Data Services platform. Overall this seems to improve the way you search in the databases.

Today I discovered a problem with Audit Analytics databases, however. No matter what type of search I tried to do in several part-databases, I could not get any data. I kept getting an error message from WRDS stating that I had not selected variables at step 3 (even though I did). The original message was:

I have also checked other databases (Compustat and Amadeus) but there does not appear to be a problem with these sources. I also checked different versions of internet software (IE 10 and Firefox 44.0.2) but this did not help.
I reported the problem to the WRDS Helpdesk and expected a swift resolution of the problem. If you need data from the Audit Analytics database you should try and see if you can also get the data you need directly through their IVES website.

Update 1: Later this evening (February 29) I tried the same type of searches again in Audit Analytics databases using Firefox 42, Edge 25 and Chrome 48 and everything seems to work fine again. Maybe the problem was easily found or small/temporary  and quickly fixed.

Update 2: On Wednesday I was notified what caused the (fixed) problem by the WRDS helpdesk: “[the error] was caused by an effort to preserve column ordering for another client. The process of sorting the columns was in this case impeded by erroneous trailing spaces in the column names.” I am glad that the issue was found and solved quickly.


WRDS platform: major changes

This week some major changes were implemented in the Wharton Research Data Services (WRDS) platform. These changes mainly concern the search options for databases (step 2) and the selection options (step 3). In essence: no options were changed for the databases. Only the way you used to make a selection has changed and the way you make a selection of variables. These are the most recent (major) changes since July 2015.

The selection screen is now much more compact and straightforward with all the options immediately available and more intuitive:

The selection screen has undergone a much bigger change. Instead of scrolling downwards through the list of boxes with lists of variables, you can now scroll sideways through the lists of variables:

The other two steps in WRDS are usually: Step 1 (Selecting a time period), and Step 4 (Output options). No major changes seem to have happened there. If I notice any more major changes in the interface I will post them here.


Compustat & missing data

When you are using Compustat to download data for a number of companies you will probably get missing data for some companies even though the variables you selected for the output are relatively common items. There are several reasons why this happens and I will discuss some of these here (for active companies).

One of the reasons may be related to the Accounting Standards that companies use for reporting their financial data and also the interpretation of the accounting rules in different countries. Some variables are simply not required and may therefore be included in reported financial statements for some years and not for others. Also, if companies switch accounting standards changes will occur in reported items. If companies use IFRS that usually means that more standard variables are reported and are comparable.

In addition to changes in companies and the way they report, it is also possible that the reason for missing data lies in the way the Compustat databases evolve/change over time. Some variables for instance are legacy variables that no longer contain data or only have data for the older years in the database. In the WRDS platform there is a file called “List of Entirely Null Variables in Compustat Datasets” in the support section at “Vendor Manuals“.
An example of a variable that (no longer) has data in the Compustat North America part database Fundamentals Annual is Audit Fees: RMUM — Auditors’ Remuneraton.

The final reason why data may be missing can be related to the type of company: Financial or Industrial. Financial sector type companies are usually active in the insurance sector, banking sector, etc. If you use screening filters in a search to limit your selection to Financial or Industrial type companies at Search Step 2 in WRDS you may not get (enough) data. This can be particularly tricky if a company files two types of financial statements and can have two records (filings) for each Fiscal Year in Compustat. One of the statements will show IND for Industrial and the other record FS for Financial to indicate what filing type it is.

An example of this type of variance is the variable “Capital Expenditure” (CAPX = This item represents the funds used for additions to property, plant, and equipment, excluding amounts arising from acquisitions (for example, fixed assets of purchased companies). This item includes property & equipment expenditures.). Industrial companies with Industrial type filings will report the variable. Financial type companies often will not. The exception is, of course, a company that files two types of statements.

N.B.: If you are unsure about missing values or if you wish to find out if the data is really not reported (or available) you could try searching for the same list of companies in a second database and download the same variable there.


New BETA platform: Wharton

The WRDS platform (Wharton Research Data Services) is a web-based interface that allows you to search through many licensed databases and makes it easier to download data from them. The learning curve on how to use the Wharton platform is minimal as all databases can be searched in a similar way. The main effort that you need to expend is: learning the details of individual databases and how the data is made available. Databases that can be searched through Wharton include: Compustat Global & North America, Execucomp, CRSP, Amadeus & Audit Analytics.

The past few weeks it has been made possible to view the new changed BETA version of the WRDS platform which looks much different from the current platform. The old interface looks as follows:

The new BETA WRDS interface looks very different:

Overall, the changes are mostly cosmetic. when you click through to a specific database you still get the same search options that are available in the previous / current version: 1) Select time frame, 2) Make a selection, 3) Choose variables, and 4) Select the output format and date format.

N.B.: The new BETA version can be viewed but as yet I do not know when the current version will be replaced. I expect it to be sometime during the summer or autumn of 2015.