Viewing & searching large text files

When doing research you sometimes come across large (plain) text files. Either you do a download from a specific secondary source, or you (or a colleague) created it while measuring phenomena. Handling such large files can be difficult as most regular programs are unable to (easily) handle large text files:

  • Wordpad, Notepad and  Microsoft Word have difficulty opening large text files with millions of lines or there is a limit based on file size. Word 2010 can not deal with files that are larger than 500 MB, for instance.
  • A program like Microsoft Excel 2010 also can use text files but has a limit of a million lines and is pretty slow when working with these files or searching through them
  • Out of the Office programs Microsoft Access 2010 has the least difficulty with larger text files but quickly becomes very slow when handling large text files that are larger than 500 MB or that contain millions of lines.

I recently came across the small open source program called Glogg. I tried it out with a couple of text files in the ranges 500 MB – 1,5 GB and found it opens these files very quickly and fully compared to other programs. In addition to being able to open such a large text file, it also has a nice search feature that can highlight items in the file you are looking for. The search result screen at the bottom allows you to jump to the items in the text file at the top. Below you see a screenshot of the Glogg program.

The Glogg program is nothing fancy but it does allow you to quickly view and search large text files. This may be handy if you are unsure whether the content is in order (odd characters) or to judge how the file is structured.

The Glogg program can be downloaded from the website. There is a 32-bits version and a 64-bits version.

Email

The Quandl Excel add-in

In the past I have already mentioned the data initiative called Quandl. Through the website it is possible to find and download numercial datasets that people or organizations provide. Many of these datasets are provided free of charge. The data can be downloaded in the following formats: Excel, CSV, text, JSON or XML.

The way the data is handled (by Quandl) makes it easy to download it: when data from different sources is provided in different frequencies or for different time frames, it can still be downloaded in the same sheet at the same time! The frequencies and time frames for selected time series are automatically matched when downloading the data. The data is on all kinds of subjects: from large time series on a macroeconomic level to company data on a microeconomic level and both financial and non-financial data.

The creaters of Quandl also offer an Excel add-in which allows you to directly download the data into Excel. The current version is 2.5. This add-in can only be used when you have an Authetication Token which is provided upon free registration on the Quandl website. The installation of the add-in is simple: just like any Add-in through the Excel Options. Once it is installed (and the Authentication Token is added at the Preferences in the Quandl add-in) you can directly search and download data. Quandl offers an extensive help section with short and clear movies that explain how the Add-in works.

Over 500 sources are already available, offering millions of datasets. Some of these sources are:

People who have registered can also upload their own datasets making it a great tool to share data. One of the people who has done this is the renowned Thomas Piketty (Author of Capital in the 21 century): Link to datasets.

Email

Bitcoin exchange rates on Datastream

The past year the digital currency Bitcoin has increasingly caught the eye of governments as well as the investment world. I have posted two items in the past on the Bitcoin phenomenom:

1) A basic post on the availability of Exchange rate data through divers exchanges.

2) A specific post on the Bitcoin price and the Bitcoin Price index

Thomson Reuters has now decided to make available exchange rate data for the Bitcoin to US Dollar through the Datastream system. The exchange rate data is delivered by the Bitcoin UK platform Bitstamp Ltd. It can be argued that the delivery of data by just one provider may give a biased view on the developments. In this case it may be a good idea to check out the Coindesk price. The data from the exchange Bitstamp Ltd is currently included in the calculation of the Bitcoin Price Index of Coindesk.

Datastream mnemonic : BTCTOU$
Start date: 18/08/2011
Frequency of update: end of day

The available Datastream time series datatypes are:

  • (ER) Exchange Rate Middle, in this case the last BTC price on the day
  • (EH) Exchange Rate Intra-day High
  • (EL) Exchange Rate Intra-day Low
  • (VO) Volume of Bitcoins traded.

Email

Industry and economic classification systems

Many people who do research on companies and markets use a system like the SIC codes or NAICS codes. Both systems allow you to determine the activities for companies. The SIC code system is the oldest of the two and can be downloaded as a variable for companies in many databases. A company can get assigned one or more SIC codes. Sometimes a primary SIC code is given, which indicates the main activity. In essence, the Standard Industrial Classification (SIC) is a system for classifying industries by a four-digit code. The system was originally developed in the United States in 1937, and it is used by US government agencies to classify industry activities.  Basic information is available on Wikipedia.
If your research covers a long time frame it may be necessary to determine whether changes that occur over time in the SIC system have an effect. Luckily, an older version of the SIC code system is available. The U.S. Department of Labor, Occupational Safety & Health Administration website allows the user to search the 1987 version SIC codes.
A more recent version of the SIC system is available on the SEC website of the Division of Corporation Finance: Standard Industrial Classification (SIC) 2011 Code List.

The SIC system is also used by agencies in other countries. In the United ingdom they have developed their own version of the SIC codes. This United Kingdom Standard Industrial Classification of Economic Activities (UK SIC) is used to classify business establishments and other standard units by the type of economic activity in which they are engaged. The new version of these codes (SIC 2007) was adopted by the UK as from 1st January 2008. Older versions of the UK SIC system are also available online, specifically, the UK Standard Industrial Classifications 2007, 2003, and 1992 versions.

The North American Industry Classification System (NAICS) is the system used by US Federal statistical agencies for classifying businesses for the purpose of collecting, analyzing, and publishing statistical data related to the U.S. business economy. The NAICS system was developed under the auspices of the Office of Management and Budget (OMB), and adopted in 1997 to replace the Standard Industrial Classification (SIC) system. From 2002 the system is increasingly used. Different versions are available through the US Census website.

The NACE-code (Nomenclature générale des Activités économiques) is a code which is largely used in the European Union and its member states use it to classify commercial and non-commercial economic activities. It is mainly developed as a useful instrument when collecting data and publishing economic statistical overviews. Many companies and organizations exhibit a diverse range of activities. Companies only get assigned a single NACE code, however: the code that indicates the primary acrivity which contributes the most to the total added value of a company. The first version of the NACE was created sometime around 1970. The first revision was published in 1990 and was called NACE Rev. 1. The second major revision (NACE Rev. 1.1) took place in 2002. The second revision was intended to synchronize the system with the  “International Standard Industrial Classification of all economic activities” ISIC of the United Nations. The different NACE versions can be found on the Eurostat website. The Dutch national SBI code system (which replaced the original BIK code system) is based on the NACE system.

The International Standard Industrial Classification of all economic activities, abbreviated as ISIC, is a standard used by the United Nations Statistics Division (UNSD). The ISIC is used to classify economic activities so that entities can be classified according to the activity they carry out.
The ISIC classification combines the statistical units according to their character, technology, organisation and financing of production. The ISIC is used widely, both nationally and internationally, in classifying economic activity data in the fields of population, production, employment, gross domestic product and other economic activities. It is a basic tool for studying economic phenomena, fostering international comparability of data and for promoting the development of sound national statistical systems.
The current and older ISIC versions are available on the statistical website of the United Nations.

When you are using a database, always check the help information to find out what version of an industry code system is being used. If a code system is available but no information on what version it is, you should find out by contacting the owner of the database. Some databases collect data but gather only historical (industry code) information from other sources without using their own version of such as system. The SDC Platinum databases have historical SIC and NAICS codes as they were indicated by the sources that Thomson Reuters uses to collect data.

N.B.: In addition to the aforementioned industry codes there are many more code systems that were developed for statistical purposes in specific countries. The database Amadeus has an easy tool in the help section that allows you to translate specific codes to other codes. This tool is available in the Bureau Van Dijk version of Amadeus:

Email

Return and variance of a series

On occasion I get asked for the return or variance of a time series. It is not always possible to download these from a database but both of them can be calculated using Excel using basic functions.

1) Return
There are many version and definitions of the return: “In finance, return is a profit on an investment. It comprises any change in value, and interest or dividends or other such cash flows which the investor receives from the investment.” Many other definitions and meanings exist and you can find an overview of some of these on Wikipedia. If I simplify it further the return means the percental change in value of a time series. As an example I have calculated the return for the price of Crude Oil WTI but the same (simplified) principle also applies to stock price series. The formula that I use is:

Return = ((Price on Day2 – Price on Day1) / Price on Day1) * 100

Below you see an example screenshot of this:

 

2) Variance
The variance is a mathematical concept in probability theory and statistics. The variance measures how far a set of numbers is spread out. A variance of zero indicates that all the values are identical. A small variance indicates that the data tend to be very close to the mean (expected value) and hence to each other. A high variance indicates that the data are very spread out around the mean and from each other. More on variance can be found on Wikipedia as well as in many good Mathematics books.

The formula that I have used, is the one for measuring an unbiased estimate of the population variance from a fixed sample of n observations:

(s2) = S [(xi – x¯)^2] / n – 1

s2 = Variance
S = Summation, which means the sum of every term in the equation after the summation sign.
xi = Sample observation. This represents every term in the set.
= The mean. This represents the average of all the numbers in the set.
n = The sample size. You can think of this as the number of terms in the set.

Below you find an example that I created using the daily Crude Oil WTI prices from 1 January 2000 to 1 May 2014. I used a similar step by step approach as shown in WikiHow:

The Excel formulas that I used were:
1) To get the sum of all the prices: =SUM(B6:B178). The reult is in cell E2.

2) To get the average I used: =AVERAGE(B6:B178). The reult is in cell D2.

3) Price (in B6 to B178) minus the average becomes: price – x¯ etc. For the values in column B it would be: B6 – $D$2. See the result in cells C6 – C178.

4) The square of the price minus the mean would be in column D6 – D178. Example: =(C6)^2

5) The sum of the squared prices of column D is in cell F2: =SUM(D6:D178)

6) The number of prices n is in cell G2: =COUNT(D6:D178)

7) The variance can now be calculated and is put in cell F6: =F2/($G$2)-1

8) The standard deviation is the square root of the variance. I put this in cell G6: =SQRT(F6)

You can also download the example Excel-sheet here.

Email

BitCoin exchange bankrupt

Today there was big news on one of the oldest BitCoin exchanges, Mt.Gox. A lawyer for Mt.Gox announced that this bitcoin exchange was filing for bankruptcy protection and that it had outstanding debt of about ¥6.5 billion ($63.6 million). The exchange has been in trouble since it stopped bitcoin withdrawals in February citing technical problems that potentially made fraudulent withdrawals possible.

Investing in cryptocurrencies remains ricky, but the bankruptcy of Mt.Gox in no way signifies the end of the BitCoin. More information on other exchanges can be found in an earlier post at this blog. If you want to keep up to date on the main BitCoin developments you can visit the Coindesk website.

Email

Publication dates of annual reports/earnings

One of the hardest variables to find, is the publication date of the annual report / annual earnings. Not many databases carry a variable listing the dates for one or more years. In the past, the database Factiva had this variable for 5 years but after the database changed this data is no longer available. The only options to get publication dates are:

  • Websites with calendars that list the publication dates of current and previous annual reports for one or more companies. Company websites often list release dates for their reports
  • Press releases from companies that announce the publication of the report. Press releases are available on websites of companies and also in commercial databases like LexisNexis and Factiva.com.

One of the few databases that carries a variable similar to the publication date is the Compustat North America database. Unfortunately the specific publication dates for annual reports are not available. The report dates for specific quarterly earnings are available, however. The reported publication date of the 4th quarter earnings may serve as an alternative for the publication date of the annual earnings result (in an annual report).
The variable “Report Date of Quarterly earnings” is only available in the part database Fundamentals Quarterly (= covers data from quarterly reports).

To get the data for a large number of companies make sure to include the following variables in the output:

  • DATACQTR – Calendar Data Year and Quarter
  • DATAFQTR – Fiscal Data Year and Quarter
  • FQTR – Fiscal Quarter
  • FYEARQ – Fiscal Year

These variables may be used to filter out the 4th fiscal quarter date (in Excel or SPSS) for each year for one or more companies. The variable that lists the necessary earnings report date is:

  • RDQ – Report Date of Quarterly Earnings

Another variable from the Compustat Fundamentals Quarterly database (in theory) might also serve as a proxy for the annual report publication date: the Final Date. This variable gives the specific date on which the data for that quarter has been definitely finalized. However, a PhD student from the Stockholm School of Economics, Peter Aleksziev, in 2015 did some investigation and came to the conclusion that the Final Date cannot serve as a publication date. By crosschecking the data with the publication dates from IBES (using the variable anndats_act) he found that only in less than 1 percent of his dataset of ±16.000 dates did the yearly date match the Final Dates and the match was also very small when looking around a 5 day interval. The Preliminary Date (PDATE) from Compustat (Segment files) much more closely matched the Annual Report end dates from IBES.

The variable Final Date is also available in the Fundamentals Annual part database (= covers data from annual reports) and there it indicates when the Annual data is finalized:

  • Fundamentals Quarterly variable: FDATEQ = Final Date
  • Fundamentals Annual variable: FDATE = Final Date.

The variable Final Date is also available in the Compustat Global Fundamentals Annual and Quarterly databases.

N.B.1: Compustat North America covers financial data on listed North American companies from (mainly) Canada and the United States of America. Compustat Global covers listed companies from other countries (worldwide, including Europe and Asia).

N.B.2: the IBES database is the Institutional Brokers Estimate System database and is available (when licensed) through platforms like Wharton/WRDS and Datastream.

Email