1  Financial and economic data

While financial markets generate a vast amount of data (sometimes with new observations every nanosecond), it is generally difficult and costly to get this data. All data providers are commercial with complicated interfaces, sometimes very expensive, and the few free data providers tend to have erratic access and errors in data.

1.1 Considerations

1.1.1 Downloading and importing data

Vendors usually distribute data as a csv or json files. In some cases, we can download this data on the command line or in a browser. However, it is usually better to automatically download it in a software package via application programming interfaces (APIs), a way for two or more computer programs to communicate with each other. That minimises errors because we don’t have to manually download the data, and allows for automatic updates as needed.

1.1.2 Symbols and names and identifiers

There are many categories of financial data one might use, like stocks, bonds, futures, options, commodities and foreign exchange. For each of these we have a large number of individual assets and types of assets. Identifying the asset we need can be quite complex. Names and tickers

All stocks that are trading in the market are associated with a ticker symbol, serving as an identifier for the specific security. These are often specific to the particular exchange or a country of listing. This can often lead to confusion or ambiguity when a company has cross-listings, for instance, the Japanese car manufacturer Toyota is listed as 7203 on the Tokyo Stock Exchange, TYT on the London Stock Exchange, and TM on the New York Stock Exchange.

Some identical securities have different names. Depending on the source, searching for data on the S&P-500 (a major American stock market index) might require the ticker names SPX (Bloomberg), ^GSPC (Yahoo Finance), INX (Google Finance), GSPC.INDX (EOD), and so on.

Furthermore, firms sometimes change names.

When searching for a particular security from the data source, it is good practice to verify that the ticker symbol used indeed corresponds to the correct data series by checking the data description. Tickers, ISIN and PERMNO

Tickers often change over time (for example due to a merger or name change) and be recycled, so the same ticker can refer to two, or more, firms, giving rise to problems if not taken into consideration when querying financial data. It is best not to refer to stocks by the ticker symbol when downloading data, and instead use one of the permanent asset identifiers.

The ISIN is the International Securities Identification Number, which is an internationally recognised 12-characters code that is unique for each stock. Unlike Tickers, the ISIN for a stock is the same regardless of the market.

The PERMNO, is the permanent issue identifier of the CRSP dataset.

1.1.3 Best practice in data downloading

Given the need to ensure that accurate data is used, it may be best to have a separate step in downloading and running the data. Get data from a provider, save it as a CSV or binary file and then have a separate run for processing the data.

There are a few reasons why this is preferable:

  1. Limits for API calls. Many data providers place limits on the amount/speed of data downloads;
  2. Minimise time spent importing data; Especially true for large datasets, since CSV reading is relatively quick;
  3. A vendor may be off-line;
  4. Your network access may be having difficulties;
  5. Data errors, and checking for data omissions. Useful to have a pre-analysis check of data quality using plots and tables.

The main exception is if we are doing real time processing.

It can be useful to download data from a vendor when code is run and compare it to data already saved. Then, making a hash or a digital fingerprint of the old and new dataframe, using the digest command from a library with the same name, can be useful.

1.1.4 Dates

A date has three components, year, month and day. Refer to those as YYYY, MM and DD, respectively. There may also be a time component: hours, minutes, seconds and fractions of a second. As well as a time zone and perhaps a summer time indicator.

Dates are very complicated to work with in software, as most code does not internally use years, months, etc. Instead, dates are floating point numbers relative to some initial date. For example, R’s origin date/time is set as January 1 1970 00:00:00, and all dates are relative to that. Midnight 1 January 2000 is the number 10957.

Dates often present problems when using multiple languages or programs to carry out numerical work. Excel has two different conventions for dates depending on version — the origin year can either be 1900 or 1904 — this requires verification before use. Furthermore, Excel does not allow dates before 1900.

A date can be represented numerically in many different ways. Consider the date 13 September 2018:

Format Example
DD-MM-YYYY 13-09-2018
MM-DD-YYYY 09-13-2018
YYYY-MM-DD 2018-09-13
YYYYMMDD 20180913

The best way is to use the YYYYMMDD convention, for two reasons:

  1. It can be represented as an integer, not as a string, making data handling more convenient;
  2. It sorts naturally (in chronological order).

The R package lubridate is very useful for dates.

1.1.5 Adjusted and unadjusted prices

If you download stock prices, in many cases your data will be unusable because the number of stocks outstanding is often adjusted, usually by stock splits.

For example, Amazon announced a 20 for 1 stock split in 2020. This meant that every Amazon stock became 20 stocks, and similarly, the price dropped by a factor of 19, from about $2000 to $124.

This means that if you would load such price data into R and do analysis you would have a big price drop that has no impact on risk or wealth.

We therefore work with what is called adjusted prices, that is, prices adjusted for stock splits.

1.1.6 Asynchronous prices

Problems arise when data comes from different markets and countries due to:

  1. Holidays;
  2. Time zones.

Public holidays, days when the markets are closed, are often not the same across countries. For example, an independence day or religious holiday, like 4 July in the United States. The exchanges are usually open Monday to Friday, but the Saudi Stock Exchange is open from Sunday to Thursday. Some exchanges close for a lunch break. Some countries have summer time and others not, and summer time often does not happen on the same dates, like in the US vs. Europe.

Name Time Zone Trading Hours Lunch Break
New York Stock Exchange EDT 9:30 a.m. to 4:00 p.m. No
Shanghai Stock Exchange CST 9:30 a.m. to 11:30 a.m. to 3:00 p.m. 11:30 a.m. to 1:00 p.m.
Tokyo Stock Exchange JST 9:00 a.m. to 3:00 p.m. 11:30 a.m. to 12:30 p.m.
London Stock Exchange BST 8:00 to 3:30 p.m. 12:00 p.m. to 12:02 p.m.
Frankfurt Stock Exchange CET 9:00 a.m. to 5:30 p.m. No

The market in the New York might overlap with London but not some European markets, and since Tokyo is 12 hours ahead of London, there is no overlap in trading hours.

This means that any research comparing prices across countries at the daily frequency needs to take these issues into consideration. They can be bypassed by using weekly or monthly data

1.2 Common Sources of Financial Data

The type of data we use here can only be obtained from a commercial vendor, either for free or by paying. Your university might have a subscription to a commercial vendor that you can use for free.

1.2.1 What we usually use EOD Historical Data

Our primary source of financial prices is End of Day Historical Data which provides fundamental data API, live and end of day historical prices for stocks, ETFs and mutual funds from exchanges all around the world. While not free, it is not very expensive and comes with an academic discount.

It has a very useful API interface that allows downloading data directly into R. Suppose you have obtained an API token, this is how you can download daily Apple stock prices.

api.token = "YOUR_API_KEY_HERE"
symbol = "AAPL.US"
ticker.link <- paste("http://nonsecure.eodhistoricaldata.com/api/eod/", symbol, "?api_token=", api.token, "&period=m&order=d", sep="")
data <- read.csv(url(ticker.link))

The data we use in these notes was obtained with permission from EOD. DBnomics

Our main source of economic data is DBnomics which aggregates free data feeds from a number of sources, including the World Bank, IMF, the BIS and OECD.

It comes with both a browser and an API interface. For R we can use rdbnomics.

df1 = rdb(ids = "AMECO/ZUTN/EA19.")

1.2.2 Some other vendors Bloomberg

One of the most ubiquitous data sources in finance is the Bloomberg Terminal. Due to its pervasiveness throughout the industry, there are numerous packages in practically every language to access its APIs. We can download Bloomberg data directly into R.

LSE students have access to a number of Bloomberg terminals in the library and the Masters students common rooms. Wind

The Wind Financial Terminal (WFT) also provides market data like the Bloomberg Terminal, but with a specific focus on the Chinese financial markets. It supports APIs for MATLAB, R, C++ and Python, among others. LSE has access to Wind. WRDS

The Wharton business school at University of Pennsylvania provides service called Wharton Research Data Services (WRDS) that many universities subscribe to. This provides a common interface to a number of databases, including CRSP and TAC high-frequency data. WRDS and many of its databases are available to LSE students and staff. Yahoo Finance

The go-to place for many researchers requiring financial data has been finance.yahoo.com. This data can be automatically downloaded for free into many software packages, including Matlab, R and Python.

There are three problems with Yahoo Finance.

  1. Yahoo occasionally changes how the API works, requiring updates to software;
  2. It often is unavailable for days or weeks;
  3. There are errors in the data. For example, UK prices, quoted in pence by convention, sometimes appear in pounds for one or two days, reverting to pence. On other occasions, numbers are simply wrong. Federal Reserve Economic Data (FRED)

The FRED Economic Reserach is a good source for macroeconomic data including data for unemployment, GDP, interest rates, the money supply, etc. It can be accessed from DBnomics. IEX

IEX provides access to US equity data via https://iextrading.com/developer/. ECB FX

The European Central Bank Statistical Data Warehouse and its corresponding SDMX interface allow for retrieval of daily Euro FX data.

The entire dataset is here http://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip, and it can be accessed using

wget http://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip -O eurofxref-hist.zip

EOD also has this data. Alpha Vantage

Alpha Vantage provides free daily and realtime stock price data, and API access is available in both R and Python. Its data source appears to be the same as Yahoo Finance and hence subject to the same errors. Quandl

Quandl provides a common API access for R and Python to a large number of commercial databases, and some that are free. While comprehensive, one may need to subscribe to data from a number of providers. Fama-French Data Library:

The Fama-French Data Library provides a large amount of historical data compiled by Eugene Fama and Kenneth French. The data is updated regularly, and the Fama-French 3-factor data is especially useful for analyzing fund and portfolio performance.

1.2.3 Other useful databases

Some useful databases that can be assessed in different ways, include: CRSP

For historical US stock market data, one of the major sources is The Center for Research in Security Prices (CRSP, pronounced “crisp”), headquartered at the University of Chicago.

CRSP can be accessed via WRDS. BIS

The BIS provides a very useful database on credit and banking statistics. You can either access it directly at stats.bis.org or DBnomics. World bank

The World Bank provides a lot of economic development data. The best way to access that is via DBnomics. OECD

The OECD provides a lot of data on member states. The best way to access that is via DBnomics.