7  Loading data

7.1 Libraries

library(reshape2)

7.2 CSV, RData, Excel and the rest

There is a large number of ways to store data. Perhaps the most common is Excel, but while Excel can be useful to manipulate data, we do not recommend storing the types of data we use here as an Excel spreadsheet. The reason is that Excel often changes data without telling us. It might erroneously interpret something as a date, and then automatically converting a number into a date. It can lose significant digits or transform data in other unexpected ways. Consequently, in the type of work we do here, we don’t recommend storing data as Excel spreadsheets. I know someone who’s PhD got delayed by a year because he used Excel for his data, didn’t keep a backup of the original data he hand collected, and then Excel mangled the data to the point of it becoming useless.

The best format for data is generally CSV. It is a simple text file that can be edited in any text editor. It can be read by any software package, including Excel, and so is universally accessible, and if there is problem with the data, it is easy to inspect it in a text editor.

There is an exception, since R has a native data format, RData, that allows us to save and load R variables. In what follows, we will obtain data from a data vendor in CSV format and then convert it into RData, and use that for subsequent analysis.

CSV files have some disadvantages. They are uncompressed and therefore can be quite large. Furthermore, by default, a lot of software truncates significant digits from floating point numbers before saving (both R and Python’s Pandas do that, Julia does not) and one may requires an extra option to include all digits. That can cause problems when one needs to see if numbers change.

The best alternative is the parquet format. It compresses the data, and preserves the full floating point precision. It further has the advantage that it is easy to exchange parquet files with other software packages. The disadvantage is that because it comes in a common binary form, it is hard to inspect the files and edit them. We use parquet files for the data we make for extremerisk.org since we use R, Python and Julia for that site.

7.2.1 Things to look out for

It is important to check the documentation of the functions we are using to make sure we are doing things right. For example, we need to be careful if:

  1. The CSV file has headers or not. A header is the title of a column. We can have a CSV file where the first row includes the column titles, or a csv with no titles;
  2. The separator is a comma (,), a semicolon (;), a tab ( or another symbol;
  3. We are getting data from a country where the decimal symbol is a comma (,) instead of a period (.).

By checking the documentation of a function, we will see how to deal with each of these specific cases and more. To do so, run ?read.csv

7.3 Data used here

There are three data sets we use below, US stock prices, the SP-500 and SP-500 TR indices. All obtained with permission from EOD.

The raw data arrives as CSV files, called stocks.csv, sp500.csv and sp500tr.csv.

See Chapter 3.

7.3.1 Processing the sp500.csv file

We can import directly a .csv file into R as a data.frame object using the function read.csv(). The first three lines in the file are:

date,Close
20030102,909.03
20030103,908.59

and the last two:

20221229,3849.28
20221230,3839.5

sp500=read.csv('data/sp500.csv')

Now let’s check that our variable df is actually a data frame:

class(sp500)
'data.frame'

We can check the structure of a data frame using the function str(). This can be quite useful to see what every column of the data frame holds, and what type of data it has:

str(sp500)
'data.frame':   5035 obs. of  2 variables:
 $ date : int  20030102 20030103 20030106 20030107 20030108 20030109 20030110 20030113 20030114 20030115 ...
 $ Close: num  909 909 929 923 910 ...
head(sp500,2)
tail(sp500,2)
A data.frame: 2 × 2
dateClose
<int><dbl>
120030102909.03
220030103908.59
A data.frame: 2 × 2
dateClose
<int><dbl>
5034202212293849.28
5035202212303839.50
names(sp500)
  1. 'date'
  2. 'Close'

The name Close is not that convenient, so we rename it

names(sp500)[2]="price"
names(sp500)
  1. 'date'
  2. 'price'

We can make a new column with log returns

\[ y_t=\log P_1 - \log P_{t-1}. \]

When we do this we will not have an observation for day 1, and since the new column needs to have the same number of rows as the data frame, we have to add a first observation, and put NA into it:

sp500$y = c(NA,diff(log(sp500$price)))
head(sp500)
A data.frame: 6 × 3
datepricey
<int><dbl><dbl>
120030102909.03 NA
220030103908.59-0.0004841496
320030106929.01 0.0222255557
420030107922.93-0.0065661110
520030108909.93-0.0141857185
620030109927.57 0.0192005899

It will nicer to remove the first row. One way to do that is

sp500=sp500[2:dim(sp500)[1],]

A more elegant way is to

sp500=sp500[!is.na(sp500$y),]
head(sp500,2)
A data.frame: 2 × 3
datepricey
<int><dbl><dbl>
220030103908.59-0.0004841496
320030106929.01 0.0222255557

We finally save the data frame into a RData file.

save(sp500,file="data/sp500.RData")

And can load it later by

load("data/sp500.RData")

7.3.2 Processing the sp500tr.csv file

We do the same with the SP-500 TR file

sp500tr=read.csv('data/sp500tr.csv')
names(sp500tr)[2]="price"
sp500tr$y = c(NA,diff(log(sp500tr$price)))
save(sp500,file="data/sp500.RData")

7.3.3 Processing the stocks.csv file

The start of stocks.csv file looks like

ticker,date,Close,Adjusted_close
MCD,20030102,16.55,9.6942
MCD,20030103,16.12,9.4423

And the end

INTC,20221229,26.21,25.8945
INTC,20221230,26.43,26.1118

stocks=read.csv('data/stocks.csv')
str(stocks)
'data.frame':   30210 obs. of  4 variables:
 $ ticker        : chr  "MCD" "MCD" "MCD" "MCD" ...
 $ date          : int  20030102 20030103 20030106 20030107 20030108 20030109 20030110 20030113 20030114 20030115 ...
 $ Close         : num  16.6 16.1 16.6 16.7 16.8 ...
 $ Adjusted_close: num  9.69 9.44 9.75 9.76 9.86 ...

We see the data frame has 5 columns and 30210 rows. The str() function shows us the names of each column, along with the data type it holds, and some observations. We see there is a column for every stock, which holds numeric data, and a column for dates, which is of type character.

We can also check parts of the structure with different functions:

dim(stocks)
nrow(stocks)
ncol(stocks)
colnames(stocks)
  1. 30210
  2. 4
30210
4
  1. 'ticker'
  2. 'date'
  3. 'Close'
  4. 'Adjusted_close'

It might be good to rename some of the colums

names(stocks)[3:4]=c("UnAdjustedPrice","price")
names(stocks)
  1. 'ticker'
  2. 'date'
  3. 'UnAdjustedPrice'
  4. 'price'

The reason is that we will not work much with the unadjusted prices, so price is transarent and simple.

We can show the first and last row, and rbind put that into the same data frame

rbind(head(stocks,1),tail(stocks,1))
A data.frame: 2 × 4
tickerdateUnAdjustedPriceprice
<chr><int><dbl><dbl>
1MCD 2003010216.55 9.6942
30210INTC2022123026.4326.1118

We can now see how which stocks we have:

unique(stocks$ticker)
  1. 'MCD'
  2. 'DIS'
  3. 'AAPL'
  4. 'GE'
  5. 'JPM'
  6. 'INTC'

7.3.3.1 Reshaping

The stocks data frame isn’t particularly useful as it, it would be much better to have the prices, for each stock in its own column. While there are many ways to do that, the best is to use the package reshape2.

If we want to create a data frame with the returns for every stock, we can use the dcast() function. date ~ means each row is data from same date and each column is data from same ticker, value.var is the column of the input data frame that will be filled into the new data frame.

Price = dcast(stocks, date ~ ticker, value.var = "price")
head(Price,2)
UnAdjustedPrice = dcast(stocks, date ~ ticker, value.var = "UnAdjustedPrice")
A data.frame: 2 × 7
dateAAPLDISGEINTCJPMMCD
<int><dbl><dbl><dbl><dbl><dbl><dbl>
1200301020.225013.754788.50279.786514.50779.6942
2200301030.226513.834488.22489.698514.79299.4423

This works since all the stocks are American, and all from the same exchange, so all with observations on the same days.

We can now make a data frame with returns. There are several ways to do that. Note that the first column is dates and we dont want to transform those.

Return=Price
for (i in 2:dim(Price)[2]) Return[,i]=c(NA,diff(log(Price[,i])))
head(Return,2)
A data.frame: 2 × 7
dateAAPLDISGEINTCJPMMCD
<int><dbl><dbl><dbl><dbl><dbl><dbl>
120030102 NA NA NA NA NA NA
2200301030.0066445430.00577766-0.003144957-0.0090326510.01946779-0.02632817

And to remove the NA.

Price=Price[!is.na(Return[,2]),]
UnAdjustedPrice=UnAdjustedPrice[!is.na(Return[,2]),]
Return=Return[!is.na(Return[,2]),]
head(Price,2)
head(UnAdjustedPrice,2)
head(Return,2)
A data.frame: 2 × 7
dateAAPLDISGEINTCJPMMCD
<int><dbl><dbl><dbl><dbl><dbl><dbl>
2200301030.226513.834488.2248 9.698514.79299.4423
3200301060.226514.543690.482510.073815.95629.7527
A data.frame: 2 × 7
dateAAPLDISGEINTCJPMMCD
<int><dbl><dbl><dbl><dbl><dbl><dbl>
22003010314.901617.129225.382616.5425.9416.12
32003010614.901618.007426.032217.1827.9816.65
A data.frame: 2 × 7
dateAAPLDISGEINTCJPMMCD
<int><dbl><dbl><dbl><dbl><dbl><dbl>
2200301030.0066445430.00577766-0.003144957-0.0090326510.01946779-0.02632817
3200301060.0000000000.04999279 0.025268359 0.0379667600.07570013 0.03234458

7.4 Saving RData files

Once we have finished cleaning and handling our dataset, we can save it so we don’t have to repeat the procedure next time we want to use it.

We can save each variable in its own file, like

save(Return, file = "data/Return.RData")

To save your data frame as a .csv file you can use the function write.csv():

write.csv(Return, file = "data/Return.csv")

But it can be more convenient, and safer to put all the data frames into one list:

x=list(Return=Return,Price=Price,UnAdjustedPrice=UnAdjustedPrice)
save(stocks, file = "data/stocks.RData")
names(x)
  1. 'Return'
  2. 'Price'
  3. 'UnAdjustedPrice'

Even better is simply to put all variables into the same list

data=list(
    Return=Return,
    Price=Price,
    UnAdjustedPrice=UnAdjustedPrice,
    sp500=sp500,
    sp500tr=sp500tr,
    Ticker=unique(stocks$ticker)
)
save(data, file = "data/data.Rdata")
names(data)
  1. 'Return'
  2. 'Price'
  3. 'UnAdjustedPrice'
  4. 'sp500'
  5. 'sp500tr'
  6. 'Ticker'