library(reshape2)
6 Data used here
We use three data files in these notes, all with daily US stock prices between 2003 and 2022, both unadjusted and adjusted. The data was obtained from the data vendor we usually use, EOD, and is distributed here with their permission. The files can be found on https://www.financialriskforecasting.com/data as CSV files.
- SP500, daily prices of the standard and Poor’s 500 index (S&P 500), the data we used to illustrate univariate applications;
- SP500TR, total returns on the S&P 500;
- Stocks, the prices of six different stocks, chosen to represent the various sectors of the US economy, and including both winners and losers, “
stocks.csv
”. The file contains both unadjusted and adjusted prices.
The raw data arrives as CSV files called stocks.csv
, sp500.csv
and sp500tr.csv
.
6.1 Libraries
6.2 Processing the sp500.csv
file
We can import directly a .csv
file into R as a data.frame
object using the function read.csv()
. The first three lines in the file are:
date,Close
20030102,909.03
20030103,908.59
And the last two:
20221229,3849.28
20221230,3839.5
=read.csv('data/sp500.csv') sp500
Now let’s check that our variable df
is actually a dataframe:
class(sp500)
[1] "data.frame"
We can check the structure of a dataframe using the function str()
. This can be quite useful to see what every column of the dataframe holds and what type of data it has:
str(sp500)
'data.frame': 5035 obs. of 2 variables:
$ date : int 20030102 20030103 20030106 20030107 20030108 20030109 20030110 20030113 20030114 20030115 ...
$ Close: num 909 909 929 923 910 ...
head(sp500,2)
date | Close |
---|---|
20030102 | 909.03 |
20030103 | 908.59 |
tail(sp500,2)
date | Close | |
---|---|---|
5034 | 20221229 | 3849.28 |
5035 | 20221230 | 3839.50 |
names(sp500)
[1] "date" "Close"
The name Close
is not that convenient, so we renamed it.
names(sp500)[2]="price"
names(sp500)
[1] "date" "price"
We can make a new column with log returns.
When we do this, we will not have an observation for day 1, and since the new column needs to have the same number of rows as the dataframe, we have to add a first observation and put NA
into it:
$y = c(NA,diff(log(sp500$price)))
sp500head(sp500)
date | price | y |
---|---|---|
20030102 | 909.03 | NA |
20030103 | 908.59 | -0.0004841 |
20030106 | 929.01 | 0.0222256 |
20030107 | 922.93 | -0.0065661 |
20030108 | 909.93 | -0.0141857 |
20030109 | 927.57 | 0.0192006 |
It will be nicer to remove the first row. One way to do that is
=sp500[2:dim(sp500)[1],] sp500
A more elegant way is to
=sp500[!is.na(sp500$y),]
sp500head(sp500,2)
date | price | y | |
---|---|---|---|
2 | 20030103 | 908.59 | -0.0004841 |
3 | 20030106 | 929.01 | 0.0222256 |
6.2.1 Processing the sp500tr.csv
file
We do the same with the SP-500 TR file
=read.csv('data/sp500tr.csv')
sp500trnames(sp500tr)[2]="price"
$y = c(NA,diff(log(sp500tr$price))) sp500tr
6.3 Processing the stocks.csv
file
The start of the stocks.csv
file looks like
ticker,date,Close,Adjusted_close
MCD,20030102,16.55,9.6942
MCD,20030103,16.12,9.4423
And the end
INTC,20221229,26.21,25.8945
INTC,20221230,26.43,26.1118
=read.csv('data/stocks.csv')
stocksstr(stocks)
'data.frame': 30210 obs. of 4 variables:
$ ticker : chr "MCD" "MCD" "MCD" "MCD" ...
$ date : int 20030102 20030103 20030106 20030107 20030108 20030109 20030110 20030113 20030114 20030115 ...
$ Close : num 16.6 16.1 16.6 16.7 16.8 ...
$ Adjusted_close: num 9.69 9.44 9.75 9.76 9.86 ...
The dataframe has five columns and 30210 rows. The str()
function shows us the names of each column, along with the data type it holds and some observations. There is a column for every stock, which holds numeric data, and a column for dates, which is of type character.
We can also check parts of the structure with different functions:
dim(stocks)
[1] 30210 4
nrow(stocks)
[1] 30210
ncol(stocks)
[1] 4
colnames(stocks)
[1] "ticker" "date" "Close" "Adjusted_close"
We should rename some of the columns.
names(stocks)[3:4]=c("UnAdjustedPrice","price")
names(stocks)
[1] "ticker" "date" "UnAdjustedPrice" "price"
The reason is that we will not work much with the unadjusted prices, so price
is transparent and simple.
We can show the first and last row, and rbind
put that into the same dataframe
rbind(head(stocks,1),tail(stocks,1))
ticker | date | UnAdjustedPrice | price | |
---|---|---|---|---|
1 | MCD | 20030102 | 16.55 | 9.6942 |
30210 | INTC | 20221230 | 26.43 | 26.1118 |
We can now see how which stocks we have:
unique(stocks$ticker)
[1] "MCD" "DIS" "AAPL" "GE" "JPM" "INTC"
6.3.0.1 Reshaping
The stocks
dataframe isn’t particularly useful as it would be much better to have the prices for each stock in its column. While there are many ways to do that, the best is to use the package reshape2
.
If we want to create a dataframe with the returns for every stock, we can use the dcast()
function. date ~
means each row is data from the same date
and each column is data from the same ticker
, value.var
is the column of the input dataframe that will be filled into the new dataframe.
= dcast(stocks, date ~ ticker, value.var = "price")
Price head(Price,2)
date | AAPL | DIS | GE | INTC | JPM | MCD |
---|---|---|---|---|---|---|
20030102 | 0.2250 | 13.7547 | 88.5027 | 9.7865 | 14.5077 | 9.6942 |
20030103 | 0.2265 | 13.8344 | 88.2248 | 9.6985 | 14.7929 | 9.4423 |
= dcast(stocks, date ~ ticker, value.var = "UnAdjustedPrice") UnAdjustedPrice
This works since all the stocks are American and from the same exchange, so they have observations on the same days.
We can now make a dataframe with returns. There are several ways to do that. Note that the first column is dates and we dont want to transform those.
=Price
Returnfor (i in 2:dim(Price)[2]) Return[,i]=c(NA,diff(log(Price[,i])))
head(Return,2)
date | AAPL | DIS | GE | INTC | JPM | MCD |
---|---|---|---|---|---|---|
20030102 | NA | NA | NA | NA | NA | NA |
20030103 | 0.0066445 | 0.0057777 | -0.003145 | -0.0090327 | 0.0194678 | -0.0263282 |
And to remove the NA
.
=Price[!is.na(Return[,2]),]
Price=UnAdjustedPrice[!is.na(Return[,2]),]
UnAdjustedPrice=Return[!is.na(Return[,2]),]
Returnhead(Price,2)
date | AAPL | DIS | GE | INTC | JPM | MCD | |
---|---|---|---|---|---|---|---|
2 | 20030103 | 0.2265 | 13.8344 | 88.2248 | 9.6985 | 14.7929 | 9.4423 |
3 | 20030106 | 0.2265 | 14.5436 | 90.4825 | 10.0738 | 15.9562 | 9.7527 |
head(UnAdjustedPrice,2)
date | AAPL | DIS | GE | INTC | JPM | MCD | |
---|---|---|---|---|---|---|---|
2 | 20030103 | 14.9016 | 17.1292 | 25.3826 | 16.54 | 25.94 | 16.12 |
3 | 20030106 | 14.9016 | 18.0074 | 26.0322 | 17.18 | 27.98 | 16.65 |
head(Return,2)
date | AAPL | DIS | GE | INTC | JPM | MCD | |
---|---|---|---|---|---|---|---|
2 | 20030103 | 0.0066445 | 0.0057777 | -0.0031450 | -0.0090327 | 0.0194678 | -0.0263282 |
3 | 20030106 | 0.0000000 | 0.0499928 | 0.0252684 | 0.0379668 | 0.0757001 | 0.0323446 |
But it can be more convenient and safer to put all the dataframes into one list:
=list(Return=Return,Price=Price,UnAdjustedPrice=UnAdjustedPrice) x
6.4 Process raw data
Even better is to put all variables into the same list and load them all with ProcessRawData()
. We put that into functions.r
.
=function(){
ProcessRawData
=read.csv('data/sp500.csv')
sp500names(sp500)[2]="price"
$y = c(NA,diff(log(sp500$price)))
sp500=sp500[!is.na(sp500$y),]
sp500$date.ts = ymd(sp500$date)
sp500$y.ts = zoo(sp500$y,order.by=sp500$date.ts)
sp500
=read.csv('data/sp500tr.csv')
sp500trnames(sp500tr)[2]="price"
$y = c(NA,diff(log(sp500tr$price)))
sp500tr$date.ts = ymd(sp500tr$date)
sp500tr$y.ts = zoo(sp500tr$y,order.by=sp500tr$date.ts)
sp500tr
=read.csv('data/stocks.csv')
stocksnames(stocks)[3:4]=c("UnAdjustedPrice","price")
= dcast(stocks, date ~ ticker, value.var = "price")
Price head(Price,2)
= dcast(stocks, date ~ ticker, value.var = "UnAdjustedPrice")
UnAdjustedPrice
=Price
Returnfor (i in 2:dim(Price)[2]) Return[,i]=c(NA,diff(log(Price[,i])))
=Price[!is.na(Return[,2]),]
Price=UnAdjustedPrice[!is.na(Return[,2]),]
UnAdjustedPrice=Return[!is.na(Return[,2]),]
Return
=list(
dataReturn=Return,
Price=Price,
UnAdjustedPrice=UnAdjustedPrice,
sp500=sp500,
sp500tr=sp500tr,
Ticker=unique(stocks$ticker)
)
return(data)
}