Before the seminar two, please make sure you have the following packaged installed: "zoo", "lubridate", "car", "tseries", "rugarch", "rmgarch" and "reshape2". You can install package using install.packages()
command.
The plan for this week:
1. Install and load packages
2. Basic Data Handling
3. Save Data Frames
4. Create, customize and export plots
One of the most useful features of R is its large library of packages. Although the base R package is very powerful, there are thousands of community-built packages to perform specific tasks that can save us the effort of programming some things from scratch. The complete list of available R packages can be found here: https://cran.r-project.org.
To install a package, type in the console: install.packages("name_of_package")
, and after installed, you can load it with: library(name_of_package)
.
It is considered a good practice to start your .R script with loading all the libraries that will be used in your program.
In the Information pane of RStudio, you can access the list of packages that are installed in your R environment:
By clicking on a package's name, you will see all the information it contains
Make sure you have the following packages installed:
install.packages("zoo")
install.packages("lubridate")
install.packages("tseries")
install.packages("rugarch")
install.packages("rmgarch")
install.packages("reshape2")
install.packages("car")
library(reshape2)
library(lubridate)
The first thing to do when working with new data is cleaning it. Let's load the data downloaded in the previous class and take a look at it:
data <- read.csv("crsp.csv")
head(data)
The format of this table is a data.frame
. This is a type of variable that allows us to store data of different nature (numbers, characters, dates, etc.)
class(data)
We are interested in having a data frame that holds the price of each stock over time, and another one that holds the returns. We will use the current TICKER
of the stocks as the identifying name for each company.
Let's start with prices. Before building our data frame, we need to adjust the prices with the Cumulative Factor to Adjust Prices, or CFACPR
. The reason is that the PRC
variable does not take into account stock splits, which can lead us to believe that the price of a stock halved in a day, when the reason for this drop is no more than a stock split. To adjust for this, we will divide the column PRC
by CFACPR
. For comparison, we will keep the unadjusted prices in a Unadjusted_PRC
.
# Keeping the unadjusted prices in a new column
data$Unadjusted_PRC <- data$PRC
# Modifying the PRC column
data$Adjusted_PRC <- data$PRC / data$CFACPR
head(data)
Now that we have the correct prices, we will select the date and price columns for each stock and put them into a variable with the ticker name. Afterwards, we have to rename the price column to the ticker name. For example, for MSFT:
# Getting the date and Adjusted_PRC variables for Microsoft
MSFT <- data[data$PERMNO == 10107, c("date", "Adjusted_PRC")]
# Renaming Adjusted_PRC to MSFT
names(MSFT)[2] <- "MSFT"
# Now we do the same for the five others
XOM <- data[data$PERMNO==11850, c("date", "Adjusted_PRC")]
names(XOM)[2] <- "XOM"
GE <- data[data$PERMNO==12060, c("date", "Adjusted_PRC")]
names(GE)[2] <- "GE"
JPM <- data[data$PERMNO==47896, c("date", "Adjusted_PRC")]
names(JPM)[2] <- "JPM"
INTC <- data[data$PERMNO==59328, c("date", "Adjusted_PRC")]
names(INTC)[2] <- "INTC"
C <- data[data$PERMNO==70519, c("date", "Adjusted_PRC")]
names(C)[2] <- "C"
# And merge all into a single table called PRC using the merge() function
PRC <- merge(MSFT, XOM)
PRC <- merge(PRC, GE)
PRC <- merge(PRC, JPM)
PRC <- merge(PRC, INTC)
PRC <- merge(PRC, C)
head(PRC)
We got the output we wanted, but it involved several lines of basically copy-pasting the same code.
As a challenge, you can try to replicate the process using a for
loop.
Also, we could have saved us this trouble by using a package. R has thousands of packages with functions that can help us easily get the output we are looking for. We are going to create another table using the dcast
function from the reshape2
package.
To get an overview of meaning of aruguments for function dcast
, we can check its documentation by:
?dcast
date ~ PERMNO
means each row is data from same date
and each column is data from same PERMNO
, value.var
is the column of the input data frame that will be filled into the new data frame.
# Remove the previous variable
rm(PRC)
# Create a new data frame
PRC <- dcast(data, date ~ PERMNO, value.var = "Adjusted_PRC")
names(PRC) <- c("date", "MSFT", "XOM", "GE", "JPM", "INTC", "C")
head(PRC)
Not only did this save us time and lines of code, but it also gave us as output a data.frame
object instead of a matrix
, which is easier to handle.
We can now directly create the data frame for returns:
RET <- dcast(data, date ~ PERMNO, value.var = "RET")
names(RET) <- c("date", "MSFT", "XOM", "GE", "JPM", "INTC", "C")
head(RET)
length(RET$"C")
The returns in our dataset are simple returns, which, as we saw in lectures, are calculated like this:
$$R_{t}=\frac{P_{t} - P_{t-1}}{P_{t-1}}$$We prefer to work with continuously compounded returns, which are defined as:
$$Y_{t} = \log\left(\frac{P_{t}}{P_{t-1}}\right) = \log\left(1+R_{t}\right)$$To transform them into compound returns, we should use the log()
function:
# We choose all the columns except the first one
# And transform them into a new Y data frame
Y <- log(1 + RET[,2:7])
Y$date <- RET$date
head(Y)
On our RET
data frame, we have a column for the dates of the observations. R has a special variable type for working with dates called Date
, which will make our lifes easier when trying to do plots and some analyses. However, by default the date column in our dataset is not in this format:
class(Y$date)
There are several ways to transform data into the Date
type. We will use the package lubridate
that we installed earlier. In particular, the function ymd()
that stands for year-month-day. It is a powerful function that will turn any character in that order into a Date
format. For example, it can handle:
ymd("20101005")
ymd("2010-10-05")
ymd("2010/10/05")
Likewise, you could also use dmy()
or mdy()
for different formats.
# Save the original int type date to int_date
Y$int_date <- Y$date
# Use the function ymd() to transform the column into Dates
Y$date <- ymd(Y$date)
# Check if it worked
class(Y$date)
# Lets do the same for PRC
PRC$int_date <- PRC$date
PRC$date <- ymd(PRC$date)
After handling data, we want to make sure we do not have to do the same procedure every time we open our program. For this, we can easily save the data frame we have created as an .RData
object that can be loaded the next time we open R. To do so, we need to make sure we are at the Directory where we want to save the data, and use the save()
function:
# Saving the data frame of returns
save(Y, file = "Y.RData")
# Saving the data frame of prices
save(PRC, file = "PRC.RData")
To load a .RData
file, we need to use the function load
:
# Remove the existing data frame of returns
rm(Y)
# Load the saved file
load("Y.RData")
head(Y)
We can also easily save our data frames as a .csv file using the write.csv()
function:
write.csv(Y, file = "Y.csv")
The base R has an easy to use plot function called plot()
. In this section we will learn how to customize our plots, add titles, change axis labels, add legends, select types of lines, vary the colors, and create subplots.
To get an overview of what plot()
can do, we can check its documentation by:
?plot
We will start plotting the returns for JP Morgan:
# Simple plot, if we do not specify an X variable, plot() will use an index
plot(Y$JPM)
# By default, plot() uses points, we can plot a line with the option "type", "l" denotes line
plot(Y$JPM, type = "l")
# We can add a title with the option "main"
# Change the axes labels with "xlab" and "ylab"
# Choose a color for the graph with "col"
plot(Y$JPM, type = "l", main = "Compound returns for JP Morgan",
ylab = "Returns", xlab = "Observation", col = "red")
We would like to have the dates in the x axis to understand our data better:
# The first data argument is used as the X variable
plot(Y$date, Y$JPM, type = "l", main = "Compound returns for JP Morgan",
ylab = "Returns", xlab = "Date", col = "red")
Also, by using the option las = 1
, we will make sure that the ticks for the axes are always horizontal. This is a good practice for easier visualization:
# The first data argument is used as the X variable
plot(Y$date, Y$JPM, type = "l", main = "Compound returns for JP Morgan",
ylab = "Returns", xlab = "Date", col = "red", las = 1)
What if we wanted to plot the prices for JP Morgan and Citi, the two major banks in our dataset?
There are many options to do so. One is to plot both in the same graph, we can achieve this by using the lines()
function after a plot is created, or using the function matplot()
:
# First we plot the returns of JPM
plot(PRC$date, PRC$JPM, type = "l", main = "Prices for JP Morgan and Citi",
ylab = "Price", xlab = "Date", col = "red")
# Then we add the returns of C
lines(PRC$date, PRC$C, col = "blue")
# And we create a legend
legend("bottomright",legend = c("JPM", "C"), col = c("red", "blue"), lty=1)