Support notebook for Financial Risk Forecasting

This should serve to accompany the Book Code from Jón Daníelsson's Financial Risk Forecasting and the Seminar notebooks for the class. All the code used is in R.

Thanks to Alvaro Aguirre for developing the notebook and material, and to Jia Rong Fan for the information on financial data sources.

References

If you want further material, the following list provides information on various topics:


1 | Financial data sources

While financial markets generate a vast amount of data (sometimes with new observations every nanosecond), it is generally difficult and costly to get this data. Most data providers are commercial with complicated interfaces, and free data providers tend to have erratic access and errors in data.

Online data providers generally provide an application programming interface (API), an interface which allows users to submit data requests and receive responses. For most languages, there are often open-source or proprietary packages available which allow end-users to retrieve data using a relatively simple function call.

Today, many universities maintain subscriptions to data providers — this makes obtaining complete and accurate data much easier. Below we discuss some of the data providers that can be of use.

While this information is accurate at the time of writing (July 2020), given the rapid developments in the space, it is likely to become quickly outdated.

While we have used each of these data sources below, we have no specific recommendation.

Common Sources of Financial Data

Bloomberg

One of the most ubiquitous data sources in finance is the Bloomberg Terminal. Due to its pervasiveness throughout the industry, there are numerous packages in practically every language to access its APIs.

LSE students have access to a number of Bloomberg terminals in the library and the Masters students common rooms.

Wind

The Wind Financial Terminal (WFT) also provides market data like the Bloomberg Terminal, but with a specific focus on the Chinese financial markets. It supports APIs for MATLAB, R, C++ and Python, among others.

WRDS

The Wharton business school at University of Pennsylvania provides service called Wharton Research Data Services (WRDS) that many universities subscribe to. This provides a common interface to a number of databases, including CRSP and TAC high-frequency data. WRDS and many of its databases are available to LSE students and staff via wrds.wharton.upenn.edu.

CRSP

For historical US stock market data, one of the major sources is The Center for Research in Security Prices (CRSP, pronounced "crisp"), headquartered at the University of Chicago. LSE students can access CRSP via WRDS.

Yahoo Finance

The go-to place for many researchers requiring financial data has been finance.yahoo.com. This data can be automatically downloaded into many software packages, including Matlab, R and Python.

There are three problems with Yahoo Finance.

  1. Yahoo occasionally changes how the API works, requiring updates to software;
  2. It often is unavailable for days or weeks;
  3. There are errors in the data. For example, UK prices, quoted in pence by convention, sometimes appear in pounds for one or two days, reverting back to pence. On other occasions, numbers are simply wrong.

EOD Historical Data

End of Day Historical Data provides fundamental data API, live and end of day historical prices for stocks, ETFs and mutual funds from exchanges all around the world.

DBnomics

DBnomics is an open source data aggregator that provides access to dozens of data providers, including several countries' statistics.

Federal Reserve Economic Data (FRED)

The [FRED Economic Reserach] (https://fred.stlouisfed.org) is a good source for macroeconomic data including data for unemployment, GDP, interest rates, the money supply, etc. It can be accessed from DBnomics.

IEX

IEX provides access to US equity data via https://iextrading.com/developer/.

ECB FX

The European Central Bank Statistical Data Warehouse and its corresponding SDMX interface allow for retrieval of daily Euro FX data.

The entire dataset is here http://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip, and it can be accessed in R using

wget http://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip -O eurofxref-hist.zip

Alpha Vantage

Alpha Vantage provides free daily and realtime stock price data, and API access is available in both R and Python. Its data source appears to be the same as Yahoo Finance and hence subject to the same errors.

Quandl

Quandl provides a common API access for R and Python to a large number of commercial databases, and some that are free. While comprehensive, one may need to subscribe to data from a number of providers.

Fama-French Data Library:

The Fama-French Data Library provides a large amount of historical data compiled by Eugene Fama and Kenneth French. The data is updated regularly, and the Fama-French 3-factor data is especially useful for analyzing fund and portfolio performance.

Preliminary Considerations

Downloading and importing data

Data from this sources can either be downloaded directly, as a .csv or .xls file, or in most cases be imported into R using an API or package.

Symbols and Names

All stocks that are trading in the market are associated with a ticker symbol, serving as an identifier for the specific security. These are often specific to the particular exchange or a country of listing. This can often lead to confusion or ambiguity when a company has cross-listings — for instance, the Japanese car manufacturer Toyota is listed as 7203 on the Tokyo Stock Exchange, TYT on the London Stock Exchange, and TM on the New York Stock Exchange.

Some identical securities have different names. Depending on the source, searching for data on the S&P-500 (a major American stock market index) might require the ticker names SPX (Bloomberg), ^GSPC (Yahoo Finance), INX (Google Finance) and so on.

When searching for a particular security from the data source, it is good practice to verify that the ticker symbol used indeed corresponds to the correct data series by checking the data description.

Tickers, ISIN and PERMNO

It is important to understand that Tickers can change over time (for example due to a merger or name change) and be recycled. This can create problems if not taken into consideration when querying financial data. There are other securities identifiers that do not change over time. The ISIN is the International Securities Identification Number, which is an internationally recognized 12-characters code that is unique for each stock. Unlike Tickers, the ISIN for a stock is the same regardless of the market. Another important identifier is the PERMNO, which is the permanent issue identifier of the CRSP dataset.

Dates

A date generally has three components, year (YYYY), month (MM) and day (DD). Depending on the granularity of data, there may also be a time component: hours, minutes, seconds and fractions of a second. Software packages often define some arbitrary date as origin. For example, R's origin date/time is set as January 1 1970 00:00:00, and all dates are relative to that.

Dates often present problems when using multiple languages or programs to carry out numerical work. Excel has two different conventions for dates depending on version — the origin year can either be 1900 or 1904 — this requires verification before use. Furthermore, Excel does not allow dates before 1900.

A date can be represented numerically in many different ways. Consider the date 13 September 2018:

Format Example
DD-MM-YYYY 13-09-2018
MM-DD-YYYY 09-13-2018
YYYY-MM-DD 2018-09-13
YYYYMMDD 20180913

The best way is to use the YYYYMMDD convention for two reasons:

  1. It can be represented as an integer, not as a string, making data handling more convenient
  2. It sorts naturally (in chronological order) even as integers

Best practice in data downloading

Given the need to ensure that accurate data is used, it may be best to have a separate step in downloading and running the data. Get data from a provider, save it as a CSV (or Excel) file and then have a separate run for processing the data.

There are a few reasons why this is preferable:

  1. Limits for API calls:
    • Many data providers place limits on the amount/speed of data downloads
  2. Minimize time spent importing data
    • Especially true for large datasets, since CSV/xlsx reading is relatively quick
  3. Quick checking for data omissions
    • Useful to have a pre-analysis check of data quality using plots and tables

Never manipulate your data in Excel. Perform any analysis in R or the programming language of your choice. This will make it tractable and can prevent mistakes.


2 | Control Flow in R

In many cases we will want to repeat a piece of code several times, or execute part of our code only if some conditions are met. Programming languages allow us to do that quite easily with loops and conditionals, which are specific statements that give instructions to computers. In this section we will cover the most important loops and conditionals in R.

For more details you can check the official introduction to R (Section

For statements

A for loop evaluates a piece of code for different values. Usually, we think of for loops in two types, depending on the values we are evaluating: Loops over sequences, and loops over elements.

Looping over sequences

Probably the most common way to build a for statement is to loop over a numeric sequence. For example, if we wanted to print() the integers from 1 to 5:

In [1]:
for (i in 1:5) {
    print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

In line 1, we are telling R that we want to create an integer sequence from 1 to 5, and use the name i to refer to values from this sequence. Then, line 2 is executed for every possible value specified in line 1.

Basically we are telling R:

  1. Create a sequence from a to b
  2. Take the first element of the sequence and assign it to "i"
  3. Execute the code between { } using the current value of "i"
  4. When done, check if there are more elements in the sequence
    4.1 If there are, take the next element and assign that value to "i", then go back to step 3
    4.2 If there are not, exit the loop

Recall the syntax 1:5 means "Create a sequence from 1 to 5", and if we don't specify the step, it takes 1 by default:

In [2]:
1:5
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5

This is equivalent to doing:

In [3]:
seq(1,5,1)
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5

So our for loop could also be written as:

In [4]:
for (i in seq(1,5,1)) {
    print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

It is a convention to create integer sequences using first_element:last_element

The main usage of this type of loops is working with indexes. For example, consider we have a vector of 500 elements filled with NA:

In [5]:
vec <- rep(NA, 500)

Let's say we want to replace all odd elements with a 1, and keep the even elements as NA. We could build a for loop over the sequence seq(1, 500, by = 2)

In [6]:
for (i in seq(1, 500, by = 2)) {    # Loop over i = 1, 3, 5, ...
    vec[i] <- 1                     # Replace the i-th element in vec for 1
}

An interesting application for using for loops over sequences is constructing the Fibonacci sequence: 1, 1, 2, 3, 5, 8, 13, 21, ...

Recall the j-th term of the Fibonacci sequence is the sum of the j-1-th and j-2-th terms.

Let's say we want to create a vector with the first 50 terms of the Fibonacci sequence:

In [7]:
fibonacci <- rep(NA, 50)                  # Create the empty vector
fibonacci[1:2] <- 1                        # The first two values are 1
for (i in 3:50){                     # Loop over the indexes 3 to 100
    fibonacci[i] <- fibonacci[i-1] + fibonacci[i-2]  # Replace the i-th term by the sum of the two previous terms
}
In [8]:
fibonacci
  1. 1
  2. 1
  3. 2
  4. 3
  5. 5
  6. 8
  7. 13
  8. 21
  9. 34
  10. 55
  11. 89
  12. 144
  13. 233
  14. 377
  15. 610
  16. 987
  17. 1597
  18. 2584
  19. 4181
  20. 6765
  21. 10946
  22. 17711
  23. 28657
  24. 46368
  25. 75025
  26. 121393
  27. 196418
  28. 317811
  29. 514229
  30. 832040
  31. 1346269
  32. 2178309
  33. 3524578
  34. 5702887
  35. 9227465
  36. 14930352
  37. 24157817
  38. 39088169
  39. 63245986
  40. 102334155
  41. 165580141
  42. 267914296
  43. 433494437
  44. 701408733
  45. 1134903170
  46. 1836311903
  47. 2971215073
  48. 4807526976
  49. 7778742049
  50. 12586269025

We could put this in to a function that gives us the i-th fibonacci number:

In [9]:
fibonacci <- function(N) {
    fibo <- rep(NA, N)
    fibo[1:2] <- 1
    for (i in 3:N) {
        fibo[i] <- fibo[i-1] + fibo[i-2]
    }
    paste0("The ", N, "-th Fibonacci number is: ", fibo[N])
}

fibonacci(10)
fibonacci(50)
'The 10-th Fibonacci number is: 55'
'The 50-th Fibonacci number is: 12586269025'

Or, for example, if you have a vector with different model names:

In [10]:
model <- c("GARCH", "EWMA", "HS")

And another vector with a value associated to each model respectively:

In [11]:
violation_ratio <- c(1.06, 0.95, 0.88)

You can output this as:

In [12]:
for (i in 1:3){
    cat("The value for the", model[i], "model is:", violation_ratio[i], "\n")
}
The value for the GARCH model is: 1.06 
The value for the EWMA model is: 0.95 
The value for the HS model is: 0.88 

Looping over elements

The other common way to think of for loops although technically analogous, is to loop over elements of a vector, which don't necessarily have to be numbers. The syntax is similar.

For example, consider the model vector defined above:

In [13]:
model
  1. 'GARCH'
  2. 'EWMA'
  3. 'HS'

You can loop over the vector model by doing for (element in model). Note that element is just a placeholder, it can be anything else. Let's say you want to say hello to everyone on the list:

In [14]:
for (model_type in model){
    cat("My favorite model is", model_type, "\n")
}
My favorite model is GARCH 
My favorite model is EWMA 
My favorite model is HS 

Another example:

In [15]:
estimation_windows = c(300, 500, 1000, 2000)   # Vector of different "windows"
S = 5000
rand <- rnorm(S)                            # Create a vector of random numbers of length 5k
for (window in estimation_windows){            # Loop over estimation windows  
    # Get last window-length elements of rand
    tmp <- rand[(S-window):S]
    cat("Average of last", window, "elements: ", mean(tmp), "\n")
}
Average of last 300 elements:  -0.05571649 
Average of last 500 elements:  -0.03766916 
Average of last 1000 elements:  0.01084457 
Average of last 2000 elements:  0.01729325 

If statements

A conditional, or if, statement, evaluates a piece of code that can be either TRUE or FALSE and takes a different action depending on the value. If the condition is TRUE, then it will execute the piece of code inside the brackets { }, if it is FALSE, it will not. You can then specify a different action to be performed if the condition is FALSE, which would go in an else statement. You can even include else if statements to evaluate another condition if the first one was not met.

A classic example:

In [16]:
X = 5

if (X > 0) {
    print("Positive")
} else if (X < 0) {
    print("Negative")
} else {
    print("Zero")
}
[1] "Positive"

With this we can easily define a function that checks if a number is even:

In [17]:
is_even <- function(x){
    if (x %% 2 == 0) {
        cat(x, "is even", "\n")
    } else {
        cat(x, "is odd", "\n")
    }
}

is_even(4)
is_even(7)
4 is even 
7 is odd 

While loops

A while loop will evaluate a condition and as long as it is TRUE, it will execute a piece of code. NOTE: you need to make sure that the loop will not run indefinitely. To do this, at some point in the loop the condition needs to be set to FALSE.

A basic usage is:

In [18]:
a <- 1

while (a < 6) {
    print(a)
    a = a + 1
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

Normally, you use while loops when the number of times the loop has to be executed is not pre-determined. For example, you can create a simple function that uses a while loop to tell you how many times you need to halve a number before its absolute value is less than 0.5:

In [19]:
less_half <- function(x){
    steps <- 0
    while (abs(x) > 0.5) {
        x = x/2
        steps = steps + 1
    }
    cat("Number of steps:", steps, "\n")
}

less_half(10)
less_half(0.2)
less_half(-500)
Number of steps: 5 
Number of steps: 0 
Number of steps: 10 


3 | Data frames handling

Objects in R can be of different classes. For example, we can have a vector, which is an ordered array of observations of data of the same type (numbers, characters, logicals). We can also have a matrix, which is a rectangular arrangement of elements of the same data type. You can check what class an object is by running class(object).

A data frame is a two-dimensional structure in which each column contains values of one variable and each rows contains one set of values, or "observation" from each column. It is perhaps the most common way of storing data in R and the one we will use the most.

One of the main advantaged of a data frame in comparison to a matrix, is that each column can have a different data type. For example, you can have one column with numbers, one with text, one with dates, and one with logicals, whereas a matrix limits you to only one data type. Keep in mind that a data frame needs all its columns to be of the same length.

Creating Data Frames

There are different ways to create a data frame. We will focus on three:

1. Reading a file

In finance it is very common to have a .csv file with the data you want to analyze. For example, this data could be the output of any of the financial data sources we mentioned above. We can import directly a .csv file into R as a data.frame object using the function read.csv().

The file stocks.csv holds daily returns for 6 different companies for the years 2015 to 2019. It has been downloaded from the CRSP database. Let's import it:

In [20]:
stocks <- read.csv("stocks.csv")    # Importing data to a dataframe

After importing a file, it is recommended to see how it looks like before starting to perform any type of analysis. For this, we can use the function head() to see the first elements of our object:

In [21]:
head(stocks)
MSFTXOMGEJPMINTCCdate
0.006651827 0.004101577 -0.0083447205 0.004941769 0.001927142 0.002768165 2015-01-02
-0.009346543 -0.027743311 -0.0185265618-0.031537108 -0.011340056 -0.032022284 2015-01-05
-0.014678200 -0.005330180 -0.0217804820-0.026271083 -0.018812857 -0.035839635 2015-01-06
0.012624969 0.010082005 0.0004149139 0.001524837 0.020758053 0.009227297 2015-01-07
0.028993594 0.016507990 0.0119710601 0.022099986 0.018430117 0.014935902 2015-01-08
-0.008440521 -0.001410995 -0.0140502440-0.017539929 0.001906182 -0.022586158 2015-01-09

It is very important to check the documentation of the functions we are using to make sure we are doing things right. For example, we need to be careful if:

  • The CSV file has headers or not - A header is the title of a column. We can have a csv file where the first row includes the column titles, or a csv with no titles
  • The separator is a comma (,), a semicolon (;), a tab (\t) or another symbol
  • We are getting data from a country where the decimal symbol is a comma (,) instead of a period (.)

By checking the documentation of a function, we will see how to deal with each of these specific cases and more. To do so, run ?read.csv

In [22]:
?read.csv

Now let's check that our variable df is actually a data frame:

In [23]:
class(stocks)
'data.frame'

We can check the structure of a data frame using the function str(). This can be quite useful to see what every column of the data frame holds, and what type of data it has:

In [24]:
str(stocks)
'data.frame':	1258 obs. of  7 variables:
 $ MSFT: num  0.00665 -0.00935 -0.01468 0.01262 0.02899 ...
 $ XOM : num  0.0041 -0.02774 -0.00533 0.01008 0.01651 ...
 $ GE  : num  -0.008345 -0.018527 -0.02178 0.000415 0.011971 ...
 $ JPM : num  0.00494 -0.03154 -0.02627 0.00152 0.0221 ...
 $ INTC: num  0.00193 -0.01134 -0.01881 0.02076 0.01843 ...
 $ C   : num  0.00277 -0.03202 -0.03584 0.00923 0.01494 ...
 $ date: Factor w/ 1258 levels "2015-01-02","2015-01-05",..: 1 2 3 4 5 6 7 8 9 10 ...

We see the data frame has 8 columns and 1258 rows. The structure function shows us the names of each column, along with the data type it holds, and some observations. We see there is a column called X that serves as an index, with variable type integer, a column for every stock, which holds numeric data, and finally a column for dates, which is of type factor.

We can also check parts of the structure with different functions:

In [25]:
dim(stocks)
nrow(stocks)
ncol(stocks)
colnames(stocks)
  1. 1258
  2. 7
1258
7
  1. 'MSFT'
  2. 'XOM'
  3. 'GE'
  4. 'JPM'
  5. 'INTC'
  6. 'C'
  7. 'date'

2. Creating a Data Frame from scratch

In some cases you might want to create a data frame from a list of vectors. This can easily be done with the data.frame() function:

In [26]:
df <- data.frame(col1 = 1:3,
                 col2 = c("A", "B", "C"),
                 col3 = c(TRUE, TRUE, FALSE),
                 col4 = c(1.0, 2.2, 3.3))

You have to specify the name of each column, and what goes inside it. Note that all vectors need to be the same length. We can now check the structure:

In [27]:
str(df)

dim(df)
colnames(df)
'data.frame':	3 obs. of  4 variables:
 $ col1: int  1 2 3
 $ col2: Factor w/ 3 levels "A","B","C": 1 2 3
 $ col3: logi  TRUE TRUE FALSE
 $ col4: num  1 2.2 3.3
  1. 3
  2. 4
  1. 'col1'
  2. 'col2'
  3. 'col3'
  4. 'col4'

Note that the elements of col2 are considered Factors, since it is the default in data.frame(). We can check the documentation and use an option to keep these are chr:

In [28]:
df <- data.frame(col1 = 1:3,
                 col2 = c("A", "B", "C"),
                 col3 = c(TRUE, TRUE, FALSE),
                 col4 = c(1.0, 2.2, 3.3),
                stringsAsFactors = FALSE)

str(df)
'data.frame':	3 obs. of  4 variables:
 $ col1: int  1 2 3
 $ col2: chr  "A" "B" "C"
 $ col3: logi  TRUE TRUE FALSE
 $ col4: num  1 2.2 3.3

3. Transforming a different object into a Data Frame

You might want to transform a matrix into a data frame (or viceversa). For example, you need an object to be a matrix to perform linear algebra operations, but you would like to keep the results as a data frame after the operations. You can easily switch from matrix to data frame using as.data.frame() (and analogously, from data frame to matrix with as.matrix(), however remember all columns need to have the same data type to be a matrix).

For example, let's say we have the matrix:

In [29]:
my_matrix <- matrix(1:10, nrow = 5, ncol = 2, byrow = TRUE)
class(my_matrix)
my_matrix
'matrix'
1 2
3 4
5 6
7 8
9 10

We can now transform it into a data frame:

In [30]:
df <- as.data.frame(my_matrix)
class(df)
df
str(df)
'data.frame'
V1V2
1 2
3 4
5 6
7 8
9 10
'data.frame':	5 obs. of  2 variables:
 $ V1: int  1 3 5 7 9
 $ V2: int  2 4 6 8 10

And we can change the column names:

In [31]:
colnames(df) <- c("Odd", "Even")
str(df)
'data.frame':	5 obs. of  2 variables:
 $ Odd : int  1 3 5 7 9
 $ Even: int  2 4 6 8 10

Subsetting Data Frames

There are many different ways to subset a data frame and access its elements. We can subset a data frame by its column/row position, by names, using logicals, etc. Each way can be useful for a different situation:

In [32]:
head(stocks)
MSFTXOMGEJPMINTCCdate
0.006651827 0.004101577 -0.0083447205 0.004941769 0.001927142 0.002768165 2015-01-02
-0.009346543 -0.027743311 -0.0185265618-0.031537108 -0.011340056 -0.032022284 2015-01-05
-0.014678200 -0.005330180 -0.0217804820-0.026271083 -0.018812857 -0.035839635 2015-01-06
0.012624969 0.010082005 0.0004149139 0.001524837 0.020758053 0.009227297 2015-01-07
0.028993594 0.016507990 0.0119710601 0.022099986 0.018430117 0.014935902 2015-01-08
-0.008440521 -0.001410995 -0.0140502440-0.017539929 0.001906182 -0.022586158 2015-01-09
In [33]:
# Subsetting by row numbers
stocks[2:3,]
MSFTXOMGEJPMINTCCdate
2-0.009346543-0.02774331 -0.01852656 -0.03153711 -0.01134006 -0.03202228 2015-01-05
3-0.014678200-0.00533018 -0.02178048 -0.02627108 -0.01881286 -0.03583964 2015-01-06
In [34]:
# Subsetting by column number
head(stocks[, 4])
  1. 0.00494176931922958
  2. -0.0315371077982067
  3. -0.0262710827191262
  4. 0.00152483684516575
  5. 0.0220999863448632
  6. -0.0175399291295306
In [35]:
# Subsetting by col names
head(stocks[c("MSFT", "GE")])
MSFTGE
0.006651827 -0.0083447205
-0.009346543 -0.0185265618
-0.014678200 -0.0217804820
0.012624969 0.0004149139
0.028993594 0.0119710601
-0.008440521 -0.0140502440
In [36]:
# Subsetting using $
head(stocks$MSFT)
  1. 0.00665182746034677
  2. -0.00934654316964306
  3. -0.0146781996320895
  4. 0.0126249686402503
  5. 0.028993593997765
  6. -0.00844052118960295
In [37]:
# Subsetting both rows and columns
stocks[1:2, c("JPM", "C")]
JPMC
0.004941769 0.002768165
-0.031537108-0.032022284
In [38]:
# Subsetting using a logical
head(stocks[stocks$JPM > 0, c("JPM", "date")])
JPMdate
10.00494176932015-01-02
40.00152483682015-01-07
50.02209998632015-01-08
80.00016998562015-01-13
110.01694954152015-01-16
130.00322579152015-01-21
In [39]:
# Subsetting using a vector
v <- c(1, 2, 4, 8, 16)
stocks[v,]
MSFTXOMGEJPMINTCCdate
1 0.006651827 0.004101577 -0.0083447205 0.0049417693 0.001927142 0.002768165 2015-01-02
2-0.009346543 -0.027743311 -0.0185265618-0.0315371078-0.011340056 -0.032022284 2015-01-05
4 0.012624969 0.010082005 0.0004149139 0.0015248368 0.020758053 0.009227297 2015-01-07
8-0.005270867 -0.003659688 -0.0050165619 0.0001699856-0.002735739 -0.007171655 2015-01-13
16-0.003609506 0.009526479 0.0044829366 0.0015867405-0.017717024 0.005537639 2015-01-26

Handling

Let's store the stock returns for JP Morgan and dates in a different variable:

In [40]:
jpm <- stocks[, c("JPM", "date")]
head(jpm)
class(jpm)
dim(jpm)
JPMdate
0.0049417692015-01-02
-0.0315371082015-01-05
-0.0262710832015-01-06
0.0015248372015-01-07
0.0220999862015-01-08
-0.0175399292015-01-09
'data.frame'
  1. 1258
  2. 2

Note that if we had only taken the returns, the output would have been a vector and not a data frame:

In [41]:
jpm2 <- stocks[, c("JPM")]
head(jpm2)
class(jpm2)
  1. 0.00494176931922958
  2. -0.0315371077982067
  3. -0.0262710827191262
  4. 0.00152483684516575
  5. 0.0220999863448632
  6. -0.0175399291295306
'numeric'

What if we wanted to create another column called negative, that tells us if the returns for a particular day where negative?

We can easily add a new variable by using the syntax: data_frame$new_variable

In [42]:
# We can use an ifelse to assign values conditional on another value
jpm$negative <- ifelse(jpm$JPM < 0, TRUE, FALSE)
head(jpm)
JPMdatenegative
0.0049417692015-01-02 FALSE
-0.0315371082015-01-05 TRUE
-0.0262710832015-01-06 TRUE
0.0015248372015-01-07 FALSE
0.0220999862015-01-08 FALSE
-0.0175399292015-01-09 TRUE

Merging data frames

If we have datasets with a common column, we can use the merge() function and R will create a new dataset using that common column as a joiner. For example:

In [43]:
MSFT <- stocks[, c("date", "MSFT")]
INTC <- stocks[, c("date", "INTC")]

head(MSFT)
head(INTC)
dateMSFT
2015-01-02 0.006651827
2015-01-05 -0.009346543
2015-01-06 -0.014678200
2015-01-07 0.012624969
2015-01-08 0.028993594
2015-01-09 -0.008440521
dateINTC
2015-01-02 0.001927142
2015-01-05 -0.011340056
2015-01-06 -0.018812857
2015-01-07 0.020758053
2015-01-08 0.018430117
2015-01-09 0.001906182

In case there are several common variables, we can specify how to join the data frames (for more information run ?merge), but in a simple case as this, R will be smart enough to know what we mean:

In [44]:
new_df <- merge(MSFT, INTC)
head(new_df)
dateMSFTINTC
2015-01-02 0.006651827 0.001927142
2015-01-05 -0.009346543-0.011340056
2015-01-06 -0.014678200-0.018812857
2015-01-07 0.012624969 0.020758053
2015-01-08 0.028993594 0.018430117
2015-01-09 -0.008440521 0.001906182

Reshaping

The package reshape2 provides various useful functions to transform a dataset. Consider the following raw output from the CRSP database, including the same stocks' returns and time period:

In [45]:
crsp <- read.csv("crsp.csv")
head(crsp)
PERMNOdateTICKERCOMNAMPRCRETCFACPR
10107 20150102 MSFT MICROSOFT CORP 46.76 0.006674 1
11850 20150102 XOM EXXON MOBIL CORP 92.83 0.004110 1
12060 20150102 GE GENERAL ELECTRIC CO25.06 -0.008310 1
47896 20150102 JPM JPMORGAN CHASE & CO62.49 0.004954 1
59328 20150102 INTC INTEL CORP 36.36 0.001929 1
70519 20150102 C CITIGROUP INC 54.26 0.002772 1

If we want to create a data frame with the returns for every stock, we can use the dcast() function

In [46]:
library(reshape2)
RET <- dcast(crsp, date ~ PERMNO, value.var = "RET")
head(RET)
date101071185012060478965932870519
20150102 0.006674 0.004110-0.008310 0.004954 0.001929 0.002772
20150105 -0.009303-0.027362-0.018356-0.031045-0.011276-0.031515
20150106 -0.014571-0.005316-0.021545-0.025929-0.018637-0.035205
20150107 0.012705 0.010133 0.000415 0.001526 0.020975 0.009270
20150108 0.029418 0.016645 0.012043 0.022346 0.018601 0.015048
20150109 -0.008405-0.001410-0.013952-0.017387 0.001908-0.022333

Since we used the PERMNO as the variable to separate the RET observations, this is now the name of the columns. We can easily rename them:

In [47]:
names(RET) <- c("date", "MSFT", "XOM", "GE", "JPM", "INTC", "C")
head(RET)
dateMSFTXOMGEJPMINTCC
20150102 0.006674 0.004110-0.008310 0.004954 0.001929 0.002772
20150105 -0.009303-0.027362-0.018356-0.031045-0.011276-0.031515
20150106 -0.014571-0.005316-0.021545-0.025929-0.018637-0.035205
20150107 0.012705 0.010133 0.000415 0.001526 0.020975 0.009270
20150108 0.029418 0.016645 0.012043 0.022346 0.018601 0.015048
20150109 -0.008405-0.001410-0.013952-0.017387 0.001908-0.022333

Saving the data frame

Once we have finished cleaning and handling our dataset, we can save it so we don't have to repeat the procedure next time we want to use it. The most common file formats to save our data frame are either a .RData or a .csv file.

.RData files are specific to R and can store as many objects as you’d like within a single file. You can save your entire environment (all your variables) in a single file and then easily load it and continue working where you left it. To save our data frame in a .RData file we can run:

In [48]:
save(RET, file = "RET.RData")

Then it can be directly loaded into R using load("RET.RData").

To save your dataframe as a .csv file you can use the function write.csv():

In [49]:
write.csv(RET, file = "RET.csv")


4 | Time Series

Most of financial applications involve working with dates. It can be monthly, weekly, daily, or even intra-day data. Storing data as text is not helpful since we cannot order or subset it easily. R has a specific data type called Date. In this section we will explore some packages that will help us to work with Date objects.

The lubridate package

It provides functions to work with date-times and time-spans. The full documentation is here: https://cran.r-project.org/web/packages/lubridate/lubridate.pdf

The main two functions we will use are ymd() and dmy(), which stand for year-month-day, and day-month-year respectively. These are quite powerful functions that transform a string object into date. As long as the object we pass on follow the structure of year-month-day for ymd(), the function will recognize it and transform it into date. For example:

In [50]:
library(lubridate)
ymd("20200110")
class(ymd("20200110"))
'Date'
In [51]:
ymd("2015JAN11")
class(ymd("20200110"))
'Date'
In [52]:
ymd("04-MAR-5")
class(ymd("04MAR5"))
'Date'
In [53]:
dmy("1/june/2019")
class(dmy("1/june/2019"))
'Date'
In [54]:
dmy("28-december-14")
class(dmy("28-december-14"))
'Date'

The zoo package

This package functions to work with ordered indexed observations. It is very commonly used in time series analysis. The complete documentation is available here: https://cran.r-project.org/web/packages/zoo/zoo.pdf

In [55]:
ret <- c(0.18, 0.02, -0.29, 0.00, 0.15)
class(ret)
'numeric'
In [56]:
dates <- c("20190121", "20190122", "20190123", "20190124", "20190125")
class(dates)
'character'
In [57]:
library(zoo)
ret.ts <- zoo(ret, order.by = ymd(dates))
class(ret.ts)
'zoo'
In [58]:
ret.ts
2019-01-21 2019-01-22 2019-01-23 2019-01-24 2019-01-25 
      0.18       0.02      -0.29       0.00       0.15 

Lag (and lead) functions

This function allows us to take the lag or leads of a time series object. The syntax is:
lag(x, k, na.pad = F) where:

  • x = a time series object to lag
  • k = number of lags (in units of observations); could be positive or negative
  • na.pad = if TRUE, adds NAs for missing observations
In [59]:
ret.ts
lag(ret.ts, k = 2)
2019-01-21 2019-01-22 2019-01-23 2019-01-24 2019-01-25 
      0.18       0.02      -0.29       0.00       0.15 
2019-01-21 2019-01-22 2019-01-23 
     -0.29       0.00       0.15 
In [60]:
ret.ts
lag(ret.ts, k = 2, na.pad = TRUE)
2019-01-21 2019-01-22 2019-01-23 2019-01-24 2019-01-25 
      0.18       0.02      -0.29       0.00       0.15 
2019-01-21 2019-01-22 2019-01-23 2019-01-24 2019-01-25 
     -0.29       0.00       0.15         NA         NA 
In [61]:
ret.ts
lag(ret.ts, k = -2)
2019-01-21 2019-01-22 2019-01-23 2019-01-24 2019-01-25 
      0.18       0.02      -0.29       0.00       0.15 
2019-01-23 2019-01-24 2019-01-25 
      0.18       0.02      -0.29 

The diff function

Takes the lagged difference of a time series. Syntax:
diff(x, lag, differences, na.pad = F) where:

  • x = a time series object
  • lag = number of lags(in unit of observations)
  • differences = the order of the difference
In [62]:
ret.ts
2019-01-21 2019-01-22 2019-01-23 2019-01-24 2019-01-25 
      0.18       0.02      -0.29       0.00       0.15 
In [63]:
diff(ret.ts, lag = 1, na.pad = TRUE)
2019-01-21 2019-01-22 2019-01-23 2019-01-24 2019-01-25 
        NA      -0.16      -0.31       0.29       0.15 
In [64]:
diff(ret.ts, lag = 1, differences = 2, na.pad = TRUE)
2019-01-21 2019-01-22 2019-01-23 2019-01-24 2019-01-25 
        NA         NA      -0.15       0.60      -0.14 
In [65]:
diff(ret.ts, lag = 2, differences = 1, na.pad = TRUE)
2019-01-21 2019-01-22 2019-01-23 2019-01-24 2019-01-25 
        NA         NA      -0.47      -0.02       0.44 

The roallapply function

Rollapply is a function for applying another function to rolling margins of an array. For example, you can construct a cumulative sum or mean. Syntax is:
rollapply(data, width, FUN, fill, align...) where:

  • data = a time series object
  • width = the window width
  • FUN = the function to apply
  • fill = NA ro replace missing observations
  • align = etierh "left", "right", or "center" (default)
In [66]:
# Example: Sum the last two consecutive returns
ret.ts
2019-01-21 2019-01-22 2019-01-23 2019-01-24 2019-01-25 
      0.18       0.02      -0.29       0.00       0.15 
In [67]:
rollapply(ret.ts, width = 2, FUN = sum, align = "left")
2019-01-21 2019-01-22 2019-01-23 2019-01-24 
      0.20      -0.27      -0.29       0.15 
In [68]:
rollapply(ret.ts, width = 2, FUN = sum, align = "right", fill = NA)
2019-01-21 2019-01-22 2019-01-23 2019-01-24 2019-01-25 
        NA       0.20      -0.27      -0.29       0.15 

Plotting zoo objects

When plotting a zoo object, R will automatically use the dates as the x-axis. Let's create a series of fictitious returns to see this:

In [69]:
ret <- rt(5000, df = 4)    # Random draws from a Student t with 3 dof
dates <- seq(Sys.Date(), by = "-1 day", length.out = 2000) # 2000 days back from today
ret.ts <- zoo(ret, order.by = ymd(dates))   # Transform to zoo
In [70]:
plot(ret.ts, main = "Returns from the past 2000 days",
    xlab = "Date", ylab = "Returns", col = "mediumblue")

Zooming into a time period

We can use the window() function to subset a zoo object to a given time period. For example, let's say we are interested in the returns of 2019. Since it is a zoo object, R will automatically adjust the x-axis to show months:

In [71]:
sub_ret.ts <- window(ret.ts, start = "2019/1/1", end = "2019/12/31")
plot(sub_ret.ts, main = "Returns from 2019",
    xlab = "Date", ylab = "Returns", col = "mediumblue")


5 | OLS and Maximum Likelihood

Consider the linear model:

$$y_i = \alpha + \beta_1 x_{i,1} + \beta_2 x{i,2} + \dots + \beta_K x_{i,K} + \varepsilon_i$$

Where E[$\varepsilon$|X] = 0

In matrix notation:

$$y = X \beta + \varepsilon$$

Where $X$ is a matrix containing one independent variable for each column, $\beta$ is a vector of coefficients and $\varepsilon$ is a vector of errors.

The Ordinary Least Squares (OLS) estimator $\hat{\beta}$ minimizes the sum of squared residuals:

$$\hat{\beta} = \textrm{arg min}\ \sum_i (y_i - \alpha - \beta_1 x_{i,1} - \beta_2 x{i,2} - \dots - \beta_K x_{i,K})^2$$

Or in matrix notation:

$$\hat{\beta} = \textrm{arg min}\ (y- X \beta)'(y -X \beta)$$

Taking the first order conditions:

$$-X'y - yX + 2X'X\hat{\beta} = 0$$

Assuming invertibility of (X'X), we have:

$$\hat{\beta} = (X'X)^{-1}X'y$$

Notation and definitions:

  • Fitted values: $\hat{y} = X\hat{\beta}$
  • Residuals: $\hat{\varepsilon} = y - \hat{y}$
  • Total Sum of Squares (SST): $(y-\bar{y})'(y-\bar{y})$
  • Sum of Squares of Residuals (SSR): $\hat{\varepsilon}'\hat{\varepsilon}$
  • R squared: 1 - SSR/SST

OLS "by hand"

We can use matrix algebra in R to solve OLS. We will use the file "Shiller.csv". This dataset was developed by Robert Shiller and includes historical data on monthly stock price, dividends, and earnings, consumer price index (to allow conversion to real values), all starting January 1871. More details on the dataset can be found here: http://www.econ.yale.edu/~shiller/data.htm

In [72]:
# Data load and cleaning
df <- read.csv("Shiller.csv", header = T, na.strings = NA)
head(df)

df$date <- paste0(df$date, "01") # Add days to dates
# Create time series
ret.ts <- zoo(df$R_r, order.by = ymd(df$date))
PDdiff.ts <- diff(zoo(df$PD_r, order.by = ymd(df$date)), na.pad = T)
head(ret.ts)
head(PDdiff.ts)
# Adjust length
ret.ts <- ret.ts[2:length(ret.ts)]*100
PDdiff.ts <- PDdiff.ts[2:length(PDdiff.ts)]
head(ret.ts)
head(PDdiff.ts)
datePDECPIDate_fraclong_intP_rD_rE_rP_E10R_rPD_r
187101 4.44 0.26 0.4 12.46 1871.04 5.32 77.17 4.52 6.95 NA NA17.07301
187102 4.50 0.26 0.4 12.84 1871.13 5.32 75.90 4.39 6.75 NA -0.01645717217.28929
187103 4.61 0.26 0.4 13.03 1871.21 5.33 76.62 4.32 6.65 NA 0.00948616617.73611
187104 4.74 0.26 0.4 12.56 1871.29 5.33 81.76 4.48 6.90 NA 0.06708431218.25000
187105 4.86 0.26 0.4 12.27 1871.38 5.33 85.78 4.59 7.06 NA 0.04916829718.68845
187106 4.82 0.26 0.4 12.08 1871.46 5.34 86.41 4.66 7.17 NA 0.00734436918.54292
  1871-01-01   1871-02-01   1871-03-01   1871-04-01   1871-05-01   1871-06-01 
          NA -0.016457172  0.009486166  0.067084312  0.049168297  0.007344369 
1871-01-01 1871-02-01 1871-03-01 1871-04-01 1871-05-01 1871-06-01 
        NA  0.2162850  0.4468173  0.5138889  0.4384532 -0.1455347 
1871-02-01 1871-03-01 1871-04-01 1871-05-01 1871-06-01 1871-07-01 
-1.6457172  0.9486166  6.7084312  4.9168297  0.7344369 -1.8632103 
1871-02-01 1871-03-01 1871-04-01 1871-05-01 1871-06-01 1871-07-01 
 0.2162850  0.4468173  0.5138889  0.4384532 -0.1455347 -0.3454936 

Now that we have cleaned the data, let's follow OLS notation. We want to regress returns on the price to dividend ratio:

In [73]:
X <- cbind(rep(1,NROW(PDdiff.ts)), as.vector(PDdiff.ts))   # Add column of 1s
y <- as.vector(ret.ts)
Xt <- t(X)

We use the function solve() to get a matrix inverse. Recall that we need to enclose operators in % signs to make matrix operations:

In [74]:
betaHat <- solve(Xt %*% X) %*% (Xt %*% y)
retHat <- X %*% betaHat
In [75]:
betaHat
0.1873181
3.0064311
In [76]:
retHat.ts <- zoo(retHat, order.by = index(ret.ts))
resid.ts <- ret.ts - retHat.ts
In [77]:
SSR = sum((ret.ts - retHat.ts)^2)
SST = sum((ret.ts - mean(ret.ts))^2)
R2 = 1 - (SSR/SST)

The lm( ) function

You should never do OLS by hand. It is a way to waste time and be prone to mistakes. The R function that takes care of this is called lm(), which stands for linear models. It is intuitive, easy to use, and provides you with much more information than just parameters. Syntax is:
lm(formula, data) where:

  • formula = obejct depvar ~ indvar1 + ... + indvarK - R introduces a constant by default, add 0 among independent variables to exclude it
  • data = the object containing the variables in the model
In [78]:
reg <- lm(ret.ts ~ PDdiff.ts)    # First dependent variable followed by ~ and formula
summary(reg)
Call:
lm(formula = ret.ts ~ PDdiff.ts)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.512  -0.873   0.024   0.995  37.710 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.18732    0.05777   3.243  0.00121 ** 
PDdiff.ts    3.00643    0.05079  59.196  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.358 on 1665 degrees of freedom
Multiple R-squared:  0.6779,	Adjusted R-squared:  0.6777 
F-statistic:  3504 on 1 and 1665 DF,  p-value: < 2.2e-16

Let's explore the attributes that we can obtain from an object created by lm():

In [79]:
# Everything part of the object reg
names(reg)
  1. 'coefficients'
  2. 'residuals'
  3. 'effects'
  4. 'rank'
  5. 'fitted.values'
  6. 'assign'
  7. 'qr'
  8. 'df.residual'
  9. 'xlevels'
  10. 'call'
  11. 'terms'
  12. 'model'
In [80]:
reg$coefficients
(Intercept)
0.187318131314459
PDdiff.ts
3.0064310968334
In [81]:
alpha <- reg$coefficients[1]
alpha

# To compare with our OLS by hand
betaHat[1]
(Intercept): 0.187318131314459
0.187318131314459
In [82]:
beta <- reg$coefficients[2]
beta

# To compare with our OLS by hand
betaHat[2]
PDdiff.ts: 3.0064310968334
3.0064310968334
In [83]:
std_alpha <- summary(reg)$coefficient[1,2]
std_alpha
0.057768211461676
In [84]:
std_beta <- summary(reg)$coefficient[2,2]
std_beta
0.050787950218705
In [85]:
y_fitted <- reg$fitted.values
In [86]:
R2_lm <- summary(reg)$r.squared
R2_lm

# To compare with our OLS by hand
R2
0.677895973961083
0.677895973961083
In [87]:
resid_lm <- reg$residuals
head(resid.ts)
head(resid_lm)
                     
1871-02-01 -2.4832813
1871-03-01 -0.5820268
1871-04-01  4.9761415
1871-05-01  3.4113324
1871-06-01  0.9846588
1871-07-01 -1.0118258
1871-02-01 1871-03-01 1871-04-01 1871-05-01 1871-06-01 1871-07-01 
-2.4832813 -0.5820268  4.9761415  3.4113324  0.9846588 -1.0118258 

Advanced linear models packages

Another package that is worth mentioning is dynlm. This package is focused in dynamic linear models and time series regressions. You can review the documentation here.

OLS by Maximum Likelihood

The Maximum Likelihood Estimator (MLE) chooses the parameters that most likely generated the observed data.

Hence:

$$\hat{\beta}_{\textrm{MLE}} = \textrm{arg max}\ f(y_1,y_2, ..., y_T | \beta)$$

We do not know the joint distribution of returns. However, if we strengthen the OLS assumptions so that we assume a distribution for the error terms: $\varepsilon_i \sim \mathcal{N}(0, \sigma^2)$, we can write the conditional density of y given x, where x is a row of the stacked matrix X defined above:

$$f(y|x) = \frac{1}{(2\pi \sigma^2)^\frac{1}{2}}\textrm{exp}\left(-\frac{1}{2\sigma^2}(y-x'\beta)^2\right)$$

Unrder the assumption that observations are mutually independent, this implies that the conditiona density of $(y_1, ..., y_N)$ given $(x_1, ..., x_N)$ is:

$$f(y_1, ..., y_N | x_1, ..., x_N) = \prod_{i=1}^{n} f(y_i | x_i)$$$$ = \prod_{i=1}^{n} \frac{1}{(2\pi \sigma^2)^\frac{1}{2}}\textrm{exp}\left(-\frac{1}{2\sigma^2}(y_i-x_i'\beta)^2\right) $$$$ = \frac{1}{(2\pi \sigma^2)^\frac{n}{2}}\textrm{exp}\left(-\frac{1}{2\sigma^2}\sum_{i=1}^{n} (y_i-x_i'\beta)^2\right) $$$$ := \mathcal{L}(\beta, \sigma^2)$$

This is called the Likelihood function. For convenience, it is typical to work with the natural logarithm, and the log-likelihood function:

$$\textrm{log}\mathcal{L}(\beta, \sigma^2) = \textrm{log} f(y_1, ..., y_N | x_1, ..., x_N)$$$$ = -\frac{n}{2} \textrm{log} (2\pi \sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n} (y_i-x_i'\beta)^2$$

So the ML estimators are those components of $\beta$ and $\sigma$ such that the log-likelihood is maximized.

To solve this in R, we can write our own log-likelihood function. The inputs we are going to use are:

  • par = vector of parameters (in our simple case c($\sigma, \alpha, \beta$)
  • depvar = the vector y
  • indvar = the matrix X
In [88]:
# Log-likelihood function for OLS 

LL.OLS <- function(par, depVar, indVar) {
    
    sigma = par[1]
    alpha = par[2]
    beta = par[3]
    betaVec = c(alpha, beta)
    T = nrow(indVar)
    
    out = - (T/2)*log(2*pi*sigma^2) - 1/(2*sigma^2)* (t(depVar - indVar %*% betaVec) %*% (depVar - indVar %*% betaVec))
    
    return(-out) # Return a negative since by default an optimizer minimizes, doesn't maximize
}

Now that we have the log likelihood function ready, we can use the function optim() to optimize:

In [89]:
# Estimate ML
res <- optim(c(1,0,0), LL.OLS, gr = NULL, y, X, hessian = T)
In [90]:
# Getting standard errors
H <- res$hessian # Hessian
I <- solve(H)    # Invert Hessian
se <- sqrt(diag(I)) # Sqrt of diagonal elements of the inverse
In [91]:
# Getting the parameters
sigmaMLE <- res$par[1]
betaVecMLE <- c(res$par[2], res$par[3])
yHatMLE <- X %*% betaVecMLE
tstat <- betaVecMLE / se[2:3]
print(betaVecMLE)
[1] 0.187737 3.005867


6 | GARCH Maximum Likelihood

In constrast to OLS models, GARCH models have to be solved using Maximum Likelihood methods, which is essentially how the GARCH packages, like rugarch, fit their models. Before exploring the GARCH packages further, let's use R to estimate an ARCH(1) and GARCH(1,1) models.

The data we are going to use comes is the prices of the S&P500 index. Let's load the data, transform them to returns, and create a lagged vector, which will be necessary for the likelihood functions:

In [92]:
sp <- read.csv("sp500.csv")
ret  <- diff(log(sp$price))*100
ret     <- ret[2:length(ret)]

ARCH(1)

$Z_t$ in ARCH(1) is standard normally distributed: $$Y_t = \sigma_t Z_t$$ $$\sigma^2_t = \omega + \alpha Y_{t-1}^2$$
$$Z_t \sim \mathcal{N}(0,1)$$

The presence of lagged returns implies the density for t = 1 is unknown since $y_0$ is unkown. At t = 2, the density is:

$$f(y_2|y_1) = \frac{1}{\sqrt{2\pi(\omega + \alpha y_{1}^2)}} exp\left(-\frac{1}{2} \frac{y_2^2}{\omega + \alpha y_{1}^2}\right)$$

So the joint density is:

$$\prod_{t=2}^{T} f(y_t) = \prod_{t=2}^{T} \frac{1}{\sqrt{2\pi(\omega + \alpha y_{t-1}^2)}} exp\left(-\frac{1}{2} \frac{y_t^2}{\omega + \alpha y_{t-1}^2}\right)$$

Therefore the log-likelihood function for the ARCH(1) model is:

$$ \log \mathcal{L} = -\frac{T-1}{2}\log(2\pi) - \frac{1}{2}\sum_{t=2}^{T}\left(\log(\omega + \alpha y_{t-1}^2) + \frac{y_t^2}{\omega + \alpha y_{t-1}^2}\right)$$

Let's now write the function in R:

In [93]:
# ARCH(1) MLE

LL.ARCH1 <- function(par, x) { 
  
    omega <- par[1] # First element of par
    alpha <- par[2] # Second element of par

    T <- length(x)	# Number of rows 
    loglikelihood <- -(T-1)/2 * log(2*pi)
    
    # Loop (create the sum) and add every instance fo loglikelihood
    for(i in 2:T){
        loglikelihood = loglikelihood - 1/2 * (log(omega + alpha*x[i-1]^2)+x[i]^2/(omega + alpha*x[i-1]^2))
    }

    return(-loglikelihood) # output the log likelihood
}

res <- optim(c(0.1,0.5) , LL.ARCH1, gr = NULL, ret, hessian = T) # Optimization

cat("Omega:", res$par[1], "\n",
   "Alpha:", res$par[2])
Omega: 0.9956736 
 Alpha: 0.3486696

GARCH(1,1)

In a GARCH(1,1) model, the conditional volatility follows:

$$\sigma_t^2 = \omega + \alpha Y_{t-1}^2 + \beta \sigma_{t-1}^2$$

The conditional density is:

$$f(y_2 | y_1) = \frac{1}{\sqrt{2\pi(\omega + \alpha y_{1}^2 + \beta\hat{\sigma}_1^2)}} exp\left(-\frac{1}{2} \frac{y_2^2}{\omega + \alpha y_{1}^2 + \beta\hat{\sigma}_1^2} \right)$$

So the log-likelihood function is:

$$\log \mathcal{L} = -\frac{T-1}{2}\log(2\pi) - \frac{1}{2} \sum_{t=2}^{T} \left(\log(\omega + \alpha y_{t-1}^2 + \beta\hat{\sigma}_{t-1}^2) + \frac{y_{t}^2}{\omega + \alpha y_{t-1}^2 + \beta\hat{\sigma}_{t-1}^2} \right)$$

Writing the function in R:

In [94]:
# GARCH(1,1) MLE

LL.GARCH1_1 <- function(par, x) { 
  
    omega <- par[1] # First element of par
    alpha <- par[2] # Second element of par
    beta <- par[3]  # Third element of par

    T <- length(x)	# Number of rows 
    loglikelihood <- -(T-1)/2 * log(2*pi)
    sigma_2 <- rep(NA, T)
    sigma_2[1] <- var(x) # Initialize the sigma with unconditional volatility

    # Loop (create the sum) and add every instance to loglikelihood
    for(i in 2:T){
        loglikelihood = loglikelihood - 1/2 * (log(omega + alpha*x[i-1]^2 + beta*sigma_2[i-1])
                                               + x[i]^2/(omega + alpha*x[i-1]^2 + beta*sigma_2[i-1]))
        
        sigma_2[i] <- omega + alpha*x[i-1]^2 + beta*sigma_2[i-1]    # Update sigma
    }

    return(-loglikelihood) # output the log likelihood
}

res <- optim(c(0.1, 0, 0) , LL.GARCH1_1, gr = NULL, ret, hessian = T, 
             method = 'L-BFGS-B', lower = c(0,0,0)) # Optimization

cat("Omega:", res$par[1], "\n",
   "Alpha:", res$par[2], "\n",
   "Beta:", res$par[3])

## Note that the values obtained in GARCH ML will depend on the optimization method used
Omega: 0.01810722 
 Alpha: 0.1030158 
 Beta: 0.883244

Some important components of the optimization process are:

  • Value of $\sigma_1^2$: We need to assign a value to $\sigma_1^2$, the first iteration of the conditional volatility. General, we choose the unconditional variance of the data for this
  • Optimization method: We have chosen the L-BFGS-B algorithm, since it allows us to place bounds on the parameters. We have specified non-negativity by setting lower = c(0,0,0), since this is desired in GARCH
  • Initial values of the parameters: Optimization works iteratively. You need to assign values to initialize the parameters, making sure these are compliant with the restrictions of the function and the parameter bounds. This will have a small effect on the parameter values, but can have a large effect in computing time

Optimization packages

In this section we used the function optim() to fit the maximum likelihood estimation. This was enough for our purposes, but if you have to carry on more complex optimization problems, optim() can fail. We recommend the package nloptr, which currently is the state of the art for minimizing functions in non-linear settings.


7 | rugarch and rmgarch usage

The packages that we use in the course for estimating GARCH models are rugarch for univariate models, and rmgarch for multivariate models. The complete documentation of the packages is:

Both packages were developed by Alexios Galanos and are constantly mantained and updated. The development code can be found here: https://bitbucket.org/alexiosg/ If you are curious on learning more programming, you are encourage to browse the page to see how the package was created.

Univariate GARCH models with rugarch

To estimate a univariate GARCH model you need to follow two steps:

  1. Create an object of the uGARCHspec class which is the specification of the model you want to estimate. This includes the type of model (standard, asymmetric, power, etc), the GARCH order, the distribution, and model for the mean
  2. Fit the specified model to the data

1. ugarchspec( )

First let's take a look at ugarchspec(), which is the function to create an instance of the uGARCHspec class. The complete syntax with its default values is:

ugarchspec(variance.model = list(model = "sGARCH", garchOrder = c(1, 1), submodel = NULL, external.regressors = NULL, variance.targeting = FALSE), mean.model = list(armaOrder = c(1, 1), include.mean = TRUE, archm = FALSE, archpow = 1, arfima = FALSE, external.regressors = NULL, archex = FALSE), distribution.model = "norm", start.pars = list(), fixed.pars = list(), ...)

You can check the details of this function and what every argument means by running ?ugarchspec.

For the purposes of our course, we are going to focus on three arguments of the ugarchspec() function:

variance.model

This argument takes a list with the specifications of the GARCH model. Its most important components are:

  • model: Specify the type of model, currently implemented models are “sGARCH”, “fGARCH”, “eGARCH”, “gjrGARCH”, “apARCH” and “iGARCH” and “csGARCH”
  • garchOrder: Specify the ARCH(q) and GARCH(p) orders

Other components includes options to do variance.targetingor specify external regressors.

mean.model

This argument takes a list with the specifications of the mean model, if assumed to have one. A traditional assumption is that the model has zero mean, in which case we specify armaOrder = c(0,0), include.mean = FALSE. However, it is also common to assume that the mean follows a certain ARMA process.

distribution.model

Here you can specify the conditional density to use for innovations. The default is a normal distribution, but we can specify the use of a Student-t distribution by setting the value to std.

2. ugarchfit( )

Once the specification has been created, you can fit this to the data using ugarchfit( spec = ..., data = ...). The result will be an object of the class uGARCHfit, which is a list that contains useful information which will be shown below.

GARCH(1,1)

Let's use rugarch to estimate a GARCH(1,1) model.

First, let's load the library, and the S&P500 data:

In [95]:
library(rugarch)
sp <- read.csv("sp500.csv")
ret <- diff(log(sp$price))*100
ret <- ret[2:length(ret)]

Now let's use ugarchspec() to specify the type of model we want. In this case, we want to use a GARCH(1,1), so the garchOrder in variance.model will be set as c(1,1). Note that even if we only have one component, we need to put it inside a list.

We will assume zero mean, so we need to include this into mean.model:

In [96]:
spec1 <- ugarchspec(
    variance.model = list(garchOrder = c(1,1)),
    mean.model = list(armaOrder = c(0,0), include.mean = FALSE)
)

We can call spec1 to see what is inside:

In [97]:
spec1
*---------------------------------*
*       GARCH Model Spec          *
*---------------------------------*

Conditional Variance Dynamics 	
------------------------------------
GARCH Model		: sGARCH(1,1)
Variance Targeting	: FALSE 

Conditional Mean Dynamics
------------------------------------
Mean Model		: ARFIMA(0,0,0)
Include Mean		: FALSE 
GARCH-in-Mean		: FALSE 

Conditional Distribution
------------------------------------
Distribution	:  norm 
Includes Skew	:  FALSE 
Includes Shape	:  FALSE 
Includes Lambda	:  FALSE 

Now we can fit the specified model to our data using the ugarchfit() function:

In [98]:
garch1_1 <- ugarchfit(spec = spec1, data = ret)

We can check the class of this new object:

In [99]:
class(garch1_1)
'uGARCHfit'

Objects from the uGARCHfit class have two slots. You can think of every slot as a list of components. The two slots are:

  1. @model
  2. @fit

To access a slot you need to use the syntax: object@slot

Let's explore these to understand better:

@model

The model slot includes all the information that was needed to estimate the GARCH model. This includes both the model specification and the data. To see every that is included in this slot, you can use the names() function:

In [100]:
names(garch1_1@model)
  1. 'modelinc'
  2. 'modeldesc'
  3. 'modeldata'
  4. 'pars'
  5. 'start.pars'
  6. 'fixed.pars'
  7. 'maxOrder'
  8. 'pos.matrix'
  9. 'fmodel'
  10. 'pidx'
  11. 'n.start'

And you can access each element with the $ sign. For example, let's see what is inside pars:

In [101]:
garch1_1@model$pars
LevelFixedIncludeEstimateLBUB
mu0.00000000 0 0 0 NA NA
ar0.00000000 0 0 0 NA NA
ma0.00000000 0 0 0 NA NA
arfima0.00000000 0 0 0 NA NA
archm0.00000000 0 0 0 NA NA
mxreg0.00000000 0 0 0 NA NA
omega0.01802185 0 1 1 2.220446e-161454.914
alpha10.10279939 0 1 1 0.000000e+00 1.000
beta10.88352744 0 1 1 0.000000e+00 1.000
gamma0.00000000 0 0 0 NA NA
eta10.00000000 0 0 0 NA NA
eta20.00000000 0 0 0 NA NA
delta0.00000000 0 0 0 NA NA
lambda0.00000000 0 0 0 NA NA
vxreg0.00000000 0 0 0 NA NA
skew0.00000000 0 0 0 NA NA
shape0.00000000 0 0 0 NA NA
ghlambda0.00000000 0 0 0 NA NA
xi0.00000000 0 0 0 NA NA

We have a matrix with all the possible parameters to be used in fitting a GARCH model. We see there is a value of 1 for the parameters that were included in our GARCH specification (omega, alpha1 and beta1). The matrix also includes what are the lower and upper bounds for these parameters, which has nothing to do with our particular data, only with what is expected from the model.

We can also see the model description:

In [102]:
garch1_1@model$modeldesc
$distribution
'norm'
$distno
1
$vmodel
'sGARCH'

And inside garch1_1@model@modeldata we will find the data vector we have used when fitting the model. You can think of the @model slot as everything that R needs to know in order to be able to estimate the model.

@fit

Inside the fit slot we will find the estimated model, including the coefficients, likelihood, fitted conditional variance, and more. Let's check everything that is included:

In [103]:
names(garch1_1@fit)
  1. 'hessian'
  2. 'cvar'
  3. 'var'
  4. 'sigma'
  5. 'condH'
  6. 'z'
  7. 'LLH'
  8. 'log.likelihoods'
  9. 'residuals'
  10. 'coef'
  11. 'robust.cvar'
  12. 'A'
  13. 'B'
  14. 'scores'
  15. 'se.coef'
  16. 'tval'
  17. 'matcoef'
  18. 'robust.se.coef'
  19. 'robust.tval'
  20. 'robust.matcoef'
  21. 'fitted.values'
  22. 'convergence'
  23. 'kappa'
  24. 'persistence'
  25. 'timer'
  26. 'ipars'
  27. 'solver'

Let's see the coefficients, along with their standard errors, in matcoef:

Note: You will see these values are very similar to what we found using Maximum Likelihood. Since it is an optimization problem, we can expect to see some small differences

In [104]:
garch1_1@fit$matcoef
Estimate Std. Error t valuePr(>|t|)
omega0.01802185 0.002790007 6.459428 1.050997e-10
alpha10.10279939 0.009098806 11.298119 0.000000e+00
beta10.88352744 0.009724544 90.855407 0.000000e+00

We can also see the log-likelihood:

In [105]:
garch1_1@fit$LLH
-6542.99177744059

This makes it easy to extract the estimated conditional volatility:

In [106]:
plot(garch1_1@fit$sigma, type = "l")

We can also access the coefficients are likelihood using the following functions:

In [107]:
coef(garch1_1)
omega
0.0180218480714706
alpha1
0.102799391800216
beta1
0.883527438351233
In [108]:
likelihood(garch1_1)
-6542.99177744059

Other specifications

Instead of using a conditionally normal GARCH(1,1) model specification, we can use a t-distribution or introduce asymmetric and power effects very easily.

Let's first fit a GARCH(1,1) using a Student-t as conditional distribution. We can see that the coefficients now include shape, which is the estimated degrees of freedom:

In [109]:
# Student t

spec2 <- ugarchspec(
    variance.model = list(garchOrder = c(1,1)),
    mean.model = list(armaOrder = c(0,0), include.mean = FALSE),
    distribution.model = "std"
)

tGARCH <- ugarchfit(spec = spec2, data = ret)

tGARCH@fit$matcoef
Estimate Std. Error t valuePr(>|t|)
omega0.00927054 0.002577062 3.597328 0.0003215026
alpha10.09976745 0.010615749 9.398060 0.0000000000
beta10.89873275 0.010206685 88.053343 0.0000000000
shape6.65154269 0.646320576 10.291399 0.0000000000

Now let's fit an apARCH model with fat tails. The paramter list will now include gamma1 and delta, which are part of the apARCH model specification:

In [110]:
# tapARCH model
spec3 <- ugarchspec(
    variance.model = list(model = "apARCH"),
    mean.model = list(armaOrder = c(0,0), include.mean = FALSE),
    distribution.model = "std"
)

tapARCH <- ugarchfit(spec = spec3, data = ret)

tapARCH@fit$matcoef
Estimate Std. Error t valuePr(>|t|)
omega0.02231419 0.0015974472 13.96866 0
alpha10.08930565 0.0043439698 20.55853 0
beta10.90967493 0.0052901005 171.95797 0
gamma10.99999971 0.00031139513211.35291 0
delta1.04334904 0.0924854533 11.28122 0
shape7.72471637 0.8248865782 9.36458 0

Issues

Sample size

It is important to consider that the more parameter a model has, the more likely it is to run into estimation problems, and the more data we need. For example, as a rule of thumb, for a GARCH(1,1) we need at least 500 observations, while for a Student-t GARCH that minimum number is around 3,000 observations. In general, the more parameters are estimated, the more data is needed to be confident in the estimation.

Optimization failure

ugarchfit() essentially performs Maximum Likelihood to fit the specified model to our data. This optimization problem required a solver, which is the numerical method used to maximize the likelihood function. In some case, the default solver in ugarchfit() can fail to converge. We recommend using the option solver = "hybrid" when fitting the model, since this will try different optimization methods in case the default one does not converge.

Multivariate models with rmgarch

The package rmgarch allows to easily estimate models with multiple assets in the same fashion as rugarch. The procedure is analogous, we first need to specify the model we want to use, and then fit it to the data. However, specifying the model has extra steps. To explain this, we will focus on the estimation of a DCC model.

In a DCC model, we assume that each individual asset follows some type of univariate model, usually a GARCH. And then we model the correlation between the assets using an ARMA-like process.

The process for fitting a DCC model using rmgarch is then:

  1. Specify the univariate volatility model each asset follows using ugarchspec()
  2. Use the multispec() function to create a multivariate specification. This is essentially a list of univariate specifications. If we are going to use the same for every asset, we can use replicate()
  3. Then we need to create a dccspec() object, which takes in the list of univariate specifications for every asset, and the additional DCC joint specifications, like dccOrder and distribution
  4. Fit the specification to the data

We will use the returns for JPM and Citigroup from the file Y.RData to go through this process:

In [111]:
library(rmgarch)
load("RET.RData")
y <- RET[c("JPM", "C")]

We will assume a simple GARCH(1,1) with mean zero for each stock. We will create a single univariate specification, and then replicate it into multispec():

In [112]:
# Create the univariate specification
uni_spec <- ugarchspec(
    variance.model = list(garchOrder = c(1,1)),
    mean.model = list(armaOrder = c(0,0), include.mean = FALSE)
)

# Replicate it into a multispec element
mspec <- multispec(replicate(2, uni_spec))

Let's take a look inside the mspec object. We will see there are two univariate specifications, one per asset, and they are equal:

In [113]:
mspec
*-----------------------------*
*     GARCH Multi-Spec        *
*-----------------------------*
Multiple Specifications	: 2
Multi-Spec Type			: equal

You can check the specifications with mspec@spec

Now we proceed to create the specification for the DCC model using dccspec():

In [114]:
spec <- dccspec(
                # Univariate specifications - Needs to be multispec
                uspec = mspec,
                
                # DCC specification. We will assume an ARMA(1,1)-like process
                dccOrder = c(1,1),
                
                # Distribution, here multivariate normal
                distribution = "mvnorm"
)

We can call spec to see what is inside:

In [115]:
spec
*------------------------------*
*       DCC GARCH Spec         *
*------------------------------*
Model          :  DCC(1,1)
Estimation     :  2-step
Distribution   :  mvnorm
No. Parameters :  9
No. Series     :  2

We can again see more details in spec@model and spec@umodel.

Now we can proceed to fit the specification to the data:

In [116]:
res <- dccfit(spec, data = y)
res
*---------------------------------*
*          DCC GARCH Fit          *
*---------------------------------*

Distribution         :  mvnorm
Model                :  DCC(1,1)
No. Parameters       :  9
[VAR GARCH DCC UncQ] : [0+6+2+1]
No. Series           :  2
No. Obs.             :  1258
Log-Likelihood       :  8156.696
Av.Log-Likelihood    :  6.48 

Optimal Parameters
-----------------------------------
              Estimate  Std. Error  t value Pr(>|t|)
[JPM].omega   0.000029    0.000016   1.8689 0.061633
[JPM].alpha1  0.141068    0.062791   2.2466 0.024663
[JPM].beta1   0.691315    0.135507   5.1017 0.000000
[C].omega     0.000021    0.000013   1.6703 0.094864
[C].alpha1    0.118458    0.056954   2.0799 0.037535
[C].beta1     0.795242    0.098281   8.0915 0.000000
[Joint]dcca1  0.100491    0.028630   3.5100 0.000448
[Joint]dccb1  0.685564    0.090675   7.5606 0.000000

Information Criteria
---------------------
                    
Akaike       -12.953
Bayes        -12.917
Shibata      -12.954
Hannan-Quinn -12.940


Elapsed time : 0.933454 

We can check what slots are inside:

In [117]:
names(res@model)
names(res@mfit)
  1. 'modelinc'
  2. 'modeldesc'
  3. 'modeldata'
  4. 'varmodel'
  5. 'pars'
  6. 'start.pars'
  7. 'fixed.pars'
  8. 'maxgarchOrder'
  9. 'maxdccOrder'
  10. 'pos.matrix'
  11. 'pidx'
  12. 'DCC'
  13. 'mu'
  14. 'residuals'
  15. 'sigma'
  16. 'mpars'
  17. 'ipars'
  18. 'midx'
  19. 'eidx'
  20. 'umodel'
  1. 'coef'
  2. 'matcoef'
  3. 'garchnames'
  4. 'dccnames'
  5. 'cvar'
  6. 'scores'
  7. 'R'
  8. 'H'
  9. 'Q'
  10. 'stdresid'
  11. 'llh'
  12. 'log.likelihoods'
  13. 'timer'
  14. 'convergence'
  15. 'Nbar'
  16. 'Qbar'
  17. 'plik'
In [118]:
# Coefficient matrix
res@mfit$matcoef
Estimate Std. Error t valuePr(>|t|)
[JPM].omega2.930068e-051.567784e-051.868924 6.163342e-02
[JPM].alpha11.410681e-016.279070e-022.246641 2.466300e-02
[JPM].beta16.913155e-011.355069e-015.101699 3.366172e-07
[C].omega2.118301e-051.268232e-051.670279 9.486425e-02
[C].alpha11.184580e-015.695379e-022.079896 3.753505e-02
[C].beta17.952415e-019.828105e-028.091504 6.661338e-16
[Joint]dcca11.004908e-012.862958e-023.510034 4.480492e-04
[Joint]dccb16.855639e-019.067528e-027.560648 4.019007e-14
In [119]:
# Log likelihood
res@mfit$llh
8156.69553672898

The matrix H inside res@mfit includes the covariances. It is 3-dimensional, since it includes the 2x2 covariance matrix for each of the T time periods:

In [120]:
H <- res@mfit$H
dim(H)
  1. 2
  2. 2
  3. 1258
In [121]:
# First period's covariances
H[,,1]
0.00017546970.0001841418
0.00018414180.0002449290

We can extract the conditional correlation in two ways. One is computing it from H:

In [122]:
# Initializing the vector
rhoDCC <- vector(length = dim(y)[1])

# Populate with the correlations
rhoDCC <- H[1,2,] / sqrt(H[1,1,]*H[2,2,])

Or we can directly extract it from res@mfit:

In [123]:
# Initializing the vector
rhoDCC2 <- vector(length = dim(y)[1])

for (i in 1:dim(y)[1]) {
    rhoDCC2[i] <- rcor(res)[1,2,i]
}
In [124]:
plot(rhoDCC2, type = "l")