This should serve to accompany the Book Code from Jón Daníelsson's *Financial Risk Forecasting* and the Seminar notebooks for the class. All the code used is in R.

Thanks to Alvaro Aguirre for developing the notebook and material, and to Jia Rong Fan for the information on financial data sources.

If you want further material, the following list provides information on various topics:

- A very short introduction to R: Concise tutorial on the basics of R
- The official introduction to R: Official documentation of the language which covers its usage
- R Programming tutorial by freeCodeCamp.org: Complete video for learning R
- R Markdown tutorial: Using R to create documents
- R Markdown cheat sheet: Practical guide to use R Markdown
- Dplyr cheat sheet: Guide on data wrangling using dplyr
- ggplot2 tutorial: Complete tutorial on ggplot2

1 | Financial data sources

2 | Control Flow

3 | Data frames handling

4 | Time Series: Dates and Zoo

5 | OLS and Maximum Likelihood

6 | GARCH Maximum Likelihood

7 | rugarch and rmgarch usage

8 | VaR probabilities

9 | Simulated VaR with simple and continuous returns

10 | Statistical testing

11 | The Apply family of functions

12 | Introduction to dplyr

13 | ggplot2

14 | R Markdown and Jupyter Notebooks

While financial markets generate a vast amount of data (sometimes with new observations every nanosecond), it is generally difficult and costly to get this data. Most data providers are commercial with complicated interfaces, and free data providers tend to have erratic access and errors in data.

Online data providers generally provide an *application programming interface* (API), an interface which allows users to submit data requests and receive responses. For most languages, there are often open-source or proprietary packages available which allow end-users to retrieve data using a relatively simple function call.

Today, many universities maintain subscriptions to data providers — this makes obtaining complete and accurate data much easier. Below we discuss some of the data providers that can be of use.

While this information is accurate at the time of writing (July 2020), given the rapid developments in the space, it is likely to become quickly outdated.

While we have used each of these data sources below, we have no specific recommendation.

One of the most ubiquitous data sources in finance is the Bloomberg Terminal. Due to its pervasiveness throughout the industry, there are numerous packages in practically every language to access its APIs.

LSE students have access to a number of Bloomberg terminals in the library and the Masters students common rooms.

The Wind Financial Terminal (WFT) also provides market data like the Bloomberg Terminal, but with a specific focus on the Chinese financial markets. It supports APIs for MATLAB, R, C++ and Python, among others.

The Wharton business school at University of Pennsylvania provides service called Wharton Research Data Services (WRDS) that many universities subscribe to. This provides a common interface to a number of databases, including CRSP and TAC high-frequency data. WRDS and many of its databases are available to LSE students and staff via wrds.wharton.upenn.edu.

For historical US stock market data, one of the major sources is The Center for Research in Security Prices (CRSP, pronounced "crisp"), headquartered at the University of Chicago. LSE students can access CRSP via WRDS.

The go-to place for many researchers requiring financial data has been finance.yahoo.com. This data can be automatically downloaded into many software packages, including Matlab, R and Python.

There are three problems with Yahoo Finance.

- Yahoo occasionally changes how the API works, requiring updates to software;
- It often is unavailable for days or weeks;
- There are errors in the data. For example, UK prices, quoted in pence by convention, sometimes appear in pounds for one or two days, reverting back to pence. On other occasions, numbers are simply wrong.

End of Day Historical Data provides fundamental data API, live and end of day historical prices for stocks, ETFs and mutual funds from exchanges all around the world.

DBnomics is an open source data aggregator that provides access to dozens of data providers, including several countries' statistics.

The [FRED Economic Reserach] (https://fred.stlouisfed.org) is a good source for macroeconomic data including data for unemployment, GDP, interest rates, the money supply, etc. It can be accessed from DBnomics.

IEX provides access to US equity data via https://iextrading.com/developer/.

The European Central Bank Statistical Data Warehouse and its corresponding SDMX interface allow for retrieval of daily Euro FX data.

The entire dataset is here http://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip, and it can be accessed in R using

`wget http://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip -O eurofxref-hist.zip`

Alpha Vantage provides free daily and realtime stock price data, and API access is available in both R and Python. Its data source appears to be the same as Yahoo Finance and hence subject to the same errors.

Quandl provides a common API access for R and Python to a large number of commercial databases, and some that are free. While comprehensive, one may need to subscribe to data from a number of providers.

The Fama-French Data Library provides a large amount of historical data compiled by Eugene Fama and Kenneth French. The data is updated regularly, and the Fama-French 3-factor data is especially useful for analyzing fund and portfolio performance.

Data from this sources can either be downloaded directly, as a .csv or .xls file, or in most cases be imported into R using an API or package.

All stocks that are trading in the market are associated with a *ticker symbol*, serving as an identifier for the specific security. These are often specific to the particular exchange or a country of listing. This can often lead to confusion or ambiguity when a company has cross-listings — for instance, the Japanese car manufacturer Toyota is listed as `7203`

on the Tokyo Stock Exchange, `TYT`

on the London Stock Exchange, and `TM`

on the New York Stock Exchange.

Some identical securities have different names. Depending on the source, searching for data on the S&P-500 (a major American stock market index) might require the ticker names `SPX`

(Bloomberg), `^GSPC`

(Yahoo Finance), `INX`

(Google Finance) and so on.

When searching for a particular security from the data source, it is good practice to verify that the ticker symbol used indeed corresponds to the correct data series by checking the data description.

It is important to understand that Tickers can change over time (for example due to a merger or name change) and be recycled. This can create problems if not taken into consideration when querying financial data. There are other securities identifiers that do not change over time. The *ISIN* is the International Securities Identification Number, which is an internationally recognized 12-characters code that is unique for each stock. Unlike Tickers, the ISIN for a stock is the same regardless of the market. Another important identifier is the *PERMNO*, which is the permanent issue identifier of the CRSP dataset.

A date generally has three components, year (YYYY), month (MM) and day (DD). Depending on the granularity of data, there may also be a time component: hours, minutes, seconds and fractions of a second. Software packages often define some arbitrary date as origin. For example, R's origin date/time is set as January 1 1970 00:00:00, and all dates are relative to that.

Dates often present problems when using multiple languages or programs to carry out numerical work. Excel has two different conventions for dates depending on version — the origin year can either be 1900 or 1904 — this requires verification before use. Furthermore, Excel does not allow dates before 1900.

A date can be represented numerically in many different ways. Consider the date 13 September 2018:

Format | Example |
---|---|

DD-MM-YYYY | 13-09-2018 |

MM-DD-YYYY | 09-13-2018 |

YYYY-MM-DD | 2018-09-13 |

YYYYMMDD | 20180913 |

The best way is to use the YYYYMMDD convention for two reasons:

- It can be represented as an integer, not as a string, making data handling more convenient
- It sorts naturally (in chronological order) even as integers

Given the need to ensure that accurate data is used, it may be best to have a separate step in downloading and running the data. Get data from a provider, save it as a CSV (or Excel) file and then have a separate run for processing the data.

There are a few reasons why this is preferable:

- Limits for API calls:
- Many data providers place limits on the amount/speed of data downloads

- Minimize time spent importing data
- Especially true for large datasets, since CSV/xlsx reading is relatively quick

- Quick checking for data omissions
- Useful to have a pre-analysis check of data quality using plots and tables

**Never** manipulate your data in Excel. Perform any analysis in R or the programming language of your choice. This will make it tractable and can prevent mistakes.

In many cases we will want to repeat a piece of code several times, or execute part of our code only if some conditions are met. Programming languages allow us to do that quite easily with loops and conditionals, which are specific statements that give instructions to computers. In this section we will cover the most important loops and conditionals in R.

For more details you can check the official introduction to R (Section

A *for loop* evaluates a piece of code for different values. Usually, we think of *for loops* in two types, depending on the values we are evaluating: Loops over sequences, and loops over elements.

Probably the most common way to build a *for* statement is to loop over a numeric sequence. For example, if we wanted to `print()`

the integers from 1 to 5:

In [1]:

```
for (i in 1:5) {
print(i)
}
```

In line 1, we are telling R that we want to create an integer sequence from 1 to 5, and use the name `i`

to refer to values from this sequence. Then, line 2 is executed for every possible value specified in line 1.

Basically we are telling R:

- Create a sequence from a to b
- Take the first element of the sequence and assign it to "i"
- Execute the code between { } using the current value of "i"
- When done, check if there are more elements in the sequence

4.1 If there are, take the next element and assign that value to "i", then go back to step 3

4.2 If there are not, exit the loop

Recall the syntax `1:5`

means "Create a sequence from 1 to 5", and if we don't specify the step, it takes 1 by default:

In [2]:

```
1:5
```

This is equivalent to doing:

In [3]:

```
seq(1,5,1)
```

So our *for loop* could also be written as:

In [4]:

```
for (i in seq(1,5,1)) {
print(i)
}
```

It is a convention to create integer sequences using `first_element:last_element`

The main usage of this type of loops is working with **indexes**. For example, consider we have a vector of 500 elements filled with `NA`

:

In [5]:

```
vec <- rep(NA, 500)
```

Let's say we want to replace all odd elements with a 1, and keep the even elements as `NA`

. We could build a *for loop* over the sequence `seq(1, 500, by = 2)`

In [6]:

```
for (i in seq(1, 500, by = 2)) { # Loop over i = 1, 3, 5, ...
vec[i] <- 1 # Replace the i-th element in vec for 1
}
```

An interesting application for using *for loops* over sequences is constructing the Fibonacci sequence: 1, 1, 2, 3, 5, 8, 13, 21, ...

Recall the j-th term of the Fibonacci sequence is the sum of the j-1-th and j-2-th terms.

Let's say we want to create a vector with the first 50 terms of the Fibonacci sequence:

In [7]:

```
fibonacci <- rep(NA, 50) # Create the empty vector
fibonacci[1:2] <- 1 # The first two values are 1
for (i in 3:50){ # Loop over the indexes 3 to 100
fibonacci[i] <- fibonacci[i-1] + fibonacci[i-2] # Replace the i-th term by the sum of the two previous terms
}
```

In [8]:

```
fibonacci
```

We could put this in to a function that gives us the i-th fibonacci number:

In [9]:

```
fibonacci <- function(N) {
fibo <- rep(NA, N)
fibo[1:2] <- 1
for (i in 3:N) {
fibo[i] <- fibo[i-1] + fibo[i-2]
}
paste0("The ", N, "-th Fibonacci number is: ", fibo[N])
}
fibonacci(10)
fibonacci(50)
```

Or, for example, if you have a vector with different model names:

In [10]:

```
model <- c("GARCH", "EWMA", "HS")
```

And another vector with a value associated to each model respectively:

In [11]:

```
violation_ratio <- c(1.06, 0.95, 0.88)
```

You can output this as:

In [12]:

```
for (i in 1:3){
cat("The value for the", model[i], "model is:", violation_ratio[i], "\n")
}
```

The other common way to think of *for loops* although technically analogous, is to loop over elements of a vector, which don't necessarily have to be numbers. The syntax is similar.

For example, consider the `model`

vector defined above:

In [13]:

```
model
```

You can loop over the vector `model`

by doing `for (element in model)`

. Note that `element`

is just a placeholder, it can be anything else. Let's say you want to say hello to everyone on the list:

In [14]:

```
for (model_type in model){
cat("My favorite model is", model_type, "\n")
}
```

Another example:

In [15]:

```
estimation_windows = c(300, 500, 1000, 2000) # Vector of different "windows"
S = 5000
rand <- rnorm(S) # Create a vector of random numbers of length 5k
for (window in estimation_windows){ # Loop over estimation windows
# Get last window-length elements of rand
tmp <- rand[(S-window):S]
cat("Average of last", window, "elements: ", mean(tmp), "\n")
}
```

A conditional, or *if*, statement, evaluates a piece of code that can be either `TRUE`

or `FALSE`

and takes a different action depending on the value. If the condition is `TRUE`

, then it will execute the piece of code inside the brackets { }, if it is `FALSE`

, it will not. You can then specify a different action to be performed if the condition is `FALSE`

, which would go in an *else* statement. You can even include *else if* statements to evaluate another condition if the first one was not met.

A classic example:

In [16]:

```
X = 5
if (X > 0) {
print("Positive")
} else if (X < 0) {
print("Negative")
} else {
print("Zero")
}
```

With this we can easily define a function that checks if a number is even:

In [17]:

```
is_even <- function(x){
if (x %% 2 == 0) {
cat(x, "is even", "\n")
} else {
cat(x, "is odd", "\n")
}
}
is_even(4)
is_even(7)
```

A *while* loop will evaluate a condition and as long as it is `TRUE`

, it will execute a piece of code. **NOTE:** you need to make sure that the loop will not run indefinitely. To do this, at some point in the loop the condition needs to be set to `FALSE`

.

A basic usage is:

In [18]:

```
a <- 1
while (a < 6) {
print(a)
a = a + 1
}
```

Normally, you use *while* loops when the number of times the loop has to be executed is not pre-determined. For example, you can create a simple function that uses a *while* loop to tell you how many times you need to halve a number before its absolute value is less than 0.5:

In [19]:

```
less_half <- function(x){
steps <- 0
while (abs(x) > 0.5) {
x = x/2
steps = steps + 1
}
cat("Number of steps:", steps, "\n")
}
less_half(10)
less_half(0.2)
less_half(-500)
```

Objects in R can be of different *classes*. For example, we can have a vector, which is an ordered array of observations of data of the same type (numbers, characters, logicals). We can also have a matrix, which is a rectangular arrangement of elements of the same data type. You can check what *class* an object is by running `class(object)`

.

A data frame is a two-dimensional structure in which each column contains values of one variable and each rows contains one set of values, or "observation" from each column. It is perhaps the most common way of storing data in R and the one we will use the most.

One of the main advantaged of a data frame in comparison to a matrix, is that each column can have a different data type. For example, you can have one column with numbers, one with text, one with dates, and one with logicals, whereas a matrix limits you to only one data type. Keep in mind that a data frame needs all its columns to be of the same length.

There are different ways to create a data frame. We will focus on three:

In finance it is very common to have a `.csv`

file with the data you want to analyze. For example, this data could be the output of any of the financial data sources we mentioned above. We can import directly a `.csv`

file into R as a `data.frame`

object using the function `read.csv()`

.

The file *stocks.csv* holds daily returns for 6 different companies for the years 2015 to 2019. It has been downloaded from the CRSP database. Let's import it:

In [20]:

```
stocks <- read.csv("stocks.csv") # Importing data to a dataframe
```

After importing a file, it is recommended to see how it looks like before starting to perform any type of analysis. For this, we can use the function `head()`

to see the first elements of our object:

In [21]:

```
head(stocks)
```

It is **very** important to check the documentation of the functions we are using to make sure we are doing things right. For example, we need to be careful if:

- The CSV file has headers or not - A
*header*is the title of a column. We can have a csv file where the first row includes the column titles, or a csv with no titles - The separator is a comma (,), a semicolon (;), a tab (\t) or another symbol
- We are getting data from a country where the decimal symbol is a comma (,) instead of a period (.)

By checking the documentation of a function, we will see how to deal with each of these specific cases and more. To do so, run `?read.csv`

In [22]:

```
?read.csv
```

Now let's check that our variable `df`

is actually a data frame:

In [23]:

```
class(stocks)
```

We can check the *structure* of a data frame using the function `str()`

. This can be quite useful to see what every column of the data frame holds, and what type of data it has:

In [24]:

```
str(stocks)
```

We see the data frame has 8 columns and 1258 rows. The *structure* function shows us the names of each column, along with the data type it holds, and some observations. We see there is a column called X that serves as an index, with variable type integer, a column for every stock, which holds numeric data, and finally a column for dates, which is of type factor.

We can also check parts of the structure with different functions:

In [25]:

```
dim(stocks)
nrow(stocks)
ncol(stocks)
colnames(stocks)
```

In some cases you might want to create a data frame from a list of vectors. This can easily be done with the `data.frame()`

function:

In [26]:

```
df <- data.frame(col1 = 1:3,
col2 = c("A", "B", "C"),
col3 = c(TRUE, TRUE, FALSE),
col4 = c(1.0, 2.2, 3.3))
```

You have to specify the name of each column, and what goes inside it. Note that all vectors need to be the same length. We can now check the structure:

In [27]:

```
str(df)
dim(df)
colnames(df)
```

Note that the elements of `col2`

are considered `Factors`

, since it is the default in `data.frame()`

. We can check the documentation and use an option to keep these are `chr`

:

In [28]:

```
df <- data.frame(col1 = 1:3,
col2 = c("A", "B", "C"),
col3 = c(TRUE, TRUE, FALSE),
col4 = c(1.0, 2.2, 3.3),
stringsAsFactors = FALSE)
str(df)
```

You might want to transform a matrix into a data frame (or viceversa). For example, you need an object to be a matrix to perform linear algebra operations, but you would like to keep the results as a data frame after the operations. You can easily switch from matrix to data frame using `as.data.frame()`

(and analogously, from data frame to matrix with `as.matrix()`

, however remember all columns need to have the same data type to be a matrix).

For example, let's say we have the matrix:

In [29]:

```
my_matrix <- matrix(1:10, nrow = 5, ncol = 2, byrow = TRUE)
class(my_matrix)
my_matrix
```

We can now transform it into a data frame:

In [30]:

```
df <- as.data.frame(my_matrix)
class(df)
df
str(df)
```

And we can change the column names:

In [31]:

```
colnames(df) <- c("Odd", "Even")
str(df)
```

There are many different ways to subset a data frame and access its elements. We can subset a data frame by its column/row position, by names, using logicals, etc. Each way can be useful for a different situation:

In [32]:

```
head(stocks)
```

In [33]:

```
# Subsetting by row numbers
stocks[2:3,]
```

In [34]:

```
# Subsetting by column number
head(stocks[, 4])
```

In [35]:

```
# Subsetting by col names
head(stocks[c("MSFT", "GE")])
```

In [36]:

```
# Subsetting using $
head(stocks$MSFT)
```

In [37]:

```
# Subsetting both rows and columns
stocks[1:2, c("JPM", "C")]
```

In [38]:

```
# Subsetting using a logical
head(stocks[stocks$JPM > 0, c("JPM", "date")])
```

In [39]:

```
# Subsetting using a vector
v <- c(1, 2, 4, 8, 16)
stocks[v,]
```

Let's store the stock returns for JP Morgan and dates in a different variable:

In [40]:

```
jpm <- stocks[, c("JPM", "date")]
head(jpm)
class(jpm)
dim(jpm)
```

Note that if we had only taken the returns, the output would have been a vector and not a data frame:

In [41]:

```
jpm2 <- stocks[, c("JPM")]
head(jpm2)
class(jpm2)
```

What if we wanted to create another column called negative, that tells us if the returns for a particular day where negative?

We can easily add a new variable by using the syntax:
`data_frame$new_variable`

In [42]:

```
# We can use an ifelse to assign values conditional on another value
jpm$negative <- ifelse(jpm$JPM < 0, TRUE, FALSE)
head(jpm)
```

If we have datasets with a common column, we can use the `merge()`

function and R will create a new dataset using that common column as a joiner. For example:

In [43]:

```
MSFT <- stocks[, c("date", "MSFT")]
INTC <- stocks[, c("date", "INTC")]
head(MSFT)
head(INTC)
```

In case there are several common variables, we can specify how to join the data frames (for more information run `?merge`

), but in a simple case as this, R will be smart enough to know what we mean:

In [44]:

```
new_df <- merge(MSFT, INTC)
head(new_df)
```

The package `reshape2`

provides various useful functions to transform a dataset. Consider the following raw output from the CRSP database, including the same stocks' returns and time period:

In [45]:

```
crsp <- read.csv("crsp.csv")
head(crsp)
```

If we want to create a data frame with the returns for every stock, we can use the `dcast()`

function

In [46]:

```
library(reshape2)
RET <- dcast(crsp, date ~ PERMNO, value.var = "RET")
head(RET)
```

Since we used the `PERMNO`

as the variable to separate the `RET`

observations, this is now the name of the columns. We can easily rename them:

In [47]:

```
names(RET) <- c("date", "MSFT", "XOM", "GE", "JPM", "INTC", "C")
head(RET)
```

Once we have finished cleaning and handling our dataset, we can save it so we don't have to repeat the procedure next time we want to use it. The most common file formats to save our data frame are either a `.RData`

or a `.csv`

file.

`.RData`

files are specific to R and can store as many objects as you’d like within a single file. You can save your entire environment (all your variables) in a single file and then easily load it and continue working where you left it. To save our data frame in a `.RData`

file we can run:

In [48]:

```
save(RET, file = "RET.RData")
```

Then it can be directly loaded into R using `load("RET.RData")`

.

To save your dataframe as a `.csv`

file you can use the function `write.csv()`

:

In [49]:

```
write.csv(RET, file = "RET.csv")
```

Most of financial applications involve working with dates. It can be monthly, weekly, daily, or even intra-day data. Storing data as text is not helpful since we cannot order or subset it easily. R has a specific data type called `Date`

. In this section we will explore some packages that will help us to work with `Date`

objects.

It provides functions to work with date-times and time-spans. The full documentation is here: https://cran.r-project.org/web/packages/lubridate/lubridate.pdf

The main two functions we will use are `ymd()`

and `dmy()`

, which stand for year-month-day, and day-month-year respectively. These are quite powerful functions that transform a string object into date. As long as the object we pass on follow the structure of year-month-day for `ymd()`

, the function will recognize it and transform it into date. For example:

In [50]:

```
library(lubridate)
ymd("20200110")
class(ymd("20200110"))
```

In [51]:

```
ymd("2015JAN11")
class(ymd("20200110"))
```

In [52]:

```
ymd("04-MAR-5")
class(ymd("04MAR5"))
```

In [53]:

```
dmy("1/june/2019")
class(dmy("1/june/2019"))
```

In [54]:

```
dmy("28-december-14")
class(dmy("28-december-14"))
```

This package functions to work with ordered indexed observations. It is very commonly used in time series analysis. The complete documentation is available here: https://cran.r-project.org/web/packages/zoo/zoo.pdf

In [55]:

```
ret <- c(0.18, 0.02, -0.29, 0.00, 0.15)
class(ret)
```

In [56]:

```
dates <- c("20190121", "20190122", "20190123", "20190124", "20190125")
class(dates)
```

In [57]:

```
library(zoo)
ret.ts <- zoo(ret, order.by = ymd(dates))
class(ret.ts)
```

In [58]:

```
ret.ts
```

This function allows us to take the lag or leads of a time series object. The syntax is:

`lag(x, k, na.pad = F)`

where:

- x = a time series object to lag
- k = number of lags (in units of observations); could be positive or negative
- na.pad = if TRUE, adds NAs for missing observations

In [59]:

```
ret.ts
lag(ret.ts, k = 2)
```

In [60]:

```
ret.ts
lag(ret.ts, k = 2, na.pad = TRUE)
```

In [61]:

```
ret.ts
lag(ret.ts, k = -2)
```

Takes the lagged difference of a time series. Syntax:

`diff(x, lag, differences, na.pad = F)`

where:

- x = a time series object
- lag = number of lags(in unit of observations)
- differences = the order of the difference

In [62]:

```
ret.ts
```

In [63]:

```
diff(ret.ts, lag = 1, na.pad = TRUE)
```

In [64]:

```
diff(ret.ts, lag = 1, differences = 2, na.pad = TRUE)
```

In [65]:

```
diff(ret.ts, lag = 2, differences = 1, na.pad = TRUE)
```

Rollapply is a function for applying another function to rolling margins of an array. For example, you can construct a cumulative sum or mean. Syntax is:

`rollapply(data, width, FUN, fill, align...)`

where:

- data = a time series object
- width = the window width
- FUN = the function to apply
- fill = NA ro replace missing observations
- align = etierh "left", "right", or "center" (default)

In [66]:

```
# Example: Sum the last two consecutive returns
ret.ts
```

In [67]:

```
rollapply(ret.ts, width = 2, FUN = sum, align = "left")
```

In [68]:

```
rollapply(ret.ts, width = 2, FUN = sum, align = "right", fill = NA)
```

When plotting a zoo object, R will automatically use the dates as the x-axis. Let's create a series of fictitious returns to see this:

In [69]:

```
ret <- rt(5000, df = 4) # Random draws from a Student t with 3 dof
dates <- seq(Sys.Date(), by = "-1 day", length.out = 2000) # 2000 days back from today
ret.ts <- zoo(ret, order.by = ymd(dates)) # Transform to zoo
```

In [70]:

```
plot(ret.ts, main = "Returns from the past 2000 days",
xlab = "Date", ylab = "Returns", col = "mediumblue")
```

We can use the `window()`

function to subset a zoo object to a given time period. For example, let's say we are interested in the returns of 2019. Since it is a zoo object, R will automatically adjust the x-axis to show months:

In [71]:

```
sub_ret.ts <- window(ret.ts, start = "2019/1/1", end = "2019/12/31")
plot(sub_ret.ts, main = "Returns from 2019",
xlab = "Date", ylab = "Returns", col = "mediumblue")
```

Consider the linear model:

$$y_i = \alpha + \beta_1 x_{i,1} + \beta_2 x{i,2} + \dots + \beta_K x_{i,K} + \varepsilon_i$$Where E[$\varepsilon$|X] = 0

In matrix notation:

$$y = X \beta + \varepsilon$$Where $X$ is a matrix containing one independent variable for each column, $\beta$ is a vector of coefficients and $\varepsilon$ is a vector of errors.

The Ordinary Least Squares (OLS) estimator $\hat{\beta}$ minimizes the sum of squared residuals:

$$\hat{\beta} = \textrm{arg min}\ \sum_i (y_i - \alpha - \beta_1 x_{i,1} - \beta_2 x{i,2} - \dots - \beta_K x_{i,K})^2$$Or in matrix notation:

$$\hat{\beta} = \textrm{arg min}\ (y- X \beta)'(y -X \beta)$$Taking the first order conditions:

$$-X'y - yX + 2X'X\hat{\beta} = 0$$Assuming invertibility of (X'X), we have:

$$\hat{\beta} = (X'X)^{-1}X'y$$Notation and definitions:

- Fitted values: $\hat{y} = X\hat{\beta}$
- Residuals: $\hat{\varepsilon} = y - \hat{y}$
- Total Sum of Squares (SST): $(y-\bar{y})'(y-\bar{y})$
- Sum of Squares of Residuals (SSR): $\hat{\varepsilon}'\hat{\varepsilon}$
- R squared: 1 - SSR/SST

We can use matrix algebra in R to solve OLS. We will use the file "Shiller.csv". This dataset was developed by Robert Shiller and includes historical data on monthly stock price, dividends, and earnings, consumer price index (to allow conversion to real values), all starting January 1871. More details on the dataset can be found here: http://www.econ.yale.edu/~shiller/data.htm

In [72]:

```
# Data load and cleaning
df <- read.csv("Shiller.csv", header = T, na.strings = NA)
head(df)
df$date <- paste0(df$date, "01") # Add days to dates
# Create time series
ret.ts <- zoo(df$R_r, order.by = ymd(df$date))
PDdiff.ts <- diff(zoo(df$PD_r, order.by = ymd(df$date)), na.pad = T)
head(ret.ts)
head(PDdiff.ts)
# Adjust length
ret.ts <- ret.ts[2:length(ret.ts)]*100
PDdiff.ts <- PDdiff.ts[2:length(PDdiff.ts)]
head(ret.ts)
head(PDdiff.ts)
```

Now that we have cleaned the data, let's follow OLS notation. We want to regress returns on the price to dividend ratio:

In [73]:

```
X <- cbind(rep(1,NROW(PDdiff.ts)), as.vector(PDdiff.ts)) # Add column of 1s
y <- as.vector(ret.ts)
Xt <- t(X)
```

We use the function `solve()`

to get a matrix inverse. Recall that we need to enclose operators in `%`

signs to make matrix operations:

In [74]:

```
betaHat <- solve(Xt %*% X) %*% (Xt %*% y)
retHat <- X %*% betaHat
```

In [75]:

```
betaHat
```

In [76]:

```
retHat.ts <- zoo(retHat, order.by = index(ret.ts))
resid.ts <- ret.ts - retHat.ts
```

In [77]:

```
SSR = sum((ret.ts - retHat.ts)^2)
SST = sum((ret.ts - mean(ret.ts))^2)
R2 = 1 - (SSR/SST)
```

You should never do OLS by hand. It is a way to waste time and be prone to mistakes. The R function that takes care of this is called `lm()`

, which stands for linear models. It is intuitive, easy to use, and provides you with much more information than just parameters. Syntax is:

`lm(formula, data)`

where:

- formula = obejct
*depvar*~*indvar1 + ... + indvarK*- R introduces a constant by default, add 0 among independent variables to exclude it - data = the object containing the variables in the model

In [78]:

```
reg <- lm(ret.ts ~ PDdiff.ts) # First dependent variable followed by ~ and formula
summary(reg)
```

Let's explore the attributes that we can obtain from an object created by `lm()`

:

In [79]:

```
# Everything part of the object reg
names(reg)
```

In [80]:

```
reg$coefficients
```

In [81]:

```
alpha <- reg$coefficients[1]
alpha
# To compare with our OLS by hand
betaHat[1]
```

In [82]:

```
beta <- reg$coefficients[2]
beta
# To compare with our OLS by hand
betaHat[2]
```

In [83]:

```
std_alpha <- summary(reg)$coefficient[1,2]
std_alpha
```

In [84]:

```
std_beta <- summary(reg)$coefficient[2,2]
std_beta
```

In [85]:

```
y_fitted <- reg$fitted.values
```

In [86]:

```
R2_lm <- summary(reg)$r.squared
R2_lm
# To compare with our OLS by hand
R2
```

In [87]:

```
resid_lm <- reg$residuals
head(resid.ts)
head(resid_lm)
```

Another package that is worth mentioning is `dynlm`

. This package is focused in dynamic linear models and time series regressions. You can review the documentation here.

The Maximum Likelihood Estimator (MLE) chooses the parameters that most likely generated the observed data.

Hence:

$$\hat{\beta}_{\textrm{MLE}} = \textrm{arg max}\ f(y_1,y_2, ..., y_T | \beta)$$We do not know the joint distribution of returns. However, if we strengthen the OLS assumptions so that we assume a distribution for the error terms: $\varepsilon_i \sim \mathcal{N}(0, \sigma^2)$, we can write the conditional density of y given x, where x is a row of the stacked matrix X defined above:

$$f(y|x) = \frac{1}{(2\pi \sigma^2)^\frac{1}{2}}\textrm{exp}\left(-\frac{1}{2\sigma^2}(y-x'\beta)^2\right)$$Unrder the assumption that observations are mutually independent, this implies that the conditiona density of $(y_1, ..., y_N)$ given $(x_1, ..., x_N)$ is:

$$f(y_1, ..., y_N | x_1, ..., x_N) = \prod_{i=1}^{n} f(y_i | x_i)$$$$ = \prod_{i=1}^{n} \frac{1}{(2\pi \sigma^2)^\frac{1}{2}}\textrm{exp}\left(-\frac{1}{2\sigma^2}(y_i-x_i'\beta)^2\right) $$$$ = \frac{1}{(2\pi \sigma^2)^\frac{n}{2}}\textrm{exp}\left(-\frac{1}{2\sigma^2}\sum_{i=1}^{n} (y_i-x_i'\beta)^2\right) $$$$ := \mathcal{L}(\beta, \sigma^2)$$This is called the **Likelihood** function. For convenience, it is typical to work with the natural logarithm, and the **log-likelihood** function:

So the ML estimators are those components of $\beta$ and $\sigma$ such that the log-likelihood is maximized.

To solve this in R, we can write our own log-likelihood function. The inputs we are going to use are:

- par = vector of parameters (in our simple case c($\sigma, \alpha, \beta$)
- depvar = the vector y
- indvar = the matrix X

In [88]:

```
# Log-likelihood function for OLS
LL.OLS <- function(par, depVar, indVar) {
sigma = par[1]
alpha = par[2]
beta = par[3]
betaVec = c(alpha, beta)
T = nrow(indVar)
out = - (T/2)*log(2*pi*sigma^2) - 1/(2*sigma^2)* (t(depVar - indVar %*% betaVec) %*% (depVar - indVar %*% betaVec))
return(-out) # Return a negative since by default an optimizer minimizes, doesn't maximize
}
```

Now that we have the log likelihood function ready, we can use the function `optim()`

to optimize:

In [89]:

```
# Estimate ML
res <- optim(c(1,0,0), LL.OLS, gr = NULL, y, X, hessian = T)
```

In [90]:

```
# Getting standard errors
H <- res$hessian # Hessian
I <- solve(H) # Invert Hessian
se <- sqrt(diag(I)) # Sqrt of diagonal elements of the inverse
```

In [91]:

```
# Getting the parameters
sigmaMLE <- res$par[1]
betaVecMLE <- c(res$par[2], res$par[3])
yHatMLE <- X %*% betaVecMLE
tstat <- betaVecMLE / se[2:3]
print(betaVecMLE)
```

In constrast to OLS models, GARCH models have to be solved using Maximum Likelihood methods, which is essentially how the GARCH packages, like `rugarch`

, fit their models. Before exploring the GARCH packages further, let's use R to estimate an ARCH(1) and GARCH(1,1) models.

The data we are going to use comes is the prices of the S&P500 index. Let's load the data, transform them to returns, and create a lagged vector, which will be necessary for the likelihood functions:

In [92]:

```
sp <- read.csv("sp500.csv")
ret <- diff(log(sp$price))*100
ret <- ret[2:length(ret)]
```

$Z_t$ in ARCH(1) is standard normally distributed:
$$Y_t = \sigma_t Z_t$$
$$\sigma^2_t = \omega + \alpha Y_{t-1}^2$$

$$Z_t \sim \mathcal{N}(0,1)$$

The presence of lagged returns implies the density for t = 1 is unknown since $y_0$ is unkown. At t = 2, the density is:

$$f(y_2|y_1) = \frac{1}{\sqrt{2\pi(\omega + \alpha y_{1}^2)}} exp\left(-\frac{1}{2} \frac{y_2^2}{\omega + \alpha y_{1}^2}\right)$$So the joint density is:

$$\prod_{t=2}^{T} f(y_t) = \prod_{t=2}^{T} \frac{1}{\sqrt{2\pi(\omega + \alpha y_{t-1}^2)}} exp\left(-\frac{1}{2} \frac{y_t^2}{\omega + \alpha y_{t-1}^2}\right)$$Therefore the log-likelihood function for the ARCH(1) model is:

$$ \log \mathcal{L} = -\frac{T-1}{2}\log(2\pi) - \frac{1}{2}\sum_{t=2}^{T}\left(\log(\omega + \alpha y_{t-1}^2) + \frac{y_t^2}{\omega + \alpha y_{t-1}^2}\right)$$Let's now write the function in R:

In [93]:

```
# ARCH(1) MLE
LL.ARCH1 <- function(par, x) {
omega <- par[1] # First element of par
alpha <- par[2] # Second element of par
T <- length(x) # Number of rows
loglikelihood <- -(T-1)/2 * log(2*pi)
# Loop (create the sum) and add every instance fo loglikelihood
for(i in 2:T){
loglikelihood = loglikelihood - 1/2 * (log(omega + alpha*x[i-1]^2)+x[i]^2/(omega + alpha*x[i-1]^2))
}
return(-loglikelihood) # output the log likelihood
}
res <- optim(c(0.1,0.5) , LL.ARCH1, gr = NULL, ret, hessian = T) # Optimization
cat("Omega:", res$par[1], "\n",
"Alpha:", res$par[2])
```

In a GARCH(1,1) model, the conditional volatility follows:

$$\sigma_t^2 = \omega + \alpha Y_{t-1}^2 + \beta \sigma_{t-1}^2$$The conditional density is:

$$f(y_2 | y_1) = \frac{1}{\sqrt{2\pi(\omega + \alpha y_{1}^2 + \beta\hat{\sigma}_1^2)}} exp\left(-\frac{1}{2} \frac{y_2^2}{\omega + \alpha y_{1}^2 + \beta\hat{\sigma}_1^2} \right)$$So the log-likelihood function is:

$$\log \mathcal{L} = -\frac{T-1}{2}\log(2\pi) - \frac{1}{2} \sum_{t=2}^{T} \left(\log(\omega + \alpha y_{t-1}^2 + \beta\hat{\sigma}_{t-1}^2) + \frac{y_{t}^2}{\omega + \alpha y_{t-1}^2 + \beta\hat{\sigma}_{t-1}^2} \right)$$Writing the function in R:

In [94]:

```
# GARCH(1,1) MLE
LL.GARCH1_1 <- function(par, x) {
omega <- par[1] # First element of par
alpha <- par[2] # Second element of par
beta <- par[3] # Third element of par
T <- length(x) # Number of rows
loglikelihood <- -(T-1)/2 * log(2*pi)
sigma_2 <- rep(NA, T)
sigma_2[1] <- var(x) # Initialize the sigma with unconditional volatility
# Loop (create the sum) and add every instance to loglikelihood
for(i in 2:T){
loglikelihood = loglikelihood - 1/2 * (log(omega + alpha*x[i-1]^2 + beta*sigma_2[i-1])
+ x[i]^2/(omega + alpha*x[i-1]^2 + beta*sigma_2[i-1]))
sigma_2[i] <- omega + alpha*x[i-1]^2 + beta*sigma_2[i-1] # Update sigma
}
return(-loglikelihood) # output the log likelihood
}
res <- optim(c(0.1, 0, 0) , LL.GARCH1_1, gr = NULL, ret, hessian = T,
method = 'L-BFGS-B', lower = c(0,0,0)) # Optimization
cat("Omega:", res$par[1], "\n",
"Alpha:", res$par[2], "\n",
"Beta:", res$par[3])
## Note that the values obtained in GARCH ML will depend on the optimization method used
```

Some important components of the optimization process are:

- Value of $\sigma_1^2$: We need to assign a value to $\sigma_1^2$, the first iteration of the conditional volatility. General, we choose the unconditional variance of the data for this
- Optimization method: We have chosen the L-BFGS-B algorithm, since it allows us to place bounds on the parameters. We have specified non-negativity by setting
`lower = c(0,0,0)`

, since this is desired in GARCH - Initial values of the parameters: Optimization works iteratively. You need to assign values to initialize the parameters, making sure these are compliant with the restrictions of the function and the parameter bounds. This will have a small effect on the parameter values, but can have a large effect in computing time

In this section we used the function `optim()`

to fit the maximum likelihood estimation. This was enough for our purposes, but if you have to carry on more complex optimization problems, `optim()`

can fail. We recommend the package *nloptr*, which currently is the state of the art for minimizing functions in non-linear settings.

The packages that we use in the course for estimating GARCH models are *rugarch* for univariate models, and *rmgarch* for multivariate models. The complete documentation of the packages is:

- rugarch: https://cran.r-project.org/web/packages/rugarch/rugarch.pdf
- rmgarch: https://cran.r-project.org/web/packages/rmgarch/rmgarch.pdf

Both packages were developed by Alexios Galanos and are constantly mantained and updated. The development code can be found here: https://bitbucket.org/alexiosg/ If you are curious on learning more programming, you are encourage to browse the page to see how the package was created.

To estimate a univariate GARCH model you need to follow two steps:

- Create an object of the
`uGARCHspec`

class which is the specification of the model you want to estimate. This includes the type of model (standard, asymmetric, power, etc), the GARCH order, the distribution, and model for the mean - Fit the specified model to the data

First let's take a look at `ugarchspec()`

, which is the function to create an instance of the `uGARCHspec`

class. The complete syntax with its default values is:

```
ugarchspec(variance.model = list(model = "sGARCH", garchOrder = c(1, 1),
submodel = NULL, external.regressors = NULL, variance.targeting = FALSE),
mean.model = list(armaOrder = c(1, 1), include.mean = TRUE, archm = FALSE,
archpow = 1, arfima = FALSE, external.regressors = NULL, archex = FALSE),
distribution.model = "norm", start.pars = list(), fixed.pars = list(), ...)
```

You can check the details of this function and what every argument means by running `?ugarchspec`

.

For the purposes of our course, we are going to focus on three arguments of the `ugarchspec()`

function:

`variance.model`

¶This argument takes a `list`

with the specifications of the GARCH model. Its most important components are:

`model`

: Specify the type of model, currently implemented models are “sGARCH”, “fGARCH”, “eGARCH”, “gjrGARCH”, “apARCH” and “iGARCH” and “csGARCH”`garchOrder`

: Specify the ARCH(q) and GARCH(p) orders

Other components includes options to do `variance.targeting`

or specify `external regressors`

.

`mean.model`

¶This argument takes a `list`

with the specifications of the mean model, if assumed to have one. A traditional assumption is that the model has zero mean, in which case we specify `armaOrder = c(0,0), include.mean = FALSE`

. However, it is also common to assume that the mean follows a certain ARMA process.

`distribution.model`

¶Here you can specify the conditional density to use for innovations. The default is a normal distribution, but we can specify the use of a Student-t distribution by setting the value to `std`

.

Once the specification has been created, you can *fit* this to the data using `ugarchfit( spec = ..., data = ...)`

. The result will be an object of the class `uGARCHfit`

, which is a list that contains useful information which will be shown below.

Let's use *rugarch* to estimate a GARCH(1,1) model.

First, let's load the library, and the S&P500 data:

In [95]:

```
library(rugarch)
sp <- read.csv("sp500.csv")
ret <- diff(log(sp$price))*100
ret <- ret[2:length(ret)]
```

Now let's use `ugarchspec()`

to specify the type of model we want. In this case, we want to use a GARCH(1,1), so the `garchOrder`

in `variance.model`

will be set as `c(1,1)`

. Note that even if we only have one component, we need to put it inside a `list`

.

We will assume zero mean, so we need to include this into `mean.model`

:

In [96]:

```
spec1 <- ugarchspec(
variance.model = list(garchOrder = c(1,1)),
mean.model = list(armaOrder = c(0,0), include.mean = FALSE)
)
```

We can call `spec1`

to see what is inside:

In [97]:

```
spec1
```

Now we can fit the specified model to our data using the `ugarchfit()`

function:

In [98]:

```
garch1_1 <- ugarchfit(spec = spec1, data = ret)
```

We can check the class of this new object:

In [99]:

```
class(garch1_1)
```

Objects from the `uGARCHfit`

class have two *slots*. You can think of every slot as a list of components. The two slots are:

`@model`

`@fit`

To access a slot you need to use the syntax: `object@slot`

Let's explore these to understand better:

The *model* slot includes all the information that was needed to estimate the GARCH model. This includes both the model specification and the data. To see every that is included in this slot, you can use the `names()`

function:

In [100]:

```
names(garch1_1@model)
```

And you can access each element with the `$`

sign. For example, let's see what is inside `pars`

:

In [101]:

```
garch1_1@model$pars
```

We have a matrix with all the possible parameters to be used in fitting a GARCH model. We see there is a value of 1 for the parameters that were included in our GARCH specification (omega, alpha1 and beta1). The matrix also includes what are the lower and upper bounds for these parameters, which has nothing to do with our particular data, only with what is expected from the model.

We can also see the model description:

In [102]:

```
garch1_1@model$modeldesc
```

And inside `garch1_1@model@modeldata`

we will find the data vector we have used when fitting the model. You can think of the `@model`

slot as everything that R needs to know in order to be able to estimate the model.

Inside the *fit* slot we will find the *estimated* model, including the coefficients, likelihood, fitted conditional variance, and more. Let's check everything that is included:

In [103]:

```
names(garch1_1@fit)
```

Let's see the coefficients, along with their standard errors, in `matcoef`

:

**Note**: You will see these values are very similar to what we found using Maximum Likelihood. Since it is an optimization problem, we can expect to see some small differences

In [104]:

```
garch1_1@fit$matcoef
```

We can also see the log-likelihood:

In [105]:

```
garch1_1@fit$LLH
```

This makes it easy to extract the estimated conditional volatility:

In [106]:

```
plot(garch1_1@fit$sigma, type = "l")
```

We can also access the coefficients are likelihood using the following functions:

In [107]:

```
coef(garch1_1)
```

In [108]:

```
likelihood(garch1_1)
```

Instead of using a conditionally normal GARCH(1,1) model specification, we can use a t-distribution or introduce asymmetric and power effects very easily.

Let's first fit a GARCH(1,1) using a Student-t as conditional distribution. We can see that the coefficients now include `shape`

, which is the estimated degrees of freedom:

In [109]:

```
# Student t
spec2 <- ugarchspec(
variance.model = list(garchOrder = c(1,1)),
mean.model = list(armaOrder = c(0,0), include.mean = FALSE),
distribution.model = "std"
)
tGARCH <- ugarchfit(spec = spec2, data = ret)
tGARCH@fit$matcoef
```

Now let's fit an apARCH model with fat tails. The paramter list will now include `gamma1`

and `delta`

, which are part of the apARCH model specification:

In [110]:

```
# tapARCH model
spec3 <- ugarchspec(
variance.model = list(model = "apARCH"),
mean.model = list(armaOrder = c(0,0), include.mean = FALSE),
distribution.model = "std"
)
tapARCH <- ugarchfit(spec = spec3, data = ret)
tapARCH@fit$matcoef
```

It is important to consider that the more parameter a model has, the more likely it is to run into estimation problems, and the more data we need. For example, as a rule of thumb, for a GARCH(1,1) we need at least 500 observations, while for a Student-t GARCH that minimum number is around 3,000 observations. In general, the more parameters are estimated, the more data is needed to be confident in the estimation.

`ugarchfit()`

essentially performs Maximum Likelihood to fit the specified model to our data. This optimization problem required a `solver`

, which is the numerical method used to maximize the likelihood function. In some case, the default solver in `ugarchfit()`

can fail to converge. We recommend using the option `solver = "hybrid"`

when fitting the model, since this will try different optimization methods in case the default one does not converge.

The package *rmgarch* allows to easily estimate models with multiple assets in the same fashion as *rugarch*. The procedure is analogous, we first need to specify the model we want to use, and then fit it to the data. However, specifying the model has extra steps. To explain this, we will focus on the estimation of a DCC model.

In a DCC model, we assume that each individual asset follows some type of univariate model, usually a GARCH. And then we model the correlation between the assets using an ARMA-like process.

The process for fitting a DCC model using *rmgarch* is then:

- Specify the univariate volatility model each asset follows using
`ugarchspec()`

- Use the
`multispec()`

function to create a multivariate specification. This is essentially a list of univariate specifications. If we are going to use the same for every asset, we can use`replicate()`

- Then we need to create a
`dccspec()`

object, which takes in the list of univariate specifications for every asset, and the additional DCC joint specifications, like`dccOrder`

and`distribution`

- Fit the specification to the data

We will use the returns for JPM and Citigroup from the file `Y.RData`

to go through this process:

In [111]:

```
library(rmgarch)
load("RET.RData")
y <- RET[c("JPM", "C")]
```

We will assume a simple GARCH(1,1) with mean zero for each stock. We will create a single univariate specification, and then replicate it into `multispec()`

:

In [112]:

```
# Create the univariate specification
uni_spec <- ugarchspec(
variance.model = list(garchOrder = c(1,1)),
mean.model = list(armaOrder = c(0,0), include.mean = FALSE)
)
# Replicate it into a multispec element
mspec <- multispec(replicate(2, uni_spec))
```

Let's take a look inside the `mspec`

object. We will see there are two univariate specifications, one per asset, and they are equal:

In [113]:

```
mspec
```

You can check the specifications with `mspec@spec`

Now we proceed to create the specification for the DCC model using `dccspec()`

:

In [114]:

```
spec <- dccspec(
# Univariate specifications - Needs to be multispec
uspec = mspec,
# DCC specification. We will assume an ARMA(1,1)-like process
dccOrder = c(1,1),
# Distribution, here multivariate normal
distribution = "mvnorm"
)
```

We can call `spec`

to see what is inside:

In [115]:

```
spec
```

We can again see more details in `spec@model`

and `spec@umodel`

.

Now we can proceed to fit the specification to the data:

In [116]:

```
res <- dccfit(spec, data = y)
res
```

We can check what slots are inside:

In [117]:

```
names(res@model)
names(res@mfit)
```

In [118]:

```
# Coefficient matrix
res@mfit$matcoef
```

In [119]:

```
# Log likelihood
res@mfit$llh
```

The matrix H inside `res@mfit`

includes the covariances. It is 3-dimensional, since it includes the 2x2 covariance matrix for each of the T time periods:

In [120]:

```
H <- res@mfit$H
dim(H)
```

In [121]:

```
# First period's covariances
H[,,1]
```

We can extract the conditional correlation in two ways. One is computing it from H:

In [122]:

```
# Initializing the vector
rhoDCC <- vector(length = dim(y)[1])
# Populate with the correlations
rhoDCC <- H[1,2,] / sqrt(H[1,1,]*H[2,2,])
```

Or we can directly extract it from `res@mfit`

:

In [123]:

```
# Initializing the vector
rhoDCC2 <- vector(length = dim(y)[1])
for (i in 1:dim(y)[1]) {
rhoDCC2[i] <- rcor(res)[1,2,i]
}
```

In [124]:

```
plot(rhoDCC2, type = "l")
```