4  R and risk forecasting

The programming language we use in this notebook is R. We could also have used Julia, Matlab, or Python, but as we argued in Chapter 3, R is the best choice for the type of work we are doing here.

We don’t expect anybody intending to work with this notebook to have ever programmed in any language before, but we do expect you to have an open mind and be willing to learn. In our experience, students become quite proficient in R with a one-semester course in risk forecasting with R.

4.1 Learning resources

There are many resources for learning R, and we do not wish to duplicate that here. The Big Book of R is a very comprehensive list of resources, but it can be hard to identify what is particularly useful. Here are some resources.

If you encounter a specific R programming problem (or one for any other programming language), StackOverFlow is an excellent place to request help.

4.2 User interfaces

There are several ways one can use R.

4.2.1 RStudio

It is usually best to use the free RStudio software for programming in R.

4.2.2 Quarto

The RStudio vendor has an interactive development environment designed for R, as well as Julia, Python, and Julia, called Quarto, which we use for these notes. Chapter 11 has more information.

4.2.3 Jupyter

Jupyter notebooks are an interactive development environment for notebooks, code, and data produced by Project Jupyter. They offer both computer code (e.g., R, Python, Julia) and rich text elements (paragraphs, equations, images, links, etc.). They are a great way to make reports and documents. They can be edited on a web server and exported as HTML, PDF via LaTeX, or other file formats. They are interactive, and you can independently run pieces of code.

Some users find the Jupyter way of coding really good, while others find it very confining.

4.2.4 VSCode

One of the best text editors is visual studio code or VScode, and it provides a very useful extension for R and Quarto. VScode is our go-to editor for most things and is how we made these notes, along with Quarto.

4.2.5 Base R editor

R Comes with a basic editor and environment for executing code. It has the advantage of being very lightweight and efficient but is rather basic. It is what we use by default because it is so lightweight and simple. For all its benefits, RStudio is neither.

4.2.6 Command line

We can also access R on the command line, which is very useful for long parallel applications or when using remote computers.

4.3 Variables

In computer science, the term variable refers to a container that stores information that we will subsequently reference and manipulate using a computer program. Variables allow us to describe data with labels so both the program and we understand the data.

It can be surprisingly difficult to name variables. While it is often tempting to use one character, like x or P, we might not remember what x actually stands for when looking at the code later. It is usually better to use more descriptive names like Price or VaR99.

4.3.1 Assignment: = or <-

In mathematics and most programming languages, we use the equal sign, =, to assign a value to a variable. The variable’s name is on the left, and the value to be stored is on the right.

However, it is more complicated in R.

By convention, R uses <- for assignment, and not the equal sign “=```. You can use both in the vast majority of cases, but there is a single, very infrequent exception where one needs to use ```<-```. We generally prefer "=. Two equal signs, "`==, are a test for equality.

x = 3
y <- 3
[1] TRUE

Some people have very strong opinions on this topic, usually insisting <- is the right way. Just ignore them and do what you find most convenient.

4.4 Objects

In computer science, class is a description or definition of an object. An object is an instance (realisation) of a class.

Everything in R is an object. R has several basic classes. Relevant to us:

  • numeric (contains both integers and doubles)
  • logical
  • character (string)

It also has many more complex classes, which we discuss below.

We can use the following to identify classes.

  • typeof()
  • class()
  • storage.mode()
  • length()
  • attributes() # metadata

4.4.1 Integer and real

In mathematics, an integer is a discrete number like 0, -1, 10.

A real number is a number that captures a continuous one-dimensional quantity, such as prices, such as 1.2344. In computer science, the number of digits in a number is finite.

In R and most other software, real numbers are referred to as double precision, which just implies a particular number of digits. We use the terms real, float, and double interchangeably, recognising that in some special cases, not relevant in the vast majority of applications, these are actually different.

x = 422
y = 2.33445
[1] "numeric"
[1] "numeric"
[1] "double"
[1] "double"

Note that even if we set x as an integer, it is actually a double. This is not one of R’s best aspects.

4.4.2 Charcacter or strings

A character is just that: a, Q, \(\alpha\).

A string is one or more characters, "Risk",.

x = 's'
y = 'Risk'
z = "Risk"
[1] "character"
[1] "character"
[1] "character"
[1] "character"
x+y # will not work because z is a string
Error in x + y: non-numeric argument to binary operator
z = Risk
Error in eval(expr, envir, enclos): object 'Risk' not found

Note we need the ' or ", as it thinks Risk is a variable.

4.4.3 Not-a-number NA

We use a special value NA to indicate not-a-number, i.e. we don’t know what the value is. This becomes useful in backtesting in Chapter 18.

[1] NA

4.4.4 Logical — TRUE and FALSE

We use a logical variable, TRUE or FALSE, quite often. Note they are spelt all uppercase. “`!``` inverts the TRUE or FALSE.

r = !W
[1] TRUE

4.4.5 NULL

R is an old language that has evolved erratically over time. This means it has many undesirable features that can lead to difficult bugs. One is a variable type called NULL, which means nothing. While it can be useful, the problem is that NULL is not only used inconsistently, but it can be outright dangerous.

1+What # variable What is not defined
Error in eval(expr, envir, enclos): object 'What' not found

Which makes sense. But consider this


The variable What does not exist in the list l, but we can access it by l$What


And when doing math with it, it fails silently.

When we need to delete columns from a dataframe or an element from a list, we assign NULL to it.

df$DeleteMe = NULL

It would be more intuitive.

4.5 Data structures

4.5.1 Vectors

R comes with vectors. Note that R does not know if they are column vectors or row vectors, which becomes important in matrix algebra.

v = vector(length=4)
v[] = NA
v[2:3] = 2
[1] NA  2  2 NA
[1] 1 2 3 4 5
[1] -1.0 -0.5  0.0  0.5  1.0  1.5  2.0
[1]  3.0  9.0 21.0  9.0  1.2

One way to create vectors is c()

[1] "1"   "4"   "0.9" "ss" 

Here, we used both numbers and strings, and all became a string.

[1] 1.0 4.0 0.9

While here, we only have numbers, and they stay numbers.

4.5.2 Matrices

R can create two-dimensional and three-dimensional matrices. We usually only work with the two-dimensional type, but we will encounter the three-dimensional type in the multivariate volatility models.

Matrices can have column names, which can be quite useful.

[1,]   NA
[2,]   NA
[3,]   NA
     [,1] [,2]
[1,]   NA   NA
[2,]   NA   NA
[3,]   NA   NA
     [,1] [,2]
[1,]    3    3
[2,]    3    3
[3,]    3    3

We can also make, or add to matrices with cbind and rbind something we do quite often in these notes.

        v    v
[1,]  3.0  3.0
[2,]  9.0  9.0
[3,] 21.0 21.0
[4,]  9.0  9.0
[5,]  1.2  1.2
  [,1] [,2] [,3] [,4] [,5]
v    3    9   21    9  1.2
v    3    9   21    9  1.2

We can access individual elements of matrixes and vectors.

v v 
9 9 
[1]  3.0  9.0 21.0  9.0  1.2
[1] 21.0  9.0  1.2
[1]  9 21

We can name the columns with “colnames()”. Unfortunately, that command name is different than what we use for dataframes below.

           [,1]       [,2]
[1,] -1.1811409 -1.0948463
[2,]  0.3116072  1.2280149
[3,]  1.3538716 -1.5215029
[4,]  0.2789277 -0.3151759
colnames(m)=c("Stock A","Stock B")
        Stock A    Stock B
[1,] -1.1811409 -1.0948463
[2,]  0.3116072  1.2280149
[3,]  1.3538716 -1.5215029
[4,]  0.2789277 -0.3151759

4.5.3 Lists

We often need to keep track of many variables that belong together, and the R “list`”object is very useful. It allows us to group multiple variables in one list.

l$b= "R is great."
w$q= "my list"
w$l = l 
[1] "my list"

[1] 2 3

[1] "Risk"

  X1  X2
1  1 VaR
2  2  ES

we can find out what in a list

[1] "q"  "l"  "df"

and access individual elements

[1] 2 3

[1] "Risk"

We make extensive use of lists in these notes.

4.6 Dataframes

Matrices have some limitations. They don’t have row names, and all the columns must be of the same type; for example, we can’t have a column with a string and another with numbers. To deal with that, R comes with dataframes, which can be thought of as more flexible matrices. We usually have to use both. For example, it is quite costly to insert new data into a dataframe, perhaps by df[3,4]=42 but not to do the same for a matrix.

Also, some functions insist on a matrix or a data frame, even if they should be able to handle both.

A dataframe is a two-dimensional structure in which each column contains values of one variable, and each row contains one set of values, or “observation”, from each column. It is the most common way of storing data in R and the one we will use the most.

One of the main advantages of a dataframe over a matrix is that each column can have a different data type. For example, you can have one column with numbers, one with text, one with dates, and one with logicals, whereas a matrix limits you to only one data type. Keep in mind that a dataframe needs all its columns to be of the same length.

Only a few things come for free, and there are downsides to dataframes. One is that accessing elements can be very slow. For example in code that iteratively puts numbers into a matrix, as we do in the chapter on Backtesting Chapter 18, dataframes can be too slow. Then, it might be best to use matrices and subsequently convert them into dataframes. Some R functions, especially those belonging to old libraries, only accept dataframes and not matrices (or vice versa)

4.6.1 Iteratively creating data frames

It can be very tempting to iteratively increasing the size of dataframes in a for loop, something you will encounter in many applications in these notes. One should almost never do that. The reason is it takes a very long time to do that operation. If we are only doing a few increases, that’s fine, but anything beyond that should not be done. Instead, and what we do extensively in these notes, not the least in the Backtesting examples, is to pre-allocate a data frame or a matrix and then insert values into it. And for speed, it is best to do that with a matrix not a dataframe.

4.6.2 Accessing the data from columns

We can access data from columns by number, like df[,3] but since all the columns have names, it is usually much better to access them by column name, like df$returns.

4.6.3 Creating a dataframe from scratch

There are several different ways to create a dataframe. One is loading from a file, which we will do later. Alternatively, we could create a dataframe from a list of vectors. This can easily be done with the data.frame() function:

df <- data.frame(col1 = 1:3,
 col2 = c("A", "B", "C"),
 col3 = c(TRUE, TRUE, FALSE),
 col4 = c(1.0, 2.2, 3.3))

You have to specify the name of each column and what goes inside it. Note that all vectors need to be the same length. We can now check the structure:

str(df) # display the structure of the dataframe
'data.frame':   3 obs. of  4 variables:
 $ col1: int  1 2 3
 $ col2: chr  "A" "B" "C"
 $ col3: logi  TRUE TRUE FALSE
 $ col4: num  1 2.2 3.3
dim(df) # dimension
[1] 3 4
colnames(df) # column names
[1] "col1" "col2" "col3" "col4"

4.6.4 Transforming a different object into a dataframe

You might want to transform a matrix into a dataframe (or vice versa). For example, you need an object to be a matrix to perform linear algebra operations, but you would like to keep the results as a dataframe after the operations. You can easily switch from matrix to dataframe using as.data.frame() (and analogously, from dataframe to matrix with as.matrix(), however remember all columns need to have the same data type to be a matrix).

For example, let’s say we have the matrix:

myMatrix <- matrix(1:10, nrow = 5, ncol = 2, byrow = TRUE)
[1] "matrix" "array" 
     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6
[4,]    7    8
[5,]    9   10

We can now transform it into a dataframe:

df = as.data.frame(myMatrix)
[1] "data.frame"
V1 V2
1 2
3 4
5 6
7 8
9 10
'data.frame':   5 obs. of  2 variables:
 $ V1: int  1 3 5 7 9
 $ V2: int  2 4 6 8 10

And we can change the column names:

colnames(df) = c("Odd", "Even")
Odd Even
1 2
3 4
5 6
7 8
9 10

4.6.5 Alternatives to dataframes

The R dataframes suffer from having been proposed decades ago and, therefore, need some very useful features one might expect, and they can be very slow. In response, there are two alternatives, each with its pros and cons. We will not use either of those in this book because we want to use base R packages wherever possible.

We discuss one case where one of those is needed in Section Section 5.2.2 which deals with compressed CSV files. Data.table

The data.table class is designed for performance and features. It is by far the fastest when using large datasets, but it also has very useful features built into it that really facilitate data work. In our work, we use data.table. Tidy

The other alternative is tidy data, part of the tidyverse. It has many useful features and is the richest data manipulation tool in R.

4.6.6 Dataframes, data.tables or tidy data?

While choice is good, it can be overwhelming. How should one choose between dataframes, data.tables or tidy data? Dataframes have the advantage of being built into R; they are relatively simple, and for basic calculations that don’t need a lot of performance, they might be the best choice.

data.tables have the best performance, so if one has large datasets or is performing complicated data science operations on data, it is generally the best choice.

The tidyverse has the richest and most coherent way of doing data wrangling, that is, performing complicated operations on data. For many users in data science, the tidyverse is really the only thing they use in R. Consequently, the tidyverse is the best choice for applications that are mostly in data science unless one really needs performance.

If you search for opinions on dataframes vs. data.tables vs. tidy you often find very strong views in favour of one of these. Best ignore such people. All of these three have their own pros and cons and just pick whatever you are most comfortable with and works best in your use case.

We only use dataframes in these notes because we want to keep the number of packages at a minimum and dataframes are sufficient for our purposes.

4.7 Some relevant issues

4.7.1 Special characters

R uses both single quotes’ and double quotes” for strings, and you can use either. That is particularly useful if you have to include a quotation mark inside a string, like

s= 'This is a quote character" in the middle of a string\n'
This is a quote character" in the middle of a string

The special character “\n”means a new line, quite handy for printing.

4.7.2 Scope and global assignment <<-

Variables in every programming language have a scope, meaning which part of the code can see them. For example, if you define a variable directly in R, it can be seen everywhere in your code, but if you define it inside a function, it is only visible within the function. The former is a global variable, while the second is a local variable.

Global variables are seen as undesirable for a good reason. They can lead to difficult to find bugs, which is a particular problem in R because it has a rather unfortunate way of dealing with missing variables.

Sometimes, global variables can only be avoided if we want to make the code more complex, so it is a tricky tradeoff. We use <<- to put something into the global namespace.

GlobalVariable <<- 123.456

You sometimes hear very strong opinions on why one should always avoid global variables. These voices are best ignored. One should make a pragmatic choice if one can do something efficiently and safely with globals, which otherwise would require a complex workaround; by all means, do the global.

4.7.3 Some useful functions

R has many functions. Below is a list of some of the most widely used in this book.

  • head: return the first part of an object
  • tail: return the last part of an object
  • cbind: combine by column
  • rbind: combine by row
  • cat: concatenate and print
  • print: print values
  • paste and paste0: concatenate strings

4.8 Printing: cat() vs. print()

To print variables, we can use cat() or print. To turn numbers into strings, we can use paste() or paste0(). In strings, a new line is \n.

w= "risk"

 10 risk 
cat("Important number for",w," is x=",x,"\n")
Important number for risk  is x= 10 
s=paste0("The return is ",round(100*y,1),"%")
The return is 120% 

There are two ways to print to the screen and to text files in R: “cat()”and “print()”. The former allows for all sorts of text formatting, while the latter simply dumps something on the screen. Both have their uses.

cat("This is the answer. x=",x,", and y=",y,".\n")
This is the answer. x= 10 , and y= 1.2 .
[1] 10

4.9 Packages/libraries

R comes with a lot of functionality as standard, but its strength is in all the packages available for it. The ecosystem is much richer than any other language when it comes to statistics. Some of these packages come with R, but most have to be downloaded separately, either using the install.package() command or a menu in RStudio.

We load the packages using the library() command. Some of them come with annoying start-up messages, which can be suppressed by the suppressPackageStartupMessages() command.

The best practice is to load all the packages used in the code file at the top.

Here are the packages we make the most use of in this book.

  • reshape2 re-shape dataframes. Very useful when data is arranged in an unfriendly way;
  • moments skewness and kurtosis;
  • tseries time series analysis;
  • zoo time-series objects;
  • lubridate date manipulation;
  • car QQ plots;
  • parallel multi-core calculations;
  • nloptr optimisation algorithms;
  • rugarch univariate volatility models;
  • rmgarch multivariate volatility models.

4.10 Matrix algebra

When dealing with vectors and matrices, * is element-by-element multiplication, while %*% is matrix multiplication. This becomes important when dealing with portfolios. Note that R vectors only have one dimension. They are not row or column vectors.

weight = c(0.3,0.7)
[1] 0.3 0.7
           [,1]     [,2]
[1,] 56.4287184 58.58493
[2,] 61.4435148 74.42369
[3,] 90.7129287 65.47341
[4,]  0.6279398 72.34061
[5,] 41.4738598 46.82814
weight * prices # element-by-element multiplication
           [,1]     [,2]
[1,] 16.9286155 41.00945
[2,] 43.0104604 22.32711
[3,] 27.2138786 45.83138
[4,]  0.4395579 21.70218
[5,] 12.4421579 32.77970
weight %*% prices # matrix multiplication
Error in weight %*% prices: non-conformable arguments
weight %*% t(prices) # matrix multiplication
         [,1]     [,2]     [,3]     [,4]     [,5]
[1,] 57.93807 70.52964 73.04526 50.82681 45.22186
prices %*% weight # matrix multiplication
[1,] 57.93807
[2,] 70.52964
[3,] 73.04526
[4,] 50.82681
[5,] 45.22186

4.11 Source files — source('functions.r')

It can be very useful to include other R files in some R code. The function to do that is “source('file.r')”.

Some of the code we develop below can be reused later. For that reason, we collect all the useful functions into an R source file called functions.r and include that into our code with source('functions.r'). It has to be in the same folder as the source code used here, but you can keep it anywhere you want, just adjusting the path in source().