6  R and risk forecasting

There are many resources for learning R, and we do not wish to duplicate that here. See discussion in Section 6.1.

However, there are particular conventions we use and some parts of the R code that are particularly useful, and we provide an overview below. We suggest being aware of the issues raised by Patrick Burns in his R inferno.

6.1 R Resources

There’s a lot of free and commercial resources for learning R. The Big book of R is a very comprehensive list of resources, but it can be hard to identify what is particularly useful.

Here are some resources you might find useful.

If you meet any specific R programming problem (also can be any other programming language), StackOverFlow is an excellent place for requesting help.

6.2 User interfaces

There are several ways one can use R.

6.2.1 RStudio

It is usually best to use the free RStudio software for programming in R.

6.2.2 Quarto

The RStudio vendor has recently come out with a product designed for producing reports in R, as well as Julia, Python and Julia, called Quarto which is what we use for these notes.

6.2.3 Jupyter

Jupyter notebooks are file types with extension .ipynb produced by Project Jupyter which contain both computer code (e.g. R, Python, Julia), and rich text elements (paragraphs, equations, images, links, etc). They are a great way to make reports and documents. They can be edited on a web server and exported as html, pdf via LaTeX, or other file formats. They are interactive and you can independently run pieces of code.

Some users find the Jupyter way of coding really good, while others find it very confining.

6.2.4 VSCode

One of the best text editors is visual studio code or VScode, and it provides a very useful extension for R. VScode is our go-to editor for most things, and is how we made these notes, along with Quarto.

6.2.5 Base R editor

R Comes with a basic editor and environment for executing code. It has the advantage of being very lightweight and efficient, but is rather basic. It is what we generally use for writing R code, because of its simplicity and light weight,

6.2.6 Command line

We can also access R on the command line, which can very useful, perhaps for long parallel applications or when using remote computers.

6.3 Some relevant issues

6.3.1 Special characters

R uses both single quotes ’ and double quotes” for strings, and you can use either. That is particularly useful if you have to include a quotation mark inside a string, like

s= 'This is a quote characher " in the middle of a string\n'
cat(s)
This is a quote characher " in the middle of a string

The special character \n means a new line, quite handy for printing.

6.3.2 Assignment: = or <-

By convention, R uses <- for assignment, and not the equal sign =. You can use both in the vast majority of cases, but there is a single, very infrequent, exception where one needs to use <-. We generally prefer =.

x = 3
y <- 3
x==y
[1] TRUE

6.3.3 Global assignment <<-

Global variables are seen as undesirable, for a good reason. Sometimes they cannot be avoided unless we want to make the code more complex, so a tricky tradeoff. We use <<- to put something into the global namespace.

GlobalVariable <<- 123.456

6.3.4 Printing: cat() vs. print()

There are two ways one can print to the screen and to text files in R, cat() and print(). The former allows for all sorts of text formatting while the latter simply dumps something on the screen. They both have their uses.

cat("This is the answer. x=",x,", and y=",y,".\n")
This is the answer. x= 3 , and y= 3 .
print(x)
[1] 3

6.3.5 Some useful functions

R comes with a large number of functions, below is a list of some of those most widely used in this book.

  • head: return the first part of an object
  • tail: return the last part of an object
  • cbind: combine by column
  • rbind: combine by row
  • cat: concatenate and print
  • print: print values
  • paste and paste0: concatenate strings

6.4 Statistical distributions

R provides functions for just about every distribution imaginable. We will use several of those, the normal, student-t (and skewed student-t), chi-square, binomial and the Bernoulli. They all provide PDF, CDF, inverse CDF and random numbers. The first letter of the function name indicates which of these four types and the remainder of the name is the distribution, for example:

  • dnorm, pnorm, qnorm, rnorm;
  • dt, pt, qt, rt.

6.4.1 Distributions and densities

par(mar=c(4,4,3,3))  # set margin
x=seq(-3,3,length=100)
z=seq(0,1,length=100)
plot(x,dnorm(x), main="Normal Density")  # plot normal density

plot(x,pnorm(x), main="Cumulative Density")  # plot cumulative normal density

plot(z,qnorm(z), main="Normal Quantile")  # plot normal quantile

6.4.2 Random numbers

We can easily simulate random numbers and do that quite frequently. One should always set the seed by set.seed().

rnorm(1)
[1] 0.2304895
rnorm(3)
[1]  0.2768963  1.6398099 -0.7027706
rnorm(3)
[1]  0.9095197 -0.6499590  0.7984155
set.seed(666)
rnorm(3)
[1]  0.7533110  2.0143547 -0.3551345
set.seed(666)
rnorm(3)
[1]  0.7533110  2.0143547 -0.3551345

6.5 Packages/libraries

As standard, R comes with a lot of of functionality, but the strength of R is in all the packages available for it. The ecosystem is much richer than for any other language when it comes to statistics. Some of these packages come with R, but most have to be downloaded separately, either using the install.package() command, or a menu in RStudio.

We load the packages using the library() command. Some of them come with annoying start-up messages, which can be suppressed by the suppressPackageStartupMessages()command.

Best practice is to load all the packages used in code file at the top.

Here are the packages we make most use of in this book.

  • reshape2 re-shape data frames. Very useful when data is arranged in an unfriendly way;
  • moments skewness and kurtosis;
  • tseries time series analysis;
  • zoo timeseries objects;
  • lubridate date manipulation;
  • car QQ plots;
  • parallel multi-core calculations;
  • nloptr optimisaton algorithms;
  • rugarch univariate volatility models;
  • rmgarch multivariate volatility models.

6.6 Variables

Objects in R can be of different classes. For example, we can have a vector, which is an ordered array of observations of data of the same type (numbers, characters, logicals). We can also have a matrix, which is a rectangular arrangement of elements of the same data type. You can check what class an object is by running class(object).

6.6.1 Integers, reals and strings

The most used variable type is a real number, followed by integers and strings.

x = 1
y = 5.3
w=0.034553
z = "Lean R"
x+y
[1] 6.3
class(x)
[1] "numeric"
class(y)
[1] "numeric"
class(z)
[1] "character"
x+z  # will not work because z is a string
Error in x + z: non-numeric argument to binary operator

To print variables we can use cat() or print. And to turn numbers into strings we can use paste() or paste0(). In strings, a new line is \n.

cat(x)
1
cat(y)
5.3
cat('\n',x,'\n')

 1 
cat(y,'\n')
5.3 
cat(w,'\n')
0.034553 
cat("Important number is x=",x,"and y=",y,"\n")
Important number is x= 1 and y= 5.3 
s=paste0("The return is ",round(100*w,1),"%")
cat(s,"\n")
The return is 3.5% 

6.6.2 Not-a-number NA

We use a special value NA to indicate not-a-number, i.e. we don’t know what the value is. This becomes useful in backtesting.

a=NA
a
[1] NA

6.6.3 TRUE and FALSE

We use a logical variable, true or false quite often. Note they are spelt all uppercase.

W = TRUE
r = !W
W
[1] TRUE
r
[1] FALSE

6.6.4 Vectors

R comes with vectors. Note that R does not know if they are column vectors or row vectors, which becomes important in matrix algebra.

v = vector(length=4)
v 
[1] FALSE FALSE FALSE FALSE
v[] = NA
v
[1] NA NA NA NA
v[2:3] = 2
v
[1] NA  2  2 NA
v=seq(1,5)
v
[1] 1 2 3 4 5
v=seq(-1,2,by=0.5)
v
[1] -1.0 -0.5  0.0  0.5  1.0  1.5  2.0
v=c(1,3,7,3,0.4)*3
v
[1]  3.0  9.0 21.0  9.0  1.2

6.6.5 Matrices

R can create two and three-dimensional matrices. We usually only do the two-dimensional type, but will encounter the three-dimensional in the multivariate volatility models.

Matrices have column names which can be quite useful.

m=matrix(ncol=2,nrow=3)
m
     [,1] [,2]
[1,]   NA   NA
[2,]   NA   NA
[3,]   NA   NA
m=matrix(3,ncol=2,nrow=3)
m
     [,1] [,2]
[1,]    3    3
[2,]    3    3
[3,]    3    3
m=cbind(v,v)
m 
        v    v
[1,]  3.0  3.0
[2,]  9.0  9.0
[3,] 21.0 21.0
[4,]  9.0  9.0
[5,]  1.2  1.2
m=rbind(v,v)
m 
  [,1] [,2] [,3] [,4] [,5]
v    3    9   21    9  1.2
v    3    9   21    9  1.2

We can access individual elements of matrixes and vectors

m[1,2]
v 
9 
m[,2]
v v 
9 9 
m[2,]
[1]  3.0  9.0 21.0  9.0  1.2
m[1,3:5]
[1] 21.0  9.0  1.2
v[2:3]
[1]  9 21

We can name the columns with colnames(). Unfortunately, that command name is different than what we use for data frames below.

m=cbind(rnorm(4),rnorm(4))
m
           [,1]        [,2]
[1,]  2.0281678 -0.80251957
[2,] -2.2168745 -1.79224083
[3,]  0.7583962 -0.04203245
[4,] -1.3061853  2.15004262
colnames(m)=c("Stock A","Stock B")
m
        Stock A     Stock B
[1,]  2.0281678 -0.80251957
[2,] -2.2168745 -1.79224083
[3,]  0.7583962 -0.04203245
[4,] -1.3061853  2.15004262

6.6.6 Lists

We quite often need to keep track of a lot of variables that belong together, and then the R list is very useful. It allows us to group multiple variables in one list.

l=list()
l$a=2
l$b="R is great"
l=list(l=c(2,3),b="Risk")
w=list()
w$q="my list"
w$l = l 
w$df=data.frame(cbind(c(1,2),c("VaR","ES")))
w
$q
[1] "my list"

$l
$l$l
[1] 2 3

$l$b
[1] "Risk"


$df
  X1  X2
1  1 VaR
2  2  ES

6.6.7 NULL

R is a an old language that has evolved erratically over time. This means it has many undesirable features that can lead to difficult bugs. One is is a variable type called NULL, meaning nothing. While it can be useful, the problem is that NULL is not only used inconsistently, it can be outright dangerous.

1+What # variable What is not defined
Error in eval(expr, envir, enclos): object 'What' not found

which makes sense. But consider this

l=list()
l$What
NULL

The variable What does not exist in the list l, but we can access it by l$What

l$What+3
numeric(0)

And when doing math with it, it fails silently.

When we need to delete columns from a data frame or an element from a list, we assign NULL to it.

df$DeleteMe = NULL

I know, not very intuitive.

6.7 Matrix algebra

When dealing with vectors and matrices, * is element-by-element multiplication, while %*% is matrix multiplication. This becomes important when dealing with portfolios. Note that R vectors only have one dimension, they are not row or column vectors.

weight = c(0.3,0.7)
prices=cbind(runif(5)*100,runif(5)*100)
weight
[1] 0.3 0.7
prices
          [,1]     [,2]
[1,]  3.834435 61.21745
[2,] 14.149569 55.33484
[3,] 80.638553 85.35008
[4,] 26.668568 46.97785
[5,]  4.270205 39.76166
weight * prices # element-by-element multiplication
          [,1]     [,2]
[1,]  1.150330 42.85222
[2,]  9.904698 16.60045
[3,] 24.191566 59.74505
[4,] 18.667997 14.09336
[5,]  1.281062 27.83316
weight %*% prices # matrix multiplication
Error in weight %*% prices: non-conformable arguments
weight %*% t(prices) # matrix multiplication
         [,1]     [,2]     [,3]     [,4]     [,5]
[1,] 44.00255 42.97926 83.93662 40.88507 29.11422
prices %*% weight # matrix multiplication
         [,1]
[1,] 44.00255
[2,] 42.97926
[3,] 83.93662
[4,] 40.88507
[5,] 29.11422

6.8 Data frames

Matrices have some limitations. They don’t have row names and all the columns must be of the same type, for example, we can’t have a column with a string and another with numbers. To deal with that, R comes with data frames, which can be thought of as more flexible matrices. We usually have to use both. For example, it is quite costly to insert new data into a data frame, perhaps by df[3,4]=42 but not to do the same for a matrix.

A data frame is a two-dimensional structure in which each column contains values of one variable and each rows contains one set of values, or “observation” from each column. It is perhaps the most common way of storing data in R and the one we will use the most.

One of the main advantaged of a data frame in comparison to a matrix, is that each column can have a different data type. For example, you can have one column with numbers, one with text, one with dates, and one with logicals, whereas a matrix limits you to only one data type. Keep in mind that a data frame needs all its columns to be of the same length.

6.8.1 Accessing the data from columns

We can access data from columns by number, like df[,3] but since all the columns have names, it is usually much better to access them by column name, like df$returns.

6.8.2 Creating a data frame from scratch

There are several different ways to create a data frame. One is loading from a file which we do below later. Alternatively, we might want to create a data frame from a list of vectors. This can easily be done with the data.frame() function:

df <- data.frame(col1 = 1:3,
                 col2 = c("A", "B", "C"),
                 col3 = c(TRUE, TRUE, FALSE),
                 col4 = c(1.0, 2.2, 3.3))

You have to specify the name of each column, and what goes inside it. Note that all vectors need to be the same length. We can now check the structure:

str(df)  # display the structure of the data frame
'data.frame':   3 obs. of  4 variables:
 $ col1: int  1 2 3
 $ col2: chr  "A" "B" "C"
 $ col3: logi  TRUE TRUE FALSE
 $ col4: num  1 2.2 3.3
dim(df)  # dimension
[1] 3 4
colnames(df)  # column names
[1] "col1" "col2" "col3" "col4"

6.8.3 Transforming a different object into a Data Frame

You might want to transform a matrix into a data frame (or vice versa). For example, you need an object to be a matrix to perform linear algebra operations, but you would like to keep the results as a data frame after the operations. You can easily switch from matrix to data frame using as.data.frame() (and analogously, from data frame to matrix with as.matrix(), however remember all columns need to have the same data type to be a matrix).

For example, let’s say we have the matrix:

my_matrix <- matrix(1:10, nrow = 5, ncol = 2, byrow = TRUE)
class(my_matrix)
[1] "matrix" "array" 
my_matrix
     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6
[4,]    7    8
[5,]    9   10

We can now transform it into a data frame:

df = as.data.frame(my_matrix)
class(df)
[1] "data.frame"
df
  V1 V2
1  1  2
2  3  4
3  5  6
4  7  8
5  9 10
str(df)
'data.frame':   5 obs. of  2 variables:
 $ V1: int  1 3 5 7 9
 $ V2: int  2 4 6 8 10

And we can change the column names:

colnames(df) = c("Odd", "Even")
df
  Odd Even
1   1    2
2   3    4
3   5    6
4   7    8
5   9   10

6.8.4 Alternatives to data frames

The R data frames suffer from having been proposed decades ago and therefore lack some very useful features one might expect, and they can be very slow. In response, there are two alternatives, each with their own pros and cons. We will not use either of those in this book because we want to use base R packages wherever possible.

6.8.4.1 Data.table

The data.table class is designed for performance and features, it is by far the fastest when using large datasets, but also has very useful features built into it that really facilitate data work. In our work we use data.table.

6.8.4.2 Tidy

The other alternative is tidy, part of the tidyverse. It has a lot of useful features, with the richest data manipulation tools in R.

6.9 Source files

It can be very useful to include other R files in some R code. The function to do that is source('file.r').

We make use of source when we include an R file called functions.r containing useful functions that are used frequently.