= 'This is a quote characher " in the middle of a string\n'
scat(s)
This is a quote characher " in the middle of a string
There are many resources for learning R, and we do not wish to duplicate that here. See discussion in Section 6.1.
However, there are particular conventions we use and some parts of the R code that are particularly useful, and we provide an overview below. We suggest being aware of the issues raised by Patrick Burns in his R inferno.
There’s a lot of free and commercial resources for learning R. The Big book of R is a very comprehensive list of resources, but it can be hard to identify what is particularly useful.
Here are some resources you might find useful.
If you meet any specific R programming problem (also can be any other programming language), StackOverFlow is an excellent place for requesting help.
There are several ways one can use R.
It is usually best to use the free RStudio software for programming in R.
The RStudio vendor has recently come out with a product designed for producing reports in R, as well as Julia, Python and Julia, called Quarto which is what we use for these notes.
Jupyter notebooks are file types with extension .ipynb
produced by Project Jupyter which contain both computer code (e.g. R, Python, Julia), and rich text elements (paragraphs, equations, images, links, etc). They are a great way to make reports and documents. They can be edited on a web server and exported as html, pdf via LaTeX, or other file formats. They are interactive and you can independently run pieces of code.
Some users find the Jupyter way of coding really good, while others find it very confining.
One of the best text editors is visual studio code or VScode, and it provides a very useful extension for R. VScode is our go-to editor for most things, and is how we made these notes, along with Quarto.
R Comes with a basic editor and environment for executing code. It has the advantage of being very lightweight and efficient, but is rather basic. It is what we generally use for writing R code, because of its simplicity and light weight,
We can also access R on the command line, which can very useful, perhaps for long parallel applications or when using remote computers.
R uses both single quotes ’ and double quotes” for strings, and you can use either. That is particularly useful if you have to include a quotation mark inside a string, like
This is a quote characher " in the middle of a string
The special character \n
means a new line, quite handy for printing.
=
or <-
By convention, R uses <-
for assignment, and not the equal sign =
. You can use both in the vast majority of cases, but there is a single, very infrequent, exception where one needs to use <-
. We generally prefer =
.
<<-
Global variables are seen as undesirable, for a good reason. Sometimes they cannot be avoided unless we want to make the code more complex, so a tricky tradeoff. We use <<-
to put something into the global namespace.
GlobalVariable <<- 123.456
cat()
vs. print()
There are two ways one can print to the screen and to text files in R, cat()
and print()
. The former allows for all sorts of text formatting while the latter simply dumps something on the screen. They both have their uses.
R comes with a large number of functions, below is a list of some of those most widely used in this book.
head
: return the first part of an objecttail
: return the last part of an objectcbind
: combine by columnrbind
: combine by rowcat
: concatenate and printprint
: print valuespaste
and paste0
: concatenate stringsR provides functions for just about every distribution imaginable. We will use several of those, the normal, student-t (and skewed student-t), chi-square, binomial and the Bernoulli. They all provide PDF, CDF, inverse CDF and random numbers. The first letter of the function name indicates which of these four types and the remainder of the name is the distribution, for example:
dnorm
, pnorm
, qnorm
, rnorm
;dt
, pt
, qt
, rt
.We can easily simulate random numbers and do that quite frequently. One should always set the seed by set.seed()
.
As standard, R comes with a lot of of functionality, but the strength of R is in all the packages available for it. The ecosystem is much richer than for any other language when it comes to statistics. Some of these packages come with R, but most have to be downloaded separately, either using the install.package()
command, or a menu in RStudio.
We load the packages using the library()
command. Some of them come with annoying start-up messages, which can be suppressed by the suppressPackageStartupMessages()
command.
Best practice is to load all the packages used in code file at the top.
Here are the packages we make most use of in this book.
reshape2
re-shape data frames. Very useful when data is arranged in an unfriendly way;moments
skewness and kurtosis;tseries
time series analysis;zoo
timeseries objects;lubridate
date manipulation;car
QQ plots;parallel
multi-core calculations;nloptr
optimisaton algorithms;rugarch
univariate volatility models;rmgarch
multivariate volatility models.Objects in R can be of different classes. For example, we can have a vector, which is an ordered array of observations of data of the same type (numbers, characters, logicals). We can also have a matrix, which is a rectangular arrangement of elements of the same data type. You can check what class an object is by running class(object)
.
The most used variable type is a real number, followed by integers and strings.
[1] 6.3
[1] "numeric"
[1] "numeric"
[1] "character"
To print variables we can use cat()
or print
. And to turn numbers into strings we can use paste()
or paste0()
. In strings, a new line is \n
.
We use a special value NA
to indicate not-a-number, i.e. we don’t know what the value is. This becomes useful in backtesting.
We use a logical variable, true or false quite often. Note they are spelt all uppercase.
R comes with vectors. Note that R does not know if they are column vectors or row vectors, which becomes important in matrix algebra.
R can create two and three-dimensional matrices. We usually only do the two-dimensional type, but will encounter the three-dimensional in the multivariate volatility models.
Matrices have column names which can be quite useful.
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] NA NA
[,1] [,2]
[1,] 3 3
[2,] 3 3
[3,] 3 3
v v
[1,] 3.0 3.0
[2,] 9.0 9.0
[3,] 21.0 21.0
[4,] 9.0 9.0
[5,] 1.2 1.2
[,1] [,2] [,3] [,4] [,5]
v 3 9 21 9 1.2
v 3 9 21 9 1.2
We can access individual elements of matrixes and vectors
We can name the columns with colnames()
. Unfortunately, that command name is different than what we use for data frames below.
[,1] [,2]
[1,] 2.0281678 -0.80251957
[2,] -2.2168745 -1.79224083
[3,] 0.7583962 -0.04203245
[4,] -1.3061853 2.15004262
Stock A Stock B
[1,] 2.0281678 -0.80251957
[2,] -2.2168745 -1.79224083
[3,] 0.7583962 -0.04203245
[4,] -1.3061853 2.15004262
We quite often need to keep track of a lot of variables that belong together, and then the R list
is very useful. It allows us to group multiple variables in one list.
R is a an old language that has evolved erratically over time. This means it has many undesirable features that can lead to difficult bugs. One is is a variable type called NULL
, meaning nothing. While it can be useful, the problem is that NULL
is not only used inconsistently, it can be outright dangerous.
which makes sense. But consider this
The variable What
does not exist in the list l
, but we can access it by l$What
And when doing math with it, it fails silently.
When we need to delete columns from a data frame or an element from a list, we assign NULL
to it.
df$DeleteMe = NULL
I know, not very intuitive.
When dealing with vectors and matrices, *
is element-by-element multiplication, while %*%
is matrix multiplication. This becomes important when dealing with portfolios. Note that R vectors only have one dimension, they are not row or column vectors.
[1] 0.3 0.7
[,1] [,2]
[1,] 3.834435 61.21745
[2,] 14.149569 55.33484
[3,] 80.638553 85.35008
[4,] 26.668568 46.97785
[5,] 4.270205 39.76166
[,1] [,2]
[1,] 1.150330 42.85222
[2,] 9.904698 16.60045
[3,] 24.191566 59.74505
[4,] 18.667997 14.09336
[5,] 1.281062 27.83316
[,1] [,2] [,3] [,4] [,5]
[1,] 44.00255 42.97926 83.93662 40.88507 29.11422
Matrices have some limitations. They don’t have row names and all the columns must be of the same type, for example, we can’t have a column with a string and another with numbers. To deal with that, R comes with data frames, which can be thought of as more flexible matrices. We usually have to use both. For example, it is quite costly to insert new data into a data frame, perhaps by df[3,4]=42
but not to do the same for a matrix.
A data frame is a two-dimensional structure in which each column contains values of one variable and each rows contains one set of values, or “observation” from each column. It is perhaps the most common way of storing data in R and the one we will use the most.
One of the main advantaged of a data frame in comparison to a matrix, is that each column can have a different data type. For example, you can have one column with numbers, one with text, one with dates, and one with logicals, whereas a matrix limits you to only one data type. Keep in mind that a data frame needs all its columns to be of the same length.
We can access data from columns by number, like df[,3]
but since all the columns have names, it is usually much better to access them by column name, like df$returns
.
There are several different ways to create a data frame. One is loading from a file which we do below later. Alternatively, we might want to create a data frame from a list of vectors. This can easily be done with the data.frame()
function:
You have to specify the name of each column, and what goes inside it. Note that all vectors need to be the same length. We can now check the structure:
You might want to transform a matrix into a data frame (or vice versa). For example, you need an object to be a matrix to perform linear algebra operations, but you would like to keep the results as a data frame after the operations. You can easily switch from matrix to data frame using as.data.frame()
(and analogously, from data frame to matrix with as.matrix()
, however remember all columns need to have the same data type to be a matrix).
For example, let’s say we have the matrix:
[1] "matrix" "array"
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
[4,] 7 8
[5,] 9 10
We can now transform it into a data frame:
[1] "data.frame"
V1 V2
1 1 2
2 3 4
3 5 6
4 7 8
5 9 10
'data.frame': 5 obs. of 2 variables:
$ V1: int 1 3 5 7 9
$ V2: int 2 4 6 8 10
And we can change the column names:
The R data frames suffer from having been proposed decades ago and therefore lack some very useful features one might expect, and they can be very slow. In response, there are two alternatives, each with their own pros and cons. We will not use either of those in this book because we want to use base R packages wherever possible.
The data.table class is designed for performance and features, it is by far the fastest when using large datasets, but also has very useful features built into it that really facilitate data work. In our work we use data.table
.
The other alternative is tidy
, part of the tidyverse. It has a lot of useful features, with the richest data manipulation tools in R.
It can be very useful to include other R files in some R code. The function to do that is source('file.r')
.
We make use of source when we include an R file called functions.r
containing useful functions that are used frequently.