= 'This is a quote characher " in the middle of a string\n'
scat(s)
This is a quote characher " in the middle of a string
There are many resources for learning R, and we do not wish to duplicate that here. See discussion in sec-rresources.
However, there are particular conventions we use and some parts of the R code that are particularly useful, and we provide an overview below. We suggest being aware of the issues raised by Patrick Burns in his R inferno.
R uses both single quotes ’ and double quotes” for strings, and you can use either. That is particularly useful if you have to include a quotation mark inside a string, like
= 'This is a quote characher " in the middle of a string\n'
scat(s)
This is a quote characher " in the middle of a string
The special character \n
means a new line, quite handy for printing.
=
or <-
By convention, R uses <-
for assignment, and not the equal sign =
. You can use both in the vast majority of cases, but there is a single, very infrequent, exception where one needs to use <-
. We generally prefer =
.
= 3
x <- 3
y ==y x
<<-
Global variables are seen as undesirable, for a good reason. Sometimes they cannot be avoided unless we want to make the code more complex, so a tricky tradeoff. We use <<-
to put something into the global namespace.
GlobalVariable <<- 123.456
cat()
vs. print()
There are two ways one can print to the screen and to text files in R, cat()
and print()
. The former allows for all sorts of text formatting while the latter simply dumps something on the screen. They both have their uses.
cat("This is the answer. x=",x,", and y=",y,".\n")
print(x)
This is the answer. x= 3 , and y= 3 .
[1] 3
R comes with a large number of functions, below is a list of some of those most widely used in this book.
head
tail
cbind
rbind
cat
print
paste
and paste0
R provides functions for just about every distribution imaginable. We will use several of those, the normal, student-t (and skewed student-t), chi-square, binomial and the Bernoulli. They all provide PDF, CDF, inverse CDF and random numbers. The first letter of the function name indicates which of these four types and the remainder of the name is the distribution, for example:
dnorm
, pnorm
, qnorm
, rnorm
;dt
, pt
, qt
, rt
.par(mar=c(4,4,0.2,0.1))
=seq(-3,3,length=100)
x=seq(0,1,length=100)
zplot(x,dnorm(x))
plot(x,pnorm(x))
plot(z,qnorm(z))
We can easily simulate random numbers and do that quite frequently. One should always set the seed by set.seed()
.
rnorm(1)
rnorm(3)
rnorm(3)
set.seed(666)
rnorm(3)
set.seed(666)
rnorm(3)
R has several ways to plot data. The simplest, what is known as base plots, is what we will use for the remainder of this book. You can make better looking plots witht the ggplot2
package which is used by the BBC and New York Times for their plots. Other packages exist, like plotly
, especially useful for plots viewed in a browser.
There are four reasons why we generally use base plots.
The default R plot is very ugly.
par(mar=c(4,4,0.2,0.1))
plot(x,dnorm(x))
But there are many ways it can be made more visually appealing. Furthermore, it is always helpful to control the plot margins, and quite possibly set other characteristics.
The margins are set by par(mar=c(bottom, left side, top, right side))
.
par(mar=c(4,4,1,0))
plot(x,dnorm(x),
type='l',
lwd=1.5,
col="blue",
las=1,
bty='l',
xlab="Outcomes",
ylab="Probability",
main="The normal density"
)=seq(-3,3,by=0.5)
waxis(1,at=w,label=FALSE,tcl=-0.3)
As standard, R comes with a lot of of functionality, but the strength of R is in all the packages available for it. The ecosystem is much richer than for any other language when it comes to statistics. Some of these packages come with R, but most have to be downloaded separately, either using the install.package()
command, or a menu in RStudio.
We load the packages using the library()
command. Some of them come with annoying start-up messages, which can be suppressed by the suppressPackageStartupMessages()
command.
Best practice is to load all the packages used in code file at the top.
Here are the packages we make most use of in this book.
reshape2
re-shape data frames. Very useful when data is arranged in an unfriendly way;moments
skewness and kurtosis;tseries
time series analysis;zoo
timeseries objects;lubridate
date manipulation;car
QQ plots;parallel
multi-core calculations;nloptr
optimisaton algorithms;rugarch
univariate volatility models;rmgarch
multivariate volatility models.Objects in R can be of different classes. For example, we can have a vector, which is an ordered array of observations of data of the same type (numbers, characters, logicals). We can also have a matrix, which is a rectangular arrangement of elements of the same data type. You can check what class an object is by running class(object)
.
The most used variable type is a real number, followed by integers and strings.
= 1
x = 5.3
y =0.034553
w= "Lean R"
z +y
xclass(x)
class(y)
class(z)
+z # will not work because z is a string x
ERROR: Error in x + z: non-numeric argument to binary operator
To print variables we can use cat()
or print
. And to turn numbers into strings we can use paste()
or paste0()
. In strings, a new line is \n
.
cat(x)
cat(y)
cat('\n',x,'\n')
cat(y,'\n')
cat(w,'\n')
cat("Important number is x=",x,"and y=",y,"\n")
=paste0("The return is ",round(100*w,1),"%")
scat(s,"\n")
1
5.3
1
5.3
0.034553
Important number is x= 1 and y= 5.3
The return is 3.5%
We use a special value NA
to indicate not-a-number, i.e. we don’t know what the value is. This becomes useful in backtesting.
=NA
a a
We use a logical variable, true or false quite often. Note they are spelt all uppercase.
= TRUE
W = !W
r
W r
R comes with vectors. Note that R does not know if they are column vectors or row vectors, which becomes important in matrix algebra.
= vector(length=4)
v
v = NA
v[]
v2:3] = 2
v[
v=seq(1,5)
v
v=seq(-1,2,by=0.5)
v
v=c(1,3,7,3,0.4)*3
v v
R can create two and three-dimensional matrices. We usually only do the two-dimensional type, but will encounter the three-dimensional in the multivariate volatility models.
Matrices have column names which can be quite useful.
=matrix(ncol=2,nrow=3)
m
m=matrix(3,ncol=2,nrow=3)
m
m=cbind(v,v)
m
m =rbind(v,v)
m m
NA | NA |
NA | NA |
NA | NA |
3 | 3 |
3 | 3 |
3 | 3 |
v | v |
---|---|
3.0 | 3.0 |
9.0 | 9.0 |
21.0 | 21.0 |
9.0 | 9.0 |
1.2 | 1.2 |
v | 3 | 9 | 21 | 9 | 1.2 |
---|---|---|---|---|---|
v | 3 | 9 | 21 | 9 | 1.2 |
We can access individual elements of matrixes and vectors
1,2]
m[2]
m[,2,]
m[1,3:5]
m[2:3] v[
We can name the columns with colnames()
. Unfortunately, that command name is different than what we use for data frames below.
=cbind(rnorm(4),rnorm(4))
m
mcolnames(m)=c("Stock A","Stock B")
m
2.0281678 | -0.80251957 |
-2.2168745 | -1.79224083 |
0.7583962 | -0.04203245 |
-1.3061853 | 2.15004262 |
Stock A | Stock B |
---|---|
2.0281678 | -0.80251957 |
-2.2168745 | -1.79224083 |
0.7583962 | -0.04203245 |
-1.3061853 | 2.15004262 |
We quite often need to keep track of a lot of variables that belong together, and then the R list
is very useful. It allows us to group multiple variables in one list.
=list()
l$a=2
l$b="R is great"
l=list(l=c(2,3),b="Risk")
l=list()
w$q="my list"
w$l = l
w$df=data.frame(cbind(c(1,2),c("VaR","ES")))
w w
X1 | X2 |
---|---|
<chr> | <chr> |
1 | VaR |
2 | ES |
R is a an old language that has evolved erratically over time. This means it has many undesirable features that can lead to difficult bugs. One is is a variable type called NULL
, meaning nothing. While it can be useful, the problem is that NULL
is not only used inconsistently, it can be outright dangerous.
1+What # variable What is not defined
ERROR: Error in eval(expr, envir, enclos): object 'What' not found
which makes sense. But consider this
=list()
l$What l
NULL
The variable What
does not exist in the list l
, but we can access it by l$What
$What+3 l
And when doing math with it, it fails silently.
When we need to delete columns from a data frame or an element from a list, we assign NULL
to it.
df$DeleteMe = NULL
I know, not very intuitive.
When dealing with vectors and matrices, *
is element-by-element multiplication, while %*%
is matrix multiplication. This becomes important when dealing with portfolios. Note that R vectors only have one dimension, they are not row or column vectors.
= c(0.3,0.7)
weight =cbind(runif(5)*100,runif(5)*100)
prices
weight prices
3.834435 | 61.21745 |
14.149569 | 55.33484 |
80.638553 | 85.35008 |
26.668568 | 46.97785 |
4.270205 | 39.76166 |
* prices # element-by-element multiplication weight
1.150330 | 42.85222 |
9.904698 | 16.60045 |
24.191566 | 59.74505 |
18.667997 | 14.09336 |
1.281062 | 27.83316 |
%*% prices # matrix multiplication weight
ERROR: Error in weight %*% prices: non-conformable arguments
%*% t(prices) # matrix multiplication weight
44.00255 | 42.97926 | 83.93662 | 40.88507 | 29.11422 |
%*% weight # matrix multiplication prices
44.00255 |
42.97926 |
83.93662 |
40.88507 |
29.11422 |
Matrices have some limitations. They don’t have row names and all the columns must be of the same type, for example, we can’t have a column with a string and another with numbers. To deal with that, R comes with data frames, which can be thought of as more flexible matrices. We usually have to use both. For example, it is quite costly to insert new data into a data frame, perhaps by df[3,4]=42
but not to do the same for a matrix.
A data frame is a two-dimensional structure in which each column contains values of one variable and each rows contains one set of values, or “observation” from each column. It is perhaps the most common way of storing data in R and the one we will use the most.
One of the main advantaged of a data frame in comparison to a matrix, is that each column can have a different data type. For example, you can have one column with numbers, one with text, one with dates, and one with logicals, whereas a matrix limits you to only one data type. Keep in mind that a data frame needs all its columns to be of the same length.
We can access data from columns by number, like df[,3]
but since all the columns have names, it is usually much better to access them by column name, like df$returns
.
There are several different ways to create a data frame. One is loading from a file which we do below later. Alternatively, we might want to create a data frame from a list of vectors. This can easily be done with the data.frame()
function:
<- data.frame(col1 = 1:3,
df col2 = c("A", "B", "C"),
col3 = c(TRUE, TRUE, FALSE),
col4 = c(1.0, 2.2, 3.3))
You have to specify the name of each column, and what goes inside it. Note that all vectors need to be the same length. We can now check the structure:
str(df)
dim(df)
colnames(df)
'data.frame': 3 obs. of 4 variables:
$ col1: int 1 2 3
$ col2: chr "A" "B" "C"
$ col3: logi TRUE TRUE FALSE
$ col4: num 1 2.2 3.3
You might want to transform a matrix into a data frame (or vice versa). For example, you need an object to be a matrix to perform linear algebra operations, but you would like to keep the results as a data frame after the operations. You can easily switch from matrix to data frame using as.data.frame()
(and analogously, from data frame to matrix with as.matrix()
, however remember all columns need to have the same data type to be a matrix).
For example, let’s say we have the matrix:
<- matrix(1:10, nrow = 5, ncol = 2, byrow = TRUE)
my_matrix class(my_matrix)
my_matrix
1 | 2 |
3 | 4 |
5 | 6 |
7 | 8 |
9 | 10 |
We can now transform it into a data frame:
= as.data.frame(my_matrix)
df class(df)
dfstr(df)
V1 | V2 |
---|---|
<int> | <int> |
1 | 2 |
3 | 4 |
5 | 6 |
7 | 8 |
9 | 10 |
'data.frame': 5 obs. of 2 variables:
$ V1: int 1 3 5 7 9
$ V2: int 2 4 6 8 10
And we can change the column names:
colnames(df) = c("Odd", "Even")
df
Odd | Even |
---|---|
<int> | <int> |
1 | 2 |
3 | 4 |
5 | 6 |
7 | 8 |
9 | 10 |
The R data frames suffer from having been proposed decades ago and therefore lack some very useful features one might expect, and they can be very slow. In response, there are two alternatives, each with their own pros and cons. We will not use either of those in this book because we want to use base R packages wherever possible.
The data.table class is designed for performance and features, it is by far the fastest when using large datasets, but also has very useful features built into it that really facilitate data work. In our work we use data.table
.
The other alternative is tidy
, part of the tidyverse. It has a lot of useful features, with the richest data manipulation tools in R.
It can be very useful to include other R files in some R code. The function to do that is source('file.r')
.
We make use of source when we include an R file called functions.r
containing useful functions that are used frequently.