par(mar=c(5,5,5,5)) # set margin
plot(NULL, xlim=c(0,10), ylim=c(0,10), xlab="Margin 1", ylab="Margin 2") # generate plot
title(main = "Base Graphics in R", cex.main=1.5, font.main=4,
col.main="blue", line=3) # add title
points(8, 8, pch=0) # add point (8,8)
points(2, 2, pch=1) # add point (2,2)
text(c(2,8), c(2,8), label=c("(2,4)","(8,6)"), pos=1) # add texts
text(5, 5, label="Plot Region", cex=1.25, col="red") # add texts
mtext("Margin 3", side=3, line=1) # add margin texts
mtext("Margin 4", side =4, line=1) # add margin texts
for (i in 1:4){
for (j in 0:3){
mtext(paste("Line", j), side=i, line=j, cex=0.5, at=9)
} }
8 Plots
This chapter mainly covers basic data visualization techniques using base R plots. Majority of the plots in the textbook can be replicated using the methods discussed here. We will start with how R views the plot it is drawing on, graphic functions and plot customization, and then move on to producing histograms, box plots, time series plots, etc. More sophisticated techniques will be discussed towards the end along with a note on how to export plots, which is particularly useful when you want to generate professional high-resolution graphs in academic settings.
8.1 Introduction
R has several ways to plot data. The simplest, which is known as base plots, is the focus of this chapter and what we will use for the remainder of this book. It is easy to make better looking plots with the ggplot2
package used by the BBC and New York Times. Other packages exist, like plotly
, designed to generate interactive plots viewed in a browser and dygraphs. Two simple examples are included at the end of this chapter to briefly demonstrate how these two packages work.
There are five reasons why we tend to use base plots.
- They are simpler to use than the alternatives;
- The alternatives are more buggy;
- Base plots can be included in \(\LaTeX\) documents with the same font as the document itself, and it further supports \(\LaTeX\) equations;
- It is really hard to make sub-tickmarks with
ggplots
; - We aim to use as much of what is supplied with R as possible, without relying on other libraries.
Generally, the best graphics package for R is ggplot2
.
8.2 Base Graphics in R
The display is divided into several regions in R base graphics. The plot region, is where data and plots are drawn. R maintains a coordinate system within the plot region. The plot region is surrounded by margins. Margins are outside the main plot region and generally include axes, labels, and other texts. Margins are numbered clockwise from 1 to 4, starting from the bottom. As opposed to the coordinate system used in the plot region, R positions objects in the margin based on a count of lines. See the plot and codes below for an illustration.
The width of margins can be set by par(mar=c(bottom, left side, top, right side))
. This should precede the plot function, i.e., before the plot is generated. plot(NULL, xlim=c(0,10), ylim=c(0,10), xlab="Margin 1", ylab="Margin 2")
generates an empty plot with the range of \(x\) and \(y\) from 0 to 10. xlab
and ylab
specify the label of the \(x\) and \(y\) axes. The title of the plot can be set by title(main="Plot Title", ...)
, with parameters such as cex
, font
, col
specifying the size, font and colour respectively. text
adds texts in the plot region at the given coordinates, and mtext
adds texts in the margin, positioned by a count of lines.
Graphic functions in base R plots are divided into three categories:
- Low level functions that allow users to customise the plot. They add new components (line segments, arrows, points, texts, etc.) to an existing plot, often generated by high level functions first.
points
andmtext
in the codes above fall into this category; - High level functions that generate a complete plot, usually with labels and axes. Examples include histograms, box plots, scatter plots, etc. We will discuss them in detail in the next section;
- Interactive functions that allow you to add/extract information interactively from an existing plot, which are beyond the scope of our discussion.
Several useful low level functions that add new components to an existing plot are listed below:
lines(x, y, ...) # add lines
points(x, y, ...) # add points at (x,y)
text(x, y, ...) # add text at (x,y)
abline(a, b, ...) # add line y=ax+b
abline(h=a) # add a horizontal line y=a
abline(v=b) # add a vertical line x=b
polygon(x, y, ...) # add a closed polygon
segments(a, b, c, d, ...) # add a line segment from (a,b) to (c,d)
arrow(a, b, c, d, ...) # add an arrow from (a,b) to (c,d)
symbols(x, y, ...) # add shapes including circles, squares etc.
box(...) # draws a box around the plot
One should take advantage of the built-in help
command in R to see the details of each function. For instance, to check the options provided by arrows
, run ?arrows
in the R console. You can then find a detailed list of arguments and parameters that can be accepted by this function, such as length, angle, arrow types, etc.
As a practice, try to replicate the following graph with the functions listed above.
Instructions:
- The colour of the title should be brown, with size 1.2;
- Plot \(y=x\) in red, with line width 2;
- Draw \(x=2\) and \(y=2\) in dashed lines;
- Draw two polygons as shown above, one in blue and another in orange. There should be no boundaries for the orange one;
- Draw a green arrow from (8,2) to (6,2), the color should be set as
green4
; - Draw a circle with centre (9,1) and radius 1.
- Add (2,6) and (4,4) to the plot, with coordinates clearly marked above/below the points;
- Add a violet, dashed box to the plot.
You can start by
plot(NULL, type="n", xlim=c(0,10), ylim=c(0,10), xlab="X Label",
ylab="Y Label", bty="n", asp=1)
This will generate an empty plot without boundary. Note that we set the aspect ratio (asp
) to be 1 to make sure the circles and polygons are in the right shape. Aspect ratio refers to the ratio between the \(x\) and \(y\) coordinates. In some cases, you may need to specify the aspect ratio to ensure the plots are demonstrated accurately. For example, the circle in the above exercise should be tangent to \(x=2\). An incorrect aspect ratio will twist this fact and may result in confusions.
# Answer
plot(NULL, type="n", xlim=c(0,10), ylim=c(0,10), xlab="X Label",
ylab="Y Label", bty="n", asp=1)
title(main="Practice Plot", col.main="brown", cex.main=1.2) # add title
abline(0, 1, col="red", lwd=2) # add line
abline(v=8, lty=2) # add line
abline(h=2, lty=2) # add line
polygon(c(1,1,3,3),c(6,8,6,8),col="blue") # generate a polygon
polygon(c(6,8,6,8),c(10,10,8,8),col="orange", border = NA) # generate a polygon
arrows(8,2,6,2, col="green4", lwd=3, length=0.1) # add an arrow
symbols(9,1, circles=1, add=T, inch=F) # add a circle
points(c(2,6), c(4,4), pch=c(1,0)) # add points
text(c(2,6), c(4,4), label=c("(2,6)","(4,4)"), pos=c(1,3)) # add texts
box(lty=2, lwd=2, col="violet") # add a box
8.3 High Level Functions
In this section, we discuss plots most commonly used in financial data analysis, generated by high level functions. We mainly use Return.csv
generated in Chapter 7.
8.3.1 Return Distributions
In many financial theories, returns are usually assumed to be normally distributed. In reality, stock returns are usually heavy-tailed and skewed. To visualize the distribution of stock returns, we can use histogram. hist
command in base R plot allow us to create histograms easily. Here, we plot the return distribution of Apple, from 2003 to 2022.
=read.csv("data/Return.csv", header=T)
stkrthist(stkrt$AAPL, freq=F, main="Return Histogram - AAPL", xlab="Return",
col="deepskyblue3", breaks=30) # histogram
=seq(-0.2,0.15,0.001)
xlines(x,dnorm(x, mean(stkrt$AAPL), sd(stkrt$AAPL)), lty=2, col="red", lwd=2) # normal density line
freq=F
displays histograms with density instead of frequency. This is more convenient if we want to compare the empirical distributions with the normal density. main="plot_name"
sets the title of the plot and breaks=30
sets the number of bins. To add a normal density line, use lines
command. The mean and standard deviation are set to match the sample mean and sample volatility. From the plot, we see that although the empirical distribution has a bell-shape, the returns have fat tails and are skewed. Returns around mean are more frequent than normal distribution, so Apple stock returns are not normally distributed. More rigorous ways of checking normality include Jarque-Bera test, Kolmogorov-Smirnov test, etc. Q-Q plot, discussed later in this book, is also a good way to visually check whether the normality assumption is violated.
Another way to see the return distributions is the box plot. The upper and lower edges of the box are the 75th and 25th percentile, and the thick line in the middle represents the median. All points beyond the boundary of whiskers are classified as outliers.
boxplot(stkrt$AAPL, main="Box Plot - AAPL", col="deepskyblue3", ylab="Return")
We see a significant numbers of outlines in AAPL stock returns. This indicates extremely high/low returns are not rare. If you wish to make a horizontal box plot, add horizontal=T
.
8.3.2 Time Series Plots
Time series is important for financial data analysis. A more comprehensive discussion for time series is in Chapter 8. For now, we will briefly discuss how to create time series plots. R itself has a time series class ts
.
ts(data, start, end, frequency, ...) # convert a numeric vector to a time series object
plot(ts) # generate a time series plot
ts
command converts a numeric vector into a time series object. Users can specify the time of the start and the end, and frequency
stands for number of observations in one period of time. Apart from the time series data, date (time) is also an attribute of ts
objects. For this reason, there is no need to include a date vector when plotting ts
objects.
However, the property of ts
can be undesirable in some cases, especially when we are dealing with daily data with small sample sizes. As stocks are not traded on weekends and holidays, missing values can be very problematic when R interpolates missing values. Time series plots can look bizarre as a result of interpolation, and you may wish to treat dates as discrete values depending on specific datasets. Therefore, we sometimes prefer to use plot(date_object, data)
instead of plot(ts_object)
.
We illustrate this by using the stock returns of Apple from 2003 to 2022 using Return.csv
. To plot the time series we will need to convert the date column in the dataset into a date class object. This can be completed using the as.Date()
function. Note that as.Date()
converts between character objects to date objects. As the dates in the dataset are numeric values, we need to convert numeric objects into character objects first. This can also be achieved by lubridate
package (See Chapter 8).
$date=as.Date(as.character(stkrt$date), format = "%Y%m%d")
stkrtplot(stkrt$date, stkrt$AAPL, main="Daily Return - AAPL", xlab="Date", ylab="Return", type="l")
abline(h=0, lty=2, col="deepskyblue3")
=which(stkrt$AAPL==min(stkrt$AAPL)) # find the index which has the lowest return
idxmin=which(stkrt$AAPL==max(stkrt$AAPL)) # find the index which has the highest return
idxmaxpoints(c(stkrt$date[idxmax], stkrt$date[idxmin]), c(stkrt$AAPL[idxmax], stkrt$AAPL[idxmin]),
col=c("red", "blue"), pch=16)
which
returns the index of the value which satisfies the given condition. With this we can easily locate the maximum and minimum of the given time series. We can see the most extreme values both occurred in 2008 during the Great Financial Crisis. If you wish to combine two sub-plots into one plot for a side-to-side comparison, use par(mfrow=c(# of rows, # of columns))
. For example, to compare the stock returns of apple and GE, we use
par(mfrow=c(1,2)) # layout of sub-plots
plot(stkrt$date, stkrt$AAPL, main="Daily Return - AAPL", xlab="Date", ylab="Return",
type="l", ylim=c(-0.2, 0.2))
plot(stkrt$date, stkrt$GE, main="Daily Return - GE", xlab="Date", ylab="Return",
type="l", ylim=c(-0.2, 0.2))
From the plot, it is evident that GE returns are more cyclical than Apple over the entire horizon.
8.3.3 3D Plots
Univariate models have their limitations and multivariate extensions can be applied more widely for a portfolio of assets. Base R plots support creating 3D plots, and higher-dimensional data can be visualized with contour plots. Let’s consider the bivarite normal case where \(\mathbf{X}_1, \mathbf{X}_2\) follow a bivariate normal distribution, i.e.,
\[
\mathbf{X}=\begin{pmatrix}\mathbf{X}_1\\\mathbf{X}_2\end{pmatrix}\qquad \mathbf{X}\mathbf\sim \mathcal{N}\begin{pmatrix}\mathbf{\mu}\\\mathbf{\Sigma}\end{pmatrix}
\] with \[
\mathbf{\mu}=\begin{pmatrix}0\\0\end{pmatrix}\qquad
\mathbf{\Sigma}=\begin{pmatrix}1&0.5\\0.5&1\end{pmatrix}
\] To generate random variables that are bivariate normal, it is useful to load the mnormt
package.
library(mnormt)
The built-in contour
and persp
command can produce basic contour plots and 3D surface plots.
=seq(-3, 3, 0.1)
x1=seq(-3, 3, 0.1)
x2=c(0, 0) # mean
mu=matrix(c(1, 0.5, 0.5, 1), nrow = 2) # var-cov matrix
sigma=function(x1, x2){dmnorm(cbind(x1, x2), mu, sigma)} # bivariate function
fx=outer(x1, x2, fx)
ycontour(x1, x2, y, main="Contour Plot", xlab="x1", ylab="x2") # contour plot
persp(x1, x2, y, theta=-30, phi=15,
shade=0.2, col="deepskyblue3", expand=0.5, r=5, ticktype="detailed",
main="Bivariate Normal Surface") # 3D plot
theta
and phi
define angles of the viewing directions. Others arguments such as expand
and r
control how the plots are presented, and it is up to the users to select a set of appropriate values to produce nice plots. As a reminder, the length of \(\mathbf{X}_1, \mathbf{X}_2\) should not be too long when using persp
, otherwise it will take a long time to generate the plot and the plot might appear uniformly black if grids are included.
8.3.4 Other Plots
Base R supports several other plots, such as pie chart (pie(...)
), bar chart (barplot(...)
), scatter plot, etc. The commands are very similar as what we have discussed.
8.4 More Examples
In this section, we attempt to create several more sophisticated plots using the techniques discussed in previous sections. These plots are relevant if you want to highlight something or to make the plots visually appealing.
8.4.1 Normal vs t-distribution
We start by generating a plot comparing normal and t-distributions. t-distributions are heavy-tailed and are convenient to use when we need fat-tailed distributions. The following plot highlights the difference between normal and t-distributions.
=seq(-3,3,0.01)
xplot(x,dnorm(x),type="l", lty=1, lwd=2, main="Normal vs t-distribution",
xlab="x", ylab="Probability") # normal density
for (i in 1:5){
lines(x,dt(x,i), lty=i+1, lwd=2)
# for degrees of freedom 1 to 5, plot the density line
} legend("topleft", c("Normal", "t(1)", "t(2)", "t(3)", "t(4)", "t(5)"),
lty=1:6, lwd=2, cex=0.8, inset=c(0.01, 0.01)) # add legend
=seq(-3, -2, 0.01)
y=seq(2, 3, 0.01)
zlines(y, dnorm(y), lwd=5, col="red")
lines(z, dnorm(z), lwd=5, col="red")
lines(y, dt(y,1), lwd=5, col="blue")
lines(z, dt(z,1), lwd=5, col="blue") # highlight tails
for
loops are significantly more efficient when similar elements need to be added to the same graph. In this example, we plotted the probability density function of t-distributions with degrees of freedom from 1 to 5. This can be achieved by using a for
loop, as the only change involved is the degrees of freedom. The second half of the codes are intended to highlight the heavy tails of \(t(1)\) distribution, using a line width of 5 (lwd=5
).
8.4.2 Volatility Clusters
Volatility tend to come in cycles. Using sp500tr.csv
, we examine the volatility clusters of S&P500 from 2003 to 2022. To visualize this, we make the following plot.
=read.csv("data/sp500tr.csv")
rt=as.data.frame(cbind(rt$date[-1], (rt$Close[-1]-rt$Close[-nrow(rt)])/rt$Close[-nrow(rt)]))
rt.dfcolnames(rt.df)=c("date", "rt")
$date=as.Date(as.character(rt.df$date), format="%Y%m%d")
rt.df=cbind(rt.df,as.numeric(format(rt.df$date,'%Y')))
rt.dfcolnames(rt.df)=c("date", "rt", "year")
=aggregate(rt.df$rt, list(rt.df$year), FUN=sd) # compute the standard deviation each year
sd1=rbind(pmin(sd1[,2], mean(sd1[,2])), pmax(sd1[,2]-mean(sd1[,2]), 0))
sd2barplot(sd2*100, names.arg = sd1[,1], main="Clusters in SP500 Volatility", ylab="Volatility",
col=c("deepskyblue3", "red3"), border=NA) # stacked bar chart
abline(h=0)
abline(h=mean(sd1[,2])*100, lty=3, col="red3")
text(x=16, y=1.2, "mean")
This is a variant of stacked barplot, to highlight the years with above-average volatilities. Note that to add text to a bar chart, \(x\) coordinates are categorical instead of numerical. For example, if you wish to add texts at year 2003, you should use \(x=1\) instead of \(x=2003\).
8.4.3 Empirical Density vs Normal
The empirical cumulative distribution can be generated by ecdf
. Using S&P500 return from 2003 to 2022, we have the following plot.
=(rt$Close[-1]-rt$Close[-nrow(rt)])/rt$Close[-nrow(rt)]*100
rt.newplot(ecdf(rt.new), xlim=c(-5,5), col="deepskyblue3", lwd=3,
main="Empirical Density vs Normal", xlab= "Outcomes",
ylab="Cumulative Probability") # empirical distribution plot
curve(pnorm(x, mean(rt.new), sd(rt.new)), from = -5, to=5,
add = TRUE, col='red', lwd = 2) # add normal curve
abline(v=0.1, col="orange", lwd=2)
abline(v=-2.15, col="orange", lwd=2)
abline(v=2.6, col="orange", lwd=2)
text(-4, 0.1, "data higher\n than normal", cex=0.8)
text(1.3, 0.1, "data\n higher", cex=0.8)
text(4, 0.1, "data lower\n than normal", cex=0.8)
legend("topleft", legend=c("Returns", "Normal"), lwd=2, col=c("deepskyblue3","red"),
cex=0.8, bty="n", inset=c(0.01,0.02)) # add legend
polygon(c(-2.15,0.1,0.1,-2.15), c(par("usr")[3], par("usr")[3], par("usr")[4], par("usr")[4]),
col=rgb(230, 180, 193, max = 255, alpha = 110), border=NA) # shade a region
text(-0.5, 0.1, "data\n lower", cex=0.8)
curve()
draws a function specified by the user. If you wish to shade a specific region in the plot, use polygon
. This usually requires setting the colour to be transparent to make the lower layers visible, which can be customised by the rgb
command.
8.4.4 VaR and Expected Shortfall
You may wish to shade the region under a curve. This again can be achieved by polygon
function. To illustrate this, we compare VaR and Expected Shortfall in a single plot.
=seq(-3, 3, 0.001)
x=seq(-3, -1.62, 0.001)
yplot(x, dnorm(x), type="l", lwd=2, xlab="", ylab="",
main="VaR vs Expected Shortfall", ylim=c(0,1.1))
lines(y, dnorm(y)*10, lwd=2)
=x[(x>=-3)&(x<=-1.645)] # select the tail region
x1=dnorm(x1)
y1=c(-3, x1, -1.645) # add start and end points
x1=c(0, y1, 0) # add start and end points
y1=y1*10
y2polygon(x1, y2, col="orange", border=NA)
polygon(x1, y1, col="brown", border=NA)
abline(v=-1.645, lwd=2, lty=2, col="black")
text(-0.5, 0.8, "Scale tail area to 1")
For the polygon
command to work we will have to add the first point and the last point to ensure the entire area is enclosed. This demonstrates how Expected Shortfall is related to VaR: ES is the expected loss conditional on VaR being violated, i.e., the orange region has an area of 1 in the plot.
8.5 Exporting Plots
After the plots are generated in R, you may wish to export and include them in your academic paper or slides. It is never a good idea to include screenshots in your writings, as they look unprofessional and are usually blurred. Instead, it is recommended to export plots from R by choosing appropriate image formats, such as PNG, JPEG, TIFF, SVG, EPS, PDF, etc.
In RStudio, it is possible to export the graphs from the bottom right preview panel. Click the Export
icon, and you will be able to select through a range of image formats R supports, set directory, change file name, and specify the image size in the pop-up window. Alternatively, you can save plots with the following codes:
# save a plot in PDF
pdf("plot_name.pdf") # specify the plot name and format
plot(...) # generate your plot here
dev.off() # close the plot device
Different image formats have different characteristics. PDF, or Portable Document Format, is most widely used in academic settings. PDF plots are in high resolution and can be zoomed in without loss of quality. SVG (Scalable Vector Graphics) and EPS (Encapsulated PostScript) are also vector-based, and are good alternatives to PDF. However, more commonly seen image formats such as PNG or JPEG are not scalable – PNG and JPEG plots are likely to be blurred and lose details once scaled up.
The selection of image formats depends on which word processor/text editor you are using. If you are using Word or PowerPoint, SVG is the recommended format. Importing SVG plots into Word and PowerPoint is simple: export the plot from R, and then in Word/PowerPoint, select Insert > Pictures > This Device
. Navigate to the directory where the plot is saved, and then choose the one you want to insert. You can easily scale plots in Word/PowerPoint to proper sizes – SVG graphs are infinitely expandable without losing any resolution. They are also very suitable for webpages.
If you are writing with \(\LaTeX\), PDF and EPS are preferred because \(\LaTeX\) back-end does not support SVG directly. Plots can be easily imported using \includegraphics{filename}
command with graphicx
package loaded in the preamble. Although it is not impossible to include SVG plots in \(\LaTeX\), there are prerequisites for them to work and can sometimes cause errors.
8.6 ggplot2
and plotly
ggplot2
and plotly
are powerful data visualization packages in R. They can significantly improve the aesthetics and quality of plots. plotly
is specially intended for interactive graphs.
Compared with base R plots, ggplot
commands are different. ggplot
starts by calling ggplot(data, aes(...))
, where data
represents the dataset and aes()
captures the variables. ggplot
then draws the graph layer by layer, with a plus sign at the end of each line indicating a new layer is added. For example, geom_area(fill=..., alpha=...)
shades the area under curve, where alpha
is the degree of transparency. By default, ggplots
have a grey background and white grids. This can be customized, with several themes available to choose from. Below is a time series plot of the S&P500 index in the past 20 years.
library(ggplot2) # load the package first
$date=as.Date(as.character(rt$date),format = '%Y%m%d') # convert into Date class
rt
=ggplot(rt, aes(x=date, y=Close)) +
pgeom_area(fill="deepskyblue3", alpha=0.4) + # shade the area under the curve
geom_line(color="deepskyblue3", size=0.8) + # set the colour and size of the line
theme_bw() + # set theme of the graph
xlab("Date") + # x-axis label
ylab("Index") + # y-axis label
ggtitle("S&P500 Index (2003-2022)") # plot title
p
ggplot2
can be used jointly with plotly
. An interactive plot can be easily generated based on ggplot objects.
library(plotly) # load the package first
ggplotly(p)
This plot allows you to interactively visualize the data. If you place the cursor on a data point, it will show explicitly the date and the index of each point.
The above plots are simple illustrations of how ggplot2
and plotly
work. There are infinitely many good-looking plots you can generate with these packages. For more examples with reproducible codes, visit R Graph Gallary
, a collection of hundreds of beautiful plots created with R.
8.7 dygraphs
library(dygraphs)
<- cbind(mdeaths, fdeaths)
lungDeaths dygraph(lungDeaths) %>%
dySeries("mdeaths", label = "Male") %>%
dySeries("fdeaths", label = "Female") %>%
dyOptions(stackedGraph = TRUE) %>%
dyRangeSelector(height = 20)