12  Best practice

There are many ways to approach the type of work we are doing here, and some are more efficient than others. For the very smallest projects, it doesn’t matter very much how one goes about the work beyond the actual programming, but for anything larger, an efficient workflow becomes increasingly crucial in minimising arrow and maximising productivity. As a general rule, there is a direct relationship between the size of the project and the relative amount of time allocated to managing the workflow if one wants to be efficient.

12.1 Libraries

source("common/functions.r",chdir=TRUE)

12.2 Avoid intermediate files

In most projects, it can be tempting to write results to files with a view to subsequently reading them for further work unless there is a very good reason to, and this should be avoided at all costs. The best workflow reads in the raw data and writes the report in one data run without creating any files besides a report along the way. In the absolutely best workflow, you run a function with a name like RunProject(), perhaps with options like RunProject(Data=G20,Model=LeverageVol,Report=PDFSlides) in a file called something like RunProject.r, which has three lines, one to load all libraries, one to load all your code and one with the run command.

12.3 Use functions and legos

If your program is very short, just a few lines, then it’s most efficient to structure the code as lines to be executed. If, however, your program is more involved, like most projects in financial risk forecasting, it is better to use functions to structure the code. This allows you to separate different parts of execution so you can ask the code to run something without thinking about what’s behind it. This allows you to focus your attention on the task at hand and gives you a much better overview of what you’re doing.

It can be helpful to think of code as a block of Legos, where each type of execution is kept in its own function, and a group of related functions is kept in a separate source file.

12.4 Workflow

In the type of work we are doing here, six steps make up the workflow:

  1. Download data;
  2. Process (clean) data;
  3. Identify the models to be applied to the data;
  4. Program up the code to load the data and run the models;
  5. Create output;
  6. Make a report.

flowchart LR
 A(Download\ndata) --> B(Process\ndata)
 B --> C(Modelling\napproach)
 C --> D(Write\nprogrammes)
 D --> E(Make\noutput)
 E --> F(Make\nreport)

12.4.1 Downloading data

The most common source of data is from outside data vendors, such as EOD, WRDS, Bloomberg, or DBnomics (see Section 2.3). These vendors usually distribute data as CSV or JSON files. In some cases, we can/need to download this data on the command line or in a browser.

However, it is usually better to automatically download it in a software package via application programming interfaces (APIs); see the EOD implementation in Section Section 2.3.1.1, a way for two or more computer programs to communicate with each other. That minimises errors because we don’t have to download the data manually and allows for automatic updates as needed. EOD, Bloomberg and DBnomics have an API interface. WRDS requires a separate download.

Given the need to ensure that accurate data is used, it is usually, but only sometimes, best to download data separately from the rest of the workflow. There are a few reasons why:

  1. Limits API calls. Many data providers place limits on the amount/speed of data downloads;
  2. Minimise time spent importing data, Especially true for large datasets, since CSV reading is relatively quick;
  3. A vendor may be off-line;
  4. Your network access may be having difficulties;
  5. Data errors and checking for data omissions. It is useful to have a pre-analysis check of data quality using plots and tables.

The main exception is if we are doing real-time processing.

If using an API, it is usually best to save the data in the native binary format of the programming language to be used, like RData for R. See Section Section 5.5. Alternatively, it can be saved as a CSV or Parquet file, which is especially useful if the data needs to be read by several different applications.

12.4.1.1 Structuring code

It is generally best to have a separate function, perhaps called GetRawData(), that downloads data if it comes from an API or reads it in if it is in the local file. If the code is more than a handful of lines, it might be best to keep it in a separate file, perhaps ReadRawData.r.

12.4.2 Process data

The data you download usually needs to be in a useful form for analysis. Chapter 6 discusses the steps involved in transforming the sample data used here into a useful form.

12.4.2.1 Don’t transform the data with Excel

It might be tempting to use Excel to transform the downloaded data into a form useful for analysis. It would be best if you never did that, however tempting it may be. Okay, there are exceptions, but almost never. The reasons are:

  1. Excel likes to transform data without telling you. It might think a number is a telephone number or a date, and then when you save it, it has changed the numbers. There are many other such examples of how Excel can mangle data;
  2. It is nontransparent and difficult to repeat. If you need to repeat the transformation, which is often the case, perhaps data has been updated, then it is often impossible to know what actually was transformed. , you might not know what you did a few months back, or a collaborator might never know what you did in Excel. Regardless, transformation and Excel are not transparent, so one needs to know what actually was done.
  3. Every time you update your data, you have to repeat the Excel manipulation;
  4. Manipulating data in Excel is not transparent. You might not know what you did a few months back, or a collaborator might never know what you did in Excel;
  5. By contrast, data manipulation in R is transparent and repeatable. You know exactly how data is transformed, and you can repeat the analysis every time you update your data.

12.4.2.2 Hash

It can be useful to download data from a vendor when code is run and compare it to data already saved. Then, making a hash or a digital fingerprint of the old and new dataframe, using the digest command from a library with the same name can be useful.

12.4.2.3 Structuring code

It is generally best to have a separate function, perhaps called Data=ProcessRawData(), that calls GetRawData(). If the code is more than a handful of lines, it might be best to keep it in a separate file, perhaps ProcessRawData.r.

Every rule has an exception. Sometimes, processing can take a very long time. The raw files may be very large or require a lot of computation. In such cases, by all means, there is a separate step where you load the raw data, process it, and save a separate file called process data.

12.4.3 Identify the models

The most important step is to decide on which model to apply to the data. While discussion model choice is outside of the scope of these notes, we refer you to the Financial Risk Forecasting and its slides.

12.4.4 Program

The code that runs the models will likely be by far the longest, and you are probably changing the code all the time. Generally, it is best to keep the various parts of the code separate and use functions to call the actual calculations you need. You will be experimenting with different ways of programming and model specifications and need to be able to keep track of what chances are available. Perhaps the biggest temptation will be to have either very long files with a large number of lines or multiple files with names like code-1.r or code-monday.r. While this is unavoidable, it is generally best to have descriptive filenames and even use version management, as discussed below in Section Section 12.7.

12.4.4.1 Structuring code

It is generally best to have a separate function, perhaps called Results=RunModels(options), that calls Data=ProcessRawData(). If the code is more than a handful of lines, it might be best to keep it in a separate file, perhaps RunModels.r.

12.4.5 Create output

Once you have done the analysis, you need to create output, like tables and figures. The best way to do that is to have a separate file, perhaps called MakeOutput.r, that has one or more functions to make output, like MakeImportantTable(). It is best to have one function per table or figure. The reason is that you may need to make different types of output; perhaps you want to save images as jpeg and pdf and also automatically make a Word file with them.

Under (almost) no circumstances should one use a screen grab to get figures.

12.4.5.1 Structuring code

It is generally best to have a separate function, perhaps called Results=RunModels(), that calls Data=ProcessRawData(). If the code is more than a handful of lines, it might be best to keep it in a separate file, perhaps RunModels.r.

12.4.6 Make a report

See Chapter 11

12.5 Structuring projects

For the simplest project, you need only one file, R, which contains all the code to process the data, run the models, and make the output. Even then, keep all the code in functions.

For projects that are not super simple, it is generally best to use multiple files for the code: one for data processing, one to run the models, and one for the reports. Often, it is also useful to have a separate file with common functions that are loaded into different bits of code.

For larger projects, keep the code in multiple directories with a number of input files.

We have created an example project that runs a small set of analyses presented in this notebook. It is kept on GitHub; see Section Section 12.7 for details. You can find it at https://github.com/Jon-Danielsson/Financial-Risk-Forecasting-Example-Project.

You can see the directory structure below. The S&P 500 index is in RawData, the code to do all the analysis is in Code, the Quarto files to generate the HTML, PDF, Word and PowerPoint reports are in Output-code, and finally, you find an example of the output in Output.

├── Code
│   ├── FitModels.r
│   ├── ProcessRawData.r
│   ├── RunModels.r
│   └── libraries.r
├── Output
│   ├── pdf-presentation.pdf
│   ├── pdf-report.pdf
│   ├── powerpoint-presentation.pptx
│   ├── sp500-returns-sd.pdf
│   ├── sp500-returns-sd.png
│   ├── sp500-returns-sd.svg
│   └── word-report.docx
├── Output-code
│   ├── pdf-presentation.qmd
│   ├── pdf-report.qmd
│   ├── powerpoint-presentation.qmd
│   ├── presentation.qmd
│   ├── report.qmd
│   └── word-report.qmd
└── RawData
 └── sp500.csv

12.6 Use list()

A common problem is that we often have many variants of input data, run parameters, and output. This means that there are many variables to keep track of, and when calling functions with many arguments, as often is the case, it is easy to lose track of what is where.

Generally, for all but the simplest projects, it is best to use lists to keep track of what we are using. For example,

data=ProcessRawData()
names(data)
[1] "Return"          "Price"           "UnAdjustedPrice" "sp500"          
[5] "sp500tr"         "Ticker"         

Where data contains all the various types of data we need. If we get new data, we have to include it in ProcessRawData(), and then it is available everywhere we recall that function.

Similarly, if we have many input parameters, it is sensible to put them into a list

Parameters       = list()
Parameters$Model ="GARCH11"
Parameters$WE    = 1000
Parameters$p     = 0.01
Parameters$value = 1000

And then, when we run a model, we might get

 Data   = ProcessRawData()
 Fit    = RunModels(Data=Data,Model="GARCH11")
 VaR    = RunVaR(Fit=Fit)
 Report = MakeVaRReport(VaR=VaR)

Then

 Report = {VaR,SummaryTable,Details}
 VaR    = {VaR,Fit}
 fit    = {Data,Model,Data}
 Data   = {sp500,stocks}

So, the object Report contains all the relevant information in one place.

12.7 Managing code versions — git

One problem that usually emerges is how to keep track of different versions of the code. As we keep editing, we experiment with different code, and we might want to keep the old versions of the code just in case what we are trying doesn’t work out.

In practice what is likely to happen is that we create files like version-1.r and version-2.r, etc. The problem is that this quickly becomes unwieldy. We would have to show discipline in naming the files, and it may not be easy to figure out what is what

The way this is managed in most serious projects, and certainly in professional environments, is to use what is known as a version control system, which is a specialised software designed to manage changes in files. If one uses such a system, we don’t explicitly keep old versions of the files with some names indicated when they were created; instead, the system keeps track of all changes to all files. If you have ever used track changes in Word, you have used such a system.

The most popular version control system is git. You can run git on your own computer, either directly, within RStudio or with a program designed to manage git. You can also do all three at the same time. For example, in RStudio, File -> New project -> New directory -> New project and then you are presented with an option create a git repository. We use a program called Git Kraken for this purpose.

In most cases, you can just run git on your own computer. However, for more complicated projects, it might be sensible to keep the project on a separate system, such as what we use github. You can download our example project from https://github.com/Jon-Danielsson/Financial-Risk-Forecasting-Example-Project. If you want to, you can contribute to it we have what is known as a pull request. What this means is that when faced with a public repository, like https://github.com/Jon-Danielsson/Financial-Risk-Forecasting-Example-Project, you submit proposed code changes. The administrator of the project can either approve them and merge them into the project or reject them.

However, it is important to recognise that managing versions is complicated, and it might only be sensible to invest in learning git if one intends to do a fair amount of programming.

In more professional environments, most companies and teams will use git or similar software, and you will be expected to know how to use it.

12.8 Reproducible environments — renv

One problem that can emerge for long duration projects is what happens when the libraries and R versions you are using are updated. This happens quite frequently and usually without any adverse consequences. Sometimes, however, new versions of libraries are incompatible with other libraries or change the way they do calculations, resulting in different results.

This can be particularly problematic when one needs to be able to run some code in the future, even several years into the future, which is what happens with academic papers that are submitted to journals and frequently in professional environments.

There are several ways to deal with this. One run code using virtual machines. Perhaps the most popular way to do that is Docker. While powerful, that can be challenging.

A simpler way is to use an R package renv, which stands for reproducible environments. This package allows you to create a project that combines particular versions of R and the libraries you use. Then, if you run the code in the future, you’ll always use the same versions.

This adds another layer of complexity, and for most simple projects, there is no reason to do this. However, for complicated projects that might be in use for a long time, reproducible environments, like renv, can be essential.