5 File formats
There is a large number of ways to store data, some better than others.
5.1 Excel
The most common file format is Excel spreadsheets. However, while Excel can be useful for manipulating data, we do not recommend storing the types of data we use here as an Excel spreadsheet. The reason is that Excel often changes/mangles data without telling us. It might erroneously interpret something as a date and then automatically convert a number into a date. It can lose significant digits or transform data in other unexpected ways. Excel can have problems when the decimal symbol is a comma (,) instead of a period (.), or vice versa. Consequently, for the type of work we do here, we do not recommend storing data in Excel spreadsheets.
We know someone whose PhD was delayed by a year because he used Excel for his data, didn’t keep a backup of the original data he had collected by hand, and then Excel mangled the data to the point that it became useless.
5.1.1 Writing to Excel files
This does not mean Excel files don’t have their place in our workflow. Often, we want to do subsequent analysis on our results in Excel or send the results to someone who prefers Excel. Fortunately, R and the other three languages discussed can easily write Excel files, and many reports start their lives as R code that writes Excel files with summary results that subsequently get further processed.
5.1.2 Reading from Excel files
Many vendors and collaborators distribute data in Excel files, and in those cases, it is usually best to have R read the data directly from those files instead of converting them first.
5.2 CSV
The best format for data is generally CSV. It is a simple text file that can be edited in any text editor. Any software package, including Excel, can be read, so it is universally accessible. If there is a problem with the data, it is easy to inspect it in a text editor since CSV files are pure text files and can be directly opened in any text editor. For example, below, you see the beginning of one of the files we will work with.
ticker,date,Close,Adjusted_close
MCD,20030102,16.55,9.6942
MCD,20030103,16.12,9.4423
The first row contains the name of the columns, and then we have one row per day.
CSV files have some disadvantages. They are uncompressed and, therefore, can be quite large. Furthermore, by default, a lot of software packages by default truncate significant digits from floating point numbers before saving (both R and Python’s Pandas do that, but Julia does not) and one may require an extra option to include all digits. That can cause problems when one needs to see if numbers change.
5.2.1 Things to look out for
It is important to check the documentation of the functions we are using to make sure we are doing things right. For example, we need to be careful if:
- Does the CSV file have headers or not? A header is the title of a column. We can have a CSV file where the first row includes the column titles or a CSV with no titles;
- The separator is a comma (,), a semicolon (;), a tab ( or another symbol;
- We are getting data from a country where the decimal symbol is a comma (,) instead of a period (.).
By checking the documentation of a function, we will see how to deal with each of these specific cases and more. To do so, run ?read.csv
5.2.2 Compressed CSV files
If your CSV files are large, perhaps in the hundreds of Mb. or more, it can become difficult to manage them. We have one project were the total amount of space taken by the CSV files is over 3 TB. In those cases, it is much better to compress the files. R can easily read and write compressed CSV files, and the file size usually shrinks by well over 90%. This means it is easier to move the files around, take backups, send them to colleagues and even better, reading the writing them is much faster.
However, then you cannot use the build in read.csv
and write.csv
, and instead need to use either data.table or tidy, as discussed in Section Section 4.6.5.
5.3 Parquet
The best alternative file format is the parquet. It compresses the data, and preserves the full floating point precision. It further has the advantage that it is easy to exchange parquet files with other software packages. The disadvantage is that because it comes in a common binary form, it is hard to inspect the files and edit them. We use parquet files for the data we make for extremerisk.org since we use R, Python and Julia for that site.
5.4 JSON
One of the most popular file formats is JSON. You can see an example on the EOD website, EOD.
[
{
"ticker": "MCD",
"date": "2024-04-08",
"open": 267.24,
"high": 268.69,
"low": 265.87,
"close": 267.56,
"adjusted_close": 267.56,
"volume": 3753200
},
{
"ticker": "MCD",
"date": "2024-04-09",
"open": 266.22,
"high": 269.67,
"low": 265.5,
"close": 269.44,
"adjusted_close": 269.44,
"volume": 3934900
}
]
The hard brackets []
indicate the beginning and end of the data, the curly brackets {}
beginning and end of each record, and then each observation has a label. The detailed structure makes JSON a particularly safe way to communicate between different systems, and a lot of data vendors deliver data by default in the JSON format. There are better choices for simply storing data.
5.5 Binary vendor formats — RData
All four languages we have discussed here have their own binary data format. In the case of R, RData. Saving data in such a binary format has many advantages. It allows us to save even complicated objects and read them in exactly the same way later, something usually not possible with other formats. The binary files are compressed and hence take up little space on the disk, and they load and save very quickly. The downside is that it may be difficult to share such files between different programming languages.
In Chapter 6, you will see how we load CSV data and save it as RData.