8 File formats
There are a large number of ways to store data, some better than others. Choosing appropriate file formats is a practical consideration in financial data analysis that affects data integrity, workflow efficiency and collaboration. Financial data often requires high precision, regular updates and sharing between different systems and analysts. The format you select influences how reliably your data survives these processes.
Different file formats offer distinct trade-offs between accessibility, precision, file size and compatibility with various software tools. Some formats excel at preserving numerical precision but create barriers for collaboration, whilst others prioritise universal accessibility at the cost of data integrity. Understanding these trade-offs helps you make informed decisions about data storage and exchange.
This chapter examines the main file formats used in financial analysis, discussing their strengths and limitations in the context of risk forecasting work. We focus on practical considerations rather than technical specifications, helping you select formats that best serve your analytical needs.
8.1 Excel
The most common file format is Excel spreadsheets. However, while Excel can be useful for manipulating data, we do not recommend storing the types of data we use here as an Excel spreadsheet. The reason is that Excel often changes/mangles data without telling us. It might erroneously interpret something as a date and then automatically convert a number into a date. It can lose significant digits or transform data in other unexpected ways. Excel can have problems when the decimal symbol is a comma (,) instead of a period (.), or vice versa. Consequently, for the type of work we do here, we do not recommend storing data in Excel spreadsheets.
We know someone whose PhD was delayed by a year because he used Excel for his data, did not keep a backup of the original data he had collected by hand, and then Excel mangled the data to the point that it became useless.
8.1.1 Writing to Excel files
This does not mean Excel files do not have their place in our workflow. Often, we want to do subsequent analysis on our results in Excel or send the results to someone who prefers Excel. Fortunately, R and the other three languages discussed can easily write Excel files, and many reports start their lives as R code that writes Excel files with summary results that subsequently get further processed.
8.1.2 Reading from Excel files
Many vendors and collaborators distribute data in Excel files, and in those cases, it is usually best to have R read the data directly from those files instead of converting them first.
8.2 CSV
The best format for data is generally CSV. It is a simple text file that can be edited in any text editor. Any software package, including Excel, can read it, so it is universally accessible. If there is a problem with the data, it is easy to inspect it in a text editor since CSV files are pure text files and can be directly opened in any text editor. For example, below, you see the beginning of one of the files we will work with.
ticker,date,Close,Adjusted_close
MCD,20030102,16.55,9.6942
MCD,20030103,16.12,9.4423
The first row contains the name of the columns, and then we have one row per day.
CSV files have some disadvantages. They are uncompressed and, therefore, can be quite large. Furthermore, by default, a lot of software packages truncate significant digits from floating point numbers before saving (both R and Python’s Pandas do that, but Julia does not) and one may require an extra option to include all digits. That can cause problems when one needs to see if numbers change.
8.2.1 Things to look out for
It is important to check the documentation of the functions we are using to make sure we are doing things right. For example, we need to be careful if:
- Does the CSV file have headers or not? A header is the title of a column. We can have a CSV file where the first row includes the column titles or a CSV with no titles;
- The separator is a comma (,), a semicolon (;), a tab ( or another symbol;
- We are getting data from a country where the decimal symbol is a comma (,) instead of a period (.).
By checking the documentation of a function, we will see how to deal with each of these specific cases and more. To do so, run ?read.csv
8.2.2 Compressed CSV files
If your CSV files are large, perhaps in the hundreds of Mb. or more, it can become difficult to manage them. We have one project where the total amount of space taken by the CSV files is over 3 TB. In those cases, it is much better to compress the files. R can easily read and write compressed CSV files, and the file size usually shrinks by well over 90%. This means it is easier to move the files around, take backups, send them to colleagues and even better, reading and writing them is much faster.
However, then you cannot use the built-in read.csv and write.csv, and instead need to use either data.table or tidy, as discussed in Section Section 7.4.
8.3 Parquet
The best alternative file format is the parquet. It compresses the data, and preserves the full floating point precision. It further has the advantage that it is easy to exchange parquet files with other software packages. The disadvantage is that because it comes in a common binary form, it is hard to inspect the files and edit them.
8.4 JSON
One of the most popular file formats is JSON. You can see an example on the EOD website.
[
{
"ticker": "MCD",
"date": "2024-04-08",
"open": 267.24,
"high": 268.69,
"low": 265.87,
"close": 267.56,
"adjusted_close": 267.56,
"volume": 3753200
},
{
"ticker": "MCD",
"date": "2024-04-09",
"open": 266.22,
"high": 269.67,
"low": 265.5,
"close": 269.44,
"adjusted_close": 269.44,
"volume": 3934900
}
]
The hard brackets [] indicate the beginning and end of the data, the curly brackets {} beginning and end of each record, and then each observation has a label. The detailed structure makes JSON a particularly safe way to communicate between different systems, and a lot of data vendors deliver data by default in the JSON format. There are better choices for simply storing data.
8.5 Binary formats — RData
All four languages we have discussed here have their own binary data format. In the case of R, RData. Saving data in such a binary format has many advantages. It allows us to save even complicated objects and read them in exactly the same way later, something usually not possible with other formats. The binary files are compressed and hence take up little space on the disk, and they load and save very quickly. The downside is that it may be difficult to share such files between different programming languages.
In Chapter 9, you will see how we load CSV data and save it as RData.
8.6 DuckDB
https://duckdb.org is an in-process analytical SQL database management system designed for lightweight, fast, and efficient data analysis. DuckDB is optimized for OLAP (Online Analytical Processing) workloads, making it useful for data science tasks.
It supports complex SQL queries, window functions, and aggregation while enabling integration with Python, R, and Julia.
One of DuckDB’s major strengths is its ability to work in-memory without the need for a separate database server, leading to minimal setup and very fast query execution on local datasets. Its support for reading directly from Parquet, CSV, and other columnar storage formats makes it useful for data workflows.
We generally keep all our data in DuckDB. We use DuckDB and parquet files for the data we make for extremerisk.org since we use R, Python and Julia for that site.