Raw Data
Plain Text Data in Open Research
Research must expose its raw data in formats that are durable,
transparent, and machine-readable. Proprietary spreadsheet files
such as Excel (.xlsx) are not suitable as canonical
research data formats.
Why Not Excel?
Spreadsheet software is useful for a quick look at raw data, but excel file formats are organised in exactly the same way as Microsoft Word; they are complex zip files of XML, styling, metadata, formulas, and interface instructions. So it has all the same problems as other bloated MS software does: it's slow, opaque, proprietary software, with a rigid workflow and binary file format that is not searchable. But it also has several other problems important to storing and analysing raw numerical and 'string' text types:
- Hidden formatting and hidden data mutations
- Automatic type conversion (dates, scientific notation errors)
- Embedded formulas that are difficult to audit
- Proprietary dependencies
- Difficult version comparison
There are a few good alternatives to a rigid, proprietary, opaque binary file format. I won't pretend that there is a 'perfect' format to store and analyse raw numerical and string data types; they each do a different job, and have different advantages and disadvantages. But at the very least, they should be 'open standards' software, which is independently implementable and open source. Here are a few good rules of thumb about what raw data file formats should do:-
Data should be stored independently of presentation and interface logic. Raw research data must be editable without proprietary software.
Plain Text Data Formats
Plain text data formats separate data from display. They are human-readable, version-controllable, and interoperable across systems.
1. CSV (Comma-Separated Values)
A simple tabular format where rows are lines and columns are separated by commas.
name,age,mark
Alice,29,88
Bob,31,92
- Universally supported, tabular format
- Diff-friendly in version control
- No hidden formatting
Advantages
Disadvantages
- Not suitable for large, complex data storage, especially text with commas
- Parsing plain text is slow and error prone for large files
- Windows can have problems with utf-8 text format
2. TSV (Tab-Separated Values)
Similar to CSV but uses tab characters as separators. Often safer when textual fields contain commas.
name age mark
Alice 29 88
Bob 31 92
3. JSON (JavaScript Object Notation)
Suitable for hierarchical or nested data structures.
{
"students": [
{"name": "Alice", "age": 29, "mark": 88},
{"name": "Bob", "age": 31, "mark": 91}
]
}
- Supports structured, nested objects
- Widely used in APIs and databases
- Machine-native format
4. YAML
A human-friendly structured format often used for configuration and metadata.
students:
- name: Alice
age: 29
mark: 88
- name: Bob
age: 31
mark: 91
5. TOML
A structural, nested plain text structure that is often more readable than YAML.
[[students]]
name = "Alice"
age = 29
mark = 88
[[students]]
name = "Bob"
age = 31
mark = 91
Advantages of Plain Text Data
- Transparent — no hidden calculations
- Version-controllable — line-by-line comparison
- Platform-independent
- Archivable long-term
- Scriptable and automatable
- Compatible with open repositories
Plain text data can be validated, hashed, cited, and reproduced deterministically. It integrates directly with structured research documents and computational workflows.
Data and Reproducibility
Raw data should either be included directly in the research repository or linked to an external open repository. Data files must be:
- Complete
- Unmodified by hidden software behavior
- Accompanied by schema documentation
- Linked to code used for analysis
In open research, data is not a screenshot of a spreadsheet. It is a structured, machine-readable artifact that can be independently verified.