Raw Data

Plain Text Data in Open Research

Research must expose its raw data in formats that are durable, transparent, and machine-readable. Proprietary spreadsheet files such as Excel (.xlsx) are not suitable as canonical research data formats.

Why Not Excel?

Spreadsheet software is useful for a quick look at raw data, but excel file formats are organised in exactly the same way as Microsoft Word; they are complex zip files of XML, styling, metadata, formulas, and interface instructions. So it has all the same problems as other bloated MS software does: it's slow, opaque, proprietary software, with a rigid workflow and binary file format that is not searchable. But it also has several other problems important to storing and analysing raw numerical and 'string' text types:

Hidden formatting and hidden data mutations
Automatic type conversion (dates, scientific notation errors)
Embedded formulas that are difficult to audit
Proprietary dependencies
Difficult version comparison

There are a few good alternatives to a rigid, proprietary, opaque binary file format. I won't pretend that there is a 'perfect' format to store and analyse raw numerical and string data types; they each do a different job, and have different advantages and disadvantages. But at the very least, they should be 'open standards' software, which is independently implementable and open source. Here are a few good rules of thumb about what raw data file formats should do:-

Data should be stored independently of presentation and interface logic. Raw research data must be editable without proprietary software.

Plain Text Data Formats

Plain text data formats separate data from display. They are human-readable, version-controllable, and interoperable across systems.

1. CSV (Comma-Separated Values)

A simple tabular format where rows are lines and columns are separated by commas.


          name,age,mark
          Alice,29,88
          Bob,31,92

Advantages

Universally supported, tabular format
Diff-friendly in version control
No hidden formatting

Disadvantages

Not suitable for large, complex data storage, especially text with commas
Parsing plain text is slow and error prone for large files
Windows can have problems with utf-8 text format

2. TSV (Tab-Separated Values)

Similar to CSV but uses tab characters as separators. Often safer when textual fields contain commas.


          name  age  mark
          Alice  29  88
          Bob  31  92

3. JSON (JavaScript Object Notation)

Suitable for hierarchical or nested data structures.

{
  "students": [
    {"name": "Alice", "age": 29, "mark": 88},
    {"name": "Bob", "age": 31, "mark": 91}
  ]
}

Supports structured, nested objects
Widely used in APIs and databases
Machine-native format

4. YAML

A human-friendly structured format often used for configuration and metadata.


          students:
            - name: Alice
              age: 29
              mark: 88
            - name: Bob
              age: 31
              mark: 91

5. TOML

A structural, nested plain text structure that is often more readable than YAML.


            [[students]]
            name = "Alice"
            age = 29
            mark = 88

            [[students]]
            name = "Bob"
            age = 31
            mark = 91

Advantages of Plain Text Data

Transparent — no hidden calculations
Version-controllable — line-by-line comparison
Platform-independent
Archivable long-term
Scriptable and automatable
Compatible with open repositories

Plain text data can be validated, hashed, cited, and reproduced deterministically. It integrates directly with structured research documents and computational workflows.

Data and Reproducibility

Raw data should either be included directly in the research repository or linked to an external open repository. Data files must be:

Complete
Unmodified by hidden software behavior
Accompanied by schema documentation
Linked to code used for analysis

In open research, data is not a screenshot of a spreadsheet. It is a structured, machine-readable artifact that can be independently verified.