Writing by Peter Hilton

Publish data using the CSVW standard

Use W3C’s ‘CSV on the Web’ (CSVW) standard to make data accessible - 28 December 2021 #data #software #design

A spider’s web

Shannon Potter

RFC 4180 standardised comma-separated values (CSV) in 2005, but didn’t address its data model limitations. With the exception of column headings, CSV files contain data but lack the metadata you need to use them reliably. Ten years later, the W3C CSV on the Web Working Group published standards for publishing data, described in CSV on the Web: A Primer.

CSVW extends standard CSV with the ingredients for a reliable, flexible and extensible data exchange format. Most importantly, CSVW accommodates metadata.

Document metadata

Like JSON, CSV has no syntax for comments or other metadata to describe the data in the file. As a result, when you publish CSV, you have to use external documentation to answer the kinds of questions that good code comments answer about code.

  • Where did this data come from?
  • Why does it exist?
  • What limitations does it have?
  • Who owns it?
  • How can you update it?

This poses less of a problem in JSON, whose flexible structure makes it easier to include metadata without making a mess of the original data. With CSV, document metadata and other documentation belongs in a separate file, much like the introduction sheets that advanced spreadsheet users tend to invent to document their multi-sheet spreadsheet workbooks, like word-processor document cover pages.

Column metadata

Whatever the data in a CSV file represents, CSV only contains text. CSV has no standard way to format numbers and dates as text, so you end up with incompatible application-specific approaches instead of reliable data interchange.

Some applications require the ability to specify column data more precisely, and validate data. Using standards such as ISO 8601 dates and times helps, but you still have to choose a specific date-time precision and format.

Data validation requires data type definitions that specify types, precision, and valid values. Similarly, internationalising text data requires content language metadata. Again, JSON’s flexibility allows more options than CSV, including using JSON Schema to ‘annotate and validate JSON documents’

CSVW

CSV on the web (CSVW) extends CSV with a specification for CSV metadata. CSVW includes:

  • UTF-8 encoding, by default
  • a JSON metadata file, e.g. data.csv-metadata.json to accompany data.csv
  • documentation for tables, columns, and cells
  • tabular data schema in the metadata, for validating files
  • standardised transformations from to JSON and RDF
  • metadata for content languages, e.g. alternate column headings

Like other schema languages, CSVW metadata lets you choose how richly to describe the data, making it flexible for a variety of applications. In particular, while CSVW defines machine-readable metadata, its readability makes it suitable as concise human-readable documentation.

One more thing… extensibility

Ad hoc structured metadata tends to lack extensibility. Without extensibility - support for changes to the data format - software that reads it will break when the data format inevitably changes. CSVW handles this with extensible data types, schemas, transformations and other metadata.

CSVW does a few things well. If you publish CSV on the web, use the CSVW standard.