Writing by Peter Hilton

Create RFC 4180-compliant CSV files

How to solve the worst comma-separated values (CSV) problems - 23 December 2021 #data #software #design

Happy woman

Bruce Mars

Comma-separated values (CSV) files often prove unreliable, due to problems that variations in the format cause. RFC 4180 resolves CSV’s historical lack of standardisation by defining a Common Format and MIME Type for Comma-Separated Values (CSV) Files. This short standard defines CSV syntax and the text/csv media type.

If you work on software that outputs CSV, make sure it complies with RFC 4180 and uses UTF-8 encoding. Follow the robustness principle.

Standard dialect

RFC 4180 defines a standard dialect for CSV, that specifies delimiters, quoting, and line breaks. As well as resolving these historical variations in CSV, RFC 4180 also resolves other potential inconsistencies, such as requiring the same number of fields on each line.

Delimiter dialects vary in the wild, with so-called CSV files using a seemingly random choice of comma, semi-colon or tab delimiters. RFC 4180 resolves this separator anxiety by specifying that CSV files use commas as delimiters.

Data may contain commas, line breaks and double quote (") characters. In general, this kind of format requires field quoting, because separating columns and rows with characters too obscure to appear in the data makes the resulting files hard to inspect in a text editor. RFC 4180 allows commas as field delimiters by requiring double quotes around any field that contains any of these characters.

Different operating systems have different conventions for which characters indicate a line break, for obscure 1960s historical reasons. RFC 4180 resolves line break confusion by specifying that a carriage return followed by a line-feed (CRLF - %x0D%x0A a.k.a. \r\n) separates each record (row).

Character encoding

RFC 4180 doesn’t specify the file encoding for CSV files, the way that JSON and XML both effectively specify UTF-8. However, things have changed since RFC 4180’s 2005 publication, and UTF-8 became the most popular text encoding on the web in 2008 according to Wikipedia.

Thirteen years on, you should only create UTF-8 encoded CSV. Especially if you publish it online or otherwise share it with anyone else. If you have a niche scenario that makes you consider a different character encoding, you should probably use a more sophisticated format than CSV.

Bad implementations

Software that exports CSV may get the format wrong and fail to properly quote values that include commas, or make each row the same length. Software that reads CSV may fail to accommodate these variations, or lack error handling that identifies where errors lie.

Don’t write your own CSV parser. You don’t have enough time to make it work correctly.

Standard CSV

Following decades of non-standard CSV, incompatible files, and unpredictable parser differences, RFC 4180 standardised CSV in 2005. Since then, software should have adopted this standard and has had plenty of time to do so.

Don’t produce software that causes CSV sadness. Build software that ignites RFC 4180 compliance happiness.