CSV character encoding
Comma-separated values (CSV) with the wrong characters 2021-12-14 #data
- CSV delimiters
- CSV character encoding ←
- Why CSV survives
- RFC-4180 compliant CSV
- CSV on the Web (CSVW)
- Excel’s broken CSV
As well as confusing delimiters, comma-separated values (CSV) files have other problems that make them a pain to work with. They keep using the wrong character encoding, for one.
Accidental metal to nerd conversion
CSV can cause problems when you use it to share data that includes, say, metal umlauts:
Name | Type | Diacritic score |
---|---|---|
Blue Öyster Cult | rock band | 1 |
Queensrÿche | heavy metal band | 1 |
Motörhead | rock band | 1 |
The Accüsed | thrash band | 1 |
Mötley Crüe | heavy metal band | 2 |
This Is Spın̈al Tap | film | 1 |
Möngöl Hörde | hardcore punk band | 3 |
G̈r̈oẗus̈ | industrial rock band | 4 |
Brütal Legend | video game | 1 |
Häagen-Dazs | dessert | 1 |
You save this as CSV, send it to someone else, and they get garbled names:
Name | Type | Diacritic score |
---|---|---|
Blue Öyster Cult | rock band | 1 |
Queensrÿche | heavy metal band | 1 |
Motörhead | rock band | 1 |
The Accüsed | thrash band | 1 |
Mötley Crüe | heavy metal band | 2 |
This Is Spın̈al Tap | film | 1 |
Möngöl Hörde | hardcore punk band | 3 |
GÃàrÃào·∫óusÃà | industrial rock band | 4 |
Brütal Legend | video game | 1 |
Häagen-Dazs | dessert | 1 |
The square root signs look more maths nerd than metal, so you have a problem. Specifically, you probably have Microsoft Excel, because opening the same CSV in Apple Numbers or Google Sheets just works.
Wrong default encoding
The example above results from saving the first table as CSV with
Unicode UTF-8 encoding,
and then assuming Macintosh
encoding when loading it.
On Windows, Motörhead
might become Motörhead
instead.
Using the wrong default encoding causes these encoding issues, which appear in software that uses default encodings from the last century for historical reasons. In theory, Unicode solves this problem by removing the need for different character encodings for different languages, so everyone can always use the same encoding. But the twenty years since Unicode’s introduction don’t appear to have given us enough time to switch to only saving CSV files with Unicode (UTF-8) encoding, and interpreting text files as UTF-8 by default.
No explicit encoding
CSV encoding issues illustrate a specific example of a broader problem, explained by Dyan Beattie in his conference presentation, There’s no such thing as plain text (slides, video). The notion of a plain text file relies on assumptions about people’s locations, languages, and cultures.
Instead of making assumptions, we might prefer to make our choices explicit. But in the same way that CSV doesn’t make its delimiter dialect explicit, a CSV file cannot declare its character encoding either.
Software developers have technical solutions to these problems, of course, but encoding problems persist in practice. Meanwhile, you can deal with broken CSV encoding by asking the person who sent you a broken file to save it as UTF-8 and try loading it again.