CSV character encoding

Comma-separated values (CSV) with the wrong characters 2021-12-14 #data

As well as confusing delimiters, comma-separated values (CSV) files have other problems that make them a pain to work with. They keep using the wrong character encoding, for one.

Accidental metal to nerd conversion

CSV can cause problems when you use it to share data that includes, say, metal umlauts:

Name	Type	Diacritic score
Blue Öyster Cult	rock band	1
Queensrÿche	heavy metal band	1
Motörhead	rock band	1
The Accüsed	thrash band	1
Mötley Crüe	heavy metal band	2
This Is Spın̈al Tap	film	1
Möngöl Hörde	hardcore punk band	3
G̈r̈oẗus̈	industrial rock band	4
Brütal Legend	video game	1
Häagen-Dazs	dessert	1

You save this as CSV, send it to someone else, and they get garbled names:

Name	Type	Diacritic score
Blue √ñyster Cult	rock band	1
Queensr√øche	heavy metal band	1
Mot√∂rhead	rock band	1
The Acc√ºsed	thrash band	1
M√∂tley Cr√ºe	heavy metal band	2
This Is Spƒ±nÃàal Tap	film	1
M√∂ng√∂l H√∂rde	hardcore punk band	3
GÃàrÃào·∫óusÃà	industrial rock band	4
Br√ºtal Legend	video game	1
H√§agen-Dazs	dessert	1

The square root signs look more maths nerd than metal, so you have a problem. Specifically, you probably have Microsoft Excel, because opening the same CSV in Apple Numbers or Google Sheets just works.

Wrong default encoding

The example above results from saving the first table as CSV with Unicode UTF-8 encoding, and then assuming Macintosh encoding when loading it. On Windows, Motörhead might become MotÃ¶rhead instead.

Using the wrong default encoding causes these encoding issues, which appear in software that uses default encodings from the last century for historical reasons. In theory, Unicode solves this problem by removing the need for different character encodings for different languages, so everyone can always use the same encoding. But the twenty years since Unicode’s introduction don’t appear to have given us enough time to switch to only saving CSV files with Unicode (UTF-8) encoding, and interpreting text files as UTF-8 by default.

No explicit encoding

CSV encoding issues illustrate a specific example of a broader problem, explained by Dyan Beattie in his conference presentation, There’s no such thing as plain text (slides, video). The notion of a plain text file relies on assumptions about people’s locations, languages, and cultures.

Instead of making assumptions, we might prefer to make our choices explicit. But in the same way that CSV doesn’t make its delimiter dialect explicit, a CSV file cannot declare its character encoding either.

Software developers have technical solutions to these problems, of course, but encoding problems persist in practice. Meanwhile, you can deal with broken CSV encoding by asking the person who sent you a broken file to save it as UTF-8 and try loading it again.

Writing by Peter Hilton

Accidental metal to nerd conversion

Wrong default encoding

No explicit encoding