CSV isn’t broken - Excel is broken
Reasons to blame Excel for comma-separated values (CSV) issues 2022-01-04 #data
- CSV delimiters
- CSV character encoding
- Why CSV survives
- RFC-4180 compliant CSV
- CSV on the Web (CSVW)
- Excel’s broken CSV ←
Excel causes some of the problems you see when you open a comma-separated values (CSV) file in Excel. Standards-compliant CSV works more reliably elsewhere.
Character encoding misidentification
CSV files don’t specify their own character encoding, so Excel guesses.
CSV character encoding issues lead, for example,
to saving data about Mötley Crüe, and later opening it to discover the rather less metal M√∂tley Cr√ºe
.
Excel mangles the non-ASCII characters because it parses the CSV using its default character encoding, not Unicode UTF-8. Excel has guessed badly for years: Wikipedia reports that UTF-8 became the most popular text encoding on the web in 2008.
Note that you can import a CSV file instead of opening it, and select the correct File origin:
Excel fails to acknowledge both that everyone else uses UTF-8, and that most files live on some else’s computer, accessed via the web. Excel’s perspective doesn’t extend beyond the local system and its default encoding. CSV on the web defines a more modern standard.
Unicode misidentification
On a more technical note, Excel does try to automatically identify Unicode CSV, but gets it wrong. It turns out that Excel saves UTF-8 CSV with a byte order mark (BOM), and when opening CSV only recognises the encoding as UTF-8 if a BOM is present.
Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII.
Illicit typing
When you enter values by hand, Excel’s implicit typing infers data types for numbers and dates. If this annoys you, because you have to correct the corrupted values, consider the data quality implications of Excel doing the same thing when it opens CSV:
Name | Code | Type |
---|---|---|
James Bond | 007 | secret agent |
Play for Scala | 9781617290794 | book |
three-quarters | 3/4 | fraction |
Membrane Associated Ring-CH-Type Finger 1 | MARCH1 | gene |
Norway | NO | country |
If you save this as CSV and open it in Excel, you get secret agent 7
,
a book with ISBN 9.78162E+12
, the fraction 04-Mar
, and the gene 01-Mar
.
Excel clearly doesn’t know about non-numeric numbers.
Still, at least it doesn’t have
The YAML Norway Problem.
Workarounds
Since Excel only identifies UTF-8 when opening CSV that has a BOM, you can ‘fix’ a UTF-8 CSV for Excel by prepending the BOM, e.g. with the command line:
echo -ne "\xEF\xBB\xBF" | cat - data.csv > data-with-BOM.csv
You can also force Excel to interpret values as text by prefixing quoted cells with an equals sign:
Name,Code,Type
James Bond,="007",secret agent
Play for Scala,="9781617290794",book
three-quarters,="3/4",fraction
Membrane Associated Ring-CH-Type Finger 1,="MARCH1",gene
Excel does recognise dates in YYYY-MM-DD
format, which you should probably use anyway.
These and other Excel CSV workarounds explain why some software offers separate Export CSV and Export CSV (Excel) options. Alternatively, Publish data using the CSVW standard, and use Apple Numbers or Google Sheets instead of Microsoft Excel.
Note: this article refers to Excel for Mac, version 16.56.