- CSV delimiters
- CSV character encoding
- Why CSV survives
- RFC-4180 compliant CSV
- CSV on the Web (CSVW)
- Excel’s broken CSV ←
Excel causes some of the problems you see when you open a comma-separated values (CSV) file in Excel. Standards-compliant CSV works more reliably elsewhere.
Character encoding misidentification
CSV files don’t specify their own character encoding, so Excel guesses.
CSV character encoding issues lead, for example,
to saving data about Mötley Crüe, and later opening it to discover the rather less metal
Excel mangles the non-ASCII characters because it parses the CSV using its default character encoding, not Unicode UTF-8. Excel has guessed badly for years: Wikipedia reports that UTF-8 became the most popular text encoding on the web in 2008.
Note that you can import a CSV file instead of opening it, and select the correct File origin:
Excel fails to acknowledge both that everyone else uses UTF-8, and that most files live on some else’s computer, accessed via the web. Excel’s perspective doesn’t extend beyond the local system and its default encoding. CSV on the web defines a more modern standard.
On a more technical note, Excel does try to automatically identify Unicode CSV, but gets it wrong. It turns out that Excel saves UTF-8 CSV with a byte order mark (BOM), and when opening CSV only recognises the encoding as UTF-8 if a BOM is present.
Wikipedia explains further:
Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII.
When you enter values by hand, Excel’s implicit typing infers data types for numbers and dates. If this annoys you, because you have to correct the corrupted values, consider the data quality implications of Excel doing the same thing when it opens CSV:
|Play for Scala
|Membrane Associated Ring-CH-Type Finger 1
If you save this as CSV and open it in Excel, you get secret agent
a book with ISBN
9.78162E+12, the fraction
04-Mar, and the gene
Excel clearly doesn’t know about non-numeric numbers.
Still, at least it doesn’t have
The YAML Norway Problem.
Since Excel only identifies UTF-8 when opening CSV that has a BOM, you can ‘fix’ a UTF-8 CSV for Excel by prepending the BOM, e.g. with the command line:
echo -ne "\xEF\xBB\xBF" | cat - data.csv > data-with-BOM.csv
You can also force Excel to interpret values as text by prefixing quoted cells with an equals sign:
James Bond,="007",secret agent
Play for Scala,="9781617290794",book
Membrane Associated Ring-CH-Type Finger 1,="MARCH1",gene
Excel does recognise dates in
YYYY-MM-DD format, which you should probably use anyway.
These and other Excel CSV workarounds explain why some software offers separate Export CSV and Export CSV (Excel) options. Alternatively, Publish data using the CSVW standard, and use Apple Numbers or Google Sheets instead of Microsoft Excel.
Note: this article refers to Excel for Mac, version 16.56.