Writing by Peter Hilton

CSV isn’t broken - Excel is broken

Reasons to blame Excel for comma-separated values (CSV) issues 2022-01-04 #data

freestocks

  1. CSV delimiters
  2. CSV character encoding
  3. Why CSV survives
  4. RFC-4180 compliant CSV
  5. CSV on the Web (CSVW)
  6. Excel’s broken CSV ←

Excel causes some of the problems you see when you open a comma-separated values (CSV) file in Excel. Standards-compliant CSV works more reliably elsewhere.

Character encoding misidentification

CSV files don’t specify their own character encoding, so Excel guesses. CSV character encoding issues lead, for example, to saving data about Mötley Crüe, and later opening it to discover the rather less metal M√∂tley Cr√ºe.

Excel mangles the non-ASCII characters because it parses the CSV using its default character encoding, not Unicode UTF-8. Excel has guessed badly for years: Wikipedia reports that UTF-8 became the most popular text encoding on the web in 2008.

Note that you can import a CSV file instead of opening it, and select the correct File origin:

Excel file import wizard

Excel fails to acknowledge both that everyone else uses UTF-8, and that most files live on some else’s computer, accessed via the web. Excel’s perspective doesn’t extend beyond the local system and its default encoding. CSV on the web defines a more modern standard.

Unicode misidentification

On a more technical note, Excel does try to automatically identify Unicode CSV, but gets it wrong. It turns out that Excel saves UTF-8 CSV with a byte order mark (BOM), and when opening CSV only recognises the encoding as UTF-8 if a BOM is present.

Wikipedia explains further:

Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII.

Illicit typing

When you enter values by hand, Excel’s implicit typing infers data types for numbers and dates. If this annoys you, because you have to correct the corrupted values, consider the data quality implications of Excel doing the same thing when it opens CSV:

Name Code Type
James Bond 007 secret agent
Play for Scala 9781617290794 book
three-quarters 3/4 fraction
Membrane Associated Ring-CH-Type Finger 1 MARCH1 gene
Norway NO country

If you save this as CSV and open it in Excel, you get secret agent 7, a book with ISBN 9.78162E+12, the fraction 04-Mar, and the gene 01-Mar. Excel clearly doesn’t know about non-numeric numbers. Still, at least it doesn’t have The YAML Norway Problem.

Workarounds

Since Excel only identifies UTF-8 when opening CSV that has a BOM, you can ‘fix’ a UTF-8 CSV for Excel by prepending the BOM, e.g. with the command line:

echo -ne "\xEF\xBB\xBF" | cat - data.csv > data-with-BOM.csv

You can also force Excel to interpret values as text by prefixing quoted cells with an equals sign:

Name,Code,Type
James Bond,="007",secret agent
Play for Scala,="9781617290794",book
three-quarters,="3/4",fraction
Membrane Associated Ring-CH-Type Finger 1,="MARCH1",gene

Excel does recognise dates in YYYY-MM-DD format, which you should probably use anyway.

These and other Excel CSV workarounds explain why some software offers separate Export CSV and Export CSV (Excel) options. Alternatively, Publish data using the CSVW standard, and use Apple Numbers or Google Sheets instead of Microsoft Excel.

Note: this article refers to Excel for Mac, version 16.56.

Share on TwitterShare on LinkedIn