Please Offer An Excel Export Option

Posted by Ross Poulton on Mon 08 December 2014 #data #excel #csv

Evan Miller, writing in Please Offer An Excel Export Option:

Data is ultimately meant to be consumed by people. This may come as a shock, but most people are not computer programmers. They are not going to read your CSV with Python and process it with Pandas or Numpy. They are not versed in the libcsv API, nor do they possess ambitions to convert your CSV to JSON in order to build a web mashup. Of the grammar of graphics, they know nothing.

In all likelihood they are going to open your CSV file in Microsoft Excel 2004, or if you're lucky Microsoft Excel 2007, and they are going to spend hours building pixelated pie charts, bar charts, and 3D line graphs. You might not use Excel to analyze data, but they almost certainly will. So why not accommodate their intentions as well as you can? Isn't programming fundamentally about helping users achieve their goals?

But it isn't just human-readable files that can benefit from being in Excel over CSV. Over the last few years I've come to loathe the CSV format. Sure, they're "portable", but there are so many flaws that I never feel I can truly trust the data coming out of them. A list that nicely sums up a bunch of the challenges I regularly face with CSV files (or poorly built spreadsheets) can be found in Releasing data or statistics in spreadsheets, including these gems:

5 . Numbers should just be numbers. Don’t put commas in them, or stars after them, or anything else.

7 . Don’t use colour or other stylistic cues to encode information

Much of my day job involves migrating data from one line-of-business application to another. If the data is available in a raw SQL database I know I'm going to have a relatively easy time — but much of the migration tasks I work on involve exporting data to CSV files then trying to import them at the other end. The problem is, every system has it's own crappy definition of things like line-breaks inside cells, relational data, writing or reading numbers, and handling of non-ASCII characters.

For better or worse, I often print very small subsets of data when I'm trying to determine how to best process and manipulate it. I do this in table form, often on an A3 page, and then scribble all over it to identify what goes where. It's amazing how many carefully-prepared spreadsheets fail the simple black-and-white printer test due to important information being encoded in coloured text or varying shades of background fill.

I'll read the data directly from the source database whenever I can, but for all other cases an Excel file is a great deal nicer than CSV files - which should be used for only the simplest and tightly-defined of datasets.