Think Database even when Spreadsheeting

Despite my preference for R over Excel, I still use a spreadsheet for many applications. But I find spreadsheets are often misused, or poorly constructed, and can create work later that can exceed the time and effort they might have saved you in the beginning. One way to prevent this is to think database when you are spreadsheeting.

An Excel spreadsheet is not a database and shouldn’t be thought of one, even if the two words are used interchangeably by people like, say, upper management (if your organization is anything like mine). But if you are doing anything with a lot of rows or columns of data, it is always a good idea to construct the spreadsheet as if it were a database. This at least gives you a chance that you will be able to add or extract data from the spreadsheet.

For example, we routinely receive large spreadsheets of thousands of rows long that are, outside of the bounds of the spreadsheet format, almost useless. The addition of fancy titles, merged cells, subtotals, empty rows, and other spreadsheet junk that disrupts the data makes extracting the data tedious at best and nearly impossible at worst. It would have been so much more useful had the author made one sheet hold the data—one header row with unique field names and just the raw data below, one record per row—and one sheet holding the spreadsheet math to provide the summary and calculated information. This is actually easier to construct and doesn’t lock the data in the spreadsheet. It is easy enough to export the raw data as a dbf or comma delimited file to work on it in a proper database program or just use the data without relying on the Excel structure.

I know there is a temptation there to add the spreadsheet junk. Pretty colors, lines, comment blocks, font changes, etc. are fine for that sheet that does the calcs (even if all that is usually a waste of time), just don’t let the junk get in the way of the data. That’s the important stuff.

What do you think? Drop us a comment.