SQLite: A surprisingly good data analysis tool

about | archive


[ 2014-December-06 12:44 ]

At Twitter, I've needed to do some ad-hoc data analysis over many terabytes of logs, and I've been surprised to discover that SQLite is an incredibly useful tool. The sqlite3 command-line tool requires nearly no configuration, is usually already installed, and can execute complex ad-hoc queries pretty quickly. Its .import command makes it trivial to read a text file into a table. Now my workflow is to write a Scalding job to extract the subset of data that I want from terabytes of logs using Hadoop, then import a few GB of data into SQLite to be able to quickly and interactively compute aggregate statistics over subsets of the data. This is many, many times faster than my previous approach, which was to write a one off script to filter and aggregate the data in the way I wanted. Here are some random tips if you are working with the sqlite3 command line: