I worked on a columnar database engine deployed on commodity Linux clusters in 2001, initial release 2002. This was at Sensage. We built a very fast DB engine that could store log data with 40x compression over traditional RDBMSs and ran log queries much more quickly than commercial DBs could at the time. I think we were too early.
The core engineering team was CTO + 3 engineers. Best engineering experience of my life. I wasn't involved at the lowest DB storage level, the guys who did that did a great job.
Michael Stonebraker, technical advisor to Sensage, learned from the Sensage mistakes and built Vertica.
Thanks for the pointer! Looking at Cloudera's jobs board, most tech-heavy work is in SF or Palo Alto. I live in San Mateo, convenient to both of those, but ... I'm likely leaving for the North Carolina RTP area soon.
When I hear columnar storage, I can't help but remember all the stories my friends have told me about kx systems (I'd love to share them but my memory is so fuzzy that I wouldn't be doing them justice. Hopefully some of the HNers here can share some of theirs though!) [1][2].
Everything old is new again. I remember considering using Sybase IQ Multiplex (columnar engine) back in 2000 in my startup. Just like "nosql" was all the rage a few years ago, bringing memory of the Pick databases of yore.
The biggest advantage of a pure column-oriented DBMS comes from having the positional information indirectly available without the requirement to store this ID information. During query execution the required position lists can then be generated.
In addition. If you considere the record format of traditional row-oriented databases you will see that the overhead of storing a single attribute record is rather high. Since with column-oriented DBMS its all about IO performance (Disk/Memory, Memory/CPU) such overhead can diminish the advantage.
Thus typical column stores tend to use only single strings of sequential memory to store the data. This can even be enhanced by applying dictionary compression and as a result only storing integer values. And modern CPUs are good in processing lots of them.
The id column and row header (as depicted in the post) can significantly waste I/O bandwidth.
See the paper "Column-stores vs. row-stores: how different are they really?" in SIGMOD 2008 for performance comparisons between C-store and approaches of emulating column-store in row-store databases.
grundprinzip's answer is good. I'd like to add a remark.
In database systems it is important to distinguish between the logical and physical models.
When you design a relational database, you focus on the correct logical model.
"People can have multiple phone numbers".
"Phone numbers belong to a single phone".
"Mobile phones are possessed by one person. Landlines can be shared".
And so on. You express this logical model to the database, most likely in SQL.
Eventually you notice that query X is slow. Your first step is to check that your logical design was sound, because poorly designed schemata are hard for query planners to reason correctly about.
Then you start doing things to the physical representation. You say stuff like:
"I look up by phone numbers a lot."
Or, in SQL terms, you add an index to a column.
Similarly, as this article pointed out, there are times when grouping data by column rather than row is advantageous. So then you tell the database to use a columnar store.
And so on. Modern RDBMSes all support the same major logical model descriptions; but they can vary widely on what physical directives you give.
Current Sensage company blurb about the event-data warehouse: http://sensage.com/content/clustered-columnar-database and http://sensage.com/content/why-columnar%E2%80%A6not-row-base...
Patent work: http://www.patentgenius.com/patent/7024414.html
The core engineering team was CTO + 3 engineers. Best engineering experience of my life. I wasn't involved at the lowest DB storage level, the guys who did that did a great job.
Michael Stonebraker, technical advisor to Sensage, learned from the Sensage mistakes and built Vertica.