It's great to see all these new ideas in the db arena. But one thing I
don't understand at all is why they diss the idea of having data integrity. Am I the only developer who actually likes working with relational data and do not think of a schema as a straight-jacket? Am I the only one getting burnt by MySQL and SQLite playing loose with types and silently coercing data almost randomly?
Check constraints are also great and has saved my ass so many times. It's so much easier to debug a problem when the crash occurs when you are inserting the data than in a wrong calculation much further down the line. Postgres for example, allows you to create regular expression constraints which you could use to ensure that a text column only contains a valid variable name. Exclusion constraints are even better, if you have a table with allocations (date ranges, integer ranges, etc) you can check so that none of the allocations overlap. Plus about a million other ways you can constrain your data.
That's the killer features for me. They have helped me debug problems so many times that I consider them almost essential to writing good web applications. So why does almost all new database projects seem to neglect them?
RethinkDB is filled with developers who have conflicted thoughts on this. The difference between structured and unstructured database systems has always seemed very analogous static vs. dynamic type systems to me. Most of us at RethinkDB like our type systems static, Haskell being one of the favored languages, and we've had many conversation about why our database doesn't reflect these same principles.
I'll admit that to some degree RethinkDB is responding to what we see as the direction the market is moving. People seem to really like developing in dynamic languages using structureless databases. I think that a lot of times this gives people more ways to shoot themselves in the foot (and I sense you agree). But even if this type of development was purely a band of fools shooting off each other's feet we still might have to respond to it.
That being said I don't think it is quite so foolish an endeavor. The thing about structured databases that sucks, and a place where the analogy to dynamic languages breaks down, is that your data integrity becomes entrenched. Changing around static types in your code isn't anywhere near as hard as changing around a static structure in your database because you need to modify all the old data to make it meet the new structure. This can be a very cumbersome process made more cumbersome by the fact that a lot of structured databases make optimizations based on things like knowing the size of each row (very hard to change on the fly). So while RethinkDB doesn't dislike data integrity we do really dislike data stasis and that's what we wanted to avoid. We do have plans for the future of having a way to enforce arbitrary constraints over you database, we just aren't willing to do them in such a way that they're hard to change. Unfortunately doing them in that way is time consuming and we haven't gotten to it yet.
In my experience, one never ends up having all of one's update/migration (to a new version) changes constrained only to schema alterations. This is because the static database structure can only capture so much metadata. One inevitably ends up facing putting migration logic into the application itself, or logic that says "if old data, treat differently." So now one ends up with a mixed data migration situation: both database-level and application-level.
Thus, it's a quite valid, pragmatic tradeoff to make: handle all data migration/schema changes in the application, let the store be completely dynamic and have it all in one level.
> I'll admit that to some degree RethinkDB is responding to what we see as the direction the market is moving.
Is this really true? I mean, it is clear the majority of stuff you read about is from people who don't like types, but honestly as time has marched on development has been "democratized" to a large extent... yet, if I think about "people with both a lot of experience and a lot money who are willing to both pay for things to use and people to maintain those things" (what I would call "the market" for a database solution) I see lots and lots of rigid types.
I mostly see "schemaless" from highly-popular (to the masses) database system where I don't really see much money being made or even moved (on database software). It could be, however, that I'm just totally misunderstanding how well some of these companies are doing. However, when I see a company evaluating things to plunk money down on, they seem to be 1) clustering support for PostgreSQL, 2) hosted management of MySQL, and 3) system administrators and licenses for Oracle.
In contrast, people providing NoSQL "schemaless" systems seem mosty to be living in a world where the developers expect everything to be free; they sell support contracts, but I never hear of anyone actually buying them... do a lot of companies do this? It also always sounds like a magical accomplishment when you hear one being used a large company with a lot of money (such as the developers at UbiSoft giving a talk on how they are using CouchBase at re:Invent)... I haven't heard RDBMS systems have to justify "look, this company used us at scale and we worked fine" in years.
Note: to be clear, I do hear of companies with "NoSQL" systems who have a lot of money, but they are all using Casaandra, HBase, or SOLR. SOLR let's you make define dynamic columns, but everything is still rigidly typed. HBase and Cassandra both have schemas with fixed columns; sure, Cassandra allowed you to largely ignore the schema, but if you've seen CQL3 they are actually nigh-into turning into an RDBMS: it is hilarious, and I think awesome. Even then, it isn't clear to me how much these companies pay money for their databases.
> Changing around static types in your code isn't anywhere near as hard as changing around a static structure in your database because you need to modify all the old data to make it meet the new structure.
This is not actually true, as the structure can be metadata stored describing the data. When I add or delete fields in PostgreSQL, the operation is instantaneous, as is renaming existing fields. AFAIK, only MySQL sucks at this. The only time I need to do a table rewrite is if I change the datatype of an existing field, and I am pretty certain that you can also handle the majority of those cases without a rewrite. Even with the rewrite, you shouldn't have to block incoming queries.
It isn't like not having a schema fixes this, as there is a schema after all... my code certainly has an assumption about how the data is stoted. It just means I either a) can't change the schema at all, b) can change the schema, but have to do so by doing the table rewrite manually, or c) can manually implement the metadata behavior (but then why not just add this to the database server). Given how the patterns one uses to solve these problems don't really rely on the data being stored, it simply makes sense to standardize them in the database server.
slava @ rethink here. DB innovation moves slowly. It takes time for people to adjust to new technology, for new technology to adjust to market demands (DBs are fundamentally more difficult to develop than most other software, so it takes more time), and for existing large companies to adopt new technology. As a result, the NoSQL field is still evolving, and there hasn't been a product yet that really nails it. We hope Rethink will be it, but it will take a little bit of time for that to happen.
> I haven't heard RDBMS systems have to justify "look, this company used us at scale and we worked fine" in years.
The main reason you've not heard this recently is because RDBMS have been around for 30+years. If we could go back to those times we'd probably see exactly the same things. The Oracle Wikipedia page actually lists the first customer of their solution [1]
Also if you take a look at the analytic databases (which are built on top of the relational model, but more recent), you'd notice exactly the same behavior.
IMO this has nothing to do with the maturity or approach of the products, but rather with the adoption cycle [2].
Well, that is obvious: I certainly knew that while writing this, which is why I provided the comparison between schemaless and with-schema newer database systems and the trajectories they seem to have (such as Cassandra becoming more rigid as time moves on), and looked at the kind of people who are interested in both. Database systems that involve schemas seem to be mostly on a downturn in money spent on them because free alternatives are improving in quality and are becoming easier to support: not because their demand has slowed.
However, it isn't clear to me that companies making schemaless systems are really making any real money at all, and they seem to be marketing themselves towards people who don't even have budgets for database tools, and probably have no people dedicated to database administration. (Does UbiSoft even pay for CouchBase support? It didn't sound like it ;P.) Can you demonstrate in these other cases that Oracle, for example, was first used by people who thought it was "cool" and insisted on it being free, and worked their way up to "people should pay $150k a year on this"?
FWIW, I really am curious about this, and would be perfectly happy being wrong. It just seems like a weird decision, and it isn't clear that it is base on "the needs of people who have budgets for this sort of thing is or will change" (which is what I'd argue was the case for "eventually you will outgrow your filing cabinet and even your spreadsheets and yes, even your mainframe"; the business case for these earlier shifts was clear: "large companies spend $Xm per year on record retention departments, those could be replaced by computers!"), as much as "people who don't spend money on database systems at all find this more exciting".
Looking into it, it seems I had forgotten you are the new MemBase, so a lot of companies are using you as a best-of-breed cache (supporting automatic and effificient repartitioning as well as on-disk persistence) , where the schema you want is just blob->blob anyway. My argument then either a) then treats you as "a database with a fixed schema that happens to be exactly the one people want or b) doesn't work on you at all, as the market you are in is more directly competing with Oracle Coherence, not Oracle RAC (and apparently you eat Oracle Coherence's lunch... apparently I should be evaluating you guys as my cache ;P).
> Can you demonstrate in these other cases that Oracle, for example, was first used by people who thought it was "cool" and insisted on it being free, and worked their way up to "people should pay $150k a year on this"?
No, I cannot do that :-). What I think you are noticing is a change in the business and adoption models. While not being an economist, my hypothesis is that open source was one of the main causes of these changes.
Going back to your question, the closest things that come to mind in the data space that could give us some more info are:
1. Neo4j and InfiniteGraph (graph databases)
2. Cloudera and MapR (Hadoopy)
As far as I know (unfortunately without any real data though) is that both Neo4j and Cloudera are doing quite well financially. On the other hand that's not to say that the model chosen by InfiniteGraph and MapR isn't working.
People rejecting SQL/schemas tend to fall into two camps:
1) People who don't know what they're doing. People in this camp tend to be inexperienced, and reject structured data because "it's easier to just deal with it in code". People in this camp, tend not to realise that they're still maintaining a schema, but making it harder because they have to manually code all the constraints.
2) People who are trying to solve a specific problem that is easier when you weaken constraints on the data. You could be dealing with naturally unstructured data such as web crawling. People in this camp have actually thought about what they want to build, and are making a conscious choice.
I think that people fall into the first camp more often than not. Using myself as an example, the team I was leading decided that CouchDB would be a good fit for a use-case involving "unclean" data. It turned out that the data could be cleaned manually quite trivially, and the overheads of utilising CouchDB massively outweighed the ever-shrinking benefit.
Note that I'm categorising people who choose to use NoSQL in production situations, not people experimenting or people not using NoSQL at all.
what does this mean? when i design a database i don't think about "tables and rows". i think about logical relationships and statements about "what exists". the tables and rows are just some weird implementation detail that happens to fall out of the fact that i arrange the world in terms of "when A is B then X has a Y".
i guess you are saying "structured data that isn't described by relational logic". but relational logic is such a general thing that it's hard to imagine such things (and still call them structured).
or alternatively, perhaps you're banging up against some practical issue like poor support for recursive relations? [edit:] or that the number of distinct relationships approaches the number of objects?
i'm just trying to get my head round what kind of problems are involved. because "tables and rows" is such a reductive (low level?) way of describing databases that it doesn't give much insight into what the "real problem" is - something like saying you can't use a programming language because your algorithms don't fit well with call stacks (i'm not saying it's impossible - perhaps you need cps for example...).
By my understanding of what scott_w said, your example is exactly what scott_w meant for his camp #2: A problem involving data that is easier to represent if one "relaxes" the constraints. In this case, the relaxation is the removal of tables/rows idiom.
This is an as wide discussion as static vs dynamic typing in programming languages. Both approaches come with their pros and cons and their impact varies from project to project. (re-reading this part feels pretty plain unfortunately)
It's also interesting to mention that there are multiple sub-topics related:
1. schema vs no-schema
2. type checking
3. referential integrity
4. data integrity
Now in reverse order:
4. every database should guarantee data integrity (i.e. you get back what you put in).
3. referential integrity is an interesting and challenging problem in the space of distributed systems that has impact on the trade-offs that the system has to make
2. type checking: while I'm not aware of any non-relational database supporting this, it is a feature that could be added at some point by basically tagging specific fields/attributes with typing information. For the time being I think people that use non-relational databases and requiring type checking are implementing it in the database access layer. But there's no good reason why it couldn't or shouldn't be supported by the database itself.
1. schema vs no-schema: IMO these two represent the two ends of a spectrum where what I think would actually make sense is support for optional schema definitions--in a way this could be similar to the way Protocol Buffers/Thrift/Avro define their protocol.
Tagging fields in json as different types is actually something we've talked about (as you may well know given they you're my coworker). But there's another aspect of it that I think is worth mentioning. Right now RethinkDB has a very limited number of types that it can return, it's limited to JSON types so basically just object, array, strings and numbers. There are several other types that could be encode as these but you're going to wind up needing to do conversions when you insert and extract this values from the database. For example you might wind encoding and RGB value as an array of 3 numbers and your code will wind up looking like:
This obviously is cumbersome. This is the type of problem that an ORM seeks to solve. However ORMs have the big flaw of being limited to a single language and needing to be kept in sync with the database. If you did this in the database layer by having the driver convert the color object to a tagged array then it would know to return it as a color object rather than an array. Other languages could do it too using the same database since they would see the tag and could convert it to a color type in their language.
The OP mentioned check constraints: what I would be looking for as a bare minimum, and which I'd imagine is low-hanging fruit, is being able to put a function on the server (as I often have numerous clients in numerous languages, and thereby want my "last stand" on data integrity to be a single piece of code written in one language as close to the data as possible) that verified the data before letting me write it; it then does not matter if the data being stored only has four data types: I can have my verifier at least check that I am storing an array with four numbers that are all integers between 0 and 255, even if I can't express "a color".
Schemaless databases move the integrity checks up into the middleware, which I dare say is a more flexable place to have it. Most ORM libraries have integrity checks so the db checks end up being redundant and more difficult to change. I too have been saved on occasion by such 'redundant' checks, but over all I find its easier to manage the structure of data when its kept in the middleware. Keep the storage layer as simple as possible with a focus on consistency, availability, and partition tolerance (CAP).
While ORM level validations are nice to have, they are in no way substitutes for database level constraints. Just relying on ORM level validations is like saying "hey I'm doing error checking in my javascript code and therefore checking for errors on the server side is redundant and unnecessary". It's not, DB level "redundant checking" is worth it. All it takes is for one guy to "just quickly try something" in the db by trying to manipulate it directly without the ORM layer to have your db end up in an inconsistent state. ORMs have their own conventions on how they handle validations which maynot be completely obvious to a new developer on your team who is not familiar with it, whereas everyone understands how db level constraints operate. I'd rather put a few lines of extra code at the db level to make sure the data is safe than risk having the db end up in an inconsistent state, you are not only doing this for the customer but also for yourself. You don't want to be the developer who has to go in and figure out why certain queries are not returning the correct results after your db is in an inconsistent state.
Schemas have saved my ass a million times too, but schema migrations are a pain.
I'd love a database that's loose with types by default, but allows you to add declarative specifications for types. Documents are checked at insert/update time, but otherwise the specifications play no role in how the data is stored.
That way you get schemas, but only when you need them. And migrations don't require heavy collection-wide locking; instead, you just change the constraints you enforce and lazily update old documents.
Postgres does to an extent with CHECK constraints, but you still have to do an ALTER TABLE if you change the underlying type of a column.
Yes, you still have to do an ALTER TABLE to change the type of a column within Postgres, because, well, that's just how it's done. This does not necessarily mean that the table will be rewritten, which I assume is what you're concerned about. See this item from the Postgres 9.2 release notes:
Reduce need to rebuild tables and indexes for certain ALTER TABLE ... ALTER COLUMN TYPE operations (Noah Misch)
Increasing the length limit for a varchar or varbit column, or removing the limit altogether, no longer requires a table rewrite. Similarly, increasing the allowable precision of a numeric column, or changing a column from constrained numeric to unconstrained numeric, no longer requires a table rewrite. Table rewrites are also avoided in similar cases involving the interval, timestamp, and timestamptz types.
Migrations are the main reason I like schemas! Trivial migrations (like adding columns) are trivially managed by a good library (SQLAlchemy + Alembic or Django + South) and more complex migrations are possible.
Without a explicit schema, it's very hard and sometimes impossible to change the shape (schema) of your data.
> ...you still have to do an ALTER TABLE if you change the underlying type of a column.
I don't see why this is a problem? How often do you update really big (hundreds of millions of rows) columns' data types in practice? Most of the time you'd change type early in development when you don't have that much data in the table yet which makes it pretty much instantaneous. If you really need to update the data type for a supersized table you could always make a new column instead and migrate lazily to that in the background. But all of this happens so rarely that in my view it's a silly excuse to dump all the great benefits you get from having an enforced database schema.
> If you really need to update the data type for a supersized table you could always make a new column instead and migrate lazily to that in the background.
I do this "irritatingly often". That said, the fact that while I'm doing it I have the advantages of schemes to make certain I do it correctly is amazing ;P. Even in a world where I have to do this every day, I still wouldn't feel "ok, let's not have a schema at all" would be a better fix to that problem.
(To be clear: I am not "disagreeing" with you: I am agreeing with a point that is even stronger than the one you made, as I think you ceded too much ground ;P.)
I'm with you. I guess it depends on what you work on. When I'm counting thousandths of a penny, I use something that provides integrity and consistency. I've worked on things that don't require precision or timely consistency, and sometimes it's fun to just jam things into a non-rel system; I like couch.
If I were to work on a general-purpose technology product, it'd be a dbms that provides these, first class: referential integrity, arbitrary precision arithmetic, interval datatypes, dependency graphs/recursive DS, and out-of-the-box reporting (formatting, grouping, serialization, etc.)
I'm looking for a middle ground - SQL/structured data extensible with non-structured data. The ability to extend sql tables with unspecified json-like data. The love child of mysql/postgres with Mongodb.
Check constraints are also great and has saved my ass so many times. It's so much easier to debug a problem when the crash occurs when you are inserting the data than in a wrong calculation much further down the line. Postgres for example, allows you to create regular expression constraints which you could use to ensure that a text column only contains a valid variable name. Exclusion constraints are even better, if you have a table with allocations (date ranges, integer ranges, etc) you can check so that none of the allocations overlap. Plus about a million other ways you can constrain your data.
That's the killer features for me. They have helped me debug problems so many times that I consider them almost essential to writing good web applications. So why does almost all new database projects seem to neglect them?