Daniel Lamire has written an interesting post discussing the terminology we should use to refer to the increasingly popular non-relational databases such as BigTable, MongoDB, CouchDB, etc. He’s mostly discussing semantics, but it got my mind thinking on this issue just one more time.
However, in practice, SQL is based on joins and related low-level issues like foreign keys. SQL entices people to normalize their data. Normalization fragments databases into smaller tables which is great for data integrity and beneficial for some transactional systems. However, joins are expensive. Moreover, joins require strong consistency and fixed schemas.
In turn, avoiding join operations makes it possible to maintain flexible or informal schemas, and to scale horizontally. Thus, the NoSQL solutions should really be called NoJoin because they are mostly defined by avoidance of the join operation.
Because these databases have become very popular recently, I have tried to learn a great deal about them. I even went to a conference about MongoDB. What I have learned is that these databases are very fast, and more easily scalable than relational databases. I have also learned that not having a strict schema for each table is advantageous in many applications. However, there is still this circle of logic I can’t get out of my mind.
You’re using MySQL with MyISAM. That’s an unreliable and flawed database engine. It has no data integrity or compliance. Ok, so you switch to InnoDB or Postgres. Now you have the data integrity, but everything is slower and more difficult to scale or replicate. The solution, therefore is noSQL/noJoin databases which have none of the data integrity.
Wait, what? Why would you choose the non-relational database over MySQL with MyISAM? They have the same data integrity and transactional issues. They are both fast. They both scale and replicate easily. The only major difference I can see is that MySQL has a strict schema and the noSQL databases have no joins.
I don’t know about other people, but every database-powered application I have ever written includes many joins. There is a great deal you can do without joins, but eventually there is a feature that requires them. Let’s say you have authors and articles in two separate tables. The author gets married, and you want to change their last name. With SQL you just update the row in the author table. Without it, you have to search for every blog post with that author, and change the name. Talk about a pain.
Now, there is an alternative to this methodology. You could have an author record and put the unique ID of the author into each blog post. So now what are you doing? You’re writing a join by hand in your application. You get benefits of using the NoSQL database, but the consequence is that you have to write more code. If using a relational database, you could do a lot of work with a single line of SQL. Instead, you write an entire algorithm in your application to do that same work as an SQL statement. It’s reinventing the wheel. You’re duplicating the work of the MySQL developers.
Now what toll does this take on performance and scalability? Yes, the NoSQL databases scale so well, and are so fast. That’s because the application servers are doing all the hard work. It’s shifting the load, not removing the load. On top of that, how good a coder are you really? I personally do not think that I can write a join algorithm that is better than what the MySQL developers can write. If I have a database without joins, and I have to write a join for my application, then how can it possibly be better than if I had just used a relational database? If it is somehow faster and better, that can only mean there is a serious problem in the join algorithm in the relational database that must be fixed. I doubt it. If indeed it is better, then why don’t they just add join commands to the noSQL databases?
If anything in my logic is wrong, please point it out to me. I’m not trying to flame, I’m trying to learn. It’s just that no matter how hard I try, I can’t comprehend the supposed hueg benefit of the non-SQL database. At best it frees you of the limits of a schema while moving some load from the database server to the application server. If I had an application that would heavily benefit from being schema-free, and there were few joins, it might be worth it. For every real world application I have actually written, it makes little sense.
Then again, every application I have ever written has always had more strain on the application server than the database server. I’m sure if I worked at Facebook or some other incredibly large place, the scaling benefits would emerge. It would be worth the effort of writing that extra code and moving some of the heat over to application servers. The thing is, if that’s my problem, I’m in a good spot. It means I’ve got so much demand that I’ve probably got money to pay someone else to do that work. There’s still little reason to start out in the beginning with anything but SQL. You’re likely making more work for yourself in exchange for scaling benefits that you do not need because you are small.
I’m sure there are some applications out there that have few, or even zero, joins. I’m also well aware of many applications where being free of a schema is such a huge boon, it is worth the price of writing more code. These applications exist, but in my experience they are rare. Almost every application I can conceive of is naturally relational in some way.
If anyone out there can explain what I am missing, I will be very thankful. Until then, I’ll be using MySQL for almost everything. It’s easy. It works. It’s fast. I don’t need the super data integrity, or user unfriendliness, of Postgres. And while it may not scale as easily as others, it scales much larger than any application I have ever had to deal with. If I am ever so fortunate to have to scale large, then I’ll make a move. Until then, I’ll save myself the trouble.