SQL Server vs. NoSQL vs. Hadoop – is this the end?

Like it, love it or loath it – there has been a huge amount of experimentation in database technology over the past few years. We have seen the rise of NoSQL, the decline of traditional RDBMS, an emerging dominance (and perhaps imminent death) of Hadoop, and entirely new in-memory databases. Not all of the new technology is necessarily good, but some of it is great and some of it is undoubtedly the future… so if you are in the Microsoft world, let’s discuss the elephant in the room – the threats to SQL Server and what Microsoft is doing to plug these gaps.

 

No Threats Here
Let’s start with the easy stuff – the stuff that I don’t believe is a threat to SQL Server, specifically: NoSQL and Hadoop.

“No-SQL”, “Not-only SQL”, “New SQL”, “Not-yet SQL”. Whatever you want to call this set of databases, they have become reasonably popular for small-scale use and for highly specialised large-scale deployments. Among the most widely known and popular NoSQL databases are MongoDB, Cassandra, Google BigTable and Neo4J (I hesitate lumping Neo4J and other graph databases here, perhaps they deserve their own category).

NoSQL databases are flexible, easy to use and incredibly easy to integrate into application stacks. Perhaps most importantly, there are a rich variety of APIs to make development in any language a breeze, whether you love Python, CoffeeScript, Javascript or Erlang. These databases have become increasingly popular with developers, where they don’t have to define a schema or detail their data requirements up front and are entirely suitable for document or key-value stores. However, NoSQL databases are not suitable as large-scale, operational data stores. Let’s talk about why…

#1 Eventual Consistency. NoSQL databases have been designed from the outset to be highly distributed systems. Part of the problem with traditional RDBMS technology in a distributed environment is the strong requirement for ACID transactions to protect the data integrity. At scale, the ACID principal is a huge performance liability. By working towards eventual consistency, NoSQL databases are able to input at a much higher velocity over a distributed network. But you have to be willing to lose data, you have to able to manage multiple updates that may be out-of-order. There is an inherent uncertainty with these systems that you have to be able to live with. If your transactional integrity is vital (e.g. online trading, point-of-sale) then eventual consistency is a killer.

#2 Aggregate Analysis. This is my favourite one. At some point in your data stack, you are going to want to perform traditional BI involving aggregate analysis, summary statistics and general trend reporting. This is something that SQL was designed to do, and there is no way that NoSQL systems are able to challenge modern database systems on sheer performance in this area (and by modern database systems, I do not mean 1990’s data warehouses, but more on that later). There is one simple reason why NoSQL systems cannot compete with relational systems in this area: the SQL compiler. Put simply, if you can write Javascript for MongoDB that out-performs the SQL compiler then you should be working for Oracle or Microsoft and getting paid a whole lot more. More importantly, your current employer should probably take a big picture view and fire you, because when you leave no one else will be smart enough to maintain your code. OK, so that is a little tougue-and-cheek, but I think you see my point. High-level declarative languages, like SQL, are here to stay simply because they out perform custom code and are much cheaper to develop.

The current trends in NoSQL systems are actually converging towards SQL. MongoDB is sporting a SQL-like declarative interface and NewSQL systems like VoltDB recognise the importance of the ACID principal. NoSQL databases have their place, but that place is somewhere between the application layer and the persistent storage layer. I would suggest they are a staging zone for dynamic data.

So what about Hadoop?
Hadoop is currently the giant in the Big Data industry. There is a huge amount of discussion, development and adoption of Hadoop but it is still early days and there are some large hurdles to overcome before Hadoop establishes a secure foothold in industry.

I want to be clear about what I believe Hadoop is, which is simply an open-source implementation of Google’s MapReduce. The Hadoop stack has grown to include SQL-like interfaces (HIVE and PIG) and the Hadoop distributed filesystem (HDF). But these are add-ons that have been developed to make Hadoop usable.

Hadoop is a very exciting technology that allows massively distributed processing across huge networks. Hadoop’s ability to farm out large queries and operations at enormous scale is what makes it enormously attractive in the world of Big Data. However, there is a trade-off and once again it is performance. Compared to modern databases, a unit of work in Hadoop is significantly more expensive than the same unit of work in a column-store, memory optimised database. Currently, the bleeding edge of database technology is significantly faster than Hadoop when it comes to raw processing power. This is particularly important for complex analytics requiring optimised vector and matrix operations for prediction and pattern recognition. However, if your problem can truly capitalise on Hadoop’s massively distributed framework – then Hadoop is definitely your solution.

Keep in mind that the latest developments in Hadoop are actually circumventing the MapReduce framework to improve usability and performance. It is important to realise that products like Cloudera’s Empala were born out of a need to improve the performance of Hadoop and lift it to a practical level for BI and interactive reporting. At this stage, I am not sure whether products like Empala will enhance Hadoop, or whether they are an early prediction of the decline of MapReduce (in enterprise) and HDF?

Summary
Loss of data is the quickest way to destroy any business. But the latency of traditional RDBMs technologies is struggling to keep up with the demands of high velocity transactional systems. Unfortunately, the eventually consistent nature of distributed NoSQL systems does not provide the level of data security that these systems require. NoSQL systems have their place, for example as dynamic data stores for web apps, but we are not going to see them dominate larger enterprises.

Performance is the other key feature for Big Data and Big Business. To date, we have not seen that Hadoop is able to match the performance of modern databases, and the increasing demands for real-time analytics and interactive reporting almost excludes the distributed filesystem and MapReduce framework of Hadoop. However, there are exciting new companies, like Cloudera and Hortonworks, who are creating advanced analytics platforms over Hadoop that will be worth watching in this space.

There is another area which we haven’t touched on in this blog… and that is the growing diversity of data. Data integration is an increasing concern for companies who are employing data scientists to look beyond their sales data and find correlations to social data, public data, government data and a myriad of other data sources. The growth of data mining is truly exciting and is driving rapid developments in modern databases. In a future post, I will explore some of these changes (column store databases, array databases, distributed databases) and I will try to answer the question about whether there is a future for SQL Server, and if so, what it may look like?

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: