For those who don’t follow the database industry closely, here is the problem: Relational database servers clearly dominated the market for the last 15-20 years, but today they render inadequate for Big Data. There are sectors of the industry where relational databases have never been able to perform - web indexing, bio research, etc. MapReduce has emerged to serve those sectors. While the MapReduce pattern is no match for SQL in terms of functionality, the distributed storage architecture on which MapReduce is based has a great potential for scalable processing.
Now the question is: Which approach will produce a rich, yet scalable, data processing engine first?
- a) Enabling relational databases to operate over distributed storage, or
- b) Expanding the processing functionality over distributed storage?
I bet on the latter approach:
Scalable databases will emerge from distributed storage.
It may seem that I am betting against the odds, because relational database vendors already have both the code and the people for the job. However, those codebases are about 20 years old and thus they are barely modifiable. The assumption that data is locally, yet exclusively, available is spread out everywhere. So I don’t believe those codebases will be of much help. Even if database vendors abandon their existing codebases, and try to solely leverage their people to build a new data processing technology from scratch, there are ecosystems grown around those legacy database engines that will keep randomizing the development process and will hold back progress. So those developers will be much less efficient then developers who don’t have to deal with such a legacy baggage.
If the above bet comes true, it will trail the following consequence:
There will be new companies who will play significant roles in the database industry.