Big SQL in Big Data is a Big Deal

By Dennis Duckworth,

I’ve been doing some work in the area of data warehouse modernization (DWM) recently. You may have seen my previous blog about our new DWM infographic and our view of the data warehouse becoming a more active component in a company’s analytics process.

One of the drivers for DWM is DWA — data warehouse augmentation — adding components around the data warehouse to address new capabilities like exploration of unstructured data. Similarly, there is a lot of talk these days about data lakes, data reservoirs, data refineries, etc. One of the questions that comes up when discussing putting data in any new place outside of the data warehouse is, “How do I access the data and analyze it there?”

Business analysts are used to doing analytics on data in the data warehouse – they have been using SQL for a long time and it is comfortable for them. But they are wary (and maybe even a little weary) of all the talk about NoSQL and Hadoop. They might see incredible value in including unstructured/semi-structured data in their analyses but they aren’t quite sure how they would do that. They probably aren’t going to learn Java so they can use MapReduce on their company’s new Hadoop clusters and by the time the IT guys get around to doing ETL on that data and pulling it into the data warehouse, it has lost some relevance and, therefore, value.

Nowadays, SQL access seems to be a priority for some of the NoSQL vendors, looking to give those business analysts the ability to use their beloved SQL (or some reasonable facsimile thereof) to do their queries against the new NoSQL data stores. So we saw Cloudera come out with Impala and then Hortonworks do significant work to improve the performance of Hive through their Stinger initiative.

Business users are speaking up, saying they want their familiar SQL access to data regardless of where it is, and the vendors are listening — that a good thing. But as some of the large database/data warehouse companies started jumping on the SQL-on-Hadoop bandwagon, I noticed something a bit nonsensical, at least from a Hadoop perspective. Those large vendors created “solutions” that were based on using their database/data warehouse products. So whereas the Hadoop vendors were building SQL query capabilities directly into their Hadoop offerings, the db/dw folks were building SQL-on-Hadoop into their mainstream RDBMS/SQL engines. That means to get the “benefit” of SQL access to Hadoop, you need to use their RDBMS product.

One of the key goals of data warehouse modernization (and augmentation) is to *not* put additional load on the RDBMS/data warehouse, especially load that doesn’t belong there. Why should you need to use an Oracle Exadata or a Teradata 6750 if you are trying to run a SQL query against just your Hadoop cluster? Well, I guess Oracle and Teradata would answer “To keep people using our expensive products” – but isn’t cost reduction one of the reasons your company wants to do more in Hadoop in the first place?

IBM created Big SQL, its SQL-on-Hadoop solution, to work completely within our Hadoop distribution (built on Apache-standard Hadoop), IBM InfoSphere BigInsights for Hadoop. You don’t need to have a separate PureData System for Analytics data warehouse appliance or a separate machine running IBM DB2 – everything you need to run SQL on Hadoop comes as part of BigInsights and it runs entirely in the Hadoop cluster. In that way, IBM is more like Cloudera and Hortonworks than like Oracle and Teradata – we see Hadoop as a first class citizen in the overall data and analytics framework rather than as an accessory to (and life support for) our RDBMS.

IBM Big SQL v3.0 is in Technology Preview right now. You can learn more about it here or you can try it out here.

About Dennis Duckworth

Dennis Duckworth, Program Director of Product Marketing for Data Management & Data Warehousing has been in the data game for quite a while, doing everything from Lisp programming in artificial intelligence to managing a sales territory for a database company. He has a passion for helping companies and people get real value out of cool technology. Dennis came to IBM through its acquisition of Netezza, where he was Director of Competitive and Market Intelligence. He holds a degree in Electrical Engineering from Stanford University but has spent most of his life on the East Coast. When not working, Dennis enjoys sailing off his backyard on Buzzards Bay and he is relentless in his pursuit of wine enlightenment. You can follow Dennis on Twiiter 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s