How to get the most out of your PureData System for Analytics using Hadoop as a cost-efficient extension

By Ralf Goetz

Today’s requirements for collecting huge amounts of data are different from several years back when only relational databases satisfied the need for a system of record.

Now, new data formats need to be acquired, stored and processed in a convenient and flexible way. Customers need to integrate different systems and platforms to unify data access and acquisition without losing control and security.

The logical data warehouse

More and more relational databases and Hadoop platforms are building the core of a Logical Data Warehouse in which each system handles the workload which it can handle best. We call this using “fit for purpose” stores.

An analytical data warehouse appliance such as PureData System for Analytics is often at the core of this Logical Data Warehouse and it is efficient in many ways. It can host and process several terabytes of valuable, high-quality data enabling lightning fast analytics at scale. And it has been possible (with some effort) to move bulk data between Hadoop and relational databases using Sqoop – an open source component of Hadoop. But there was no way to query both systems using SQL – a huge disadvantage.

Two options for combining relational database and Hadoop

Why move bulk data between different systems or run cross-systems analytical queries? Well, there are several use cases for this scenario but I will only highlight two of them based on a typical business scenario in analytics.

The task: an analyst needs to find out how the stock level of the company’s products will develop throughout the year. This stock level is being updated very frequently and produces lots of data in the current data warehouse system implemented on PureData System for Analytics. Therefore the data cannot be kept in the system for more than a year (hot data). A report on this hot data indicates that the stock level is much too high and needs to be adjusted to keep stock costs low. This would normally trigger immediate sales activities (e.g. a marketing and/or sales campaign with lower prices).

“We need a report, which could analyze all stock levels for all products for the last 10+ years!”

Yet, a historical report, which could analyze all stock levels for all products for the last 10+ years would have indicated that the stock level at this time of the year is a good thing, because a high season is approaching. Therefore, the company would be able to sell most of their products and satisfy the market trend. But how can the company provide such a report with so much data?

 

The company would have 2 use case options to satisfy their needs:

  1. Replace the existing analytical data warehouse appliance with a newer and bigger one (This would cost some dollars and has been covered in another blog post.), or
  2. Use an existing Hadoop cluster as a cheap storage and processing extension for the data warehouse appliance (Note that a new, yet to be implemented Hadoop cluster would probably cost more than a bigger PureData box as measured by Total Cost of Ownership).

Option 2 would require a mature, flexible integration interface between Hadoop and PureData. Sqoop would not be able to handle this, because it requires more capabilities than just bulk data movement capabilities from Hadoop to PureData.

IBM Fluid Query for seamless cross-platform data access using standard SQL

These requirements are only two of the reasons why IBM has introduced IBM Fluid Query in March, 2015 as a no charge extension for PureData System for Analytics. Fluid Query enables bulk data movement from Hadoop to PureData and vice versa AND operational SQL query federation. With Fluid Query, data residing in Hadoop distributions from Cloudera, Hortonworks and IBM BigInsights for Apache Hadoop can be combined with the data residing in PureData using standard SQL syntax.

“Move and query all data, find the value in the data and integrate only if needed.”

This enables users to seamlessly query older, cooler data and hot data without the complexity of data integration with a more exploratory approach: move and query all data, find the value in the data and integrate only if needed.

IFQ_Goetz_graphic 2_566 x 243

IBM Fluid Query can be downloaded and installed as a free add-on for PureData System for Analytics.

Try it out today. IBM Fluid Query is technology that is available for PureData System for Analytics.  Clients can download and install this software and get started right away with these new capabilities.  Download it here on Fix Central. Doug Dailey’s “Getting Started with Fluid Query” blog for more information and documentation links to get started is highly recommended reading.  Update: Learn about Fluid Query 1.5, announced July, 2015.

IBM Fluid Query Minimum System Requirements

About Ralf,
Ralf GoetzRalf is an Expert Level Certified IT Specialist in the IBM Software Group. Ralf joined IBM trough the Netezza acquisition in early 2011. For several years, he led the Informatica tech-sales team in DACH region and the Mahindra Satyam BI competency team in Germany. He then became part of the technical pre-sales representative for Netezza and later for the PureData System for Analytics. Ralf is still focusing on PDA but is also supporting the technical sales of all IBM BigData products. Ralf holds a Master degree in computer science.

Do you want to learn more about Big Data and modern data warehousing?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s