How to get the most out of your PureData System for Analytics using Hadoop as a cost-efficient extension

By Ralf Goetz

Today’s requirements for collecting huge amounts of data are different from several years back when only relational databases satisfied the need for a system of record.

Now, new data formats need to be acquired, stored and processed in a convenient and flexible way. Customers need to integrate different systems and platforms to unify data access and acquisition without losing control and security.

The logical data warehouse

More and more relational databases and Hadoop platforms are building the core of a Logical Data Warehouse in which each system handles the workload which it can handle best. We call this using “fit for purpose” stores.

An analytical data warehouse appliance such as PureData System for Analytics is often at the core of this Logical Data Warehouse and it is efficient in many ways. It can host and process several terabytes of valuable, high-quality data enabling lightning fast analytics at scale. And it has been possible (with some effort) to move bulk data between Hadoop and relational databases using Sqoop – an open source component of Hadoop. But there was no way to query both systems using SQL – a huge disadvantage.

Two options for combining relational database and Hadoop

Why move bulk data between different systems or run cross-systems analytical queries? Well, there are several use cases for this scenario but I will only highlight two of them based on a typical business scenario in analytics.

The task: an analyst needs to find out how the stock level of the company’s products will develop throughout the year. This stock level is being updated very frequently and produces lots of data in the current data warehouse system implemented on PureData System for Analytics. Therefore the data cannot be kept in the system for more than a year (hot data). A report on this hot data indicates that the stock level is much too high and needs to be adjusted to keep stock costs low. This would normally trigger immediate sales activities (e.g. a marketing and/or sales campaign with lower prices).

“We need a report, which could analyze all stock levels for all products for the last 10+ years!”

Yet, a historical report, which could analyze all stock levels for all products for the last 10+ years would have indicated that the stock level at this time of the year is a good thing, because a high season is approaching. Therefore, the company would be able to sell most of their products and satisfy the market trend. But how can the company provide such a report with so much data?


The company would have 2 use case options to satisfy their needs:

  1. Replace the existing analytical data warehouse appliance with a newer and bigger one (This would cost some dollars and has been covered in another blog post.), or
  2. Use an existing Hadoop cluster as a cheap storage and processing extension for the data warehouse appliance (Note that a new, yet to be implemented Hadoop cluster would probably cost more than a bigger PureData box as measured by Total Cost of Ownership).

Option 2 would require a mature, flexible integration interface between Hadoop and PureData. Sqoop would not be able to handle this, because it requires more capabilities than just bulk data movement capabilities from Hadoop to PureData.

IBM Fluid Query for seamless cross-platform data access using standard SQL

These requirements are only two of the reasons why IBM has introduced IBM Fluid Query in March, 2015 as a no charge extension for PureData System for Analytics. Fluid Query enables bulk data movement from Hadoop to PureData and vice versa AND operational SQL query federation. With Fluid Query, data residing in Hadoop distributions from Cloudera, Hortonworks and IBM BigInsights for Apache Hadoop can be combined with the data residing in PureData using standard SQL syntax.

“Move and query all data, find the value in the data and integrate only if needed.”

This enables users to seamlessly query older, cooler data and hot data without the complexity of data integration with a more exploratory approach: move and query all data, find the value in the data and integrate only if needed.

IFQ_Goetz_graphic 2_566 x 243

IBM Fluid Query can be downloaded and installed as a free add-on for PureData System for Analytics.

Try it out today. IBM Fluid Query is technology that is available for PureData System for Analytics.  Clients can download and install this software and get started right away with these new capabilities.  Download it here on Fix Central. Doug Dailey’s “Getting Started with Fluid Query” blog for more information and documentation links to get started is highly recommended reading.  Update: Learn about Fluid Query 1.5, announced July, 2015.

IBM Fluid Query Minimum System Requirements

About Ralf,
Ralf GoetzRalf is an Expert Level Certified IT Specialist in the IBM Software Group. Ralf joined IBM trough the Netezza acquisition in early 2011. For several years, he led the Informatica tech-sales team in DACH region and the Mahindra Satyam BI competency team in Germany. He then became part of the technical pre-sales representative for Netezza and later for the PureData System for Analytics. Ralf is still focusing on PDA but is also supporting the technical sales of all IBM BigData products. Ralf holds a Master degree in computer science.

Do you want to learn more about Big Data and modern data warehousing?

IBM DB2 Analytics Accelerator: OLTP and OLAP in the same system at last! (Part two)

In my previous post, I introduced IBM DB2 Analytics Accelerator (Accelerator) and explained how it is capable of serving online analytical processing (OLAP) and online transaction processing (OLTP) queries at the same time. Now it is time to go into more detail and explain how it all works.

The DB2 optimizer is the key

The design concept is quite simple. The DB2 for z/OS optimizer is aware of DB2 Analytics Accelerator’s existence in a given environment and can execute a given query either on the DB2 Analytics Accelerator or by using the already well-known access paths within DB2 for z/OS. The DB2 optimizer decides which queries to direct to the Accelerator for hardware-accelerated parallel query processing, thus the Accelerator is essentially transparent to the applications and reporting tools querying DB2 for z/OS.

Deep DB2 Integration with z Systems

So, the DB2 optimizer determines whether a query is best suited to run utilizing symmetric multiprocessing (SMP), leveraging DB2 for z/OS; or the hardware-accelerated massively parallel processing (MPP) architecture delivered by the PureData System for Analytics, powered by Netezza technology. It chooses the best option transparently. There are essentially no changes to the way you query the DB2 system.

Query Execution Process Flow

After defining tables to be accelerated and then loading data into the Accelerator, you essentially have an environment that provides both an SMP and MPP “personality” to the same table. This allows you to leverage the legendary DB2 for z/OS qualities of service and performance for transactional queries, as well as Netezza performance for complex queries.

Complex queries are automatically routed by DB2 for processing by the Asymmetric Massively Parallel Processing (AMPP) engine, an architecture that uses special-purpose hardware accelerators (a multi-engine field-programmable gate array [FPGA]). These accelerators decompress and filter data for relevance to the query before it is loaded into memory and given to the processor for any necessary aggregation and final processing. This entire process is transparent to the application and user. The user only notices an improvement in the speed at which the query is resolved by the system, specifically the analytical ones. The system in general also benefits, because the analytical load is transferred out of DB2, and thus there is more capacity available to serve additional transactional queries. And keep this in mind: once a query is derived to be solved by the Accelerator, its execution does not consume MIPS.

High Performance Storage Saver

Once you have this system installed, you may use it for more than accelerating analytical queries. There are some interesting use cases that will help you obtain more benefits from your hybrid system, such as the High Performance Storage Saver.

High Performance Storage Saver: Store your historical data outside DB2 for z/OS, but maintain high speed analytical access to it.
High Performance Storage Saver: Store your historical data outside DB2 for z/OS,
but maintain high speed analytical access to it.

In this use case, you can save in Total Cost of Ownership (TCO) by using the Accelerator to store historical data to reduce storage costs on z Systems. This way, you are not consuming storage in your IBM z Systems. Instead, you are using the lower cost of storage in the DB2 Analytics Accelerator, while still being able to access that historical data and perform analysis on it. And this is the most interesting point . . . the historical data that you would typically take out of your mainframe due to the cost of storage in that system can now be explored online. You do not need to recover the tapes where you store the historical data to access it.

Simple is still better

Meanwhile, I encourage you to learn more about DB2 Analytics Accelerator, where you are leveraging the “simple is still better” design tenents of PureData System for Analytics And of course, I am more than happy to read about your opinions and see you share your experiences in the comments area below or join me in a conversation on Twitter.

See additional posts

IBM DB2 Analytics Accelerator: OLTP and OLAP in the same system at last! (Part one)

About Isaac,

Isaac Moreno NavarroIsaac is working as a data warehouse and Big Data technical pre-sales professional for IBM, covering customers in Spain and Portugal, where his special focus is on PureData for Analytics. He joined IBM in 2011, through the Netezza acquisition. Before that, he has held several positions in pre-sales and professional services in companies such as Oracle, Sun Microsystems, Netezza and other Spanish companies. During the years previous to working at IBM, he has acquired a diverse experience with different software tools (databases, identity management products, geographical information systems, manufacturing systems…) in a very diverse set of projects. He also holds a Master of Science Degree in Computer Science.