How to get the most out of your PureData System for Analytics using Hadoop as a cost-efficient extension

By Ralf Goetz

Today’s requirements for collecting huge amounts of data are different from several years back when only relational databases satisfied the need for a system of record.

Now, new data formats need to be acquired, stored and processed in a convenient and flexible way. Customers need to integrate different systems and platforms to unify data access and acquisition without losing control and security.

The logical data warehouse

More and more relational databases and Hadoop platforms are building the core of a Logical Data Warehouse in which each system handles the workload which it can handle best. We call this using “fit for purpose” stores.

An analytical data warehouse appliance such as PureData System for Analytics is often at the core of this Logical Data Warehouse and it is efficient in many ways. It can host and process several terabytes of valuable, high-quality data enabling lightning fast analytics at scale. And it has been possible (with some effort) to move bulk data between Hadoop and relational databases using Sqoop – an open source component of Hadoop. But there was no way to query both systems using SQL – a huge disadvantage.

Two options for combining relational database and Hadoop

Why move bulk data between different systems or run cross-systems analytical queries? Well, there are several use cases for this scenario but I will only highlight two of them based on a typical business scenario in analytics.

The task: an analyst needs to find out how the stock level of the company’s products will develop throughout the year. This stock level is being updated very frequently and produces lots of data in the current data warehouse system implemented on PureData System for Analytics. Therefore the data cannot be kept in the system for more than a year (hot data). A report on this hot data indicates that the stock level is much too high and needs to be adjusted to keep stock costs low. This would normally trigger immediate sales activities (e.g. a marketing and/or sales campaign with lower prices).

“We need a report, which could analyze all stock levels for all products for the last 10+ years!”

Yet, a historical report, which could analyze all stock levels for all products for the last 10+ years would have indicated that the stock level at this time of the year is a good thing, because a high season is approaching. Therefore, the company would be able to sell most of their products and satisfy the market trend. But how can the company provide such a report with so much data?

 

The company would have 2 use case options to satisfy their needs:

  1. Replace the existing analytical data warehouse appliance with a newer and bigger one (This would cost some dollars and has been covered in another blog post.), or
  2. Use an existing Hadoop cluster as a cheap storage and processing extension for the data warehouse appliance (Note that a new, yet to be implemented Hadoop cluster would probably cost more than a bigger PureData box as measured by Total Cost of Ownership).

Option 2 would require a mature, flexible integration interface between Hadoop and PureData. Sqoop would not be able to handle this, because it requires more capabilities than just bulk data movement capabilities from Hadoop to PureData.

IBM Fluid Query for seamless cross-platform data access using standard SQL

These requirements are only two of the reasons why IBM has introduced IBM Fluid Query in March, 2015 as a no charge extension for PureData System for Analytics. Fluid Query enables bulk data movement from Hadoop to PureData and vice versa AND operational SQL query federation. With Fluid Query, data residing in Hadoop distributions from Cloudera, Hortonworks and IBM BigInsights for Apache Hadoop can be combined with the data residing in PureData using standard SQL syntax.

“Move and query all data, find the value in the data and integrate only if needed.”

This enables users to seamlessly query older, cooler data and hot data without the complexity of data integration with a more exploratory approach: move and query all data, find the value in the data and integrate only if needed.

IFQ_Goetz_graphic 2_566 x 243

IBM Fluid Query can be downloaded and installed as a free add-on for PureData System for Analytics.

Try it out today. IBM Fluid Query is technology that is available for PureData System for Analytics.  Clients can download and install this software and get started right away with these new capabilities.  Download it here on Fix Central. Doug Dailey’s “Getting Started with Fluid Query” blog for more information and documentation links to get started is highly recommended reading.  Update: Learn about Fluid Query 1.5, announced July, 2015.

IBM Fluid Query Minimum System Requirements

About Ralf,
Ralf GoetzRalf is an Expert Level Certified IT Specialist in the IBM Software Group. Ralf joined IBM trough the Netezza acquisition in early 2011. For several years, he led the Informatica tech-sales team in DACH region and the Mahindra Satyam BI competency team in Germany. He then became part of the technical pre-sales representative for Netezza and later for the PureData System for Analytics. Ralf is still focusing on PDA but is also supporting the technical sales of all IBM BigData products. Ralf holds a Master degree in computer science.

Do you want to learn more about Big Data and modern data warehousing?

Advertisements

Getting Started with IBM Fluid Query 1.0 for IBM PureData System for Analytics

By Doug Dailey

As Big Data concepts continue to mature and evolve, so does the technology that encourages its adoption. Enterprises are looking at ways to better leverage their data by reducing costs and positioning data for success based on its relevance. The yield for this exercise delivers optimum insights for the business at the right time.

Many are finding that Hadoop is not the answer for all of their data needs. They want to have access to various systems, rather than choosing a “one size fits all” mentality. Enterprise Data Warehouse (EDW), relational, content stores, real-time in-memory processing and more all have their place. We have seen an increasing number of software tools, specialized hardware products and services that work to bridge the gap between approaches to store or analyze the data.

Fluid Query Strengths – Query access and Data Movement with Hadoop

IBM introduced Fluid Query 1.0 for use on PureData System for Analytics in March. The capability allows PureData users to turn their EDW on its end and work as a client. Traditionally, EDW environments served as landing zones for high value data to explore, analyze and gain speed of thought insights from complex in-database algorithms. Now, IBM Fluid Query allows PureData users to access data residing on Hadoop distributions as if they are a client. This does not move and store data locally, but actually pushes SQL down to Hadoop offload processing via Map Reduce jobs. Now, you can query directly from Hadoop and move data natively between PureData and Hadoop in parallel.

Are you interested in doing any of the following?

● Query Hadoop data from your PureData System for Analytics
● Bi-directional data transfer between PureData and Hadoop (BigInsights, Hortonworks or Cloudera)
● Move data between PureData and Hadoop in parallel
● Full control over tables and data ranges queried or transferred
● Automatic registration with Hive meta-store

How to Get Started

Customers have been able to download, install, configure and test Fluid Query in less than 30 minutes. This is a perfect lunch hour activity for inquiring minds. Just be sure that your Hadoop and PureData environment have the needed prerequisites in place. This will run on PureData System for Analytics N100x, N2001, N2002, and N3001.

IBM Fluid Query Minimum System Requirements

 

 

 

 

 

 

 

Tools needed for installation:

(1) Supported Hadoop distribution installed, up & running

supported hadoop providers

(2) Active network connection and user access/authentication between PureData and Hadoop

(3) PureData installed with Netezza Analytics

(4) Data available for use

Downloading and installing Fluid Query:

1. Download FLUIDQUERY_1.0 tar package from Fix Central
http://www-933.ibm.com/support/fixcentral/

Download IBM Fluid Query

 

 

 

 

 

 

 

 

 

 

2. The IBM Fluid Query User Guide can be found here for more details on setup and configuration.

3. Unpack the FluidQuery_1.0 bundle and run the fluidquery_install.pl script.

4. Configure Fluid Query for use, then query and move data to your heart’s content. This is comprised of a lightweight configuration, registration of user defined table functions, and view creation.

Finally, use your favorite tool to execute your Hadoop query and view results.

IBM Fluid Query screen shot 2

 

In keeping with the simplicity and ease of use of Netezza technology, we have delivered a very lightweight set of capabilities that pack a load of value for your Logical Data Warehouse ecosystem. Whether you are trudging through a data swamp, or swimming in a data lake or reservoir, you can very easily reel in results important to your business.

Go to the IBM Fluid Query Solution Brief to learn more.

Update: Learn about Fluid Query 1.5 announced in July, 2015.

Doug Daily About Doug,
Doug has over 20 years combined technical & management experience in the software industry with emphasis in customer service and more recently product management.He is currently part of a highly motivated product management team that is both inspired by and passionate about the IBM PureData System for Analytics product portfolio.

IBM Fluid Query 1.0: Efficiently Connecting Users to Data

by Rich Hughes

Launched on March 27th, IBM Fluid Query 1.0 opens doors of “insight opportunity” for IBM PureData System for Analytics clients. In the evolving data ecosystem, users want and need accessibility to a variety of data stores in different locations. This only makes sense, as newer technologies like Apache Hadoop have broadened analytic possibilities to include unstructured data. Hadoop is the data source that accounts for most of the increase in data volume.  By observation, the world’s data is doubling about every 18 months, with some estimates putting the 2020 data volume at 40 zettabytes, or 4021 bytes. This increase by decade’s end would represent a 20 fold growth over the 2011 world data total of 1.821 bytes.1 IT professionals as well as the general public can intuitively feel the weight and rapidity of data’s prominence in our daily lives. But how can we cope with, and not be overrun by, relentless data growth? The answer lies in part, with better data access paths.


IBM Fluid Query 1.0 opens doors of “insight opportunity”for IBM PureData System for Analytics clients. In the evolving data ecosystem, users want and need accessibility to a variety of data stores in different locations.

IBM Fluid Query 1.0 – What is it?

IBM Fluid Query 1.0 is a specific software feature in PureData that provides access to data in Hadoop from PureData appliances. Fluid Query also promotes the fast movement of data between Big Data ecosystems and PureData warehouses.  Enabling query and data movement, this new technology connects PureData appliances with common Hadoop systems: IBM BigInsights, Cloudera, and Hortonworks. Fluid Query allows results from PureData database tables and Hadoop data sources to be merged, thus creating powerful analytic combinations.


Fluid Query allows results from PureData System for Analytics database tables and Hadoop data sources to be merged, thus creating powerful analytic combinations.

IBM® Fluid Query Benefits

Fluid Query makes practical use of existing SQL developer skills. Workbench tools yield productivity gains because SQL remains the query language of choice when PureData and Hadoop schemas logically merge. Fluid Query is the physical bridge whereby a query is pushed efficiently to where the data resides, whether it is in your data warehouse or in your Hadoop environment. Other benefits made possible by Fluid Query include:

  • better exploitation of Hadoop as a “Day 0” archive, that is queryable with conventional SQL;
  • combining hot data from PureData with colder data from Hadoop; and
  • archiving colder data from PureData to Hadoop to relieve resources on the data warehouse.

Managing your share of Big Data Growth

Fluid Query provides data access between Hadoop and PureData appliances. Your current data warehouse, the PureData System for Analytics, can be extended in several important ways over this bridge to additional Hadoop capabilities. The coexistence of PureData appliances alongside Hadoop’s beneficial features is a best-of-breed approach where tasks are performed on the platform best suited for that workload. Use the PureData warehouse for production quality analytics where performance is critical to the success of your business, while simultaneously using Hadoop to discover the inherent value of full-volume data sources.

How does Fluid Query differ from IBM BigSQL technology?

Just as IBM PureData System for Analytics innovated by moving analytics to the data, IBM Big SQL moves queries to the correct data store. IBM Big SQL supports query federation to many data sources, including (but not limited to) IBM PureData System for Analytics; DB2 for Linux, UNIX and Windows database software; IBM PureData System for Operational Analytics; dashDB, Teradata, and Oracle. This allows users to send distributed requests to multiple data sources within a single SQL statement. IBM Big SQL is a feature included with IBM BigInsights for Apache Hadoop which is an included software entitlement with IBM PureData System for Analytics. By contrast, many Hadoop and database vendors rely on significant data movement just to resolve query requests—a practice that can be time consuming and inefficient.

Learn more

Since March 27, 2015, IBM® Fluid Query 1.0 has been generally available as a software addition to PureData System for Analytics customers. If you want to understand how to take advantage of IBM® Fluid Query 1.0 check out these two sources: the on-demand webcast, Virtual Enzee – The Logical Data Warehouse, Hadoop and PureData System for Analytics , and the IBM Fluid Query solution brief. Update: Learn about Fluid Query 1.5, announced July, 2015.

About Rich,

Rich HughesRich Hughes is an IBM Marketing Program Manager for Data Warehousing.  Hughes has worked in a variety of Information Technology, Data Warehousing, and Big Data jobs, and has been with IBM since 2004.  Hughes earned a Bachelor’s degree from Kansas University, and a Master’s degree in Computer Science from Kansas State University.  Writing about the original Dream Team, Hughes authored a book on the 1936 US Olympic basketball team, a squad composed of oil refinery laborers and film industry stage hands. You can follow him on Twitter: @rhughes134

Footnote:
1 “How Much Data is Out There” by Webopedia Staff, Webopedia.com, March 3, 2014.

Big SQL in Big Data is a Big Deal

By Dennis Duckworth,

I’ve been doing some work in the area of data warehouse modernization (DWM) recently. You may have seen my previous blog about our new DWM infographic and our view of the data warehouse becoming a more active component in a company’s analytics process.

One of the drivers for DWM is DWA — data warehouse augmentation — adding components around the data warehouse to address new capabilities like exploration of unstructured data. Similarly, there is a lot of talk these days about data lakes, data reservoirs, data refineries, etc. One of the questions that comes up when discussing putting data in any new place outside of the data warehouse is, “How do I access the data and analyze it there?”

Business analysts are used to doing analytics on data in the data warehouse – they have been using SQL for a long time and it is comfortable for them. But they are wary (and maybe even a little weary) of all the talk about NoSQL and Hadoop. They might see incredible value in including unstructured/semi-structured data in their analyses but they aren’t quite sure how they would do that. They probably aren’t going to learn Java so they can use MapReduce on their company’s new Hadoop clusters and by the time the IT guys get around to doing ETL on that data and pulling it into the data warehouse, it has lost some relevance and, therefore, value.

Nowadays, SQL access seems to be a priority for some of the NoSQL vendors, looking to give those business analysts the ability to use their beloved SQL (or some reasonable facsimile thereof) to do their queries against the new NoSQL data stores. So we saw Cloudera come out with Impala and then Hortonworks do significant work to improve the performance of Hive through their Stinger initiative.

Business users are speaking up, saying they want their familiar SQL access to data regardless of where it is, and the vendors are listening — that a good thing. But as some of the large database/data warehouse companies started jumping on the SQL-on-Hadoop bandwagon, I noticed something a bit nonsensical, at least from a Hadoop perspective. Those large vendors created “solutions” that were based on using their database/data warehouse products. So whereas the Hadoop vendors were building SQL query capabilities directly into their Hadoop offerings, the db/dw folks were building SQL-on-Hadoop into their mainstream RDBMS/SQL engines. That means to get the “benefit” of SQL access to Hadoop, you need to use their RDBMS product.

One of the key goals of data warehouse modernization (and augmentation) is to *not* put additional load on the RDBMS/data warehouse, especially load that doesn’t belong there. Why should you need to use an Oracle Exadata or a Teradata 6750 if you are trying to run a SQL query against just your Hadoop cluster? Well, I guess Oracle and Teradata would answer “To keep people using our expensive products” – but isn’t cost reduction one of the reasons your company wants to do more in Hadoop in the first place?

IBM created Big SQL, its SQL-on-Hadoop solution, to work completely within our Hadoop distribution (built on Apache-standard Hadoop), IBM InfoSphere BigInsights for Hadoop. You don’t need to have a separate PureData System for Analytics data warehouse appliance or a separate machine running IBM DB2 – everything you need to run SQL on Hadoop comes as part of BigInsights and it runs entirely in the Hadoop cluster. In that way, IBM is more like Cloudera and Hortonworks than like Oracle and Teradata – we see Hadoop as a first class citizen in the overall data and analytics framework rather than as an accessory to (and life support for) our RDBMS.

IBM Big SQL v3.0 is in Technology Preview right now. You can learn more about it here or you can try it out here.

About Dennis Duckworth

Dennis Duckworth, Program Director of Product Marketing for Data Management & Data Warehousing has been in the data game for quite a while, doing everything from Lisp programming in artificial intelligence to managing a sales territory for a database company. He has a passion for helping companies and people get real value out of cool technology. Dennis came to IBM through its acquisition of Netezza, where he was Director of Competitive and Market Intelligence. He holds a degree in Electrical Engineering from Stanford University but has spent most of his life on the East Coast. When not working, Dennis enjoys sailing off his backyard on Buzzards Bay and he is relentless in his pursuit of wine enlightenment. You can follow Dennis on Twiiter