One Cloud Data Warehouse, Three Ways

by Mona Patel

There’s something very satisfying about using a single, cloud database solution to solve many business problems.  This is exactly what BPM Northwest experiences with IBM dashDB when delivering Data and Analytics solutions to clients worldwide.

The exciting success with dashDB compelled BPM Northwest to share implementations and best practices with IDC.

In the webcast they team up to discuss the value and realities of moving analytical workloads to the cloud.   Challenges around governance, data integration, and skills are also discussed as organizations are very interested and driven to seize the opportunities of a cloud data warehouse.

In the webcast, you will hear three ways that you can utilize IBM dashDB:

  • New applications, with some integration with on-premises systems
  • Self-service, business-driven sandbox
  • Migrating existing data warehouse workloads

After watching the webcast, please think about how IBM dashDB use cases discussed can apply to your challenges and if a hybrid data warehouse is the right solution for you.

Want to give IBM dashDB on Bluemix a try?  Before you sign up for a free trial, take a tutorial tour on the IBM dashDB YouTube channel to learn how to load data from your desktop, enterprise, and internet data sources, and then see how to run simple to complex SQL queries with your favorite BI tool, or integrated R/R Studio. In fact, watch how IBM dashDB integrates with other value added Bluemix services such as Dataworks Lift and Watson Analytics so that you can bring together all relevant data sources for newer insights.

mona_blog

About Mona,

mona_headshotMona Patel is currently the Portfolio Marketing Manager for IBM dashDB, the future of data warehousing.  With over 20 years of experince analyzing data at The Department of Water and Power, Air Touch Communications, Oracle, and MicroStrategy, Mona decided to grow her career at IBM, a leader in data warehousing and analytics.  Mona received her Bachelor of Science degree in Electrical Engineering from UCLA.

Start Small and Move Fast: The Hybrid Data Warehouse

by Mona Patel

In the world of cutting edge big data analytics, the same obstacles in gaining meaningful insight still exists – ease of getting data in and getting data out.  To address these long standing issues, the utmost flexibility is needed, especially when layered with the agile needs of the business.

Why spend millions of dollars replacing your data and analytics environment with the latest technology promise to address these issues, when can you to leverage existing investments, resources, and skills to achieve the same, and sometimes better, insight?

Consider a hybrid data warehouse.  This approach allows you to start small and move fast. It provides the best of both worlds – flexibility and agility without breaking the bank.  You can RAPIDLY serve up quality data managed by your data warehouse, blended with newer data sources and data types in the cloud, and apply integrated analytics such as Spark or R – all without additional IT resources and expertise.  How is this possible?  IBM dashDB.

Read Aberdeen’s latest report on The Hybrid Data Warehouse.

mona's blog

 

Watch Aberdeen Group’s Webcast on The Hybrid Data Warehouse.

Let me give you an example.  We live in a digital world, with organizations now very interested in improving customer data capture across mobile, web, IoT, social media, and more for newer insights.  A telecommunications client was facing heavy competition and wanted to quickly deliver unique mobile services for an upcoming event in order to acquire new customers by collecting and analyzing mobile and social media data.  Taking a hybrid data warehouse approach, the client was able to start small and move fast, uncovering new mobile service options.

Customer information generated from these newer data sources were blended together with existing customer data managed in the data warehouse to deliver newer insights.  IBM dashDB provided a high performing, public cloud data warehouse service that was up and running in minutes.  Automatic transformation of unstructured geospatial data into structured data, in-memory columnar processing, in-database geospatial analytics, integration with Tableau, and pricing were some of the key reasons IBM dashDB was chosen.

This brings me back to my first point – you don’t have to spend millions of dollars to capitalize on getting data in and getting data out.  For example, clients like the one described above took advantage of Cloudant JSON document store integration, enabling them to rapidly get data into IBM dashDB with ease– no ETL processing required.  Automatic schema discovery loads and replicates unstructured JSON documents that capture IoT, Web and mobile-based data into a structured format.  Getting data or information out was simple, as IBM dashDB provides in-database analytics and the use of familiar, integrated SQL based tools such as Cognos, Watson Analytics, Tableau, and Microstrategy.  I can only conclude that IBM dashDB is a great example of how a highly compatible cloud database can extend or modernize your on-premises data warehouse into a hybrid one to meet time-sensitive business initiatives.

What exactly is a hybrid data warehouse?  A hybrid data warehouse introduces technologies that extend the traditional data warehouse to provide key functionality required to meet new combinations of data, analytics and location, while addressing the following IT challenges:

  • Deliver new analytic services and data sets to meet time-sensitive business initiatives
  • Manage escalating costs due to massive growth in new data sources, analytic capabilities, and users
  • Achieve data warehouse elasticity and agility for ALL business data

mona_dashDB

Still not convinced on the power of a hybrid data warehouse?  Hear what Aberdeen Group’s expert Michael Lock has to say in this 30 min webcast.

About Mona,

mona_headshot

Mona Patel is currently the Portfolio Marketing Manager for IBM dashDB, the future of data warehousing.  With over 20 years of analyzing data at The Department of Water and Power, Air Touch Communications, Oracle, and MicroStrategy, Mona decided to grow her career at IBM, a leader in data warehousing and analytics.  Mona received her Bachelor of Science degree in Electrical Engineering from UCLA.

Performance – Getting There and Staying There with PureData System for Analytics

by David Birmingham, Brightlight Business Analytics, A division of Sirius Computer Solutions and IBM Champion

Many years ago in a cartoon dialogue, Dilbert’s boss expressed concern for the theft of their desktop computers, but Dilbert assured him, to his boss’ satisfaction, that if he loaded them with data they would be too heavy to move. Hold that thought.

Co location: Getting durable performance from queries

Many shops will migrate to a new PureData System for Analytics appliance, Powered by Netezza Technology, simply by copying old data structures into the new data warehouse appliance. They then point their BI tools at it and voila, a 10x performance boost just for moving the data. Life is good.

The shop moves on by hooking up the ETL tools, backups and other infrastructure, not noticing that queries that ran in 5 seconds the week before, now run in 5.1 seconds. As the weeks wear on, 5.1 seconds become 6, then 7, then 10 seconds. Nobody is really watching, because 10 seconds is a phenomenal turnaround compared to their prior system’s 10-minute turnaround.

But six months to a year down the line, when the query takes 30 seconds or longer to run, someone may raise a flag of concern. By this time, we’ve built many new applications on these data structures. Far-and-away more data has been added to its storage. In true Dilbert-esque terms, loading more data makes the system go slower.

PureData has many layers of high-performance hardware, each one more powerful than the one above it. Adhering to this leverage over time helps maintain durable performance.

The better part about a PureData machine is that it has the power to address this by adhering to a few simple rules. When simply migrating point-to-point onto a PureData appliance, we’re likely not taking advantage of the core power-centers in Netezza technology. The point-to-point migration starts out in first-gear and never shifts up to access more power. That is, PureData has many layers of high-performance hardware, each one more powerful than the one above it. Adhering to this leverage over time helps maintain durable performance. The system may eventually need an upgrade for storage reasons, but not for performance reasons.

PureData is a physical machine with data stored on its physical “real estate”, but unlike buying a house with “location-location-location!” we want “co-location-co-location-co-location!” Two flavors of data co-location exist: zone maps and data distribution. The use of these (or lack thereof) either enable or constrain performance. These factors are physical, because performance is in the physics. It’s not enough to migrate or maintain a logical representation of the data. Physical trumps logical.

Zone maps, a powerful form of co-location in PureData

The most powerful form of co-location is zone maps, optimized through the Organize-On and Groom functions. Think of transaction_date as an Organize-On optimization key. The objective is to regroup the physical records so that those with like-valued keys are co-located on as few disk pages as possible. Groom will do this for us. Now when a query is issued against the table, filtering the transaction_date on a date value or date range filter, this query will be applied to the zone maps to derive the known physical disk locations and exclude all others. This is Netezza’s principle of using the query to tell it “where-not-to-look”.

The additional caveat is that the physical co-location of records by Organize-On keys is only valuable if they are actually used in the query. They radically reduce data reads, for example from 5 thousand pages down to 5 pages to get the same information. That’s a 1000x boost! The zone maps, enabled by Organize-On and Groom, are what achieve these dramatic performance boosts. If we do not use them, then queries will initiate a full table-scan which naturally takes more time.

The reason why this is so important is that disk-read is the number one penalty of the query, with no close second. A PureData System N200x or N3001 can read over 1100 pages per second on a given data slice. So if the query scans 5000 pages for each, it’s easily a 4-second query. But it won’t stay a 4-second query. As the data grows from 5000 pages to 10,000 pages, it will become a 10-second query. If the query leverages the zone maps and reduces it consistently to say, 100 pages per query, the query will achieve a sub-second duration and remain there for the life of the solution.

Does this sound like too much physical detail to know for certain what to do? That’s why the Organize-On and Groom functions make it easy. Just use the Query History’s column access statistics, locate the largest tables and find the most-often-accessed columns in where-clause filters (just don’t Organize-On join-only columns or distribution keys!). Add them to the Organize-On, Groom the table and watch this single action boost the most common queries into the stratosphere.

Data Distribution, co-location through “data slices”

Data distribution is another form of co-location. On a PureData system, every table is automatically divided across disks, each representing a “data slice”. Basically when a distribution key (e.g. Customer_ID) is used, the machine will hash the key values to guarantee that records with the same key value will always arrive on the same data slice. If several tables are distributed on the same key value, their like-keyed records will also be co-located on the same data slice. This means joining on those keys will initiate a parallel join, or what is called a co-located read.

Another of the most powerful aspects of Netezza technology is the ability to process data in parallel. Using the same distribution key to make an intermediate table, an insert-select styled query will perform a co-located read and a co-located write, effectively performing the operation in massively parallel form and at very fast speeds. Netezza technology can eclipse a mainframe in both its processing speed and ability to move and position large quantities of data for immediate consumption.

A few tweaks to tables and queries however, can yield a 100x or 1000x boost…

The caveat of data distribution is that a good distribution model can preserve capacity for the long-term. A distribution model that does not leverage co-located joining will chew-up the machine’s more limited resources such as memory and the inter-process network fabric. If we have enough of these queries running simultaneously, the degradation becomes extremely pronounced. A few tweaks to tables and queries however, can yield a 100x or 1000x boost; and without them the solution is using 10x or 100x more machine capacity than necessary. This is why some machines appear very stressed even though they are doing and storing so little.

Accessing the machine’s “deep metal”

Back to the notion of a “simple migration”. Does it sound like a simple point-to-point migration will leverage the power of the machine? Do the legacy queries use where-clause filters that can consistently invoke the zone maps? Are the tables configured to be heavily dependent upon indexes to support performance? If so, then the initial solution will be in first-gear.

But wait, maybe the migration happened a year or so ago and now the machine is “under stress” for no apparent reason. Where did all the capacity go? It’s still waiting to be used, in the deep-metal of the machine, the metal that the migrated solution doesn’t regard. It’s easy to fix that and voila, all this “extra” capacity seemingly appears from nowhere, like magic! It was always there. The solution was ignoring it and grinding the engines in first gear.

Enable business users to explore deep data detail

When Stephen Spielberg made Jurassic Park, he mentioned that the first dinosaur scene with the giant Brachiosaurus required over a hundred hours of film and CGI crunched into fifteen seconds of movie magic.

This represents a typical analytic flow model, where tons of data points are summarized into smaller form for fast consumption by business analysts. PureData System changes this because it is fast and easy to expose deep detail to users. Business analysts like to have access to the detail of deep data because summary structures will throw away useful details in an effort to boost performance on other systems.

The performance is built-in to the machine. It’s an appliance after all.

Architects and developers alike can see how the “Co-location, co-location, co-location!” is easy to configure and maintain, offering a durable performance experience that is also adaptable as business needs change over time. Getting there and staying there doesn’t require a high-wall of engineering activities or a gang of administrators  on roller-skates to keep it running smoothly. The performance is built-in to the machine. It’s an appliance after all.

About David,

David Birmingham, Brighlight, Sirius Computing Solutions David is a Senior Solutions Architect with Brightlight Consulting, a division of Sirius Computer Solutions, and an IBM Champion since 2011. He has over 30 years of extensive experience across the entire BI/DW lifecycle. David is one of the world’s top experts in PureData for Analytics (Netezza) – is the author of Netezza Underground and Netezza Transformation (both on Amazon.com) and various essays on IBM Developerworks’ Netezza Underground Blog. He is also a five-year IBM Champion, a designation that recognizes the contributions of IBM customers and partners.  Catch David each year at the Sunday IBM Insight Enzee Universe for new insights on best practices and solutions with the machine.

Making the Insurance Policyholder Experience Better

How AOK Niedersachsen leverages IBM’s Cognos BI reporting and PureData System for Analytics  for improving policyholder service

AOK Niedersachsen, a German health insurance provider, was facing increased competition, a stricter regulatory environment, and the need for more transparency. With members numbering 2.4 million in 2013 and over 40,000 health care provider partners, AOK Niedersachsen realized they had both a Big Data problem and an opportunity to improve service. Their realization was that corporate data has to be accurate, delivered more quickly, and easily accessible to a wider range of corporate decision makers.

How AOK changed their business practices

AOK Niedersachsen aligned with an IBM business partner, novem business applications GmbH (novem), to reorganize AOK Niedersachsen’s data infrastructure to build a one-version-of-the truth data repository. The requirements included utilizing the existing corporate performance metrics for continuity, while extending graphical and visualization capabilities for the reporting system. The analytics and reporting also needed to be rolled out to more than 700 AOK Niedersachsen decision makers. To implement these requirements, the designers decided that IBM Cognos for Business Intelligence reporting, and IBM’s PureData System for Analytics for the advanced analytical data warehouse were the right combination.

Technology Solution Description

Before sorting through the solution details, it is interesting to note that AOK Niedersachsen utilized a video podcast to educate its user base on the new applications. This enablement tactic promoted a high degree of user acceptance for the new system. The 750 decision-making users (up to and including the CEO) query a database of over 500 data tables and roughly 1.5 billion records. IBM Cognos manufactures the dashboards that transform the raw numbers into effective visualizations. These graphical representations reflect high level aggregations which can then be drilled into for lower level, component data by the inquisitive business person. Simulation algorithms show possible outcomes prior to a manager’s decision choice. IBM Cognos Mobile is another platform option via an employee’s Apple iPad.

The resulting installation delivered these business benefits: queries 100 times faster than from the previous system, significantly decreased operational costs, and more informed, fact based decisions. And the best part is optimized treatment services for AOK Niedersachsen’s policyholders.

Much faster performance delivered by the PureData appliance, helps in many ways, including acceptance by the line-of-business user. There’s nothing like dropping one query from 25 hours down to three minutes elapsed time (500X faster) to get buy-in. Adding more details is AOK Niedersachsen’s Elke Stump, “With the IBM PureData System for Analytics, we finally have the right degree of performance for our IBM Cognos Business Intelligence system. We have even been able to reduce 
the time taken for administration tasks significantly. Performance optimization is no longer an issue.” 1

Summary

Reducing risk and playing the odds are sound business practices in the insurance industry. Keeping customers satisfied with good products and thoughtful services works well in any industry. AOK Niedersachsen contracted with IT partner novem, to design and install a Business Intelligence (Cognos) and Data Warehouse (PureData System for Analytics) based on IBM technology. The resulting installation delivered these business benefits: queries 100 times faster than from the previous system, significantly decreased operational costs, and more informed, fact based decisions. And the best part is optimized treatment services for AOK Niedersachsen’s policyholders.

 

Please share your thoughts or questions in the comments.

For more details in German, visit: AOK German PDF
For more details in English, visit: AOK English PDF

About Rich,

Rich HughesRich Hughes is an IBM Marketing Program Manager for Data Warehousing.  Hughes has worked in a variety of Information Technology, Data Warehousing, and Big Data jobs, and has been with IBM since 2004.  Hughes earned a Bachelor’s degree from Kansas University, and a Master’s degree in Computer Science from Kansas State University.  Writing about the original Dream Team, Hughes authored a book on the 1936 US Olympic basketball team, a squad composed of oil refinery laborers and film industry stage hands.


1 “AOK Niedersachsen significantly improves policyholder service” by IBM Staff, April, 2015.