IBM Fluid Query 1.7 is Here!

by Doug Dailey

IBM Fluid Query offers a wide range of capabilities to help your business adapt to a hybrid data architecture and more importantly it helps you bridge across “data silos” for deeper insights that leverage more data.   Fluid Query is a standard entitlement included with the Netezza Platform Software suite for PureData for Analytics (formerly Netezza). Fluid Query release 1.7 is now available, and you can learn more about its features below.

Why should you consider Fluid Query?

It offers many possible uses for solving business problems in your business. Here are a few ideas:
• Discover and explore “Day Zero” data landing in your Hadoop environment
• Query data from multiple cross-enterprise repositories to understand relationships
• Access structured data from common sources like Oracle, SQL Server, MySQL, and PostgreSQL
• Query historical data on Hadoop via Hive, BigInsights Big SQL or Impala
• Derive relationships between data residing on Hadoop, the cloud and on-premises
• Offload colder data from PureData System for Analytics to Hadoop to free capacity
• Drive business continuity through low fidelity disaster recovery solution on Hadoop
• Backup your database or a subset of data to Hadoop in an immutable format
• Incrementally feed analytics side-cars residing on Hadoop with dimensional data

By far, the most prominent use for Fluid Query for a data warehouse administrator is that of warehouse augmentation, capacity relief and replicating analytics side-cars for analysts and scientists.

New: Hadoop connector support for Hadoop file formats to increase flexibility

IBM Fluid Query 1.7 ushers in greater flexibility for Hadoop users with support for popular file formats typically used with HDFS.Fluid query 1.7 connector picture These include popular data storage formats like AVRO, Parquet, ORC and RC that are often used to manage bigdata in a Hadoop environment.

Choosing the best format and compression mode can result in drastic differences in performance and storage on disk. A file format that doesn’t support flexible schema evolution can result in a processing penalty when making simple changes to a table. Let’s just  say that if you live in the Hadoop domain, you know exactly what I am speaking of. For instance, if you want to use AVRO, do your tools have readers and writers that are compatible? If you are using IMPALA, do you know that it doesn’t support ORC, or that Hortonworks and Hive-Stinger don’t play well with Parquet? Double check your needs and tool sets before diving into these popular format types.

By providing support for these popular formats,  Fluid Query allows you to import, store, and access this data through local tools and utilities on HDFS. But here is where it gets interesting in Fluid Query 1.7: you can also query data in these formats through the Hadoop connector provided with IBM Fluid Query, without any change to your SQL!

New: Robust connector templates

In addition, Fluid Query 1.7 now makes available a more robust set of connector templates that are designed to help you jump start use of Fluid Query. You may recall we provided support for a generic connector in our prior release that allows you to configure and connect to any structured data store via JDBC. We are offering pre-defined templates with the 1.7 release so you can get up and running more quickly. In cases where there are differences in user data type mapping, we also provide mapping files to simplify access.  If you have your own favorite database, you can use our generic connector, along with any of the provided templates as a basis for building a new connector for your specific needs. There are templates for Oracle, Teradata, SQL Server, MySQL, PostgreSQL, Informix, and MapR for Hive.

Again, the primary focus for Fluid Query is to deliver open data access across your ecosystem. Whether the data resides on disk, in-memory, in the Cloud or on Hadoop, we strive to enable your business to be open for data. We recognize that you are up against significant challenges in meeting demands of the business and marketplace, with one of the top priorities around access and federation.

New: Data movement advances

Moving data is not the best choice. Businesses spend quite a bit of effort ingesting data, staging the data, scrubbing, prepping and scoring the data for consumption for business users. This is costly process. As we move closer and closer to virtualization, the goal is to move the smallest amount of data possible, while you access and query only the data you need. So not only is access paramount, but your knowledge of the data in your environment is crucial to efficiently using it.

Fluid Query does offer data movement capability through what we call Fast Data Movement. Focusing on the pipe between PDA and Hadoop, we offer a high speed transfer tool that allows you to transfer data between these two environments very efficiently and securely. You have control over the security, compression, format and where clause (DB, table, filtered data). A key benefit is our ability to transfer data in our proprietary binary format. This enables orders of magnitude performance over Sqoop, when you do have to move data.

Fluid Query 1.7 also offers some additional benefits:
• Kerberos support for our generic database connector
• Support for BigInsights Big SQL during import (automatically synchronizes Hive and Big SQL on import)
• Varchar and String mapping improvements
• Import of nz.fq.table parameter now supports a combination of multiple schemas and tables
• Improved date handling
• Improved validation for NPS and Hadoop environment (connectors and import/export)
• Support for BigInsights 4.1 and Cloudera 5.5.1
• A new Best Practices User Guide, plus two new Tutorials

You can download this from IBM’s Fix Central or the Netezza Developer’s Network for use with the Netezza Emulator through our non-warranted software.

Picture1

Take a test drive today!

About Doug,
Doug Daily
Doug has over 20 years combined technical & management experience in the software industry with emphasis in customer service and more recently product management.He is currently part of a highly motivated product management team that is both inspired by and passionate about the IBM PureData System for Analytics product portfolio.

Advertisements

What’s new: IBM Fluid Query 1.6

by Doug Dailey

Editorial Note: IBM Fluid Query 1.7 became available in May, 2016. You can read about features in release 1.6 here, but we also recommend reading the release 1.7 blog here.

The IBM PureData Systems for Analytics team has assembled a value-add set of enhancements over current software versions of Netezza Platform Software (NPS), INZA software and Fluid Query. We have enhanced  integration, security, real-time analytics for System z and usability features with our latest software suite arriving on Fix Central today.

There will be something here for everyone, whether you are looking to integrate your PureData System (Netezza) into a Logical Data Warehouse, improve security, gain more leverage with DB2 Analytics Accelerator for z/OS, or simply improve your day-to-day experience. This post covers the IBM Fluid Query 1.6 technology.  Refer to my NPS and INZA post (link) for more information on the enhancements that are now available in these other areas.

Integrating with the Logical Data Warehouse: Fluid Query overview

Are you struggling with building out your data reservoir, lake or lagoon? Feeling stuck in a swamp? Or, are you surfing effortlessly through an organized Logical Data Warehouse (LDW)?

Fluid Query offers a nice baseline of capability to get your PureData footprint plugged into your broader data environment or tethered directly to your IBM BigInsights Apache Hadoop distribution. Opening access across your broader ecosystem of on-premise, cloud, commodity hardware and Hadoop platforms gets you ever closer to capturing value throughout “systems of engagement” and “systems of record” so you can reveal new insights across the enterprise.

Now is the time to be fluid in your business, whether it is ease of data integration, access to key data for discovery/exploration, monetizing data, or sizing fit-for-purpose stores for different data types.  IBM Fluid Query opens these conversations and offers some valuable flexibility to connect the PureData System with other PureData Systems, Hadoop, DB2, Oracle and virtually any structured data source that supports JDBC drivers.

The value of content and the ability to tap into new insights is a must have to compete in any market. Fluid Query allows you to provision data for better use by application developers, data scientists and business users. We provide the tools to build the capability to enable any user group.

fluid query connectors

What’s new in Fluid Query 1.6?

Fluid Query was released this year and is in its third “agile” release of the year. As part of NPS software, it is available at no charge to existing PureData clients, and you will find information on how to access Fluid Query 1.6 below.

This capability enables you to query more data for deeper analytics from PureData. For example, you can query data in the PureData System together with:

  • Data in IBM BigInsights or other Hadoop implementations
  • Relational data stores (DB2, 3rd party and open source databases like Postgres, MySQL, etc.)
  • Multi-generational PureData Systems for Analytics systems (“Twin Fin”, “Striper”, “Mako”)

The following is a summary of some new features in the release that all help to support your needs for insights across a range of data types and stores:

  • Generic connector for access to structured data stores that support JDBC
    This generic connector enables you to select the database of choice. Database servers and engines like Teradata, SQL Server, Informix, MemSQL and MAPR can now be tapped for insight. We’ve also provided a capability to handle any data type mismatches between differing source/target systems.
  • Support for compressed read from Big SQL on IBM BigInsights
    Now using the Big SQL capability in IBM BigInsights, you are able to read compressed data in Hadoop file systems such as Big Insights, Cloudera and Hortonworks. This adds increased flexibility and efficiency in storage, data protection and access.
  • Ability to import databases to Hadoop and append to tables in Hadoop
    New capabilities now enable you to import databases to Hadoop, as well as append data in existing tables in Hadoop. One use case for this is backing up historical data to a queryable archive to help manage capacity on the data warehouse. This may include incremental backups, for example from a specific date for speed and efficiency.
  • Support for the lastest Hadoop distributions
    Fluid Query v. 1.6 now supports the latest Hadoop distributions, including BigInsights 4.1, Hortonworks 2.5 and Cloudera 5.4.5. For Netezza software, support is now available for NPS 7.2.1 and INZA 3.2.1.

Fluid Query 1.6 can be easily downloaded from IBM Support Fix Central. I encourage you to refer to my “Getting Started” post that was written for Fluid Query 1.5 for additional tips and instructions. Note that this link is for existing PureData clients. Refer to the section below if you are not a current client.

fluid query download from fix central

Packaging and distribution

From a packaging perspective we refreshed IBM Netezza Platform Developer Software to this latest NPS 7.2.1 release to ensure the software suite is current from IBM’s Passport Advantage.

Supported Appliances Supported Software
  • N3001
  • N2002
  • N2001
  • N100x
  • C1000
  • Netezza Platform Software v7.2.1
  • Netezza Client Kits v7.2.1
  • Netezza SQL Extension Toolkit v7.2.1
  • Netezza Analytics v3.2.1
  • IBM Fluid Query v1.6
  • Netezza Performance Portal v2.1.1
  • IBM Netezza Platform Development Software v7.2.1

For the Netezza Developer Network we continue to expand the ability to easily pick up and work with non-warranted products for basic evaluation by refreshing the Netezza Emulator to NPS 7.2.1 with INZA 3.2.1. You will find a refresh of our non-warranted version of Fluid Query 1.6 and the complete set of Client Kits that support NPS 7.2.1.

NDN download button

Feel free to download and play with these as a prelude to PureData Systems for Analytics purchase or as a quick way to validate new software functionality with your application. We maintain our commitment to helping our partners working with our systems by maintaining the latest systems and software for you to access. Bring your application or solution and work to certify, qualify and validate them.

For more information,  NPS 7.2.1 and INZA 3.2.1 software, refer to my post.

Doug Daily About Doug,
Doug has over 20 years combined technical & management experience in the software industry with emphasis in customer service and more recently product management.He is currently part of a highly motivated product management team that is both inspired by and passionate about the IBM PureData System for Analytics product portfolio.

Performance – Getting There and Staying There with PureData System for Analytics

by David Birmingham, Brightlight Business Analytics, A division of Sirius Computer Solutions and IBM Champion

Many years ago in a cartoon dialogue, Dilbert’s boss expressed concern for the theft of their desktop computers, but Dilbert assured him, to his boss’ satisfaction, that if he loaded them with data they would be too heavy to move. Hold that thought.

Co location: Getting durable performance from queries

Many shops will migrate to a new PureData System for Analytics appliance, Powered by Netezza Technology, simply by copying old data structures into the new data warehouse appliance. They then point their BI tools at it and voila, a 10x performance boost just for moving the data. Life is good.

The shop moves on by hooking up the ETL tools, backups and other infrastructure, not noticing that queries that ran in 5 seconds the week before, now run in 5.1 seconds. As the weeks wear on, 5.1 seconds become 6, then 7, then 10 seconds. Nobody is really watching, because 10 seconds is a phenomenal turnaround compared to their prior system’s 10-minute turnaround.

But six months to a year down the line, when the query takes 30 seconds or longer to run, someone may raise a flag of concern. By this time, we’ve built many new applications on these data structures. Far-and-away more data has been added to its storage. In true Dilbert-esque terms, loading more data makes the system go slower.

PureData has many layers of high-performance hardware, each one more powerful than the one above it. Adhering to this leverage over time helps maintain durable performance.

The better part about a PureData machine is that it has the power to address this by adhering to a few simple rules. When simply migrating point-to-point onto a PureData appliance, we’re likely not taking advantage of the core power-centers in Netezza technology. The point-to-point migration starts out in first-gear and never shifts up to access more power. That is, PureData has many layers of high-performance hardware, each one more powerful than the one above it. Adhering to this leverage over time helps maintain durable performance. The system may eventually need an upgrade for storage reasons, but not for performance reasons.

PureData is a physical machine with data stored on its physical “real estate”, but unlike buying a house with “location-location-location!” we want “co-location-co-location-co-location!” Two flavors of data co-location exist: zone maps and data distribution. The use of these (or lack thereof) either enable or constrain performance. These factors are physical, because performance is in the physics. It’s not enough to migrate or maintain a logical representation of the data. Physical trumps logical.

Zone maps, a powerful form of co-location in PureData

The most powerful form of co-location is zone maps, optimized through the Organize-On and Groom functions. Think of transaction_date as an Organize-On optimization key. The objective is to regroup the physical records so that those with like-valued keys are co-located on as few disk pages as possible. Groom will do this for us. Now when a query is issued against the table, filtering the transaction_date on a date value or date range filter, this query will be applied to the zone maps to derive the known physical disk locations and exclude all others. This is Netezza’s principle of using the query to tell it “where-not-to-look”.

The additional caveat is that the physical co-location of records by Organize-On keys is only valuable if they are actually used in the query. They radically reduce data reads, for example from 5 thousand pages down to 5 pages to get the same information. That’s a 1000x boost! The zone maps, enabled by Organize-On and Groom, are what achieve these dramatic performance boosts. If we do not use them, then queries will initiate a full table-scan which naturally takes more time.

The reason why this is so important is that disk-read is the number one penalty of the query, with no close second. A PureData System N200x or N3001 can read over 1100 pages per second on a given data slice. So if the query scans 5000 pages for each, it’s easily a 4-second query. But it won’t stay a 4-second query. As the data grows from 5000 pages to 10,000 pages, it will become a 10-second query. If the query leverages the zone maps and reduces it consistently to say, 100 pages per query, the query will achieve a sub-second duration and remain there for the life of the solution.

Does this sound like too much physical detail to know for certain what to do? That’s why the Organize-On and Groom functions make it easy. Just use the Query History’s column access statistics, locate the largest tables and find the most-often-accessed columns in where-clause filters (just don’t Organize-On join-only columns or distribution keys!). Add them to the Organize-On, Groom the table and watch this single action boost the most common queries into the stratosphere.

Data Distribution, co-location through “data slices”

Data distribution is another form of co-location. On a PureData system, every table is automatically divided across disks, each representing a “data slice”. Basically when a distribution key (e.g. Customer_ID) is used, the machine will hash the key values to guarantee that records with the same key value will always arrive on the same data slice. If several tables are distributed on the same key value, their like-keyed records will also be co-located on the same data slice. This means joining on those keys will initiate a parallel join, or what is called a co-located read.

Another of the most powerful aspects of Netezza technology is the ability to process data in parallel. Using the same distribution key to make an intermediate table, an insert-select styled query will perform a co-located read and a co-located write, effectively performing the operation in massively parallel form and at very fast speeds. Netezza technology can eclipse a mainframe in both its processing speed and ability to move and position large quantities of data for immediate consumption.

A few tweaks to tables and queries however, can yield a 100x or 1000x boost…

The caveat of data distribution is that a good distribution model can preserve capacity for the long-term. A distribution model that does not leverage co-located joining will chew-up the machine’s more limited resources such as memory and the inter-process network fabric. If we have enough of these queries running simultaneously, the degradation becomes extremely pronounced. A few tweaks to tables and queries however, can yield a 100x or 1000x boost; and without them the solution is using 10x or 100x more machine capacity than necessary. This is why some machines appear very stressed even though they are doing and storing so little.

Accessing the machine’s “deep metal”

Back to the notion of a “simple migration”. Does it sound like a simple point-to-point migration will leverage the power of the machine? Do the legacy queries use where-clause filters that can consistently invoke the zone maps? Are the tables configured to be heavily dependent upon indexes to support performance? If so, then the initial solution will be in first-gear.

But wait, maybe the migration happened a year or so ago and now the machine is “under stress” for no apparent reason. Where did all the capacity go? It’s still waiting to be used, in the deep-metal of the machine, the metal that the migrated solution doesn’t regard. It’s easy to fix that and voila, all this “extra” capacity seemingly appears from nowhere, like magic! It was always there. The solution was ignoring it and grinding the engines in first gear.

Enable business users to explore deep data detail

When Stephen Spielberg made Jurassic Park, he mentioned that the first dinosaur scene with the giant Brachiosaurus required over a hundred hours of film and CGI crunched into fifteen seconds of movie magic.

This represents a typical analytic flow model, where tons of data points are summarized into smaller form for fast consumption by business analysts. PureData System changes this because it is fast and easy to expose deep detail to users. Business analysts like to have access to the detail of deep data because summary structures will throw away useful details in an effort to boost performance on other systems.

The performance is built-in to the machine. It’s an appliance after all.

Architects and developers alike can see how the “Co-location, co-location, co-location!” is easy to configure and maintain, offering a durable performance experience that is also adaptable as business needs change over time. Getting there and staying there doesn’t require a high-wall of engineering activities or a gang of administrators  on roller-skates to keep it running smoothly. The performance is built-in to the machine. It’s an appliance after all.

About David,

David Birmingham, Brighlight, Sirius Computing Solutions David is a Senior Solutions Architect with Brightlight Consulting, a division of Sirius Computer Solutions, and an IBM Champion since 2011. He has over 30 years of extensive experience across the entire BI/DW lifecycle. David is one of the world’s top experts in PureData for Analytics (Netezza) – is the author of Netezza Underground and Netezza Transformation (both on Amazon.com) and various essays on IBM Developerworks’ Netezza Underground Blog. He is also a five-year IBM Champion, a designation that recognizes the contributions of IBM customers and partners.  Catch David each year at the Sunday IBM Insight Enzee Universe for new insights on best practices and solutions with the machine.

Is the Data Warehouse Dead? Is Hadoop trying to kill it?

By Dennis Duckworth

I attended the Strata + Hadoop World Conference in San Jose a few weeks ago, which I enjoyed immensely. I found that this conference had a slightly different “feel” than previous Hadoop conferences in terms of how Hadoop was being positioned. Since I am from the data warehouse world, I have been sensitive to Hadoop being promoted as a replacement for the data warehouse.

In previous conferences, sponsors and presenters seemed almost giddy in their prognostication that Hadoop would become the main data storage and analytics platform in the enterprise, taking more and more load from the data warehouse and eventually replacing it completely. This year, there didn’t seem to be much negative talk about data warehouses. Cloudera, for example, clearly showed its Hadoop-based “Enterprise Data Hub” as being complementary to the Enterprise Data Warehouse rather than as a replacement, reiterating the clarification of their positioning and strategy that they made last year. Maybe this was an indication that the Hadoop market was maturing even more, with companies having more Hadoop projects in production and, thus, having more real experience with what Hadoop did well and, as importantly, what it didn’t do well. Perhaps, too, the data warehouse escaped being the villain (or victim) because the “us against them” camp was distracted by the emergence and perceived threat of some other technologies like Spark and Mesos.

The conference was just another data point supporting my hypothesis that Hadoop and other Big Data technologies are complementing existing data warehouses in enterprises rather than replacing them. Another data point (actually a collection of many data points) can be seen in the survey results of The Information Difference Company as reported in the paper “Is the Data Warehouse Dead?”, sponsored by IBM. You can download a copy here.

Reading through this report, I found myself recalling many of the conversations I myself have had with customers and prospects over the last few years. If you have read some of my previous blogs, you will know that IBM is a big believer in the power of Big Data. We have solutions that help enterprises deal with the new challenges they are facing with the increasing size, speed and diversity of data. But we continue to offer and recommend relational database and data warehouse solutions because they are essential for deriving business value from data – they have done that in the past, they continue to do so today.

We believe that they will continue doing so going forward. Structured data doesn’t go away, nor does the need for doing analytics (descriptive, predictive, or prescriptive) on the data. An analytics engine that was created and tuned for structured data will continue to be the best place to do such analytics. Sure, you can do some really neat data exploration and visualizations on all sorts of data in Hadoop, but you still need your daily/weekly/monthly reports and your executive dashboards, all needing to be produced within shrinking time windows, that are all fueled by structured data.

About Dennis Duckworth

Dennis Duckworth, Program Director of Product Marketing for Data Management & Data Warehousing has been in the data game for quite a while, doing everything from Lisp programming in artificial intelligence to managing a sales territory for a database company. He has a passion for helping companies and people get real value out of cool technology. Dennis came to IBM through its acquisition of Netezza, where he was Director of Competitive and Market Intelligence. He holds a degree in Electrical Engineering from Stanford University but has spent most of his life on the East Coast. When not working, Dennis enjoys sailing off his backyard on Buzzards Bay and he is relentless in his pursuit of wine enlightenment.

See also: New Fluid Query for PureData and Hadoop by Wendy Lucas

Safety Insurance Company Gains a Better View of its Data And its Customers with IBM PureData System for Analytics and Cognos

By Louis T.Cherian,

The success of a firm in the highly competitive insurance industry depends not only on its ability to get new customers, but also retaining the most valuable customers. One way to do this is to offer these customers the most suitable policies at the best rates. So how can a company the size of Safety Insurance, which has been in existence since 1979 identify their most valuable customers?

And how could they maintain consistency when offering multi-policy incentives, given that they  deal in dozens of types of policies to millions of policyholders, when  the customer data is fragmented across numerous policy systems? Moreover, with its customer data fragmented across numerous policy systems, the actuaries were spending all their time building new databases instead of analyzing them, and eventually ending up in getting multiple versions of the truth, making it difficult for the business to make informed decisions.

This is where the combination of IBM PureData System for Analytics and Cognos opened up a whole new world of analytics possibilities that enables Safety Insurance to run their business more wisely and more efficiently.

How did they do it?

  • Switching to a powerful analytics solution

Safety Insurance teamed up with New England Systems, (IBM Business Partner) and decided to deploy the IBM PureData System for Analytics to provide a high-performance data warehouse platform that would unite data from all of its policy and claims systems. They also implemented IBM Cognos Business Intelligence to provide sophisticated automated analysis and reporting tools.

  • Accelerating delivery of mission-critical information

Harnessing the immense computing power of IBM PureData System for Analytics enables Safety Insurance to generate reports in a fraction of the time previously needed. Moreover, automating report generation enables actuaries to focus on their actual job which is analyzing figures rather than building and compiling and them. Automation also standardizes the reporting process, which improves consistency and reduces the company’s reliance on a particular analyst’s individual knowledge

  • Identifying and retaining high-value customers

By providing a “single view” of the customer across all types of insurance gives a new level of insight into customer relationships and total customer value. By revealing how valuable a particular policyholder is to the overall business, the company will be able to provide more comprehensive service, better combinations of products, and consistent application of multi-policy discounts.

To know more about this success story, watch this video where Christopher Morkunas (Data Governance Manager) from Safety Insurance Company talks about how they were able to gain a better view of their existing data and its products with the combination of IBM PureData System for Analytics and Cognos.

About Louis T. Cherian,

Louis T. Cherian is currently a member of the worldwide product marketing team at IBM that focuses on data warehouse and database technology. Prior to this role, Louis has held a variety of product marketing roles within IBM, and in Tata Consultancy Services, prior to joining IBM.  Louis has done his PGDBM from Xavier Institute of Management and Entrepreneurship, and also has an engineering degree in computer science from VTU Bangalore.

 

Data Warehousing – No assembly required

By Wendy Lucas,

In my last blog, I wrote about how big things come in small packages when talking about the great value that comes in the new PureData System for Analytics Mini Appliance.  I must be in the holiday spirit early because I’m going to stick with the holiday theme for this discussion.

Did youWL 1 ever receive a gift that had multiple components to it, maybe one that required a bunch of assembly before you got to truly enjoy the gift?   I’m not talking about Lincoln Logs (do they still sell those?) or Legos where the assembly is half the fun.

I’m talking about things like a child’s bicycle that comes with the frame, handle bar, wheels, tires, kickstand, seat, nuts and bolts as a bunch of parts inside a box.

What is more exciting? Receiving a box of parts or receiving the shiny red bicycle already assembled and ready to take for an immediate ride?

WL 2

In this world where we require instant satisfaction and immediate results, we don’t have time to assemble the bike. Do your system administrators have time to custom build a solution of hardware and software for your data warehouse?  Forget about that hardware and software being truly designed, integrated and optimized for analytic workloads.  What value are your users getting while the IT staff are doing that?  Do your DBAs have enough time to tune the system for every new data source that’s added or every new report requirement that one of your users needs?  We live in a world that demands agile response to changing requirements and immediate results.

Simple is still better for faster deployment

In this very complex world, simple solutions are better.  Just like the child preferring the bike that is already assembled and ready to go, the IBM PureData System for Analytics, powered by Netezza technology has been delivering on the promise of simplicity and speed for over a decade.  Don’t just take my word for it.  In a recent study, International Technology Group compared costs and time to value with PureData compared to both Teradata and Oracle.[i]   They researched customers deploying all three solutions and had some notable findings.  While over 75% of PureData customers deployed their appliances in under 3 weeks, not a single Teradata customer deployed in that same time frame and only one Oracle customer achieved that window.

Simple is still better for lower costs

Not only is the data warehouse appliance simple to deploy but it is architected for speed with minimal tuning or administration.  The same studies found that Teradata has 3.8x and Oracle 3.5x higher deployment costs than PureData System for Analytics and use more DBA resources to maintain the system.

Simple is still better, and now even more secure

The PureData System for Analytics N3001 series that was just announced has the same speed and simplicity of it’s predecessors, but adds improved performance, self-encrypting drives and big data and business intelligence starter kits.  The self-encrypting drives encrypt all user and temp data for added security without any performance overhead or incremental cost to the appliance.

For more anecdotal examples of why simple is still better, watch this video or you can read this white paper or visit ibm.com/software/data/puredata/analytics/ for more information.

[i] ITG: Comparing Costs and Time to Value with Teradata Data Warehouse Appliance, May 2014.

ITG: Comparing Costs and Time to Value with Oracle Exadata Database Machine X3, June 2014.

About Wendy,

Wendy Lucas is a Program Director for IBM Data Warehouse Marketing. Wendy has over 20 years of experience in data warehousing and business intelligence solutions, including 12 years at IBM. She has helped clients in a variety of roles, including application development, management consulting, project management, technical sales management and marketing. Wendy holds a Bachelor of Science in Computer Science from Capital University and you can follow her on Twitter at @wlucas001

Governance, Stewardship, and Quality of Temporal Data in a Data Warehousing Context

By James Kobielus, 

Organizations must hold people accountable for their actions, and that depends on having the right data, tools, and processes for keeping track of the precise sequence of events over time.

Timing is everything when you’re trying to pinpoint the parties who are personally responsible in any business context. Consequently, time-series discovery is the core task of any good investigator, be they Sherlock Holmes or his real-world counterparts in the hunt for perps and other parties of interest.

Audit trails are essential for time-series discovery in legal proceedings, and they support equivalent functions in compliance, security, and other business application contexts. Audit trails must describe correctly and consistently the prior sequence of events, so that organizations can identify precisely who took what actions when under which circumstances.

To help identify the responsible parties on specific actions, decisions, and outcomes, the best audit trails should, at minimum, support longitudinal analysis, which rolls up records into a comprehensive view of the entire sequence of events. But the databases where the audit trails are stored should also support time-validity analysis, which rolls back time itself to show the exact state of all the valid data available to the each responsible party at the times they made their decisions. Without the former, you can’t fit each event into the larger narrative of what transpired. Without the latter, you can’t fit each event into the narrative of who should be punished or exonerated.

All of that requires strong data quality, which relies, in turn, on having access to databases and tools that facilitate the requisite governance and stewardship procedures. Data warehouses are where you should be keeping your system-of-record data to support time-series analyses. Consequently, temporal data management is an intrinsic feature of any mature data warehousing, governance, and stewardship practice. Indeed, the ability to traverse data over time is at the very heart of the concept of data warehousing, as defined long ago by Bill Inmon: “a subject-oriented, nonvolatile, integrated, time-variant collection of data in support of management’s decisions.”

Many organizations have deployed transactional databases such as IBM DB2 for data warehousing and temporal data management. If they use a high-performance implementation, such as DB2 with BLU Acceleration software running on IBM POWER8 processors, they can do in-memory time-series analyses of large audit trails with astonishing speed. If you want further depth on DB2’s native temporal data management features, I strongly recommend this technical article.

Temporal data management concepts may be unfamiliar to some data warehousing professionals. Here’s a good recent article providing a good primer on temporal database computing. As the author states, “A temporal database will show you the actual value back then, as it was known back then, and the actual value back then, as it is known now.”

These concepts are a bit tricky to explain clearly, but I’ll take a shot. The “actual value back then, as it is known now” is the “valid time,” and may be updated or corrected within a temporal database if the previously-believed valid time is found to have been in error. The “actual value back then, as it was known back then” is the “transaction time”; it remains unchanged and may diverge from the “valid time” as the latter is corrected.

Essentially, this arrangement enables the record of any historical data to be corrected at any time in the future. It also preserves the record, for each point in the past, of that moment’s own erroneous picture of the even deeper past. This gets to the heart of the “what they knew and when they knew it” heart of personal responsibility.

As I was reading this recent article that discusses time-series data in an Internet of Things (IoT) context, the association of temporality with personal responsibility came into new focus. What if, through IoT, we were able to save every last datum that each individual person produced, accessed, viewed, owned, or otherwise came into contact with at each point in time? And what if we could roll it back to infer what they “knew” and “when they knew it” on a second-by-second basis?

This is not a far-fetched scenario. As the IoT gains ubiquity in our lives, will make this a very realistic scenario (for the moment, let’s overlook the staggering big-data management and analytics challenges that this would entail). And as this temporal data gets correlated with geospatial, social, and other data sources–and mined through data lineage tools–it will make it possible to rollup high-resolution, 360-degree portraits of personal responsibility. We’ll have a full audit trail of exactly who knew (individually and collectively) what when, where, how, why, and with what consequences.

Whether you’re a prosecuting attorney building a case, a law-enforcement official searching trying to uncover terrorist plots in the nick of time, or an IT security administrator trying to finger the shadowy perpetrators of a hack attack, these IoT-infused discovery tools will prove addictive.

The effectiveness of governance in the modern world will depend on our ability to maintain the requisite audit trails in whatever data warehouse or other well-governed repository best suits your operational requirements.

About James, 

James Kobielus is IBM Senior Program Director, Product Marketing, Big Data Analytics solutions. He is an industry veteran, a popular speaker and social media participant, and a thought leader in big data, Hadoop, enterprise data warehousing, advanced analytics, business intelligence, data management, and next best action technologies. Follow James on Twitter : @jameskobielus