Three session guides get you started with data warehousing at IBM Insight at World of Watson

Join us October 24 to 27, 2016 in Las Vegas!

by Cindy Russell, IBM Data Warehouse marketing

IBM Insight has been the premiere data management and analytics event for IBM analytics technologies, and 2016 is no exception.  This year, IBM Insight is being hosted along with World of Watson and runs from October 24 to 27, 2016 at the Mandalay Bay in Las Vegas, Nevada.  It includes 1,500 sessions across a range of technologies and features keynotes by IBM President and CEO, Ginni Rometty; Senior Vice President of IBM Analytics, Bob Picciano; and other IBM Analytics and industry leaders.  Every year, we include a little fun as well, and this year the band is Imagine Dragons.

IBM data warehousing sessions will be available across the event as well as in the PureData System for Analytics Enzee Universe (Sunday, October 23).  Below are product-specific quick reference guides that enable you to see at a glance key sessions and activities, then plan your schedule.  Print these guides and take them with you or put the links to them on your phone for reference during the conference.

This year, the Expo floor is called the Cognitive Concourse, and we are located in the Monetizing Data section, Cognitive Cuisine experience area.  We’ll take you on a tour across our data warehousing products and will have some fun as we do it, so please stop by.  There is also a demo room where you can see live demos and engage with our technical experts, as well as a series of hands-on labs that let you experience our products directly.

The IBM Insight at World of Watson main web page is located here.  You can register and then use the agenda builder to create your personalized schedule.

IBM PureData System for Analytics session reference guide

Please find the session quick reference guide for PureData System for Analytics here: ibm.biz/wow_enzee

Enzee Universe is a full day of dedicated PureData System for Analytics / Netezza sessions that is held on Sunday, October 23, 2016.  To register for Enzee Universe, select sessions 3459 and 3461 in the agenda builder tool.  This event is open to any full conference pass holder.

During the regular conference, there are also more than 35 PureData, Netezza, IBM DB2 Analytics Accelerator for z/OS (IDAA) technical sessions across all the conference tracks, as well as hands on labs.  There are several session being presented by IBM clients so you can see how they put PureData System for Analytics to use.  Click the link above to see the details.

IBM dashDB Family session reference guide

Please find the session quick reference guide for the dashDB family here: ibm.biz/wow_dashDB

There are a more than 40 sessions for dashDB, including a “Meet the Family” session that will help you become familiar with new products in this family of modern data management and data warehousing tools.  There is also a “Birds of a Feather” panel discussion on Hybrid Data Warehousing, and one that describes some key use cases for dashDB.  And, you can also see a demo, take in a short theatre session or try out a hands-on lab.

IBM BigInsights, Hadoop and Spark session reference guide

Please find the session quick reference guide for BigInsights, Hadoop and Spark topics here: ibm.biz/wow_biginsights

There are more than 65 sessions related to IBM BigInsights, Hadoop and Spark, with several hands on labs and theatre sessions. There is everything from an Introduction to Data Science to Using Spark for Customer Intelligence Analytics to hybrid cloud data lakes to client stories of how they use these technologies.

Overall, it is an exciting time to be in the data warehousing and analytics space.  This conference represents a great opportunity to build depth on IBM products you already use, learn new data warehousing products, and look across IBM to learn completely new ways to employ analytics—from Watson to Internet of Things and much more.  I hope to see you there.

IBM Fluid Query 1.7 is Here!

by Doug Dailey

IBM Fluid Query offers a wide range of capabilities to help your business adapt to a hybrid data architecture and more importantly it helps you bridge across “data silos” for deeper insights that leverage more data.   Fluid Query is a standard entitlement included with the Netezza Platform Software suite for PureData for Analytics (formerly Netezza). Fluid Query release 1.7 is now available, and you can learn more about its features below.

Why should you consider Fluid Query?

It offers many possible uses for solving business problems in your business. Here are a few ideas:
• Discover and explore “Day Zero” data landing in your Hadoop environment
• Query data from multiple cross-enterprise repositories to understand relationships
• Access structured data from common sources like Oracle, SQL Server, MySQL, and PostgreSQL
• Query historical data on Hadoop via Hive, BigInsights Big SQL or Impala
• Derive relationships between data residing on Hadoop, the cloud and on-premises
• Offload colder data from PureData System for Analytics to Hadoop to free capacity
• Drive business continuity through low fidelity disaster recovery solution on Hadoop
• Backup your database or a subset of data to Hadoop in an immutable format
• Incrementally feed analytics side-cars residing on Hadoop with dimensional data

By far, the most prominent use for Fluid Query for a data warehouse administrator is that of warehouse augmentation, capacity relief and replicating analytics side-cars for analysts and scientists.

New: Hadoop connector support for Hadoop file formats to increase flexibility

IBM Fluid Query 1.7 ushers in greater flexibility for Hadoop users with support for popular file formats typically used with HDFS.Fluid query 1.7 connector picture These include popular data storage formats like AVRO, Parquet, ORC and RC that are often used to manage bigdata in a Hadoop environment.

Choosing the best format and compression mode can result in drastic differences in performance and storage on disk. A file format that doesn’t support flexible schema evolution can result in a processing penalty when making simple changes to a table. Let’s just  say that if you live in the Hadoop domain, you know exactly what I am speaking of. For instance, if you want to use AVRO, do your tools have readers and writers that are compatible? If you are using IMPALA, do you know that it doesn’t support ORC, or that Hortonworks and Hive-Stinger don’t play well with Parquet? Double check your needs and tool sets before diving into these popular format types.

By providing support for these popular formats,  Fluid Query allows you to import, store, and access this data through local tools and utilities on HDFS. But here is where it gets interesting in Fluid Query 1.7: you can also query data in these formats through the Hadoop connector provided with IBM Fluid Query, without any change to your SQL!

New: Robust connector templates

In addition, Fluid Query 1.7 now makes available a more robust set of connector templates that are designed to help you jump start use of Fluid Query. You may recall we provided support for a generic connector in our prior release that allows you to configure and connect to any structured data store via JDBC. We are offering pre-defined templates with the 1.7 release so you can get up and running more quickly. In cases where there are differences in user data type mapping, we also provide mapping files to simplify access.  If you have your own favorite database, you can use our generic connector, along with any of the provided templates as a basis for building a new connector for your specific needs. There are templates for Oracle, Teradata, SQL Server, MySQL, PostgreSQL, Informix, and MapR for Hive.

Again, the primary focus for Fluid Query is to deliver open data access across your ecosystem. Whether the data resides on disk, in-memory, in the Cloud or on Hadoop, we strive to enable your business to be open for data. We recognize that you are up against significant challenges in meeting demands of the business and marketplace, with one of the top priorities around access and federation.

New: Data movement advances

Moving data is not the best choice. Businesses spend quite a bit of effort ingesting data, staging the data, scrubbing, prepping and scoring the data for consumption for business users. This is costly process. As we move closer and closer to virtualization, the goal is to move the smallest amount of data possible, while you access and query only the data you need. So not only is access paramount, but your knowledge of the data in your environment is crucial to efficiently using it.

Fluid Query does offer data movement capability through what we call Fast Data Movement. Focusing on the pipe between PDA and Hadoop, we offer a high speed transfer tool that allows you to transfer data between these two environments very efficiently and securely. You have control over the security, compression, format and where clause (DB, table, filtered data). A key benefit is our ability to transfer data in our proprietary binary format. This enables orders of magnitude performance over Sqoop, when you do have to move data.

Fluid Query 1.7 also offers some additional benefits:
• Kerberos support for our generic database connector
• Support for BigInsights Big SQL during import (automatically synchronizes Hive and Big SQL on import)
• Varchar and String mapping improvements
• Import of nz.fq.table parameter now supports a combination of multiple schemas and tables
• Improved date handling
• Improved validation for NPS and Hadoop environment (connectors and import/export)
• Support for BigInsights 4.1 and Cloudera 5.5.1
• A new Best Practices User Guide, plus two new Tutorials

You can download this from IBM’s Fix Central or the Netezza Developer’s Network for use with the Netezza Emulator through our non-warranted software.

Picture1

Take a test drive today!

About Doug,
Doug Daily
Doug has over 20 years combined technical & management experience in the software industry with emphasis in customer service and more recently product management.He is currently part of a highly motivated product management team that is both inspired by and passionate about the IBM PureData System for Analytics product portfolio.

What is the fundamental difference between “ETL” and “ELT” in the world of big data?

By Ralf Goetz

Initially,  it seems like  just a different sequence of the two characters “T” and “L”. But this difference often separates successful big data projects from failed ones. Why is that? And how can you avoid falling into the most common data management traps around mastering big data? Let’s examine this topic in more detail.

Why are big data projects different from traditional data warehouse projects?

Big data projects are mostly characterized as one or a combination of these 4 (or 5) data requirements:

  • Volume: the volume of (raw) data
  • Variety: the variety (e.g. structured, unstructured, semi-structured) of data
  • Velocity: the speed of data processing, consummation or analytics of data
  • Veracity: the level of trust in the data
  • (Value): the value behind the data

For big data, each of the “V”s is bigger in terms of order of magnitudes of its classification. For example, a traditional data warehouse data volume is usually around several hundred gigabytes or a low number of terabytes, while big data projects typically handle data volumes of hundreds or even thousands of terabytes. Another example would be that traditional data warehouse systems only manage and process structured data, whereas typical big data projects need to manage and process both structured and unstructured data.

Having this in mind, it is obvious that traditional technologies or methodologies for data warehousing may not be sufficient to handle these big data requirements.

Mastering the data and information supply chain using traditional ETL

This brings us to a widely adapted methodology for data integration called “Extraction, Transformation and Load” (ETL). ETL is a very common methodology in data warehousing and business analytics projects and can be performed by custom programming (e.g. scripts, or custom ETL applications) or with the help of state-of-the-art ETL platforms  such as IBM InfoSphere Information Server.

Extract Transform Load Big Data

 

The fundamental concept behind most ETL implementations is the restriction of the data in the supply chain. Only data, which is presumably important will be identified, extracted and loaded into a staging area inside a database, and later, into the data warehouse. “Presumably” is the weakness in this concept. Who really knows which data is required for which analytic insight and requirement as of now and tomorrow? Who knows which legal or regulatory requirements must be followed in the months and years to come?

Each change in the definition and scope of the information and data supply chain requires a considerable amount of effort, time and budget and is a risk for any production system. There must be a resolution for this dilemma – and here it comes.

A new “must follow” paradigm for big data: ELT

Just a little change in the sequence of two letters will mean everything to the success of your big data project: ELT (Extraction, Load and Transform). This change seems small, but the difference lies in the overall concept of data management.  Instead of restricting the data sources to only “presumably” important data (and all the steps this entails), what if we take all available data, and put it into a flexible, powerful big data platform such as the Hadoop-based IBM InfoSphere BigInsights system?

graphic 2

Data storage in Hadoop is flexible, powerful, almost unlimited, and cost efficient since it can use commodity hardware and scales across many computing nodes and local storage.

Hadoop is a schema-on-read system. It allows the storage of all kinds of data without knowing its format or definition (e.g. JSON, images, movies, text files, spreadsheets, log files and many more). Without the previously discussed limitation in the amount of data which will be extracted in the ETL methodology, we can be sure that we have all data we need today and may need in the future. This also reduces the required effort for the identification of “important” data – this step can literally be skipped: we take all we can get and keep it!

Without the previously discussed limitation in the amount of data which will be extracted in the ETL methodology, we can be sure that we have all data we need today and may need in the future.

Since Hadoop offers a scalable data storage and processing platform, we can utilize these features as a replacement for the traditional staging area inside a database. From here we can take only the data that is required today and analyze it either directly with a business intelligence platform such as IBM Cognos  or IBM SPSS, or use an intermediate layer with deep and powerful analytic capabilities such as IBM PureData System for Analytics.

Refining raw data and gaining valuable insights

Hadoop is great for storage and processing of raw data, but applying powerful and lightning fast complex analytic queries is not its strength, and so another analytics layer makes sense.  PureData System for Analytics is the perfect place for the subsequent in-database analytic processing for “valued” data because of it’s massive parallel processing (MPP) architecture and it’s rich set of analytics functions. PureData can resolve even the most complex analytic queries in only a fraction of the time compared to traditional relational databases. And it scales – from a big data starter project with only a couple of terabytes of data to a petabyte-sized PureData cluster.

 PureData System for Analytics is the perfect place for the subsequent in-database analytic processing for “valued” data because of it’s massive parallel processing architecture (MPP) and it’s rich set of analytic functions.

IBM offers everything you need to master your big data challenges. You can start very small and scale with your growing requirements. Big data projects can be fun with the right technology and services!

About Ralf Goetz 
Ralf GoetzRalf is an Expert Level Certified IT Specialist in the IBM Software Group. Ralf joined IBM trough the Netezza acquisition in early 2011. For several years, he led the Informatica tech-sales team in DACH region and the Mahindra Satyam BI competency team in Germany. He then became part of the technical pre-sales representative for Netezza and later for the PureData System for Analytics. Ralf is still focusing on PDA but is also supporting the technical sales of all IBM BigData products. Ralf holds a Master degree in computer science.

IBM Fluid Query 1.0: Efficiently Connecting Users to Data

by Rich Hughes

Launched on March 27th, IBM Fluid Query 1.0 opens doors of “insight opportunity” for IBM PureData System for Analytics clients. In the evolving data ecosystem, users want and need accessibility to a variety of data stores in different locations. This only makes sense, as newer technologies like Apache Hadoop have broadened analytic possibilities to include unstructured data. Hadoop is the data source that accounts for most of the increase in data volume.  By observation, the world’s data is doubling about every 18 months, with some estimates putting the 2020 data volume at 40 zettabytes, or 4021 bytes. This increase by decade’s end would represent a 20 fold growth over the 2011 world data total of 1.821 bytes.1 IT professionals as well as the general public can intuitively feel the weight and rapidity of data’s prominence in our daily lives. But how can we cope with, and not be overrun by, relentless data growth? The answer lies in part, with better data access paths.


IBM Fluid Query 1.0 opens doors of “insight opportunity”for IBM PureData System for Analytics clients. In the evolving data ecosystem, users want and need accessibility to a variety of data stores in different locations.

IBM Fluid Query 1.0 – What is it?

IBM Fluid Query 1.0 is a specific software feature in PureData that provides access to data in Hadoop from PureData appliances. Fluid Query also promotes the fast movement of data between Big Data ecosystems and PureData warehouses.  Enabling query and data movement, this new technology connects PureData appliances with common Hadoop systems: IBM BigInsights, Cloudera, and Hortonworks. Fluid Query allows results from PureData database tables and Hadoop data sources to be merged, thus creating powerful analytic combinations.


Fluid Query allows results from PureData System for Analytics database tables and Hadoop data sources to be merged, thus creating powerful analytic combinations.

IBM® Fluid Query Benefits

Fluid Query makes practical use of existing SQL developer skills. Workbench tools yield productivity gains because SQL remains the query language of choice when PureData and Hadoop schemas logically merge. Fluid Query is the physical bridge whereby a query is pushed efficiently to where the data resides, whether it is in your data warehouse or in your Hadoop environment. Other benefits made possible by Fluid Query include:

  • better exploitation of Hadoop as a “Day 0” archive, that is queryable with conventional SQL;
  • combining hot data from PureData with colder data from Hadoop; and
  • archiving colder data from PureData to Hadoop to relieve resources on the data warehouse.

Managing your share of Big Data Growth

Fluid Query provides data access between Hadoop and PureData appliances. Your current data warehouse, the PureData System for Analytics, can be extended in several important ways over this bridge to additional Hadoop capabilities. The coexistence of PureData appliances alongside Hadoop’s beneficial features is a best-of-breed approach where tasks are performed on the platform best suited for that workload. Use the PureData warehouse for production quality analytics where performance is critical to the success of your business, while simultaneously using Hadoop to discover the inherent value of full-volume data sources.

How does Fluid Query differ from IBM BigSQL technology?

Just as IBM PureData System for Analytics innovated by moving analytics to the data, IBM Big SQL moves queries to the correct data store. IBM Big SQL supports query federation to many data sources, including (but not limited to) IBM PureData System for Analytics; DB2 for Linux, UNIX and Windows database software; IBM PureData System for Operational Analytics; dashDB, Teradata, and Oracle. This allows users to send distributed requests to multiple data sources within a single SQL statement. IBM Big SQL is a feature included with IBM BigInsights for Apache Hadoop which is an included software entitlement with IBM PureData System for Analytics. By contrast, many Hadoop and database vendors rely on significant data movement just to resolve query requests—a practice that can be time consuming and inefficient.

Learn more

Since March 27, 2015, IBM® Fluid Query 1.0 has been generally available as a software addition to PureData System for Analytics customers. If you want to understand how to take advantage of IBM® Fluid Query 1.0 check out these two sources: the on-demand webcast, Virtual Enzee – The Logical Data Warehouse, Hadoop and PureData System for Analytics , and the IBM Fluid Query solution brief. Update: Learn about Fluid Query 1.5, announced July, 2015.

About Rich,

Rich HughesRich Hughes is an IBM Marketing Program Manager for Data Warehousing.  Hughes has worked in a variety of Information Technology, Data Warehousing, and Big Data jobs, and has been with IBM since 2004.  Hughes earned a Bachelor’s degree from Kansas University, and a Master’s degree in Computer Science from Kansas State University.  Writing about the original Dream Team, Hughes authored a book on the 1936 US Olympic basketball team, a squad composed of oil refinery laborers and film industry stage hands. You can follow him on Twitter: @rhughes134

Footnote:
1 “How Much Data is Out There” by Webopedia Staff, Webopedia.com, March 3, 2014.

Hybrid Data Warehousing – The Best of All Worlds

By Wendy Lucas, 

When it comes to data warehousing, organizations are progressing along the maturity curve at their own individual pace.  Today, most organizations have some form of warehouse and business intelligence in place, or recognize the need for it and the benefits it can drive.  But we all know that technology doesn’t stand still.  And so, you are now faced with a new step in your progression towards data warehouse maturity – the move to cloud.

Building Momentum

Cloud applications started with a fairly narrow focus.  A few years ago, you may have viewed the cloud as a viable platform for mobile applications or just a way to keep your contacts synchronized between your devices (by the way, that is still my favorite cloud use case).  IT organizations have begun looking to the cloud as a way to cut costs, but the strong momentum behind cloud adoption indicates there is more to it than that!

According to a recent IBM Tech Trend study, cloud adoption is up 92% since 2012.  The same study shows that organizations identified as pacesetters are 10x more likely to increase workforce efficiency with the cloud, 5x more likely to enhance communication and collaboration and report 4X better customer experience.   Pick your research outlet and you will find similar statistics.

One of Forrester’s top cloud computing predictions for 2015 is that “hybrid cloud management gets real” in terms of having the tools to allow you to manage across multiple on-premise and cloud platforms.

I believe that cloud use cases are the driving factor behind the growth and momentum of cloud technologies. Data warehousing on the cloud is no exception, where the general need is to deliver analytics to the organization faster.  Let’s explore specific data warehouse use cases.

Use case 1: development, testing, prototyping and sandboxing

A safe place to start might be establishing a cloud environment for warehouse development and testing.   Do you need the ability to test key functions like ETL processes or analytic applications without the need to setup more costly infrastructure on-premise?  Why not consider testing in the cloud? Perhaps you need an environment in which to do quick prototyping or sandboxing?   Whether it’s an environment that is a temporary or persistent, a cloud data warehouse instance can be quickly stood up and used for prototyping and sandboxing with very minimal cost.

Use case 2: Do more with less when when you are at capacity

Organizations are also considering cloud as a way to expand capacity of their existing data warehouse.  In the context of the logical data warehouse, data assets can reside on the cloud to serve up specific types of data to specific applications.

Use case 3: self-service analytics

Organizations can use the cloud as a data layer for self-service business intelligence and analytic capability, especially for applications that need data that’s already in the cloud, for example if your marketing organization need to analyze unstructured social media data.

Both IT and the line of business can benefits from these and other use cases.  IT organizations are able to reduce infrastructure costs and simplify budgets by shifting capital expense to an operational expense model.   Perhaps most importantly, the flexibility and agility of a cloud option provides faster time to insight for end users who need insight immediately.

What should I move to the cloud?

If cloud is so great, why not move everything to the cloud?  The reality is there are some applications that will remain on-premise for some time to come (or forever).  Systems that require large amounts of on-premise or sensitive data or that are generating large volumes of data may not be easily moved to the cloud.  It may make sense to leave these in the on-premise data warehouse systems that have matured over decades and are fulfilling the needs of those applications quite well.   But like discussed above, you may not want to incur capital expenditure or longer deployment times for things like data marts, development and test environments or analytics for data already in the cloud and these represent ideal opportunities to use a cloud data warehouse

There isn’t a one-size fits all answer, which is why hybrid environments make the most sense.  A hybrid environment can provides the best of all worlds – the ability to keep your large, on-premise warehouses in place, allow compliance with security and regulatory reporting, and fulfill the needs of traditional reporting and analysis.  , All of this is done while continuing to reduce costs and increase flexibility and speed of deployment for new applications in the cloud.  Just like most things, its best to pick the right tool for the job.

What tools can help me get there?

IBM data warehouse solutions offer the breadth and depth of capabilities required to effectively support a hybrid environment.   On cloud, IBM dashDB is our exciting new data warehouse and analytics as a service that concluded the beta program for Cloudant and enterprise plan on December 18th and is now generally available.  It pulls together the lightning fast performance of DB2 with BLU Acceleration with market leading in-database analytic capabilities from Netezza.  Think of it as the combination of the fastest data warehouse and analytic platform, combined with the flexibility and agility of the cloud.  dashDB will continue to evolve in a way that preserves analytic and application portability between the cloud and on-premise systems.  Most importantly, as you modernize with a hybrid cloud approach, the enterprise plan is available to support you at scale.

And of course on premises, IBM offers DB2 with BLU Acceleration as a software-only solution or the IBM PureData System for Analytics as a ready to go data warehouse appliance. Putting these pieces together, you can support your hybrid data warehousing needs with proven technologies that offer the best of all worlds.

For more information, please visit dashDB.com.

About Wendy,

Wendy Lucas is a Program Director for IBM Data Warehouse Marketing. Wendy has over 20 years of experience in data warehousing and business intelligence solutions, including 12 years at IBM. She has helped clients in a variety of roles, including application development, management consulting, project management, technical sales management and marketing. Wendy holds a Bachelor of Science in Computer Science from Capital University and you can follow her on Twitter at @wlucas001

dashDB grows and improves in a flash!

By Dennis Duckworth, 

IBM® dashDB™ continues to grow and improve with some announcements made on December 18, 2014.  Additional plans are now available to everyone and new features have been included in dashDB.

New Deployment Options

  • Enterprise Plan available to all – The Enterprise Plan for dashDB is a dedicated cloud infrastructure with tera-scale capacity. This offering is now available to anyone. Contact your IBM Information Management Sales representative to get started!
  • Cloudant deployments of dashDB now offer higher capacity, paid usage plans – We are adding an Entry plan that is fully integrated with Cloudant supporting up to 50 GB of uncompressed data, for $50/month. The freemium offering for data usage below 1 GB will remain.
  • Expanded Geographic Presence – dashDB can now be deployed in our UK availability region in addition to our existing North American region.

New Cool and Useful Features

As a result of input from beta program participants, we have added new features to dashDB to make it even more useful:

  • Improved SQL Editor – The SQL query and editor capabilities have been expanded to allow a full range of SQL to be submitted via the web browser, including the ability to load and save SQL scripts. SQL validation and error checking is also included.

DD 1

  • Better Workload Monitoring – Get a much better idea of what’s running in your dashDB instance, including specific statements and connections. Set it to monitor in near real-time, drill into the details of a session, or terminate a session if needed.

 DD2

  • Command Line Support – Sometimes, you need a command line interface for scripting and automation. dashDB now includes CLPPlus support – so you now get a command line user interface that lets you connect to databases and define, edit, and run statements, scripts, and commands.
  • UDX Support – Some applications or algorithms require user-defined functions (UDFs) and user-defined aggregates (UDAs). dashDB now supports these out of the box so you can implement and run your own algorithms right inside the database.

New Security Features and Capabilities

Data security is always a consideration, and dashDB now includes new security features:

  • SSL Support for all Connections – It’s not enough to automatically encrypt data at rest, we need to encrypt it in motion too. dashDB now supports SSL for all connections to the database.

DD3

  • Select Guardium Reports for all PlansdashDB now has bundled Guardium reports for all plans including the Enterprise and Cloudant integrated plans. This allows for automatic discovery of sensitive data, as well as access reports and details of SQL statements that were run against that data.

DD4

We will continue to add new features and capabilities to dashDB over the coming months, so watch this space!

If you have not started analyzing your data in dashDB, what are you waiting for? Get started with dashDB on Bluemix or Cloudant at dashDB.com.

About Dennis Duckworth

Dennis Duckworth, Program Director of Product Marketing for Data Management & Data Warehousing has been in the data game for quite a while, doing everything from Lisp programming in artificial intelligence to managing a sales territory for a database company. He has a passion for helping companies and people get real value out of cool technology. Dennis came to IBM through its acquisition of Netezza, where he was Director of Competitive and Market Intelligence. He holds a degree in Electrical Engineering from Stanford University but has spent most of his life on the East Coast. When not working, Dennis enjoys sailing off his backyard on Buzzards Bay and he is relentless in his pursuit of wine enlightenment. You can follow Dennis on Twitter 

The Logical Data Warehouse : Two Easy Pieces (DW+Hadoop)

By Dennis Duckworth,

In some of our recent blogs, we have described our Data Warehouse Point of View and our Zone Architecture for Big Data. We developed these from our experiences with our customers, seeing what worked (and what didn’t), to encourage those who are just starting out on their analytics journeys or those who are disappointed by the performance or rigidity of their existing data warehouse environments to at least consider the advantages of separating data (and corresponding analytics) into different zones based on the characteristics of both. We have been using the term Data Warehouse Modernization to describe the renovation of old traditional monolithic data warehouses (along with other data silos) into hybrid, integrated, or logical data warehouse models.

In a sort of modernization of our own, we have reexamined how we go to market with our data warehouse and data management products to see how we might make it easier for our customers to implement the best practices that we actively promote. With the recent release of our latest data warehouse appliance, the PureData System for Analytics (PDA) N3001 (codename Mako), we had the chance to make some changes. Now,  for example, included with every PDA appliance we ship (every configuration, from the smallest, the “Mako-mini” 2 server rack-mountable appliance, all the way up to our largest, our 8-rack system), we include license entitlements for other IBM software products we firmly believe can help our customers in creating a modern, flexible, high performance logical data warehouse environment. One of those entitlements is for IBM InfoSphere BigInsights for Hadoop.

Studies are proving out our opinion that the logical data warehouse is a critical contributor to analytic success for enterprises. In the recently released 2014 IBM Institute for Business Value analytics study, companies were analyzed and categorized by the extent and the effectiveness of analytics in them. Those in the top category, the “front runners”, use data to the highest benefit. They have been successful in “blending” their traditional business intelligence infrastructures with big data technologies to create agility and flexibility in the way they ingest, manage and use data. Quite interestingly, and consistent with our guidance in these blogs, almost all of the front runners (92 percent) have an integrated (or hybrid) data warehouse and, as part of that, they are 10 times than more likely than other organizations to have a big data landing platform. In practice, they have implemented what we have called zone architecture to allow them to collect and analyze a wider variety of data, empowering their employees to make full use of their traditional data and new types of data together.

DL 1

Our customers are also providing proof that data warehouse modernization works. How are these customers using BigInsights and these big data landing platforms? Many are creating what we have been calling data reservoirs. As you may recall from our blogs here and from the hundreds/thousands of other posts on the topic, Hadoop is finding a home in the enterprise as the preferred technology for data reservoirs. These are landing areas for all the data you think may be useful in your company, whether it is structured, unstructured, or semi-structured. Some more specific examples: One of our customers is using BigInsights in combination with the PureData System for Analytics to help it convert users of its free cloud service to customers for their paid service, using predictive analytics on user behavior (structured and unstructured data) to target them more accurately with offers. Another, a telco, is using BigInsights with PDA along with InfoSphere Streams to get a 360° view of its customers and to enable them to react in real-time to customer satisfaction issues. (The InfoSphere Streams entitlement with PDA will be the topic of a future blog.)

The BigInsights entitlement that comes with the N3001 PureData System for Analytics is for 5 virtual nodes which, by our calculations, gives you the ability to manage about 100TB of data. So this is not a useless little demo version – this license gives you the ability to create and use a full-blown Hadoop cluster with all of the advantages that BigInsights has to offer, things like Big SQL for SQL access to the data in BigInsights, Big Sheets (enables Excel like spreadsheet exploration of the data), text analytics accelerator, Big R (which allows you to explore, visualize, transform, and model big data using familiar R syntax), and a long list of other features and capabilities. You get all of this (and much more) with every N3001 PureData System for Analytics. With software entitlements like this, we allow you to practice what we preach: modernize your data management environment by putting data and the corresponding analytics on the proper platform.

About Dennis Duckworth

Dennis Duckworth, Program Director of Product Marketing for Data Management & Data Warehousing has been in the data game for quite a while, doing everything from Lisp programming in artificial intelligence to managing a sales territory for a database company. He has a passion for helping companies and people get real value out of cool technology. Dennis came to IBM through its acquisition of Netezza, where he was Director of Competitive and Market Intelligence. He holds a degree in Electrical Engineering from Stanford University but has spent most of his life on the East Coast. When not working, Dennis enjoys sailing off his backyard on Buzzards Bay and he is relentless in his pursuit of wine enlightenment. You can follow Dennis on Twiiter