Three session guides get you started with data warehousing at IBM Insight at World of Watson

Join us October 24 to 27, 2016 in Las Vegas!

by Cindy Russell, IBM Data Warehouse marketing

IBM Insight has been the premiere data management and analytics event for IBM analytics technologies, and 2016 is no exception.  This year, IBM Insight is being hosted along with World of Watson and runs from October 24 to 27, 2016 at the Mandalay Bay in Las Vegas, Nevada.  It includes 1,500 sessions across a range of technologies and features keynotes by IBM President and CEO, Ginni Rometty; Senior Vice President of IBM Analytics, Bob Picciano; and other IBM Analytics and industry leaders.  Every year, we include a little fun as well, and this year the band is Imagine Dragons.

IBM data warehousing sessions will be available across the event as well as in the PureData System for Analytics Enzee Universe (Sunday, October 23).  Below are product-specific quick reference guides that enable you to see at a glance key sessions and activities, then plan your schedule.  Print these guides and take them with you or put the links to them on your phone for reference during the conference.

This year, the Expo floor is called the Cognitive Concourse, and we are located in the Monetizing Data section, Cognitive Cuisine experience area.  We’ll take you on a tour across our data warehousing products and will have some fun as we do it, so please stop by.  There is also a demo room where you can see live demos and engage with our technical experts, as well as a series of hands-on labs that let you experience our products directly.

The IBM Insight at World of Watson main web page is located here.  You can register and then use the agenda builder to create your personalized schedule.

IBM PureData System for Analytics session reference guide

Please find the session quick reference guide for PureData System for Analytics here: ibm.biz/wow_enzee

Enzee Universe is a full day of dedicated PureData System for Analytics / Netezza sessions that is held on Sunday, October 23, 2016.  To register for Enzee Universe, select sessions 3459 and 3461 in the agenda builder tool.  This event is open to any full conference pass holder.

During the regular conference, there are also more than 35 PureData, Netezza, IBM DB2 Analytics Accelerator for z/OS (IDAA) technical sessions across all the conference tracks, as well as hands on labs.  There are several session being presented by IBM clients so you can see how they put PureData System for Analytics to use.  Click the link above to see the details.

IBM dashDB Family session reference guide

Please find the session quick reference guide for the dashDB family here: ibm.biz/wow_dashDB

There are a more than 40 sessions for dashDB, including a “Meet the Family” session that will help you become familiar with new products in this family of modern data management and data warehousing tools.  There is also a “Birds of a Feather” panel discussion on Hybrid Data Warehousing, and one that describes some key use cases for dashDB.  And, you can also see a demo, take in a short theatre session or try out a hands-on lab.

IBM BigInsights, Hadoop and Spark session reference guide

Please find the session quick reference guide for BigInsights, Hadoop and Spark topics here: ibm.biz/wow_biginsights

There are more than 65 sessions related to IBM BigInsights, Hadoop and Spark, with several hands on labs and theatre sessions. There is everything from an Introduction to Data Science to Using Spark for Customer Intelligence Analytics to hybrid cloud data lakes to client stories of how they use these technologies.

Overall, it is an exciting time to be in the data warehousing and analytics space.  This conference represents a great opportunity to build depth on IBM products you already use, learn new data warehousing products, and look across IBM to learn completely new ways to employ analytics—from Watson to Internet of Things and much more.  I hope to see you there.

IBM Fluid Query 1.7 is Here!

by Doug Dailey

IBM Fluid Query offers a wide range of capabilities to help your business adapt to a hybrid data architecture and more importantly it helps you bridge across “data silos” for deeper insights that leverage more data.   Fluid Query is a standard entitlement included with the Netezza Platform Software suite for PureData for Analytics (formerly Netezza). Fluid Query release 1.7 is now available, and you can learn more about its features below.

Why should you consider Fluid Query?

It offers many possible uses for solving business problems in your business. Here are a few ideas:
• Discover and explore “Day Zero” data landing in your Hadoop environment
• Query data from multiple cross-enterprise repositories to understand relationships
• Access structured data from common sources like Oracle, SQL Server, MySQL, and PostgreSQL
• Query historical data on Hadoop via Hive, BigInsights Big SQL or Impala
• Derive relationships between data residing on Hadoop, the cloud and on-premises
• Offload colder data from PureData System for Analytics to Hadoop to free capacity
• Drive business continuity through low fidelity disaster recovery solution on Hadoop
• Backup your database or a subset of data to Hadoop in an immutable format
• Incrementally feed analytics side-cars residing on Hadoop with dimensional data

By far, the most prominent use for Fluid Query for a data warehouse administrator is that of warehouse augmentation, capacity relief and replicating analytics side-cars for analysts and scientists.

New: Hadoop connector support for Hadoop file formats to increase flexibility

IBM Fluid Query 1.7 ushers in greater flexibility for Hadoop users with support for popular file formats typically used with HDFS.Fluid query 1.7 connector picture These include popular data storage formats like AVRO, Parquet, ORC and RC that are often used to manage bigdata in a Hadoop environment.

Choosing the best format and compression mode can result in drastic differences in performance and storage on disk. A file format that doesn’t support flexible schema evolution can result in a processing penalty when making simple changes to a table. Let’s just  say that if you live in the Hadoop domain, you know exactly what I am speaking of. For instance, if you want to use AVRO, do your tools have readers and writers that are compatible? If you are using IMPALA, do you know that it doesn’t support ORC, or that Hortonworks and Hive-Stinger don’t play well with Parquet? Double check your needs and tool sets before diving into these popular format types.

By providing support for these popular formats,  Fluid Query allows you to import, store, and access this data through local tools and utilities on HDFS. But here is where it gets interesting in Fluid Query 1.7: you can also query data in these formats through the Hadoop connector provided with IBM Fluid Query, without any change to your SQL!

New: Robust connector templates

In addition, Fluid Query 1.7 now makes available a more robust set of connector templates that are designed to help you jump start use of Fluid Query. You may recall we provided support for a generic connector in our prior release that allows you to configure and connect to any structured data store via JDBC. We are offering pre-defined templates with the 1.7 release so you can get up and running more quickly. In cases where there are differences in user data type mapping, we also provide mapping files to simplify access.  If you have your own favorite database, you can use our generic connector, along with any of the provided templates as a basis for building a new connector for your specific needs. There are templates for Oracle, Teradata, SQL Server, MySQL, PostgreSQL, Informix, and MapR for Hive.

Again, the primary focus for Fluid Query is to deliver open data access across your ecosystem. Whether the data resides on disk, in-memory, in the Cloud or on Hadoop, we strive to enable your business to be open for data. We recognize that you are up against significant challenges in meeting demands of the business and marketplace, with one of the top priorities around access and federation.

New: Data movement advances

Moving data is not the best choice. Businesses spend quite a bit of effort ingesting data, staging the data, scrubbing, prepping and scoring the data for consumption for business users. This is costly process. As we move closer and closer to virtualization, the goal is to move the smallest amount of data possible, while you access and query only the data you need. So not only is access paramount, but your knowledge of the data in your environment is crucial to efficiently using it.

Fluid Query does offer data movement capability through what we call Fast Data Movement. Focusing on the pipe between PDA and Hadoop, we offer a high speed transfer tool that allows you to transfer data between these two environments very efficiently and securely. You have control over the security, compression, format and where clause (DB, table, filtered data). A key benefit is our ability to transfer data in our proprietary binary format. This enables orders of magnitude performance over Sqoop, when you do have to move data.

Fluid Query 1.7 also offers some additional benefits:
• Kerberos support for our generic database connector
• Support for BigInsights Big SQL during import (automatically synchronizes Hive and Big SQL on import)
• Varchar and String mapping improvements
• Import of nz.fq.table parameter now supports a combination of multiple schemas and tables
• Improved date handling
• Improved validation for NPS and Hadoop environment (connectors and import/export)
• Support for BigInsights 4.1 and Cloudera 5.5.1
• A new Best Practices User Guide, plus two new Tutorials

You can download this from IBM’s Fix Central or the Netezza Developer’s Network for use with the Netezza Emulator through our non-warranted software.

Picture1

Take a test drive today!

About Doug,
Doug Daily
Doug has over 20 years combined technical & management experience in the software industry with emphasis in customer service and more recently product management.He is currently part of a highly motivated product management team that is both inspired by and passionate about the IBM PureData System for Analytics product portfolio.

Things you need to know when switching from Oracle database to Netezza (Part 3)

by Andrey Vykhodtsev

In my previous two posts I covered the differences in architecture between IBM PureData System for Analytics and Oracle Database, as well as differences in SQL. (See below for links.) In this post, I am going to cover another important topic – additional structures that speed-up data access.

Partitions, Indexes, Materialized Views

Oracle database relies on Indexes, Partitions and Materialized views for performance. In Oracle, indexes are designed 19712947_s_blue data arrow backgroundto speed-up point searches or range searches that touch a very small percentage of the data. Because of the B-Tree index structure, if you touch a large percentage of the data, using the index will be much slower than the full scan of the whole table. If you have this problem, then you probably have decided to use partitioning. In Oracle, Partitioning is a paid feature that goes only with certain editions. You also have Materialized views with which you can put results of the complex queries on disk for later re-use. These structures are designed with general purpose (analytical processing + transactional processing ) in mind, and can be complex and unwieldy to maintain.

By contrast, with PureData you have fewer worries. The trade-off, as I said in my first post, is that PureData is not a general-purpose system, but rather an analytical-processing system.

We use ZoneMaps in PureData instead of indexes. In essence, a ZoneMap is just a table of minimum and maximum values for all columns that have certain types. ZoneMaps are extremely compact, and they don’t need to be created or maintained. But this is not all. ZoneMap filtering takes place at the hardware level. (Remember mention of FPGA, Field Programmable Gate Arrays in my first post?) The system will not scan data that does not need to be scanned for a particular query. Therefore I/O is greatly reduced. If you update data or delete data based on a condition, ZoneMaps also are taken into account.

Because of ZoneMaps, you don’t need to partition your data. ZoneMaps take advantage of the natural ordering of data. For example, if you insert data daily, ZoneMap on the date field will become completely sorted. Range searches on this field will be extremely fast.

In addition to ZoneMaps, there are couple of other techniques you can use to optimize query access to a certain table. First is called CBT, Clustered Based Table. This is not a separate structure that needs to be maintained, but rather an internal table organization method. If you choose a table to be CBT, you can provide up to 4 fields, on which you will have extremely fast searches.

The only additional structure that PureData has is called “Materialized View”, but this is a bit different concept than in Oracle. In PureData, materialized view is a subset of columns from one table that can be sorted differently than the base table, therefore speeding up access on the sorted columns. Because materialized views are ZoneMapped, they have some properties of the indices, but they are not actually indices. Materialized views might be needed if you have “tactical queries”, queries that require fast and frequent access to small portions of data. Otherwise, you don’t usually need them.

In Conclusion

As you see, in PureData it is much simpler to maintain efficient data access. Instead of creating and maintaining indexes for the subset of columns on each table, PureData automatically creates ZoneMaps for you. I know from experience what a nightmare index maintenance in a large data warehouse might be. Partitioning is another technique that is not needed in PureData. Instead of indexes and partitions, we use much simpler structures, that are automatically maintained, and applied on hardware level (in FPGA), with the speed of streaming data.
In  my next posts, I am going to cover a few more topics that you need to be aware of when migrating from Oracle to PDA. Please stay tuned, and follow me on Twitter: @vykhand

Other posts in this series

About Andrey,
Andrey VykhodtsevAndrey Vykhodtsev is Big Data Technical Sales Expert covering Central and Eastern Europe Region in IBM. He has more than 12 years of experience in Data Warehousing and Analytics, and has worked as senior data warehouse developer, analyst, architect, consultant in multiple industries, including Financial sector and Telecommunications.

IBM Fluid Query 1.0: Efficiently Connecting Users to Data

by Rich Hughes

Launched on March 27th, IBM Fluid Query 1.0 opens doors of “insight opportunity” for IBM PureData System for Analytics clients. In the evolving data ecosystem, users want and need accessibility to a variety of data stores in different locations. This only makes sense, as newer technologies like Apache Hadoop have broadened analytic possibilities to include unstructured data. Hadoop is the data source that accounts for most of the increase in data volume.  By observation, the world’s data is doubling about every 18 months, with some estimates putting the 2020 data volume at 40 zettabytes, or 4021 bytes. This increase by decade’s end would represent a 20 fold growth over the 2011 world data total of 1.821 bytes.1 IT professionals as well as the general public can intuitively feel the weight and rapidity of data’s prominence in our daily lives. But how can we cope with, and not be overrun by, relentless data growth? The answer lies in part, with better data access paths.


IBM Fluid Query 1.0 opens doors of “insight opportunity”for IBM PureData System for Analytics clients. In the evolving data ecosystem, users want and need accessibility to a variety of data stores in different locations.

IBM Fluid Query 1.0 – What is it?

IBM Fluid Query 1.0 is a specific software feature in PureData that provides access to data in Hadoop from PureData appliances. Fluid Query also promotes the fast movement of data between Big Data ecosystems and PureData warehouses.  Enabling query and data movement, this new technology connects PureData appliances with common Hadoop systems: IBM BigInsights, Cloudera, and Hortonworks. Fluid Query allows results from PureData database tables and Hadoop data sources to be merged, thus creating powerful analytic combinations.


Fluid Query allows results from PureData System for Analytics database tables and Hadoop data sources to be merged, thus creating powerful analytic combinations.

IBM® Fluid Query Benefits

Fluid Query makes practical use of existing SQL developer skills. Workbench tools yield productivity gains because SQL remains the query language of choice when PureData and Hadoop schemas logically merge. Fluid Query is the physical bridge whereby a query is pushed efficiently to where the data resides, whether it is in your data warehouse or in your Hadoop environment. Other benefits made possible by Fluid Query include:

  • better exploitation of Hadoop as a “Day 0” archive, that is queryable with conventional SQL;
  • combining hot data from PureData with colder data from Hadoop; and
  • archiving colder data from PureData to Hadoop to relieve resources on the data warehouse.

Managing your share of Big Data Growth

Fluid Query provides data access between Hadoop and PureData appliances. Your current data warehouse, the PureData System for Analytics, can be extended in several important ways over this bridge to additional Hadoop capabilities. The coexistence of PureData appliances alongside Hadoop’s beneficial features is a best-of-breed approach where tasks are performed on the platform best suited for that workload. Use the PureData warehouse for production quality analytics where performance is critical to the success of your business, while simultaneously using Hadoop to discover the inherent value of full-volume data sources.

How does Fluid Query differ from IBM BigSQL technology?

Just as IBM PureData System for Analytics innovated by moving analytics to the data, IBM Big SQL moves queries to the correct data store. IBM Big SQL supports query federation to many data sources, including (but not limited to) IBM PureData System for Analytics; DB2 for Linux, UNIX and Windows database software; IBM PureData System for Operational Analytics; dashDB, Teradata, and Oracle. This allows users to send distributed requests to multiple data sources within a single SQL statement. IBM Big SQL is a feature included with IBM BigInsights for Apache Hadoop which is an included software entitlement with IBM PureData System for Analytics. By contrast, many Hadoop and database vendors rely on significant data movement just to resolve query requests—a practice that can be time consuming and inefficient.

Learn more

Since March 27, 2015, IBM® Fluid Query 1.0 has been generally available as a software addition to PureData System for Analytics customers. If you want to understand how to take advantage of IBM® Fluid Query 1.0 check out these two sources: the on-demand webcast, Virtual Enzee – The Logical Data Warehouse, Hadoop and PureData System for Analytics , and the IBM Fluid Query solution brief. Update: Learn about Fluid Query 1.5, announced July, 2015.

About Rich,

Rich HughesRich Hughes is an IBM Marketing Program Manager for Data Warehousing.  Hughes has worked in a variety of Information Technology, Data Warehousing, and Big Data jobs, and has been with IBM since 2004.  Hughes earned a Bachelor’s degree from Kansas University, and a Master’s degree in Computer Science from Kansas State University.  Writing about the original Dream Team, Hughes authored a book on the 1936 US Olympic basketball team, a squad composed of oil refinery laborers and film industry stage hands. You can follow him on Twitter: @rhughes134

Footnote:
1 “How Much Data is Out There” by Webopedia Staff, Webopedia.com, March 3, 2014.

Fluid doesn’t just describe your coffee anymore … Introducing IBM Fluid Query 1.0

by Wendy Lucas

Having grown up in the world of data and analytics, I long for the days when our goal was to create a single version of the truth. Remember  when data architecture diagrams showed source systems flowing through ETL, into a centralized data warehouse and then out to business intelligence applications? Wow, that was nice and simple, right – at least conceptually? As a consultant, I can still remember advising clients and helping them to pictorially represent this reference architecture. It was a pretty simple picture, but that was also a long time ago.

While IT organizations struggled with data integration, enterprise data models and producing the single source of the truth, the lines of business grew impatient and would build their own data marts (or data silos).  We can think of this as the first signs of the requirement for user self-service. The goal behind building the consolidated, enterprise, single version of the truth never went away. Sure, we still want the ability to drive more accurate decision-making, deliver consistent reporting, meet regulatory requirements, etc. However, the ability to achieve this goal became very difficult as requirements for user self-service, increased agility, new data types, lower cost solutions, better business insight and faster time to value became more important.

Recognizing the Logical Data Warehouse

Enterprises have developed collections of data assets that each provide value for specific workloads and purposes. This includes data warehouses, data marts, operational data stores and Hadoop data stores to name a few. It is really this collection of data assets that now serves as the foundation for driving analytics, fulfilling the purpose of the data warehouse within the architecture. The Logical Data Warehouse or LDW is a term we use to describe the collection of data assets that make up the data warehouse environment, recognizing that the data warehouse is no longer just a single entity. Each data store within the Logical Data Warehouse can be built on a different platform, fit for the purpose of the workload and analytic requirements it serves.


Each data store within the Logical Data Warehouse can be built on a different platform, fit for the purpose of the workload and analytic requirements it serves.

But doesn’t this go against the single version of the truth? The LDW will still struggle to deliver on the goal behind the single version of the truth, if it doesn’t have information governance, common metadata and data integration practices in place. This is a key concept. If you’re interested in more on this topic, check out a recent webcast by some of my colleagues on the “Five Pitfalls to Avoid in Your Data Warehouse Modernization Project: Making Data Work for You.”

Unifying data across the Logical Data Warehouse

Logically grouping separate data stores into the LDW does not necessarily make our lives easier. Assuming you have followed good information governance practices, you still have data stores in different places, perhaps on different platforms. Haven’t you just made your application developers and users lives, who want self-service, infinitely more difficult? Users need the ability to leverage data across these various data stores without having to worry about the complexity of where to find it, or re-writing their applications. And let’s not forget about the needs of IT. DBAs struggle to manage capacity and performance on data warehouses while listening to Hadoop administrators brag about the seemingly endless, lower cost storage and ability to manage new data types that they can provide. What if we could have the best of all worlds? Provide seamless access to data across a variety of stores, formats, and platforms. Provide capability for IT to manage Hadoop and Data Warehouses along-side each other in a way that leverages the strengths of both.

Introducing IBM Fluid Query

IBM Fluid Query is the capability to unify data across the Logical Data Warehouse, providing the ability to seamlessly access data in it’s various forms and locations. No matter where a user connects within the logical data warehouse, users have access to all data through the same, standard API/SQL/Analytics access. IBM Fluid Query powers the Logical Data Warehouse, giving users the ability to combine numerous types of data from various sources in a fast and agile manner to drive analytics and deeper insight, without worrying about connecting to multiple data stores, using different syntaxes or API’s or changing their application.

In its first release, IBM Fluid Query 1.0 will provide users of the IBM PureData System for Analytics the capability to access Hadoop data from their data warehouse and move data between Hadoop and PureData if needed. High performance is about moving the query to the data, not the data to the query. This provides extreme value to PureData users who want the ability to merge data from their structured data warehouse with Hadoop for powerful analytic combinations, or more in-depth analysis. IBM Fluid Query 1.0 is part of a toolkit within Netezza Platform Software (NPS) on the appliance so it’s free for all PureData System for Analytics customers.


IBM Fluid Query 1.0 will provide users of the IBM PureData System for Analytics the capability to access Hadoop data from their data warehouse and move data between Hadoop and PureData

For Hadoop users, IBM also provides IBM Big SQL which delivers Fluid Query capability. Big SQL provides the ability to run queries on a variety of data stores, including PureData System for Analytics, DB2 and many others from your IBM BigInsights Hadoop environment. Big SQL has the ability to push the query to the data store and return the result to Hadoop without moving all the data across the network. Other Hadoop vendors provide the ability to write queries like this but they move all the data back to Hadoop before filtering, applying predicates, joining, etc. In the world of big data, can you really afford to move lots of data around to meet the queries that need it?

IBM Fluid Query 1.0 is generally available on March 27 as a software addition to PureData System for Analytics customers. If you are an existing customer and want to understand how to take advantage of IBM Fluid Query 1.0 or if you just would like more information, I encourage you to listen to this on-demand webcast: Virtual Enzee – The Logical Data Warehouse, Hadoop and PureData System for Analytics  and check out the solution brief. Or if you are an existing PureData System for Analytics customer, download this software. Update: Learn about Fluid Query 1.5, announced July, 2015.

About Wendy,

Wendy LucasWendy Lucas is a Program Director for IBM Data Warehouse Marketing. Wendy has over 20 years of experience in data warehousing and business intelligence solutions, including 12 years at IBM. She has helped clients in a variety of roles, including application development, management consulting, project management, technical sales management and marketing. Wendy holds a Bachelor of Science in Computer Science from Capital University and you can follow her on Twitter at @wlucas001

Join IBM at the TDWI Orlando Conference, Dec 7-12, 2014

By Amit Patel,

You are invited to join IBM at the TDWI Orlando Conference, Dec 7-12 to learn how IBM’s next generation data management and BI solutions can advance your business. TDWI events are focused on helping attendees get the best business value from their data. For data and business professionals who are looking for a week of focused education and interaction around organizing and visualizing data, I encourage you to attend this event.

We would love to see you in the IBM Booth (#205) in the Expo to learn more about your particular requirements around data management and BI, and discuss how IBM solutions can help you gain quick and actionable insight from your data. You can enter a raffle to win a Kindle by visiting the IBM booth during the Expo Partner Member Reception on Monday, Dec 8, 5:15-7:15 PM.

On Wednesday, December 10, from 12:10-1:45 PM you are invited to attend special educational sessions from IBM:

  • Big Data News cases…What in the world are people doing with Hadoop, presented by Rola Shaar
  • Taking a more refined approach to Big Data – Why you need a Data Refinery, presented by Brian Vile
  • What’s new in the IBM PureData System for Analytics, presented by Rich Hughes
  • From Insight to Foresight with BI and Predictive Analysis, presented by David Clement

IBM is hosting a special luncheon on Monday, December 8 at 12:15 PM where I am presenting a session on dashDB, a brand-new, fully-managed data warehouse service in the cloud. You will get to learn why the cloud offers unique advantages for analytics and data warehousing, and what’s involved in moving analytics to the cloud. This is an invitation-only luncheon with very limited capacity, so please let us know if you’d like to join us for this special event.

I look forward to seeing you in Orlando!