Fluid doesn’t just describe your coffee anymore … Introducing IBM Fluid Query 1.0

by Wendy Lucas

Having grown up in the world of data and analytics, I long for the days when our goal was to create a single version of the truth. Remember  when data architecture diagrams showed source systems flowing through ETL, into a centralized data warehouse and then out to business intelligence applications? Wow, that was nice and simple, right – at least conceptually? As a consultant, I can still remember advising clients and helping them to pictorially represent this reference architecture. It was a pretty simple picture, but that was also a long time ago.

While IT organizations struggled with data integration, enterprise data models and producing the single source of the truth, the lines of business grew impatient and would build their own data marts (or data silos).  We can think of this as the first signs of the requirement for user self-service. The goal behind building the consolidated, enterprise, single version of the truth never went away. Sure, we still want the ability to drive more accurate decision-making, deliver consistent reporting, meet regulatory requirements, etc. However, the ability to achieve this goal became very difficult as requirements for user self-service, increased agility, new data types, lower cost solutions, better business insight and faster time to value became more important.

Recognizing the Logical Data Warehouse

Enterprises have developed collections of data assets that each provide value for specific workloads and purposes. This includes data warehouses, data marts, operational data stores and Hadoop data stores to name a few. It is really this collection of data assets that now serves as the foundation for driving analytics, fulfilling the purpose of the data warehouse within the architecture. The Logical Data Warehouse or LDW is a term we use to describe the collection of data assets that make up the data warehouse environment, recognizing that the data warehouse is no longer just a single entity. Each data store within the Logical Data Warehouse can be built on a different platform, fit for the purpose of the workload and analytic requirements it serves.


Each data store within the Logical Data Warehouse can be built on a different platform, fit for the purpose of the workload and analytic requirements it serves.

But doesn’t this go against the single version of the truth? The LDW will still struggle to deliver on the goal behind the single version of the truth, if it doesn’t have information governance, common metadata and data integration practices in place. This is a key concept. If you’re interested in more on this topic, check out a recent webcast by some of my colleagues on the “Five Pitfalls to Avoid in Your Data Warehouse Modernization Project: Making Data Work for You.”

Unifying data across the Logical Data Warehouse

Logically grouping separate data stores into the LDW does not necessarily make our lives easier. Assuming you have followed good information governance practices, you still have data stores in different places, perhaps on different platforms. Haven’t you just made your application developers and users lives, who want self-service, infinitely more difficult? Users need the ability to leverage data across these various data stores without having to worry about the complexity of where to find it, or re-writing their applications. And let’s not forget about the needs of IT. DBAs struggle to manage capacity and performance on data warehouses while listening to Hadoop administrators brag about the seemingly endless, lower cost storage and ability to manage new data types that they can provide. What if we could have the best of all worlds? Provide seamless access to data across a variety of stores, formats, and platforms. Provide capability for IT to manage Hadoop and Data Warehouses along-side each other in a way that leverages the strengths of both.

Introducing IBM Fluid Query

IBM Fluid Query is the capability to unify data across the Logical Data Warehouse, providing the ability to seamlessly access data in it’s various forms and locations. No matter where a user connects within the logical data warehouse, users have access to all data through the same, standard API/SQL/Analytics access. IBM Fluid Query powers the Logical Data Warehouse, giving users the ability to combine numerous types of data from various sources in a fast and agile manner to drive analytics and deeper insight, without worrying about connecting to multiple data stores, using different syntaxes or API’s or changing their application.

In its first release, IBM Fluid Query 1.0 will provide users of the IBM PureData System for Analytics the capability to access Hadoop data from their data warehouse and move data between Hadoop and PureData if needed. High performance is about moving the query to the data, not the data to the query. This provides extreme value to PureData users who want the ability to merge data from their structured data warehouse with Hadoop for powerful analytic combinations, or more in-depth analysis. IBM Fluid Query 1.0 is part of a toolkit within Netezza Platform Software (NPS) on the appliance so it’s free for all PureData System for Analytics customers.


IBM Fluid Query 1.0 will provide users of the IBM PureData System for Analytics the capability to access Hadoop data from their data warehouse and move data between Hadoop and PureData

For Hadoop users, IBM also provides IBM Big SQL which delivers Fluid Query capability. Big SQL provides the ability to run queries on a variety of data stores, including PureData System for Analytics, DB2 and many others from your IBM BigInsights Hadoop environment. Big SQL has the ability to push the query to the data store and return the result to Hadoop without moving all the data across the network. Other Hadoop vendors provide the ability to write queries like this but they move all the data back to Hadoop before filtering, applying predicates, joining, etc. In the world of big data, can you really afford to move lots of data around to meet the queries that need it?

IBM Fluid Query 1.0 is generally available on March 27 as a software addition to PureData System for Analytics customers. If you are an existing customer and want to understand how to take advantage of IBM Fluid Query 1.0 or if you just would like more information, I encourage you to listen to this on-demand webcast: Virtual Enzee – The Logical Data Warehouse, Hadoop and PureData System for Analytics  and check out the solution brief. Or if you are an existing PureData System for Analytics customer, download this software. Update: Learn about Fluid Query 1.5, announced July, 2015.

About Wendy,

Wendy LucasWendy Lucas is a Program Director for IBM Data Warehouse Marketing. Wendy has over 20 years of experience in data warehousing and business intelligence solutions, including 12 years at IBM. She has helped clients in a variety of roles, including application development, management consulting, project management, technical sales management and marketing. Wendy holds a Bachelor of Science in Computer Science from Capital University and you can follow her on Twitter at @wlucas001

One Query Drives It All: IBM Fluid Query Is The Foundation Of The Logical Data Warehouse

by James Kobielus

Data warehousing (DW) should flow smoothly as an enterprise decision-support asset. For this to happen, the back-end DW infrastructure should enable a seamless flow of data acquisition, transformation, loading, access, query, and analysis functions all the way from sources to the end users trying to make evidence-driven decisions.

For an enterprise DW to support fluid delivery of data-driven insights, the enabling infrastructure needs to be engineered with simplicity, scale, speed, interoperability, and usability in order to eliminate any obstacles to maximum value. In the drive to modernize their DWs and address emerging requirements, enterprises may risk adding complexity that inadvertently impacts the productivity of DW users, administrators, and other stakeholders.

In a growing number of enterprise DW modernization initiatives, Hadoop is starting to play important supplementary roles such as supporting data refinement on unstructured sources and providing a low-cost, scalable, and queryable data archive. As Hadoop platforms such as IBM InfoSphere BigInsights take their place within “logical” or “hybrid” DW architectures alongside DW platforms such as IBM PureData System for Analytics, the underlying complexities grow, but the simplicity and fluidity of the overall end-to-end infrastructure needn’t suffer.

The fluidity of the Logical Data Warehouse (LDW) depends on core interfaces, infrastructure, and tooling that span the entire architecture, no matter how complex the underlying hybrid assortment of relational, Hadoop, NoSQL, and other data platforms. Chief among these enablers of LDW fluidity is SQL, the data access, query, and manipulate lingua franca of databases everywhere. SQL now pervades the Hadoop market thanks to initiatives and interfaces such as IBM Big SQL.


In a growing number of enterprise DW modernization initiatives, Hadoop is starting to play important supplementary roles such as supporting data refinement on unstructured sources and providing a low-cost, scalable, and queryable data archive.

However, SQL-over-Hadoop standards alone can’t achieve the promise of LDWs that remain seamlessly fluid and interoperable no matter how complex they grow under the covers. For that dream to come to fruition, the SQL dialects of the relational, Hadoop, and other platforms that comprise the LDW need to be accessible through a “fluid query” abstraction layer. This would enable all BI, reporting, dashboarding, statistical modeling, and other applications that query any data provided by any underlying platform within the LDW to speak one simple SQL dialect that spans it all.

A fluid query layer that spans the entire LDW would eliminate several obstacles to user and administrator productivity. It would avoid the need for users to query two or more separate data platforms and then either manually combine the results or have someone in IT implement a “data munging” tool to do that in a more automated fashion. If the unified query interface is combined with a fluid ability to move data back and forth between relational and Hadoop platforms to ensure optimal utilization of available LDW capacity, queries and all the supporting back-end data movement and transformation processes can operate much faster and more efficiently.

That’s the power of DW fluidity: simplicity, speed, throughput, scalability, and cost-effectiveness. The recent launch of IBM Fluid Query demonstrates that this dream is now a reality. Users that have invested in IBM PureData System for Analytics and the leading Hadoop distributions can now enable fast, unified, efficient queries across their hybrid DW environments like never before. This new solution gives DW administrators new power to choose the underlying data platform, PureData or Hadoop, that is best suited for each type of query, data, and workload.


Users that have invested in IBM PureData System for Analytics and the leading Hadoop distributions can now enable fast, unified, efficient queries across their hybrid DW environments like never before.

IBM Fluid Query 1.0 is available starting March 27. It includes connectors for routing PureData queries to the supported Hadoop platforms, which are the most widely adopted distributions in the marketplace. The solution, which comes at no additional charge with Netezza Platform Software 7.0.2 and Netezza Analytics 2.5 and higher, includes simple-to-install loaders for PDA and the file systems of the supported Hadoop platforms. Features include the ability to:

  • Query the supported Hadoop distributions’ data from PDA;
  • Perform queries of unstructured data in Hadoop landing zones from PDA;
  • Run multi-temperature queries and advanced analytics that use data from PDA and/or Hadoop;
  • Use multithreaded parallel transfers to move data efficiently, either in compressed or uncompressed form, to and from PDA and Hadoop file systems;
  • Retain properly vetted Hadoop file system data in PDA (a feature that was already available in PDA prior to IBM Fluid Query);
  • Deploy Hadoop as an alternate platform for ETL and ELT in conjunction with PDA;
  • Persist cold, archival, and exploratory data from PDA to Hadoop file systems;
  • Use Hadoop file systems for backup, disaster recovery, and capacity relief of data stored in PDA;
  • Use Hadoop platforms to better manage capacity, resource utilization, and workloads on PDA within the LDW;
  • Use PDA for production quality analytics where SLA performance times matter, while simultaneously utilizing Hadoop for advanced analytics and exploration of multistructured data.

For fluid queries that flow in either direction—from Hadoop to PDA and from PDA to Hadoop—the feature is available only with BigSQL, which is only available with IBM BigInsights, of which versions 2.1 and higher are supported in IBM Fluid Query. Other supported Hadoop distributions include Cloudera (4.7 and higher), and Hortonworks (2.2 and higher).

For more information about IBM Fluid Query, click this link to view an informational Virtual Enzee webcast. Or if you are an existing PureData System for Analytics customer, download this software.

About James,

JAMES KOBIELUSJames Kobielus is IBM Senior Program Director, Product Marketing, Big Data Analytics solutions. He is an industry veteran, a popular speaker and social media participant, and a thought leader in big data, Hadoop, enterprise data warehousing, advanced analytics, business intelligence, data management, and next best action technologies. Follow James on Twitter : @jameskobielus

How an appliance the size of a pizza box can be your new big data nucleus?

Are you on the verge of starting your first big data project? Are you still unsure which technology you should use because of required skill sets? Do you have only a limited budget but need to address the most common big data challenges at once? If you answer these 3 questions with a “YES”, then this blog could be an eye opener for you.

Big data is a challenge for every industry – no matter how big or how small a company may be. The challenges are always very similar: Volume, Variety, Velocity and Veracity. These are 4 indicators for big data requirements. However, most of the time only a subset of these requirements may apply – at least at the beginning of the “big data journey”. Personally, I would add another “V”, which is often not so obvious from the beginning: Value. Value in terms of: what are the expected costs related to big data projects and what is the most probable outcome? Nobody will invest huge amounts of money in new hardware and software if the outcome is very unpredictable.

That’s why most companies start with a “sandbox” big data project: experimenting with trial and open source software on virtual machines and existing hardware in order to keep the initial investment small. But sooner or later, important decisions need to be made: will this be next generation architecture for big data and analytics? How much will it cost to move from a sandbox to a mature production environment? What about enterprise support for the new big data platform?

data warehouse, data warehousing, PureData, data warehouse appliance
PureData for Analytics N3001-001

IBM has acknowledged these challenges and the requirement for an entry-level big data platform. Have you heard of the new Ultra LitePureData N3001-001? Introduced at the end of 2014, this big data appliance is an optimized, powerful combination of hardware and software that is the size of a family pizza-box. It is able to process and store up to 16 Terabytes of structured data and can serve as the center and hub for other required big data products–thus covering the 4 or 5 “V’s” of big data.

data warehousing, data warehouse appliance

The IBM PureData System for Analytics N3001-001 is a factory configured, highly available big data appliance for the processing and storage of structured data at rest.  It is architected as a shared-nothing Massive Parallel Processing (MPP) architecture consisting of:

  • A server
  • A database
  • Storage on standard, cost efficient SATA self encrypting drives (SED)
  • Networking fabric (10 GBit)
  • Analytic software (Netezza technology)

PureData for Analytics comes with production licenses for a suite of other IBM big data products and integrates with these products through well-defined standard industry interfaces (SQL, ODBC, JDBC, OLE DB) for maximum data throughput and reliability. So you get a factory configured, highly available processing MPP platform for todays Big Data analytic requirements.

But not even PureData for Analytics can deal with all “V”s mentioned above. Big data analytics is a team game and that’s the reason why it comes with production licenses for these additional IBM big data products:.

  • IBM InfoSphere BigInsights: PureData refines the raw and unstructured data from IBM InfoSphere BigInsights with its ability to process huge an amount of data with its patented and industry leading Netezza technology. PureData reads and writes data to and from Hadoop using state-of-the art integration technology as well as running MapReduce™ programs within its database.
  • IBM InfoSphere Information Server: Information Server pushes transformations down to PureData using it’s MPP architecture so that transformations are processed in-database rather on a separate server platform. This helps to reduce network traffic and data movement as well as to reduce the cost of a more powerful server platform for Information Server. Information Server can use PureData analytic and transformational functions and utilize its shared-nothing architecture to process terabytes of structured data per hour.
  • IBM COGNOS: COGNOS is the Business Intelligence platform that is optimized to work with PureData. It, supports in-database analytics, pushdown SQL, OLAP over relational and many more features, utilizing the shared-nothing MPP architecture of PureData. COGNOS adds in-memory features to the disk-based PureData architecture, making it able to analyze huge amounts of data.
  • IBM InfoSphere Streams: PureData integrates well with Streams and can be a data source, as well as a data sink (target) for Streams. Since Streams is able to process and analyze huge amounts of data / events per second (millions of data packages per second), Streams needs a resourceful target to offload the analyzed data – able to store the terabytes of data required for further, deeper analytics. This is a non-production single license for the Streams product.

Not included but highly recommended

 With this big data nucleus you can start your journey with more confidence – with the right basis to grow and scale from the beginning. For an optimal user experience I recommend the following optional products to maximize the results:

  • IBM SPSS: PureData is able to act as a powerful scoring platform for IBM SPSS, supporting data mining and predictive use-cases with built-in analytics functions and its massive parallel processing power. With PureData, SPSS does not need an extra scoring server and can even run programs written in R, C, C++, Fortran, Java, Python and NZ-LUA in the core database.
  • Watson Explorer: PureData is a supported metadata crawler source for Watson Explorer.  It supplies a big data inventory for all structured data stored within the PureData 16 Terabyte capacity.

 Conclusion

IBM has made it possible to start the big data journey with small investments, using highly mature, industry leading software and an analytic big data appliance as its core. This helps you make a smooth transition from sandbox to production without disruption. Why not give it a try?

Connect with me on Twitter (@striple66) and meet me during CeBIT 2015 in Hanover, Germany.

 

About Ralf Goetz 
Ralf is an Expert Level Certified IT Specialist in the IBM Software Group. Ralf joined IBM trough the Netezza acquisition in early 2011. For several years, he led the Informatica tech-sales team in DACH region and the Mahindra Satyam BI competency team in Germany. He then became part of the technical pre-sales representative for Netezza and later for the PureData System for Analytics. Ralf is still focusing on PDA but is also supporting the technical sales of all IBM BigData products. Ralf holds a Master degree in computer science.

Is the Data Warehouse Dead? Is Hadoop trying to kill it?

By Dennis Duckworth

I attended the Strata + Hadoop World Conference in San Jose a few weeks ago, which I enjoyed immensely. I found that this conference had a slightly different “feel” than previous Hadoop conferences in terms of how Hadoop was being positioned. Since I am from the data warehouse world, I have been sensitive to Hadoop being promoted as a replacement for the data warehouse.

In previous conferences, sponsors and presenters seemed almost giddy in their prognostication that Hadoop would become the main data storage and analytics platform in the enterprise, taking more and more load from the data warehouse and eventually replacing it completely. This year, there didn’t seem to be much negative talk about data warehouses. Cloudera, for example, clearly showed its Hadoop-based “Enterprise Data Hub” as being complementary to the Enterprise Data Warehouse rather than as a replacement, reiterating the clarification of their positioning and strategy that they made last year. Maybe this was an indication that the Hadoop market was maturing even more, with companies having more Hadoop projects in production and, thus, having more real experience with what Hadoop did well and, as importantly, what it didn’t do well. Perhaps, too, the data warehouse escaped being the villain (or victim) because the “us against them” camp was distracted by the emergence and perceived threat of some other technologies like Spark and Mesos.

The conference was just another data point supporting my hypothesis that Hadoop and other Big Data technologies are complementing existing data warehouses in enterprises rather than replacing them. Another data point (actually a collection of many data points) can be seen in the survey results of The Information Difference Company as reported in the paper “Is the Data Warehouse Dead?”, sponsored by IBM. You can download a copy here.

Reading through this report, I found myself recalling many of the conversations I myself have had with customers and prospects over the last few years. If you have read some of my previous blogs, you will know that IBM is a big believer in the power of Big Data. We have solutions that help enterprises deal with the new challenges they are facing with the increasing size, speed and diversity of data. But we continue to offer and recommend relational database and data warehouse solutions because they are essential for deriving business value from data – they have done that in the past, they continue to do so today.

We believe that they will continue doing so going forward. Structured data doesn’t go away, nor does the need for doing analytics (descriptive, predictive, or prescriptive) on the data. An analytics engine that was created and tuned for structured data will continue to be the best place to do such analytics. Sure, you can do some really neat data exploration and visualizations on all sorts of data in Hadoop, but you still need your daily/weekly/monthly reports and your executive dashboards, all needing to be produced within shrinking time windows, that are all fueled by structured data.

About Dennis Duckworth

Dennis Duckworth, Program Director of Product Marketing for Data Management & Data Warehousing has been in the data game for quite a while, doing everything from Lisp programming in artificial intelligence to managing a sales territory for a database company. He has a passion for helping companies and people get real value out of cool technology. Dennis came to IBM through its acquisition of Netezza, where he was Director of Competitive and Market Intelligence. He holds a degree in Electrical Engineering from Stanford University but has spent most of his life on the East Coast. When not working, Dennis enjoys sailing off his backyard on Buzzards Bay and he is relentless in his pursuit of wine enlightenment.

See also: New Fluid Query for PureData and Hadoop by Wendy Lucas