Enterprise Data Warehouse Beyond SQL with Apache Spark

By Torsten Steinbach, Lead Architect for IBM Data Warehousing Advanced Analytics

Enterprise IT infrastructure is often based heavily on relational data warehouses, where all other applications communicate through the data warehouse for analytics. There is pressure from line of business departments to use open source analytics & big data technology, such as R, Python and Spark for analytical projects and to deploy them continuously without having to wait for IT provisioning. Not being able to serve these requests can lead to proliferation of analytic silos and lost control of data. For this reason, the new IBM dashDB Local for software-defined environments (SDEs) and private clouds has now integrated a complete Apache Spark stack, enabling you to continue to operate established data warehouses and leverage its proven operational quality of service as well as running Spark-based workloads out of the box on the same data.

This tightly embedded Apache Spark environment can use the entire set of resources of the dashDB system, which also applies to the MPP scale out. Each dashDB Local node, each with its own data partition, is overlaid with a local Apache Spark executor process. The existing data partitions of the dashDB cluster are implicitly derived for the data frames in Spark and thus for any distributed parallel processing in Spark on this data.









The co-location of the Spark execution capabilities with the database engine minimizes latency in accessing the data and leverages optimized local IPC mechanisms for data transfer. The benefits of this architecture become apparent when we apply standard machine learning algorithms on Spark to data in dashDB Local. Comparing a remote Spark cluster setup with a co-located setup we found that those algorithms can get a significant increase in speed. This even includes optimization in remote access to read data in parallel tasks, one for each database partition in dashDB.So you can see that there is indeed a performance advantage provided by the integrated architecture.

In addition, the Spark-enabled data warehouse engine can do a lot of things out of the box that were not possible before:

1. Out of the box data exploration & visualization


2. Interactive Machine Learning


3. One-click deployment – Turning interactive notebooks into deployed Spark applications



4. dashDB as hosting environment to run your Spark applications

Once as Spark application has been deployed to dashDB it can be invoked in three different ways.

Using spark-submit.sh from command line and scripts:



Using dashDB REST API:


Using SPARK_SUBMIT stored procedure:


5. Out of the box machine learning


In addition, this implementation of Spark capabilities in dashDB Local provides you with a high degree of flexibility in ELT and ETL activities and let you process and land data in motion in dashDB Local.

Let’s summarize the key benefits that dashDB with integrated Apache Spark provides:

  1. dashDB Local lets you dramatically modernize your data warehouse solutions with advanced analytics based on Spark.
  2. Spark applications processing relational data gain significant performance and operational QoS benefits from being deployed and running inside dashDB Local.
  3. dashDB Local enables analytic solution creation end-to-end, from interactive exploration and machine learning experiments, verification of analytic flows, easy operationalization by creating deployed Spark applications, up to hosting Spark applications in a multi-tenant enterprise warehouse system and integrating them with other applications via various invocation APIs.
  4. dashDB Local allows you to invoke Spark logic via SQL connections.
  5. dashDB Local can land streaming data directly into tables via deployed Spark applications.
  6. dashDB Local can run complex data transformations and feature extractions that cannot be expressed with SQL using integrated Spark.

Please also check out the tutorial playlist for dashDB with Spark here: ibm.biz/BdrLNG.  You can also download a free trial version of dashDB Local at ibm.biz/dashDBLocal to see these Spark features in action for yourself.


About Torsten,

torsten-steinbachTorsten has worked over many years as an IBM software architect for IBM’s database software offerings with particular focus on performance monitoring, application integration and workload management. Today, Torsten is the lead architect for advanced analytics in IBM’s data warehouse products and cloud services.


Three session guides get you started with data warehousing at IBM Insight at World of Watson

Join us October 24 to 27, 2016 in Las Vegas!

by Cindy Russell, IBM Data Warehouse marketing

IBM Insight has been the premiere data management and analytics event for IBM analytics technologies, and 2016 is no exception.  This year, IBM Insight is being hosted along with World of Watson and runs from October 24 to 27, 2016 at the Mandalay Bay in Las Vegas, Nevada.  It includes 1,500 sessions across a range of technologies and features keynotes by IBM President and CEO, Ginni Rometty; Senior Vice President of IBM Analytics, Bob Picciano; and other IBM Analytics and industry leaders.  Every year, we include a little fun as well, and this year the band is Imagine Dragons.

IBM data warehousing sessions will be available across the event as well as in the PureData System for Analytics Enzee Universe (Sunday, October 23).  Below are product-specific quick reference guides that enable you to see at a glance key sessions and activities, then plan your schedule.  Print these guides and take them with you or put the links to them on your phone for reference during the conference.

This year, the Expo floor is called the Cognitive Concourse, and we are located in the Monetizing Data section, Cognitive Cuisine experience area.  We’ll take you on a tour across our data warehousing products and will have some fun as we do it, so please stop by.  There is also a demo room where you can see live demos and engage with our technical experts, as well as a series of hands-on labs that let you experience our products directly.

The IBM Insight at World of Watson main web page is located here.  You can register and then use the agenda builder to create your personalized schedule.

IBM PureData System for Analytics session reference guide

Please find the session quick reference guide for PureData System for Analytics here: ibm.biz/wow_enzee

Enzee Universe is a full day of dedicated PureData System for Analytics / Netezza sessions that is held on Sunday, October 23, 2016.  To register for Enzee Universe, select sessions 3459 and 3461 in the agenda builder tool.  This event is open to any full conference pass holder.

During the regular conference, there are also more than 35 PureData, Netezza, IBM DB2 Analytics Accelerator for z/OS (IDAA) technical sessions across all the conference tracks, as well as hands on labs.  There are several session being presented by IBM clients so you can see how they put PureData System for Analytics to use.  Click the link above to see the details.

IBM dashDB Family session reference guide

Please find the session quick reference guide for the dashDB family here: ibm.biz/wow_dashDB

There are a more than 40 sessions for dashDB, including a “Meet the Family” session that will help you become familiar with new products in this family of modern data management and data warehousing tools.  There is also a “Birds of a Feather” panel discussion on Hybrid Data Warehousing, and one that describes some key use cases for dashDB.  And, you can also see a demo, take in a short theatre session or try out a hands-on lab.

IBM BigInsights, Hadoop and Spark session reference guide

Please find the session quick reference guide for BigInsights, Hadoop and Spark topics here: ibm.biz/wow_biginsights

There are more than 65 sessions related to IBM BigInsights, Hadoop and Spark, with several hands on labs and theatre sessions. There is everything from an Introduction to Data Science to Using Spark for Customer Intelligence Analytics to hybrid cloud data lakes to client stories of how they use these technologies.

Overall, it is an exciting time to be in the data warehousing and analytics space.  This conference represents a great opportunity to build depth on IBM products you already use, learn new data warehousing products, and look across IBM to learn completely new ways to employ analytics—from Watson to Internet of Things and much more.  I hope to see you there.

Build skills for 2016 and Beyond: Data Warehousing and Analytics Top 10 Resources

by Cindy Russell, IBM Data Warehouse Marketing

Skills are always an essential consideration in technical careers and it is important for data warehousing professionals to expand their knowledge to handle the proliferation of data types and volumes in 2016 and beyond.

These are my “top 10” resource picks that you may want to explore. I am choosing these because of their popularity and also because they represent new technologies you may face in 2016 as you modernize your data warehouse and extend it beyond its traditional realm to meet new analytics needs.

  1. Gartner Magic Quadrant for Data Warehouse and Data Management Solutions for Analytics – I am recommending this report because it provides an overview of the trends, issues and marketplace leaders in data warehousing. It calls out the need for the Logical Data Warehouse, which is a key element of a modernization strategy. I believe the Logical Data Warehouse will be of increasing importance to your operations in the coming months. Read a summary of the report.
  2. Logical Data Warehouse – Due to the massive and rapid growth of data volumes and types, a single centralized data warehouse cannot meet all of the new needs for analytics by itself. The data warehouse now becomes part of a Logical Data Warehouse in which a set of “fit for purpose” stores are used to house a range of data. This blog by Wendy Lucas was published in 2014, but is still a good primer on the concept if you need one.
  3. IBM Fluid Query information and entitlement for PureData clients – In 2015, we released a series of “agile” announcements of IBM Fluid Query. This is a tool that PureData System for Analytics clients can use to query more data sources for deeper insights. This tool is a key element when you have a Logical Data Warehouse where data stores include Hadoop, databases, other data warehouses and more. PureData clients can take advantage of this technology as part of the entitlements. Start learning with our blog series and webcast.
  4. dashDB, data warehousing on the cloud – dashDB was launched in 2014 as the IBM fully managed data warehouse in the cloud. Some initial use cases cloud be: setting up self-service data science sandboxes, establishing test environments or cost-effectively housing data that is already external, such as social media feeds. dashDB is based on the Netezza and BLU Acceleration in-memory computing technologies. If you have workloads you want to place on the cloud, dashDB is a good solution. This webcast and a TDWI Checklist for cloud get you started.
  5. Hadoop and Big SQL – Hadoop is a scalable, cost-effective, open source file system that can store a range of structured or unstructured data as part of a Logical Data Warehouse. It can also be used to help you manage capacity on the data warehouse, for example as a queryable historical archive. Read this blog by our expert to learn the basics. IBM provides a free open source distribution, IBM Open Platform with Apache Hadoop. For those looking to augment the IBM Open Platform, IBM BigInsights adds enterprise-grade features including visualization, exploration and advanced analytics. Within the family is an implementation that includes Big SQL—enabling you to use familiar SQL skills to query data in Hadoop. Explore the above content options, then get started with a no charge trial.
  6. Apache Spark –IBM announced a major commitment to Apache Spark in June, 2015 and has already made available a series of Spark-based products and cloud services. You will be seeing more of Spark across the IBM Analytics portfolio, so it is a good technology to learn. Apache Spark is an open source processing engine built around speed, ease of use, and analytics. If you have large amounts of data that requires low latency processing that a typical Map Reduce program cannot provide, Spark is the alternative. It performs at speeds up to 100 times faster than Map Reduce for iterative algorithms or interactive data mining. Spark provides in-memory cluster computing for speed, and supports the Java, Scala, and Python APIs for ease of development. I recommend this no charge Big Data University course on Spark fundamentals.
  7. Update to IBM Netezza Analytics software – For those of you who are PureData System for Analytics clients, there is an update to the Netezza Analytics software. Doug Daily is one of our experts in this area, and he created an announcement blog to help you understand what new capabilities you can leverage.
  8. Virtual Enzee on demand webcasts – IBM offers webcasts on topics related to data warehousing and PureData System for Analytics. Browse the “Virtual Enzee” webcast library to stay up to date on PureData through these on demand webcasts.
  9. Learn Cognos Analytics for user self-service applications – Some of our clients use Cognos BI in conjunction with their data warehouses for super-fast reporting. Cognos Analytics was announced at IBM Insight as a guided, self-service capability that provides a personal approach to analytics. As your users are demanding more insights, self-service may be a sound solution to some of their needs. Browse the blog and web site to learn more.
  10. IBMGo on demand keynotes from IBM Insight – If you were unable to attend IBM Insight 2015, IBMGo brings some of the main sessions to you! It is a great way to learn about the bigger IBM Analytics solutions and points of view. Start here.

Tweet this blog