Saying Goodbye to Messy Data with Data Warehousing and Analytics with Cloudant and dashDB

December 18, 2104 update: dashDB on Cloudant has concluded its current beta program and is now officially available!  Read the details.

By Alan Hoffman and Adam Ronthal

At Cloudant, we  aim to help you get more from your live data. Our JSON document store provides a simple and intuitive operational data store for powering live applications. It’s flexible and scalable and has many different APIs for finding and managing data.

We are excited to announce the integration between Cloudant and dashDB (beta). dashDB is the new IBM hosted data warehousing solution for the Cloud. Starting today, you can easily replicate data from Cloudant to dashDB for deeper offline reporting and analytics.

This post walks through how to quickly and easily you can create a data warehouse from one or more of your Cloudant databases. Below, we’ll show you the working of the warehousing functionality and the new Schema Discovery Process. And we’ll follow up with a couple of examples of what you can do with your newly warehoused data.

Finally, we should say that this new feature is very much in beta. That is, we are deliberately releasing a feature that is not-fully-baked with the hope of gathering feedback from early users. The feature has restricted functionality and no guarantees for data integrity. Just the same, there is a great deal of awesome stuff in here, but it’s by no means production ready.

Schema Discovery Process

Before getting too deep into the tutorial, we should pause briefly to discuss what we call the Schema Discovery Process (SDP.) This is where the magic happens. dashDB is built on a relational database, where the data is stored in structured relational tables. Cloudant stores JSON documents, where all the data is encapsulated in a single record. To move data between these two systems we need to be able to translate our JSON docs into a schema (or set of tables) that dashDB understands.

This is exactly what the SDP does. It scans your Cloudant database and intuits the implicit structures in your data. It then creates that proper schema in dashDB and copies the data over. Is it perfect? Of course not. But it does work, especially for relatively simple and homogenous Cloudant databases. The SDP can help you discover how your data is organized, and that can power a whole suite of functionality within Cloudant. (More on this later.)

Prerequisites

● You’ll need a Cloudant account. You access your account is through the Cloudant dashboard. If you don’t have an account, go to http://cloudant.com and sign up. Go ahead, we’ll wait.

● You’ll need some data as well. For this post we are going to use a small set of New York City taxi data. If you want to replicate this to your account, you can find the data set at https://examples.cloudant.com/nyctaxi.

Getting started

Log into your Cloudant account and you will land on the ‘Databases’ tab. Notice that there is a new tab on the main left-hand menu labeled ‘Warehouse.’

dashDB

If you click on the ‘Warehouse’ tab, you’ll find an overview with information about what you can do with this new functionality and how to use it. You should peruse the information accessed from that tab at your leisure. But if you click on the ‘New Warehouse’ button at the top, you will see a simple panel for setting up your first dashDB warehouse.

dashDB

First, you’ll want to select the data you want to move to the warehouse in dashDB. You can add up to 10 Cloudant databases to your warehouse. The type-ahead search box should help you find databases even if you have hundreds. After selecting your databases, click “Create Warehouse” and the SDP magic is kicked off.

dashDB

 

A brief aside on limitations

While the SDP is doing its thing, we should talk about the restrictions of this beta offering. There are a couple of big gotchas that you need to be aware of. The dashDB beta is primarily a tool for development, testing new workflows for your data, and helping you figure out if this functionality works with your application. First and foremost is the limitation on the number of warehouses you can have at one time. During the beta, you can only have one. You can include any number of Cloudant databases in this one warehouse, but there is a 1GB (compressed) size limit, which corresponds to about 10GB of Cloudant data. If you try to create a second warehouse, you will be forced to delete the first one before continuing. Secondly, you cannot update a warehouse. You have the option to ‘rescan’ a warehouse, which essentially deletes and re-creates the warehouse from scratch. Finally, you cannot connect to an existing dashDB warehouse or account. You have to launch dashDB from Cloudant to take advantage of this integration.

On we go

When you click the ‘Create Warehouse’ button, the SDP begins to scan your data and build a schema for dashDB. The SDP communicates with Cloudant via the _warehouser database, which is automatically created when you run your first job. You will see a document in that database that holds all of the metadata about the warehousing job. For those of you familiar with the _replicator database, this process works in a similar fashion. You can check the status and other metadata for your job with this document.

dashDB

When the SDP has finished doing its thing, you’ll be notified on the dashboard and your brand-new warehouse is ready for you to work with. When you click the warehouse name, it launches the dashDB console in a new window. You are ‘leaving’ Cloudant at this point and entering dashDB. From here, you will be able to analyze your data in the rows, columns, and tables that the SDP has created for you.

Exploring dashDB

The dashDB interface allows you to work with your data and get insights quickly.

dashDB

The first thing you’ll probably want to do is inspect your data, which you can do by clicking ‘Inspect Data.’ Take a look at the table or tables that were created, the data types assigned to each column, and the contents of the tables. (The database also includes some sample data that is pre-loaded, so don’t be surprised if there is more there than your Cloudant data!)   Do a quick inspection to make sure the data is what you expect it to be and that it was properly brought over from its JSON origins!

We should also mention the ‘overflow table.’ When you drill into your warehouse you should see a table with the _overflow suffix. You’ll find two types of records there. First are any documents that don’t exactly fit into the schema built by the SDP. Non-conforming documents are put here in their entirety. The second type of record is an error record. If any piece of the data transfer pipeline fails for a particular document, we’ll put that document’s ID along with the error message in this table. During the beta phase for dashDB and when you are first putting your data into a warehouse, you should keep an eye on this table. Having many entries here indicates that something has gone wrong.

With dashDB you can run SQL queries and view the results directly in the console. Most people will likely connect a separate BI tool like IBM Cognos, SPSS, Watson Analytics, or a third-party tool from one of our large ecosystem of supported partners. That said, it’s nice to be able to jump right in with standard SQL to get a feel for what dashDB enables! From the pull-down menu on the left, go to “Manage > Run Query.”

dashDB dashDB

You can also perform other standard data warehousing tasks like loading data (either directly from the console or via a separate ETL tool like IBM InfoSphere DataStage). Loading other data is useful if you want to bring in other data sources to further your analysis of your Cloudant data. For example, with the NYC taxi dataset, you might want to bring in weather information to see if there is a correlation between rain and the average time of a trip. dashDB makes it easy to import data from external sources. To load data by using the console, go to “Manage > Load Data”.

Finally, dashDB has tight integration with R and built-in, in-database analytic algorithms that allow you to perform everything from linear regression and k-means clustering, to geospatial analytics.

To get started with R, simply go to “Analyze > Develop R Scripts” from the pull-down menu at the top left. From there you can launch RStudio. You’ll have to provide your dashDB login credentials, but they are easily obtainable from the “Set Up > Connect Applications > Connection settings menu in the console.

dashDB

With RStudio, you can develop and import sample R scripts, and generate plots and visualizations of your data easily. You can download the sample R script for this data here, (and the weather data it uses here), and walk through some analysis of the taxi trip data. From RStudio, simply upload the taxi_trip_demo.R file, and you can get started!

dashDB

We’ve put together some initial analysis for you to show you how it’s done:

dashDB

What next?

This is a beta product. We have a long list of features that we want to add before we are production ready. There are the obvious ones: giving you the ability to create more than one warehouse, raising the data volume cap on warehouses, improving performance, etc. We also plan to give you the ability to do incremental updates to warehouses and even schema evolution through the SDP. Speaking of which, we’d like to make the SDP more configurable. For now, the SDP is a coarse tool, but we’ll be adding more fine-grained control for advanced users.

Most importantly, we want to hear from you, the users of Cloudant and dashDB. What can we do to make Cloudant and dashDB work better for you? What features do you want or need? What use cases are you exploring? We would love to hear from you. Please email us, Alan and John (ahoffman@us.ibm.com and jjpark@ca.ibm.com) with all your feedback. We hope to hear from you soon.

One thought on “Saying Goodbye to Messy Data with Data Warehousing and Analytics with Cloudant and dashDB

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s