Turn Up The Power For Software-Defined Data Warehousing

by Mona Patel

Interview with Mukta Singh

As big data analytics technologies such as Spark and Hadoop continue their move into the mainstream, you might think that the traditional data warehouse is becoming less important.

Actually, nothing could be further from the truth.

To enable data of all types to be ingested, transformed, processed and analyzed efficiently, many companies are choosing to build hybrid analytics architectures that plug cloud and open source technologies such as Spark and Hadoop into on-premises environments. At the heart of these hybrid architectures lies the data warehouse – a highly reliable resource that provides a single source of truth for enterprise reporting and analytics.

This raises an important question: since the data warehouse is so central to the hybrid analytics architecture, how can we make sure it performs well and cost-effectively?

Traditional wisdom is that the infrastructure doesn’t matter – that running these vital systems of record on commodity hardware is perfectly adequate. But when you look at the numbers, you may begin to question that view.

To understand why the right hardware – in this case, IBM Power Systems – can make a real difference, I spoke with Mukta Singh, Director of Data Warehousing at IBM. In my conversation with Mukta, we take a deeper dive into why IBM’s software-defined data warehouse – IBM dashDB Local – on IBM Power Systems offers a better price/performance ratio compared to commodity hardware.

Mona_Blog

Mona Patel: Can you tell our readers a little bit about the Power Architecture? What is so unique about it?

Mukta Singh: IBM Power Systems is the dominant server platform in today’s Unix market, with over 50 percent market share. It has also become a leading platform for Linux systems, and we have seen tremendous growth in that area in recent years.

Unlike commodity servers, which typically use x86 processors, Power servers use IBM’s Power Architecture, a unique processor architecture that has been designed specifically for big data and analytics workloads.

Mona Patel: How does IBM dashDB Local integrate with Power Systems?

Mukta Singh: dashDB Local is a software-defined data warehouse offering that has been optimized for rapid deployment and ease of management. Essentially, the system runs in a Docker container, which means it can be flexibly deployed on different types of hardware either on-premises or in a private or public cloud environment.

One of the options today is to deploy your dashDB Local container on IBM Power Systems – it runs completely transparently, and it’s optimized to allow the dashDB engine to take advantage of the unique features of the Power Architecture.

If you want to move an existing dashDB Local environment from x86 to Power Systems, that’s easy too. The latest-generation POWER8 processors can operate in little-endian (LE) mode, which is the same byte order that x86 processors use. That means that you can move a dashDB container from one platform to the other without making any changes to your applications or data.

At a higher level, we have also ensured that running dashDB on Power Systems offers the same user experience as it does on x86, so the database and OS management, monitoring and integration aspects are exactly the same. The skills are completely transferable from one platform to another, so it’s a free choice and users don’t have to worry about being locked in.

Mona Patel: Can you tell us about the benefits that the Power Architecture provides for dashDB Local?

Mukta Singh: Well, for example, dashDB’s analytics engine is built on IBM BLU Acceleration – a columnar, in-memory technology that cuts query run-times from hours or minutes to just seconds.

BLU Acceleration is designed to take advantage of multi-threaded cores, and Power processors have more threads per core than most current x86 processors. In fact, if you compare an IBM POWER8 processor to an Intel Broadwell EX, it has four times as many threads per core. That means if you have a query that BLU can parallelize, you will get much better performance from Power Systems.

Similarly, because dashDB’s BLU Acceleration does all the processing in-memory, the bandwidth between the processor and the memory is very important. Again, Power Systems has a huge advantage here, with four times as much memory bandwidth as the x86 equivalent.

Finally, the processor’s cache size is important. BLU is engineered to do the majority of its processing in the CPU cache. That means it doesn’t need to repeatedly access data from RAM, which is usually a much slower process. Power processors offer four times as much cache than x86, which means they offer lower latency and reduce the need to access RAM even further. So they play to the strengths of dashDB’s query engine.

Mona Patel: So how do those numbers translate in terms of performance and cost-efficiency?

Mukta Singh: We’ve done a benchmark with dashDB Local of a 24-core POWER8 server versus a 44-core x86 server.

The Power server was 1.2 times faster in terms of throughput, despite having 45 percent fewer cores. Or to look at it another way, each POWER8 core offered 2.2 times more throughput than the x86 equivalent. Leadership performance and competitive pricing for Power scale-out servers deliver a very compelling price-performance-optimized solution with dashDB Local.

Mona Patel: How do you see the market for dashDB Local on Power Systems? Is this something that customers have been asking for?

Mukta Singh: Even when we started bringing dashDB Local to market last year, there were Power clients who were interested. As I mentioned earlier, Power has a dominant share of the Unix market, and there are thousands of companies whose businesses are built on DB2 or Oracle databases running on Power Systems. For companies that rely on Power Systems already, the idea of running dashDB Local on their existing infrastructure is very attractive.

But the results of our benchmark suggests that this isn’t just a good idea for existing Power clients – it’s also an opportunity for new clients to start out running dashDB on a hardware platform that is tailor-made for high-performance analytics.

And for any client who currently runs dashDB on x86 servers, the message we’d like to get across is that it’s easy to move to Power Systems. It’s faster, it’s more cost-effective, and you still get all the ease of use and ease of management that you’re used to with your existing dashDB environment.

Mona Patel: OK, last question: where can our readers go to learn more about dashDB Local on Power? Can they try out dashDB Local on Power Systems before they buy?

Mukta Singh: Yes, we offer a free trial with a Docker ID – please visit dashDB.com to learn more and access the trial.

About Mona,

mona_headshotMona Patel is currently the Portfolio Marketing Manager for IBM dashDB, the future of data warehousing.  With over 20 years of analyzing data at The Department of Water and Power, Air Touch Communications, Oracle, and MicroStrategy, Mona decided to grow her career at IBM, a leader in data warehousing and analytics.  Mona received her Bachelor of Science degree in Electrical Engineering from UCLA.

Increased Speed, More Options for dashDB for Analytics with Pay-As-You-Go and Bluemix Lift

by Ben Hudson

Harnessing the power of IBM dashDB for Analytics just got quicker and easier. We’re excited to introduce two new and improved ways to connect to the cloud for in-memory processing; RStudio and Cloudant integrations; in-database analytics; and other powerful features that will reduce your time to market:

  1. Pay-As-You-Go (PayGo) provisioning: Starting today, you can purchase dashDB for Analytics directly in Bluemix using your credit card*.  We’ll start provisioning your system right away, accelerating your time to value.
  2. Bluemix Lift: Now you can move your on-premises data stores into a dashDB instance even faster. Bluemix Lift, IBM’s newest data movement solution, accelerates data migration by up to 10 times versus traditional options, with the flexibility of both PayGo and subscription plans to meet your data needs.  Check the details out here.

You can also purchase dashDB for Analytics through a Bluemix subscription.  Try it out today!

About Ben,

ben-hudsonBen Hudson is an Advisory Offering Manager for IBM dashDB for Analytics. He recently obtained his Master’s degree in Computer Science from Wesleyan University in Middletown, CT.

 

*Note: dashDB for Analytics MPP Small for AWS is not available as a PayGo plan.

 

Enterprise Data Warehouse Beyond SQL with Apache Spark

By Torsten Steinbach, Lead Architect for IBM Data Warehousing Advanced Analytics

Enterprise IT infrastructure is often based heavily on relational data warehouses, where all other applications communicate through the data warehouse for analytics. There is pressure from line of business departments to use open source analytics & big data technology, such as R, Python and Spark for analytical projects and to deploy them continuously without having to wait for IT provisioning. Not being able to serve these requests can lead to proliferation of analytic silos and lost control of data. For this reason, the new IBM dashDB Local for software-defined environments (SDEs) and private clouds has now integrated a complete Apache Spark stack, enabling you to continue to operate established data warehouses and leverage its proven operational quality of service as well as running Spark-based workloads out of the box on the same data.

This tightly embedded Apache Spark environment can use the entire set of resources of the dashDB system, which also applies to the MPP scale out. Each dashDB Local node, each with its own data partition, is overlaid with a local Apache Spark executor process. The existing data partitions of the dashDB cluster are implicitly derived for the data frames in Spark and thus for any distributed parallel processing in Spark on this data.

ts_screenshotone

 

 

 

 

 

 

 

The co-location of the Spark execution capabilities with the database engine minimizes latency in accessing the data and leverages optimized local IPC mechanisms for data transfer. The benefits of this architecture become apparent when we apply standard machine learning algorithms on Spark to data in dashDB Local. Comparing a remote Spark cluster setup with a co-located setup we found that those algorithms can get a significant increase in speed. This even includes optimization in remote access to read data in parallel tasks, one for each database partition in dashDB.So you can see that there is indeed a performance advantage provided by the integrated architecture.

In addition, the Spark-enabled data warehouse engine can do a lot of things out of the box that were not possible before:

1. Out of the box data exploration & visualization

ts_screenshotone

2. Interactive Machine Learning

ts_screenshottwo

3. One-click deployment – Turning interactive notebooks into deployed Spark applications

ts_screenshotthree

ts_screenshotfour

4. dashDB as hosting environment to run your Spark applications

Once as Spark application has been deployed to dashDB it can be invoked in three different ways.

Using spark-submit.sh from command line and scripts:

ts_screenshotfive

ts_screenshotsix

Using dashDB REST API:

ts_screenshotseven

Using SPARK_SUBMIT stored procedure:

ts_screenshoteight

5. Out of the box machine learning

ts_screenshotnine

In addition, this implementation of Spark capabilities in dashDB Local provides you with a high degree of flexibility in ELT and ETL activities and let you process and land data in motion in dashDB Local.

Let’s summarize the key benefits that dashDB with integrated Apache Spark provides:

  1. dashDB Local lets you dramatically modernize your data warehouse solutions with advanced analytics based on Spark.
  2. Spark applications processing relational data gain significant performance and operational QoS benefits from being deployed and running inside dashDB Local.
  3. dashDB Local enables analytic solution creation end-to-end, from interactive exploration and machine learning experiments, verification of analytic flows, easy operationalization by creating deployed Spark applications, up to hosting Spark applications in a multi-tenant enterprise warehouse system and integrating them with other applications via various invocation APIs.
  4. dashDB Local allows you to invoke Spark logic via SQL connections.
  5. dashDB Local can land streaming data directly into tables via deployed Spark applications.
  6. dashDB Local can run complex data transformations and feature extractions that cannot be expressed with SQL using integrated Spark.

Please also check out the tutorial playlist for dashDB with Spark here: ibm.biz/BdrLNG.  You can also download a free trial version of dashDB Local at ibm.biz/dashDBLocal to see these Spark features in action for yourself.

 

About Torsten,

torsten-steinbachTorsten has worked over many years as an IBM software architect for IBM’s database software offerings with particular focus on performance monitoring, application integration and workload management. Today, Torsten is the lead architect for advanced analytics in IBM’s data warehouse products and cloud services.

IBM Fluid Query 1.7 is Here!

by Doug Dailey

IBM Fluid Query offers a wide range of capabilities to help your business adapt to a hybrid data architecture and more importantly it helps you bridge across “data silos” for deeper insights that leverage more data.   Fluid Query is a standard entitlement included with the Netezza Platform Software suite for PureData for Analytics (formerly Netezza). Fluid Query release 1.7 is now available, and you can learn more about its features below.

Why should you consider Fluid Query?

It offers many possible uses for solving business problems in your business. Here are a few ideas:
• Discover and explore “Day Zero” data landing in your Hadoop environment
• Query data from multiple cross-enterprise repositories to understand relationships
• Access structured data from common sources like Oracle, SQL Server, MySQL, and PostgreSQL
• Query historical data on Hadoop via Hive, BigInsights Big SQL or Impala
• Derive relationships between data residing on Hadoop, the cloud and on-premises
• Offload colder data from PureData System for Analytics to Hadoop to free capacity
• Drive business continuity through low fidelity disaster recovery solution on Hadoop
• Backup your database or a subset of data to Hadoop in an immutable format
• Incrementally feed analytics side-cars residing on Hadoop with dimensional data

By far, the most prominent use for Fluid Query for a data warehouse administrator is that of warehouse augmentation, capacity relief and replicating analytics side-cars for analysts and scientists.

New: Hadoop connector support for Hadoop file formats to increase flexibility

IBM Fluid Query 1.7 ushers in greater flexibility for Hadoop users with support for popular file formats typically used with HDFS.Fluid query 1.7 connector picture These include popular data storage formats like AVRO, Parquet, ORC and RC that are often used to manage bigdata in a Hadoop environment.

Choosing the best format and compression mode can result in drastic differences in performance and storage on disk. A file format that doesn’t support flexible schema evolution can result in a processing penalty when making simple changes to a table. Let’s just  say that if you live in the Hadoop domain, you know exactly what I am speaking of. For instance, if you want to use AVRO, do your tools have readers and writers that are compatible? If you are using IMPALA, do you know that it doesn’t support ORC, or that Hortonworks and Hive-Stinger don’t play well with Parquet? Double check your needs and tool sets before diving into these popular format types.

By providing support for these popular formats,  Fluid Query allows you to import, store, and access this data through local tools and utilities on HDFS. But here is where it gets interesting in Fluid Query 1.7: you can also query data in these formats through the Hadoop connector provided with IBM Fluid Query, without any change to your SQL!

New: Robust connector templates

In addition, Fluid Query 1.7 now makes available a more robust set of connector templates that are designed to help you jump start use of Fluid Query. You may recall we provided support for a generic connector in our prior release that allows you to configure and connect to any structured data store via JDBC. We are offering pre-defined templates with the 1.7 release so you can get up and running more quickly. In cases where there are differences in user data type mapping, we also provide mapping files to simplify access.  If you have your own favorite database, you can use our generic connector, along with any of the provided templates as a basis for building a new connector for your specific needs. There are templates for Oracle, Teradata, SQL Server, MySQL, PostgreSQL, Informix, and MapR for Hive.

Again, the primary focus for Fluid Query is to deliver open data access across your ecosystem. Whether the data resides on disk, in-memory, in the Cloud or on Hadoop, we strive to enable your business to be open for data. We recognize that you are up against significant challenges in meeting demands of the business and marketplace, with one of the top priorities around access and federation.

New: Data movement advances

Moving data is not the best choice. Businesses spend quite a bit of effort ingesting data, staging the data, scrubbing, prepping and scoring the data for consumption for business users. This is costly process. As we move closer and closer to virtualization, the goal is to move the smallest amount of data possible, while you access and query only the data you need. So not only is access paramount, but your knowledge of the data in your environment is crucial to efficiently using it.

Fluid Query does offer data movement capability through what we call Fast Data Movement. Focusing on the pipe between PDA and Hadoop, we offer a high speed transfer tool that allows you to transfer data between these two environments very efficiently and securely. You have control over the security, compression, format and where clause (DB, table, filtered data). A key benefit is our ability to transfer data in our proprietary binary format. This enables orders of magnitude performance over Sqoop, when you do have to move data.

Fluid Query 1.7 also offers some additional benefits:
• Kerberos support for our generic database connector
• Support for BigInsights Big SQL during import (automatically synchronizes Hive and Big SQL on import)
• Varchar and String mapping improvements
• Import of nz.fq.table parameter now supports a combination of multiple schemas and tables
• Improved date handling
• Improved validation for NPS and Hadoop environment (connectors and import/export)
• Support for BigInsights 4.1 and Cloudera 5.5.1
• A new Best Practices User Guide, plus two new Tutorials

You can download this from IBM’s Fix Central or the Netezza Developer’s Network for use with the Netezza Emulator through our non-warranted software.

Picture1

Take a test drive today!

About Doug,
Doug Daily
Doug has over 20 years combined technical & management experience in the software industry with emphasis in customer service and more recently product management.He is currently part of a highly motivated product management team that is both inspired by and passionate about the IBM PureData System for Analytics product portfolio.

Using Docker containers for software-defined environments or private cloud implementations

by Mitesh Shah

Data warehousing architectures have evolved considerably over recent years. As businesses try to derive insight as the basis of value creation, ALL roles must participate by leveraging new insights.  As a result, analytics needs are expanding, markets are transforming and new business models are being created.  This ushers in increased requirements for self-service analytics and alternative infrastructure solutions. Read on to learn how the “software-defined environment” (SDE) that utilizes container technology can help you meet expanded analytics needs.

Adaptability delivered through software-defined environments

From an avalanche of new data, to mobile computing and cloud-based platforms, new technologies must move into the IT infrastructure very quickly. Traditional IT systems—hampered by labor-intensive management and high costs—are struggling to keep up. IT organizations are caught between complex security requirements, extreme data volumes and the need for rapid deployment of new services. A simpler, more adaptive and more responsive IT infrastructure is required.

One of the key solutions on the horizon is  the SDE which optimizes the entire computing infrastructure – compute, storage and network resources – so that IT staff can adapt to different types of workloads very quickly. For example, without an SDE, resources are assigned manually to workloads; the same assignments happens automatically within an SDE.

Now, dashDB Local  (via Docker container) is available as an early access client preview.  I hope you will test this new technology and provide us valuable feedback. Learn more, then request access: ibm.biz/dashDBLocal

By dynamically assigning workloads to IT resources based on a variety of factors, including the characteristics of specific applications, the best-available resources, and service-level policies, a software-defined environment can deliver continuous, dynamic optimization and reconfiguration to address infrastructure issues.

Software-defined environment benefits

A software defined environment framework can help to:

  • Simplify operations with automated infrastructure tuning and configuration
  • Reduce time to value with a simple, pluggable and rich API-supported architectures
  • Sense and respond to workload demands automatically
  • Optimize resources by assigning assets without manual intervention
  • Maintain security and manage privacy through a common platform
  • Facilitate better business outcomes through advanced analytics and cognitive capabilities

A software-defined environment fits well into the private cloud ecosystem so that IT staff can deliver flexibility and ease of consumption, as well as maximize the use of commodity or virtualized hardware. An SDE is now easily achievable by leveraging container technology, where Docker is one of the leaders.

Docker containers provide application portability

Docker containers “wrap up” a piece of software in a complete file system that contains everything the software needs to run: code, run-times, system tools, system libraries and other components that can be installed on a server. This guarantees that the software will always run the same, regardless of the environment in which it is running.

Docker provides true application portability and ease of consumption by alleviating the complex process of software setup and installation that often can require multiple skills across multiple hours or days. It provides OS-level abstraction without disrupting the standards on the host operating system, which makes it even more attractive.

One key point to keep in mind is that Docker is not the same as VMware. Docker provides process isolation at the operating system level, whereas VMware provides a hardware abstraction layer. Unlike VMware, Docker does not create an entire virtual operating system. Instead, the host operating system kernel can be shared across multiple Docker containers. This makes it very lightweight to deploy and faster to start than a virtual machine.  There is no looking back, as container technology is being very quickly embraced as part of a hybrid solution that meets business user needs-fast!

dashDB Local: data warehousing delivered via Docker container

Coming full circle, the data warehouse is the foundation of all analytics and must be fast and agile to serve new analytics needs.  Software defined environments make this easy to do – enabling key deployment of the warehousing engine in minutes as compared to hours or days.

IBM dashDB is the data warehousing technology that delivers high speed insights through in-memory computing and  in-database analytics at massively parallel processing (MPP) scale.  It has been available as a fully managed services on the IBM cloud.  Now, dashDB Local  as a is available as an early access client preview for private clouds and other software-defined infrastructures.  I hope you will test this new technology and provide us valuable feedback. Learn more, then request access: ibm.biz/dashDBLocal

About Mitesh,

MiteshMitesh Shah is the product manager for the new dashDB data warehousing solution as a software-defined environment (SDE) that can be used on private clouds and other implementations that support Docker container technology. He has broad experience around various facets of software development revolving around database and data warehousing technologies.  Throughout his career, Mitesh has enjoyed a focus on helping clients address their data management and solution architecture needs.

How To Make Good Decisions in Deploying the Logical Data Warehouse

By Rich Hughes,

A recent article addresses the challenges facing businesses trying to improve their results by analyzing data. As Hadoop’s ability to process large data volumes continues to gain acceptance, Dwaine Snow provides a reasonable method to examine when and under what circumstances to deploy Hadoop alongside your PureData System for Analytics (PDA).   Snow makes the case that traditional data warehouses, like PDA, are not going away because of the continued value they provide. Additionally, Hadoop distributions also are playing a valuable role in meeting some of the challenges in this evolving data ecosystem.

The valuable synergy between Hadoop and PDA are illustrated conceptually as the logical data warehouse in Snow’s December 2014 paper (Link to Snow’s Paper).

The logical data warehouse diagrams the enterprise body of data stores, connective tissue like APIs, and the cognitive features like analytical functions.  The logical data warehouse documents the traditional data warehouse, which began about 1990, and its use of structured data bases.  Pushed by the widespread use of the Internet and its unstructured data exhaust, the Apache Hadoop community was founded as a means to store, evaluate, and make sense of unstructured data.  Hadoop thus imitated the traditional data warehouse in evaluating value from the data available, then retaining the most valuable data sources from that investigation.  As well, the discovery, analytics, and trusted data zone architecture of today’s logical data warehouse resembles the layered architecture of yesterday’s data warehouse.

Since its advent some 10 years ago, Hadoop has branched out to servicing SQL statements against structured data types, which brings us back to the business challenge:  where can we most effectively deploy our data assets and analytic capabilities?  In answering this question, Snow discusses the fit-for-purpose repositories which for success, require inter-operability across the various zones and data stores.  Each data zone is evaluated for cost, value gained, and required performance on service level agreements.

By looking at this problem as a manufacturing sequence, the raw material / data is first acquired, then manipulated into a higher valued product—in this case, the value being assessed by the business consumer based on insights gained and speed of delivery.  Hadoop distributed file environments shows its worth in storing relatively larger data volumes and accessing both structured and unstructured data.  Traditional data warehouses like IBM’s PureData System for Analytics display their value in being the system of record where advanced analytics are delivered in a timely fashion.

In an elegant cost benefit analysis, Snow provides the tools necessary to weigh where best to deploy the different, but complimentary data insight technologies.  A listing of Total Cost of Ownership (TCO) for Hadoop includes four line items:

  1. Initial system cost (hardware and software)
  2. Annual system maintenance cost
  3. Setup costs to get the system ‘up and running’
  4. Costs for humans managing the ongoing system administration

Looking at just the first cost item, which is sometimes reduced to a per Terabyte price like $1,000 per TB, is but part of the story.  The article documents the other unavoidable tasks for deploying and maintaining a Hadoop cluster.  Yes, $200,000 might be the price for the hardware and software for a 200TB system, but over a five year ownership, industry studies are cited in ascribing the other significant budget expenses.  Adding up the total costs, the conclusion is that the final amount could very well be in excess of $2,000,000.

The accurate TCO number is then subtracted from the business benefits of using the system, which determines net value gained.  And business benefits are accrued, Snow notes, from query activity.  Only 1% of the queries in today’s data analytic systems require all of the data, which makes that activity perfect for the lower cost and performance Hadoop model.  Conversely, 90% of current queries require only 20% of the data, which matches well with the characteristics of the PureData System for Analytics:  reliability with faster analytic performance.  What Snow has shown is the best-of-breed nature of the Logical Data Warehouse, and as the ancient slogan suggests, how to get more “bang for the buck”.

About Rich Hughes,

Rich Hughes is an IBM Marketing Program Manager for Data Warehousing.  Hughes has worked in a variety of Information Technology, Data Warehousing, and Big Data jobs, and has been with IBM since 2004.  Hughes earned a Bachelor’s degree from Kansas University, and a Master’s degree in Computer Science from Kansas State University.  Writing about the original Dream Team, Hughes authored a book on the 1936 US Olympic basketball team, a squad composed of oil refinery laborers and film industry stage hands. You can follow him on @rhughes134

What the Future Holds for the Database Administrator (DBA)

By Rich Hughes,

Scanning the archives as far back as 2000 reveals articles speculating on the future of the DBA.  With mounting operational costs attributed to the day-to-day maintenance of data warehouses, even 15 years ago, this was a fair question to ask.  The overhead of creating indexes, tuning individual queries, on top of the necessary nurturing of the infrastructure had many organizations looking for more cost effective alternatives.

Motivated to fix the I/O bottleneck that traditionally handicapped data warehouses, and inspired by the design goals of reduced administration and easy data access for users, the data warehouse appliance was born.  Netezza built the original data warehouse appliance that by brilliantly applying hardware and software combinations, brought the query request much closer to the data.  This breakthrough paved the way for lower administrative costs and forced others in the data warehouse market to think of additional ways to solve the I/O problem.

To be sure, Netezza disruptive technology of no indexing, great performance, and ease of administration, left many DBAs feeling threatened.  But what was really threatened was the frustrating and never ending search for data warehouse performance via indexing.  Netezza DBAs got their nights and weekends back, and adjusted by making themselves more valuable to their organizations by using the time saved with no-indexing to get closer to the business.  Higher level skills taken on by DBAs included data stewardship and data modeling, and in this freer development environment, advanced analytics took root.  In the data warehouse appliance world, much more DBA emphasis was placed on the business applications because the infrastructure was designed to run for the most part, unassisted.

Fast forward to current day where the relentless pursuit of IT cost efficiencies while providing more business value continues.  Disruptive technologies in the past decade have been invented to fill this demand, like the Hadoop Ecosystem and the maturing Cloud computing environment.  Hardware advances have pushed in-memory computing, Solid State Drives are in the process of phasing out spinning disk storage, and 128 bit CPUs and operating systems are on the drawing boards.  Databases like IBM’s dashDB have benefitted by incorporating several of these newer hardware and software advances

So 15 years into the new Millennium what’s a DBA to do? Embrace change and realize there is plenty of good news and much data to administer.  While the Cloud’s Infrastructure and Platform services will decrease on-premise DBA work over time, the added complexity will demand new solutions for determining the right mixture of on, off, and hybrid premise platforms.  Juggling the organizational data warehouse work load requires different approaches if the Cloud’s elasticity and cheaper off-hour rates are to be leveraged.

Capacity planning and data retention take on new meaning in a world where, while it is now possible to store and access everything, what is the return value of all that information? The DBA will be involved in cataloging the many new data sources as well as getting a handle on the unstructured data provided by the Internet of Things.  Moving data, when to move data, to persist or not, how does this data interact with existing schemas are all good questions to be considered for the thoughtful DBA.  And that is just on the ingest side of the ledger.  Who gets access, what are the security levels, how can applications be rapidly developed, how does one re-use SQL in a NoSQL world, and how to best federate all this wonderful data are worthwhile areas for reasonable study.

In summary, the role of the Database Administrator has always been evolving, forced by technology advances and rising business demands.  The DBA has and will continue to be one that requires general knowledge of several IT disciplines, with the opportunity to specialize.  Historically the DBA, by keeping current, can go deeper in a particular technology– a move that benefits both their career and their organization’s needs.  The DBA can logically move into an architecture or Data Scientist position, the higher skill sets for today’s world.  What has not changed is the demand to deliver reliable, affordable, and valuable information.

About Rich Hughes,

Rich Hughes is an IBM Marketing Program Manager for Data Warehousing.  Hughes has worked in a variety of Information Technology, Data Warehousing, and Big Data jobs, and has been with IBM since 2004.  Hughes earned a Bachelor’s degree from Kansas University, and a Master’s degree in Computer Science from Kansas State University.  Writing about the original Dream Team, Hughes authored a book on the 1936 US Olympic basketball team, a squad composed of oil refinery laborers and film industry stage hands. You can follow him on @rhughes134