By Rich Hughes,
A recent article addresses the challenges facing businesses trying to improve their results by analyzing data. As Hadoop’s ability to process large data volumes continues to gain acceptance, Dwaine Snow provides a reasonable method to examine when and under what circumstances to deploy Hadoop alongside your PureData System for Analytics (PDA). Snow makes the case that traditional data warehouses, like PDA, are not going away because of the continued value they provide. Additionally, Hadoop distributions also are playing a valuable role in meeting some of the challenges in this evolving data ecosystem.
The valuable synergy between Hadoop and PDA are illustrated conceptually as the logical data warehouse in Snow’s December 2014 paper (Link to Snow’s Paper).
The logical data warehouse diagrams the enterprise body of data stores, connective tissue like APIs, and the cognitive features like analytical functions. The logical data warehouse documents the traditional data warehouse, which began about 1990, and its use of structured data bases. Pushed by the widespread use of the Internet and its unstructured data exhaust, the Apache Hadoop community was founded as a means to store, evaluate, and make sense of unstructured data. Hadoop thus imitated the traditional data warehouse in evaluating value from the data available, then retaining the most valuable data sources from that investigation. As well, the discovery, analytics, and trusted data zone architecture of today’s logical data warehouse resembles the layered architecture of yesterday’s data warehouse.
Since its advent some 10 years ago, Hadoop has branched out to servicing SQL statements against structured data types, which brings us back to the business challenge: where can we most effectively deploy our data assets and analytic capabilities? In answering this question, Snow discusses the fit-for-purpose repositories which for success, require inter-operability across the various zones and data stores. Each data zone is evaluated for cost, value gained, and required performance on service level agreements.
By looking at this problem as a manufacturing sequence, the raw material / data is first acquired, then manipulated into a higher valued product—in this case, the value being assessed by the business consumer based on insights gained and speed of delivery. Hadoop distributed file environments shows its worth in storing relatively larger data volumes and accessing both structured and unstructured data. Traditional data warehouses like IBM’s PureData System for Analytics display their value in being the system of record where advanced analytics are delivered in a timely fashion.
In an elegant cost benefit analysis, Snow provides the tools necessary to weigh where best to deploy the different, but complimentary data insight technologies. A listing of Total Cost of Ownership (TCO) for Hadoop includes four line items:
- Initial system cost (hardware and software)
- Annual system maintenance cost
- Setup costs to get the system ‘up and running’
- Costs for humans managing the ongoing system administration
Looking at just the first cost item, which is sometimes reduced to a per Terabyte price like $1,000 per TB, is but part of the story. The article documents the other unavoidable tasks for deploying and maintaining a Hadoop cluster. Yes, $200,000 might be the price for the hardware and software for a 200TB system, but over a five year ownership, industry studies are cited in ascribing the other significant budget expenses. Adding up the total costs, the conclusion is that the final amount could very well be in excess of $2,000,000.
The accurate TCO number is then subtracted from the business benefits of using the system, which determines net value gained. And business benefits are accrued, Snow notes, from query activity. Only 1% of the queries in today’s data analytic systems require all of the data, which makes that activity perfect for the lower cost and performance Hadoop model. Conversely, 90% of current queries require only 20% of the data, which matches well with the characteristics of the PureData System for Analytics: reliability with faster analytic performance. What Snow has shown is the best-of-breed nature of the Logical Data Warehouse, and as the ancient slogan suggests, how to get more “bang for the buck”.
About Rich Hughes,
Rich Hughes is an IBM Marketing Program Manager for Data Warehousing. Hughes has worked in a variety of Information Technology, Data Warehousing, and Big Data jobs, and has been with IBM since 2004. Hughes earned a Bachelor’s degree from Kansas University, and a Master’s degree in Computer Science from Kansas State University. Writing about the original Dream Team, Hughes authored a book on the 1936 US Olympic basketball team, a squad composed of oil refinery laborers and film industry stage hands. You can follow him on @rhughes134