By Ralf Goetz
Initially, it seems like just a different sequence of the two characters “T” and “L”. But this difference often separates successful big data projects from failed ones. Why is that? And how can you avoid falling into the most common data management traps around mastering big data? Let’s examine this topic in more detail.
Why are big data projects different from traditional data warehouse projects?
Big data projects are mostly characterized as one or a combination of these 4 (or 5) data requirements:
- Volume: the volume of (raw) data
- Variety: the variety (e.g. structured, unstructured, semi-structured) of data
- Velocity: the speed of data processing, consummation or analytics of data
- Veracity: the level of trust in the data
- (Value): the value behind the data
For big data, each of the “V”s is bigger in terms of order of magnitudes of its classification. For example, a traditional data warehouse data volume is usually around several hundred gigabytes or a low number of terabytes, while big data projects typically handle data volumes of hundreds or even thousands of terabytes. Another example would be that traditional data warehouse systems only manage and process structured data, whereas typical big data projects need to manage and process both structured and unstructured data.
Having this in mind, it is obvious that traditional technologies or methodologies for data warehousing may not be sufficient to handle these big data requirements.
Mastering the data and information supply chain using traditional ETL
This brings us to a widely adapted methodology for data integration called “Extraction, Transformation and Load” (ETL). ETL is a very common methodology in data warehousing and business analytics projects and can be performed by custom programming (e.g. scripts, or custom ETL applications) or with the help of state-of-the-art ETL platforms such as IBM InfoSphere Information Server.
The fundamental concept behind most ETL implementations is the restriction of the data in the supply chain. Only data, which is presumably important will be identified, extracted and loaded into a staging area inside a database, and later, into the data warehouse. “Presumably” is the weakness in this concept. Who really knows which data is required for which analytic insight and requirement as of now and tomorrow? Who knows which legal or regulatory requirements must be followed in the months and years to come?
Each change in the definition and scope of the information and data supply chain requires a considerable amount of effort, time and budget and is a risk for any production system. There must be a resolution for this dilemma – and here it comes.
A new “must follow” paradigm for big data: ELT
Just a little change in the sequence of two letters will mean everything to the success of your big data project: ELT (Extraction, Load and Transform). This change seems small, but the difference lies in the overall concept of data management. Instead of restricting the data sources to only “presumably” important data (and all the steps this entails), what if we take all available data, and put it into a flexible, powerful big data platform such as the Hadoop-based IBM InfoSphere BigInsights system?
Data storage in Hadoop is flexible, powerful, almost unlimited, and cost efficient – since it can use commodity hardware and scales across many computing nodes and local storage.
Hadoop is a schema-on-read system. It allows the storage of all kinds of data without knowing its format or definition (e.g. JSON, images, movies, text files, spreadsheets, log files and many more). Without the previously discussed limitation in the amount of data which will be extracted in the ETL methodology, we can be sure that we have all data we need today and may need in the future. This also reduces the required effort for the identification of “important” data – this step can literally be skipped: we take all we can get and keep it!
Without the previously discussed limitation in the amount of data which will be extracted in the ETL methodology, we can be sure that we have all data we need today and may need in the future.
Since Hadoop offers a scalable data storage and processing platform, we can utilize these features as a replacement for the traditional staging area inside a database. From here we can take only the data that is required today and analyze it either directly with a business intelligence platform such as IBM Cognos or IBM SPSS, or use an intermediate layer with deep and powerful analytic capabilities such as IBM PureData System for Analytics.
Refining raw data and gaining valuable insights
Hadoop is great for storage and processing of raw data, but applying powerful and lightning fast complex analytic queries is not its strength, and so another analytics layer makes sense. PureData System for Analytics is the perfect place for the subsequent in-database analytic processing for “valued” data because of it’s massive parallel processing (MPP) architecture and it’s rich set of analytics functions. PureData can resolve even the most complex analytic queries in only a fraction of the time compared to traditional relational databases. And it scales – from a big data starter project with only a couple of terabytes of data to a petabyte-sized PureData cluster.
PureData System for Analytics is the perfect place for the subsequent in-database analytic processing for “valued” data because of it’s massive parallel processing architecture (MPP) and it’s rich set of analytic functions.
IBM offers everything you need to master your big data challenges. You can start very small and scale with your growing requirements. Big data projects can be fun with the right technology and services!
About Ralf Goetz
Ralf is an Expert Level Certified IT Specialist in the IBM Software Group. Ralf joined IBM trough the Netezza acquisition in early 2011. For several years, he led the Informatica tech-sales team in DACH region and the Mahindra Satyam BI competency team in Germany. He then became part of the technical pre-sales representative for Netezza and later for the PureData System for Analytics. Ralf is still focusing on PDA but is also supporting the technical sales of all IBM BigData products. Ralf holds a Master degree in computer science.