by Andrey Vykhodtsev
In my previous two posts I covered the differences in architecture between IBM PureData System for Analytics and Oracle Database, as well as differences in SQL. (See below for links.) In this post, I am going to cover another important topic – additional structures that speed-up data access.
Partitions, Indexes, Materialized Views
Oracle database relies on Indexes, Partitions and Materialized views for performance. In Oracle, indexes are designed to speed-up point searches or range searches that touch a very small percentage of the data. Because of the B-Tree index structure, if you touch a large percentage of the data, using the index will be much slower than the full scan of the whole table. If you have this problem, then you probably have decided to use partitioning. In Oracle, Partitioning is a paid feature that goes only with certain editions. You also have Materialized views with which you can put results of the complex queries on disk for later re-use. These structures are designed with general purpose (analytical processing + transactional processing ) in mind, and can be complex and unwieldy to maintain.
By contrast, with PureData you have fewer worries. The trade-off, as I said in my first post, is that PureData is not a general-purpose system, but rather an analytical-processing system.
We use ZoneMaps in PureData instead of indexes. In essence, a ZoneMap is just a table of minimum and maximum values for all columns that have certain types. ZoneMaps are extremely compact, and they don’t need to be created or maintained. But this is not all. ZoneMap filtering takes place at the hardware level. (Remember mention of FPGA, Field Programmable Gate Arrays in my first post?) The system will not scan data that does not need to be scanned for a particular query. Therefore I/O is greatly reduced. If you update data or delete data based on a condition, ZoneMaps also are taken into account.
Because of ZoneMaps, you don’t need to partition your data. ZoneMaps take advantage of the natural ordering of data. For example, if you insert data daily, ZoneMap on the date field will become completely sorted. Range searches on this field will be extremely fast.
In addition to ZoneMaps, there are couple of other techniques you can use to optimize query access to a certain table. First is called CBT, Clustered Based Table. This is not a separate structure that needs to be maintained, but rather an internal table organization method. If you choose a table to be CBT, you can provide up to 4 fields, on which you will have extremely fast searches.
The only additional structure that PureData has is called “Materialized View”, but this is a bit different concept than in Oracle. In PureData, materialized view is a subset of columns from one table that can be sorted differently than the base table, therefore speeding up access on the sorted columns. Because materialized views are ZoneMapped, they have some properties of the indices, but they are not actually indices. Materialized views might be needed if you have “tactical queries”, queries that require fast and frequent access to small portions of data. Otherwise, you don’t usually need them.
As you see, in PureData it is much simpler to maintain efficient data access. Instead of creating and maintaining indexes for the subset of columns on each table, PureData automatically creates ZoneMaps for you. I know from experience what a nightmare index maintenance in a large data warehouse might be. Partitioning is another technique that is not needed in PureData. Instead of indexes and partitions, we use much simpler structures, that are automatically maintained, and applied on hardware level (in FPGA), with the speed of streaming data.
In my next posts, I am going to cover a few more topics that you need to be aware of when migrating from Oracle to PDA. Please stay tuned, and follow me on Twitter: @vykhand
Other posts in this series
Andrey Vykhodtsev is Big Data Technical Sales Expert covering Central and Eastern Europe Region in IBM. He has more than 12 years of experience in Data Warehousing and Analytics, and has worked as senior data warehouse developer, analyst, architect, consultant in multiple industries, including Financial sector and Telecommunications.