Hi All,
For those that are interested I've made an initial attempt at bench-marking HANA and HADOOP Impala against each other.
My PowerPoint slide comparing them is publicly shared on Google docs at:
https://docs.google.com/file/d/0Bxydpie8Km_fWTd3RmJTbjVHd00/edit?usp=sharing
As most of you are aware there is a revolution taking place in Big Data Analytics, with many new solutions appearing on the market, including open source solutions running on HADOOP. For a brief explanation of HADOOP please read http://blogs.sap.com/innovation/big-data/what-is-hadoop-018605
HADOOP is designed to handle very large datasets. Large volumes of data can be processed but jobs need to be scheduled
The key benefits of HADOOP is that it is open source and operates on affordable scalable infrastructure.
Real-time reporting has been a weakness as reports may take minutes instead of seconds.
Recently Cloudera have released, on HADOOP, a new open source real time reporting solution called Impala
It also has the option to use Column store tables (PARQUET) to optimise query run times
Cloudera Impala 1.0 GA was released on the 29th April 2013.
With the advent of Cloud computing it’s now easier than ever to test new products
I’ve been using HANA for almost a year now and I love it. To get your own HANA box see Get your own SAP HANA, developer edition on Amazon Web Services
Over the past couple of months I’ve also used AWS to setup a small HADOOP cluster to test out Impala (from the earlier BETA releases)
I’ve tested Impala with 1, 3, 9 & 18 Node Cluster (Each node represents a separate cloud machine). [Companies such as Yahoo, Twitter & Facebook may use many thousand node clusters]
By contrast HANA running on AWS runs only on a single machine
I don’t consider HANA & HADOOP/IMPALA rival products, just different tools for different purposes, though there is an overlap.
I focused on SQL read-times, row limits and costs between the two solutions, both running on cloud machines hosted by Amazon Web Services (AWS).
To benchmark them I used sample SAP SPL Data and TPC-H data both loaded with 60 million records
For details on TPC-H see http://www.tpc.org/tpch/
At this point the analysis only focuses on queries running on a single table. Depending on feedback I may broaden the scope of comparison to include more complex queries with Joins.
If you notice any glaring inaccuracies or omissions then please feel free to let me know. Where possible I'm happy to update my slides accordingly.
All the best
Aron