Big Data Analytics: HANA vs HADOOP IMPALA on AWS

Hi All,

For those that are interested I've made an initial attempt at bench-marking HANA and HADOOP Impala against each other.

My PowerPoint slide comparing them is publicly shared on Google docs at:

https://docs.google.com/file/d/0Bxydpie8Km_fWTd3RmJTbjVHd00/edit?usp=sharing

As most of you are aware there is a revolution taking place in Big Data Analytics, with many new solutions appearing on the market, including open source solutions running on HADOOP. For a brief explanation of HADOOP please read http://blogs.sap.com/innovation/big-data/what-is-hadoop-018605

HADOOP is designed to handle very large datasets. Large volumes of data can be processed but jobs need to be scheduled

The key benefits of HADOOP is that it is open source and operates on affordable scalable infrastructure.

Real-time reporting has been a weakness as reports may take minutes instead of seconds.

Recently Cloudera have released, on HADOOP, a new open source real time reporting solution called Impala

It also has the option to use Column store tables (PARQUET) to optimise query run times

Cloudera Impala 1.0 GA was released on the 29th April 2013.

http://blog.cloudera.com/blog/2013/05/cloudera-impala-1-0-its-here-its-real-its-already-the-standard-for-sql-on-hadoop/

With the advent of Cloud computing it’s now easier than ever to test new products

I’ve been using HANA for almost a year now and I love it. To get your own HANA box see Get your own SAP HANA, developer edition on Amazon Web Services

Over the past couple of months I’ve also used AWS to setup a small HADOOP cluster to test out Impala (from the earlier BETA releases)

http://blog.cloudera.com/blog/2013/03/how-to-create-a-cdh-cluster-on-amazon-ec2-via-cloudera-manager/

I’ve tested Impala with 1, 3, 9 & 18 Node Cluster (Each node represents a separate cloud machine). [Companies such as Yahoo, Twitter & Facebook may use many thousand node clusters]

By contrast HANA running on AWS runs only on a single machine

I don’t consider HANA & HADOOP/IMPALA rival products, just different tools for different purposes, though there is an overlap.

I focused on SQL read-times, row limits and costs between the two solutions, both running on cloud machines hosted by Amazon Web Services (AWS).

To benchmark them I used sample SAP SPL Data and TPC-H data both loaded with 60 million records

For details on TPC-H see http://www.tpc.org/tpch/

At this point the analysis only focuses on queries running on a single table. Depending on feedback I may broaden the scope of comparison to include more complex queries with Joins.

If you notice any glaring inaccuracies or omissions then please feel free to let me know. Where possible I'm happy to update my slides accordingly.

All the best

Aron

Big Data Analytics: HANA vs HADOOP IMPALA on AWS

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112