SAP HANA PAL - How to use the Apriori Algorithm

PAL is an optional component from SAP HANA and the main porpoise of PAL is to enable the modelers to perform predictive analysis over big volumes of data. Currently, PAL includes classic and universal predictive analysis algorithms from six data-mining categories:

Clustering
Classification
Association
Time Series
Preprocessing
Miscellaneous

To be able to call PAL functions from SQLScript you need to be on SPS05 and install the Application Function Library (AFL) which includes PAL and BFL (Business Functions Library).

In this post I’m going to focus on how to use the Apriori Algorithm. Let’s take a look at what the PAL user guide says about it:

“…Apriori is a classic predictive analysis algorithm for finding association rules used in association analysis. Association analysis uncovers the hidden patterns, correlations or casual structures among a set of items or objects. For example, association analysis enables you to understand what products and services customers tend to purchase at the same time. By analyzing the purchasing trends of your customers with association analysis, you can predict their future behavior…”

In other words, this algorithm will find frequent itemsets. This algorithm is designed to run on transactional tables, that means that you can for example use your Sales Order Items table to find patterns on how your customers consume your products or services without the need of having to pre-process the information stored in the table. So running this algorithm is quite simple. The easiest way to understand the porpoise of this algorithm is to go to Amazon and browse a product (any), you’ll notice that there is a section called “Frequently Bought Together”. The content of that section is created using an Association Algorithm like the one we are talking about in this blog (if not the same).

When you run the Apriori procedure in SAP HANA, the results will be stored in a column table that contains all the associations uncovered by the algorithm with three measures: Support, Confidence and Lift

Support

Is the proportion of transactions in the data set which contain a particular itemset. Let’s say we have a Sales Order Items table that looks like this:

Transaction ID	Item
1	Milk
1	Bread
2	Butter
3	Beer
4	Milk
4	Bread
4	Butter
5	Milk
5	Butter

In this example, the Itemset {Milk, Bread, Butter} will have a Support = 1/5, because only 1 transaction out of 5 contains that particular Itemset. If we use the Itemset {Milk, Bread}, the Support will be 2/5, because 2 transactions out of 5 contain that combination.

Confidence

This measure will tell you how often a customer buys Item B when he buys Item A. For example, if we use the table above, the Confidence of {Milk, Bread} will be 2/3, because out of 3 transactions that contain Milk, only 2 contain the Itemset {Milk, Bread}. So the Confidence is defined as:

Lift

In general, if we have a high Support and a high Confidence, this means we have a strong association. However, this can be deceptive in some cases because if Item A and Item B have a high support, we can have a high value for confidence even when they are independent. A better measure to judge the strength of an association rule is the Lift. The Lift ratio is the confidence of the rule divided by the confidence assuming the consequent (Item B) and the antecedent (Item A) are independent:

A Lift ratio greater than 1.0 suggests that we have a strong rule. The larger the Lift ratio, the greater is the strength of the association. In our previous example, the Lift Ratio for {Milk, Bread} will be:

Ok, now that we understand the different concepts associated with this algorithm, let the coding begin!

Code

For my example I would like to know which #HASHTAGS are frequently twitted together in Twitter when Twitting about SAP HANA, so I downloaded a number of Tweets from Twitter (15.000 to be precise) that contain the words SAP and HANA. After downloading the data I created a column table with the following structure to store the Tweets:

CREATECOLUMNTABLE TWT (

"ID"SMALLINT,

"DATE"INTEGER, --> I use Integer for the date because I'm using the UNIX TimeStamp Format

"CONTENT"NVARCHAR(200),

PRIMARYKEY ("ID"));

After uploading the Tweets to SAP HANA, I ran a Text Analysis process by executing the following statement:

CREATE FULLTEXT INDEX TWT_FTI ONTWT("CONTENT")

TEXT ANALYSIS ON CONFIGURATION 'EXTARCTION_CORE';

The reason why I ran this Text Analysis process is because I want to identify all the #HASHTAGS in my Tweets. After creating the Full Text Index, a new table called $TA_TWT_FTI is created. Let’s take a look at the content of the table:

As you can see, in this table I can easily identify all the different #HASHTAGS by only filtering by column TYPE = SOCIAL_MEDIA/TOPIC_TWITTER. If you would like to know more about the Text Analysis Process, take a look at my previous post where I describe in detail how to use it. My next step is creating a SQL View on this table to show only #HASHTAGS:

CREATEVIEW"PAL_TRANS_TAB"AS

SELECT ID, TOKEN FROM"$TA_TWT_FTI"

WHERETYPE = 'SOCIAL_MEDIA/TOPIC_TWITTER';

The resulting view looks like this:

This looks pretty similar to a Sales Order Items table, doesn’t it? Just replace the ID with the Sales Order ID and the Token with the Product or Service ID included in the Sales Order. Now I’m ready to run the Apriori algorithm:

SETSCHEMA _SYS_AFL;

/* CREATE TABLE TYPE FOR MY INPUT DATA */

DROPTYPE PAL_DATA_T;

CREATETYPE PAL_DATA_T ASTABLE(

"ID" INT,

"TOKEN"VARCHAR(100)

);

/* CREATE TABLE TYPE FOR THE OUTPUT TABLE */

DROPTYPE PAL_RESULT_T;

CREATETYPE PAL_RESULT_T ASTABLE(

"PRERULE"VARCHAR(500),

"POSTRULE"VARCHAR(500),

"SUPPORT"DOUBLE,

"CONFIDENCE"DOUBLE,

"LIFT"DOUBLE

);

/* CREATE TABLE TYPE FOR THE OUTPUT PMML MODEL */

DROPTYPE PAL_PMMLMODEL_T;

CREATETYPE PAL_PMMLMODEL_T ASTABLE(

"ID" INT,

"PMMLMODEL"VARCHAR(5000)

);

/* CREATE TABLE TYPE FOR THE TABLE THAT WILL CONTAIN THE INPUT PARAMETERS */

DROPTYPE PAL_CONTROL_T;

CREATETYPE PAL_CONTROL_T ASTABLE(

"NAME"VARCHAR (50),

"INTARGS"INTEGER,

"DOUBLEARGS"DOUBLE,

"STRINGARGS"VARCHAR (100)

);

/* CREATE TABLE THAT WILL POINT TO THE DIFFERENT TYPES I'M USING TO RUN THE ALGORITHM */

DROPTABLE PDATA;

CREATECOLUMNTABLE PDATA(

"ID" INT,

"TYPENAME"VARCHAR(100),

"DIRECTION"VARCHAR(100) );

/* FILL THE TABLE */

INSERTINTO PDATA VALUES (1, '_SYS_AFL.PAL_DATA_T', 'in');

INSERTINTO PDATA VALUES (2, '_SYS_AFL.PAL_CONTROL_T', 'in');

INSERTINTO PDATA VALUES (3, '_SYS_AFL.PAL_RESULT_T', 'out');

INSERTINTO PDATA VALUES (4, '_SYS_AFL.PAL_PMMLMODEL_T', 'out');

/* GENERATE THE APRIORI PROCEDURE */

DROPPROCEDURE PAL_APRIORI_RULE;

DROPTYPE PAL_APRIORI_RULE__TT_P1;

DROPTYPE PAL_APRIORI_RULE__TT_P2;

DROPTYPE PAL_APRIORI_RULE__TT_P3;

DROPTYPE PAL_APRIORI_RULE__TT_P4;

callSYSTEM.afl_wrapper_generator('PAL_APRIORI_RULE', 'AFLPAL', 'APRIORIRULE', PDATA);

/* CREATE TABLE THAT WILL CONTAIN THE INPUT PARAMETERS */

DROPTABLE PAL_CONTROL_TAB;

CREATECOLUMNTABLE PAL_CONTROL_TAB(

"NAME"VARCHAR (50),

"INTARGS"INTEGER,

"DOUBLEARGS"DOUBLE,

"STRINGARGS"VARCHAR (100)

);

/* FILL THE TABLE */

/* NUMBER OF THREADS TO BE USED */

INSERTINTO PAL_CONTROL_TAB VALUES ('THREAD_NUMBER', 2, null, null);

/* MINIMUM SUPPORT THRESHOLD */

INSERTINTO PAL_CONTROL_TAB VALUES ('MIN_SUPPORT', null, 0.02, null);

/* MINIMUM CONFIDENCE THRESHOLD */

INSERTINTO PAL_CONTROL_TAB VALUES ('MIN_CONFIDENCE', null, 0.02, null);

/* CREATE THE OUTPUT TABLE */

DROPTABLE PAL_RESULT_TAB;

CREATECOLUMNTABLE PAL_RESULT_TAB(

"PRERULE"VARCHAR(500),

"POSTRULE"VARCHAR(500),

"SUPPORT"Double,

"CONFIDENCE"Double,

"LIFT"DOUBLE

);

/* CREATE THE OUTPUT TABLE WITH THE PMML MODEL */

DROPTABLE PAL_PMMLMODEL_TAB;

CREATECOLUMNTABLE PAL_PMMLMODEL_TAB (

"ID" INT,

"PMMLMODEL"VARCHAR(5000)

);

/* CALL THE APRIORI ALGORITHM */

CALL PAL_APRIORI_RULE(PAL_TRANS_TAB, PAL_CONTROL_TAB, PAL_RESULT_TAB, PAL_PMMLMODEL_TAB) WITH overview;

/* SHOW THE RESULTS */

SELECT * FROM PAL_RESULT_TAB;

SELECT * FROM PAL_PMMLMODEL_TAB;

This is how the output table looks like:

Of course, the strongest rules are {#HANA, #SAP} and {#SAP, #HANA} because that is the selection criteria I used to download the data from Twitter, but the interesting thing about this algorithm is that it not only creates rules with two items, the rules can contain multiple items, for example, people that Tweet using #HASHTAGS #SAP and #SAPTechEd in the same Tweet, usually include the #HASHTAG #HANA. I hope you enjoyed it!

Cheers, Lucas.

Info en Español sobre SAP HANA™:

www.HablemosHANA.com

SAP HANA PAL - How to use the Apriori Algorithm

Support

Confidence

Lift

Code

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112