Quantcast
Channel: SCN : All Content - SAP HANA Developer Center
Viewing all articles
Browse latest Browse all 6745

SAP HANA PAL - How to use the Apriori Algorithm

$
0
0

PAL is an optional component from SAP HANA and the main porpoise of PAL is to enable the modelers to perform predictive analysis over big volumes of data. Currently, PAL includes classic and universal predictive analysis algorithms from six data-mining categories:

 

  • Clustering
  • Classification
  • Association
  • Time Series
  • Preprocessing
  • Miscellaneous

 

To be able to call PAL functions from SQLScript you need to be on SPS05 and install the Application Function Library (AFL) which includes PAL and BFL (Business Functions Library).

 

In this post I’m going to focus on how to use the Apriori Algorithm. Let’s take a look at what the PAL user guide says about it:

 

“…Apriori is a classic predictive analysis algorithm for finding association rules used in association analysis. Association analysis uncovers the hidden patterns, correlations or casual structures among a set of items or objects. For example, association analysis enables you to understand what products and services customers tend to purchase at the same time. By analyzing the purchasing trends of your customers with association analysis, you can predict their future behavior…”

 

In other words, this algorithm will find frequent itemsets. This algorithm is designed to run on transactional tables, that means that you can for example use your Sales Order Items table to find patterns on how your customers consume your products or services without the need of having to pre-process the information stored in the table. So running this algorithm is quite simple. The easiest way to understand the porpoise of this algorithm is to go to Amazon and browse a product (any), you’ll notice that there is a section called “Frequently Bought Together”. The content of that section is created using an Association Algorithm like the one we are talking about in this blog (if not the same).

 

When you run the Apriori procedure in SAP HANA, the results will be stored in a column table that contains all the associations uncovered by the algorithm with three measures: Support, Confidence and Lift


Support

Is the proportion of transactions in the data set which contain a particular itemset. Let’s say we have a Sales Order Items table that looks like this:

 

Transaction ID

Item

1

Milk

1

Bread

2

Butter

3

Beer

4

Milk

4

Bread

4

Butter

5

Milk

5

Butter

 

In this example, the Itemset {Milk, Bread, Butter} will have a Support = 1/5, because only 1 transaction out of 5 contains that particular Itemset. If we use the Itemset {Milk, Bread}, the Support will be 2/5, because 2 transactions out of 5 contain that combination.

 

Confidence

This measure will tell you how often a customer buys Item B when he buys Item A. For example, if we use the table above, the Confidence of {Milk, Bread} will be 2/3, because out of 3 transactions that contain Milk, only 2 contain the Itemset {Milk, Bread}. So the Confidence is defined as:

 

form1.PNG

Lift

In general, if we have a high Support and a high Confidence, this means we have a strong association. However, this can be deceptive in some cases because if Item A and Item B have a high support, we can have a high value for confidence even when they are independent. A better measure to judge the strength of an association rule is the Lift. The Lift ratio is the confidence of the rule divided by the confidence assuming the consequent (Item B) and the antecedent (Item A) are independent:

 

form2.PNG

A Lift ratio greater than 1.0 suggests that we have a strong rule. The larger the Lift ratio, the greater is the strength of the association. In our previous example, the Lift Ratio for {Milk, Bread} will be:

form3.PNG

Ok, now that we understand the different concepts associated with this algorithm, let the coding begin!

 

Code

For my example I would like to know which #HASHTAGS are frequently twitted together in Twitter when Twitting about SAP HANA, so I downloaded a number of Tweets from Twitter (15.000 to be precise) that contain the words SAP and HANA. After downloading the data I created a column table with the following structure to store the Tweets:

 

 

CREATECOLUMNTABLE TWT (

       "ID"SMALLINT,

       "DATE"INTEGER, --> I use Integer for the date because I'm using the UNIX TimeStamp Format

       "CONTENT"NVARCHAR(200),

  PRIMARYKEY ("ID"));

 

After uploading the Tweets to SAP HANA, I ran a Text Analysis process by executing the following statement:

 

CREATE FULLTEXT INDEX TWT_FTI ONTWT("CONTENT")

TEXT ANALYSIS ON CONFIGURATION 'EXTARCTION_CORE';

 

The reason why I ran this Text Analysis process is because I want to identify all the #HASHTAGS in my Tweets. After creating the Full Text Index, a new table called $TA_TWT_FTI is created. Let’s take a look at the content of the table:

cap1.png

As you can see, in this table I can easily identify all the different #HASHTAGS by only filtering by column TYPE = SOCIAL_MEDIA/TOPIC_TWITTER. If you would like to know more about the Text Analysis Process, take a look at my previous post where I describe in detail how to use it. My next step is creating a SQL View on this table to show only #HASHTAGS:

 

CREATEVIEW"PAL_TRANS_TAB"AS

SELECT ID, TOKEN FROM"$TA_TWT_FTI"

WHERETYPE = 'SOCIAL_MEDIA/TOPIC_TWITTER';

 

The resulting view looks like this:

cap2.png

This looks pretty similar to a Sales Order Items table, doesn’t it? Just replace the ID with the Sales Order ID and the Token with the Product or Service ID included in the Sales Order. Now I’m ready to run the Apriori algorithm:

 

SETSCHEMA _SYS_AFL;

 

/* CREATE TABLE TYPE FOR MY INPUT DATA */

DROPTYPE PAL_DATA_T;

CREATETYPE PAL_DATA_T ASTABLE(

"ID" INT,

"TOKEN"VARCHAR(100)

);

 

/* CREATE TABLE TYPE FOR THE OUTPUT TABLE */

DROPTYPE PAL_RESULT_T;

CREATETYPE PAL_RESULT_T ASTABLE(

"PRERULE"VARCHAR(500),

"POSTRULE"VARCHAR(500),

"SUPPORT"DOUBLE,

"CONFIDENCE"DOUBLE,

"LIFT"DOUBLE

);

 

/* CREATE TABLE TYPE FOR THE OUTPUT PMML MODEL */

DROPTYPE PAL_PMMLMODEL_T;

CREATETYPE PAL_PMMLMODEL_T ASTABLE(

"ID" INT,

"PMMLMODEL"VARCHAR(5000)

);

 

/* CREATE TABLE TYPE FOR THE TABLE THAT WILL CONTAIN THE INPUT PARAMETERS */

DROPTYPE PAL_CONTROL_T;

CREATETYPE PAL_CONTROL_T ASTABLE(

"NAME"VARCHAR (50),

"INTARGS"INTEGER,

"DOUBLEARGS"DOUBLE,

"STRINGARGS"VARCHAR (100)

);

 

/* CREATE TABLE THAT WILL POINT TO THE DIFFERENT TYPES I'M USING TO RUN THE ALGORITHM */

DROPTABLE PDATA;

CREATECOLUMNTABLE PDATA(

"ID" INT,

"TYPENAME"VARCHAR(100),

"DIRECTION"VARCHAR(100) );

 

/* FILL THE TABLE */

INSERTINTO PDATA VALUES (1, '_SYS_AFL.PAL_DATA_T', 'in');

INSERTINTO PDATA VALUES (2, '_SYS_AFL.PAL_CONTROL_T', 'in');

INSERTINTO PDATA VALUES (3, '_SYS_AFL.PAL_RESULT_T', 'out');

INSERTINTO PDATA VALUES (4, '_SYS_AFL.PAL_PMMLMODEL_T', 'out');

 

/* GENERATE THE APRIORI PROCEDURE */

DROPPROCEDURE PAL_APRIORI_RULE;

DROPTYPE PAL_APRIORI_RULE__TT_P1;

DROPTYPE PAL_APRIORI_RULE__TT_P2;

DROPTYPE PAL_APRIORI_RULE__TT_P3;

DROPTYPE PAL_APRIORI_RULE__TT_P4;

callSYSTEM.afl_wrapper_generator('PAL_APRIORI_RULE', 'AFLPAL', 'APRIORIRULE', PDATA);

 

/* CREATE TABLE THAT WILL CONTAIN THE INPUT PARAMETERS */

DROPTABLE PAL_CONTROL_TAB;

CREATECOLUMNTABLE PAL_CONTROL_TAB(

"NAME"VARCHAR (50),

"INTARGS"INTEGER,

"DOUBLEARGS"DOUBLE,

"STRINGARGS"VARCHAR (100)

);

 

/* FILL THE TABLE */

/* NUMBER OF THREADS TO BE USED */

INSERTINTO PAL_CONTROL_TAB VALUES ('THREAD_NUMBER', 2, null, null);

/* MINIMUM SUPPORT THRESHOLD */

INSERTINTO PAL_CONTROL_TAB VALUES ('MIN_SUPPORT', null, 0.02, null);

/* MINIMUM CONFIDENCE THRESHOLD */

INSERTINTO PAL_CONTROL_TAB VALUES ('MIN_CONFIDENCE', null, 0.02, null);

 

/* CREATE THE OUTPUT TABLE */

DROPTABLE PAL_RESULT_TAB;

CREATECOLUMNTABLE PAL_RESULT_TAB(

"PRERULE"VARCHAR(500),

"POSTRULE"VARCHAR(500),

"SUPPORT"Double,

"CONFIDENCE"Double,

"LIFT"DOUBLE

);

 

/* CREATE THE OUTPUT TABLE WITH THE PMML MODEL */

DROPTABLE PAL_PMMLMODEL_TAB;

CREATECOLUMNTABLE PAL_PMMLMODEL_TAB (

"ID" INT,

"PMMLMODEL"VARCHAR(5000)

);

 

/* CALL THE APRIORI ALGORITHM */

CALL PAL_APRIORI_RULE(PAL_TRANS_TAB, PAL_CONTROL_TAB, PAL_RESULT_TAB, PAL_PMMLMODEL_TAB) WITH overview;

 

/* SHOW THE RESULTS */

SELECT * FROM PAL_RESULT_TAB;

SELECT * FROM PAL_PMMLMODEL_TAB;

 

This is how the output table looks like:

cap3.png

Of course, the strongest rules are {#HANA, #SAP} and {#SAP, #HANA} because that is the selection criteria I used to download the data from Twitter, but the interesting thing about this algorithm is that it not only creates rules with two items, the rules can contain multiple items, for example, people that Tweet using #HASHTAGS #SAP and #SAPTechEd in the same Tweet, usually include the #HASHTAG #HANA. I hope you enjoyed it!

 

Cheers, Lucas.

 

Info en Español sobre SAP HANA™:

www.HablemosHANA.com

 



Viewing all articles
Browse latest Browse all 6745

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>