Using Hierarchical Basket Analysis to solve Association Problems
Hierarchical basket analysis generates association rules from frequent itemsets identified by the Frequent Itemsets Mining task.
Prerequisites
you must have created a flow;
the required datasets must have been imported into the flow;
the data used for the analysis must have been well prepared, and include a categorical output and a number of inputs. Data preparation may involve discretization before building the decision tree, to improve the accuracy of the model and to reduce the computational effort.
a unified model must have been created by merging all the datasets in the flow.
the Hierarchical Basket Analysis task is connected to a preceding Frequent Itemset Mining task, whose output is used as the input of the basket analysis.
Additional tabs
The results of the task can be viewed in two separate tabs:
The Association rules tab displays the generated item sets, where:
Rule ID is the association rule ID.
Positive/negative premise(s) distinguish between positive and negative premises. Negative premises only appear if the Negative Rules option is selected. If negative premises are listed in the current row, NOT is printed is this column; otherwise nothing is printed.
Premise Item ID contains the Item IDs of premises.
Positive/negative consequence(s) distinguish between positive and negative consequences. Negative consequences only appear if the Negative Rules option is selected. If negative premises are listed in the current row, NOT is printed is this column; otherwise nothing is printed.
Consequence Item ID contains the Item ID(s) of consequences.
Support premise(s) is the percentage of orders in which premise(s) appear in the dataset.
Support # premise(s) is the number of times in which premise(s) appear in the dataset.
Support consequence(s) the number of times consequence(s) appear in the dataset.
Support shows the relevance of the considered rule, i.e. it counts how many transactions include both all premises, and is expressed as a percentage with respect to the total number of orders.
Support # shows the relevance of the considered rule, i.e. it counts how many transactions include both all premises, and is expressed in absolute terms.
Confidence measures the reliability of the considered association rule. More specifically, it measures the following: if all the items in the premise of the rule are bought, how often are all the ones in the consequence bought too. Confidence values are comprised between 0 and 1.
Lift represents a relative measure of interdependence between premises and consequences. If consequences are independent from premises, lift is equal to 1. Consequently, if the lift is greater than 1 there is a direct correlation between item purchases, while a lift lower than 1 is an indicator of inverse correlation.
Cosine is a normalized interdependence measure, comprised between 0 and 1. The greater the cosine score, the stronger the interdependence between premise(s) and consequence(s).
Conviction represents a specificity measure, proportional to confidence and inversely proportional to support. The conviction value increases for reliable and rare associations and tends to infinity if confidence is maximum (i.e. equal to 1).
Leverage represents an absolute measure of interdependence between premise(s) and consequence(s). If consequence(s) are independent from premise(s), leverage is equal to 0.
Chi-square reports the value of the Chi-square test. If missing, it points out that the contingency table associated to the rule does not allow a reliable p-value estimate through the Chi-square test. In these cases, the Fisher’s exact test is preferred and its p-value estimate is upper-bounded as shown in [2].
p-value is the probability of the null hypothesis associated to the rule (i.e. no relationship between premise and consequence).
The Results tab, where details on the execution of the analysis are displayed:
Task identifier is the ID code for the task, internally used by the Rulex engine.
Task name is the name of the task.
Elapsed time is the time required for the latest computation (in seconds).
Minimum # support threshold for items is the minimum threshold for items applied during latest computation, in absolute terms.
Minimum support threshold for items (percentage) is the minimum threshold for items applied during latest computation as a percentage.
Number of different items in input is the number of distinct items which were fed to the task during latest computation.
Number of different orders in input is the number of distinct orders which were fed to the task during latest computation.
Number of generated association rules is the number of the associative rules displayed in the Association Rules tab.
Procedure
Drag the Hierarchical Basket Analysis task onto the stage.
Connect a Frequent Itemsets Mining task, which contains the frequent itemsets from which you want to extract the associations, to the new task.
Double click the Hierarchical Basket Analysis task. The left-hand pane displays a list of all the available attributes in the dataset, which can be ordered and searched as required.
Click on the Basic tab to configure the basic options, as described in the table below.
Click on the Advanced tab to configure the basic options as described in the table below.
Click on the Output tab to configure the basic options as described in the table below.
Save and compute the task.
Hierarchical Basket Analysis Basic options | |
Name | Description |
---|---|
Minimum item support (# samples) | All items which appear in orders fewer times than this threshold are discarded. This option is enabled only if Auto (specify #items) option is not selected. |
Auto (specify #items) | I selected, the minimum support for items is automatically computed according to the minimum number of items to take into account specified in the #Items to consider option. |
#Items to consider | The number of items to take into account (most frequent first). This option is enabled only if the Auto (specify #items) option is selected. |
Minimum association rule support (# samples) | All association rules which are verified fewer times than this threshold are discarded. This option is enabled only if Auto (above average) option is not selected. |
Auto (above average) | If selected, the minimum association rule support is set to the average support of rules with the same dimension (i.e. with the same premise(s)+consequence(s) number). |
Minimum confidence, minimum lift | The minimum confidence and lift values for association rules. |
Minimum Kulczynski index, maximum p-value | Define values for:
|
No maximum # of premises/consequences | If selected, no maximum number of premises can be specified. |
Maximum # of premises/consequences | The maximum number of premises and consequences of the association rules. This option is enabled only if the No maximum # of premises/consequences option is not selected. |
Hierarchical Basket Analysis Advanced options | |
Attribute to filter to select rows including relevant data | Drag and drop attributes to this edit box from the Available attributes list to specify a filtering criterion. Items satisfying this criterion are not discarded, regardless their support. Instead of manually dragging and dropping attributes, they can be defined via a filtered list. |
Attribute to filter to discard rows including irrelevant data | Drag and drop attributes to this edit box from the Available attributes list to specify a filtering criterion. Items satisfying this criterion are discarded, regardless their support. If both the selecting and the discarding filters are specified, the discarding filter prevails. Instead of manually dragging and dropping attributes, they can be defined via a filtered list. |
Hierarchical Basket Analysis Output options | |
No maximum # of premises/consequences | If selected, no maximum number of premises/consequences will be generated. |
Minimum number of different attributes involved in each role | Specify the minimum number of different attributes that must be included in each role. |
Negative Rules (NOT A implies B, A implies NOT B) | If selected, negative rules are also generated. Negative rules are rules for which premise(s) or consequence(s) appear in negative form. For instance: A implies NOT B or NOT A implies B. |
Maximum Kulczynski value which triggers the check for negative rules | Considering that the presence of a high value for the Kulczynski index identifies a strong and robust correlation between premises and consequences constituting a rule, the same index can also used, from another perspective, to guide the mining of negative rules. Consequently if the Kuclzysnki index is low (up to the specified maximum value), it is evaluated if the considered rule becomes strong when expressed in negative form (for instance when denying the premise). |
Maximum # of premises/consequences, negative rules | The maximum number of premises and consequences of the association rules. This option is enabled only if the Negative Rules (NOT A implies B, A implies NOT B) option is selected. |
Example
The following example uses the Groceries dataset.
The following example follows on from the Frequent Itemsets Mining example.
Description | Screenshot |
---|---|
After having extracted the frequent sequences with the Frequent Itemsets Mining task, add the Hierarchical Basket Analysis task to the flow. Set the following options:
Save and compute the task. | |
Association rules are stored in the Association Rules tab. Each association rule will be characterized by premise(s) and consequence(s). If, for instance, a rule includes tropical fruit as a premise and citrus fruit as a consequence, it means that if a transaction includes a tropical fruit, it is also likely to include a citrus fruit. | |
Different indicators qualify and quantify the strength of this cross-selling relationship. To view which rules have the highest confidence, right-click on the Confidence column in the Association Rules tab and select Sort Descending. We can now perform a few further steps in order to analyze the extracted rules in further detail, and perform filtering and statistical operations on the rules. | As the most reliable rules may have low support you could try repeating the analysis after setting the Minimum association rule support to 20, and check which rule has the highest lift. |
In order to view the rules themselves and not just the dataset we can import the rules only by adding an Import From Task task to the flow. Double-click the task and select:
Save and compute the task. | |
Add a Data Manager task to the Import from Task to analyze the rules by filtering them in the Query Manager pane. In this example, we wanted to filter all the values in the Lift attribute higher than 1. | |
Alternatively you could also compute min/max or average values in the Sheets tab. For example by using the Variance option from the univariate statistics on the Confidence attribute, as in the example. |