Using Frequent Itemsets Mining to solve Association Problems
Frequent itemset mining extracts recurrent item associations from a dataset. Rulex uses the Equivalence Class Transformation (Eclat) algorithm to perform this task.
A typical scenario in which this task could be applied is in defining which items are frequently bought together in a supermarket.
The output would be a table of itemsets which are bought in the same transaction more than a specified number of times. However, the task can be used in many other scenarios, whenever it is possible to identify attributes which define groups (Order key attributes) and attributes that populate these groups with information (Item key attributes).
Rulex can handle both:
generalized frequent itemset mining, where the items refer to different attributes and consequently carry different information
hierarchical frequent itemset mining, where the attributes carry the same information with different levels of detail.
Prerequisites
you must have created a flow;
the required datasets must have been imported into the flow;
the data used for the analysis must have been well prepared;
a unified model must have been created by merging all the datasets in the flow.
Additional tabs
The results of the task are displayed in two separate tabs:
The Frequent itemsets tab displays the generated item sets, where:
Frequent ItemsetID is the sequential ID number for frequent itemsets.
Cardinality is the cardinality of the frequent itemset.
Support is the percentage of orders in which the frequent itemset appears in the dataset.
Support# is the number of times the frequent itemset appears in the dataset.
All-confidence is the ratio between the support of the itemset and the support of the least frequent item included in the itemset.
Item ID is the ID of the items composing the frequent itemset reported in these columns.
The Results tab displays details on the execution of the analysis, where:
Task Identifier is the ID code for the task, internally used by the Rulex engine.
Task Name is simply the name of the task.
Elapsed time (sec) is the time required for latest computation (in seconds).
Number of generated frequent itemsets is the number of itemsets which were found to be frequent, according to the support threshold.
Number of different items in input is the number of distinct items which were fed to the task during latest computation.
Number of different orders in input is the number of distinct orders which were fed to the task during latest computation.
Procedure
Drag the Frequent Itemset Mining task onto the stage.
Connect a Data Manager task, which contains the attributes from which you want to extract the associations, to the new task.
Double click the Frequent Itemset Mining task. The left-hand pane displays a list of all the available attributes in the dataset, which can be ordered and searched as required.
Configure the Basic options described, as described below.
Click on the Advanced tab to configure the frequent itemsets advanced options, as described below.
Click on the Output tab to configure the output options, as described below.
Save and compute the task.
Frequent Itemsets Mining Basic options | |
Name | Description |
---|---|
Order key attributes | Drag and drop the nominal attributes which define orders from the Attributes list onto this list. Instead of manually dragging and dropping attributes, they can be defined via a filtered list. |
Item child attributes | Drag and drop the nominal attributes which characterize items from the Attributes list onto this list. Instead of manually dragging and dropping attributes, they can be defined via a filtered list. |
Item parent attributes | Drag and drop the nominal attributes which correspond to the hierarchically superior level of the attribute inserted in the Item child attributes list. For example, if the analysis involves EAN codes and categories, the EAN code is dragged and dropped onto the Item child attributes list, while the category is inserted in the same position of the Item parent attributes list. If the parent item attribute is not defined for any child instances (i.e. an EAN is not categorized), the child attribute value is repeated in the parent attribute column. This list is enabled only if the Hierarchical item attributes option is selected. Instead of manually dragging and dropping attributes, they can be defined via a filtered list. |
Hierarchical item attributes | If checked, this option denotes the existence of hierarchical attributes which characterize items, consequently enabling the Item parent attributes list. |
Support count only for top-level attributes | If checked, this option modifies support computation so that only top-level attributes in a hierarchy are taken into account. If this option is not checked, for every order, all included elements of the hierarchy increment their support by 1. |
Minimum item support (# samples) | All items which appear in orders fewer times than this threshold are discarded. This value is relevant only if the Auto (specify #items) option is unchecked. |
Auto (specify # items) | If selected, the minimum support for items is automatically computed, and the number of items to be taken into account can be specified in the Items to consider spin box. |
Items to consider | This option is enabled only if the Auto (specify # items) option is selected. The number of items to take into account (most frequent first). |
Keep auto item support threshold also for itemsets | This option is enabled only if the Auto (specify # items) option is not selected. If selected, all itemsets which occur fewer times than this threshold are discarded. |
Minimum itemset support (# samples) | This option is enabled only if the Auto (specify # items) option is not selected. All itemsets which occur fewer times than this threshold are discarded. |
Auto (above average) | If selected, the minimum itemset support value is set to the average support of itemsets with the same dimension. |
Maximum itemset cardinality | The maximum cardinality of generated itemsets. |
No maximum itemset cardinality | If selected, all itemsets with higher support than the specified threshold are generated, regardless of their cardinality. |
Minimum number of different attributes involved in each itemset | Determines the minimum number of different attributes that have to be part of an itemset in order not to discard it. |
Frequent Itemsets Mining Advanced options | |
Attribute to filter to select rows including relevant data | Drag and drop attributes to this edit box (from the Available attributes, the Order key attributes, the Item key attributes or the Auxiliary attributes list) to specify a filtering criterion. Items satisfying this criterion are not discarded, regardless their support. |
Attribute to filter to discard rows including irrelevant data | Drag and drop attributes to this edit box (from the Available attributes, the Order key attributes, the Item key attributes or the Auxiliary attributes list) to specify a filtering criterion. Items which appear in rows satisfying this criterion are discarded, regardless their support. If both the selecting and the discarding filters are specified, the discarding filter prevails. |
Maximum factor per auxiliary attribute adjusting support | Specify the value up to which the support of items and associations may be multiplied or divided, according to the average value of its auxiliary attribute(s). |
Auxiliary attributes | Auxiliary attributes are used to take additional criteria into account (together with support) when filtering itemsets. For instance, it is possible to take into account item and itemsets whose support is low if their margin is greater than the average and, symmetrically, to discard itemsets even if they have high support, if their margin is lower than average. Drag and drop attributes (from the Available attributes list):
Drag and drop those attributes for which you want to calculate their overall quantity in the Item quantities target list. |
Frequent Itemsets Mining Output options | |
Flag maximal frequent itemsets | If selected a column is added to the table which specifies whether a frequent itemset is maximal or nor, i.e. whether or not it is not included within another frequent itemset or not. |
Rare itemsets mining | If selected, the output will display rare itemsets instead of frequent itemsets. Rare itemsets are groupings of items that are rarely found together, although they may be frequent individually. |
Maximum itemset support | This threshold value indicates the maximum number of times the items in an itemset can be found together in order to be considered rare. |
Maximum relative support for itemsets | The support value compares the number of times the item appears with and without the other item in the rare itemset. |
Example
The following example uses the Grocieries dataset.
Description | Screenshot |
---|---|
| |
The dataset can be restructured by adding a Reshape To Long task to the flow.
| |
| |
| |
The resulting itemsets are displayed in the Frequent Itemsets tab. | |