Using Frequent Itemsets Mining to solve Association Problems

Frequent itemset mining extracts recurrent item associations from a dataset. Rulex uses the Equivalence Class Transformation (Eclat) algorithm to perform this task.

A typical scenario in which this task could be applied is in defining which items are frequently bought together in a supermarket.

The output would be a table of itemsets which are bought in the same transaction more than a specified number of times. However, the task can be used in many other scenarios, whenever it is possible to identify attributes which define groups (Order key attributes) and attributes that populate these groups with information (Item key attributes).

Rulex can handle both:

  • generalized frequent itemset mining, where the items refer to different attributes and consequently carry different information

  • hierarchical frequent itemset mining, where the attributes carry the same information with different levels of detail.


Prerequisites

Additional tabs

The results of the task are displayed in two separate tabs:

  • The Frequent itemsets tab displays the generated item sets, where:

    • Frequent ItemsetID is the sequential ID number for frequent itemsets.

    • Cardinality is the cardinality of the frequent itemset.

    • Support is the percentage of orders in which the frequent itemset appears in the dataset.

    • Support# is the number of times the frequent itemset appears in the dataset.

    • All-confidence is the ratio between the support of the itemset and the support of the least frequent item included in the itemset.

    • Item ID is the ID of the items composing the frequent itemset reported in these columns.                    

  • The Results tab displays details on the execution of the analysis, where:

    • Task Identifier is the ID code for the task, internally used by the Rulex engine.

    • Task Name is simply the name of the task.

    • Elapsed time (sec) is the time required for latest computation (in seconds).

    • Number of generated frequent itemsets is the number of itemsets which were found to be frequent, according to the support threshold.

    • Number of different items in input is the number of distinct items which were fed to the task during latest computation.

    • Number of different orders in input is the number of distinct orders which were fed to the task during latest computation.


Procedure

  1. Drag the Frequent Itemset Mining task onto the stage.

  2. Connect a Data Manager task, which contains the attributes from which you want to extract the associations, to the new task.

  3. Double click the Frequent Itemset Mining task. The left-hand pane displays a list of all the available attributes in the dataset, which can be ordered and searched as required.

  4. Configure the Basic options described, as described below.

  5. Click on the Advanced tab to configure the frequent itemsets advanced options, as described below.

  6. Click on the Output tab to configure the output options, as described below.

  7. Save and compute the task.

Frequent Itemsets Mining Basic options

Name

Description

Order key attributes

Drag and drop the nominal attributes which define orders from the Attributes list onto this list. Instead of manually dragging and dropping attributes, they can be defined via a filtered list.

Item child attributes

Drag and drop the nominal attributes which characterize items from the Attributes list onto this list. Instead of manually dragging and dropping attributes, they can be defined via a filtered list.

Item parent attributes

Drag and drop the nominal attributes which correspond to the hierarchically superior level of the attribute inserted in the Item child attributes list. For example, if the analysis involves EAN codes and categories, the EAN code is dragged and dropped onto the Item child attributes list, while the category is inserted in the same position of the Item parent attributes list. If the parent item attribute is not defined for any child instances (i.e. an EAN is not categorized), the child attribute value is repeated in the parent attribute column. This list is enabled only if the Hierarchical item attributes option is selected. Instead of manually dragging and dropping attributes, they can be defined via a filtered list.

Hierarchical item attributes

If checked, this option denotes the existence of hierarchical attributes which characterize items, consequently enabling the Item parent attributes list.

Support count only for top-level attributes

If checked, this option modifies support computation so that only top-level attributes in a hierarchy are taken into account. If this option is not checked, for every order, all included elements of the hierarchy increment their support by 1.

Minimum item support (# samples)

All items which appear in orders fewer times than this threshold are discarded.

This value is relevant only if the Auto (specify #items) option is unchecked.

Auto (specify # items)

If selected, the minimum support for items is automatically computed, and the number of items to be taken into account can be specified in the Items to consider spin box.

Items to consider

This option is enabled only if the Auto (specify # items) option is selected.

The number of items to take into account (most frequent first). 

Keep auto item support threshold also for itemsets

This option is enabled only if the Auto (specify # items) option is not selected.

If selected, all itemsets which occur fewer times than this threshold are discarded. 

Minimum itemset support (# samples)

This option is enabled only if the Auto (specify # items) option is not selected.

All itemsets which occur fewer times than this threshold are discarded.

Auto (above average)

If selected, the minimum itemset support value is set to the average support of itemsets with the same dimension.

Maximum itemset cardinality

The maximum cardinality of generated itemsets.

No maximum itemset cardinality

If selected, all itemsets with higher support than the specified threshold are generated, regardless of their cardinality.

Minimum number of different attributes involved in each itemset

Determines the minimum number of different attributes that have to be part of an itemset in order not to discard it.

Frequent Itemsets Mining Advanced options

Attribute to filter to select rows including relevant data

Drag and drop attributes to this edit box (from the Available attributes, the Order key attributes, the Item key attributes or the Auxiliary attributes list) to specify a filtering criterion.

Items satisfying this criterion are not discarded, regardless their support.

Attribute to filter to discard rows including irrelevant data

Drag and drop attributes to this edit box (from the Available attributes, the Order key attributes, the Item key attributes or the Auxiliary attributes list) to specify a filtering criterion.

Items which appear in rows satisfying this criterion are discarded, regardless their support. If both the selecting and the discarding filters are specified, the discarding filter prevails.

Maximum factor per auxiliary attribute adjusting support

Specify the value up to which the support of items and associations may be multiplied or divided, according to the average value of its auxiliary attribute(s).

Auxiliary attributes

Auxiliary attributes are used to take additional criteria into account (together with support) when filtering itemsets. For instance, it is possible to take into account item and itemsets whose support is low if their margin is greater than the average and, symmetrically, to discard itemsets even if they have high support, if their margin is lower than average.

Drag and drop attributes (from the Available attributes list):

  • In the first target list those attributes where a high value is more relevant, for example the margin

  • In the second target list those attributes where a high value is more relevant, for example if the attribute contains how many days have passed from the target transactions, and we are primarily interested in most recent transactions

Drag and drop those attributes for which you want to calculate their overall quantity in the Item quantities target list.

Frequent Itemsets Mining Output options

Flag maximal frequent itemsets

If selected a column is added to the table which specifies whether a frequent itemset is maximal or nor, i.e. whether or not it is not included within another frequent itemset or not. 

Rare itemsets mining

If selected, the output will display rare itemsets instead of frequent itemsets. Rare itemsets are groupings of items that are rarely found together, although they may be frequent individually.

Maximum itemset support

This threshold value indicates the maximum number of times the items in an itemset can be found together in order to be considered rare.

Maximum relative support for itemsets

The support value compares the number of times the item appears with and without the other item in the rare itemset.


Example

The following example uses the Grocieries dataset.

Description

Screenshot

  • After having imported the dataset (remember to set to 0 the get names from line , link a Data Manager to the import task and open it.

  • Add a new attribute column to the dataset, called ORDER_ID, then select it and populate it with the values resulting from the enum() function.

  • Then, set the type to nominal. save and compute the task.

  • The current format of the dataset is not suitable for the Frequent Itemsets Mining task as each row represents a full transaction and not a single purchase. The dataset must be restructured so that the information concerning a purchase of n items is distributed over n rows, each one including a Order ID/Item ID pairing.

The dataset can be restructured by adding a Reshape To Long task to the flow.

  • Double-click the Reshape To Long task and drag all the attributes from the left, apart from the ORDER_ID attribute, onto the Attributes to be transformed in long format target list.

  • Save and compute the task.

  • Right-click the Reshape To Long task and select Take a look to check the new structure.

  • The dataset is now structured with a row for every single purchase.

  • Now add a Frequent Itemsets Mining task, and configure the task as follows:

    • Drag the ORDER_ID attribute in the Order key attributes target list.

    • Drag the Wide_1 attribute in the Item child attributes target list.

    • Select the Auto (specify #items) checkbox.

    • Set the Items to consider to 50, to evaluate the 50 top-selling items.

    • Set the Maximum itemset cardinality to 3.

  • Save and compute the task.

The resulting itemsets are displayed in the Frequent Itemsets tab.