Using Label Clustering to Cluster Data
The Label Clustering task performs a clustering process according to the k-means approach after having aggregated and filtered data according to a subset of label variables.
The output of the task is a collection of clusters characterized by:
a (positive integer) index,
a central vector (centroid) and
a dispersion value measuring the normalized average distance of cluster members from the centroid.
In the following examples, which illustrates the three phases of the analysis, the list of values assumed by the label variables in a given pattern of the dataset is called tag:
Data grouping: Examples in the training set characterized by the same tag are grouped together and considered as a single representative record. The mean (or median, or medoid, according to the option specified by the user) among the values of each profile attribute is computed and assigned to the corresponding variable in the unified record.
Data filtering: Representative records with profile variables which have undesired properties are discarded. Two filter conditions are presently implemented:
Minimum number q of occurrences: Records that do not derive from a group of at least q patterns of the training set with the same tags are removed as statistically they are not highly representative.
Maximum dispersion coefficient σ: If the values of the profile variables of the group of patterns leading to a representative record present a dispersion coefficient (computed with respect to the desired central value) greater than σ, that record presents an irregular behavior that can deteriorate the results of the clustering procedure, and it is consequently discarded.
Data clustering: A k-means (or k-medians, or k-medoids, according to the option specified by the user) clustering algorithm is employed to aggregate representative records with similar profiles. The centroid of each cluster provides the values of the profile variables to be used in a subsequent Apply Model task when a new pattern is assigned to that cluster.
Label clustering can also be employed for the solution of signal prediction problems, where the behavior of a mono-dimensional output signal (described by the profile variables) has to be predicted starting from a set of label attributes.
you must have created a flow;
the required datasets must have been imported into the flow;
the data used for the analysis must have been well prepared and the final dataset contains profile and label variables.
To preserve generality a profile attribute can also be a label variable.
If nominal profile attributes are considered, the k-modes variant is adopted to allow their treatment. Optionally, a variable with the cluster id role can be included in the dataset, providing the initial cluster assignment for each pattern.
If a weight attribute is present, its values are employed as a measure of relevance for each example, thus affecting the position of the cluster centroid.
The results of the task can be viewed in three separate tabs:
The Clusters tab displays a spreadsheet with the values of the profile attributes for the centroids of created clusters, together with the number of elements and the dispersion coefficient (given by the normalized average distance of cluster members from the centroid) for each of them. In particular, the cluster, nelem and disp columns respectively contain the index of the cluster, the number of elements and the dispersion coefficient. Since several tags may be included in the same cluster, the characteristic values of each cluster may appear in more than one row of the spreadsheet displayed in the Clusters tab. By ordering the column
clusterwe can easily retrieve all the tags belonging to each cluster.
The last row, characterized by a null index in the column cluster, reports the values pertaining to the default cluster, obtained by including all the elements if the training set in a single group. To point out the generality of this special cluster all the values in its tag are set to missing.
The Results tab, where a summary on the performed calculation is displayed, among which:
the execution time,
the number of valid training samples,
the average weight of training samples,
the number of distinct tags in the training set,
the average, minimum and maximum dispersion coefficient for these tags,
the number of tags present in only one training sample and their average weight,
the number of clusters built,
the average dispersion of clusters,
the dispersion coefficient of the default cluster,
the minimum and the maximum number of points in clusters,
the number of singleton clusters, including only a point of the training set.
Drag the Label Clustering task onto the stage.
Connect a Split Data task, which contains the attributes you want to cluster, to the new task.
Double click the Label Clustering task.
Configure the attributes described in the table below.
Save and compute the task.
Label K-Means Clustering options
Attributes to consider for clustering
Drag and drop the attributes that will be used as profile attributes in the clustering computation.
Drag and drop the attributes that that will be considered as labels in the clustering computation.
Three different approaches for computing cluster centroids are available:
Three different clustering algorithms are available:
Distance method for clustering
The method employed for computing distances between examples.
Possible methods are: Euclidean, Euclidean (normalized), Manhattan, Manhattan (normalized), Pearson.
Details on these methods are provided in the Distance parameter of the Managing Attribute Properties page.
Distance method for evaluation
Select the method required for distance, from the possible values: Euclidean, Euclidean (normalized), Manhattan, Manhattan (normalized), Pearson.
For details on these methods see the Managing Attribute Properties page.
Normalization for ordered variables
Type of normalization adopted when treating ordered (discrete or continuous) variables.
Every attribute can have its own value for this option, which can be set in the Data Manager. Details on these options are provided in the Distance parameter of the Managing Attribute Properties page.
These choices are preserved if Attribute is selected in the present menu; every other value (e.g. Normal) supersedes previous selections for the current task.
Initial assignment for clusters
Select the procedure to be adopted for the initial assignment of points to clusters; it may be one of the following:
(Optional) attribute for initial cluster assignment
Optionally select a specific attribute from the drop-down list, which will be used as an initial cluster assignment.
(Optional) attribute for weights
Optionally select an attribute from the drop-down list, which will be used as a weight in the clustering process.
Number of clusters to be generated
The required number of clusters. The number of clusters cannot exceed the number of different examples in the training set.
Number of executions
Number of subsequent executions of the clustering process (to be used in conjunction with Random as the Initial assignment for clusters option); the best result among them is retained.
Maximum number of iterations
Maximum number of iterations of the k-means inside each execution of the clustering process.
Minimum decrease in error value
The error value corresponds to the average distance of each point from the respective centroid.
This value, measured at each iteration, should gradually decrease. When the error decrease value (i.e. the difference in error between the current and previous iteration) falls below the threshold specified here, the clustering process stops immediately since it is supposed that no further significant changes in error will occur.
Minimum number of occurrences
Minimum number of examples in the training set that must be characterized by a given tag so that it passes the filtering phase.
Maximum dispersion coefficient
If the profile attribute values present a dispersion coefficient (computed with respect to the desired central value) greater than the value entered here, the record presents an irregular behavior that can deteriorate the results of the clustering procedure, and is consequently discarded.
Initialize random generator with seed
If checked, the positive integer shown in the box is used as an initial seed for the random generator; with this choice two iterations with the same options produce identical results.
Keep attribute roles after clustering
If selected, roles defined in the clustering task (such as profile, labels, weight and cluster id) will be maintained in subsequent tasks in the process.
Filter patterns before clustering
If selected, data is filtered, otherwise all the representative records are considered in the clustering process.
Aggregate data before processing
If checked, identical patterns in the training set are considered as a single point in the clustering process.
Additional attributes produced by previous tasks are maintained at the end of the present one, rather than being overwritten.
The following example uses the Adult dataset.