Identifying the Principal Components in Datasets with PCA
The Principal Component Analysis task identifies the most important components in a dataset and consequently reduces the number of attributes a dataset contains. These components correspond to a linear combination of attributes that collect most variance in the values. The Principal Component Analysis basically compresses a large amount of data into a smaller number of attributes that capture the essence of the original data. To put it simply, think of our TVs that show us 3D people and places flattened into 2D viewing. Although a dimension is missing, we don’t lose much detail.
The first new “attribute” (called an eigenvector) represents the maximum variation in the data, the second eigenvector represents the second largest amount of variation and so on. In the Principal Component Analysis task in Rulex you can select how many eigenvectors you want to create in your new compressed dataset.
This function is extremely useful when dealing with datasets that have a very high number of attributes. This is useful to prepare large datasets for tasks such as clustering, neural networks and linear fit. The technique can also help to avoid overfitting in rules, where there are so many attributes that rules get be overly precise and articulate in the training set, and consequently not produce good results when applied to new data.
However, eigenvectors do not represent a single aspect of the original dataset, such as age or occupation, but a combination of these. The task does not subsequently result in the generation of immediately human understandable explainable rules. It is possible to analyze how much each original attribute influenced the eigenvectors in the rules, but this method is rather approximate and not particularly reliable. It would not make much sense, for example, to use this task with the Logic Learning Machine algorithm in Rulex.
Consequently, if you need to explain decisions, avoid using the Principal Component Analysis task in your flow.
Prerequisites
you must have created a flow;
the required datasets must have been imported into the dataset.
Procedure
Drag the Principal Component Analysis task onto the stage.
Connect a task, which contains the data you want to export, to the new task.
Double click the Principal Component Analysis task.
Configure the task options as described in the table below.
Save and compute the task.
Parameter Name | Description |
---|---|
Use previous eigenvectors for Principal Component Analysis execution | If selected, the eigenvectors defined in the upstream PCA task will be used to create the required number of principle components. |
Attributes for principal component analysis | Drag and drop here those attributes which you want to use in the principal component analysis. Principal Component Analysis cannot be performed on nominal values. |
Method for distance evaluation | The method you want to use to compute distances between samples. The distance is computed as the combination of the distances for each attribute. Possible options are: Euclidean, Euclidean (normalized), Manhattan, Manhattan (normalized) and Pearson. |
Normalization | The type of normalization you want to use with ordered variables. Possible options are: None, Attribute, Normal, Minmax [0,1] and Minmax [-1,1]. |
Aggregate data before processing | If selected, identical patterns will be aggregated and considered as a single pattern during the principal component analysis. |
Minimum number of final components (0 means no minimum) | The minimum number of final components the resulting dataset must contain. |
Minimum level of confidence for the resulting dataset (0 means no minimum) | The minimum level of confidence the resulting dataset must have. If this minimum confidence level does not satisfy the minimum number of components specified for the analysis in the Minimum number of final components option, the confidence level may increase until the minimum number of components is also reached. |