Cleaning Datasets
The Fill/Clean task improves the quality of datasets by managing missing data in the Fill tab, and automatically removing unnecessary attributes in the Clean tab.
The task allows you to perform operations quickly on a data table, for example by dragging and dropping or filtering attributes to delete. This is particularly helpful when working with large amounts of data. These operations can also be performed in the Data Manager task, attribute by attribute, but it is more time-consuming.
Typical operations performed in this task are:
filling in missing values using:
fixed values,
min/max values, or
averages.
ignoring attributes:
which have too many missing values
where the mode ratio is too high (i.e. where a high percentage of values are identical)
which have too many different values, for example a different id for every row.
Prerequisites
Procedure
Drag the Fill/Clean task onto the stage.
Connect the task to the one which contains the data to work on.
Double click the Fill/Clean task. The left-hand pane displays a list of all the available attributes in the dataset, which can be ordered and searched as required.
Select the attributes that you want to keep in the dataset in the left hand list (keepnames); unselected tasks will be removed from the dataset after computation. Attributes can be checked/unchecked one by one or in groups by pressing the Ctrl or SHIFT key. Right-clicking this panel gives access to some additional options, such as checking/unchecking all attributes, inverting your selection, hiding ignored attributes, etc.
Configure the fill and/or clean options as described in the table below.
Save and compute the task.
Fill/Clean options | |
Option name | Description |
---|---|
Fill options | |
Attributes subject to filling operations | Drag and drop the attributes that can be modified by the selected filling operations. |
Attributes to define calculation groups | Drag and drop here the attributes that define sub-groups which will be used to calculate missing values. |
Impute missing values for NOMINAL attributes with: | If selected, you can define how missing nominal attributes will be filled, either by specifying a fixed value (nomimputetype = 0), and manually entering it (nomvalue), or by selecting how the value can be calculated, by calculating the mode of each attribute (nomimputetype = 1) according to the specified calculation groups. |
Impute missing values for ORDERED attributes with: | If selected, you can define how missing ordered attributes will be filled, either by specifying a fixed value (ordimptype = 0), and manually entering it (ordvalue), or by selecting how the value can be calculated, by selecting one of the available calculation types (nomimputetype) according to the specified calculation groups. Possible methods are mean (1), median (2), mode (3), maximum (4) or minimum (5). |
Impute missing values for TEMPORAL attributes with: | If selected, you can define how missing temporal attributes will be filled, either by specifying a fixed value (timeimptype = 0), and manually entering it (timevalue), or by selecting how the value can be calculated, by selecting one of the available calculation types (timeimputetype) according to the specified calculation groups. Possible methods are mean (1), median (2), mode (3), maximum (4) or minimum (5). |
Consider the whole row to compute MODE | If selected, all attributes, and not just those added to the Key attributes list, will be used to calculate the mode for missing attributes. This applies only to missing value calculations that use the mode method, as selected above in the Impute missing attributes options. |
Clean options | |
Attributes subject to cleaning operations | Drag and drop the attributes that can be modified by the selected cleaning operations. |
Attributes to be deleted | Drag and drop attributes you want to delete here. If you change you mind, just select the attribute and click delete and it will be removed from the edit box and go back to the Available attributes list. |
Remove attributes with missing ratio above (%) | If selected, removes attributes whose percentage of missing values exceeds the percentage specified for this option. If you want to ignore, but not delete these attributes, select the Ignore attributes instead of deleting them option (selected by default). |
Remove attributes with number of mode values above (%) | If selected, attributes where the mode ratio exceeds the specified value are removed. This may be useful, for example, where there is a high percentage of values are identical). If you want to ignore, but not delete these attributes, select the Ignore attributes instead of deleting them option (selected by default). |
Remove nominal attributes with number of values above: | If selected, attributes who have a higher number of values than the specified limit are removed. This limit removes attributes which have, for example, a different id for every row. If you want to ignore, but not delete these attributes, select the Ignore attributes instead of deleting them option. |
Standardize ordered attributes | If selected, ordered attributes will be normalized using the formula (x-mean)/stdev. For further information see https://en.wikipedia.org/wiki/Normalization_(statistics) → Standard Score. |
Ignore attributes instead of deleting them | If selected, attributes you chose to remove in the following options are ignored, and consequently not used to create the model, but will remain in the dataset (grayed out):
|
Example
The following example uses the melbourne-houses dataset.
Description | Screenshot |
---|---|
| |
| |
| |
|