Identifying and Managing Outliers

Outliers are anomalous data samples, which are different from typical values in the data set.

Outliers can have an important negative impact on predictive models, by falsifying the final result. In other situations the outliers themselves may help resolve problems, for example when detecting anomalies.


Identifying outliers

Outliers can be identified in various ways:

  • by plotting box-plots in the Data Manager, where you have a visual representation of values. Box plots display the middle values of a variable, while the whiskers stretch to the values corresponding to three times the standard deviation above/below the mean. Any external points are considered outliers and are shown individually.

  • by calculating single statistics in the Data Manager, where you can calculate a threshold for detecting outliers. The value is three times above or below the mean (Lower whisker for boxplot or Upper whisker for boxplot parameters).

 

Dealing with outliers

The way identified outliers are dealt with can have a big impact on the model, and consequently must be carefully considered.

There are two main ways to manager outliers:

  • the entire row containing the outlier can be removed

  • the specific value can be substituted with a normal value.

 

When creating a predictive model with the Classification Logic Learning Machine task you can decide whether or not to include outliers in the creation of rules.