Discretizing Data

Discretization transforms continuous data by defining a set of cutoffs that subdivide a continuous domain into a finite set of homogeneous intervals.

The points in each interval should have a high probability of belonging to the same class. These intervals increase the effectiveness of data in the creation of predictive models.


Prerequisites


Procedure

  1. Drag the Discretize task onto the stage.

  2. Connect a task that contains the attributes you want to transform to the Discretize task.

  3. Double click the Discretize task. On the left hand side of the pane there is a list of all the available attributes in the dataset, which can be ordered and searched as required.

  4. Configure the options, as described in the table below.

  5. Save and compute the task.

Discretize options

Name

Description

Use previous cutoffs to discretize dataset

If selected, the cutoffs defined in an upstream Discretize task will be used to discretize the new data, instead of defining new cutoffs. This is useful when you want data to be discretized in the same way in various point of the worklflow.

Attributes to discretize

Drag and drop the ordered attribute you want to transform from the Available attributes list.

Method for discretization

Select the method you want to use from the Method for discretization drop-down list. Possible values are:

  • Attribute Driven Incremental Discretization (0) (default choice): it is a top-down method that recursively adds separation points (cutoffs) for each discrete or continuous attribute. The method is designed to obtain a complete separation of the points of the training set, i.e. the discretization process must not generate ambiguities. This method is supervised and requires an output attribute.

  • ChiMerge (2): bottom-up chisquare-based technique that iteratively merges adjacent intervals according to a statistical measure of their similarity. This method is supervised and requires an output attribute.

  • Entropy (1): this top-down method recursively adds cutoffs according to a measure, based on entropy, of the information gain achieved by splitting an interval in two. This method is supervised and requires an output attribute.

  • Equal width (3): creates intervals of the same amplitude regardless of the output value. This method is unsupervised, and does not require an output value.

  • Equal frequency (4): creates intervals containing the same number of patterns regardless of the output value. This method is unsupervised, and does not require an output value.

  • Roc Curve (5): uses the ROC Curve to find the best cut-off. This method is supervised and requires an output attribute.

The Attribute Driven Incremental Discretization method usually scores the best performance but may be quite time consuming when there are large training sets. The Entropy method is usually faster but may generate some ambiguities and then compromise the accuracy of any subsequent analysis.

 

Minimum distance between different classes

Specifies the minimum distance that must be kept between two patterns of different classes, as the percentage of the total number of attributes. This distance is computed as the number of attributes whose values are different in the two patterns. The minimum and default distance is one. If you select 100% all the attributes of each couple of heterogeneous patterns must differ. This is not always possible since many attribute can have the same value in the starting data, and in this case the method uses the available separations.

Output attribute

Select the output attribute to be used for discretization from the drop-down list. Output attributes are mandatory for supervised methods.

Number of patterns used for discretization

Specifies how many patterns will be used. This option allows you to use only a randomly selected subset of the training set, which is particularly useful when there is a high amount of data, as a high number of patterns considerably slows sown the discretization process. The default value of -1 means that all patterns will be used.

Number of values for ordered variables

Specifies the number of cutoffs to be inserted for each variable, which must not exceed the number of values available in the training set. The number of cutoffs must at least ensure that the minimum distance between different classes can be guaranteed.

Preselect best cutoffs

If selected, the most promising cutoffs will be selected and employed in the subsequent phase. This consequently reduces the number of possible cutoffs to be analyzed. This works particularly well coupled with the Attribute Driven Incremental Discretization method.

Aggregate data before processing

If selected, identical patterns will be aggregated and considered as a single pattern during the discretization phase.

Discretize output

If selected the output attribute will be discretized. This option is available if you selected a discrete (e.g. integer) or continuous output attribute. You can then select the required discretization method in the Discretization method for output option.

  • the discretization method, which can either be Equal Frequency to create intervals that contain the same number of patterns (up to border effects), or Equal Width, to create intervals of the same amplitude.

  • the number of cutoffs to be created for the output. The default is 10 whereas 0 means that all possible cutoffs have to be inserted.

Discretization method for output

Select the discretization method you want to adopt to discretize the output. This option is available only if you have selected the Discretion output option.

Possible methods are:

  • Equal Frequency (1) to create intervals that contain the same number of patterns (up to border effects), or 

  • Equal Width (0), to create intervals of the same amplitude.

Number of cutoffs for output

Select the number of intervals to be created when discretizing output values. The default is 10 whereas 0 means that all possible cutoffs have to be inserted.

This option is available only if you have selected the Discretize output option.


Example

The following example uses the Adult dataset.

Description

Screenshot

  • After having imported the dataset, right-click on the task and select Take a look to check all the data have been correctly imported.

  • The original dataset is made up of 32561 records, and the age attribute  includes almost all the possible integer values between 17 band 90.

  • We want to group all these possible values into 5 groups of equal frequency.

  • Drag a Discretize task onto the stage and link it to the import task. Double click on it to open it. Specify the following:

    • Attributes to discretize: age

    • Method for discretization: Equal frequency

    • Number of values for ordered values: 5 (in order to create 5 separate groups).

  • Save and compute the task.

  • To visualize the discretization results, add a Data Manager to the flow and link it to the Discretize task.

  • In the Sheets tab, drag the age attribute onto the Var_1 area and choose the Values, frequencies and quantiles statistic from the drop-down list.

  • Here you can see the five groups that have been created, with their assigned average values, and the number of rows belonging to each group.