discretize function

The discretize function discretizes values of a selected attribute into bins of equal width, or with the same number of values or according to cutoff values.

For more details on each specific discretization method, check out the specific topics:


Parameters

discEqualWidth(column, nvalue, rank, mon, max)

Parameter

Description

column

The attribute whose values we want to discretize. The column parameter is mandatory.

nvalue

The number of bins you want to create. This value can be any number, up to the number of rows in the dataset.

cutoffs

The cutoff values that will be used to discretize values into ranges. All cutoff points must be enclosed in square brackets.

mode

The method you want to use to discretize values. Possible values are:

  • “ef” for Equal Frequency (default value)

  • “ew” for Equal Width

rank

By default, the central value of each range is displayed. If instead we want to display a ranking number for each range, the rank value must be set to True. It is False by default.

quantile

If set to True, the values will be discretized in quantiles. By default, this parameter is set to False.

This option has a particular impact when there are many identical values in a dataset. Standard discretization will put identical values in a single bin, and perform discretization on the remaining values, while if the quantile parameter is used, multiple bins may have the same central value, if this value is found in a high percentage of rows in the dataset.

min

If required, you can set a minimum value, which may not correspond to the values currently present in the dataset.

max

If required, you can set a maximum value, which may not correspond to the values currently present in the dataset.


Example

The following example uses the Age_BMI dataset. This dataset has been extracted from the public Hepatitis C Virus (HCV) for Egyptian patients dataset available on Kaggle.

Description

Screenshot

In the Age_BMI dataset, we have added a new attribute, called Age_Disc, to the dataset to contain the discretized values of the Age attribute.

In this new attribute we have divided the Age values into 5 different bins of equal width, with the formula discretize($"Age", 5, mode="ef").

The resulting 5 bins display the central value for each bin: 35, 40, 46,, 52, 58.

Given the high number of parameters in the task, it is important to use keywords to identify the parameters we are providing values for (e.g. mode=”ef”), so we do not need to include them all in every function.

If this example, we will create another attribute, called BMI_Disc, which we can use to display the results of the discretized values of the BMI attribute.

This time we will use the Equal Width discretization method to create 4 bins with the same range of values.

The formula will consequently be: discretize($"BMI", 4, mode="ew").

The resulting 4 bins will display the central values for each bin: 19, 24, 30, 35.

In the final example, we will again divide the BMI attribute into 4 bins, using the Disc_BMI attribute to display the results, but this time using cutoff values that we will insert into the formula.

To create 4 bins we must enter three cut-off points, which in our example are 20, 25 and 30.

The formula will consequently be: discretize($"BMI", 4, [20,25,30])

The results display the central value for each range created using the cut-off values we supplied.