discEqualWidth function in the Factory

The discEqualWidth function discretizes data values into bins of equal width.


Parameters

discEqualWidth(column, nvalue, rank, min, max)

Parameter

Description

column

The attribute whose values we want to discretize. The column parameter is mandatory.

nvalue

The number of bins you want to create. Cutoffs values will be automatically created to discretize values into ranges. All cutoff points must be enclosed in square brackets. The nvalue parameter is mandatory.

rank

By default, the central value of each range is displayed. If instead we want to display a ranking number for each range, the rank value must be set to True. It is False by default.

min

If required, you can set a minimum value, which may not correspond to the values currently present in the dataset.

max

If required, you can set a maximum value, which may not correspond to the values currently present in the dataset.


Example

The following example uses the Age_BMI dataset. This dataset has been extracted from the public Hepatitis C Virus (HCV) for Egyptian patients dataset available on Kaggle.

Description

Screenshot

In the Age_BMI dataset, we have added a new attribute, called Age_Disc, to the dataset to contain the discretized values of the Age attribute.

In this new attribute we have divided the Age values into 5 different bins of equal width, with the formula discEqualWidth($"Age", 5).

The resulting 5 groups display the central value for each bin: 35, 41, 47, 52, 58.

If we want to display a ranking number for each range, we simply set the rank parameter to True, instead of leaving its default value. The formula will consequently be: discEqualWidth($"Age", 5, True).

The resulting 5 bins will now have a ranking number from 1 to 5.

We can change the range of values considered when defining bins, by setting values for the minimum and maximum values.

For example, in our dataset the range of values for our Age attribute is from 32 to 61 years, but we would like to create bins that range from 18 years up to 75, as this is the potential age range of our survey. In this case, you can set the min parameter to 18, and the max to 75.

The formula in this case would be discEqualWidth($"Age", 5, False, 18, 75).

If you want to leave the default value for the rank parameter, and just set the last two min and max parameters, you do not need to specify a rank value, but by doing this you are no longer able to identify parameters by their position, otherwise 18 would appear to be the value for the rank. To get around this, you simply need to add the name of the parameter you are defining to the function, after the skipped parameter: discEqualWidth($"Age", 5, min=18, max=75).

In our example, 5 bins have been created, but there are only values currently available for 3 of the 5 groups (no values in the lowest and highest value bins, for example from 18 to 30, and from 65 to 75).