disc function in the Factory

The disc function discretizes data values into ranges defined by cutoff values.


disc(column, cutoffs, rank)




The attribute whose values we want to discretize. The column parameter is mandatory.


The cutoff values that will be used to discretize values into ranges. All cutoff points must be enclosed in square brackets. The cutoffs parameter is mandatory.


By default, the central value of each range is displayed. If instead we want to display a ranking number for each range, the rank value must be set to True. It is False by default.


The following example uses the Age_BMI dataset. This dataset has been extracted from the public Hepatitis C Virus (HCV) for Egyptian patients dataset available on Kaggle.



In the Age_BMI dataset, we have added a new attribute, called Disc_BMI, to the dataset to contain the discretized values of the BMI attribute.

In this new attribute we have divided the BMI values into 4 different ranges, using the cutoff values 18, 25 and 31, with the formula disc($"BMI", [18, 25, 31]).

The resulting 4 groups display the central value for each range.

If we want to display a ranking number for each range, we simply set the rank parameter to True, instead of leaving its default value. The formula will consequently be: disc($"BMI", [18, 25. 31], True).