distinct function

The distinct function returns the number of distinct values of the column, evaluated within groups defined by the group parameter if required.

You can use this function in case you need to fill an attribute with the number of distinct values of another one.


Function and parameters

distinct(column, group,usemissing)

Parameter

Description

column

It identifies the column to which you want to apply the formula. The column parameter is mandatory.

group

It allows you to group the results by a certain column.

usemissing

Flag used to define if missing values have to be considered or not in distinct values. Default is False.


Example

The following example uses the HR employee attrition dataset.

Description

Screenshot

  • In this example, we want to count the number of distinct values of the YearsAtCompany attribute.

  • We write the following formula:

  • distinct($"YearsAtCompany")

  • The number of distinct values for the YearsAtCompany attribute is 37.

  • We want to group our results by the JobRole attribute. So we add to the formula the group parameter, and the formula will be:

  • distinct($"YearsAtCompany",$"JobRole")

  • The results will be:

    • For the Sales Executive role, there are 25 distinct values for YearsAtCompany.

    • For the Research Scientist role, there are 19 distinct values for YearsAtCompany.