Statistical functions in the Factory

Statistical functions perform statistical operations on the selected columns.

They are useful if you need to add the operations performed in the statistics manager directly to your dataset.

Each formula requires certain parameters, which are specified in the corresponding pages.

Click on the function’s name to see how to correctly use it:

Rulex Platform formula

Function

Formula

Description

anovap

anovap(column, attclass, group, usemissing)

Returns the ANOVA p value, which is the probability to obtain a worst case compared to the null hypothesis we are verifying. Values above 0.05 (i.e. the conventional value for alpha) state that we can’t reject the null hypothesis, while values below 0.05 state that we need to reject the null hypothesis and consider the alternative one.

anovat

anovat(column, attclass, group, usemissing)

Returns the ANOVA test value.

argMax

argMax(column, group)

Returns the ID of the row that contains the maximum value of a selected attribute.

argMin

argMin(column, group)

Returns the ID of the row that contains the minimum value of a selected attribute.

chisquare

chisquare(column1, column2, group, usemissing)

Returns the level of correlation between two nominal variables. The higher the value, the stronger the correlation between the selected nominal attributes.

chisquarep

chisquarep(column1, column2, group, usemissing)

Compares the null hypothesis, which assumes the variables are totally independent, to the results obtained by analysing the data (the alternative hypothesis), to evaluate the reliability of the correlation. The result is the p-value coefficient, which indicates the lowest level of significance at which the null hypothesis of the coefficient would be rejected. Values can range between 0 and 1, where low values below 0.05 (alpha) indicate that there may effectively be a correlation between the variables, whereas higher values indicate that the results are probably due simply to chance, and cannot consequently be considered reliable. This value is particularly important to consider when the dataset has a limited number of samples.

cohenk

cohenk(column1, column2, group, usemissing)

Applies the Cohen’s kappa coefficient to compare values. It is commonly used to compare real and predicted values to evaluate model performance, considering the probability of agreement by pure chance.

count

count(group)

Returns either the number of overall values present in an attribute, or the number of times each distinct value is present.

countIf

countIf(condition, group)

Returns the number of times a distinct value, which meets a specified condition, is present in an attribute.

covariance

covariance(column1, column2, group)

Measures how changes in one variable are associated with changes in a second variable.

cumMax

cumMax(column, group)

Returns the cumulative maximum of the column, which is the greatest value between the current value of the column and the previous values of the same column, evaluated within groups defined by the group parameter if required.

cumMin

cumMin(column, group)

Returns the cumulative minimum of the column, which is the lowest value between the current value of the column and the previous values of the same column, evaluated within groups defined by the group parameter if required.

distinct

distinct(column, group)

Returns the number of distinct values of the column, evaluated within groups defined in the group parameter, if required.

entropy

entropy(column, group, usemissing)

Returns the entropy of the column.

fact

fact(column)

Returns the factorial of the values of the column.

gini

gini(column, group, usemissing)

Returns the Gini index of the column, evaluated within groups defined by the group parameter, if required.

inIqr

inIqr(column, coeff)

Isolates outliers: for each data observation, it identifies whether it is in the interquartile deviation or not. It returns the column with a binary True/False value according to the interquartile range.

max

max(column, group)

Returns the maximum of the column.

max2

max2(column1, column2)

Returns the maximum value between two columns.

maxyoudencut

maxyoudencut(column, attclass, defclass, group)

Returns the value which maximizes the youden index of the ROC curve defined by column1 and by the class attclass. The computation can be performed according to the groups defined in the group parameter, if required.

mean

mean(column, group)

Returns the mean of the column.

median

median(column, group)

Returns the median of the column.

min

min(column, group)

Returns the minimum of the column.

min2

min2(column1, column2)

Returns the minimum value between two columns.

mode

mode(column, group, usemissing)

Returns the mode of the column.

movMean

movMean(column, lag, group, front)

Returns the moving average of the column, evaluated on the lag continuous rows, computed according to groups defined by the group parameter if required.

pearson

pearson(column1, column2, group)

Returns the Pearson coefficient between two columns, evaluated within groups defined by the group parameter if required.

quantile

quantile(column, quant, group, weights)

Returns the specified quantile of the column, evaluated within groups defined by the group parameter if required. A column of weights can also be defined. Quantiles are cut points dividing a range of probability distribution into intervals with equal probabilities.

roc

roc(column, attclass, defclass, group)

Returns the correlation between a continuous variable and a binary target variable. It calculates a performance indicator, the AUC, that is the area under the curve defined by the column and the attclass.

The default value for the class attribute (if more than two values are present) can be specified as the optional parameter defclass. All the computation can be performed according to the groups defined in the group parameter.

std

std(column, group)

Returns the standard deviation of the column, evaluated within groups defined by the group parameter, if required. The standard deviation is the square root of the variance.

variance

variance(column, group)

Returns the variance of the column, evaluated within groups defined by the group parameter, if required. The variance is a measure of dispersion, which displays how a set of values is far from their average value.

Parameters in bold are mandatory.