Statistical functions in the Factory
Statistical functions perform statistical operations on the selected columns.
They are useful if you need to add the operations performed in the statistics manager directly to your dataset.
Each formula requires certain parameters, which are specified in the corresponding pages.
Click on the function’s name to see how to correctly use it:
Rulex Platform formula | ||
---|---|---|
Function | Formula | Description |
anovap(column, attclass, group, usemissing) | Returns the ANOVA p value, which is the probability to obtain a worst case compared to the null hypothesis we are verifying. Values above 0.05 (i.e. the conventional value for alpha) state that we can’t reject the null hypothesis, while values below 0.05 state that we need to reject the null hypothesis and consider the alternative one. | |
anovat(column, attclass, group, usemissing) | Returns the ANOVA test value. | |
argMax(column, group) | Returns the ID of the row that contains the maximum value of a selected attribute. | |
argMin(column, group) | Returns the ID of the row that contains the minimum value of a selected attribute. | |
chisquare(column1, column2, group, usemissing) | Returns the level of correlation between two nominal variables. The higher the value, the stronger the correlation between the selected nominal attributes. | |
chisquarep(column1, column2, group, usemissing) | Compares the null hypothesis, which assumes the variables are totally independent, to the results obtained by analysing the data (the alternative hypothesis), to evaluate the reliability of the correlation. The result is the p-value coefficient, which indicates the lowest level of significance at which the null hypothesis of the coefficient would be rejected. Values can range between 0 and 1, where low values below 0.05 (alpha) indicate that there may effectively be a correlation between the variables, whereas higher values indicate that the results are probably due simply to chance, and cannot consequently be considered reliable. This value is particularly important to consider when the dataset has a limited number of samples. | |
cohenk(column1, column2, group, usemissing) | Applies the Cohen’s kappa coefficient to compare values. It is commonly used to compare real and predicted values to evaluate model performance, considering the probability of agreement by pure chance. | |
count(group) | Returns either the number of overall values present in an attribute, or the number of times each distinct value is present. | |
countIf(condition, group) | Returns the number of times a distinct value, which meets a specified condition, is present in an attribute. | |
covariance(column1, column2, group) | Measures how changes in one variable are associated with changes in a second variable. | |
cumMax(column, group) | Returns the cumulative maximum of the column, which is the greatest value between the current value of the column and the previous values of the same column, evaluated within groups defined by the group parameter if required. | |
cumMin(column, group) | Returns the cumulative minimum of the column, which is the lowest value between the current value of the column and the previous values of the same column, evaluated within groups defined by the group parameter if required. | |
distinct(column, group) | Returns the number of distinct values of the column, evaluated within groups defined in the group parameter, if required. | |
entropy(column, group, usemissing) | Returns the entropy of the column. | |
fact(column) | Returns the factorial of the values of the column. | |
gini(column, group, usemissing) | Returns the Gini index of the column, evaluated within groups defined by the group parameter, if required. | |
inIqr(column, coeff) | Isolates outliers: for each data observation, it identifies whether it is in the interquartile deviation or not. It returns the column with a binary True/False value according to the interquartile range. | |
max(column, group) | Returns the maximum of the column. | |
max2(column1, column2) | Returns the maximum value between two columns. | |
maxyoudencut(column, attclass, defclass, group) | Returns the value which maximizes the youden index of the ROC curve defined by column1 and by the class attclass. The computation can be performed according to the groups defined in the group parameter, if required. | |
mean(column, group) | Returns the mean of the column. | |
median(column, group) | Returns the median of the column. | |
min(column, group) | Returns the minimum of the column. | |
min2(column1, column2) | Returns the minimum value between two columns. | |
mode(column, group, usemissing) | Returns the mode of the column. | |
movMean(column, lag, group, front) | Returns the moving average of the column, evaluated on the lag continuous rows, computed according to groups defined by the group parameter if required. | |
pearson(column1, column2, group) | Returns the Pearson coefficient between two columns, evaluated within groups defined by the group parameter if required. | |
quantile(column, quant, group, weights) | Returns the specified quantile of the column, evaluated within groups defined by the group parameter if required. A column of weights can also be defined. Quantiles are cut points dividing a range of probability distribution into intervals with equal probabilities. | |
roc(column, attclass, defclass, group) | Returns the correlation between a continuous variable and a binary target variable. It calculates a performance indicator, the AUC, that is the area under the curve defined by the column and the attclass. The default value for the class attribute (if more than two values are present) can be specified as the optional parameter defclass. All the computation can be performed according to the groups defined in the group parameter. | |
std(column, group) | Returns the standard deviation of the column, evaluated within groups defined by the group parameter, if required. The standard deviation is the square root of the variance. | |
variance(column, group) | Returns the variance of the column, evaluated within groups defined by the group parameter, if required. The variance is a measure of dispersion, which displays how a set of values is far from their average value. |
Parameters in bold are mandatory.