Search

Statistical functions in the Factory

Statistical functions perform statistical operations on the selected columns.

They are useful if you need to add the operations performed in the statistics manager directly to your dataset.

Each formula requires certain parameters, which are specified in the corresponding pages.

Click on the function’s name to see how to correctly use it:

Rulex Platform formula
Function	Formula	Description
anovap	anovap(column, attclass, group, usemissing)	Returns the ANOVA p value, which is the probability to obtain a worst case compared to the null hypothesis we are verifying. Values above 0.05 (i.e. the conventional value for alpha) state that we can’t reject the null hypothesis, while values below 0.05 state that we need to reject the null hypothesis and consider the alternative one.
anovat	anovat(column, attclass, group, usemissing)	Returns the ANOVA test value.
argMax	argMax(column, group)	Returns the ID of the row that contains the maximum value of a selected attribute.
argMin	argMin(column, group)	Returns the ID of the row that contains the minimum value of a selected attribute.
chisquare	chisquare(column1, column2, group, usemissing)	Returns the level of correlation between two nominal variables. The higher the value, the stronger the correlation between the selected nominal attributes.
chisquarep	chisquarep(column1, column2, group, usemissing)	Compares the null hypothesis, which assumes the variables are totally independent, to the results obtained by analysing the data (the alternative hypothesis), to evaluate the reliability of the correlation. The result is the p-value coefficient, which indicates the lowest level of significance at which the null hypothesis of the coefficient would be rejected. Values can range between 0 and 1, where low values below 0.05 (alpha) indicate that there may effectively be a correlation between the variables, whereas higher values indicate that the results are probably due simply to chance, and cannot consequently be considered reliable. This value is particularly important to consider when the dataset has a limited number of samples.
cohenk	cohenk(column1, column2, group, usemissing)	Applies the Cohen’s kappa coefficient to compare values. It is commonly used to compare real and predicted values to evaluate model performance, considering the probability of agreement by pure chance.
count	count(group)	Returns either the number of overall values present in an attribute, or the number of times each distinct value is present.
countIf	countIf(condition, group)	Returns the number of times a distinct value, which meets a specified condition, is present in an attribute.
covariance	covariance(column1, column2, group)	Measures how changes in one variable are associated with changes in a second variable.
cumMax	cumMax(column, group)	Returns the cumulative maximum of the column, which is the greatest value between the current value of the column and the previous values of the same column, evaluated within groups defined by the group parameter if required.
cumMin	cumMin(column, group)	Returns the cumulative minimum of the column, which is the lowest value between the current value of the column and the previous values of the same column, evaluated within groups defined by the group parameter if required.
distinct	distinct(column, group)	Returns the number of distinct values of the column, evaluated within groups defined in the group parameter, if required.
entropy	entropy(column, group, usemissing)	Returns the entropy of the column.
fact	fact(column)	Returns the factorial of the values of the column.
gini	gini(column, group, usemissing)	Returns the Gini index of the column, evaluated within groups defined by the group parameter, if required.
inIqr	inIqr(column, coeff)	Isolates outliers: for each data observation, it identifies whether it is in the interquartile deviation or not. It returns the column with a binary True/False value according to the interquartile range.
max	max(column, group)	Returns the maximum of the column.
max2	max2(column1, column2)	Returns the maximum value between two columns.
maxyoudencut	maxyoudencut(column, attclass, defclass, group)	Returns the value which maximizes the youden index of the ROC curve defined by column1 and by the class attclass. The computation can be performed according to the groups defined in the group parameter, if required.
mean	mean(column, group)	Returns the mean of the column.
median	median(column, group)	Returns the median of the column.
min	min(column, group)	Returns the minimum of the column.
min2	min2(column1, column2)	Returns the minimum value between two columns.
mode	mode(column, group, usemissing)	Returns the mode of the column.
movMean	movMean(column, lag, group, front)	Returns the moving average of the column, evaluated on the lag continuous rows, computed according to groups defined by the group parameter if required.
pearson	pearson(column1, column2, group)	Returns the Pearson coefficient between two columns, evaluated within groups defined by the group parameter if required.
quantile	quantile(column, quant, group, weights)	Returns the specified quantile of the column, evaluated within groups defined by the group parameter if required. A column of weights can also be defined. Quantiles are cut points dividing a range of probability distribution into intervals with equal probabilities.
roc	roc(column, attclass, defclass, group)	Returns the correlation between a continuous variable and a binary target variable. It calculates a performance indicator, the AUC, that is the area under the curve defined by the column and the attclass. The default value for the class attribute (if more than two values are present) can be specified as the optional parameter defclass. All the computation can be performed according to the groups defined in the group parameter.
std	std(column, group)	Returns the standard deviation of the column, evaluated within groups defined by the group parameter, if required. The standard deviation is the square root of the variance.
variance	variance(column, group)	Returns the variance of the column, evaluated within groups defined by the group parameter, if required. The variance is a measure of dispersion, which displays how a set of values is far from their average value.

Parameters in bold are mandatory.