Formulas and Functions in the Factory

Formulas and functions allow you to enrich your dataset by creating new attributes, and retrieve information on your attributes.

Function categories

 Arrays

Arrays in the Data Manager

Rulex Platform formula

Function

Formula

Description

enum

enum(group)

Enumerates the rows in a dataset, in ascending order, optionally grouped by a selected attribute.

fillDown

fillDown(column, group, fillall)

Returns a copy of the specified column, filling all the missing values with the last valid value, optionally grouped by selected attributes.

If there are rows which remain empty, as they do not have a previous valid value, a fillUp operation can then be performed, using the next valid value.

fillLinear

fillLinear(column, group)

Fills any missing values for a specified attribute with a value, based on the other values present in the attribute, using a linear interpolation method.

The linear interpolation links two adjacent values with a straight line, graphically: this means that the gap is filled with values with equal distance one to the other.

fillUp

fillLinear(column, group)

Returns a copy of the selected column, filling all the missing values with the next valid value, optionally grouped by selected attributes.

If there are rows which remain empty, as they do not have a next valid value, a fillDown operation can then be performed, using the previous valid value.

integer

integer(column)

Returns the values of the specified attribute, cast to the integer data type.

len

len(column)

Returns the number of values present in the attribute, including missing values.

matchHeaders

matchHeaders(column)

Returns the value in the corresponding row for the selected column.

nominal

nominal(column)

Returns the values of the specified attribute cast to the nominal data type.

perm

perm(column)

Returns the values of a selected attribute in a random order.

rank

rank(column, usemissing)

Returns the values of a column in ascending alphabetic order. By default missing values are considered in the ranking.

shift

shift(column, shift, group, cyclic)

Returns the attribute column shifted by the shift value. The shift can be performed according to the groups defined in the group parameter.

Parameters in bold are mandatory.

 Logical

Logical functions in the Data Manager

Rulex Platform formula

Function

Formula

Description

ifelse

ifelse(condition, iftrue, iffalse)

Checks whether a specific condition is true, and returns the value of iftrue if the condition is true, otherwise it returns the value of iffalse if the condition is false.

ifNone

ifNone(column, condition, iftrue)

The ifNone function fills the attribute’s None values, which satisfy the specified condition, with the desired value. If the condition is not satisfied, the cell is left empty.

isDate

isDate(string, binary)

Checks whether the data type of the attribute specified in the string parameter is date.

isDatetime

isDatetime(string, binary)

Checks whether the data type of the attribute specified in the string parameter is datetime.

isFloat

isFloat(string, binary)

Checks whether the data type of the attribute specified in the string parameter is continuous.

isInteger

isInteger(string, binary)

Checks whether the data type of the attribute specified in the string parameter is integer.

isMonth

isMonth(string, binary)

Checks whether the data type of the attribute specified in the string parameter is month.

isQuarter

isQuarter(string, binary)

Checks whether the data type of the attribute specified in the string parameter is quarter.

isTime

isTime(string, binary)

Checks whether the data type of the attribute specified in the string parameter is time.

isType

isType(string, type, binary)

Checks whether the data type of the attribute specified in the string parameter corresponds to the data type specified in the type parameter.

isWeek

isWeek(string, binary)

Checks whether the data type of the attribute specified in the string parameter is week.

Parameters in bold are mandatory.

 Statistical

Statistical functions in the Data Manager

Rulex Platform formula

Function

Formula

Description

anovap

anovap(column, attclass, group, usemissing)

Returns the ANOVA p value, which is the probability to obtain a worst case compared to the null hypothesis we are verifying. Values above 0.05 (i.e. the conventional value for alpha) state that we can’t reject the null hypothesis, while values below 0.05 state that we need to reject the null hypothesis and consider the alternative one.

anovat

anovat(column, attclass, group, usemissing)

Returns the ANOVA test value.

argMax

argMax(column, group)

Returns the ID of the row that contains the maximum value of a selected attribute.

argMin

argMin(column, group)

Returns the ID of the row that contains the minimum value of a selected attribute.

chisquare

chisquare(column1, column2, group, usemissing)

Returns the level of correlation between two nominal variables. The higher the value, the stronger the correlation between the selected nominal attributes.

chisquarep

chisquarep(column1, column2, group, usemissing)

Compares the null hypothesis, which assumes the variables are totally independent, to the results obtained by analysing the data (the alternative hypothesis), to evaluate the reliability of the correlation. The result is the p-value coefficient, which indicates the lowest level of significance at which the null hypothesis of the coefficient would be rejected. Values can range between 0 and 1, where low values below 0.05 (alpha) indicate that there may effectively be a correlation between the variables, whereas higher values indicate that the results are probably due simply to chance, and cannot consequently be considered reliable. This value is particularly important to consider when the dataset has a limited number of samples.

cohenk

cohenk(column1, column2, group, usemissing)

Applies the Cohen’s kappa coefficient to compare values. It is commonly used to compare real and predicted values to evaluate model performance, considering the probability of agreement by pure chance.

count

count(group)

Returns either the number of overall values present in an attribute, or the number of times each distinct value is present.

countIf

countIf(condition, group)

Returns the number of times a distinct value, which meets a specified condition, is present in an attribute.

covariance

covariance(column1, column2, group)

Measures how changes in one variable are associated with changes in a second variable.

cumMax

cumMax(column, group)

Returns the cumulative maximum of the column, which is the greatest value between the current value of the column and the previous values of the same column, evaluated within groups defined by the group parameter if required.

cumMin

cumMin(column, group)

Returns the cumulative minimum of the column, which is the lowest value between the current value of the column and the previous values of the same column, evaluated within groups defined by the group parameter if required.

distinct

distinct(column, group)

Returns the number of distinct values of the column, evaluated within groups defined in the group parameter, if required.

entropy

entropy(column, group, usemissing)

Returns the entropy of the column.

fact

fact(column)

Returns the factorial of the values of the column.

gini

gini(column, group, usemissing)

Returns the Gini index of the column, evaluated within groups defined by the group parameter, if required.

inIqr

inIqr(column, coeff)

Isolates outliers: for each data observation, it identifies whether it is in the interquartile deviation or not. It returns the column with a binary True/False value according to the interquartile range.

max

max(column, group)

Returns the maximum of the column.

max2

max2(column1, column2)

Returns the maximum value between two columns.

maxyoudencut

maxyoudencut(column, attclass, defclass, group)

Returns the value which maximizes the youden index of the ROC curve defined by column1 and by the class attclass. The computation can be performed according to the groups defined in the group parameter, if required.

mean

mean(column, group)

Returns the mean of the column.

median

median(column, group)

Returns the median of the column.

min

min(column, group)

Returns the minimum of the column.

min2

min2(column1, column2)

Returns the minimum value between two columns.

mode

mode(column, group, usemissing)

Returns the mode of the column.

movMean

movMean(column, lag, group, front)

Returns the moving average of the column, evaluated on the lag continuous rows, computed according to groups defined by the group parameter if required.

pearson

pearson(column1, column2, group)

Returns the Pearson coefficient between two columns, evaluated within groups defined by the group parameter if required.

quantile

quantile(column, quant, group, weights)

Returns the specified quantile of the column, evaluated within groups defined by the group parameter if required. A column of weights can also be defined. Quantiles are cut points dividing a range of probability distribution into intervals with equal probabilities.

roc

roc(column, attclass, defclass, group)

Returns the correlation between a continuous variable and a binary target variable. It calculates a performance indicator, the AUC, that is the area under the curve defined by the column and the attclass.

The default value for the class attribute (if more than two values are present) can be specified as the optional parameter defclass. All the computation can be performed according to the groups defined in the group parameter.

std

std(column, group)

Returns the standard deviation of the column, evaluated within groups defined by the group parameter, if required. The standard deviation is the square root of the variance.

variance

variance(column, group)

Returns the variance of the column, evaluated within groups defined by the group parameter, if required. The variance is a measure of dispersion, which displays how a set of values is far from their average value.

Parameters in bold are mandatory.

 Math and Trigonometry

Math and Trigonometry functions in the Data Manager

Rulex Platform formula

Function

Formula

Description

abs

abs(column)

Returns the absolute value of each value of the column.

acos

acos(column)

Returns the arccosine values of each value of the column.

acosh

acosh(column)

Returns the hyperbolic arccosine of each value of the column.

asin

asin(column)

Returns the arcsine of each value of the column.

asinh

asinh(column)

Returns the hyperbolic arcsine of each value of the column.

atan

atan(column)

Returns the arctangent of each value of the column.

atanh

atanh(column)

Returns the hyperbolic arctangent of each value of the column.

baseConv

baseConv(column, basein, baseout, compflagin, compflagout)

Converts a base 10 integer, or a string that corresponds to an integer, to a different base. Optional parameters allow the user to have a 2-complement code (if set to True) in the input and/or in the output value.

ceil

ceil(column)

Returns each value of the column rounded up to its next bigger integer unit.

cos

cos(column)

Returns the cosine of each row of the column.

cosh

cosh(column)

Returns the hyperbolic cosine of each row of the column.

cumProd

cumProd(column, group)

Returns the cumulative product of the column, evaluated within groups defined by the group parameter if required.

cumSum

cumSum(column, group)

Returns the cumulative sum of the column, evaluated within groups defined by the group parameter if required.

exp

exp(column)

Returns the exponential of each row of the column.

floor

floor(column)

Returns each value of the column rounded down to its next smaller integer unit.

isInteger

isInteger(string, binary)

Checks whether a string corresponds to an integer value.

log

log(column)

Returns the natural logarithm of each row of the column.

log10

log10(column)

Returns the logarithm (with respect to 10) of each row of the column.

prod

prod(column, group)

Returns the product of the column, evaluated within groups defined by the group parameter if required.

rand

rand(n, seed)

Returns a random column with the specified number of elements. If the number of elements is specified, a random column is created with n (n=number of examples) elements.

randGauss

randGauss(n, seed, mean, stddev)

Returns a normally distributed random column with the specified number of elements. If the number of elements is specified, a random column with n (=number of examples) elements is created.

round

round(column)

Returns the nearest integer value of each row of the column.

sign

sign(column)

Returns the sign of each row of the column.

sin

sin(column)

Returns the sine of each row of the column.

sinh

sinh(column)

Returns the hyperbolic sine of each row of the column.

sqrt

sqrt(column)

Returns the square root of each row of the column.

sum

sum(column, group)

Returns the sum of the column, evaluated within groups defined by the group parameter if required.

tan

tan(column)

Returns the tangent of each row of the column.

tanh

tanh(column)

Returns the hyperbolic tangent of each row of the column.

Parameters in bold are mandatory.

 Text

Text functions in the Data Manager

If you are using any of these functions with continuous attributes, please double check the Flow Execution Parameters, as you can set the number of digits to be displayed after the function’s computation.

Rulex Platform formula

Function

Formula

Description

charReplace

charReplace(column, oldchar, newchar, unchanged, charforothers, considersequence)

Replaces the current fonts with new ones.

distance

distance(column1, column2, method)

Computes the distance between the values of two columns, column1, column2, according to one of the following methods: "levenshtein" ("I"), "damerau-levenshtein" ("dl"), "lcs", "hamming".

find

find(column, value, binary, ischarlist, charpos)

Looks for a value in a column and returns True or False.

head

head(column, nchar)

Returns in each row of the result with the first n letters of the corresponding value contained in the column.

isPrefix

isPrefix(column, value, binary)

Checks whether a string is a prefix or not.

isSuffix

isSuffix(column, value, binary)

Checks whether a string is a suffix or not.

isWord

isWord(column, substring, delimiter, binary)

Checks whether a string (which can have a delimiter) is contained in an attribute or not.

numExt

numExt(column, onlyint, separator)

Returns a string containing only the numerical fonts of the input string. If more than one number is present, numbers are delimited by a separator decided by the user (by default "-").

pad

pad(column, len, value, where)

Returns in each row of the result, the values of the column, filled (padded) with the padstring value to reach the specified length. The string can be added at the beginning (where = "begin" or by default) or at the end (where = "end") of the string, according to the value of the parameter where.

phonetic

phonetic(column, component)

Returns the phonetic encoding of the strings contained in the column using the Metaphone algorithm. Phonetic may return the primary Metaphone component (component = "Primary" or component = "P") or the secondary component (component = "Secondary" or component = "S"). By default the primary component is returned.

prefix

prefix(column, value, last)

Considers the chosen value as prefix and returns the subsequent fonts.

replace

replace(column, oldvalue, newvalue, ntimes)

Replaces the current strings of the values in the column with the new ones.

strip

strip(column, value, where, ischarlist)

Returns the value without the specified characters or list of characters located at the beginning, at the end or on both sides of the value.

suffix

suffix(column, value, last)

Considers the chosen value as suffix and returns the preceding characters.

tail

tail(column, nchar)

Returns the last n letters of the corresponding value in the column.

textConcat

textConcat(column, separator, group)

Returns the concatenation of all the strings in a column.

textExtract

textExtract(column, startpos, endpos)

Returns the string ranging from a defined starting position to defined ending position.

textFormat

textFormat(column)

Returns the type of the strings in each row of the column.

textLen

textLen(column)

Returns the length of the string contained in each row of the column.

textLower

textLower(column, mode, leaveother)

Changes uppercase fonts of a nominal attribute to lowercase fonts.

textSort

textSort(column, ascending)

Sorts in ascending order the strings contained in each cell of a nominal attribute.

textUpper

textUpper(column, mode, leaveother)

Changes lowercase fonts of a nominal attribute to uppercase fonts.

Parameters in bold are mandatory.

 Date/Time

Date and Time functions in the Data Manager

Rulex Platform formula

Function

Formula

Description

addMonth

addMonth(date, nmonth)

Adds a given number of months to a date attribute.

addQuarter

addQuarter(date, nquarter)

Adds a given number of quarters to a date attribute.

addWorkingDays

addWorkingDays(date, nday)

Adds a given number of working days (excluding weekends) to a date attribute.

addYear

addYear(date, nyear)

Adds a given number of years to a date attribute.

currDate

currDate(utc)

Returns the current date according to local or UTC settings.

currDatetime

currDatetime(utc)

Returns the current datetime according to local or UTC settings.

date

date(year, month, day)

Returns a column with all values equal to the date consisting of given year, month and day. This description is valid only when the formula is filled with the year mandatory parameter and at least one optional parameter.

datetime

datetime(date, time)

Returns in each row of the result the datetime value obtained by the composition of the date value contained in the date entry and the time value contained in the time entry.

day

day(date)

Returns the day value of the date.

hour

hour(time)

Returns the hour value of time.

isDate

isDate(string, binary)

Checks whether the data type of the attribute specified in the string parameter is date.

isDatetime

isDatetime(string, binary)

Checks whether the data type of the attribute specified in the string parameter is date.

isMonth

isMonth(string, binary)

Checks whether the data type of the attribute specified in the string parameter is month.

isQuarter

isQuarter(string, binary)

Checks whether the data type of the attribute specified in the string parameter is quarter.

isTime

isTime(string, binary)

Checks whether the data type of the attribute specified in the string parameter is time.

isWeek

isWeek(string, binary)

Checks whether the data type of the attribute specified in the string parameter is week.

minute

minute(time)

Returns the minute value of time.

month

month(date, extract)

Returns the month value of date.

quarter

quarter(date, extract)

Returns the quarter value of date. If extract is True, it tries to cast variable date to quarter.

second

second(time)

Returns the seconds (values) of the time values.

time

time(hour, minute, second)

Composes a time starting from hours, minutes and seconds.

timeZone

timeZone()

Returns the current timezone, i.e. the difference between local time and UTC time. The resulting type is time.

week

week(date, extract)

Returns the week integer value of an attribute containing a value which can reconduct to a date.

weekDay

weekDay(date, mondaystart)

Returns the day of the week as an integer for each value of date.

year

year(date)

Returns the year value of an attribute containing a date.

Parameters in bold are mandatory.

 Graphs

Graphs functions in the Data Manager

Rulex Platform formula

Function

Formula

Description

connComp

connComp(parent, son, group)

This function operates on directed graphs. For each note, it returns the corresponding subgroup.

The node is the parent or son value of the graph.

In other words, the connComp function defines which subgroups of nodes are interconnected.

leaf

leaf(parent, son, group, whichpath, separator, weights, operator)

This function operates on a directed graph, the leaf of a node is meant as the very last node of a branch.
The leaf function retrieves the corresponding leaf for each node of the son attribute.

leafDistance

leafDistance(parent, son, group, whichpath, separator, weights, operator)

The leafDistance function calculates the distance, in terms of number of edges, of each node of the son attribute from its leaf.

A leaf is the very last node of the branch.

An edge is the link from one node to another.

root

root(parent, son, group, whichpath, separator, weights, operator)

The root function returns the corresponding root of a node. The root is the very first node of the branch.

rootDistance

rootDistance(parent, son, group, whichpath, separator, weights, operator)

The rootDistance function operates on a directed graph, two major components in a directed graph are nodes and edges.
A node is a data element while an edge is a link that helps to connect nodes.
The rootDistance function calculates the distance of each node from the root.

Parameters in bold are mandatory.

 System

System functions in the Data Manager

Rulex Platform formula

Function

Formula

Description

currDate

currDate(utc)

Returns the current date according to local or UTC settings.

currDatetime

currDatetime(utc)

Returns the current datetime according to local or UTC settings.

hostName

hostName()

Returns the hostname of the machine where Rulex Platform is running.

ipAddress

ipAddress()

Returns the IP address of the machine where Rulex Platform is running.

timeZone

timeZone()

Returns the current timezone, i.e. the difference between local time and UTC time. The resulting type is time.

Parameters in bold are mandatory.

 Data

Data functions in the Data Manager

Rulex Platform formula

Function

Formula

Description

cast

cast(column, newtype, forced)

Casts a column to the specified data type. If the flag forced is set to false (true by default) only those operations which do not result in a loss of precision will be performed.

catNames

catNames(indatt, values, separator, negate)

Searches for values in specific attributes, and returns the headers of the attributes where the values were found. All the corresponding headers are concatenated.

decideType

decideType(column)

Assigns the correct data type to an attribute, depending on the values it contains.

disc

disc(column, cutoffs, rank)

Discretizes values of a selected attribute according to cutoff values.

discEqualFrequency

discEqualFrequency(column, nvalue, rank, quantile)

Discretizes values of a selected attribute into bins with the same number of values in each.

discEqualWidth

discEqualWidth(column, nvalue, rank, min, max)

Discretizes values of a selected attribute into bins of equal width.

discretize

discretize(column, nvalue, cutoffs, mode, rank, quantile, min, max)

Discretizes values of a selected attribute into bins of equal width, or with the same number of values or according to cutoff values.

isAttribute

isAttribute(name, binary)

Checks whether a specified attribute is present in the dataset. The result can either be returned as a Boolean (true/false) or binary (0/1) result.

isFloat

isFloat(string, binary)

Checks whether the data type of the attribute specified in the string parameter is continuous.

isType

isType(string, type, binary)

Checks whether the data type of the attribute specified in the string parameter corresponds to the data type specified in the type parameter.

type

type(column)

Returns the data type of the selected column as a nominal value.

Parameters in bold are mandatory.

When using functions in formulas, pay close attention to the following syntax:

  • Function names are case sensitive.

  • When specifying the attributes, the syntax is $"attributename"

  • When specifying a string, the syntax is 'string'

  • When specifying continuous values and integers, just type them.