distance function in the Factory

The distance function computes the distance between the values of two columns, column1, column2, according to one of the following methods: "levenshtein" ("I"), "damerau-levenshtein" ("dl"), "lcs", "hamming".


Parameters

distance(column1, column2, method)

If you are using continuous attributes, check the Flow Execution Parameters.

Parameter

Description

column1

The first attribute used to evaluate the distance. If it is not nominal, it will be casted to nominal upon function’s computation. The column1 parameter is mandatory.

column2

The second attribute used to evaluate the distance. If it is not nominal, it will be casted to nominal upon function’s computation. The column2 parameter is mandatory.

method

The algorithm ("levenshtein", "damerau-levenshtein", "lcs", "hamming") used to evaluate the string distance. Each method is associated to a string: ‘l' for the Levenshtein algorithm, ‘dl’ for the Damerau-Levenshtein algorithm, ‘lcs’ for the lcs algorithm, 'hamming’ for the hamming algorithm. The method parameter is mandatory.

method algorithms:

  • The Levenshtein algorithm measures the difference between two different strings. (e.g. from the string ‘like’ and ‘likely’ the Levinshtein value is 2, as we need to perform two operations, that are adding a ‘l' and a ‘y’ to ‘like’ to transform the ‘like’ string into the 'likely’ string).

  • The Damerau-Levenshtein algorithm measures the edit distance between two different strings. It includes transpositions among its allowable operations in addition to the three classical single-character edit operations (insertions, deletions and substitutions) typical of the Levenshtein algorithm.

  • The lcs algorithm finds the longest subsequence common to a set of them. Subsequences are not required to occupy consecutive positions within the original sequences.

  • The hamming algorithm finds errors in a sequence of code-words. If its value is 0 or 1, then the word sequence is reliable.


Example

The following example uses the Turkish calendar dataset.

Description

Screenshot

In this example, we want to calculate the distance between the RAMADAN_FLAG attribute and the PUBLIC_HOLIDAY_FLAG attributes using the Levenshtein algorithm.

Add a new attribute, called distance, and type the following formula:

distance($"RAMADAN_FLAG",$"PUBLIC_HOLIDAY_FLAG",'l')

The function has returned 0, when the value in both attributes is the same, so no changes has to be made to make them equal. It has returned 1 when the values were not equal, indicating the operations to perform to make the values equal.