Joining datasets

The Join task in Rulex merges two datasets by their rows, creating a single table with all the data.

Merge operations can be used to produce a global dataset from on which in-depth analyses can be performed.

All merge operations are performed around the key attributes, which act as a unique identifier of each row in a table.

Rulex offers a full range of join options:


Prerequisites


Procedure

  1. Drag the Join task onto the stage.

  2. Connect the tasks that contain the datasets you want to merge to merge the Join task.

  3. Double-click the Join task.

  4. Select the attributes you want to include in the merged table from the list of attributes on the left (lkeepnames) and right side tables (rkeepnames).

  5. Configure the options described in the table below.

  6. Save and compute the task.

Join options

Name

Description

Join type

Select the combination method you want to use from the Join type drop-down menu. Possible types are:

  • Inner join (0): only the pairs of samples whose values of the key attributes values satisfy all the conditions listed in the matching attributes panels will be included in the final dataset.

  • Left outer join (1): the final dataset includes all the records from the dataset on the left even if the join condition does not find matching records in the right dataset. 

    If a row in the left-hand table does not match any row in the right-hand table the columns relative to the right-hand table will present empty cells.

  • Right outer join (2): the final dataset includes all the records from the dataset on the right. If a row in the right-hand table does not match any row in the left-hand table the columns relative to the left-hand table will present empty cells.

  • Full outer join (3): the union of the tables produced by left outer join and right outer join is included in the final dataset.

  • Left complement (4): the final dataset includes all the records from the left-hand dataset that do not satisfy the conditions listed in the matching attribute panels. Since the rows in the left-hand table do not match any records in the right-hand table, the columns relative to the right-hand table will all be empty.

  • Right complement (5): the final dataset includes all the records from the right-hand dataset that do not satisfy the conditions listed in the matching attribute panels. Since the rows in the right-hand table do not match any records in the left-hand table, the columns relative to the left-hand table will all be empty.

  • Full complement (6): the final dataset includes all the records from both datasets that do not satisfy the conditions listed in the matching attribute panels. 

Missing policy

Select the policy you want to adopt for managing missing values when evaluating the constraints on the matching attributes:

  • Missing values are considered normal values (0): missing values are considered missing, and are matched only with other missing values.

  • Missing values always satisfy equality checks (1): missing values are always considered as matching.

  • Missing values never satisfy equality checks (2): missing values are never considered as matching.

Merge type

This option manages attributes with the same name (or the same position according to the Match columns by name options) that are present in both datasets.

Possible merge types are:

  • None (0): if the attributes are not merged, a separate column will be created in the resulting dataset for the left-hand and right-hand dataset attributes.

  • Left fill (1): the value in the right-hand dataset overwrites the value in the left-hand dataset. If there is no value in the right-hand dataset for the attribute, the corresponding value in the left-hand dataset is kept.

  • Right fill (2): the value in the left-hand dataset overwrites the value in the right-hand dataset. If there is no value in the left-hand dataset for the attribute, the corresponding value in the right-hand dataset is kept.

Match columns by name

If selected, columns with the same name are considered equal (default), rather than in the same position.

Matching attributes for joining datasets

Drag and drop the key attributes you want to use to combine the datasets from the left-hand dataset (lkeynames) and right-hand dataset (rkeynames) onto the Matching attributes panel below, specifying the relationship that must exist between each pair of attributes with an operator (oplist) and value (parlist). If no attributes are selected row numbers are employed to combine datasets. A search and order function is provided to make it easier to find attributes when the list is long:

  • Search attribute, which allows you to search the list of attributes for a specific string.

  • Order attributes by, which allows you to sort attributes in the original order, in alphabetical order or according to their type, ignored value, number of values and role.

More than one attribute can be defined as key, however the number of attributes must be the same in both panels.

 

Insert left key value when missing or Insert right key value when missing

Select Insert left key value when missing or Insert right key value when missing if you want all the key attributes that are not present in a dataset to be filled with the corresponding key value in the other dataset.


Example

The following example uses the northwind_customers and northwind_orders datasets.

Description

Screenshot

After having imported the northwind_customers and the northwind_orders datasets, we can state, by right clicking on the import task and selecting Take a look, that:

  • The northwind_customers dataset contains:

    • 91 records

    • 11 attributes (all nominal)

  • The northwind_orders dataset contains:

    • 830 records

    • 14 attributes (of various types)

Add a Join task and link it to both the import tasks.

Compute the Join task. If the task is executed with the default options no attributes are present in the matching panels and therefore the row numbers are employed to combine datasets.

In this case the resulting table has 91 records with 25 attributes.

Note that the name CustomerID of the first attribute deriving from the right dataset has been changed to Customer_ID_1 to avoid a conflict with the name of the first attribute.

If the CustomerID and City attributes from the dataset on the left are matched with the CustomerID and ShipCity attributes from the dataset on the right, and Inner join is chosen as the Join type, the resulting dataset includes 817 records with 23 attributes.

Only 817 different values in the CustomerID/ShipCity attribute pair from the right dataset match the values in the CustomerID and City attribute pair from the left dataset.