Joining datasets
The Join task in Rulex merges two datasets by their rows, creating a single table with all the data.
Merge operations can be used to produce a global dataset from on which in-depth analyses can be performed.
All merge operations are performed around the key attributes, which act as a unique identifier of each row in a table.
Rulex offers a full range of join options:
Prerequisites
You must have created a flow;
You must have imported the required datasets into the flow.
Procedure
Drag the Join task onto the stage.
Connect the tasks that contain the datasets you want to merge to merge the Join task.
Double-click the Join task.
Select the attributes you want to include in the merged table from the list of attributes on the left (lkeepnames) and right side tables (rkeepnames).
Configure the options described in the table below.
Save and compute the task.
Join options | |
Name | Description |
Join type | Select the combination method you want to use from the Join type drop-down menu. Possible types are:
|
Missing policy | Select the policy you want to adopt for managing missing values when evaluating the constraints on the matching attributes:
|
Merge type | This option manages attributes with the same name (or the same position according to the Match columns by name options) that are present in both datasets. Possible merge types are:
|
Match columns by name | If selected, columns with the same name are considered equal (default), rather than in the same position. |
Matching attributes for joining datasets | Drag and drop the key attributes you want to use to combine the datasets from the left-hand dataset (lkeynames) and right-hand dataset (rkeynames) onto the Matching attributes panel below, specifying the relationship that must exist between each pair of attributes with an operator (oplist) and value (parlist). If no attributes are selected row numbers are employed to combine datasets. A search and order function is provided to make it easier to find attributes when the list is long:
More than one attribute can be defined as key, however the number of attributes must be the same in both panels.
|
Insert left key value when missing or Insert right key value when missing | Select Insert left key value when missing or Insert right key value when missing if you want all the key attributes that are not present in a dataset to be filled with the corresponding key value in the other dataset. |
Example
The following example uses the northwind_customers and northwind_orders datasets.
Description | Screenshot |
---|---|
After having imported the northwind_customers and the northwind_orders datasets, we can state, by right clicking on the import task and selecting Take a look, that:
Add a Join task and link it to both the import tasks. | |
Compute the Join task. If the task is executed with the default options no attributes are present in the matching panels and therefore the row numbers are employed to combine datasets. In this case the resulting table has 91 records with 25 attributes. Note that the name CustomerID of the first attribute deriving from the right dataset has been changed to Customer_ID_1 to avoid a conflict with the name of the first attribute. | |
If the CustomerID and City attributes from the dataset on the left are matched with the CustomerID and ShipCity attributes from the dataset on the right, and Inner join is chosen as the Join type, the resulting dataset includes 817 records with 23 attributes. Only 817 different values in the CustomerID/ShipCity attribute pair from the right dataset match the values in the CustomerID and City attribute pair from the left dataset. |