Merging Datasets

It may happen that you need to merge datasets containing similar data to perform advanced analysis.

There are two distinct ways to merge data in Rulex:

  • Joining tables by their rows. If there are different attributes for each key attribute spread over different tables, it would make sense to join your tables. For example, for the client key attribute there are two tables, one with the client's address and phone number, and a second table with the client's purchase history. The resulting merged table would contain complete information on each client key attribute.

  • Concatenating tables by their columns. If your tables contain different subsets of a set of records, you should concatenate your tables by their columns. For example, table 1 may contain full details on clients whose names begin with A > M, while table 2 contains full details from N > Z. The resulting merged table would contain information on all clients from A >Z.

 

General guidelines on merging datasets

  • Whenever possible perform join operations after grouping data, since the size of data tables is smaller and the time required for computation is lower.

  • Carefully select the key attributes. Remember that many-to-many relationships between keys may lead to huge output tables. Although in some situations this is exactly what is expected from the join task, in several cases it depends on an incorrect definition of keys.

  • Define the data joining policy (Inner, Outer, Complement) according to your needs. Remember that selecting the wrong policy may remove significant rows (Inner Join) or, vice versa, include unnecessary rows in the final table (Outer Join).

  • Carefully select which attributes will be included in the final table. Including all the attributes can lead to huge tables, without adding information to the dataset. Pay particular attention to the attributes present in both the joined tables since they will result in duplicate columns in the final table.