Splitting Data with the Data Manager

When you want more control on how the dataset is divided, you can split the dataset with a Data Manager task.

In this way you can specify criteria with which the dataset is split.


Prerequisites


The Modeling sets bar

The Modeling sets bar is a tool which is used to divide and display All the dataset and the Training, Test and Validation sets.

It is made of four icons:

Icon

Name

Purpose

All

To display the complete dataset.

Training set

To identify patterns in the data and build the model.

Test set

To assess the accuracy of the model.

Validation set

To tune the model parameters.

Procedure

  1. Add a new Data Manager task to the process.

  2. Drag and drop the attributes you want to filter by to create the dataset division onto the Filter column in the Query Manager.

  3. Configure the filters to create the required view.

  4. Right-click on any cell in the data sheet and select Assign view to > Test/Training/Validation set, accordingly (by default, patterns are all in the training set).

  5. Remove the filter by selecting the filter cell in the Query Manager and pressing DELETE.

  6. Save and compute the task.


Example

The following example uses the Adult dataset.

Description

Screenshot

  • After having imported the required dataset, drag a Data Manager onto the stage and link it to the imported dataset.

  • Double-click the Data Manager and open it.

  • The original dataset contains 32561 patterns.

  • We want to divide the dataset as follows:

    • The training set contains values where the age is > 39.

    • The test set contains all the other values.

    • No validation set is required.

  • Drag the age attribute onto the pre-filter area and set the filter as > 39. Click Apply to set the filter.

  • In the Modeling sets area, right-click on the training set logo and click Assign displayed row to training set: this way, the training set is built and displayed if you click the training set icon.