Splitting Data with the Data Manager
When you want more control on how the dataset is divided, you can split the dataset with a Data Manager task.
In this way you can specify criteria with which the dataset is split.
Prerequisites
you must have created a flow;
the required data must have been /wiki/spaces/RPDP/pages/2658467884.
the data used for the model have been well prepared.
a single unified dataset has been created by merging all datasets imported into the flow.
The Modeling sets bar
The Modeling sets bar is a tool which is used to divide and display All the dataset and the Training, Test and Validation sets.
It is made of four icons:
Icon | Name | Purpose |
---|---|---|
All | To display the complete dataset. | |
Training set | To identify patterns in the data and build the model. | |
Test set | To assess the accuracy of the model. | |
Validation set | To tune the model parameters. |
Procedure
Add a new Data Manager task to the process.
Drag and drop the attributes you want to filter by to create the dataset division onto the Filter column in the Query Manager.
Configure the filters to create the required view.
Right-click on any cell in the data sheet and select Assign view to > Test/Training/Validation set, accordingly (by default, patterns are all in the training set).
Remove the filter by selecting the filter cell in the Query Manager and pressing DELETE.
Save and compute the task.
Example
The following example uses the Adult dataset.
Description | Screenshot |
---|---|
| |
| |
|