Using Decision Tree to solve Classification Problems
The Decision Tree task can solve classification problems by building a tree structure of intelligible rules.
you must have created a flow;
the required datasets must have been imported into the flow;
the data used for the analysis must have been well prepared, and include a categorical output and a number of inputs. Data preparation may involve discretization before building the decision tree, to improve the accuracy of the model and to reduce the computational effort.
a unified model must have been created by merging all the datasets in the flow.
The Monitor tab, where it is possible to view the statistics related to the generated rules as a set of histograms, such as the number of conditions, covering value, or error value. Rules relative to different classes are displayed as bars of a specific color. These plots can be viewed during and after computation operations.
The Results tab, where statistics on the computation are displayed, such as the execution time, number of rules, average covering etc.
Drag the Decision Tree task onto the stage.
Connect a task, which contains the attributes from which you want to create the model, to the new task.
Double click the Decision Tree task.
Configure the options described in the table below.
Save and compute the task.
Browse through the Monitor and Results tabs to analyze the results.
Decision Tree options
Drag and drop the input attributes which will be used to classify data in the decision tree.
Drag and drop the attributes which will be used to form the final classes into which the dataset will be divided.
Minimum number of patterns in a leaf
The minimum number of patterns that a leaf can contain. If a node contains less than this threshold, tree growth is stopped and the node is considered a leaf.
Maximum impurity in a leaf
Specify the threshold on the maximum impurity in a node. The impurity is calculated with the method selected in the Impurity measure option.
By default this value is zero, so trees grow until a pure node is obtained (if possible with training set data) and no ambiguities remain.
The method used to prune redundant leaves after tree creation. The following choices are currently available:
Method for handling missing data
Select the method to be used to handle missing data:
The method used to measure the impurity of a leaf. Considering a classification problem with c classes and a given node η, the following choices are currently available:
Initialize random generator with seed
If selected, a seed, which defines the starting point in the sequence, is used during random generation operations. Consequently using the same seed each time will make each execution reproducible. Otherwise, each execution of the same task (with same options) may produce dissimilar results due to different random numbers being generated in some phases of the process.
Select the attribute to split before the value
If selected, the QUEST method is used to select the best split. According to this approach, the best attribute to split is selected via a correlation measure, such as F-test or Chi-Square. After choosing the best attribute, the best value for splitting is selected.
If selected, the results of this computation are appended to the dataset, otherwise they replace the results of previous computations.
Aggregate data before processing
If selected, identical patterns are aggregated and considered as a single pattern during the training phase.
The following example uses the Adult dataset.
After importing the adult dataset with the Import from Text File task and splitting the dataset into test and training sets (20% test, 20% validation and 60% training) with the Split Data task, add a Decision Tree task to the process and double click the task.
Compute the task to start the analysis.
The properties of the generated rules can be viewed in the Monitor tab of the Decision Tree task:
There are, for example, 656 rules with 4 conditions, 515 relative to class "<50K, and 141 relative to class “>50K”.
The total number of rules, and the minimum, maximum and average of the number of conditions is reported, too.
Analogous histograms can be viewed for covering and error, by clicking on the corresponding tabs.
Clicking on the Results tab displays a spreadsheet with
The rule spreadsheet then can be viewed by adding a Rule Manager task.
Each row displays all the conditions that belong to the specific rule.
The total number of generated rules is 1390, with a number of conditions ranging from 1 to 10.
The maximum covering value is 67.4%, whereas the maximum error is about 15%.
We can check out the application of this set of rules to the training and test patterns by right-clicking the Apply Model task and selecting Take a look.
The application of the rules generated by the Decision Tree task has added new columns containing: