Search

Using Regression Tree to solve Regression Problems

The Regression Tree task can build a regression model using the decision tree approach.

The output of the task is a tree structure, and a set of intelligible rules, which can be analyzed in the Rule Manager.

Prerequisites

you must have created a flow;
the required datasets must have been imported into the flow;
the dataset used for the analysis must have been well prepared, and and includes a categorical output and a number of inputs. Data preparation may involve discretization before building the decision tree, to improve the accuracy of the model and to reduce the computational effort.
a unified model must have been created by merging all the datasets into the flow.

Additional tabs

The Monitor tab, where it is possible to view the statistics related to the generated rules as a set of histograms, such as the distribution of the number of conditions, covering and error. Rules relative to different classes are displayed as bars of a specific color. These plots can be viewed during and after computation operations.
The Results tab, where statistics on the RT computation are displayed, such as the execution time, number of rules etc.

Procedure

Drag the Regression Tree task onto the stage.
Connect a task, which contains the attributes from which you want to create the model, to the new task.
Double click the Regression Tree task.
Configure the options described in the table below.
Save and compute the task.

Parameter Name	Description
Regression Tree options
Input attributes	Drag and drop the input attributes which will be used to generate predictive rules in the decision tree.
Output attributes	Drag and drop the attributes which will contain the results of the predictive analysis.
Minimum number of patterns in a leaf	The minimum number of patterns that a leaf can contain. If a node contains less than this threshold, tree growth is stopped and the node is considered a leaf.
Maximum impurity in a leaf	Specify the threshold on the maximum impurity in a node. The impurity is calculated with the method selected in the Impurity measure option. By default this value is zero, so trees grow until a pure node is obtained (if possible with training set data) and no ambiguities remain.
Pruning method	The method used to prune redundant leaves after tree creation. The following choices are currently available: No pruning: leaves are not pruned and the tree is left unchanged. Cost-complexity pruning: according to this approach, implemented in CART the tree is pruned through a cost-complexity measure that creates a sequence of sub-trees and finds the best one through the application on a validation set. Each sub-tree is created from the previous one by minimizing a cost-complexity measure that takes into account both the mis-classification level in the training set and the number of leaves.
Method for handling missing data	Missing data can be handled using one of the following methods: Replace with average: missing values are replaced with the value fixed by the user for the corresponding attribute (for example, by means of a Data Manager). If this value is not set, the average computed on the training set is employed. Remove from splits: patterns with missing value in the test attribute are removed from the subsequent nodes. Include in splits: patterns with missing values in the test attribute at a given node are sent to both the sub-nodes deriving from the split.
Select the attribute to split before the value	If selected, the QUEST method is used to select the best split. According to this approach, the best attribute to split is selected via a correlation measure, such as F-test or Chi-Square. After choosing the best attribute, the best value for splitting is selected.
Aggregate data before processing	If selected, identical patterns are aggregated and considered as a single pattern during the training phase.
Initialize random generator with seed	If selected, a seed, which defines the starting point in the sequence, is used during random generation operations. Consequently using the same seed each time will make each execution reproducible. Otherwise, each execution of the same task (with same options) may produce dissimilar results due to different random numbers being generated in some phases of the process.
Append results	If selected, the results of this computation are appended to the dataset, otherwise they replace the results of previous computations.

Example

The following example uses the Adult dataset.

Description

Screenshot

After having imported the dataset with the Import from Text File task and splitting the dataset into test and training sets (30% test and 70% training) with the Split Data task, add a Regression Tree to the flow and define the following parameters. As this is a regression task, we need an ordered output, so we define the hours-per-week attribute as the output attribute.
Drag the following attributes onto the Input attributes pane:
- age
- workclass
- education
- occupation
- race
- sex
- native-country
- income
Specify No pruning in the Pruning method drop down list.
Save and compute the task.

The properties of the generated rules can be viewed in the Monitor tab of the RT task.

1588 rules have been generated, while their number of conditions ranges from 2 to 7.

You can hover over the various colours of the histogram to see the distribution for each rule, with the corresponding number of conditions and results.

Analogous histograms can be viewed for covering and error, by clicking on the corresponding tabs.

We can check out the application of this set of rules to the training and test patterns by adding an Apply Model task, and computing it with the default options. Right-clicking the computed task, and selecting Take a look allows us to check out the results.

The application of the rules generated by the Regression Tree task has added four columns containing:

the forecast for each pattern: pred(hours-per-week)
the confidence relative to this forecast: conf(hours-per-week)
the most important rule that determined the prediction: rule(hours-per-week)
the classification error, i.e. 1 if misclassified and 0 if correctly classified: err(hours-per-week).

The content of the parentheses is the name of the variable the prediction refers to.

From the summary panel on the left we can see that the model scores a 12.35 of mean square error in the training set.