Reshaping, Transforming and Cleaning Datasets

It is frequently necessary to perform operations on the structure or contents of datasets prior to creating a predictive model.

For example, it may be necessary to reshape some of the datasets prior to merging them in a single table, or transform attributes to more manageable data types, or clean up the dataset by removing outliers or attributes that could cause confusion in the final model. 

Task

Description

Corresponding page

Reshaping Tasks

Reshape To Long

Transforms key attributes in a dataset into new columns.

This operation is necessary when a table contains more than one key.

Reshaping Datasets to Long Format

Reshape To Wide

Transforms key attributes in a dataset into new rows. 

This operation is necessary when a table contains more than one key.

Reshaping Datasets to Wide Format

Transpose

Converts rows into columns and vice versa.

Transposing Data

Transforming Tasks

Discretize

Transforms continuous attributes into a finite set of intervals

Discretizing Data

Moving Window

Defines temporal windows of data of a specific size and shape.

Performing Moving Windows Statistics on Data

Cleaning Tasks

Fill/Clean

Removes attributes which could create confusion in the resulting predictive model.

Cleaning Datasets

 

Outliers

It is also important to correctly identify and manage outliers, which are anomalous data samples. which can have a negative impact on predictive models if not handled correctly. 

This is not limited to a single task. For details see Identifying and Managing Outliers.