Importing Data from an XML File

You can import data directly from XML files, specifying the data sheets. You can do it by either:

  • Dragging the XML file directly onto the Stage, creating an Import from XML File task.

  • Dragging an Import from XML File task onto the stage: this operation allows you to perform a more precise import operation, as you can:

    • Import a single file: only one file is imported, specifying the datasheet from which information will be taken.

    • Import multiple files together: in this case the files are concatenated to form a single table. This means that all files imported together must have the same structure.

A preview of the imported table is always displayed in the task.


Prerequisites

  • You must have created a flow;

  • In case you are importing multiple files, they need to have the same structure.


Procedure

  1. Drag an Import from XML File task onto the stage.

  2. Double click the task and open the task menu.

  3. Select whether you want to use a Saved or a Custom source.

  4. Choose from the drop-down list if you want to import the file from your computer (Local) or from a remote connection.

    1. If you are importing from a Remote Filesystem, choose it from the list and then click on the pencil button to set the connection information required (only if you are using a Custom source). The tables are loaded in the Files tab.

    2. If you are using a Local Filesystem, click on the Select File button and choose the path where the file is stored.

  5. Click on the Add new path button, located next to the Select button, to add new paths where other files to be imported are stored. You can add as many paths as you want.

  6. Click on the X button, located next to the Select button, to cancel the corresponding path.

  7. Click on the Delete all paths button, located under the Add new path button, to cancel all the inserted paths.

  8. Choose the resources to import by clicking the Select button in the Path 1 section.

  9. Click on the Add new path button, located next to the Select button, to add a new path for a new resource. You can add as many paths as you want.

  10. Click on the X button, located next to the Select button, to cancel the corresponding path.

  11. Click on the Delete all paths button, located under the Add new path button, to cancel all the inserted paths.

  12. If you are importing more files together, click on the Concatenation Type’s drop-down arrow to choose Inner or Outer concatenation.

    1. Inner concatenation final table includes only attributes that exist in both tables.

    2. Outer concatenation final table includes all the attributes, filling in any missing values if necessary.

  13. If you are importing more files together, click on the Match Column by drop-down arrow and choose if you want to match them by their Name or by their Position.

  14. Click on the XML options tab and set the Parsing options and Import options, as shown in the table below.

  15. Save and compute the task.


Parsing and Import options

Settings option

Description

Parsing options

Here you can set:

  • Number separators, that are the thousands separator and the decimals separator.

  • Missing string: enter the word you want to cancel from the dataset.

  • Key for types: specify the string that will be used to identify the type.

Import options

  • Remove empty rows: select the check box if you want to remove the empty rows from the imported dataset.

  • Remove empty columns: select the check box if you want to remove the empty columns from the imported dataset.

  • Strip spaces: select this option if you want to remove spaces surrounding strings. For example, the string “ class “ will be imported as “class”.

  • Add an attribute containing filename: select this option to add an extra column with the name of the file to the dataset.

  • Case sensitive: select the checkbox if you want the upper cases to be considered different from lower cases. Read below all the consequences on your data if this checkbox is selected.

  • Use old computation data if the source file is not available: if selected, data from the previous computations will be used if the source table is not available.

  • Continue the execution if the file is missing: if selected, computation of the task continues, even if the selected source files are not available.

  • Wait until the target file is present: if selected, Rulex polls the target file with the frequency specified (sleeptime) until it is available.

  • Turn off smart type recognition: if selected, prevents automatic recognition of data types, leaving the generic nominal type. This option is useful when manual identification is preferable, for example when there is the risk of a code being misinterpreted as a date.

  • Number of records to preview: it specifies how many records the table preview will display.

  • Turn off smart type recognition: if selected, prevents automatic recognition of data types, leaving the generic nominal type. This option is useful when manual identification is preferable, for example when there is the risk of a code being misinterpreted as a date.


Focus on the Case Sensitive checkbox

We encourage you not to select the Case Sensitive checkbox, as it has a significant impact on the data analysis.

If the Case Sensitive checkbox is selected, the number of distinct values in a column can increase, causing a slight difference in the data analysis.

In fact, if we have two strings, 'Word' and 'word', they will be considered as two distinct values. This means that, if you write a function valid for the string 'word', it won't be valid for the string 'Word' too.

It might cause consequences also on attributes, because if we want to apply a function to the $"Word" attribute and we type $"word" in the formula bar, an error occurs during the computation process.