Importing Data from a PDF File

You can import data stored in a PDF file, whether they have a table layout or not.

You can do it by either:

  • Dragging the PDF file directly onto the Stage, creating an Import from PDF File task.

  • Dragging an Import from PDF File task onto the stage: this operation allows you to perform a more precise import operation, as you can:

    • Import a single file: only one file is imported, specifying the datasheet from which information will be taken.

    • Import multiple files together: in this case the files are concatenated to form a single table. This means that all files imported together must have the same structure.

A preview of the imported table is always displayed in the task.


Prerequisites

  • you must have created a flow;

  • if you are importing multiple files, they must have the same structure.


Procedure

  1. Drag an Import from PDF File task onto the stage or drag the file to import onto the stage.

  2. Double click on the task to open it.

  3. Select whether you want to use a Saved source or a Custom source.

  4. Choose from the drop down list if you want to import your file from your computer (Local Filesystem) or from a Remote Filesystem.

  5. If you are importing from a Remote Filesystem, choose it from the list and then click on the pencil button to set the connection information required (only if you are using a Custom source). The tables are loaded in the Files tab.

  6. If you are using a Local Filesystem, click on the Select button and choose the path where the file is stored.

  7. Click on the Add new path button, located next to the Select button, to add new paths where other files to be imported are stored. You can add as many paths as you want.

  8. Click on the X button, located next to the Select button, to cancel the corresponding path.

  9. Click on the Delete all paths button, located under the Add new path button, to cancel all the inserted paths.

  10. Choose the resources to import by clicking the Select button in the Path 1 section.

  11. Click on the Add new path button, located next to the Select button, to add a new path for a new resource. You can add as many paths as you want.

  12. Click on the X button, located next to the Select button, to cancel the corresponding path.

  13. Click on the Delete all paths button, located under the Add new path button, to cancel all the inserted paths.

  14. In the Concatenation type box, select either:

    1. Detach to keep the imported files, or sheets from the same file, separate, or

    2. Concatenate if you want to merge them. You must then specify the concatenation type:
      - Inner concatenation includes only attributes that exist in both tables.
      - Outer (default) concatenation final table includes all the attributes, filling in any missing values if necessary.

  15. Select if you want the columns to be matched by their Name or Position in the the Match Column by box.

  16. Click on the PDF Configuration tab and set the Parsing options and the Import options, as displayed in the table below.

  17. Save and compute the task.


Parsing and import options

Settings options

Description

Parsing options

Here you can set:

  • Data separators (comma, semicolon, space, tabbing, other).

  • Number separators, that are the thousands separator and the decimals separator.

  • Missing string: enter the word you want to cancel from the dataset.

  • Text delimiter: select ' or if these symbols have been used as string delimiters. They will not be included in the imported file. For example, the string “apartment” will be imported as apartment. This option will remove all instances of text delimiters in the string, and not only the initial and closing symbols. The only exception to this rule will be if the symbol is proceeded by a backslash. For example, "ad\"cb" will be imported as ad"cb, while "ad"cb" will be imported as adcb.
    The data type for values with string delimiters is nominal, and this data type will not be altered by the removal of text delimiters. For example, “3” will be imported as 3, but will remain a nominal value, instead of being converted to an integer.

  • Use contiguous separators as a single one: select the check box if you want to force the parser to consider any possible group of adjacent separators as one in text files. For example, if you select this option, the string ‘1,2,,,3’, with the comma as a separator, will be parsed as 1, 2, 3, while if not checked it will be parsed as 12‘’‘’3.

  • Character encoding for input file: here you can set the file’s encoding:

    • ASCII

    • UNICODE

    • HTML

Import options

  • Start importing from line: the number of the line from which the importing operations will start.

  • Stop importing at line: the number of the line where the importing operations will end. Leave the value 0 if you want the whole dataset to be imported.

  • Force number of columns: the number of columns which must be created upon importing, as the PDF file might not have been correctly formatted.

  • Get names from line: the number of the line from which the column’s names will be taken.

  • Get types from line: the number of the line from which the attributes' types will be taken.

  • Column to be imported (empty for all): the number of columns to be imported. If left empty, all the columns will be imported.

  • Add an attribute containing:

    • Filename, to add a column with the file name.

    • Sheetname, to add a column with the sheet name.

    • Both, to add two columns, one with the file name and one with the sheet name.

  • Remove empty rows: select the checkbox if you want to remove the empty rows from the imported dataset.

  • Compress white spaces: select it to remove extra consecutive spaces from within strings. For example the string "university    program" would be imported as "university program".

  • Remove empty columns: select the checkbox if you want to remove the empty columns from the imported dataset.

  • Case sensitive: select the checkbox if you want the upper cases to be considered different from lower cases. Read below all the consequences on your data if this checkbox is selected.

  • Strip spaces: select this option if you want to remove spaces surrounding strings. For example, the string “ class “ will be imported as “class”.

  • Continue the execution if the folder is empty: if selected, computation of the task continues, even if the selected source files are not available.

  • Use old computation data if source file is not available: if selected, data from the previous computations will be used if the source table is not available.

  • Turn off smart type recognition: if selected, prevents automatic recognition of data types, leaving the generic nominal type. This option is useful when manual identification is preferable, for example when there is the risk of a code being misinterpreted as a date.

    However, if data types have been specifically defined in incoming MS Excel files, these data types will be maintained, even when the Turn off smart type recognition option has been selected.

  • Wait until the target file is present: if selected, Rulex polls the target file with the frequency specified (sleeptime) until it is available.


Focus on the Case Sensitive checkbox

We encourage you not to select the Case Sensitive checkbox, as it has a significant impact on the data analysis.

If the Case Sensitive checkbox is selected, the number of distinct values in a column can increase, causing a slight difference in the data analysis.

In fact, if we have two strings, 'Word' and 'word', they will be considered as two distinct values. This means that, if you write a function valid for the string 'word', it won't be valid for the string 'Word' too.

It might cause consequences also on attributes, because if we want to apply a function to the $"Word" attribute and we type $"word" in the formula bar, an error occurs during the computation process.