Using Sequence Analysis to solve Association Problems

Rulex extracts frequent sequences from event logs with the Sequence Analysis task.


Prerequisites

Additional tabs

The results of the Sequence Analysis task can be viewed in two separate tabs (the respective columns are described in the table below):

  • The Frequent sequences tab, where it is possible to view the data resulting from the anomaly detection execution:

    • Frequent Sequence ID: sequential ID number for the frequent sequence.

    • Cardinality: number of events that make up the frequent sequence.

    • Couple characterization: Qualitative characterization of the behavior for the sequence of two events A-B. The possible outcomes are:

      • Weak sequence - B is likely to follow A, A is indifferent to B,

      • Strong sequence - B is likely to follow A, A is unlikely to follow B,

      • Complements - B is likely to follow A and vice-versa,

      • Substitutes - B is unlikely to follow A and vice-versa,

      • Independents - B is indifferent to A and vice-versa, or

      • Not enough information to determine.

    • #Occurrences: number of times in which the sequence is retrieved in the data.

    • Confidences: Ratio of cases (0-1 value) in which, if the initial part of the sequence is verified, the final part follows. The first column of confidence is referred to the initial event, i.e. measures how often, if the initial event happens, the rest of the sequence follows. If a Maximum sequence cardinality higher than 2 is set, other columns are also generated, representing how often if the first two events are verified the other follow and so on.

    • All-confidence: Ratio between the number of occurrences of the whole sequence and the number of occurrences of the least frequent event included in the sequence.

    • Minimum time interval, Maximum time interval, Average time interval, Std time interval: Minimum, maximum, average and std interval of occurrences associated to the frequent sequence.

    • Event IDs: IDs of the events constituting the frequent sequence.

  • The Results tab, where statistics on the task computation are displayed, such as the number of anomalies detected:

    • Task identifier: ID code for the task, internally used by the Rulex engine.

    • Task name: name of the task.

    • Elapsed time: time required for latest computation (in seconds).

    • Number of different events in input: number of distinct events which were fed to the task during latest computation.

    • Number of different sequences in input: number of distinct sequences which were fed to the task during latest computation.

    • Number of detected frequent sequences: number of events labeled as anomalies by the task.

    • Number of generated frequent sequences: number of sequences which were found to be frequent, according to the support threshold.

    • Minimum event support: minimum number of occurrences for frequent events.


Procedure

  1. Drag the Sequence Analysis task onto the stage.

  2. Connect a task which contains event log data to the task Sequence Analysis task.

  3. Double click the Sequence Analysis task. The left-side pane displays a list of all the available attributes in the dataset, which can be ordered and searched as required.

  4. Configure the basic and advanced options as described in the table below.

  5. Save and compute the task.

Sequence Analysis Basic options

Parameter Name

Description

Minimum event support (#samples)

All events which appear in orders fewer times than this threshold are discarded. This value is relevant only if the Auto (specify #events) option is not selected.

Auto (specify #events)

If this option is selected, the minimum support for events is automatically computed: the user shall specify the number of events to take into account (most frequent first).

#Events to consider

Number of events to take into account (most frequent first). This value is relevant only if the Auto (specify #events) option is selected.

Minimum sequence support (#samples)

All sequences which are verified fewer times than this threshold are discarded. This value is relevant only if the Auto (above average) option is not selected.

Auto (above average)

If this option is selected, the minimum sequence support is set to the average support of sequences with the same dimension (i.e. constituted by the same number of events).

Maximum sequence cardinality

Maximum cardinality of generated sequences.

No maximum sequence cardinality

If this option is selected, all sequences with higher support than the specified threshold are generated, regardless of their cardinality.

Time attribute

Attribute including the timestamp for each of the events.

The reference time unit can also be specified via the drop-down menu.

Minimum and maximum interval between sequence elements

Consecutive events in sequences are bound to these minimum and maximum thresholds of temporal distance.

Allow repetitions (the same event can occur more than one time in a sequence)

If this option is selected, repetitions of the same event in a single sequence are allowed.

Only print cyclic sequences (start event and end event have the same ID)

If this option is selected, the output is constituted only by the sequences in which the first event is characterized by the same ID as the last one.

Sequence ID attributes (NOMINAL)

Drag and drop here the nominal attributes which identify the sequences. Instead of manually dragging and dropping attributes, they can be defined via a filtered list.

Event ID attributes (NOMINAL)

Drag and drop here the nominal attributes which characterize the events. Instead of manually dragging and dropping attributes, they can be defined via a filtered list.

Sequence Analysis Advanced options

Attribute to filter to select relevant data

Drag and drop here the attribute you want to use as a filter to select relevant data, from the Available attributes or Proximity attributes lists and configure the filter in the attribute filter dialog box. 

Instead of manually dragging and dropping attributes, they can be defined via a filtered list.

Attribute to filter to discard irrelevant data

Drag and drop here the attribute you want to use as a filter to discard irrelevant data, from the Available attributes or Proximity attributes lists and configure the filter in the attribute filter dialog box.

Instead of manually dragging and dropping attributes, they can be defined via a filtered list.

Proximity attributes

Drag and drop here the ordered item attributes which characterize the proximity among events together with time onto the Proximity attributes list (mbaitemchildnames), and then set the corresponding thresholds in the Minimum-maximum proximity thresholds edit box.

For example, if you need to mine frequent sub-sequences of events which occur in locations close to each other, spatial coordinates shall be dragged in this list. 

Instead of manually dragging and dropping attributes, they can be defined via a filtered list.

Minimum-maximum proximity thresholds

Set the minimum and maximum proximity thresholds for the corresponding attribute in the Proximity attributes edit box. 


Example

The following example uses the san-test dataset.

Description

Screenshot

After having imported the dataset with an Import from Text File task, add a Reshape to Long task and drag all the Event_ID (from 1 to 10) attributes onto the Attributes to be transformed in long format area.

Then, we connect the Sequence Analysis task to the Reshape to Long task.

Configure the task as follows:

  • Drag and drop the Sequence ID attribute in the Sequence ID attributes list and the Wide_1 attribute in the Event ID attributes list.

  • Select the Auto option (to the right) for the Minimum event support.

  • Set the #Events to consider to 30 (if you have problems setting this number deselect and reselect the Auto option above).

  • Deselect the Auto (above average) option for Minimum sequence support (#samples) and set the value to 10.

  • Set the Maximum sequence cardinality to 2.

  • Select Date as the Time attribute (and day as the unit of measure).

  • Set the Minimum and maximum interval between sequence elements respectively to 0 and 1

The extracted frequent sequences can be seen in the Frequent Sequences tab.