1. Home
  2. Docs
  3. Stage 4 – Prepare
  4. Analyzing the Data Sources

Analyzing the Data Sources

After building the Data Collection Plan (DCP) it’s now time to collect the needed data. But collecting the data directly can lead to getting bad data or just not appropriate data. 

Actually the data are almost never what we expect, and that for many reasons. The purpose is not to list all the possible reasons here but let’s just assume all systems and applications live and change regularly (time is always their worst enemy), they also have to face new organizational changes, etc.

So, beforehand it’s interesting to evaluate the data in two facets:

  1. For a technical (usability) point of view: For example let’s ask these questions:
    1. How is my data structured? Can I collect it directly or I may need to make some transformation ?
    2. What about the data format ? Is it always the same (homogeneous) ? or I may have to convert it ?
    3. What about the granularity of my data? We’ll see that capital point later but for example if i consider the timestamp, can i ensure the timestamp is the event one … and not for example the Process flow starting point.
  2. For a meaningful point of view. This is a task that needs to be guided by the business user and the data expert together. The purpose here is to verify the data are the one expected (no misunderstanding).

We call this kind of analysis a Data Quality Assessment (DQA) and we can easily manage this phase with efficient and dedicated tools: Data Profiling and Data Quality tools.

The Data Quality Conditions to check

The Process Mining solution does not require so many Data controls to have a viable and usable dataset.

  • CONTROL_M1 The Process Flow Identifier (PFI-KEY)
    • Must exists (not NULL)
    • Can be a String or a Number
  • CONTROL_M2 The Step Name (SN-KEY)
    • Must exists (not NULL)
    • Can be a String or a Number
  • CONTROL_M3 The Timestamp (T-KEY)
    • Must exists (not NULL)
    • [MOST OF THE TIME] Should respect a specific (or many) date format.
  • CONTROL_M4 The tuple (PFI-KEY, SN-KEY, T-KEY)
    • must be unique

These are the minimum level of control we need to ensure. Of course we may consider some other optional data controls like the ones on the additionals and optionals fields.

How can we help?