1. Home
  2. Docs
  3. Stage 3 – Define
  4. The Data collection Plan (DCP)

The Data collection Plan (DCP)

What data we need to collect

Without data, there’s just no Process Mining. A Process Mining project is fully data-driven and must be fed by good data then. So the first important practical step in process mining is to gather data from the organization information systems. This data will be used byt the solution to reconstruct and create a visual representation of the business process. But this is where the problem really starts as most of the time the Information Systems landscape is not simple. So, we may have to deal with many systems, applications, structures, skills and organizations to gather the needed data.

The Data Collection Plan (DCP) defines the way we will collect the data for analyzing our process.

When we create a data collection plan, we may have to think about:

  • What data do we need ? (that may also depend on the Process Mining Solution we’re currently using to drive our analysis). 

Remember we need :

  • The PFI-KEY: ie the process Flow Identifier (Mandatory)
  • The SN-KEY: This is the Process step name or Event Name(Mandatory)
  • The T-KEY: It’s a Timestamp which specify when the event really occurs (Mandatory)
  • In addition to the mandatory columns above, the dataset may contain any number of additional columns which will be used as dimensional attributes.

The purpose is really to locate where we could gather these data, and the first activity we must run is to identify by using the Process Flow what are the applications and datastores are curretly storing these informations.

Last point about the final dataset we have to gather: the granularity (or the level of the rows) has to be the Process Step or Event. That means each line must represents one event which occurs in one Process Flow instance at a specific time.

How to proceed in practice

The easiest way to gather the needed information for the DCP is to start with the Process Flow. At this stage this Process flow may be uncomplete, but that does not really matter as it’s just a way to start the DCP. Anyway this DCP may change a lot during all the qualifications stages and sometimes also during all the project life: so the most important point here is to start this collection.

Based on the the Process Flow given by the Process or business Analyst, we just first list all the Process Steps (or events). For each identified steps/events we have now to find out which application or system is reponsible to store the status of the Process at this specific stage. It’s really important to manage that discovery one step at a time to avoid any confusion or mix between the different Process Steps. Once the System or application has been identified we can use the Data Questionnaire below.

Be careful:

  • If the Process Flow is too big (too many events or Process Steps) it’s a good practice to break it down in several sub-processes.
  • Sometimes we may have several applications which are in charge the storage of the status of one Process Step. In this case each application has to be investigated separately (Data Questionnaire). We’ll just have to keep in mind and write down how to link these data later.

The Data questionnaire

Then, we may ask ourselves some questions about these data:

  • What about their format or structure ?
  • Where are they located ? because most of the time we don’t find all the needed data in the same place, me may have to merge these informations from different data source
  • Can we extract them (do we have the tools, connectors, rights, etc.) ?
  • How can we check these data are correct ?
  • Which part of the process this data is linked to ?
  • How will it help us?
  • What analysis will we be conducting with this data ?

According to the DMAIC methodology (Lean/Six Sigma) when creating a Data Collection plan we may have to consider 8 points:

  1. Performance Measures (what we need to know about your process and where to find measurement points)
  2. Operational Definition of those measures/metrics (clear, precise description of the metric)
  3. The stratification factors (Data slicing: i.e. capturing and use of characteristics to sort data into different categories)
  4. Data Source and Location (and … does the data currently exist? like for example if we have to manage non-quantifiable outcomes)
  5. How will the data be collected?
  6. Who will collect the data?
  7. When will the data be collected?
  8. What are the sample sizes ? (Will we have to manage sampling techniques in this case?)

The summary per data source

All the points from the Data questionnaire must all be adressed so as to ensure all the needed informations can be collected in the Project Charter later. It’s a good practice to manage this questionnaire via one or several DCP workshop before doing any Process Mining activities. We can for example have separate workshops considering each data system or application, or just one with several iterations if all the needed people cannot attend in the same time.

After managing these workshops we may summarize in a simple table the needed information per data source for each fields or data we want to gather (ie. Process Flow Identifier, Event Name and Timestamp, and other needed):

Data Source 1Data Source n
ApplicationOrder Management App
Data Source TypeDatabase
Other/CommentsExtract the field [OTI] in the first XML field.

Let’s explain a bit about what is expected to fill out above:

  • System: the system represents somehow where the data is stored. A system can be something physical (like a set of different servers) or a logical area (for example the SAP instance or code model).
  • Application: In a system several applications can co-exists. An application has its own purpose and most of the time it is the easiest asset to identify as the business user uses it every day.
  • Data Source Type: It’s the nature of the data storage. It can be a file, a Database, a message bus (broker), etc.
  • Structure: That describes how the data is really stored in the Data source. It most of the time belongs to the Data Source Type but not every time. Being aware of the real nature of the data is important to manage potential reshaping later. Example: XML, JSON, flat, tabular, etc.
  • Format: It’s not anymore about the storage itself but how the data has been recorded. The same information can have many formats (example: a date can be stored in several formats: DD/MM/YYYY or YY-MM-ddd, etc.)
  • Name or Identifier: gives us the way to identify in the Data Source the data we want to extract.

Where to gather the data ?

In fact there’s many ways to get the data. Most of the time when talking about Process Mining we think about getting log data first. But unfortunately these data are not always accessible or really usable, in this case it may be interesting to get in touch with the Business Intelligence team or the Data Lake team as they may have already ingested the needed data somewhere else. 

Each data type has obviously its own pros and cons :

Logs DataTask MiningOperational DataAnalytics DataEvent Broker DataRPAIDP/OCR
FormatFiles (CSV, JSON, etc.), XES format, othersN.A.Files, DB, APIsDBAPIs, Stream DataN.A.Images, PDF
TypeStructuredSemi-structuredInternal ABBYY recordsStructuredStructuredStructuredSemi-structuredN.A.Unstructured
IntegrationABBYY Timeline, Hot Folder,Alteryx, ETL, etc.N.A.ABBYY Timeline,Alteryx, ETL, etc.ABBYY Timeline,Alteryx, ETL, etc.ABBYY Timeline,Alteryx, ETL, etc.AnyAbbyy FlexiCapture Connector + Alteryx, ETL
Data FreshnessFile upload freq.ScheduledBased on ScheduleBased on ScheduleReal TimeBased on DW SchedulingBased on Schedule

In addition to these structural points it’s also important to have an idea of the estimated volume of the data to extract. This information may have a big impact on the way to gather the data later.

How can we help?