Checking the existence
The screenshot below shows the NULL detection we must perform for the PFI-KEY (the field TimelineID in the screenshot below):
Table Profiling with Ataccama Data Quality Analyzer
In the dataset above, we have 5 records for which the PFI-KEY does not exist. We’ll have to see with the business user what to do with these 5 records:
- Do we filter out these rows ?
- Do we create a dummy field value with a fixed PFI-KEY (equal to “DUMMY” for example) ?
- Can we attach these 5 records to another PFI-KEY thanks to the other data of the record ? (not common and not really easy to manage)
In any case we cannot manage to import the data as is !
Second thing we need to take care of is the unicity of this field. It’s not something mandatory to check but that may help in finding out if we have the right field for the PFI-KEY (which is sometimes not so obvious).
Checking the unicity
Here we see we have 82% duplicates of this field which is fine. Having for example unique values at almost 100% would not make sense as a Process Flow would have several belonging steps to be relevant (see figure above).
Looking at the distribution PFI-KEY / SN-KEY
We can also count the number of records (rows) per Process Flow Identifier. In the example below the result is relevant as the Process Flows (ie. nb of timelines) have between 5 and 16 steps or events.
At the opposite the example below shows a real problem as we have one step/event per Process flow :
Between these extreme situations (best and worst) we can have situations like this:
Here, it’s quite difficult to make a conclusion. 89% of the Process Flows (158 628 on the 180 000) have exactly 4 steps. The others 11% have between 1 and 31 steps. The situation is possible but not realistic somehow, that may alert on some data troubles. For example that can mean we had a problem in extracting the Process Flow Identifier (maybe we just did choose the best candidate for it). If this is the case we may review the Data Collection Plan.