A DQA (Formally Data Quality Assessment) is a way of examining and analyzing data to create valuable summaries of it. It used to be the classical first part of each – good – Data Integration Project. The DQA provides its results on real data, that means the solution reads all the data (and not a sample) to provide a real state of them.
There are many available solutions in the market (like Informatica, Attacama, SAS, Talend, OpenRefine, Power MatchMaker, etc.). Some of them are free and also Open Source. It’s also possible to use languages like Python or R to do these data profiling activities.
By using a Data Profiling solution you’ll be able to discover:
- Structure Discovery
- Structure Analysis
- Content Discovery
- Relationship Discovery
Of course using such a tool like this is not a magic wand and this analysis has to be led by the Data Expert and the business analyst to match the business expectations.
It’s better to proceed like this when starting the analysis:
As a prerequisite, we first need to ensure we have identified the 3 potential Process Mining candidate fields (be careful as they may exist in several possible files). Check again with the business to validate these field candidates. This task has normally already been done in the previous stages, but here we need to find out where these data are really located (which Data Sources ?).