DQA Kit (using KNIME)

Table of Contents

What is it ?

This simple kit aims to simplify the check of the three mandatories key (T-KEY, PFI-KEY and SN-KEY). It provides a complete report on the several controls necessary. Typically kit helps to evaluate the datasets given by athe business during the stage 4 (Prepare) while performing a DQA.

However it has 2 limitations :

  • Can only evaluate 1 dataset at a time. If the business provides several Data Sources it is strongly recommanded to use a Data Profiling tool (like the ones mentionned in the toolbox) in order to make the first Data quality checks of course but also to detect the potentials relationships between these data sources.
  • It assumes we already know which are the key candidates (T-KEY, PFI-KEY and SN-KEY).

This kit does require any action from the user and provides at resulta complete pdf report. Of course it clearly a good practice to share these results with the business so as to point out the differents problems encountered in the data.

This kit has been developped:

  • with KNIME (Open Source version) and is provided with as is (free to download and free to update).
  • And / Or in Python
Download it on GitHub https://github.com/datacorner/pyProcessMining

Usage

It’s possible to run the kit directly from KNIME but it is also possible to run it directly from the command line. The kit needs a reference to the data source (CSV file) and some parameters must be filled out:

Parameters description:

  • workflowDir: specify the location of the kit
  • -workflow.variable=”BPPI_OutputPath”,”C:\\knime-wk\\BPPI Toolbox“,”String”: the parameter in bold must be changed to the output path you want the kit write the pdf report.
  • -workflow.variable=”file”,”C:\\knime-wk\\BPPI Toolbox\\data.csv“,”String”: change the parmater in bold with the dataset to analyze
  • -workflow.variable=”TIMELINEID_Column”,”TimelineID“,”String” : change the parmater in bold with the PFI-KEY column name
  • -workflow.variable=”TIMESTAMP_Column”,”Date“,”String”: change the parmater in bold with the T-KEY column name
  • -workflow.variable=”EVENTID_Column”,”Event“,”String“: change the parmater in bold with the PS-KEY column name
  • -workflow.variable=“delimiter”,“;“,”String: : change the parmater in bold with the CSV separator (by default the comma is used)

This is an example of the command line:

C:\Program Files\KNIME\knime.exe" --launcher.suppressErrors -reset -nosave -consolelog -nosplash -application org.knime.product.KNIME_BATCH_APPLICATION -workflowDir="C:\knime-wk\BPPI Toolbox\BPPI_DataAnalysis_BuildReport" -workflow.variable="BPPI_OutputPath","C:\\knime-wk\\BPPI Toolbox","String" -workflow.variable="file","C:\\knime-wk\\BPPI Toolbox\\data.csv","String" -workflow.variable="TIMELINEID_Column","TimelineID","String" -workflow.variable="TIMESTAMP_Column","Date","String" -workflow.variable="EVENTID_Column","Event","String“ -workflow.variable=“delimiter",“;","String

Results

The report provides different sections :

  • Dataset Header & Size
  • Columns candidates (evaluated)
  • Missing Data
  • Date Format checks
  • Duplicates checks
  • Most & Least count of events / Timeline and some statistics on the distribution
  • Events Freq. distribution
  • Number of rows / Timeline