1. Home
2. Docs
3. Stage 4 – Prepare
4. Data Sample size

# Data Sample size

It exists many sampling techniques like that can be used if we can have access to the global data set. In this case and because that makes no sense to analyze all the process during all the time being more agile and flexible in using a sample is a good approach. These are the most used sampling techniques:

• SIMPLE RANDOM SAMPLING: method of sampling in which every unit has equal chance of being selected (random)
• STRATIFIED RANDOM SAMPLING: method of sampling in which subsets or groups are created and then units are picked randomly
• SYSTEMATIC SAMPLING: method of sampling in which every xth unit is selected from the population
• CLUSTER SAMPLING: method of sampling in which clusters are sampled every xth time

In many case the customer provides the data as is, and at this stage it could be important to raise an alert if the number of data provided is not enough or just not representative. Hopefully the Inference Statistics techniques provides some formulas which help in calculating the mininum amout of data necessary to have a revelant sample.

## Sampling first consideration for Process Mining

Because Process works mainly on a time basis, we need to consider 2 aspects before calculating our sample. These informations will help to figure out how many Process flows we may need to provide a relevant analysis.

• The Period: This comes from the nature of the Process itself and can be the second, minute or a month.
• Based on this period range can we’ll need to know the number of Process flows information per period. For example if we analyse an OTC process in 2022, our global population will be calculated on 2022 only.

This time range has to be relevant for a Process point of view because if the period is too short we may loose some peaks and other periodical uses of the process. Unfortunatly there are no mathematics rules to determine this time range, and that must be decided accordingly with the business.

## What we need to calculate the sample size

So as to be revelant the Data provided needs to have a “critical size”. Not enough data is clearly a pitfall which can provides wrongs results, so when managing the Data Collection Plan it’s also important to take care about the volume of data we’ll have to collect from the data sources.

Most of the time we work with samples of data coming from the different data sources we’ll have to manage, so we’ll have to ensure the final data set is revelant by getting

• The right amount of global data
• A representative sample of data from each data sources

But What is a sample of data ? Let’s assume that sampling is the process of collecting only a portion of the global data that is available or just could be available. Based on this sample the purpose is to draw some conclusions about the total population. This is known as statistical inference.

Before all we need to get (or calculate) some informations about the dataset:

• The global population size (the Big dataset size)
• The margin of erreur (This value, which is also called the confidence interval, corresponds to the degree of error that you decide to give to your results). Most of the time we use a value of 5% (ie. 0,05).
• The confidence level (It measures the degree of certainty that the sample correctly represents the population within a defined margin of error)
• The standard deviation (explain how the values are distributed, we usually take 95% here)
• The z-score (very famous for the statisticians, it’s a constant value that is set automatically based on the confidence level). For example: 95% confidence as a z-score of 1.96

## The magic formula

• z is the z-score (1.96 most of the time)
• o is the standard deviation of the global population
• e is the margin of error
• s is the global population size

## Data Sample needed calculation

The easiest approach to calculate the needed amount of data is to first have a look on the distribution: Number of Process Flow per Time period. Obviously we need to define the time period as it depends on the process usage. For example if we’re analyzing a process which runs every day for a business purpose like for example an account reconciliation in a bank it’s interresting to choose the day as a period. For other business process (like in accounting) we choose a bigger period like a week or a month.

Once the period has been chosen we can calculate the standard deviation of this distribution. That clearly helps to see how regular is the usage of the process during the timeline, but that will be also used to calculate the sample size.

In the example below we can see a flat distribution of the Process Flow Count per week. The original data set have 1312 Process flows, but we want to work with a sample of it:

In this example if we want to get a significant sample, we first need to calculate the Stadard deviation of this distribution (equal here to 7,5). Then we can apply the formula above.

o = 0,9 and e =0,05, s=1312, z=1.96

So in our case here the number minimum of Process Flow needed would be equal to 207 Process flows