Even if the data are correct (for a Data Quality standpoint) that does not mean we can explore and provide some interesting outcomes from them. So in this case we may need to reshape this data again considering the business goals.
These are some typical example of Data Preparation operations we can build:
- Derive fields: Sometimes the data gathered are not accurate enough themselves. So in this case we may have to create ourselves (based on several other data) the needed data. If a field does not exist as it should be, we may need to consider using the other fields and based on different rules build it. it can be:
- Cutting the Process Flow Identifier as to the current scope of the identified one is too large.
- Splitting a field into several other needed fields (sometimes we may have several fields concatenated in one)
- Creating new fields
- Creating the Event Name based on other fields
- Combining several fields: the most obvious example is when the Date and time are stored in different fields (for the T-KEY)
- Changing the data types
- Especially true for the timestamp field
- Reworking the timestamp
- Changing the format
- Managing the timezone
- Managing date/time operations (adding some days, months to fit business reasons, making a difference of two dates, etc.)
- Changing the data
- Managing the case
- Adding new string (concatenate) in an existing one to match a business codification for example
- Trimming or making basic formatting operations on the data to make them more readable or to prepare it for a more complex Data transformation.
- Removing or altering (replacing for example) data in a field
- Like the One Hot Encoding technique (famous in Machine Learning), the purpose is to transform a field into several fields (more simple) and with the same meaning.
- Because it’s just required by a business rule, the data is encrypted or it just needs to be anonymised.
These transformations are executed at the field level (or column, ie. vertically).
We can also rework the data at the row level (or records, ie. horizontally) by:
- Joining data from other data sources
- Be careful as joining can lead to filtering data from the different sources (Inner, left or Right joins)
- Making some lookup value in a dictionary is pretty common, especially for the event name field. Indeed as data can come from different sources they are not – most of the time – consistent. Merging in a consistent way the labels (Event names or steps) is an important task to avoid a wrong analysis. We’ll see in detail this rework in the next chapter.
- Filtering data
- When the Data in bad quality cannot be corrected
- When it’s just required by the business rules (all the data coming from the data sources may not be interesting to ingest or can lead to wrong conclusions)
- Aggregating data
- When the data are not at the right granularity (Time, location, organization, etc.)
- Sometimes there are too many events to manage (more than 1000 for example) in this case it’s also possible to aggregate the data before import.
- Because there’s too much data (volume too big)
- Transposing data
- Converting a specific column or field into several records or rows