The ten steps of information integration
by Friedhelm Reydt.
Information integration usually consists of ten (abstracted) steps:
1. Job specification
For which object or process within an organisation is data completeness required?
Example: In order to provide customers with a price indication with the help of a product calculator, all the necessary data must be available. The order specification describes the information needs of the addressees as well as the target state to be achieved.
2. Data identification
If the data for the calculator is incomplete, the product will be offered on the market either too expensive or too cheap. Both can have a negative impact on the market positioning of the company. Without wanting to go into the method of recursive data identification here, this phase includes the complete localisation of all data sources that are available in different areas of activity and are needed to fulfil the information requirements of the addressed target group. Both formal and informal technical or non-technical data sources can be considered. Technical data can be structured and unstructured. The goal of data identification is information completeness. All identified information is presented in the form of an information map.
3. Data extraction
To ensure information completeness, data from the identified technical source systems are continuously passed on for data transformation. This should be done regularly and automatically. Non-technical and unstructured data (e.g. Word files from document management systems) are transformed into structured technical data.
4. Data transformation
All data is converted into a common format and structure ... and collected in a temporary database.
5. Data cleansing
Inconsistent, duplicate and missing data are eliminated from the temporary database. A set of rules is created in advance for this purpose. Data cleansing can be automated and/or manual.
6. Data reconciliation
With regard to the single version of the truth principle, semantic differences between data sources must be identified and eliminated. This is also done with the help of a set of rules.
Recommended by LinkedIn
7. Data enrichment
To improve their quality and completeness, the data in our temporary database can be enriched or supplemented with additional information (metadata).
8. Data storage
Once the data in our temporary database has been completely processed, it is formally added to the central database (repository). With this step, the extraction, transformation and loading (ETL) is considered complete.
9. Data linking
The data in the central database can be used to meet the information needs of different target groups inside and outside the organisation. The data required for this purpose can be virtually linked both logically and mathematically and combined to form target group-specific information sets. In this way, new information is created from previously independent data that would not have been available without information integration.
10. Data dissemination
Target group-specific information sets are made available to further processing systems via standard interfaces. These are available at the gateway and the target system accesses them. Examples of such a system are, for example, a mobile app or a web frontend with the function of a configurator that requires daily updated data for the purpose of calculating offers or a simple database. It is important that data dissemination does not preclude writing data back to the original source systems....
To be continued.
Excursus
Data models vs. information sets
The information systems of the departments are based on specific data models, which, for example, can consist of components such as customer name, service item and corresponding price for an invoice. Such information systems may or may not communicate with each other in the sense of the value chain.
If process-relevant systems do not communicate with each other, even though the data generated in system A is needed in the operational context of system B, a media break occurs within the digital value chain, which in the worst case remains undetected or must be remedied with the help of a manual process step. A media break always indicates a possible source of error that can lead to data being falsified.
To resolve the dilemma of insufficient data reconciliation, tools such as Information Integration Platforms are used to map higher-level information sets to clean up missing, erroneous or redundant data. An information set can therefore be composed of different attributes from different data models.