The Data Integration Challenge and BI
The goal of any BI solution should be to provide accurate and timely information to the User organization. The User must be shielded from any complexities related to data sourcing and data integration. It is up to the development team to ensure that they deliver a robust architecture that meets these expectations.
The most important aspect of any BI solution is the design of the overall BI framework that encompasses data acquisition, data integration and information access. There are challenges in designing each of these components correctly but often data integration is the one that is the most complex yet important component of the BI solution that must be developed. A solid architecture is required to support the data integration effort (see Claudia Imhoff’s article on why a Data Integration Architecture is needed).
So what are some of key the challenges and considerations that should be addressed when thinking about data integration?
First, unless your project is tasked with building "one off’’ or departmental type solutions, it is important to separate the integration component of the architecture from the analytical component (this is the point where some readers may disagree, but separation of these components allows for a more flexible and scaleable architecture over time – a must for any Enterprise solution today). With this rule in place, the data integration team can focus on what they do best (data integration) and the analytical team can focus on what they do best (designing for reporting and analytics).
With this structure in place the data integration team has some tough challenges ahead of them that must be addressed:
(i) Identifying the correct data sources of information
(ii) Identifying and addressing data quality and integration challenges
(iii) Making information accessible to downstream applications
Before data can be integrated it must be identified and sourced. As simple as this sounds it in not unusual for an Organization to have multiple sources of the same data. It is important to identify the data source that is the true ‘system of record’ for that information, contains the elements that support current information requirements and can extend to support future information requirements. Choose the data source that makes the most sense and not the one that is the easiest to get to.
Once the appropriate sources of information have been identified, the integration team must then determine how best to access that information. The team must identify how often the data needs to be extracted (once a day, week, etc) and how the data will be extracted (push or pull, direct or indirect). The frequency should be based on future as well as current requirements for information. It is easier to build based on what is required for today than for what may be planned or needed tomorrow. Data volumes should be a consideration when determining the optimum acquisition method and often a more frequent data sourcing process may be beneficial irrespective of the final reporting expectations (this is a good example of where separation of integration and analytics has merit since the data integration layer can be designed for optimum integration without impact to the requirements of the analytic environment).
Getting at the data itself is often more politically challenging that technically challenging. Source data may exist in internally developed as well as packaged and externally supported applications.
Pull paradigms are good when:
(a) Tools are available that can connect directly to the source systems (that’s a given) and when needed provide options for change data capture mechanisms
(b) Access to the systems is allowed; just because you can connect to a source system does not mean that the IT organization will allow that to happen – these solutions can be invasive and direct access may not be welcomed or allowed (so make sure you consider this)
(c) Source volumes are small and all data is being extracted in full or there is a means to identify new or changed records. The latter is a definite consideration when data volumes are large but there must be a means to identify these changes and it must be reliable and efficient else source invasiveness becomes a concern (especially if the source system must perform tuning to support these downstream processes)
Push paradigms (even when enterprise tools for pulling data are available) are good
options when:
(a) Data with the desired granularity, frequency and content is readily available in a different format and can be leveraged
(b) Direct access to source systems is not an option and/or the IT prefers to source the data that is needed. In this scenario a solution for change data capture may need
to be developed
(c) It is easier for IT to identify the data to be pulled and provide it instead of downstream applications pulling the data directly
Before determining the best choice for your project you also need to consider the limitations of the tools available within your environment
Identifying and Addressing Data Integration Challenges
Once the method for data acquisition has been addressed, data must be cleansed, transformed and integrated to support downstream applications such as data marts. So what does this mean and what are the potential challenges?
The size of the data integration effort is dependent on several factors: (i) the number of data sources being integrated and the number of source systems from which data is provided (ii) quality of data within each of those systems, (iii) quality of data and integration across those source systems, (iv) the Organization’s priority for improving data quality in general. When integrating data the Organization has the choice of enforcing data quality during the integration process or ignoring it.
So what are some of the key challenges for a typical data integration effort? These typically include:
(i) Transformation of data that does not meet expected rules (contents of data elements and the validation of referential integrity relationships for example)
(ii) Mapping of data elements to some standard or common value
(iii) Cleansing of data to improve the data content (for example to cleanse and standardize name and address data) that extends the data transformation process a step further
(iv) Determining what action to take when those integration rules fail
(v) Ensuring proper ownership of the data quality process
So what are some of the challenges and considerations within each of these areas? Tune in to Part Two of this article when we will address some of these considerations as well as addressing the need for making information easily accessible downstream of the integration process.