Friday, April 18, 2008

The Data Integration Challenge and BI (Part One)

This week I've asked a collegue of mine, Brian Swarbrick, to provide insight into some of the typical Data Integration challanges faced when developing Business Intelligence solutions. Brian, an expert in large scale data warehouse and data integration initiatives was so enthusiastic on the subject that we have decided to split his blog into two parts!! Next week I will publish Part Two. Thank you Brian!

The Data Integration Challenge and BI
The goal of any BI solution should be to provide accurate and timely information to the User organization. The User must be shielded from any complexities related to data sourcing and data integration. It is up to the development team to ensure that they deliver a robust architecture that meets these expectations.

The most important aspect of any BI solution is the design of the overall BI framework that encompasses data acquisition, data integration and information access. There are challenges in designing each of these components correctly but often data integration is the one that is the most complex yet important component of the BI solution that must be developed. A solid architecture is required to support the data integration effort (see Claudia Imhoff’s article on why a Data Integration Architecture is needed).

So what are some of key the challenges and considerations that should be addressed when thinking about data integration?

First, unless your project is tasked with building "one off’’ or departmental type solutions, it is important to separate the integration component of the architecture from the analytical component (this is the point where some readers may disagree, but separation of these components allows for a more flexible and scaleable architecture over time – a must for any Enterprise solution today). With this rule in place, the data integration team can focus on what they do best (data integration) and the analytical team can focus on what they do best (designing for reporting and analytics).

With this structure in place the data integration team has some tough challenges ahead of them that must be addressed:

(i) Identifying the correct data sources of information

(ii) Identifying and addressing data quality and integration challenges

(iii) Making information accessible to downstream applications

Identifying the Correct Data Sources of Information
Before data can be integrated it must be identified and sourced. As simple as this sounds it in not unusual for an Organization to have multiple sources of the same data. It is important to identify the data source that is the true ‘system of record’ for that information, contains the elements that support current information requirements and can extend to support future information requirements. Choose the data source that makes the most sense and not the one that is the easiest to get to.

Once the appropriate sources of information have been identified, the integration team must then determine how best to access that information. The team must identify how often the data needs to be extracted (once a day, week, etc) and how the data will be extracted (push or pull, direct or indirect). The frequency should be based on future as well as current requirements for information. It is easier to build based on what is required for today than for what may be planned or needed tomorrow. Data volumes should be a consideration when determining the optimum acquisition method and often a more frequent data sourcing process may be beneficial irrespective of the final reporting expectations (this is a good example of where separation of integration and analytics has merit since the data integration layer can be designed for optimum integration without impact to the requirements of the analytic environment).

Getting at the data itself is often more politically challenging that technically challenging. Source data may exist in internally developed as well as packaged and externally supported applications.

Pull paradigms are good when:

(a) Tools are available that can connect directly to the source systems (that’s a given) and when needed provide options for change data capture mechanisms

(b) Access to the systems is allowed; just because you can connect to a source system does not mean that the IT organization will allow that to happen – these solutions can be invasive and direct access may not be welcomed or allowed (so make sure you consider this)

(c) Source volumes are small and all data is being extracted in full or there is a means to identify new or changed records. The latter is a definite consideration when data volumes are large but there must be a means to identify these changes and it must be reliable and efficient else source invasiveness becomes a concern (especially if the source system must perform tuning to support these downstream processes)

Push paradigms (even when enterprise tools for pulling data are available) are good
options when:

(a) Data with the desired granularity, frequency and content is readily available in a different format and can be leveraged

(b) Direct access to source systems is not an option and/or the IT prefers to source the data that is needed. In this scenario a solution for change data capture may need
to be developed

(c) It is easier for IT to identify the data to be pulled and provide it instead of downstream applications pulling the data directly

Before determining the best choice for your project you also need to consider the limitations of the tools available within your environment

Identifying and Addressing Data Integration Challenges
Once the method for data acquisition has been addressed, data must be cleansed, transformed and integrated to support downstream applications such as data marts. So what does this mean and what are the potential challenges?

The size of the data integration effort is dependent on several factors: (i) the number of data sources being integrated and the number of source systems from which data is provided (ii) quality of data within each of those systems, (iii) quality of data and integration across those source systems, (iv) the Organization’s priority for improving data quality in general. When integrating data the Organization has the choice of enforcing data quality during the integration process or ignoring it.

So what are some of the key challenges for a typical data integration effort? These typically include:

(i) Transformation of data that does not meet expected rules (contents of data elements and the validation of referential integrity relationships for example)

(ii) Mapping of data elements to some standard or common value

(iii) Cleansing of data to improve the data content (for example to cleanse and standardize name and address data) that extends the data transformation process a step further

(iv) Determining what action to take when those integration rules fail

(v) Ensuring proper ownership of the data quality process

So what are some of the challenges and considerations within each of these areas? Tune in to Part Two of this article when we will address some of these considerations as well as addressing the need for making information easily accessible downstream of the integration process.

Thursday, April 17, 2008

Politics: There's No "I" in "DIG"

What do sports, politics and DIG have in common? Well, of course, it’s prediction markets. There’s Protrade, and Tradesports and the Iowa Electronic Markets and, well, Las Vegas itself, kind of. But thinking across to the other themes of the conference, the similarities disappear. Much has been said and written about the use of data and analytics in sports (Moneyball,,, but the closest most politicos get to analysis is focus groups, commissioned polls and a cornucopia (or is it hodgepodge?) of cognitive biases (“we need to focus on ‘soccer moms!’”).

In the last few years, some individuals and organizations have begun to make a dent in this space; notably among them Get Out the Vote: How to Increase Voter Turnout by a couple of Yale professors who base their recommendations on actual research. More recently, Brendan Nyhan at Duke reports on his blog the founding of “The Analyst Institute,” which states as its mission “for all voter contact to be informed by evidence-based best practices. To ensure that the progressive community becomes more effective with every election, we facilitate and support organizations in building evaluation into their election plans.”

It’s not as if there isn’t incentive to win, and it’s not as if there’s a lack of interested funding. So why is politics behind the curve on data and analytics? Is there a rational (or irrational) belief that politics need to be managed by gut? Or are there structural reasons? Or am I mistaken in thinking politics is late to the game, and that McCain is hiding the next Billy Beane somewhere on the Straight Talk Express?

Wednesday, April 16, 2008

Can I get the Consumer Reports for these Appliances?

I wanted to pull together a quick summary of the Data Warehouse and Business Intelligence Appliance space. It is a continually maturing space with a set of strong vendors still fighting for market share. Teradata, DATAllegro, Netezza, NeoView from HP and Dataupia all provide solutions that combine hardware, operating system and database software into a single unit. Calpont, Kognitio, Vertica and ParAccel provide software only and platform independent solutions. The benefits of these appliance solutions include a reduced total cost of ownership, increased performance through massively parallel systems, reduced administration and database administrators, and high availability and scalability. Where these solutions typically sell is through a proof of concept where a customer has a very specific performance issue that the vendor can show proven results. Industries that collect massive amounts of transaction data such as retailers or web clickstream data are inherit sweet spots for DW appliances.

If you aren’t familiar with the DW Appliance space, I would recommend taking a look at a series of articles from Krish Krishnan (intro, part 1, part 2) on the topic. I also came across this blog posting fact or fiction that unwinds some of the misconceptions on the DW appliance space.

Another interesting area that has followed the DW Appliance trend is in Business Intelligence. I have come across fewer vendors here, but Celequest (acquired by Cognos) and Ingres Icebreaker are two that provide a bundled hardware, operating system, database software and reporting tools. Business Objects has also partnered with Netezza to provide a single point solution in data warehousing and business intelligence. All the solutions adhere to standards which allows for integration with a majority of the BI vendor software that are SQL based tools.

Tuesday, April 15, 2008

Deregulation of Utility Computing and my Gmail account

I may be jumping the gun on this one a bit since there isn’t as of yet a “Computing Utility” the way natural gas, telephone and electricity are currently piped into my home. Thus, there is no need to deregulate the industry the way the natural gas industry was in the 1980s. But will there be?

Google last week announced their foray into utility computing with their Google App Engine. Google is opening up their computing horsepower to allow scalable, web-based application development for anyone. And it’s free. They aren’t the only ones doing providing utility computing. Amazon has been providing a similar platform called Elastic Compute Cloud (EC2) for a “resizable” computing capacity cloud and their Simple Storage Service (S3) for inexpensive storage services.

The idea of utility computing has been batted around as an idea for a while, but Nicholas Carr’s book “The Big Switch” makes an interesting correlation with the switch manufacturers made 100 years ago from providing their own electricity to tapping into the expanding power grids. Carr makes a compelling case that this is the direction of computing for businesses and consumers.

If you aren’t familiar with Carr, he is a bit of a lightning rod in the IT industry based on his controversial point of view of IT. He was just named #93 on the Ziff Davis Most Influential People in IT. Not everyone necessarily agrees with Carr’s view of IT, but he has forced the industry to take a look in the mirror and question the value being provided.

So you may be asking yourself, what does this post have to do with DIG and why did I start on the topic of utility computing? Honestly, there is not direct relationship beyond I have been having some “constructive” budgetary discussions with a client around disk storage sizing. When I got home tonight I asked myself “This has to be easier”, thus my research on utility computing. Why is it that I can get 6.6 gigabytes of free storage from Google for my email but not enough storage for a data mart? (btw – this is a hypothetical question and doesn’t need to be answered via a comment).


Monday, April 14, 2008

Social Objects for Business Conversation

Recently, I have been intrigued by this idea of “social objects” - as the basis for driving the success of E2.o applications. I first heard of the term just a few months back on Hugh Macleod’s blog. I have since looked around for more information and really enjoyed watching Jyri Engestrom’s lecture on social objects as used within social computing platforms. The conference video is out on YouTube. It seems that Jyri, the founder of Jaiku, has been instrumental in bringing the idea to life.

MacLeod defines Social Objects as “The Social Object, in a nutshell, is the reason two people are talking to each other, as opposed to talking to somebody else. Human beings are social animals. We like to socialize. But if think about it, there needs to be a reason for it to happen in the first place. That reason, that "node" in the social network, is what we call the Social Object.”

Honestly, I don’t know if I had ever thought about the idea before seeing these posts – but it seems to make a lot of sense. It’s certainly true for me. There exists some social object in the mix most every time I talk to my friends or colleagues. It might be a movie. It could be a baseball game. It could be a friend’s job situation. It could be a finance report. The bottom line is that “social objects” are the basis for most all of my conversations. It’s a fascinating concept if you think about it.

The idea got me thinking. What are the typical social objects within business conversation? What social objects attract the most attention and could become the basis for a robust conversation? It would seem to me that these objects could be the obvious building blocks of a productive corporate social network or social computing application. Here’s my informal list from a 5 minute brainstorm. I have put an “x” next to the Top 10 form my perspective!

Variable compensation plans
Performance objectives
Market factors
Executive Leaders (x)
Management (x)
Mission statement
Culture (x)
Norms (x)
Office environment (x)
Corporate Communications (x)
Public Advertising
Performance Review Process (x)
Benefits Package
Finance function
IT function (x)
Budgeting Process (x)
Key initiatives (x)
Budget variance explanations
Forecasting assumptions
Customer needs
Customer experience
Lunch destination

MacLeod goes on to say, “The thing to remember is, Human beings do not socialize in a completely random way. There’s a tangible reason for us being together, that ties us together. Again, that reason is called the Social Object. Social Networks form around Social Objects, not the other way around.”

I wonder. I just wonder what it takes to influence and/or transform the core social objects within our business conversations? At first glance, I would suspect that we would be better off if the top few objects in our business dialog were the following:

Customer needs
Customer experience
Performance objectives

As an aside, in thinking back to my many years of consulting, I must say that there is only one client where I remember hearing this last set of “social objects” integrated into almost every conversation. It was WalMart! I wonder what that is saying?

Can you think of other social objects that I have missed? What are your thoughts on the topic? Please drop me your comments.

Sunday, April 13, 2008

Immelman crowned Masters champion!

Congratulations to South African Trevor Immelman who secured a maiden major title win today with a three-shot victory at the 72nd Masters Golf tournament at Augusta.

If you are a golf fan it was for sure a quality weekend in front of the TV.

If you were immersed in the golf, waiting for Tiger to yet again make a run for the championship, did you ever find yourself wondering where on earth do the commentators get all those performance statistics from that they continuously feed you during their commentary?

Well check out the article on “How the PGA Tour Manages Its Data” to see how “IT Chief Steve Evans relies on legions of golf-crazed volunteers, high-tech lasers and the input of golf pros to help him identify, manage and display the Tour's most critical data.

Thomas Wailgum’s article provides interesting insight into the efforts made to capture data real-time from the field, then to translate it into interesting statistics that both the general public can hear about to enhance their golf watching experience, as well as players on the course to not just help them track their progress but also help them evaluate their risk exposure when contemplating their next shot.

The system used is ShotLink, a revolutionary system that “tracks every shot at every event—where a player's golf ball starts and lands, and all the ground covered in between.” ShotLink requires the use of over 1000 volunteers out on the course to help “capture” the required data on over 32,000 shots which is feed into the system.

Impressively in the article Mr. Evans states that through some minor modification to ShotLink they have been able to ensure a very high level of data quality in the statistics that they produce. He states that “Our goal is to have any data corrections made inside of one minute, and we consistently meet that metric.

How accurate is the data that you use? What data quality manaagement process do you have in place? Do you tend to resolve data quality issues upstream at its source, or do you cleanse within the applications your report from, or "tweak" the actual reports?