Talk DIG: data quality

Showing posts with label data quality. Show all posts

Wednesday, June 11, 2008

Calling all data quality software vendors….TDWI needs your help

I couldn’t resist posting this one, but I want to start off by saying that I am big fan of The Data Warehouse Institute. They provide great information, their conferences are extremely valuable and their training classes provide the right level of practical learning.

But, their CRM system is in desperate need of some cleansing. Much like how Glyn Heatley discussed Data Quality Going Green (btw – until that post I didn’t realize that Glyn was a tree hugger. You think you know someone), TDWI needs to clean up their registration records. I received 3 emails from Wayne Eckerson (it wasn’t really from Wayne Eckerson) asking me to participate in a TDWI Benchmark survey. The emails all arrived within a few minutes of each other. The first one starts by addressing me as “Dear Graham”, while the second two start with “Dear Peter”. Now, on further inspection it is clear that this is partly my own doing. Each email was sent to a different email address since over the last 3 years we have internally changed our emails here at Palladium. But, that is the whole purpose of organizations house holding their data and identifying duplicates. I just find it entertaining that the organization (TDWI) that espouses having quality data is an offender. And it’s costing them money.

I did a quick search on the TDWI site to see if I could find any whitepapers or studies that I would recommend they take a look at. I found the following best practices report on taking data quality to the enterprise. The great thing is that if they are looking for some software vendors who can help, they can just take a look at the sponsor list for the best practices report.

I am hoping they won’t take away my membership based on this posting.

Tuesday, May 27, 2008

Example of Data Quality Gone Bad

At the DIG conference, Glyn Heatley introduced our data theme with an entertaining example of data quality gone bad. The story was told through a YouTube video. I had stumbled across this story in an industry magazine a few years ago, but I can’t seem to find the original article. Thought I would share the video so people can use it any time someone questions the quantifiable value of data quality initiatives.

Tuesday, May 6, 2008

Quality Data helps us go GREEN!

Yesterday was another day of coming home from work late and having to force the door open to get past all the junk mail inside. After picking up, taking into the kitchen and spending 10 minutes going through I was yet again presented with another fine example of poor data quality (i.e. the majority of organizations really don’t have a grip on their customer data let alone the ability to household).

3 copies of a news letter from the same software company (no names mentioned!), the exact same letter from a State Insurance agency for both my wife and I, and then two copies of the Crate & Barrel latest summer catalog addressed to me (how on earth I became registered on their list I’ll never know!).

I wonder what the impact to the environment would be if organizations simply got a better understanding of their customer data and improved their marketing functions alone?

So once I finished my nightly chore of “shredding” I did some quick research to see what sort of impact to the environment today junk mail has. Check out the following facts listed by New America Dream:

More than 100 million trees’ worth of bulk mail arrive in American mail boxes each year – that’s the equivalent of deforesting the entire Rocky Mountain National Park every four months. (New American Dream calculation from Conservatree and U.S. Forest Service statistics)

In 2005, 5.8 million tons of catalogs and other direct mailings ended up in the U.S. municipal solid waste stream – enough to fill over 450,000 garbage trucks. Parked bumper to bumper these garbage trucks would extend from Atlanta to Albuquerque. Less than 36% of this ad mail was recycled. (U.S. Environmental Protection Agency)

The production and disposal of direct mail consumes more energy than 3 million cars. (New American Dream calculation from U.S. Department of Energy and the Paper Task Force statistics)

Citizens and local governments spend hundreds of millions of dollars per year to collect and dispose of all the bulk mail that doesn’t get recycled. (New American Dream estimate from EPA statistics)

California's state and local governments spend $500,000 each year collecting and disposing of AOL’s direct mail disks alone. (California State Assembly)

With companies trying to put on a more “Green” face you would think this would be a nice eco friendly place to start. Imagine the impact of cutting bulk/junk mail in half by just knowing who your customer is and the fact that you may have multiple that live at the same address?

Even though the challenges surrounding customer data are not new, more is being spoken in the industry around Customer Data Integration. Check out Tony Fisher’s article on TDWI for an introduction on Data Quality and the Emergence of Customer Data Integration as well as go directly to such vendor sites as DataFlux and Trillium Software for innovative solutions that work to address data quality challenges, deduplication and relationship identification.

Lastly while not being one to solicit an audience, if you do have any interest in helping the environment and stopping all that junk mail look at GreenDimes. I signed up last night… I’ll let you know how it works out!

Friday, April 25, 2008

The Data Integration Challenges and BI (Part Two)

In Part One of this topic Brian introduced some of the key data integration challenges for a typical BI engagement and left off by highlighting some of the specific data integration challenges that included:

(i) Transformation of data that does not meet expected rules (contents of data elements and the validation of referential integrity relationships for example)

(ii) Mapping of data elements to some standard or common value

(iii) Cleansing of data to improve the data content (for example to cleanse and standardize name and address data) that extends the data transformation process a step further

(iv) Determining what action to take when those integration rules fail

(v) Ensuring proper ownership of the data quality process

In this second part of the article he takes a little deeper into several of these components.

Data transformations may be as simple as replacing one attribute value with another or validating that a piece of reference data exists. The extent of this data validation effort is dependent on the extent of the data quality issues and may require a detailed data quality initiative to understand exactly what data quality issues exist. At a minimum the data model that supports the data integration effort should be designed to enforce data integrity across the data model components and to enforce data quality on any component of that model that contains important business content. The solution must have a process in place to determine what actions to take when a data integration issue is encountered and should provide a method for the communication and ultimate resolution of those issues (typically enforced by implementing a solid technical solution that meets each of these requirements).

As Organizations grow via mergers and/or acquisitions, so too does the number of data sources and eventually lack of insight into overall corporate performance. Integration of these systems upstream may not be feasible and so the BI application may be tasked with this integration dilemma. A typical example is the integration of financial data from what used to be multiple Organizations or the integration of data from different geographical systems.

This integration is a challenge. It must consider (i) the number of sources to be integrated, (ii) commonality and differences across the different sources, (iii) requirements to conform attributes [such as accounts] to a common value but retain visibility to the original data values and (iv) how to model this information to support future integration efforts as well as downstream applications. This task is indeed a challenging one. All attributes of all sources must be analyzed to determine what is needed and what can be thrown away. Common attribute domains must be understood and translated to common values. Transformation rules and templates must be developed and maintained. The data usage must be clearly understood especially if the transformation of data is expected to lose visibility into any data that is transformed (for example if translating financial data to common charts of accounts).

Making Information Accessible to Downstream Applications

With this data integration effort in place, it is important to understand the eventual usage for this information (downstream applications and data marts) and to ensure that downstream applications can extract data efficiently. The data integration process should be designed to support the requirements for integrating data, that is to support the data acquisition and data validation/data quality processes (validation, reporting, recycling, etc), to be flexible to support future data integration requirements and to support historical data changes (regardless of any reporting expectations that may require a subset of this functionality requirement). The data integration process should also be designed to support the push or pull of data in addition. With that in mind the data integration model should provided metadata that can assist downstream processes (timestamps for example that indicate when data elements are added or modified), partition large data sets (to enable efficient extraction of data), reliable effective dating of model entities (to allow simple point in time identification) and be designed consistently.

The data integration process may at first seem a daunting process. But by breaking the BI architecture into it’s core components (data acquisition, data integration, information access), developing a consistent data model to support the data integration effort, establishing a robust exception handling and data quality initiative and finally implementing processes to manage the data transformation and integration rules, the goal of creating a solid foundation for data integration can be met.

Tuesday, April 22, 2008

The Price You Pay When Your Data is Questioned

I read this article in yesterday’s Wall Street Journal that I found interesting and relevant to DIG. There is always a debate on the value of having “one version of the truth” and the necessity of accuracy in corporate data. To date, that hasn’t been the case with certain types of performance measurement, especially website visits. Well, comScore is paying the price through shareholder value and their stock price. The issue stems from the accuracy of “clickstream” data that comScore, like their competitor Nielson, collect and track the popularity of websites on the web. Google announced that advertising clicks grew by 20%, while comScore reported only a 1.8% growth. Well, who is right?

This data is critical for marketers when deciding where to spend their ad dollars. You should read the full article to gain a full appreciation of the entire story, but here are a few snippets that are relevant to the importance of having “one version of the truth”.

Sarah Fay, chief executive of both Carat and Isobar US, ad companies owned by Aegis group said “We have not expected the numbers to be 100%”. It’s good to see that no expectations were being set out of the gates. Not sure this would fly when discussing something like revenue for an organization.

The article goes on to point out that comScore and Nielson data doesn’t always match up. “To complicate matters, disparities between comScore and Nielson data are common, as the two companies use different methodologies to measure their audience panels.” This isn’t something we don’t here inside the four walls of a corporation for something like a measures calculation rule.

Brad Bortner, an analyst with Forrester Research points out “There is no truth on the Internet, but you have two companies vying to say they are the truth of the Internet, and they disagree.”

And finally, my favorite quote in the article came from Sean Muzzy, senior partner and media director at digital ad agency http://www.ogilvy.com/neo/. “We are not going to look at comScore to determine the effectiveness of Google. We are going to look at our own campaign-performance measures”. This would be the equivalent of “if you don’t like the results, try a different measure.”

I have always wavered on the need for accurate data for certain types of measurement, especially something like clickstream analysis. I guess that wavering has now fallen to the side of the camp with the other types of data that require precision and accuracy.

Sunday, April 13, 2008

Immelman crowned Masters champion!

Congratulations to South African Trevor Immelman who secured a maiden major title win today with a three-shot victory at the 72nd Masters Golf tournament at Augusta.

If you are a golf fan it was for sure a quality weekend in front of the TV.

If you were immersed in the golf, waiting for Tiger to yet again make a run for the championship, did you ever find yourself wondering where on earth do the commentators get all those performance statistics from that they continuously feed you during their commentary?

Well check out the article on “How the PGA Tour Manages Its Data” to see how “IT Chief Steve Evans relies on legions of golf-crazed volunteers, high-tech lasers and the input of golf pros to help him identify, manage and display the Tour's most critical data.”

Thomas Wailgum’s article provides interesting insight into the efforts made to capture data real-time from the field, then to translate it into interesting statistics that both the general public can hear about to enhance their golf watching experience, as well as players on the course to not just help them track their progress but also help them evaluate their risk exposure when contemplating their next shot.

The system used is ShotLink, a revolutionary system that “tracks every shot at every event—where a player's golf ball starts and lands, and all the ground covered in between.” ShotLink requires the use of over 1000 volunteers out on the course to help “capture” the required data on over 32,000 shots which is feed into the system.

Impressively in the article Mr. Evans states that through some minor modification to ShotLink they have been able to ensure a very high level of data quality in the statistics that they produce. He states that “Our goal is to have any data corrections made inside of one minute, and we consistently meet that metric.”

How accurate is the data that you use? What data quality manaagement process do you have in place? Do you tend to resolve data quality issues upstream at its source, or do you cleanse within the applications your report from, or "tweak" the actual reports?

Sunday, April 6, 2008

Information Quality & Master Data Management?

Master Data Management is the process used to create and maintain a “system of record” for core sets of data elements and their associated dimensions, hierarchies and properties which typically span business units and IT systems.

Master Data, often referred to as “Reference Data”, may in your organization take the form of Charter of Accounts, Product Catalogue, Stores Organization, Suppliers and Vendor Lists but to name a few.

In his article “Demystifying Master Data Management”, Tony Fischer uses Customer as an example of Master data and how, if not understood and managed appropriately, can cause all sort of headaches for a company, in this case the CEO himself!

“Years ago, a global manufacturing company lost a key distribution plant to a fire. The CEO, eager to maintain profitable relationships with customers, decided to send a letter to key distributors letting them know why their shipments were delayed—and when service would return to normal.

He wrote the letter and asked his executive team to "make it happen." So, they went to their CRM, ERP, billing and logistics systems to find a list of customers. The result? Each application returned a different list, and no single system held a true view of the customer. The CEO learned of this confusion and was understandably irate. What kind of company doesn't understand who its customers
are?”

So what are the typical barriers that hinder organizations from addressing their master data management problem? My colleagues and I typically encounter four primary barriers:

Multiple Sources and Targets: Reference data is created, stored and updated in multiple transactional and analytic systems causing inaccuracies. Synchronization challenges between disparate systems
Ability to Standardize: Most organizations cannot agree on a standardized view of master data. There are a lack of audit policies that comply with federal regulations
Organizational Ownership: Disagreement within the organization as to who takes ownership of master data management, business or IT. Assignment of accountability with cross-functional processes is difficult
Centralization of Master Data: Organizational resistance to centralizing master data since there is a sense that control will be lost. Challenges to find a technology solution that supports existing systems and the lifecycle of master data management

Organizations that are addressing such barriers typically have a successful master data management process in place that contains the following components:

Data Quality: Focus on the accuracy, correctness, completeness and relevance of dataIncorporate validation processes and checkpoints. Effort is highest in the beginning of a MDM initiative to correct quality issues.
Governance: Cross functional team formed to establish organizational standards for MDM related to ownership, change control, validation and audit policies. Focus includes establishing a standard meeting process to discuss standards, large changes and organizational issues.
Stewardship: Assignment of ongoing ownership of MDM stewardship. Typically MDM stewards are business users. Accountable for the implementation of standards established through MDM governance

Technology: Create an architectural foundation that aligns with the other three components. Implement a technology that centralizes reference data. Align processes with the technology solution to synchronize master data across source and analytic systems

As we can see, master data management is not a one-time initiative but rather a long-term program that runs continuously within the organization. To be successful organizations need to instill an iterative approach that helps develop a program that continuously monitors, evaluates, validates and creates master data in a consistent, meaningful and well communicated way.

What is your organization doing about Master Data Management? Have you had success in establishing a Data Governance program? Who own the process in your organization, IT or the business?