The Webhouse that Ralph Built: An Interview with Dr. Ralph Kimball
Dr. Ralph Kimball is a leading visionary in the datawarehouse industry, and the celebrated author of The Data Warehouse Toolkit, The Data Warehouse Lifecycle Toolkit, and most recently, The Data Webhouse Toolkit. His “Webhouse Architect” column appears in Intelligent Enterprise, and his company, Ralph Kimball Associates, focuses on developing, teaching, and delivering dimensional data warehouse design techniques for IT professionals. Kimball University, operated by The Kimball Group, offers non-vendor specific classes on data warehousing.
You say that Dimensional Data warehousing is a discipline in itself, not to be confused with the OLTP paradigm. Can you tell us what dimensional data is and why it is so critical to maintaining data quality?
Our task in the data-warehousing world is to present the data and not to collect it. The issues that we have when we present data are understandability and performance. If we step back from all the techniques and technology, that’s really what the dimensional approach is about. Its not a religion, its not exclusive, its not good or bad, but it does help focus your efforts in the right direction.
The problem with the OLTP approaches is that it is focused on collection- making sure there’s no redundant data, making sure that transactions are really fast, making sure that we record the interrelationships between that data at data collection time - that’s all wonderful, but when that has been finished and put to bed, then the tables and the complexities of that approach defeat the understandability and the performance issues. A big transaction system like Oracle Financials, for example, has an average installation of 2,000 tables. And an average installation of SAP, which is by far the largest of the ERP vendors, the average SAP installation is 10,000 tables.
The dimensional approach is a very simple symmetrical approach that’s organized around taking measurements. I won’t go on at length about this, but if you think of computers as taking measurements of things, like, how much did the customer buy today or what was the charge that we made, then any transaction that’s made in any business can be thought of as a measurement. We put those measurements in at the core of our databases, and those end up being what I call fact tables in the dimensional approach. Then you basically surround those measurements, which are very physical, you just surround them with what you know, which is OK, we know that this was the customer, we know that this was the store, we know that this was the time, and those end up being the dimensions. So that’s sort of the whole thing — it’s a very natural, simple approach. And then I develop in my books a lot of the details of how you build these, how you think about them, how you organize them efficiently, but those are all techniques.
And so this is sort of a fundamental way of ensuring your quality of data and that the way in which you view data across the company is uniform?
Yes, one of the things that I encountered when I was working with A. C. Nielsen back in the early 80s was that they had discovered that the data that they sold to these large companies like Proctor and Gamble needed to be connected to the data inside Proctor and Gamble. And they educated me about this. They said, look, this doesn’t work unless we make these dimension things, like our markets and our product lists and our timeframes, we have to make those identical to the same lists inside Proctor and Gamble. Otherwise, we can’t link these together, it’s like they’re two isolated islands. They call it conforming the dimensions. I like the word [conforming] because it’s kind of a specific word, and its not political. So when you ask me about data quality, that’s what comes into my mind is that you get this consistent view of the main things in a company, like the customers or the products or the markets.
Is everyone taking this dimensional approach?
If you look at the Fortune 500 or the Fortune 2000, I daresay that every single one of them, their IT department would say, we have a data warehousing strategy. The whole idea of collecting and analyzing data, at that level, everybody’s doing it. Now, when any one of those organizations faces the issue of pulling together all the different data sources and trying to present them in a reasonable way, they run headlong into the problems of building a framework around all the data so that they can link the data together. They also figured out that they don’t like these different islands. I think virtually everybody’s aware that that’s bad. And so, at that level, absolutely, everybody’s doing it.
Now, what I’ve tried to do with the 20 criteria is to very deliberately do something that Ed Codd published in the early 80s- a list of 12 criteria for the OLTP systems. And they were very scientific, very demanding criteria that disciplined everyone to think clearly about relational systems. And he didn’t really get in the business of scoring systems, but everybody else scored their systems against his criteria. Nobody ever got 12 out of 12 — in fact, 7 or 8 out of 12 was awesome. This was appealing, because it was tough, it wasn’t a slam-dunk.
I decided that I could write a set of criteria very much in the spirit of Codd’s original criteria but this time for data warehousing, and that’s what these 20 criteria are. The intent was to provide a very challenging, very precise framework for excellence in data warehouse systems with the dimensional approach, because I believe that the dimensional approach is actually the most effective one for delivering a warehouse. So, I would say that it is early to say, is everyone doing it. I will tell you that I am getting some very strong interest from vendors who are anxious to rate their systems against this, and I’m about to do some press releases about vendors who are rating their systems against this. So I’m actually somewhat hopeful that this will be a useful set of criteria for people.
Well, it seems like you’ve set the benchmark and now everybody’s going to try to live up to it.
Well, that would be great. I’ve tried to not bias it for any particular vendor, it was more a number of issues that I have experienced over the years in building systems and various aggravations and problems that have always come up.
We’ve all heard the cliche- “the customer is the data.” Do you find that the advent of e-business, with its focus on understanding and predicting customer buying patterns, personalization and recommendation engines, etc. has made data warehousing more critical than ever before?
I think that several things have come together in the last few years. Certainly, the web revolution is providing the foundation for the collecting of this data, but remember that the retailers, for example, have been doing this for twenty years, trying to predict buying patterns, but they were anonymous patterns. Marketing departments have been interested in this kind of information for a long time. What changes with the web, is that you usually know who it is that the customer is- and you can figure out if they are young, old, but also who they are as individuals. The other thing that’s really incredible, is that this data contains not just what they bought, but it also contains all the gestures, I call it, that they made before they bought. So literally, you can see what shelves they looked at, what items did they pick up and put back on, how long did they spend in the store, what items did they spend time reading about, etc. Of course, we can’t see if the customer is smiling, or what his eyes are looking at, but we can learn a great deal by looking at the clickstream.
The other thing of course, is that in the twenty years that we’ve been building these data warehouses, our techniques have improved a lot, and the hardware and software have improved, so we can store this data and do something with it. What’s really crazy that in the middle 80s I did a lot of database installs, and we installed 400 datawarehouses at places like Proctor and Gamble, and Kraft foods and banks and other places, and the average database we built was only 40 or 50 Megabytes. Once in a while we saw a database which was a Gigabyte!
The “unified view of the customer” seems to be the Holy Grail of Customer Relationship Management. What advice do you have for enterprises that find they have problems such as:
a) the lack of integration between the call-center and the web-support
b) global islands of customer data (eg. each international division has its own customer data)
c) customer registration and profiles are not shared between the sales application, the e-commerce catalog application and the product support site (yet another application)- all on various servers.
This is a fascinating area because its both so important and it is so scary at the same time. Let’s look first at the business benefit, which I think you’ve already implied in the way that you’ve asked the question.
It’s really clear that, if you’re a large enterprise and you encounter your customer in a number of different ways — you have a credit relationship in addition to a product relationship, you have advertising or marketing, you have returns and all sorts of other things, and so each of those facets of your business develops that relationships with your customer and knows specific things, and you’ve obviously like to tie all of those things together.
Patricia Seybold [in her book, Customers.com] — I like that book a lot because the way that I read the book literally is it’s a marketing person saying in their words what they want IT to do. Now, I don’t think she actually wrote the book with that specifically in mind, but it’s easy in my view to read the book that way and to say, wow, this is really a useful book and I just have to interpret a few of the things she says. In particular, she talks in several places about presenting a seamless view of the whole enterprise to the customer whereas the different parts of the enterprise know about the other parts. This is just the same thing we’re talking about. And what this implies technically — when I put on my IT hat and say, OK, how do we do this — it means that we have to have this uniform view of the customer that’s shared up and down the line, and I will call that a conformed customer dimension. I really like how clearly she stated that, and it seems that there’s a way to do that. The business benefit is really obvious.
Now, the technical problems, I’ve begun to start on those, but the technical problems have to do with just collecting the information and even recognizing that you’ve got the same customer. There are situations, especially in banks, where the mortgage department signs you up separately from the savings account department, and they may not even be quite sure that you’re the same person. Sometimes they don’t even have an incentive for tying those together. And then its very difficult, after the fact — you’ve got a slightly different spelling of your name or you’ve left out your middle initial or you’ve moved, and pretty soon they’re not even sure that you’re the same person. I looked at the customer list at Wells Fargo Bank, for example, when I was designing their system, and in San Francisco there’s a very large Chinese population where you can’t believe how many people there are basically, John Lee. About 200 of them! And some of them are identical people, and some of them are different. There’s very interesting and fairly difficult technical issues in matching these. OK, we’ve recognized the business benefit, we’ve recognized that we’ve got some interesting technical challenges, that’s all stuff that we can work on. And what we’re focusing on is the upside. We’re focusing on the business value. But there is a real reaction occurring to the downside.
I recently wrote an article about the difference between the beneficial uses and the insidious abuses of this technology. It’s the unexpected or unintended or unadvertised ways that you can take the same data and ways that you can abuse the privilege, abuse the end user. And there’s just lots of different ways that this can happen. We in the United States are very touchy about our privacy, even though we allow our privacy to be violated, and we’re on a big downhill slide right now, our privacy is just about gone. And its not clear that we can stuff the genie back in the bottle very easily. Currently on the Web, there are no real limits to the collecting of all this data.
I was interested to read this morning in the newspaper that there’s just been a law passed in England. They have allowed the government to declare that any use of the Internet by any entity in England is subject to full view by the government and that any form of encryption is illegal unless you give the government the key to your encrypted data. So all email, all documents, everything that is transmitted is open. And the Labour government, which is in a solid majority in their Parliament, was able to vote this in. And the whole sort of privacy allergy reaction is kind of missing there. The English tolerate a very heavy-duty form of government intrusion into their privacy. We’re far from the end of that story here. But the downside to the collecting of data is that loss of privacy, and I’m not sure I can predict how that’s going to turn out.
What about B2B exchanges? What does it take to create an intelligent demand-chain and supply-chain? Are their different techniques which don’t just analyze last week’s data but look ahead instead- a predictive, forward-looking interpolation of info from the used data in the data-warehouse?
Well that kind of hope has been around forever. All of our linear predictive modeling stuff, that goes back to the sixties. And people have been trying to predict the stock market with all kinds of sophisticated models. So, its important to understand that there’s absolutely nothing new in the desire to do it, or the hope. And I think that partly because of the experience of looking at things like stock market predicting and other predictive models that you have to step back and let the hype settle down a bit because its too easy to believe that with just a little technology, you can predict the future. Now I think not everything is like playing the stock market, it may just be predicting demand that’s based on seasonal variations or based on some other factors in the marketplace. I think that data warehouse data is actually pretty good at providing short-range predictions as long as your models are good. The problem is that many, many of the areas of business now are changing so rapidly — the business models are changing, the markets are altering — that I’m a little bit skeptical that you can predict very far in advance. If you want to predict daily fluctuations, is Tuesday going to be a busier day than Thursday, than I sort of can believe that, but if you’re trying to predict out for two years in one of these marketplaces, forget it, I think it’s a little bit misleading.
I guess my question there is, how good are the ERP vendors in terms of dimensional data? Is that the approach they use, or do they not use that approach at all?
Well, we’re talking about SAP simply because they are the overwhelmingly dominant ERP vendor; they have 30 percent of the ERP market. The second place is Oracle, with 15 percent, and then there’s PeopleSoft and Baan and some of the others. SAP has done a great job in terms of the dimensional approach. In fact, they have done the enterprise a service which often goes unappreciated. The process of implementing SAP is really a process of conforming the dimensions, as we talked about earlier. SAP helps get rid of the islands of data. Now what they need to improve, and they’ve been working on this, is reporting performance.
Have you seen a shortage of skilled datawarehouse analysts? Are their any business intelligence (to use Gartner’s phrase) service providers that an enterprise could outsource their data analyses too- or is this too risky to be a serious consideration?
There is no doubt that the ASP phenomenon is here to stay for a while. Outsourcing is not a new thing, but ASPs provide the skills and the capabilities that many IT shops are desperate for. So many IT shops are having a hard time figuring out how to hire the right skills to interpret web-data, or even what to look for. So ASPs do have a serious role to play.
What is dangerous with the ASP model is that there is a propensity to create the “stove-pipe” approach to data we just mentioned. The other point is that there is a wave of consolidation hitting the ASP market, and companies are leery of outsourcing to “Digital Data-Mart” down the street. Companies want assurances that their ASPs will be around for the future. The current wave of mergers and acquisitions has something to do with this concern.
It’s one thing to capture and analyze transactional data or customer profiles and quite another to create a data-warehouse for sharing information between co-workers- a knowledge base. Are their dimensional data techniques you would apply to create a “knowledge management” system to codify enterprise processes and the specialized knowledge inside the heads of employees?
This is something that does come up in my classes every now and then, and I have to remind my students that what we call “business rules” are not the same as the rules of business. Artificial Intelligence did create some excitement back in the 80s, but I think we still have a long way to go. I’m not prepared to give up on this question yet, but there are too many factors to really try to take on the knowledge inside the heads of employees, unless it is knowledge tied specifically to a process, how a task is accomplished.
What are the availability requirements for the datawarehousing enviroment for most customers? Do you find that companies have Service Level Agreements in place for their datawarehouses?
Its all about availability. I don’t see companies doing much more than ensuring availability. Performance, as we discussed, is also a function of design. And good design does significantly improve performance. I can’t say that I’ve seen many Service Level Agreements for databases which go beyond availability.
Are there other issues you would like to talk about?
Recently I was at a DCI conference that brought the datawarehouse vendors and the business intelligence vendors together. The business intelligence area was humming with activity, while the datawarehouse area was slow- reflecting the nature of the market today. This is reflection of a disconnect between IT and its end users. Let’s look at what I view as two extremes- one, the marketing department goes off and buys itself a packaged application, and creates its own little data-mart, an island of data, if you will. The other extreme, an IT department which is not really responsive to the real needs of its internal customers. Both extremes should e avoided. I believe and strongly advocate that IT must deploy its people literally to sit with the marketing folk, and develop a true data warehouse solution that conforms the dimensions across the company. I believe the pendulum will swing again, and bring data warehousing back to the forefront, as companies realize they need business intelligence across the company.
Contributors : Christian Sarkar
Last modified 2005-04-12 06:21 AM