A Data Modeler's Bag of Tricks

by Bill Lewis

“The justification of all...models is that...they force analysts to confront possibilities that would not have occurred to them otherwise.”

Sylvia Nasar, A Beautiful Mind

Consider if you will: an assortment of concepts, guidelines, and opinions gathered over 15 years’ experience in applying the craft of data modeling under a wide range of circumstances. Maybe not enough to fill a book, but enough for a bag of tricks: portable, and, hopefully, applicable to many situations.

Let’s Start At the Very Beginning: Crafting Definitions

Developing a good entity definition is more than just an exercise in documentation. Good entity definitions are essential to the identification and demarcation of the set of entities in a model. To do this, each entity definition needs to be intentional¹; that is, comprising a comprehensive set of criteria that an instance must meet in order to be distinguishable as a member of the entity set. An extentional definition, in contrast, is a list of examples. An extentional definition may provide good additional clarification for the “real,” intentional definition; but, by definition (ahem), an intentional definition cannot be comprehensive.

An example²of an intentional definition for the entity “Customer” might be: “An individual or group of individuals who has purchased, or is highly likely to purchase, products or services of our company.”

A good set of intentionally-defined entities provides a strong foundation for a data model. One of the first tests of a model is to identify candidate entity instances within the scope of functions the model is to support, and “throw them against the model to see if they stick,” so to speak. Instances that don’t clearly and unambiguously “stick” to any entity, based on meeting the defining criteria, should lead to adaptations or enhancements to the model-so they will stick. When the model is proven to be sufficiently “sticky,” it’s time to move on to attribution.

Functional Dependency

Functional Dependency (FD) is the basis for accurately attributing (assigning attributes to) the entities in a data model. A clear, intuitive understanding of FD is essential for transforming entities into relations.

FD is the relationship between two variables (e.g., attributes, columns), or sets of variables, such that if the value of the first (the “independent” variable) changes, the value of the second (the “dependent” variable) changes also, “as a function” of the first.

Functional dependency is the mathematical foundation of the relational model. In much the same way sub-atomic forces hold matter together, functional dependency (FD) is the glue that holds a data model together. Sets of attributes “stick together” as normalized relations (do we see a recurring theme here?), comprised of two subsets of attributes: a) the primary key set, and b) the set of attributes that are all functionally dependent the primary key set. Normalized relations then “stick to each other” through relationships between primary and foreign keys. Foreign keys are, by definition, primary keys also.

Because FD is the relationship between primary key attributes and non-primary key attributes, it is provides confirmation that an entity, after being rigorously, intentionally defined, continues to hang together at the next level of detail. It’s what is really meant by the old normalization saw “Dependent on the key, the whole key, and nothing but the key, so help me Codd.”

Keep FD firmly in mind when evaluating candidate entity instances as a means of “testing” or validating a model.

Laying Out an ER Diagram

When developing an entity-relationship diagram, especially one of significant complexity, avoiding the appearance of a bowl of spagetti is a necessary yet thankless task. What data modeler has not fervently wished that some modeling-tool vendor would develop an automated layout mechanism that would optimize the layout of entities and relationships in a diagram and arrive at the absolute minimum number of line crossings.

Making an ER diagram look no more complicated than it actually is is essential to its usability as a communication mechanism. (See for more on this in the following.)

Close to the top of my data-modeler’s wish list is a three-dimensional modeling tool. The extra dimension would allow lines and object to cross underneath each other, and models could be rotated in all dimensions. VRML (Virtual Reality Modeling Language, yes, really) just could possibly do this, if someone really applied it to the task.

Many years ago I was introduced to one of the most compelling yet demanding graphical layout techniques I’ve come across before or since. Dubbed with the acronym OSSAC (Objects, Subjects, Strata, Associations, Cardinality . . . but never mind), it was introduced within the context of an enterprise data modeling effort by an independent consultant with whom I long ago lost touch³. The most valuable lesson learned from OSSAC was its attempt to embed what was called “Logical Order of Precedence” into a data model. This was done by imposing the discipline of “dead crows”; that is, the “crow’s feet” of all relationships in the diagram should point upward.

The result of dead-crows layout discipline is that the more “fundamental” entities (e.g., dimensions, reference data) wind up closer to the bottom of the diagram, and the more “dependent” entities (e.g., “facts,” “transactions”) near the top. What’s the point? Such a diagram inherently represents, in a very concise form, a data life-cycle viewpoint. Regardless of whether enterprise, application, subject-area, or data mart/warehouse in scope, the model must be instantiated “from the bottom up.” This provides valuable clues into how the application(s) that manage the state of the model entities must perform.

Even if a model is not constructed strictly in accordance with these graphical layout conventions, keeping the logical order of precedence in mind when testing and presenting the model significantly increases its usefulness and understandability.

Supertypes and Subtypes

Creating a supertype-subtype hierarchy that faithfully captures business data requirements is probably half technique and half judgement. The number of shared or distinct attributes and relationships provides the most objective method for deciding what subtypes and supertypes should exist independently or be combined.

It may be helpful to keep in mind two opposing forces as overall guiding principles of subtyping and its opposite, generalization. The first, the existence of common primary keys and relationships, tends to cluster entities into generalized supertypes. The second, non-shared attributes (“nonhomogeneity” according to Hawryzkiewicz, see below), tends to separate entities into specialized subtypes, in order to avoid non-applicable null values. (For example, a Last Name attribute in a Customer entity instance that is a corporation, for example, is a denormalization, since always-null values are not, strictly speaking, functionally dependent on the supertype primary key.)

Because attribute lists provide the most objective decision-making input for subtype entity design, supertype/subtype hierarchies typically evolve most actively during the attribution stage of the modeling process. Obvious candidate hierarchies should certainly be documented in the early pre-attributed stages, but wrangling about the distinction between, for example, a Non-Profit Corporation and a For-Profit Corporation in your Customer model, before attributes are discovered, risks wasting precious time reaching tenuous decisions that are likely to be changed later.

When considering supertype/subtype hierarchies, keep in mind that you are working on a continuum: at one extreme, combining all candidate subtypes together into the supertype; on the other extreme, a large number of subtypes, at perhaps multiple hierarchical levels, each with only a handful of differentiating attributes. Finding the right place on this continuum is a result of

achieving an understandable, accurate representation of the business requirements
limiting the number of null attributes resulting from combining multiple subtypes into a single subtype
creating an implementable model (if implementation is the outcome)

Spreadsheets are a great tool for working with attribute lists for groups of subtype entities. Candidate subtype entity names can be assigned as the column headings, then candidate attributes for each sub-entity can be listed down the rows below the headings. Merging and splitting candidate subtypes can then be accomplished and evaluated by moving attributes from one list to another.

With the recent proliferation of portals and search facilities, both corporate and Web, the analysis of information assets with the goal of presenting them in an organized, hierarchical format — variously termed a taxonomy or an ontology — has achieved unprecedented visibility. Combined with library science disciplines, the more “conventional” modeling of supertype/subtype hierarchies can provide a methodical context for these increasingly fashionable analysis activities.

Presenting Data Models: A Facilitated Session, or a Deliverable Walkthrough?

A data model serves at least two functions: as the basis for implementation, and as a communication medium. Presenting a data model to an audience, in contrast to, say, printing and distributing it, or posting it on an internal Web site, is a very effective means of communicating documented business data requirements.

When it comes to assimilating data presented in a graphical format, some people “get it” right away, others will “get it” later, yet others, unfortunately, may never get it. The graphical presentation capabilities of business intelligence tools such as Microsoft Access have helped immeasurably in introducing a wide audience to the “boxes and arrows” data representation metaphor. As a result, more neophytes will grasp data models sooner than might have been the case five or ten years ago.

When a model is in development, it can be presented in the context of a facilitated “working” session, with the objective of eliciting and reaching consensus on business data requirements. After the work is done, a presentation can introduce a data model to its user community.

In presenting a data model, communication is the objective. Before conducting either type of walkthrough, it’s important that both presenter and audience understand and agree with why the presentation is taking place.

Star Schema vs. 3NF: What’s All theFuss About?

Let’s get over it, already. A star schema model is a special case of an ER model. Dimensions are classification, or “reference data,” hierarchies. Sure, the measures in the fact table are often aggregate, derived attributes, and this is sometimes cited as evidence of denormalization, but the “key” point is that these attributes are still functionally dependent on the fact table’s set of primary key attributes.

Testing a Model

Use cases are very useful (as it were) for testing or “walking” a data model as it is being developed.

With data modeling tools such as ERwin, it is very easy to generate a first-cut physical database, and from there populate it with test data. Microsoft Access can function quite adequately as your first target database. Show your users the populated tables, in spreadsheet format, or even in quick-and-dirty forms and reports. This is a great way to get confirmation of the model, poke holes in it, and get more buy-in and enthusiastic support. Doing this before generating a line of code helps to assure a rock-solid foundation for whatever application the data model will be supporting.

The more testing of this type you can do before the evil DBA gets hold of your beautiful model, the better (see “Denormalization” below). Just kidding! Find a DBA who agrees with the next section, take him/her out to lunch, and stay on his/her good side forever.

Denormalization

As Fabian Pascal has pointed out, once you start denormalizing, when do you stop4?

If you denormalize a model — with pre-joined tables, summary tables, and the like — before you test the normalized model, how will you ever know how much, if any, denormalization was really necessary? Premature denormalization increases the likelihood of imposing an unnecessary burden of maintaining a significant amount of redundant data. Hold off denormalization as long as possible-forever, ideally. Have the DBA reach into his bag of tricks: build indexes; physically distribute (e.g., stripe) the data; find more, smaller, faster disks. If all the multitude of physical tuning techniques fail, only then go back to the drawing board, and do so with much discretion.

The Young Data Modeler’s Curriculum

There are several good books on various facets of data modeling currently on the market; the authors Dave Hay and Graeme C. Simsion come immediately to mind. But the classics listed here have withstood the test of time. A smart, motivated person with a logical mind and an affinity for creating order out of chaos can get a good start on data modeling by self-study with these resources, in the order listed.

DM101: CASE*Method: Entity Relationship Modeling by Richard Barker is an excellent introduction to ER modeling.

DM201: Relational Database Design by Fleming and von Halle goes step by step into the nuts and bolts, all the way to the physical side.

DM301: Practical Issues in Database Management by Fabian Pascal, will introduce many of the perennial tough problems in data modeling, and will help assure the new data modeler that there’s more to data modeling than what is supported by current commercial implementations of SQL and relational database management products.

Extra credit A: I.T. Hawryzkiewicz, Database Analysis and Design. Unfortunately long out of print (there may be used copies left on Amazon.com, where I found one). Excellent coverage of super/subtype hierarchies and functional dependency.

Extra credit B: Len Silverston, The Data Model Resource Book, volumes 1 and 2. Volume 1 presents patterns of data models that occur across all businesses, that a data modeler will be faced with repeatedly during his/her career. And if you want to come up to speed quickly on the data requirements in one of several business domains, Volume 2 is for you.

All Requirements are Data Requirements

When I started data modeling, to most software developers it was a strange, arcane kind of black art, viewed with skepticism, and requiring intense lobbing and proselytizing to achieve its reluctant adoption within a development project. These days, no development project gets very far along without a data model — even if disguised (some might say misappropriated) as an object model.

While many challenges have been overcome, others remain. Contrary to the prevailing wisdom, a data model (and its equivalent database implementation) is not just a part of a computer application, an unfortunate necessity to accommodate this pesky “persistence” thing. It is the foundation upon which all of the processing is built. Processing has to happen to something.

Data modelers should take every opportunity to connect with the business analysts and software developers that they work with, and get them engaged in data modeling tasks at the very beginning of every project. Don’t wait for them to come to you!

“Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious.”

Frederick P. Brooks Jr., The Mythical Man-month: Essays on Software Engineering

1. As the humorist Dave Barry says, “I am not making this up.” Intentionality and extensionality are concepts of epistemology, the study of knowledge and its representation.
2. An extentional definition of an intentional definition, if you will.
3. Jay Smith.
4. Fabian Pascal’s “Against the Grain,” www.searchdatabase.com, 3/1/2001

Bill Lewis has over 20 years’ experience as a leading-edge practitioner and thought leader in software development, data integration and data modeling. His book Data Warehousing and E-Commerce is available at online and bricks-and-mortar booksellers.

Contributors : William H. Lewis
Last modified 2005-04-27 11:26 AM

DBAzine.com

Sections

Personal tools

Menu

Who Are You?