Working SQL: The Trouble with Normal

In the first of two articles, Peter provides the essential background on normalization. In this article, Peter looks at first normal form and provides a different perspective on primary keys. More importantly, here’s why you should care about first normal form.

Up here in Canada we have a singer-songwriter named Bruce Cockburn. One of his songs has a chorus that goes, “The trouble with normal is/It always gets worse.” Many people feel that way about normalization. The process of normalizing your database isn’t a theoretical issue: Non-normalized databases cost more to build, run, and maintain than normalized databases. And, as I’ll demonstrate later in this article, in non-normalized databases, some activities are practically impossible. Yet it’s an unusual project where someone doesn’t declare that the data should be “de-normalized” for some reason.

Normalization consists of moving an initial database design to a normalized form. When the relational theory was developed, a set of five normalized forms for data (cleverly numbered one through five) was defined. A database can be in a “pre-normal” form or in one of the five numbered states.

As my horrible example in this edition of the column, I’m going to use the table design that you see in Table 1.

Table 1. A horrible example.

Field Name Field Description
CustomerNumber An arbitrary code to designate a customer
CustomerName The customer name
Year4Purchases Total purchases from four years previous
Year3Purchases Total purchases from three years previous
Year2Purchases Total purchases from two years previous
Year1Purchases Total purchases from the previous year
CurrentYearPurchases Total purchases in 1991
State State the customer lives in
StateAbbrv Short form for the state name
Telephone Customer’s phone number

Uniquely yours

Before beginning any normalization, every table in the database must have its primary key field(s) defined. The primary key field is the field (or combination of fields) that will uniquely identify each record in the table. In practical terms, if you use the primary key field, you should retrieve one and only one record from the table for any value supplied.

In our example, we have several candidate keys (fields that could be used as primary keys), including the customer number and the customer name. Obviously, the customer number was generated by the table designer in order to be the primary key for the table.

I often think that we leap to using arbitrary keys like customer number too quickly. Why, for instance, was the customer name not used as the primary key? Presumably, because the designer was concerned that there might be two customers with the same name.

This concern usually turns out to be pretty unlikely. However, if you did want to worry about this, how about using the customer phone number as the primary key? It would satisfy the passionate desire to use numbers to identify everything and is guaranteed not to duplicate. Again, the concern might be raised that the customer could change their phone number. This was certainly a problem in an earlier generation of database management systems where the primary couldn’t be changed. However, most current RDBMSs will allow you to change a record’s primary key and, through the wonders of referential integrity, even flow those changes through all related records.

I often think that the imposition of a unique, meaningless identifier is done primarily to satisfy the database designer’s sense of control. By selecting an arbitrary identifier, the designer takes control away from the database users and puts it in the hands of the database designer. In order to satisfy this sense of control, an identifier is created that the customer doesn’t understand and can’t remember, and an extra data item is added to the database purely to make the design work.

Normal functioning

With the primary key identified, there are two ways to move your database design to the normal form that you want. One method is through the use of functional dependency analysis.

The basis of this method is that every field in a table must be dependent on the key field(s) of the table. From a practical point of view, functional dependency means that changing the key field of the table will cause the contents of all of the fields in the record to change. For instance, the customer name is functionally dependent on the customer number, because changing the customer number will cause you to retrieve a different customer name.

The corollary to this is that a field can’t be functionally dependent on any other field in the record. For instance, changing the customer number will also retrieve different values for the State and StateAbbrv fields. However, changing the State field will also change the StateAbbrv fields (and vice versa). Since those two fields are functionally dependent on each other, they belong in a different table.

Functional dependency can extend to fields that aren’t present in the table, but are implied. The four “Purchase” fields are distinguished by an implied date field, for example. These fields are dependent not only on the primary key, but the date of the record. Next year, for instance, all of these values will change as the data in Year1Purchase is moved to Year2Purchase and Year2 to Year3, and so on. This suggests that the Purchase information belongs in another table whose key also includes an explicit year field.

Normal rules

I must admit to being uncomfortable with using functional dependency analysis as a means of normalizing database tables. I’m a right-brain kind of guy and I find that spotting the dependencies (especially the implied ones) requires too much imagination for me. Instead, I tend to use the set of rules associated with the various normal forms to move my design up through the normal form hierarchy. While functional dependency can move you straight to fifth normal form, I tend to refine my design through successive application of the normal forms.

At least in theory that’s what I do. In fact, my initial design is always in first normal form. However, on reviewing that design, I usually find that I’ve violated the higher normal forms and need to revise my table layouts. This isn’t only true when I’ve designed my tables “by hand.” If I’ve used a data modeling tool (I use InfoModeler), I tend to make the same mistakes, but they show up earlier in the process. However, for the purposes of this article, we’ll stick with an example that assumes that we’re designing our tables without benefit of a modeling tool.

The rule to move your design to first normal form is simple: no repeating fields. In our case, that means removing the “YearXPurchase” fields to their own table and linking them back to the original table with the customer number used as a foreign key (see Table 2).

Table 2. The Purchase History table.

Field Name Field Description
CustomerNumber An arbitrary code to designate a customer
YearOffSet The number of years in the past for the data
Purchases Total purchases for the year

I think it’s important to understand why I’m generating this extra table. Typically, this justification would involve long discussions about “update anomalies” and the like. I have a more important reason: without normalizing this table, you can’t ask any interesting questions of your database. And, since the purpose of the database is to provide information, it means that your database is useless.

You might feel that I’m overstating the case to say that the database is useless. I suppose that you can ask questions of your database when it’s non-normalized. It just costs more and takes longer. In addition, the process is so difficult that your users can’t do it. The result is that reporting from non-normalized tables must be done by the programming staff. This, in turn, results in job security for the programming staff and an impressive backlog for the department. However, since these reporting tasks are usually low priority, they tend to get deferred and, as a result, also usually contribute to poor customer satisfaction. The net effect is that many questions never get asked because the task never comes off the backlog. The practical result is — as I said — that if the table isn’t normalized, you can’t ask any interesting questions of it.

Here’s an example: Imagine that someone wants to know in what year each customer had his or her largest purchase. In the version of the record in Table 1, there’s no simple way to retrieve that information. Typically, a program would have to be written that compares each of the YearXPurchase fields to the others in order to determine which had the largest value. Then, this code would have to be executed for each record in the table and the results written to another temporary reporting table. With the layout in Table 2, the same end can be accomplished with a single SQL query (though it does require a subquery):

Select CustomerNumber, YearOffSet
From PurchaseHistory A
Where Purchases =
 (Select Max(Purchases)
  From PurchaseHistory B
 Where A.CustomerNumber = B.CustomerNumber)

The subquery finds the record with the largest purchases, while the superquery retrieves the YearOffset for that record. In Access, you’d probably set up the subquery as a standalone query and then use it in another query modeled after the superquery.

This still isn’t a very good table design, because the use of the YearOffSet field requires the company to run a year-end job to reset the YearOffSet field (and woe betide the operator who accidentally runs the job twice). However, it’s worthwhile to point out that, with the table in first normal form, that year-end job becomes three SQL statements:

Delete
From PurchaseHistory
Where YearOffSet = 4;

Update PurchaseHistory
Set YearOffSet = YearOffSet+1;

Insert Into Purchase History
Select CustomerNumber,1,CalcCurrentPurchase(CustomerNumber)
From Customers;

Normal decisions

Repeating fields aren’t always as clearly identifiable as this example suggests. For instance, look at the design in Table 3.

Table 3. The Customer Current Totals table.

Field Name Field Description
CustomerNumber An arbitrary code to designate a customer
Returns The total value of all products returned by the customer
Purchases Total purchases for the customer
Demos The total cost of all goods given to the customer

In this situation, do Returns, Purchases, and Demos count as repeating fields? They probably all have the same type declaration (Currency) and effectively represent dollar amounts associated with the customer.

I determine whether or not something is a repeating field by asking one question: Would selecting the largest of the fields in question give my users a useful answer? If the question results in comparing “apples and oranges,” then the data isn’t repeating.

In this case, the test I’d use would be a question like “Give me a list of customers with the larger of their Returns, Purchases, or Demos.” In the scenario I’m using, the answer wouldn’t be useful, which means these aren’t repeating fields. On the other hand, “Give me a list of all of the customers with their largest purchase in the last four years” does seem like a reasonable request — in my scenario.

Notice, though, that making this decision requires some understanding of what the data means and how it’s used by the system’s users. While the normalization rules appear mechanical, they aren’t. Applying them requires the involvement of the people who truly understand the business.

Does normalization matter? I was on a project that used a single table for processing sales orders. The reason for putting all the data in a single table had been to “improve performance.” Ignoring all of the other problems we had with this design, the program took 30 seconds and 500 lines of code to prepare an order for release. After I broke the table down into five separate tables, the program took three seconds to prepare an order — and did it by calling a single SQL query.

Normal concerns

Unfortunately, one of the results of normalizing is that the number of tables in the database tends to increase dramatically. The usual reaction to this increase in the number of objects in the database is to “de-normalize” it. This is also usually the worst reaction. Third normal form has been considered the appropriate trade-off point between purity of design and drowning in tables. I won’t try to challenge that, but I’ve also found that most developers have only a foggy idea of what fourth and fifth normal form are, so the decision not to use them might not be being made on the best grounds.

In the next issue, I’m going to go through the other four levels of normalization, up to and including the obscure fifth normal form. In addition, I’d like to debunk the idea that there are any benefits to be gained by denormalizing (except in very specific instances). And in no circumstances should you ever go back past first normal form. See you next month …
Parts 2-5 Normalization and Management Advice

About Peter Vogel

After 10 years of editing Smart Access, Peter continues to develop with Access (though he also does a lot of .NET stuff). Peter is also still editing other people's article--he even self-published a book on writing user manuals ("rtfm*") with a blog here.
This entry was posted in Design and Tables and tagged . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.