Reasons for normalization
As database designers we are interested in designing a database that can both accurately reflect important facts about parts of the world we are interested in and stay accurate no matter how many times data is added to it, changed, or deleted from it.
To understand why we should be careful in designing databases, let's consider an example of a bad database design (shown in the Stock table below).
|Cust ID||Name||Addr||Stock||Curr Price||Dividend||Shares|
Consider the first row of data in this table. It should be interpreted as:
Customer Smith has ID 003 and lives in LA. She owns 16 shares of IBM whose most recent price was 100 and whose last dividend was 3.
Another way of thinking about what information this table really contains is to determine how you can uniquely identify a particular row of the table, not by using the row number, but by referring to the data itself. For example, if you could use Cust ID to uniquely identify a row in the Stock table, then you would say that this table contains information about customers. Why can you say this? Let's look into this point for a minute.
Consider the Stock column. IBM appears on two different rows. Why? Well, two different customers own that stock. So, is this table about stocks? That is, does each line of the table tell us information related to just that particular stock? Not exactly. On the other hand, if the table never contained two rows with the same stock on it, then it would seem that each row must contain something uniquely associated with that particular stock, wouldn't it?
Two attributes that, when taken together, can uniquely identify any row in the Stock table are [cust id, stock]. Attributes that can perform this function are referred to as the primary identifier of a relation. (These are really important when thinking about database design; we'll study these in much greater depth later.) There are many problems with this table but they can be classified into three types of anomalies. The process of normalizing a database helps us avoid these anomalies.
Overview of the normalization process
The goal of the normalization process is to define relations (and, hence, tables) so that each relation is about one kind of thing. Not two. Not three. One. This seems like a reasonable condition, given the problems that it prevents (see above). All that remains is to determine how to create well-normalized tables. This section provides an overview of the process while later sections go into much more detail.
Consider, again, the Stock relation whose structure and data are shown [#tab:stock-ex above in the Stock table]. Is this relation well-normalized? This is the wording that database designers use but it essentially means the same as asking if everything in the table is about one thing. Think about this before you go on.
I contend that it seems to be about at least two things: customers and stocks. There are customer names and addresses; there are stocks and their prices and recent dividends. A graphical means of representing the relationships among the attributes (fields) in the Stock relation is shown in the following figure.
|Representation of determinancy information
for the stock table
This graph tells us the following: If you know a customer id, then you know the person's name and address. If you know a stock identifier, then you know its current price and most recent dividend. Finally, for any pairing of a customer id and a stock identifier, you know how many shares that person owns of that stock. Make sure you can see how this graph contains this information. (Don't worry. You'll see this again and it'll be explained in much more detail later in the chapter. Still, think about it.)
Now we know that the Stock relation is not well-normalized. (Why?) But what is? The next figure shows how the graph (and, hence, the relation and tables) for Stock should be broken up so that only well-normalized relations are represented.
|Graphical representation of “well normalized” relations|
These graphs tell us that there are now three separate relations since there are three separate graphs. The tables constructed from these relations, with the sample data from the original Stocks table filled in, are shown in the following three tables.
Notice that each table is filled with data about one type of thing. The first has information about stocks. The second has information about customers. And the third has information about the relationship between stocks and customers — which customers own which stocks and how many of a stock does a customer own. These are well-normalized relations. Verify for yourself that none of the anomalies are present.
Relationship to database design and ER modelling
In other contexts you might have been introduced to ER modelling. This is one tool that is available to a database designer. It takes a top-down approach to database design: “Let's figure out what the entities are and the relationships among them and then we'll start assigning attributes to the appropriate entities.” Normalization is another tool but it takes a bottom-up approach: “Give me a list of all the attributes that you want to store and I'll organize them appropriately.”
The process of designing a database generally involves using both ER modelling and normalization. It all depends on the situation. If a badly-normalized database is available or if the designer is given a big list of attributes that need to be included in a database, then normalization is the primary tool. Or, if the designer is told to investigate a situation and design a database, then usually the designer will use ER modelling to form an overall structure for the solution and then use his or her knowledge of normalization to assign attributes to tables correctly. Generally, normalization is used as a check to verify that a set of attributes that are collected in one entity type should, in fact, be assigned to that entity type.
Continue with the first normal form page.