Using Graphs

## Introduction to using graphs

In this section I am going to spell out, in as simple terms as possible, a method of normalizing that uses graphs. These graphs are called determinancy diagrams. The reason for learning this method in addition to the first method is that it has proven to be easier to grasp. You may then wonder: “If it is easier to grasp, then why learn the first method?” Well, you will see normalization discussed in terms of the first method in many more places than you will see the second — so I thought that you should learn both. In this section I will work through a problem, describing the method as we solve the problem. At the end of the section I summarize the method in a figure so that you can have something to which you can refer.

### The problem

We have a form for which we want to construct a normalized database (i.e., one that is in 3rd normal form). The form is shown in this figure.

#### A purchase order

 PO # Date Vendor ID Name Address City ST ZIP
Part ID Name Category ID Category Name Delivery date Price Quantity Extended price
Subtotal
Tax rate
Tax
Total
For use by office
Input by:
Accepted by (Y/N):
Date input:
Tax rate is based on the vendor's address. It is 6% in MI, 5% otherwise.

Look at this form and figure out what each field and field name means. Do this now.

The Date field at the top of the form is the date the purchase order was filled in by the vendor. The name and address all apply to the vendor. The office-related information indicates the clerk that input the data, whether the purchase order has been accepted or not, and the date the clerk input the data. The table holds the following information:

part id
the number of the part being requested
name
the name of the part being requested
category id
the id of the category to which the part belongs
category name
the name of the category
delivery date
the date for which the part is being requested
price
price of the part
quantity
number of the part being delivered on this date
extended price
value of all of this particular part being delivered on a certain date

At the bottom of the form are fields for subtotal, tax rate, tax, and total. There is also a note that says the tax rate is 6% in Michigan and 5% elsewhere.

### The solution

#### 1NF

The first step is to list all the fields:

PO#, PO date, vendor id, v name, addr, city, state, zip, input by, accepted, date input, part id, name, category id, c name, delivery date, price, quantity, ext price, subtotal, tax rate, tax, total

These are all the fields mentioned on the form. Now I need to remove the repeating groups of attributes. The repeating groups in this form are the multiple values that can be listed under the field names. You can eliminate many of the repeating groups that exist on forms by using names of columns rather than naming every field on a form (e.g., part-id1, part-name1, part-id2, part-name2). The above list does not have any repeating groups since I listed field names and column names.

Now I need to remove fields that contain derived data. These are the fields that can be calculated using mathematical equations. Subtotal is the sum of all the values in the extended price column; tax is calculated by multiplying subtotal by the tax rate; total is calculated by adding subtotal and tax; ext price is calculated by multiplying price and quantity. Thus, I should remove subtotal, tax, total, and ext price from the list of fields. Here are the remaining fields:

PO#, PO date, vendor id, v name, addr, city, state, zip, input by, accepted, date input, part id, name, category id, c name, delivery date, price, quantity, tax rate

Now I draw the determinancy diagram that will represent the 1NF relation (sometimes called the unnormalized relation).

1. Get a blank sheet of paper.
2. Draw a box in the middle of the page that is one-fourth the size of the page.
3. Find a determinant. Put the name of that field in the big box and draw a smaller box around the name. Cross the field's name off of the list of fields.
4. Look in the list of fields and find all the fields that are functionally determined by this field. For each of these fields, put the name near the field's determinant but outside of the box created in step 2, put a box around the field, draw an arrow from the determinant to this field, and cross the name off of the list.
5. Perform the above step for each field that you add to the diagram.
6. When no more fields are functionally determined by the fields in the diagram, get another field from the list, put its name in the box below the first field you put on the diagram, cross its name off the list, and go back to step 4.

You are done with this process when there are no more fields in the field list.

For this example I started with PO# in the middle of the page (step 3). I crossed it off the list. The fields date input, accepted, input by, PO date, and vendor id all are functionally determined by PO#. I determined this by asking myself the following questions: “If I know the PO#, will there be one and only one (_)? Does this (_) depend on any other field more directly?” For each of the above five fields I answered “yes” to the first question and “no” to the second — so these fields are functionally determined by PO#. Cross each of these five fields off of the list (step 4).

For each of these fields, determine if any fields are functionally dependent on it (step 5). It turns out that v name, addr, and zip are determined by vendor id. For each of these fields, put it on the diagram, draw an arrow from vendor id to it, and cross it off the list.

For each of these fields, determine if any fields are functionally dependent on it (step 5). It turns out that city and state are determined by zip. Put these on the diagram, draw the arrows, and cross them off the list. The following are the fields that are remaining:

part id, name, category id, c name, delivery date, price, quantity, tax rate

No other fields seem to be determined by the fields currently on the diagram, so I will start the process again. Put part id just below PO# on the diagram and cross it off the list (step 6). Now start the process all over with step 4. The fields name, category id, and price all depend on part id. C name depends on category id. The fields remaining on the list are delivery date, quantity, and tax rate. Delivery date does not depend on other fields so it should be put below part id in the diagram. Quantity requires an additional assumption: For any purchase order, a part will only be ordered once for any date. Thus, if you know the purchase order, part, and date, then you know the quantity ordered. This implies that you should draw a box around PO #, Part id, and Delivery date, put Quantity outside the box you drew in step 2, and draw an arrow from the box around the three fields to Quantity. As for the tax rate field, we are told that it is determined by the state from which the customer is billed; this is the state field. Go back and put tax rate on the diagram and show that it is determined by state. When you are done your diagram should look something like this figure.

A 1NF determinancy diagram for the form

If there were any fields left in the list, you would put them inside the dark box — but there aren't, so you're done. The primary identifier for this relation are the fields within the box you drew in step 2. The diagram in this figure is in 1NF. It is sometimes referred to as an unnormalized relation because it is not in 3NF.

#### 2NF

To get the relation into 2NF, we have to remove partial key dependencies. (Identifiers are sometimes referred to as keys.) This instructs us to get rid of all lines that are coming out from the inside of a box. This figure(a) demonstrates what should be done. The top figure represents the type of structure you should look for. Notice that C depends on A and A is contained within a larger box. The bottom figure shows the results of the action you should take.

Transformations into 2NF and 3NF

For this example there are two partial key dependencies. The only place where this can occur is in a box that contains multiple fields; there is only one in this figure (A 1NF determinancy diagram for the form). Both PO# and part id are determinants of other fields. An additional relation needs to be created for both of these determinancy relationships. Now there are three separate relations, none of which have any partial key dependencies. The following figure shows what the relations look like now that they are in 2NF.

2NF determinancy diagrams for the form

#### 3NF

To get the relation into 3NF, we have to remove transitive dependencies. This instructs us to get rid of all boxes that have lines going into and coming out of them. The ``Transformations into 2NF and 3NF'' figure(b) demonstrates what should be done. The top figure represents the type of structure you should look for. Notice that B is dependent on A and also determines C. The bottom figure shows the results of the action you should take.

For this example there are several transitive dependencies:

PO# —> vendor id —> zip —> city
PO# —> vendor id —> zip —> state —> tax rate
part id —> category id —> c name

The most natural way to handle these are to work from the “leaves” back to the “root.” Start with the PO# relation: tax rate is determined by state and should be moved to another relation; city and state are determined by zip and should be moved to another relation; v name, addr, and zip are determined by vendor id and should be moved to another relation. And now for the part id relation: c name depends on category id and should be moved to another relation. Now there are many separate relations, none of which have any partial key dependencies (because of 2NF) or transitive dependencies (because of 3NF). The figure below shows what the relations look like now that they are in 3NF.

3NF determinancy diagrams for the form

The form is now in 3NF.

#### 4NF

Now we need to look for violations of fourth normal form. This involves looking for multi-valued dependencies. The only candidate is the relation [PO #, Part id, delivery date] —> quantity. (Why?) It turns out that in this case there is not MVD. However, we should go through the general procedure anyway.

In any case the next figure shows simple 4NF violations and the figure below that shows a more complex set of violations.

Graphical representation of 4NF violations and their solution

Graphical representation of 4NF violations and their solution

In any of these cases the solution is to do the following:

1. Ensure that the relations are in 3NF. Once the relations are in 4NF, you will never have a table that contains both a determinancy and a multi-determinancy. So, to simplify your task, you want to handle all the determinancy relations before you start handling the multi-determinancy relations.
2. So, at worst, you should have one (or a very few) relations that contain at least two attributes in the primary identifier.
3. Identify the multi-dependencies by drawing the double-arrow.
4. Now, work from the leaves back to the branches.
1. Find any MVD of the form X —» Y | Z.
2. Are there any attributes D that are determined by [X, Y]? If yes, create a new relation [X, Y] —> D. Delete D from the original relation. If no, create a new relation X —» Y.
3. Are there any attributes P that are determined by [X, Z]? If yes, create a new relation [X, Z] —> P. Delete P from the original relation. If no, create a new relation X —» Z.
4. Delete Y and Z from the original relation. If X is the only attribute (or set of attributes) remaining in the original relation, then delete the original relation.
5. Repeat the above steps for each pair of MVDs (remembering to work from the leaves to the branches).

Once you understand the above rules, they can be given the following, more graphical, interpretation:

• Any 4NF relation can have, at its outside layer, no more than 1 double-headed arrow.
• Any 4NF relation can have, at its outside layer, either dependencies or MVDs but not both.

For some practice on this, create a graphical representation of the relations in [fourth-normal-form#tab-violates-4nf this table], [fourth-normal-form#tab-4nf-simple-prob this table], and [fourth-normal-form#tab:complicated-4nf-prob this table]. Your answers should be the same as the answers we got with the other method.

#### Conditional dependencies

Now that the relations are in 4NF, we need to look for conditional dependencies. The result of this process is essentially equivalent to the result of the process of dealing with MVDs — the conditional dependencies are isolated in their own relations. The solution is to do the following:

1. Ensure that the relations are in 4NF.
2. Consider each existing dependency in the 4NF relations. It will have the form PI —> D, where PI is a relation's primary identifier and D is some subset of the relation's non-identifier attributes. If a dependency should be conditional, replace the solid arrow with a dashed arrow (- ->).
3. For each conditional dependency PI - -> D, create a new relation PI - -> D. Remove D from the original relation. Do this for all the conditional dependencies.
4. For each conditional dependency with the same PI, group the relations together if they have the same condition. Do not group them together if they do not have the same condition.

Once you understand the above rules, they can be given the following, more graphical, interpretation:

• No relation may have both a conditional dependency and any other dependency.
• No relation may have more than one conditional dependency.

The steps that we have gone through are summarized in this list. (This list is also contained in a separate page for more convenient reference.) This is fully equivalent to the method we previously used.

#### Normalizing by drawing graphs

1. Get to first normal form.
1. List all the fields.
2. Get rid of synonym problems; that is, don't use different names for the same attribute.
3. Get rid of homonym problems; that is, don't use the same name for different attributes.
4. Ensure field definitions make sense.
5. Remove from this list all the repeating groups of fields.
6. Combine paired, mutually exclusive fields if possible.
7. Remove from this list all those values that can be derived from other fields in the list.
8. Construct a determinancy diagram representing all the functional dependencies.
9. Put a rectangle around the primary identifier for this relation. You should now have one large diagram.
2. Get to second normal form. You are removing partial key dependencies. For each separate diagram, apply the following rule: “No box shall have lines coming out from inside it.” When you can no longer apply this rule, group together all relations that have the same primary identifier.
3. Get to third normal form. You are removing transitive dependencies. For each separate diagram, apply the following rule: “No box shall have lines going in and coming out.” When you can no longer apply this rule, group together all relations that have the same primary identifier.
4. Get to fourth normal form.
1. You should have a few relations that contain at least two attributes in the primary identifier.
2. Identify MVDs by drawing double-headed arrows.
3. Now, work from the leaves back to the branches when applying the following rules.
• Any 4NF relation can have, at its outside layer, no more than 1 double-headed arrow.
• Any 4NF relation can have, at its outside layer, either dependencies or MVDs but not both.
4. Repeat the above steps for each pair of <FONT SIZE="-1">MVD</FONT>s (remembering to work from the leaves to the branches).
5. Get rid of conditional dependencies.
1. Identify all conditional dependencies by changing regular arrows to dashed arrows.
2. Apply the following rules.
• No relation may have both a conditional dependency and any other dependency.
• No relation may have more than one conditional dependency.

The conclusion is next.

page revision: 17, last edited: 18 Aug 2008 23:58