Normalization
Schema design is difficult; however, one of the perks of the relational model is that, by virtue of being rather mathematical in nature, we have a large body of knowledge to draw from in their construction. Normalization is one way to guide schema design; indeed, if we can ensure that our schemas fulfill some properties, then we can benefit from the guarantees that are provided by these normal forms. In this chapter we'll explore normalization through the lens of schema design and see the benefits and drawbacks of adhering to a normal form.
Functional Dependencies
In order to discuss normalization, we must discuss the nature of our data. In particular, we want to know how our data interact with each other, and we do this by discussing functional dependencies. Functional dependencies describe how different attributes relate to other attributes. Formally, an attribute \(A\) is functionally determined by \(B\) (equivalently, \(B\) determines \(A\)) if tuples with the same value for \(B\) always have the same value for \(A\). We write this as \(B \to A\). We can write dependencies over any non-empty set of attributes; say \(A_1, A_2, \ldots, A_s \to B_1, B_2, \ldots, B_t\).
Typically, there is a little leeway in discussing what functional dependencies exist. Indeed, although we may have that every customer named "Ronnie" also shops at the "Irvine" location, it doesn't necessarily imply that all future Ronnies must also shop in Irvine. We cannot necessarily determine that a dependency exists by looking at populated databases. Thus, we must determine our dependencies by analyzing the nature of our data somewhat philosophically, then moving forward.
We have quite a few sub-definitions surrounding functional dependencies. We say that if \(Y \subseteq X\), then \(X \to Y\) is a trivial dependency and should be removed. Indeed, of course if two tuples share all of their attributes in \(X\), then they will share all of their attributes in a subset of \(X\).
Consider an attribute set \(A\). We say that if \(X\) determines \(A \setminus X\), then \(X\) is a super key. Of course, \(A\) itself is a super key. If there is no \(Y \subset X\) such that \(Y\) determines \(A \setminus X\), then \(X\) is a candidate key. We can safely choose any candidate key as a primary key; however, some might be more natural than others.
Armstrong's Axioms
It turns out that given a set of constraints, there might be a better way to write them. There are three fundamental axioms that govern rewriting functional dependencies:
- Reflexivity: \(Y \subseteq X \implies X \to Y\)
- Augmentation: \(X \to Y \implies WX \to WY\)
- Transitivity: \(X \to Y \wedge Y \to Z \implies X \to Z\)
From these rules, we can prove three more useful rules (as an exercise, try proving them):
- Union: \(X \to Y \wedge X \to Z \implies X \to YZ\)
- Decomposition: \(X \to YZ \implies X \to Y \wedge X \to Z\)
- Pseudotransitivity: \(X \to Y \wedge WY \to Z \implies WX \to Z\)
Using these rules, we can rewrite our functional dependencies to nicer to work with forms. In particular, finding a minimal functional dependency set is somewhat easy using the Canonical Cover Algorithm.
Normal Forms
It is often incredibly helpful to put our tables into a normal form. A normal form is a set of properties that, if our tables hold, certain nice properties also hold. If nothing else, they serve as a good guideline for schema design. We will explore the canonical normal forms, original proposed by Codd, in increasing restrictivity using an example.
Consider the following schema and data:
course (c_id, dept_id, dept, evaluations, inst, office, sect, time_slot)
c_id | dept_id | dept | evaluations | inst | office | sect | time_slot |
---|---|---|---|---|---|---|---|
61 | 1 | CS | HW, Midterm, Final | Eddie Kohler | 345 | A | 8am-10am |
61 | 1 | CS | HW, Midterm, Final | Eddie Kohler | 345 | B | 1pm-3pm |
165 | 2 | MATH | Project, Final | Stratos Idreos | 346 | A | 10am-11am |
165 | 2 | MATH | Project, Final | Stratos Idreos | 346 | B | 7pm-8pm |
265 | 2 | MATH | HW | Andy Pavlo | 346 | B | 1pm-2pm |
455 | 3 | PHYS | Midterm, Final | Nesime Tatbul | 347 | B | 6pm-7pm |
First Normal Form
A schema is in the first normal form when:
- The attributes are atomic, and
- The attribute names are unique.
We already have the latter, so we'll focus on the former. To ensure atomicity, we split time_slot
into two attributes time_start
and time_end
, and pull evaluations out into an evaluation
table, as well as a course_evaluation
table to join evaluations back to courses.
course (**c_id**, **sect**, dept_id, dept, inst, office, time_start, time_end)
evaluation (**e_id**, e_name)
course_evaluation (**c_id**, **e_id**)
Here is the resulting data for this decomposition:
c_id | sect | dept_id | dept | inst | office | time_start | time_end |
---|---|---|---|---|---|---|---|
61 | A | 1 | CS | Eddie Kohler | 345 | 8am | 10am |
61 | B | 1 | CS | Eddie Kohler | 345 | 1pm | 3pm |
165 | A | 2 | MATH | Stratos Idreos | 346 | 10am | 11am |
165 | B | 2 | MATH | Stratos Idreos | 346 | 7pm | 8pm |
265 | B | 2 | MATH | Andy Pavlo | 346 | 1pm | 2pm |
455 | B | 3 | PHYS | Nesime Tatbul | 347 | 6pm | 7pm |
e_id | e_name |
---|---|
1 | HW |
2 | Project |
3 | Midterm |
4 | Final |
c_id | e_id |
---|---|
61 | 1 |
61 | 3 |
61 | 4 |
165 | 2 |
165 | 4 |
265 | 1 |
455 | 3 |
455 | 4 |
Second Normal Form
A schema is in the second normal form when:
- The schema is in first normal form (done!)
- There are no partial dependencies.
What is a partial dependency? Say there's a candidate key \(A\). Then, every other attribute in the relation should depend on all of \(A\), not a subset of it. If there is an attribute such that only a part of \(A\) functionally determines it, there is a partial dependency.
Let's lay out our dependencies:
c_id -> dept_id, inst
c_id, sect -> time_start, time_end
inst -> office, c_id
dept_id -> dept
Notice that in the first table above, c_id -> dept_id, inst
is a partial dependency, as the LHS is a subset of the candidate key c_id, sect
. Thus, we should split c_id, sect, time_start, time_end
into its own table:
course (**c_id**, dept_id, dept, inst, office)
section (**c_id**, **sect**, time_start, time_end)
evaluation (**e_id**, e_name)
course_evaluation (**c_id**, **e_id**)
Here is the resulting data for this decomposition:
c_id | dept_id | dept | inst | office |
---|---|---|---|---|
61 | 1 | CS | Eddie Kohler | 345 |
165 | 2 | MATH | Stratos Idreos | 346 |
265 | 2 | MATH | Andy Pavlo | 346 |
455 | 3 | PHYS | Nesime Tatbul | 347 |
c_id | sect | time_start | time_end |
---|---|---|---|
61 | A | 8am | 10am |
61 | B | 1pm | 3pm |
165 | A | 10am | 11am |
165 | B | 7pm | 8pm |
265 | B | 1pm | 2pm |
455 | B | 6pm | 7pm |
e_id | e_name |
---|---|
1 | HW |
2 | Project |
3 | Midterm |
4 | Final |
c_id | e_id |
---|---|
61 | 1 |
61 | 3 |
61 | 4 |
165 | 2 |
165 | 4 |
265 | 1 |
455 | 3 |
455 | 4 |
Third Normal Form
A schema is in the third normal form when:
- The schema is in second normal form (done!)
- There are no transitive dependencies.
A transitive dependency is a dependency between two sets of attributes where both sets are still dependent on the primary key. Formally, if \(A \to B\) and \(B \to C\) and NOT \(B \to A\) , then \(A \to C\) is a transitive dependency. Once you've identified a transitive dependency, you can separate the dependent attributes out into another relation.
Let's lay out our dependencies from last time, even though they are unchanged:
c_id -> dept_id, inst
c_id, sect -> time_start, time_end
inst -> office, c_id
dept_id -> dept
Notice that we have a transitive dependency in c_id -> dept_id
and dept_id -> dept
and NOT dept_id -> c_id
. So, we pull dept_id, dept
into its own table.
course (**c_id**, dept_id, inst, office)
section (**c_id**, **sect**, time_start, time_end)
evaluation (**e_id**, e_name)
course_evaluation (**c_id**, **e_id**)
department(**dept_id**, dept)
Here is the resulting data for this decomposition:
c_id | dept_id | inst | office |
---|---|---|---|
61 | 1 | Eddie Kohler | 345 |
165 | 2 | Stratos Idreos | 346 |
265 | 2 | Andy Pavlo | 346 |
455 | 3 | Nesime Tatbul | 347 |
c_id | sect | time_start | time_end |
---|---|---|---|
61 | A | 8am | 10am |
61 | B | 1pm | 3pm |
165 | A | 10am | 11am |
165 | B | 7pm | 8pm |
265 | B | 1pm | 2pm |
455 | B | 6pm | 7pm |
e_id | e_name |
---|---|
1 | HW |
2 | Project |
3 | Midterm |
4 | Final |
c_id | e_id |
---|---|
61 | 1 |
61 | 3 |
61 | 4 |
165 | 2 |
165 | 4 |
265 | 1 |
455 | 3 |
455 | 4 |
dept_id | dept |
---|---|
1 | CS |
2 | MATH |
3 | PHYS |
BCNF (Third-and-a-half Normal Form)
A schema is in BCNF when:
- The schema is in third normal form (done!)
- For all dependencies, either:
- \(a \to b\) is a trivial dependency ( \(a \subseteq b\) ), or
- \(a\) is a superkey.
As an exercise, convert this schema into BCNF. You should have six relations by the end of your conversion.