Introduction
The purpose of this content is to analyze the methods and rationale for decomposing a relational database into different normal forms. This content is drawn from several DBMS texts however its main foundation is Codd’s seminal RDBMS framework. Recall that RDBMS theory is based on the mathematical theory of relations, predicate logic and set theory. As a result, a mathematical relation of degree n will be expressed as a relational table (R-table) with a number of n columns (Codd, 1990, p. 45.). A mathematical attribute is represented as an R-table column and the attribute’s domain is a RDBMS extended data type (Codd, 1990, p.45). A mathematical tuple is represented as an R-table row and the cardinality of a relation is the number of R-table rows (Codd, 1990, p. 45).
RDBMS Normalization Definition and Purpose
By definition, normalization is based on mathematical functional dependence and is the process of evaluating and correcting relations in accord with Codd’s relational model (Codd, 1990; Garcia-Molina, Ullman & Widom, 2009; Rob & Coronel, 2007). Functional dependence describes the scenario where a R-table’s attribute or set of attributes uniquely determine a dependent set of attributes values within a relation (Garcia-Molina, Ullman & Widom, 2009; Rob & Coronel, 2007). As previously introduced, RDBMS component retrieval and manipulation is based on the tuple’s values rather than using an extraneous facet such as the tuples ordering or memory or storage address location (Codd, 1990).
Functional dependencies serve as the foundation for determining a relation’s keys. A key is defined as an attribute or set of attributes that uniquely determine all other attributes in a relation’s tuple (Codd, 1990; Garcia-Molina, Ullman & Widom, 2009). Keys can be further differentiated as candidate keys, super keys and primary keys. Candidate keys are minimal keys in the sense that they do not contain extraneous attributes that do not participate in the functional dependency. In contrast to candidate keys, super keys contain a superset of a minimal candidate key. If a relation has more than one candidate key, one of the candidate keys is typically chosen to be the primary key. It should be noted the choice of the primary key requires knowledge of an organization’s business rules and information needs as this choice impacts data storage and retrieval (Garcia-Molina, Ullman & Widom, 2009).
Central to the normalization process is an effort to eliminate data redundancy. Data redundancy occurs when equivalent data is unnecessarily stored at different places within the database. Redundancy wastes disk space however possibly more important: redundancy increases RDBMS complexity and management and renders the database susceptible to data inconsistencies or anomalies. The presence of anomalies decreases the RDBMS’s ability to provide accurate and relevant information and therefore negatively impacts organizational effectiveness. Data anomalies may be further distinguished as update anomalies, insertion anomalies and deletion anomalies. It must be cited that the ability to correctly update, insert and delete data are core requirements of the relational model and its first order predicate logic basis therefore the presence of anomalies is not in accord with the relational model (Codd, 1990, p. 29).
To provide a necessary foundation, a few examples of the various anomalies that result from data redundancy and improperly normalized relations are presented for illustrative purposes. Consider that when redundant information is stored in multiple places throughout the database, extra effort must be taken to update all data occurrences consistently (Rob & Coronel, 2007). Improperly normalized data may require the insertion of dummy or null values when data is added to the database (Rob & Coronel, 2007). The presence of nulls increases RDBMS design and management complexity as nulls must be represented and acted upon in a systematic way independent of the missing database values (Codd, 1990, p. 39). To provide further clarification, if the systemic RDBMS null representation is not in accord with the data’s domain, the RDBMS must take this into account with respect to its constraints, predicate logic and relational algebra processing (Codd, 1990). Lastly, deleting data should not cascade and delete other associated yet independent information if the information cannot be regenerated or located through relational algebra (Rob & Coronel, 2007).
As introduced above, it must be emphasized that proper RDBMS design also requires professional judgment, data modeling and detailed knowledge of the organization’s capabilities, culture and business rules (Garcia-Molina, Ullman & Widom, 2009; Rob & Coronel, 2007; Satinger, Jackson & Burd, 2002). To provide examples of different design goals consider that organizations may need their RDBMS design to optimize transaction operational speed, timely information retrieval or support extensive data warehousing as these goals will alter the RDBMS’s design tenets (Elmasri & Navathe, 2000; Rob & Coronel, 2007; Silberschatz, Korth & Sudarshan, 1999).
Normalization Methodology
Database normalization is the process of analyzing data and business rules allowing the RDBMS designer to decompose or disaggregate data into separate R-tables in an attempt to reduce and control data redundancy (Codd, 1990; Rob & Coronel, 2007). It should be noted that properly normalized data may still contain redundant data however this is necessary to retain relationships and referential integrity through foreign keys.
The end result of normalization is to create tables that are more flexible, scalable and less complicated to use (Garcia-Molina, Ullman & Widom, 2009; Rob & Coronel, 2007). In this sense, properly normalization establishes the necessary constraints on a database system to operate effectively (Garcia-Molina, Ullman & Widom, 2009). It should also be noted that a natural and positive side effect of database analysis and normalization is to reveal and document an organization’s operations (Rob & Coronel, 2007).
The normalization process disaggregates relations by reassigning attributes from un-normalized R-tables to disaggregated normalized R-tables based on their functional dependence. This disaggregation is in accord with the splitting/combining rule as the original relation’s information can be retrieved through relational algebra (Garcia-Molina, Ullman & Widom, 2009, p. 73). The end goal of the normalization process is to create a structure where: (a) each R-table represents a single entity, (b) redundant data is only stored when necessary to support the system’s relational constructs and, (c) all attributes in a table are dependent on a primary key (Garcia-Molina, Ullman & Widom, 2009; Rob & Coronel, 2007).
The normalization process was developed and refined by Edgar Codd and operates by moving sequentially from un-normalized data to subsequent or higher normal forms (Codd, 1990). As defined by Codd, these normal forms are: (a) first order normal form (1NF), (b) second order normal form (2NF), (c) third order normal form (3NF) and, (d) Boyce-Codd normal form (BCNF) (Codd, 1990; Rob & Coronel, 2007). While it is possible to normalize data even further to fourth, fifth or sixth normal forms, 3NF or BCNF is typically the end goal for production transaction driven database design (Rob & Coronel, 2007).
It must be noted that increasingly higher levels of normal forms often require more relational join operations that consume more system resources resulting in increased costs and slower system response. With this basis normal forms higher than BCNF are typically reserved for RDBMS research or distinct and precise applications (Rob & Coronel, 2007). This paper will focus on the normalization of data beginning with 1NF and culminating with BCNF.
1NF (First Normal Form)
By definition, a database is in 1NF if: (a) its tables do not contain tuples with repeating groups, (b) all key attributes are defined (e.g. they do not contain null values), and (c) all attributes are dependent on the primary key (Codd, 1990; Elmasri & Navathe, 2000; Rob & Coronel, 2007). According to Rob and Coronel (2007, p. 152), the methodology used to achieve 1NF can be viewed as a three step process.
1. The designer must eliminate repeating groups or multiple entries. This can be achieved by requiring that each cell be atomic. The designer must also eliminate nulls by ensuring group attributes contain an appropriate data value.
2. The designer identifies the primary key.
3. The designer identifies all partial and transitive dependencies within the table in accord with the data and business rules. By definition, a partial dependency exists when part of a composite primary key determines non-prime attributes (Rob & Coronel, 2007). Non-prime attributes are defined as attributes that are not part of any candidate keys (Rob & Coronel, 2007). A transitive dependency exists when non-prime attributes determine other non-prime attributes (Rob & Coronel, 2007).
2NF (Second Normal Form)
By definition, a database is in 2NF if it is in 1NF and does not contain any partial dependencies (Rob & Coronel, 2007). This improves RDBMS integrity rendering it less susceptible to anomalies (Codd, 1990; Elmasri & Navathe, 2000). Note that an R-table in 2NF may still have transitive dependencies and therefore may still suffer from anomalies. Conversion to 2NF builds on the framework established in the conversion to 1NF and proceeds as follows.
1. The dependencies identified in the 1NF can be depicted graphically in a data model allowing the designer to disaggregate the data into additional tables satisfying the 2NF definition above (Rob & Coronel, 2007, p. 154).
2. The identified partial dependencies and their prime and determined attributes can be separated into new tables allowing the prime attributes to become the candidate keys of the new tables. As introduced above, candidate keys are defined to be minimal keys where no subset of the candidate key determines any of the R-table’s attributes.
3. The original composite key is used to create an R-table that captures any attributes not accounted for in the newly created partial dependency’s relations and links the disaggregated tables through the use of foreign keys.
This 2NF transformation process preserves all of the database’s information since the original information can be retrieved through relational algebra (Codd, 1990).
3NF (Third Normal Form)
By definition, a database is in 3NF if it is in 2NF and it does not contain transitive dependencies as defined above (Codd, 1990; Elmasri & Navathe, 2000; Rob & Coronel, 2007). Removal of transitive dependencies follows a similar methodology as was taken with the conversion to 2NF.
1. Using the dependencies identified in the third step of the 1NF transformation and the dependency graph developed for the 2NF transformation, the designer disaggregates the data even further by creating a new relation for each transitive dependency.
2. In this new relation the transitive determinant becomes the candidate and foreign key. As cited above, it is typical for transaction driven production databases to only go as far as 3NF as this eliminates the possibility of update, insertion and deletion anomalies (Codd, 1990; Rob & Coronel, 2007).
It should also be noted that 3NF often meets the requirements of the higher BCNF and fourth normal forms (Rob & Coronel, 2007).
BCNF
By definition, an R-table is in BCNF when every determinant in the R-table is a candidate key (Codd, 1990; Elmasri & Navathe, 2000; Rob & Coronel, 2007). Couched another way, an R-table is not in BCNF when the R-table contains more than one candidate key. Intuitively, BCNF is equivalent to 3NF when an R-table contains only one candidate key (Rob & Coronel, 2007). It must be noted that it has been shown that BCNF is not always achievable as certain 3NF forms cannot be normalized to BCNF (Beeri & Bernstein, 1979). Recall the conversion to 2NF removed partial dependencies and the conversion to 3NF removed transitive dependencies however dependencies may still exist where a key attribute is determined by a non key attribute. The existence of this type of dependency renders the database susceptible to logical or semantic errors (Codd, 1990). To overcome this shortcoming, it may be possible to rearrange an R-table’s primary key to include this non-prime determinant and facilitate further disaggregation. If this rearrangement is possible, additional partial dependencies may be identified facilitating R-table disaggregation as identified in the 2NF conversion process above. The resulting relation schemas preserve the databases information as the disaggregation follows from a relation’s functional dependencies and their closure serves as an equivalent basis (Garcia-Molina, Ullman & Widom, 2009).
Summary
RDBMS support today’s business environment by allowing organizations to store, access, and modify data quickly and easily on low-cost computers (Garcia-Molina, Ullman & Widom, 2009; Rob & Coronel, 2007). Proper RDBMS design facilitates IS integration across supply and value chains and minimizes data inconsistency enhancing organizational productivity and agility (Rob & Coronel, 2007; Satinger, Jackson & Burd, 2002). Similar to with IS analysis and design, normalization must follow a formal methodology to eliminate anomalies while preserving relationships and functional dependencies in order to guarantee lossless decomposition.
Database normalization is part of the IS design process as it allows the RDBMS to conform to design standards minimizing data redundancies and destructive data anomalies while concomitantly creating well-defined components necessary for integration and evaluation (Rob & Coronel, 2007). It must also be noted that RDBMS design is an iterative process consistent with contemporary IS analysis and design best practices (Rob & Coronel, 2007; Satinger, Jackson & Burd, 2002). With this basis 3NF and BCNF are not static end goals as designers must continually seek to improve database design. These continuing improvement initiatives seek to enhance the database’s: (a) operational characteristics, (b) information presentation, (c) naming conventions, (d) relationships, (e) attributes, (f) surrogate keys and, (g) historical accuracy (Rob & Coronel, 2007, p. 158). With this understanding, RDBMS design must be in accord with organizational needs therefore extensive normalization may be in contrast to the RDBMS intended purpose. As a result, it may also be necessary to de-normalize portions of database as the organization and RDBMS evolve.
I go into more detail above but here is a quick video 1st, 2nd & 3rd Normal Forms from YouTube
References
Beeri, C., & Bernstein, P. A. (1979). Computational problems related to the design of normal form relational schemas. ACM Transactions on Database Systems 4(1), 30 – 59. Retrieved June 26, 2010 from: http://portal.acm.org/citation.cfm?id=320066.
Codd, E. F. (1990). The relational model for database management: version 2. Boston, MA: Addison-Wesley Longman Publishing Co. Retrieved June 1, 2010 from, http://portal.acm.org/citation.cfm?id=77708.
Elmasri, R., & Navathe, S. B. (2000). Fundamentals of database systems. Reading MA: Addison-Wesley.
Garcia-Molina, H., Ullman, J.D., & Widom, J. (2009). Database systems: The complete book, (2nd ed.). Upper Saddle River, NJ: Pearson Prentice Hall.
Pearlson, K. E., & Saunders, C. S. (2006). Managing and using information systems (3rd ed.). Hoboken, NY: Wiley Publishing.
Rob, P., & Coronel, C.M. (2007). Database systems design, implementation, and management (7th ed.). Boston, MA: Thomson Course Technology.
Satinger, J. W., Jackson, R. B., & Burd, S. D., (2002). Systems analysis and design (2nd ed.) Boston, MA: Course Technology
Silberschatz, A., Korth, H. F., & Sudarshan, S. (1999). Database system concepts (3rd ed.). Boston, MA: McGraw-Hill