Introduction
The purpose of this module is to evaluate indexing and hashing and how they improve the performance of relational database management systems (RDBMS). As a basis, a RDBMS must be designed to make optimal use of the storage hierarchy since accessing its various components (e.g. registers, cache, memory, on-line secondary storage and off-line secondary storage) often differ by a magnitude or more (Elmasri & Navathe, 2000; Nutt, 2004; Silberschatz & Galvin, 1999). Efficient management of the storage hierarchy is necessary since most databases are too large to fit in main memory and RDBMS require tighter access and transactional control than what most operating systems (OS) provide (Silberschatz, Korth & Sudarshan, 1999). To illustrate this context, recall that file systems can only sort and store their records in secondary storage based on a single field. While a sorted record structure facilitates binary search at the disk block level, maintaining a sorted file is expensive and does not account for the mixed record files that typically comprise contemporary RDBMS applications (Elmasri & Navathe, 2000).
Investigating RDBMS processing in further detail, recall that RDBMS systems maintain logical relationships between tuples through the use of foreign keys. In order to perform a query or process information based on tuple attributes from more than one relation, a tuple must be brought into memory to obtain the connecting foreign keys and retrieve any foreign key related tuples (Elmasri & Navathe, 2000). These diverse RDBMS processing needs require enhanced access mechanisms to access records stored on disk blocks in secondary storage. In order to provide finely granular access to a RDBMS’s database, contemporary RDBMS necessarily implement indexing mechanisms to maintain a mapping to records located in secondary storage (Elmasri & Navathe, 2000; Garcia-Molina, Ullman & Widom, 2009). To this extent, enterprise RDBMS may add functionality to or even assume control of the storage hierarchy to achieve their requisite performance (Silberschatz, Korth & Sudarshan, 1999).
Indexing Structures
By definition, a RDBMS index is a secondary access path and auxiliary structure that maintains pointers that map to record locations on secondary storage disk blocks (Codd, 1990; Elmasri & Navathe, 2000; Garcia-Molina, Ullman & Widom, 2009). Indexes seek to improve performance by allowing the RDBMS to perform data retrieval and join processing with a single pass over the data (Codd, 1990). Indexes also allow the RDBMS to optimize storage as they facilitate records that span or cross disk block boundaries decreasing internal disk fragmentation (Elmasri & Navathe, 2000).
Indexes may be created on column or attribute values however they can also be extended to include domains (Codd, 1990). According to Codd (1990), a domain-based index is a single level index on all the columns that share a specified domain. This allows a domain-based index to be a multi-relational index quickly locating all qualifying data throughout the database (Codd, 1990). It is intuitive that domain-based indexes can be applied to primary and foreign keys and this serves to enhance constraint, referential integrity and trigger processing performance (Codd, 1990). This performance enhancing functionality is so fundamental that many RDBMS including Oracle automatically index all primary keys and unique attributes (Garcia-Molina, Ullman & Widom, 2009; Groff, Weinberg & Oppel, 2010).
In addition to selecting the right components to index and similar to OS directory structures, RDBMS indexes must compliment the system’s resources as a large number of dense indexes can consume large amounts of memory and decrease system performance (Silberschatz & Galvin, 1999). By definition, dense indexes have an index entry for every search key value and therefore every record whereas sparse indexes have incomplete entries and consume less memory (Elmasri & Navathe, 2000). In the extreme case, a large number of dense indexes and insufficient memory could require the system to page the indexes into and out of memory incurring significant delays.
The database administrator (DBA) must also consider that record insertion and deletion requires all indexes associated with the index key fields be updated within each transaction (Codd, 1990). With this basis, index creation and the selection of appropriate index structures must performed in accord with system’s resources and the information’s intended use and organizational policies (Codd, 1990).
As outlined by Codd (1990), the DBA is responsible for creating indexes to improve RDBMS performance and use storage hierarchy efficiently (Codd, 1990). Index creation may take place when relations or tables are defined however for flexibility; a RDBMS should also provide the ability to manage indexes through the CREATE INDEX and DROP INDEX SQL statements (Codd, 1990). The first step is choosing which fields or attributes to index. Dependent on the RDBMS implementation, the DBA may have several index structures to choose from and include: (a) single level indexes that include primary, clustered and secondary indexes, (b) tree structures that include multi-level indexes and balanced trees, and (c) hash based structures (Elmasri & Navathe, 2000).
Single Level Indexes
A single level index is an ordered index structure similar to indexes found at the end of a textbook. Textbook indexes typically have entries for important topics and keywords and the page numbers where the entries may be found. This allows the user to search on a topic and retrieve the relevant page number. Mapping this structure into a RDBMS environment requires selection of an indexing attribute and corresponding pointer to a disk block (Elmasri & Navathe, 2000). This type of ordered index supports binary search which can be accomplished in logarithmic time (Elmasri & Navathe, 2000).
Single level indexes can be further categorized as primary, clustering and secondary indexes. Both primary and clustered indexes are based on an ordered or sorted file of records (Elmasri & Navathe, 2000). Consistent with the introduction above, only one primary or clustered index can be defined for the file of records since the file of records can only be sorted on a single field. With this basis, primary and clustered indexes can be sparse since they support binary search and must only maintain a reference to the first record in a disk block known as the anchor block rather than a mapping for each record’s index filed (Elmasri & Navathe, 2000). It should be noted this requires that the disk block be brought into memory and searched for the desired record.
Primary and clustered indexes differ since a primary index is based on a primary key field and implicitly requires that every record accessed through the index key be unique. In contrast, a clustering index is based on a non-unique clustering field and maintains one pointer for each index value. Since clustered indexes can point to several records it is possible that more than one disk block will have to be read into memory and searched to obtain the required record.
Primary and clustered indexes that map to an ordered file of records incur substantial overhead as the ordered file of records must be maintained in sorted order when records are inserted or deleted. Insertion cost can be mitigated through the use of linked records or an unordered overflow file however these mechanisms complicate record search and retrieval (Elmasri & Navathe, 2000). The time spent on sorting the file following record deletion can be overcome through the use of deletion markers however this intuitively wastes disk space.
A secondary index is an ordered index structure that maps an index key to an unordered file of records (i.e. the file of records is sorted on an attribute other than the secondary index key). In this framework, a secondary index creates a logical ordering based on the index field. A DBA may specify many secondary indexes and these indexes can be based on distinct key values or non-distinct values similar to clustered indexes. Since the file of records is not ordered on this attribute, secondary indexes are characterized as dense indexes and can consume considerable memory if the index field is large. While maintaining a secondary index consumes resources the performance improvement can be substantial since the files would have to be searched linearly if the index was not present.
Multi-level Indexes
Multi-level indexes decrease the amount of necessary binary search and therefore they can reduce the search time and quicken the retrieval process. Functionally a multi-level index’s first or base-level index is similar to a single level index however instead of pointing to a disk block, the index’s reference points to a secondary index. This secondary index can in turn point to records or disk blocks or subsequent indexes that can extend the multi-level index ever further.
Typically the need for additional levels is driven by the size of the index’s structure in relation to the disk block (Elmasri & Navathe, 2000). If the index structure is larger than a disk block a secondary index is created (Elmasri & Navathe, 2000). If the secondary index is larger than a disk block a third index is created and this process may continue until the last index is smaller than a disk block (Elmasri & Navathe, 2000). An example of this type of multilevel index is the indexed sequential access file (ISAM) implemented on many early IBM mainframes however multi-level indexes have largely been replaced by balanced tree implementations (Elmasri & Navathe, 2000; Garcia-Molina, Ullman & Widom, 2009).
Balanced Trees
Balanced trees (B-tree) have largely replaced multi-level indexes due to their enhanced efficiency and as a result have become the accepted default structure for indexes dynamically generated on demand (Elmasri & Navathe, 2000; Garcia-Molina, Ullman & Widom, 2009). B-trees are a special form of trees where the leaf nodes are all the same depth and their structure is defined by a parameter N that is often chosen in relation to the disk block size (Garcia-Molina, Ullman & Widom, 2009). Ensuring all leaf nodes are at the same level and choosing the parameter N in relation to the disk block size serves to minimize disk block access (Elmasri & Navathe, 2000). The parameter N determines the maximum number if indexes and pointers maintained by each B-tree node.
A B-tree dynamically adapts its structure to ensure each interior node maintains a number M of indexes between N/2 and N and M + 1 pointers. Requiring that nodes minimally maintain N/2 indexes minimizes wasted space and serves to keep the tree balanced (Elmasri & Navathe, 2000). B-trees adapt their structure by expanding or splitting internal nodes to accommodate insertions and joining internal nodes to accommodate deletions resulting in a structure that maintains a sorted order (Elmasri & Navathe, 2000). An internal node needs to be expanded when it cannot insert a new value as one of its children (i.e. it has N-1 children) (Elmasri & Navathe, 2000). Internal nodes need to be joined when a deletion results in a parent having less than the minimal number of children (i.e. it has less than N/2 children) (Elmasri & Navathe, 2000). In either case changes will propagate throughout the tree maintaining a balanced tree structure (Elmasri & Navathe, 2000).
B-trees support efficient logical indexing by minimizing the disk I/O required for record access (Garcia-Molina, Ullman & Widom, 2009). B-trees become less useful when the number of index keys becomes very large since B-tree maintenance and disk access can become expensive with a large number of levels (Elmasri & Navathe, 2000). In order to locate a desired record, the system begins its tree traversal at the root and follows pointers from parent to child by evaluating the index fields. If the desired index value is equal to the parent, then the parent supplies the record’s disk block address otherwise the tree traversal search continues. It should be noted that a B+-tree is a special case of a B-tree where only leaf nodes contain pointers to records or disk blocks (Elmasri & Navathe, 2000).
Hashing
As an alternative to indexing, hashing can be used to provide a mapping from an attribute value to a disk block (Elmasri & Navathe, 2000). Similar to indexes, hashing uses structures (e.g. hash tables, extendible hashing directories and linear hashing) that are comprised of a key and a reference indirection that points to the information’s disk location. Similar to sparse indexes introduced above, the disk block is brought into memory and searched further to obtain the desired record (Elmasri & Navathe, 2000).
Hashing derives its hash key by applying a randomizing hash function to the hash field (Elmasri & Navathe, 2000). This hash function is typically an arithmetic or logical function that folds the key into an established range of numbers (Elmasri & Navathe, 2000; Silberschatz & Galvin, 1999). With this basis, the difference between hashing and indexing is that a hash-based disk reference is related to a derived field whereas an index-based disk reference is related to the index field itself (Elmasri & Navathe, 2000). It should also be noted that some data sets are intrinsically ordered and therefore lend themselves to ordered hashes (e.g. invoice numbers) (Elmasri & Navathe, 2000).
In an RDBMS environment, the hash function is applied to an attribute value to obtain the hash key and this key is applied to the hash structure to obtain the desired record’s disk block. Efficient hashing mechanisms can typically locate a record’s disk block in a single access and therefore can provide the quickest mechanism for accessing random records (Elmasri & Navathe, 2000). The tradeoff of this quick access is the memory required for the hashing structure and the time it takes to compute the hash key and resolve collisions. Collisions occur when the information item (i.e. hash field value) is not unique and when two different hash field values map to the same hash key. A system can accommodate collisions through various methods that include open addressing that uses locations adjacent to the computed hash key, chaining that uses indirection references to point to overflow positions and multiple hashing where a second hash function is subsequently applied following a collision.
Hashing can be further categorized as internal hashing and external hashing. Internal hashing is used when the information is located in memory and is typically implemented with an array based hash table. Various OS use internal hashing to maintain their file directory structures (Silberschatz & Galvin, 1999). External hashing is typically used to access records stored on secondary storage. Consistent with this paper’s purpose, analysis will focus on external hashing.
In a general RDBMS external hashing environment, a hash key will be computed for a hash field used to locate target disk blocks. This system must accommodate a dynamically expanding and shrinking number of buckets without a loss of efficiency. The two primary mechanisms to achieve this goal are extendible hashing and linear hashing. Extendible hashing accomplishes this goal by grouping disk blocks into buckets where a bucket can be comprised of a single disk block or a cluster of contiguous disk blocks. This storage abstraction allows hashing mechanisms not only accommodates dynamically expanding databases but it also some alleviates collisions as records a cluster of disk blocks can hold many records. Linear hashing allows the hash file to expand without needing the directory structure implemented for extendible hashing. With linear hashing, a collision results in splitting the hash key entry into two separate entries and remapping the existing and colliding entries using a second hash function. This mechanism allows the hash-directory structure to grow as necessary.
Summary
By design a RDBMS maintains intimate logical relationships among diverse objects and this information must be stored on persistent secondary storage. Additionally, a RDBMS must provide responsive access to its information in accord with organizational needs. By definition an index is a performance oriented object that supports information access (Codd, 1990). As posited by Codd (1990), indexing whether it is index-based or hash-based provides a requisite performance enhancement to today’s RDBMS implementations. Both indexing and hashing provide an efficient mechanism for fast information access based on a search condition however they must be implemented judiciously since they can consume considerable amounts of memory if it is not designed well (Elmasri & Navathe, 2000). Central to hashing’s effectiveness in a dynamic RDBMS environment is its ability to use memory efficiently, distribute records uniformly and minimize collisions.
While this module has discussed indexing and hashing, typical file organizations also allow logical items to be physically clustered on disk if the relationships are expected to be used frequently enhancing performance ever further (Elmasri & Navathe, 2000). Due to length requirement this module has also omitted analysis of the records themselves (e.g. null values and variable vs. fixed length records) as they impact appropriate storage and indexing selection-fixed length records versus variable length records.
References
Codd, E. F. (1990). The relational model for database management: version 2. Boston, MA: Addison-Wesley Longman Publishing Co. Retrieved June 1, 2010 from, http://portal.acm.org/citation.cfm?id=77708.
Elmasri, R., & Navathe, S. B. (2000). Fundamentals of database systems. Reading MA: Addison-Wesley.
Garcia-Molina, H., Ullman, J.D., & Widom, J. (2009). Database systems: The complete book, (2nd ed.). Upper Saddle River, NJ: Pearson Prentice Hall.
Groff, J. R., Weinberg, P. N., & Oppel, A. J. (2010). SQL: The complete reference (3rd ed.). New York: McGraw-Hill.
Nutt, G. (2004). Operating systems (3rd ed.). Boston, MA: Addison Wesley.
Silberschatz, A., & Galvin, P. B. (1999). Operating systems concepts (5th ed.). New York: John Wiley & Sons.
Silberschatz, A., Korth, H. F., & Sudarshan, S. (1999). Database system concepts (3rd ed.). Boston, MA: McGraw-Hill