Query Optimization

Introduction

In this module we will analyze how a query compiler plan is generated and optimized in a relational database management system (RDBMS).  As previously introduced,  SQL queries allow agents to extract and modify RDBMS data.  SQL is an abstract fourth generation language that allows agents to specify what is to be done rather than how it is to be done.  This high level of abstraction requires that the RDBMS parse and translate the agents’ SQL commands into operational specifics (e.g. algorithms and data structures) that access and possibly control the system’s resources (e.g. CPU, distributed communications, memory and secondary storage) (Garcia-Molina, Ullman & Widom, 2009; Silberschatz, Korth & Sudarshan, 1999).

Background

As previously introduced, there are often several ways to compose equivalent SQL queries (Codd, 1990; Silberschatz, Korth & Sudarshan, 1999).   This ambiguity is consistent with the RDBMS’ theoretical basis as there are often several ways to express an equivalent clause in relational algebra and predicate calculus (Codd, 1990; Silberschatz, Korth & Sudarshan, 1999).   With this basis it is clear there may be several ways to translate agents’ SQL queries into the operational specifics that comprise the physical query execution plan.

Query processing can generally be assessed according to its parsing and translation, optimization and evaluation techniques (Silberschatz, Korth & Sudarshan, 1999).  While these three aspects or phases will be evaluated separately below, the reader should note that they are integrated across the query generation process.  As an example, it is intuitive that evaluative statistics may create or verify heuristic approaches and thereby drive heuristic optimization into the parsing and translation phase.  Parsing and translation can be further broken down into three distinct steps as the SQL query is scanned, preprocessed and transformed into a relational algebra parse tree, transformed into a logical query plan from the relational parse tree, and transformed into a physical query plan from the logical query plan (Garcia-Molina, Ullman & Widom, 2009; Silberschatz, Korth & Sudarshan, 1999).

It should be noted that consistent with other RDBMS concepts, query generation and optimization also relies on an understanding of organizational processes, the RDBMS’ intended purpose, the computer architecture, the operating system and programming theory and constructs.  To understand this integration from a system’s perspective consider that disk access times can dominate the cost of executing a query (Garcia-Molina, Ullman & Widom, 2009; Silberschatz, Korth & Sudarshan, 1999).   This requires that the query generator translate or compile the queries into efficient execution plans and data structures that avail themselves to system capabilities minimizing disk access (Elmasri & Navathe, 2000; Garcia-Molina, Ullman & Widom, 2009; Silberschatz, Korth & Sudarshan, 1999).

Parsing and Translation

The process necessary to generate a physical query execution plan begins by taking an SQL query as input, scanning, preprocessing and parsing the input to generate a parse tree based on relational algebra (Silberschatz, Korth & Sudarshan, 1999).  While beyond the scope of this paper, this initial phase is in accord with compiler theory and its use of recursive top-down parsing to build grammatically correct parse trees in Backus-Nuar form (Aho, Sethi & Ullman, 2006; Garcia-Molina, Ullman & Widom, 2009; Silberschatz, Korth & Sudarshan, 1999).  This relational algebra tree representation created is a more useful internal representation as it can be manipulated facilitating transformation into multiple logical query plans that can be evaluated, contrasted and optimized (Garcia-Molina, Ullman & Widom, 2009; Silberschatz, Korth & Sudarshan, 1999).   It should be noted that this first step may alternatively create a parse graph relational algebra representation (Elmasri & Navathe, 2000) however this module will focus on the use of parse trees.

In the pre-processed parse tree representation, lexical elements or tokens (e.g. SQL keywords, relation names, attributes, operators and schema elements) exist as atomic leaf nodes (Elmasri & Navathe, 2000; Garcia-Molina, Ullman & Widom, 2009).   Internal nodes represent syntactic categories and query subparts that are described further by rules of grammar (Garcia-Molina, Ullman & Widom, 2009).  The creation of this tree structure allows the preprocessor to verify and resolve the naming, scope, use and syntax of attributes, relations, indexes and domains in relation to the catalog information (Garcia-Molina, Ullman & Widom, 2009; Silberschatz, Korth & Sudarshan, 1999).  The parse tree representation also allows the system to replace views with their relational equivalent of how the view was created facilitating optimization in subsequent steps.

The parse tree is next transformed into a logical query plan in the form of an expression tree and lastly, this logical expression tree is transformed into a physical query execution plan (Garcia-Molina, Ullman & Widom, 2009).  In contrast to compiling procedural high level languages, creation of a physical query execution plan requires that the system decide on the operational specifics.  This physical query execution plan augments the relational parse tree by specifying necessary operations and their order of execution as well as how the data is stored and transferred within the system (Garcia-Molina, Ullman & Widom, 2009).  With this basis a physical query execution plan represents a logical query annotated with execution instructions its components may be referred to as evaluation primitives (Silberschatz, Korth & Sudarshan, 1999).

Heuristic Optimization

Query optimization can be based on cost estimations or on established heuristics.  This section will focus on heuristic optimization and the subsequent section will focus on cost based optimization.  As previously cited, there are several ways to equivalently encode a SQL statement and its relational tree representation can be represented in several ways as well.  This allows and requires the RDBMS and DBA to decide on the optimal query execution plan based on query costs in accord with system resources and organizational processes (Silberschatz, Korth & Sudarshan, 1999).  From a system resource and execution cost perspective, consider that a single query may have multiple selections, projections and set operations.   If a particular query’s execution can be reordered to perform selections and projections early in its execution sequence, the query may be able to minimize disk I/O and avoid the use of materialized relations (Garcia-Molina, Ullman & Widom, 2009; Silberschatz, Korth & Sudarshan, 1999).

To provide another heuristic optimization example, consider the evaluation ordering of joins as it makes sense to bring the smaller relation into memory first and use available indexes or hashes to subsequently join with the larger relation.  In this instance it may also be beneficial to sort the records and remove duplicates early in the process to further improve performance.  While these latter two approaches allow the system to bring only necessary components into memory and perform the join more efficiently, this optimization analysis may further assist the DBA with index selection (Codd, 1990; Silberschatz, Korth & Sudarshan, 1999).

Additionally depending on the available system resources, evaluation primitives may be able to pipeline their results between components further enhancing execution efficiency and decreasing the need to use secondary storage and its expensive disk I/O (Silberschatz, Korth & Sudarshan, 1999).  Note that this further emphasizes the integration of query optimization with the computer system as the execution enhancements achieved through pipelining vary depending on the operating system architecture’s implementation of pipelining (e.g. process or thread based, demand or producer driven and inter-process communications mechanisms) (Nutt, 2004; Silberschatz, Galvin & Gagne, 2010; Silberschatz, Korth & Sudarshan, 1999).

             To achieve these aforementioned heuristic based performance enhancements, a logical query plan represented in a relational logic-based tree structure may be transformed into an equivalent plan or tree structure through the use of relational algebra rewrite rules (e.g. relational set theory, commutative laws and associative laws) (Elmasri & Navathe, 2000; Garcia-Molina, Ullman & Widom, 2009; Silberschatz, Korth & Sudarshan, 1999).  The use of rewrite rules to create equivalent tree representations is also consistent with modern compiler optimization theory and practices (Aho, Sethi & Ullman, 2006). These equivalent representations are typically developed based on standard programming and data-structure heuristics that include dynamic programming, hill climbing and branch and bound and specific SQL heuristics as introduced above (Garcia-Molina, Ullman & Widom, 2009; Mitchell, 1997; Rich & Knight, 1991; Silberschatz, Korth & Sudarshan, 1999).

As introduced in the examples above, the optimizer will try to perform selections and projections as early as possible.  From a representation standpoint this can be achieved by moving projections and selections down in the tree closer to the leaves as execution begins at the leaf level.  While this early heuristic-based optimization is necessary, it must be noted that computing the exact cost of a query can only be done when the execution engine executes the physical execution plan and returns the results to the requesting agent (Silberschatz, Korth & Sudarshan, 1999).  Also note that this execution cost will vary dependent on many factors including the system’s load (Herodotou & Babu, 2009; Silberschatz, Korth & Sudarshan, 1999; Yagoub et al., 2008).

Evaluation and Cost Estimation

It must be noted at the outset that estimating costs can be an expensive endeavor by consuming considerable resources (Silberschatz, Korth & Sudarshan, 1999).  As introduced above, the cost to execute a query can only be determined after it has been transformed into a physical execution plan and executed by the query execution engine (Silberschatz, Korth & Sudarshan, 1999).  There are numerous factors that impact the cost of executing a query and include the system’s resources (e.g. CPU, memory, storage and communications infrastructure) and the nature and purpose of the RDBMS (e.g. distributed database, transactional system or analytical system).  As cited in previous CS 77005 coursework, it has even been asserted that correctly ascribing a cost to a particular query is impossible in a complex system (Yagoub et al., 2008).

Query cost evaluation is based on statistical information maintained in the catalog.  These statistics are typically computed periodically since they do not change radically over shorter intervals and can remain relevant if applied consistently (Garcia-Molina, Ullman & Widom, 2009).  This statistical information includes information on the relations and their number of tuples, attributes, bytes, blocking factor, number of distinct tuples and selection cardinalities (i.e. the average number of attributes that satisfy a particular condition) (Silberschatz, Korth & Sudarshan, 1999).   Catalogs also maintain information on indexes that detail the index’s tree structures (e.g. average fan out and depth) or hash structures allowing the system to estimate record access further (Silberschatz, Korth & Sudarshan, 1999).   These statistics allow the system and DBA to estimate the size of the base tables and intermediary and resulting relations allowing the optimizer to choose the optimal solution.

Summary

SQL query compilers take SQL queries as input and attempt to generate an optimal query execution plan that is subsequently executed by the query execution engine.  As introduced above, there are many factors in achieving an optimal query execution plan that go far beyond the scope of this module that has restricted analysis solely to the RDBMS where query optimization can be achieved through heuristic or cost based statistical approaches.

It is intuitive that actual statistical approaches can achieve better results than heuristic approaches however as cited in previous modules, today’s production RDBMS operate in an emergent, transitory and dynamic environment and with this basis, even an optimal approach based on statistics will have limited relevance in the long term.  To this extent it may be propitious to implement a hybrid approach as certain heuristics such as pushing selections and projections down in the tree to execute them early nearly always pays dividends (Garcia-Molina, Ullman & Widom, 2009).  Oracle offers a compelling example of this approach as its query optimizer has both heuristic and cost based components (Silberschatz, Korth & Sudarshan, 1999).  In conclusion, information must be timely, relevant and accurate.  It is intuitive that organizations must take every opportunity to improve their RDBMS performance and a critical component of this is query optimization.

References

Aho, A. V., Sethi, R., & Ullman, J. D. (2006). Compilers: Principles, techniques, and tools (2nd ed.). Upper Saddle River, NJ: Pearson Prentice Hall.

Codd, E. F. (1990). The relational model for database management: version 2. Boston, MA: Addison-Wesley Longman Publishing Co.  Retrieved June 1, 2010 from, http://portal.acm.org/citation.cfm?id=77708.

Elmasri, R., & Navathe, S. B. (2000).  Fundamentals of database systems. Reading MA: Addison-Wesley.

Garcia-Molina, H., Ullman, J.D., & Widom, J.   (2009).   Database systems: The complete book, (2nd ed.).  Upper Saddle River, NJ: Pearson Prentice Hall.

Herodotou, H., & Babu, S. (2009).  Automated SQL tuning through trial and (sometimes) error. Retrieved July 31, 2010 from, http://portal.acm.org/ft_gateway.cfm?id=1594171&type=pdf&coll=GUIDE&dl=GUIDE&CFID=93632109&CFTOKEN=77805535

Mitchell, T. M. (1997).  Machine Learning. Boston, MA: McGraw-Hill.

Nutt, G. (2004). Operating systems (3rd ed.). Boston, MA: Addison Wesley.

Rich, E., & Knight, K. (1991).  Artificial Intelligence.  New York: McGraw-Hill.

Silberschatz, A., Galvin, P. B., & Gagne, G.  (2010). Operating systems concepts with Java (8th ed.). New York: John Wiley & Sons.

Silberschatz, A., Korth, H. F., & Sudarshan, S. (1999). Database system concepts (3rd ed.). Boston, MA: McGraw-Hill

Yagoub, K., Belknap, P., Dageville, B., Dias, K., Joshi, S., & Yu, B. (2008). Oracle’s SQL Performance analyzer.  DEB, 31(1).  Retrieved July 31, 2010 from, ftp://ftp.research.microsoft.com/pub/debull/A08mar/yagoub.pdf.