Structured Query Language (SQL)

Introduction

This page presents the relational database management systems (RDBMS) structured query language (SQL) queries described in the summaries of chapter six and chapter seven of the Garcia-Molina, Ullman and Widom (2009, p. 307, p. 339) text.

As a basis, SQL statements or commands can be differentiated as components of either the Data Description Language (DDL) or the Data Manipulation Language (DML) (Groff, Weinberg & Oppel, 2010).  From a functional perspective, relational DBMS (RDBMS) inventor Edgar Codd states an RDBMS must minimally support retrieve, insert, update, and delete commands.  Codd states that he uses the terms retrieval and query synonymously and continues to cite that while queries perform actions, they represent read-only operations (1990, p. 21).  Citing this definitional basis, the queries and query components presented in the order are: (a) select-from-where queries, (b) set operations on relations, (c) join and outer join expressions, (d) subqueries and, (e) aggregations (Codd, 1990; Garcia-Molina, Ullman and Widom, 2009, p. 307, p. 339).

Coding Conventions

For completeness and consistency, this analysis will utilize the American National Standards Institute (ANSI) SQL query statements. By convention, SQL programmers typically format query components on separate lines to make the code more readable (Groff, Weinberg & Oppel, 2010).   ANSI standard SQL formatting places spaces between SQL commands and their components and separates multiple components with commas.  SQL statements may also include parentheses and will be terminated with the standard semi-colon to clearly differentiate SQL queries from the paper’s text (Casteel, 2010; Groff, Weinberg & Oppel, 2010).   While the use of semicolons depends on the specific SQL implementation, semicolons are necessary to separate SQL statement when more than one SQL statement is presented to the system at once (Casteel, 2010; Rob & Coronel, 2007; Silberschatz, Korth & Sudarshan, 1999).

SQL is not case sensitive however for clarity; this paper will adopt a standard convention of formatting SQL command keywords in uppercase letters (e.g. SELECT, FROM, WHERE), capitalizing relation table (R-table) names (e.g. Movies, StarsIn) and beginning attribute or column names with a lower case letter (e.g. title, year).  It must be noted that character attribute instance values are case sensitive however these are clearly identified since they are enclosed in single quotes (e.g. ‘Pixar’).   Lastly, we will use the terms relation and R-table and attribute and column name interchangeably and in context however we will refer to attribute instance values to clearly denote user data from DBMS data dictionary values.

Chapter Six Summary Query and Query Components

As previously presented, RDBMS processing is based on set theory, predicate logic and relational algebra (Codd, 1990; Elmasri & Navathe, 2000; Garcia-Molina, Ullman & Widom, 2009, Silberschatz, Korth & Sudarshan, 1999).  As a basis, every atomic attribute stored in a RDBMS is guaranteed to be accessible by specifying its relation table (R-table) name, primary key and attribute name rather than through an extraneous feature (e.g. reference) (Codd, 1990, p. 229).  Central to SQL query processing are domains as they serve as the most elemental integrity constraint and their atomic instances provide the necessary basis for meaningful comparison (Codd, 1990; Silberschatz, Korth & Sudarshan, 1999).   Functionally, before a RDBMS compares database values it ensures that the attributes or columns draw their values from a common or comparable domain.

ANSI standard data retrieval is accomplished using the SQL SELECT command statement.  By definition retrieve operations are defined to operate on relations and return relations as their results (Codd, 1990).  In contrast to this definition, modern production RDBMS query operations operate on relations and bags and may generate relations and bags as their results (Garcia-Molina, Ullman & Widom, 2009).  In contrast to RDBMS’ formal relational algebra and set theory basis, bags may contain duplicate tuples (Codd, 1990; Garcia-Molina, Ullman & Widom, 2009).  We will analyze the various components of the SQL SELECT command based on its formal mathematical relational definition as presented in Codd’s seminal work (Codd, 1990).

SELECT FROM

The SQL SELECT statement is derived from Codd’s (1990) theta-select operator that was originally named a theta-restrict.  The SQL SELECT FROM command statement’s minimal implementation is: SELECT attributes FROM Tables; where attributes are one or more column names and Tables are one or more R-table names (Casteel, 2010).   It should be noted that SELECT commands not only provide a mechanism to retrieve data and but they serve as the basis to create periodic and exception views and reports (Rob & Coronel, 2007; Satinger, Jackson & Byrd, 2002).  When parsed and executed, the SELECT FROM statement can return constants or specific attributes from specified tables resulting in a new relation (Codd, 1990; Garcia-Molina, Ullman & Widom, 2009; Silberschatz, Korth & Sudarshan, 1999).  Choosing specific attributes is formally known as projection and all of an R-table’s attributes can be projected using the asterisk wild card as follows: SELECT * FROM Tables; (Garcia-Molina, Ullman & Widom, 2009; Silberschatz, Korth & Sudarshan, 1999).  If multiple attributes or tables are referenced they must be separated with commas.

For flexibility, the resulting relation returned by a SQL SELECT statement may also contain computed columns as well as column and table aliases.  Computed columns use the standard arithmetic operators, may include parentheses and follow the standard arithmetic and logical precedence rules (Rob & Coronel, 2007).  Column and table aliases may be designated using the optional SQL keyword AS.   Aliases serve to retain relations in memory for additional processing in addition to providing contextual meaning to output and standardization for subsequent integration (Codd, 1990; Garcia-Molina, Ullman & Widom, 2009).

SQL programmers may tailor their output further by using the ORDER BY statement.  The ORDER BY statement sorts the resulting relation’s rows based on a one or more of the rows’ attribute values (Rob & Coronel, 2007).    By default ORDER BY sorts items in ascending order however ascending or descending order can be explicitly specified by using the ASC or DESC keywords respectively (Rob & Coronel, 2007).  If the user desires unique rather than duplicate attribute values in a specific projected column, the user can force distinct values using the DISTINCT clause (Rob & Coronel, 2007).  It should be noted the DISTINCT clause allows developers to enforce compliance with the pure relational model outlined by Codd (1990).  In summary, the combination of these functionalities allows developers to tailor their output and enhance interoperability.

WHERE

It is conventional for the SELECT statement to include a WHERE clause.  A WHERE clause specifies a condition to constrain or restrict the resulting relation’s rows or tuples to specific tuples that meet the specified condition’s criteria (Rob & Coronel, 2007).  Alternatively couched, any tuple that is in accord with the WHERE’s conditional expression contributes components to a new tuple that is added to the resulting relation or bag (Garcia-Molina, Ullman & Widom, 2009).  The WHERE clause may use a combination of arithmetic, Boolean and special SQL comparison operators to compare attribute values with other attribute values, constants or derived values (Silberschatz, Korth & Sudarshan, 1999).

ANSI standard SQL enumerates the following arithmetic comparison operators: (a) equal to denoted by =, (b) less than denoted by <, (c) greater than denoted by >, (d) less than or equal to denoted by <=, (e) greater than or equal to denoted by >=, and (f) not equal to denoted by <> or != (Rob & Coronel, 2007).  SQL supports comparison of dates and times and when comparing character based data, SQL uses American Standard Code for Information Interchange (ASCII) encoding by default.  As introduced above, character data can be clearly identified within SQL statements as it is enclosed in single quotes.  Examples of the comparison operators are provided in Figure 3 and Figure 4 below.

SQL supports the standard OR, AND and NOT Boolean operators (Rob & Coronel, 2007).  In accord with RDBMS’ set theory basis, SQL provides the standard set operations union, intersection and difference using the SQL keywords UNION, INTERSECT and EXCEPT respectively (Rob & Coronel, 2007).   It is assumed that these set operations are applied to equivalent relation schemas with equivalent attributes and domains (Codd, 1990).   An example of the OR operator is presented in Figure 1 below.  An example of the UNION clause is presented further below in Figure 4.

ANSI standard SQL also supports the special SQL operators BETWEEN, IN, IS NULL, EXISTS and LIKE that return Boolean values.   BETWEEN can be used to test if an attribute is within a range of values whereas IN tests if an attribute is a member of value list or a sub-query’s resulting relation.  IS NULL intuitively determines if an attribute is null whereas EXISTS determines the existence of a tuple.  With this basis, EXISTS is conventionally used with sub-queries detailed further below.  LIKE provides the where clause with string pattern matching functionality and supports single and multiple character wildcards.  The percent sign can be used to match multiple characters and the underscore is used to match a single letter.   The reader is reminded that while SQL keywords are not case sensitive, attribute value instances are case sensitive.  To assist users with character string comparisons of differing cases, SQL provides the UPPER keyword that converts character strings to their upper case representation.  The Boolean operators can be used to test for multiple conditions or negate the other operators.  An example using the SQL keywords LIKE and UPPER is provided in Figure 1 below.  Lastly, it must be noted that the resulting relation from a SQL SELECT FROM WHERE query may be an empty set.

The WHERE clause allows the resulting relation to be tailoring even further.    The SQL GROUP BY command allows the user to collate tuples based on attribute values.  The SQL HAVING clause is used in conjunction with the GROUP BY clause and restricts tuple selection based on its condition (Rob & Coronel, 2007).  This HAVING functionality is similar to specifying a WHERE clause in a SELECT FROM statement however the HAVING clause is applied to the GROUP BY’s output (Rob & Coronel, 2007).   It should be noted that the GROUP BY HAVING commands are often used in conjunction with SQL’s aggregate functions introduced below (Rob & Coronel, 2007).  An example of the SQL GROUP BY command it presented in Figure 4 below.

To illustrate the material presented above in a conventionally formatted SQL query, one solution to the 6.1 exercise query to “find all stars who are either male or live in Malibu (i.e. have string Malibu as part of their address)” (Garcia-Molina, Ullman & Widom, 2009, p. 257) is presented in Figure 1 below.  To illustrate additional functionality, this SQL query will test all variants of the character strings, rename the resulting relation’s column name to MovieStarName and order the output by the star’s last name.  It should be noted that the column name should be renamed to movieStarName if the desired result was a relation to be subsequently used in additional computations and integration consistent with our naming conventions introduced at the outset.

Figure 1 – Exercise 6.1 Solution

SELECT  name AS MovieStarName, address

FROM    MovieStar

WHERE   UPPER(gender) = ‘M’

     OR UPPER(address) LIKE ‘% MALIBU %’

ORDER BY name;

Join Expressions

The SQL SELECT FROM WHERE statement presented above represents a simple query since it accesses a single R-table (Garcia-Molina, Ullman & Widom, 2009).  The efficiency and utility of RDBMS comes from its ability to disaggregate data into normalized data however this requires a mechanism to regenerate a data set’s information in a lossless fashion.  Join expressions satisfy this requisite functionality as they provide the mechanism to retrieve all of the information from a normalized database.

Join expressions are derived from variants of Codd’s (1990) equi-join and are necessary when data must be retrieved from multiple tables in a SELECT FROM query.  Functionally, a join is a special case of the Cartesian product that constrains the resulting relation to tuples that meet the WHERE clause’s conditions (Codd, 1990).  The Cartesian product between two RDBMS relations S and T is defined to be a resulting relation U that contains every tuple of T concatenated with every tuple of S (Codd, 1990, p. 65).  Note that the programmer may have to disambiguate Cartesian product attributes using the TableName.attributeName convention.

Simple SQL joins are generated by including multiple table names in the FROM clause and restricting tuple selection through the WHERE clause (Garcia-Molina, Ullman & Widom, 2009).  Typically, join expressions compare primary and foreign key values however minimally, the attribute comparison requires the values to be from equivalent or comparable domains (Codd, 1990; Rob & Coronel, 2007).   SQL contains several variants of join clauses that include the JOIN, CROSS JOIN and OUTER JOINs and may be further selectively qualified using the keywords ON, NATURAL, FULL, LEFT and RIGHT.   The keyword ON serves to identify the comparative criteria in lieu of placing the comparative condition in the WHERE clause.  As an example the generic SQL clause FROM Table1 JOIN Table2 ON value1 = Table1.attribute1 AND value2 = Table2.attribute2; could be equivalently coded as FROM Table1, Table2 WHERE value1 = Table1.attribute1 AND value2 = Table2.attribute2;

Continuing the definitions of the various joins, their keywords and their functionality, the CROSS JOIN may be used to create a Cartesian product of relations.  The keyword NATURAL constrains the JOIN operation’s output to tuples with equivalent attribute values in columns with identical names (Garcia-Molina, Ullman & Widom, 2009).   OUTER JOINs contain all tuples that satisfy the restricting condition and are padded with dangling tuples from either the first or second relation being joined (Garcia-Molina, Ullman & Widom, 2009).  Dangling tuples are tuples that do not match the joining criteria and will be padded with NULL values as necessary.  With this basis a LEFT OUTER JOIN will contain all tuples from the first relation and condition qualifying tuples from the second relation.  Conversely a RIGHT OUTER JOIN will contain all tuples from the second relation and condition qualifying tuples from the first relation.

To illustrate join functionality, one solution to exercise 6.3.9 (Garcia-Molina, Ullman & Widom, 2009, p. 281) is presented in Figure 2 below.  This solution projects all attributes from the join between the Classes and Ships tables.  It is instructive to note that the join is based on the Classes’ primary key class that references the Ships’ foreign key.

Figure 2

Exercise 6.3.9 Solution

SELECT  *

FROM    Classes RIGHT NATURAL

        OUTER JOIN Ships ;

Subqueries

Subqueries are also known as nested or inner queries since they are embedded inside another query (Garcia-Molina, Ullman & Widom, 2009; Rob & Coronel, 2007).  Queries may be nested many levels deep in either the FROM, WHERE or HAVING clauses (Garcia-Molina, Ullman & Widom, 2009; Rob & Coronel, 2007).    Recall that the SELECT FROM WHERE command statement may return either a constant or a relation of zero or more tuples.   This provides SQL with a flexible, powerful and succinct flexibility.  This flexibility allows subqueries to be used to retrieve data in the form of tuples that can subsequently be used in comparisons, used to insert tuples into tables or even used to create entirely new tables (Rob & Coronel, 2007).   Often, a SELECT FROM WHERE sub query will return a scalar defined as a unary tuple with a single attribute (Codd, 1990; Garcia-Molina, Ullman & Widom, 2009).  This scalar can be subsequently used as a constant in an outer WHERE clause (Garcia-Molina, Ullman & Widom, 2009).  Syntactically, a sub query can be easily identified since it is enclosed in parenthesis.

The use of subqueries may introduce naming, reference or scope complexity.  Self contained subqueries that do not reference the outer query’s data are simple in the sense that they are executed only once.  A more complex scenario exists when a sub query references data in an outer query.  Subqueries that reference an outer query’s data are known as correlated or repeating subqueries and are evaluated every time a comparison is made.  Typically this results in executing the correlated sub query for each tuple result.  Correlated queries are more complex from a programming perspective since the developer must remain aware of nesting and scoping.  For clarity and system documentation, developers may have to use aliasing and fully disambiguated attribute references of the form AliasTableName.attributeName (Garcia-Molina, Ullman & Widom, 2009).  For illustration, one solution to Garcia-Molina, Ullman and Widom exercise 6.3.1.f  is presented in Figure 3 below.  This query finds “the maker(s) of the PC(s) with the fastest processor among all those PCs that have the smallest amount of RAM” (Garcia-Molina, Ullman & Widom, 2009, p. 279).

Figure 3

Exercise 6.3.1.f  Solution

SELECT  R1.maker

FROM    Product R1,

        PC P1

WHERE   R1.model=P1.model

    AND P1.ram IN

        (SELECT MIN(ram)

        FROM    PC

        )

    AND P1.speed >= ALL

        (SELECT P1.speed

        FROM    Product R1,

                PC P1

        WHERE   R1.model=P1.model

            AND P1.ram IN

                (SELECT MIN(ram)

                FROM    PC

                )

        );

Aggregate Functions

Aggregate functions calculate and return a single value from specified columns.  While aggregate functions can appear by themselves in a SELECT clause, they are typically used in conjunction with the GROUP BY clauses to further constrain column entries (Rob & Coronel, 2007).    The ANSI standard SQL aggregate functions and their definitions are: (a) COUNT that counts the number of entries in a group, (b) MIN returns a group’s minimum value, (c) MAX that returns a group’s maximum attribute value, (d) SUM that sums a group’s values and, (e) AVG that computes the average or arithmetic mean of a group.   To provide an example, one solution to Garcia-Molina, Ullman and Widom exercise 6.2.2.d is presented in Figure 4 below.  For the purposes of this paper, the solution to exercise 6.2.2.d  presented below has been augmented to retrieve the hard-disk sizes that occur in two or more PCs or two or more Laptops to illustrate the UNION clause  (Garcia-Molina, Ullman & Widom, 2009, p. 267).

Figure 4

Exercise 6.2.2.d Solution

SELECT DISTINCT PC.hd

FROM    PC

GROUP BY PC.hd

HAVING COUNT(PC.model) >= 2

UNION

SELECT DISTINCT Laptop.hd

FROM    Laptop

GROUP BY Laptop.hd

HAVING COUNT(Laptop.model) >= 2;

Summary

From a user’s perspective, SQL allows users to specify what should be done rather than the far more complex procedural task of specifying how to accomplish the task (Elmasri & Navathe, 2000; Rob & Coronel, 2007).  SQL provides a minimal and intuitive set of command statements that serve to enhance its flexibility and power (Rob & Coronel, 2007).   As a result, SQL is considered to be user friendly and allows users to retrieve data and create meaningful information with minimal effort (Rob & Coronel, 2007).   It should be noted that this flexible retrieval mechanism concomitantly assists database administrators (DBAs) with testing and verifying database content and integrity (Codd, 1990).  With this basis it is critical to understand RDBMS functionality to not only retrieve meaningful data but to also administer the RDBMS as RDBMS system health is critically important to an organization’s fiscal stability.

References

Casteel, J. (2010). Oracle 11g SQL. Boston, MA: Thomson Course Technology.

Codd, E. F. (1990). The relational model for database management: version 2.  Boston, MA:  Addison-Wesley Longman Publishing Co.  Retrieved June 1, 2010 from, http://portal.acm.org/citation.cfm?id=77708.

Elmasri, R., & Navathe, S. B. (2000).  Fundamentals of database systems. Reading MA: Addison-Wesley.

Garcia-Molina, H., Ullman, J.D., & Widom, J.   (2009).   Database systems: The complete book, (2nd ed.).   Upper Saddle River, NJ: Pearson Prentice Hall.

Groff, J. R., Weinberg, P. N., & Oppel, A. J. (2010).  SQL: The complete reference (3rd ed.). New York: McGraw-Hill.

Satinger, J. W., Jackson, R. B., & Burd, S. D., (2002). Systems analysis and design (2nd ed.). Boston, MA: Course Technology.

Silberschatz, A., Korth, H. F., & Sudarshan, S. (1999). Database system concepts (3rd ed.). Boston, MA: McGraw-Hill.

Rob, P., & Coronel, C.M. (2007). Database systems design, implementation, and management. (7th ed.). Boston, MA: Thomson Course Technology.