SQL #3: DQL

4 minute read

This is the third in a series of SQL notes I made during the Kubrick Data Engineering training course. The others are:
#1: Database Design
#2: SQL Design
#4: DDL
#5: DML
#6: Sorting
#7: Joining
#8: Temporary Tables
#9: Programmable Objects
#10: Indexing
#11: Operation & Optimization

All the posts in this series are my personal note taking and may be updated as the course progresses.


Data query language consists of SELECT and it’s associated options. It is used to find and pull out data from a table. While SELECT is simple in itself, it can be combined with many other key words and functions to form very complex queries. (All of the examples in this series use data from the AdventureWorks dataset).


Single SELECT

SELECT is the core syntax of SQL queries, a SELECT statement in it’s simplest form queries a table with no filters:

SELECT *
FROM Production.Product

FROM is used to select the table to query from.

WHERE

A more specific query would state the column names and use WHERE to apply a logical condition on a column using an equality operator. In SQL, the precursor to comments is --. The text qualifier is the square bracket [ ].

--VERTICAL FILTER (Columns):
SELECT
     [Name]
    ,[Color]
    ,[Partno]
FROM Production.Product
--HORIZONTAL FILTER (Rows):
WHERE [Cost] <= 500

The ISO standard equality operators are =,<,>,>=,<=,<>, SQL also has BETWEEN, LIKE, IN, and EXISTS. BETWEEN specifies an inclusive number range:

SELECT * FROM Production.Product
WHERE ListPrict BETWEEN 50 AND 100

LIKE matches strings using REGEX (_ is ANY single character, % is ANY number of ANY character):

SELECT * FROM Production.Product
WHERE NAME LIKE 'A%'

IN match against a chosen set of options:

SELECT *
FROM Production.Product
WHERE colour IN (silver,blue,green)

EXISTS matches where some row of a subquery exists:

SELECT ListPrice
FROM Production.Product
WHERE EXISTS
(SELECT SalesID FROM Sales.Sales WHERE OrderQty > 50);

This EXISTS example shows an important feature of WHERE and other filters; they can use a subquery as an arguement.

HAVING

HAVING is similar to WHERE in that it filters rows. However HAVING is always used after a GROUP BY statement and is applied to the grouped rows.

The following query shows products grouped by color, then by size which have an individual cost of < 1000 and a grouped cost of < 1000. AS can optionally be used to set an alias for a column or table.

SELECT 
     [color]
    ,[size]
    ,SUM(standardcost) AS sumcost
    ,AVG(standardcost) AS avcost
FROM Production.Product
    WHERE standardcost < 1000
    GROUP BY [color], [size]
    HAVING SUM(standardcost) < 1000

While SQL server is case insensitive, other systems are not so it’s also a good idea to have consistent casing (either upper or lower).

The user can also specify DISTINCT to return only distinct values. TOP n is used to return the top n values or TOP n PERCENT to return a percentage from the top. If the query is ordered and two values are tied then TOP 1 WITH TIES will return all the first tied values for example. The INTO command (before FROM) is used to save the query into a new table.

CASE

CASE is used to SELECT a different value for a column depending on some condition, this can be useful for ‘binning’ an attribute, price for example.

SELECT
	CASE
		WHEN ListPrice >= 2000 THEN 'Expensive'
		WHEN ListPrice >= 500 THEN 'Moderate'
		WHEN ListPrice > 0 THEN 'Cheap'
		WHEN ListPrice <= 0 THEN 'Free'
		ELSE 'Unknown'
	END price_bin
FROM Production.Product

The above script will create a new column called price_bin containing labels based on the ListPrice column values.


Order of Operations

SQL queries have a defined logical order of operations which is the order in which filters and groupings are applied. It’s important to understand this order as it affects every aspect of how a SELECT statement is written. Confusingly, the lexical (written) order is different to the logical order, which is also different to the actual order of operations (apparently).

The logical order of operations is:

  1. FROM: Choose table
  2. WHERE: Filter rows
  3. GROUP BY: Group filtered rows
  4. Aggregations: Summarize rows
  5. HAVING: Filter groups
  6. OVER(Window): Aggregate over groups
  7. SELECT: Filter columns
  8. DISTINCT: Remove duplicates
  9. UNION, INTERSECT, EXCEPT: Combine SELECT statements
  10. ORDER BY: Choose ordering
  11. LIMIT, FETCH, TOP: Filter ordered rows

Each statement can only use arguements that are defined before it in the list. This means that for example;

  • WHERE can not include aggregations but window functions (OVER) can
  • ORDER BY is the only statement that can use aliases defined by SELECT

There are a few ways to get around the limitations of the single SELECT. The simplest is a nested SELECT.


Nested SELECT

A nested SELECT is an (inefficient) method for circumventing the logical order of operations and form a more complex query.

Any example would be, to find the discontinued products from Production.Product we would need to search Sales.SalesOrderDetail for the most recent order. If the last time the product was modified was in the same year then it’s probably still stocked. An EXCEPT statement can be used to find the products which are not in this list.

SELECT ProductID FROM Production.Product
	except
	SELECT ProductID FROM Production.Product
	WHERE ProductID IN(
		SELECT ProductID FROM sales.SalesOrderDetail
			WHERE year(ModifiedDate) IN(
				SELECT 
					 max(year(orderdate)) lastorder
				FROM sales.SalesOrderHeader ))

Updated: