SQL #3: DQL
This is the third in a series of SQL notes I made during the Kubrick Data Engineering training course. The others are:
#1: Database Design
#2: SQL Design
#4: DDL
#5: DML
#6: Sorting
#7: Joining
#8: Temporary Tables
#9: Programmable Objects
#10: Indexing
#11: Operation & Optimization
All the posts in this series are my personal note taking and may be updated as the course progresses.
Data query language consists of SELECT and it’s associated options. It is used to find and pull out data from a table. While SELECT is simple in itself, it can be combined with many other key words and functions to form very complex queries. (All of the examples in this series use data from the AdventureWorks dataset).
Single SELECT
SELECT is the core syntax of SQL queries, a SELECT statement in it’s simplest form queries a table with no filters:
SELECT *
FROM Production.Product
FROM is used to select the table to query from.
WHERE
A more specific query would state the column names and use WHERE to apply a logical condition on a column using an equality operator. In SQL, the precursor to comments is --. The text qualifier is the square bracket [ ].
--VERTICAL FILTER (Columns):
SELECT
[Name]
,[Color]
,[Partno]
FROM Production.Product
--HORIZONTAL FILTER (Rows):
WHERE [Cost] <= 500
The ISO standard equality operators are =,<,>,>=,<=,<>, SQL also has BETWEEN, LIKE, IN, and EXISTS.
BETWEEN specifies an inclusive number range:
SELECT * FROM Production.Product
WHERE ListPrict BETWEEN 50 AND 100
LIKE matches strings using REGEX (_ is ANY single character, % is ANY number of ANY character):
SELECT * FROM Production.Product
WHERE NAME LIKE 'A%'
IN match against a chosen set of options:
SELECT *
FROM Production.Product
WHERE colour IN (‘silver’,’blue’,’green’)
EXISTS matches where some row of a subquery exists:
SELECT ListPrice
FROM Production.Product
WHERE EXISTS
(SELECT SalesID FROM Sales.Sales WHERE OrderQty > 50);
This EXISTS example shows an important feature of WHERE and other filters; they can use a subquery as an arguement.
HAVING
HAVING is similar to WHERE in that it filters rows. However HAVING is always used after a GROUP BY statement and is applied to the grouped rows.
The following query shows products grouped by color, then by size which have an individual cost of < 1000 and a grouped cost of < 1000. AS can optionally be used to set an alias for a column or table.
SELECT
[color]
,[size]
,SUM(standardcost) AS sumcost
,AVG(standardcost) AS avcost
FROM Production.Product
WHERE standardcost < 1000
GROUP BY [color], [size]
HAVING SUM(standardcost) < 1000
While SQL server is case insensitive, other systems are not so it’s also a good idea to have consistent casing (either upper or lower).
The user can also specify DISTINCT to return only distinct values. TOP n is used to return the top n values or TOP n PERCENT to return a percentage from the top. If the query is ordered and two values are tied then TOP 1 WITH TIES will return all the first tied values for example.
The INTO command (before FROM) is used to save the query into a new table.
CASE
CASE is used to SELECT a different value for a column depending on some condition, this can be useful for ‘binning’ an attribute, price for example.
SELECT
CASE
WHEN ListPrice >= 2000 THEN 'Expensive'
WHEN ListPrice >= 500 THEN 'Moderate'
WHEN ListPrice > 0 THEN 'Cheap'
WHEN ListPrice <= 0 THEN 'Free'
ELSE 'Unknown'
END price_bin
FROM Production.Product
The above script will create a new column called price_bin containing labels based on the ListPrice column values.
Order of Operations
SQL queries have a defined logical order of operations which is the order in which filters and groupings are applied. It’s important to understand this order as it affects every aspect of how a SELECT statement is written. Confusingly, the lexical (written) order is different to the logical order, which is also different to the actual order of operations (apparently).
The logical order of operations is:
FROM: Choose tableWHERE: Filter rowsGROUP BY: Group filtered rows- Aggregations: Summarize rows
HAVING: Filter groupsOVER(Window): Aggregate over groupsSELECT: Filter columnsDISTINCT: Remove duplicatesUNION,INTERSECT,EXCEPT: CombineSELECTstatementsORDER BY: Choose orderingLIMIT,FETCH,TOP: Filter ordered rows
Each statement can only use arguements that are defined before it in the list. This means that for example;
WHEREcan not include aggregations but window functions (OVER) canORDER BYis the only statement that can use aliases defined bySELECT
There are a few ways to get around the limitations of the single SELECT. The simplest is a nested SELECT.
Nested SELECT
A nested SELECT is an (inefficient) method for circumventing the logical order of operations and form a more complex query.
Any example would be, to find the discontinued products from Production.Product we would need to search Sales.SalesOrderDetail for the most recent order. If the last time the product was modified was in the same year then it’s probably still stocked. An EXCEPT statement can be used to find the products which are not in this list.
SELECT ProductID FROM Production.Product
except
SELECT ProductID FROM Production.Product
WHERE ProductID IN(
SELECT ProductID FROM sales.SalesOrderDetail
WHERE year(ModifiedDate) IN(
SELECT
max(year(orderdate)) lastorder
FROM sales.SalesOrderHeader ))