SQL #6: Sorting
This is the sixth in a series of SQL notes I made during the Kubrick Data Engineering training course. The others are:
#1: Database Design
#2: SQL Design
#3: DQL
#4: DDL
#5: DML
#7: Joining
#8: Temporary Tables
#9: Programmable Objects
#10: Indexing
#11: Operation & Optimization
All the posts in this series are my personal note taking and may be updated as the course progresses.
GROUP BY
GROUP BY
is used to aggregate values by some grouping and defines a new level of detail for the query. This means it can only return columns which are either in the GROUP BY
or in the aggregation. For example:
SELECT
color
,sum(standardcost) AS sumcost
,count(*) AS cnt
FROM Production.Product
GROUP BY color
This query displays the total standard costs of the products in Production.Product
by colour. color
is in the GROUP BY
and both standardcost
and *
are aggregated.
Window Functions
Window functions perform a calculation over a set of rows. The OVER
keyword determines the window (partitioning and ordering of a set of rows) and is required in every window function. A function (calculation) is then applied on the window. It keeps the same level of detail as the original table. This means a single query can return both detailed AND aggregate data.
The various options for OVER
are listed below.
OVER (
PARTITION BY <clause>
ORDER BY <clause>
ROW or RANGE <clause>
)
PARTITION BY
PARTITION BY
acts a bit like a GROUP BY
, applying an aggregate to a group but it will return the result for each row.
SELECT
OrderLineID
,StockItemID
,UnitPrice
,AVG(UnitPrice) OVER (PARTITION BY StockItemID) Avg_LinePrice
From Sales.Orderlines
This query will return some details about each product and an aggregates average price for that StockItemID. This allows us to compare the price for that particular product to the average for the line.
ORDER BY
If the order of rows is important to the function then ORDER BY
is used (this is independent of the SELECT
ORDER BY
). Ordering is generally computationally expensive as it requires scanning.
Ranking Functions
Ranking functions will return an index number related to some ranking, they must be used with ORDER BY
to define the ranking criteria. Each of the following, except NTILE
, differs in how ‘tied’ rows are labelled.
ROW_NUMER() OVER (ORDER BY <col2>)
: Gives a row number if ordered by a condition, ignores ties [1,2,3]RANK() OVER (ORDER BY <col2>)
: LikeROW_NUMBER
but tied values are given tied position in rank [1,1,3]DENSE_RANK() OVER (ORDER BY <col2>)
: LikeRANK
but tied values only occupy one rank value [1,1,2]NTILE(n)
: Segments data inton
evenly sized tiles/buckets
Analytics Functions
These are four common Analytics functions that can be used over a window.
FIRST_VALUE
and LAST VALUE
give the highest or lowest ranked value for some sorting. The following query gives the OrderID
for the first order by each customer.
SELECT
CustomerID
,OrderID
,FIRST_VALUE(OrderID) OVER (PARTITION BY OrderID ORDER BY OrderDate, CustomerID ROWS UNBOUNDED PRECEDING)
FROM Sales.Sales
The ROWS UNBOUNDED PRECEDING
setting specifies how far into the past or future to search. The options are UNBOUNDED
or CURRENT
for distance and PRECEDING
or FOLLOWING
to choose past or future.
Similar to FIRST_VALUE
and LAST_VALUE
, LAG
and LEAD
allow us to effectively join the table to itself on the partition keys and reach into the joined set to retrieve the next value, or previous value. LEAD
gives the first value and LAG
the last.
Pivot
Pivot is a way of re-organising and aggregating tables. A table can be ‘pivotted’ to list the total price of each ProductCode by color:
SELECT *
FROM
(
SELECT
left(productnumber, 2) AS ProductCode
,ISNULL(color, 'NA') Color
,ListPrice
FROM Production.Product
) R
PIVOT
(
SUM(listprice) FOR color IN ([NA], [Black], [Blue], [Grey], [Multi], [Red], [Silver] ,[Silver/Black] ,[White] ,[Yellow])
)Piv
The pivot command can be useful when there are multiple columns with the same details except for one, multiple email addresses for one person for example. A second column can be made and pivotted into to store ‘email2’. In that example there’s no real aggregation but an aggregation must be defined, so just use MAX()
or something.
More likely is the need to ‘unpivot’ a table to group some spread out data into buckets if it was set out column wise:
SELECT *
FROM
(
SELECT
left(productnumber, 2) AS ProductCode
,ISNULL(color, 'NA') Color
,ListPrice
FROM Production.Product
) R
UNPIVOT
(
SUM(listprice) FOR color IN ([NA], [Black], [Blue], [Grey], [Multi], [Red], [Silver] ,[Silver/Black] ,[White] ,[Yellow])
)Unpiv
The unpivot command knows that the columns in the SELECT
statement are the unique combination of values by which to aggregate.
Dynamic Pivot
Typing every column to pivot on is slow, it is possible to do a dynamic pivot by declaring a list of columns to pivot on.
DECLARE @cols NVARCHAR(MAX)
DECLARE @query NVARCHAR(MAX)
SET @cols = STUFF((SELECT DISTINCT ',' + QUOTENAME(AgeBucket)
FROM training.populationByAge c
for XML PATH(''), TYPE
).VALUE('.', 'NVARCHAR(MAX)'),1,1,'')
SELECT @cols
@cols
now stores a list of age buckets that could be used for a pivot. NVARCHAR
must be used here becuase dynamic SQL commands require Unicode.