SQL #6: Sorting

4 minute read

This is the sixth in a series of SQL notes I made during the Kubrick Data Engineering training course. The others are:
#1: Database Design
#2: SQL Design
#3: DQL
#4: DDL
#5: DML
#7: Joining
#8: Temporary Tables
#9: Programmable Objects
#10: Indexing
#11: Operation & Optimization

All the posts in this series are my personal note taking and may be updated as the course progresses.


GROUP BY

GROUP BY is used to aggregate values by some grouping and defines a new level of detail for the query. This means it can only return columns which are either in the GROUP BY or in the aggregation. For example:

    SELECT 
         color
        ,sum(standardcost) AS sumcost
        ,count(*) AS cnt
    FROM Production.Product
    GROUP BY color

This query displays the total standard costs of the products in Production.Product by colour. color is in the GROUP BY and both standardcost and * are aggregated.


Window Functions

Window functions perform a calculation over a set of rows. The OVER keyword determines the window (partitioning and ordering of a set of rows) and is required in every window function. A function (calculation) is then applied on the window. It keeps the same level of detail as the original table. This means a single query can return both detailed AND aggregate data.

The various options for OVER are listed below.

OVER (    
    PARTITION BY <clause> 
    ORDER BY <clause>   
    ROW or RANGE <clause>   
      )

PARTITION BY

PARTITION BY acts a bit like a GROUP BY, applying an aggregate to a group but it will return the result for each row.

SELECT
     OrderLineID
    ,StockItemID
    ,UnitPrice
    ,AVG(UnitPrice) OVER (PARTITION BY StockItemID) Avg_LinePrice
From Sales.Orderlines

This query will return some details about each product and an aggregates average price for that StockItemID. This allows us to compare the price for that particular product to the average for the line.

ORDER BY

If the order of rows is important to the function then ORDER BY is used (this is independent of the SELECT ORDER BY). Ordering is generally computationally expensive as it requires scanning.

Ranking Functions

Ranking functions will return an index number related to some ranking, they must be used with ORDER BY to define the ranking criteria. Each of the following, except NTILE, differs in how ‘tied’ rows are labelled.

  • ROW_NUMER() OVER (ORDER BY <col2>): Gives a row number if ordered by a condition, ignores ties [1,2,3]
  • RANK() OVER (ORDER BY <col2>): Like ROW_NUMBER but tied values are given tied position in rank [1,1,3]
  • DENSE_RANK() OVER (ORDER BY <col2>): Like RANK but tied values only occupy one rank value [1,1,2]
  • NTILE(n): Segments data into n evenly sized tiles/buckets

Analytics Functions

These are four common Analytics functions that can be used over a window.

FIRST_VALUE and LAST VALUE give the highest or lowest ranked value for some sorting. The following query gives the OrderID for the first order by each customer.

SELECT
     CustomerID
    ,OrderID
    ,FIRST_VALUE(OrderID) OVER (PARTITION BY OrderID ORDER BY OrderDate, CustomerID ROWS UNBOUNDED PRECEDING)
FROM Sales.Sales

The ROWS UNBOUNDED PRECEDING setting specifies how far into the past or future to search. The options are UNBOUNDED or CURRENT for distance and PRECEDING or FOLLOWING to choose past or future.

Similar to FIRST_VALUE and LAST_VALUE, LAG and LEAD allow us to effectively join the table to itself on the partition keys and reach into the joined set to retrieve the next value, or previous value. LEAD gives the first value and LAG the last.


Pivot

Pivot is a way of re-organising and aggregating tables. A table can be ‘pivotted’ to list the total price of each ProductCode by color:

SELECT *
FROM
(
SELECT 
	 left(productnumber, 2) AS ProductCode
	,ISNULL(color, 'NA') Color
	,ListPrice
FROM Production.Product
) R

PIVOT
(
SUM(listprice) FOR color IN ([NA], [Black], [Blue], [Grey], [Multi], [Red], [Silver] ,[Silver/Black] ,[White] ,[Yellow])
)Piv

The pivot command can be useful when there are multiple columns with the same details except for one, multiple email addresses for one person for example. A second column can be made and pivotted into to store ‘email2’. In that example there’s no real aggregation but an aggregation must be defined, so just use MAX() or something.

More likely is the need to ‘unpivot’ a table to group some spread out data into buckets if it was set out column wise:

SELECT *
FROM
(
SELECT 
	 left(productnumber, 2) AS ProductCode
	,ISNULL(color, 'NA') Color
	,ListPrice
FROM Production.Product
) R

UNPIVOT
(
SUM(listprice) FOR color IN ([NA], [Black], [Blue], [Grey], [Multi], [Red], [Silver] ,[Silver/Black] ,[White] ,[Yellow])
)Unpiv

The unpivot command knows that the columns in the SELECT statement are the unique combination of values by which to aggregate.

Dynamic Pivot

Typing every column to pivot on is slow, it is possible to do a dynamic pivot by declaring a list of columns to pivot on.

DECLARE @cols NVARCHAR(MAX)
DECLARE @query NVARCHAR(MAX)

SET @cols = STUFF((SELECT DISTINCT ',' + QUOTENAME(AgeBucket)
FROM training.populationByAge c
for XML PATH(''), TYPE 
).VALUE('.', 'NVARCHAR(MAX)'),1,1,'')

SELECT @cols

@cols now stores a list of age buckets that could be used for a pivot. NVARCHAR must be used here becuase dynamic SQL commands require Unicode.

Updated: