SQL #10: Indexing

3 minute read

This is the tenth in a series of SQL notes I made during the Kubrick Data Engineering training course. The others are:
#1: Database Design
#2: SQL Design
#3: DQL
#4: DDL
#5: DML
#6: Sorting
#7: Joining
#8: Temporary Tables
#9: Programmable Objects
#11: Operation & Optimization

All the posts in this series are my personal note taking and may be updated as the course progresses.

Indexes

Database access usually accounts for the biggest share of disk-IO in any system and frequently consumes more memory (cache) and processor cycles than other kinds of workload. In SQL server the smallest basic unit of storage is a page which is 8Kb (8192 bytes).

An index can be thought of in the same way as an index in a book. The more ‘selective’ a dataset is, the more useful it is to have an index to prevent the server searching many unique values.

The main index techniques in SQL Servers are:

B+Tree index (clustered and non-clustered) {MAIN}
Column store
Spatial index
XML index

Each of these index the data in different ways, B-Tree is the most common and has two index methods: clustered (all columns) and non-clustered (selected columns). A B-Tree is like a decision tree with an associated depth (number of levels), the lowest level of the tree is called the leaf level. A B+Tree structure makes range scan queries more efficient by putting every value into the leaf level of the tree and then chaining those leaves together with pointers. In a database index this is referred to as a page chain and it is doubly-linked so the chain can be traversed in both directions. Having an index means the server can SEEK by index instead of a SCAN of every row, and seeking is much more computationally efficient.

Non-Clustered Index

A non-clustered index is one where the leaf level of the index contains rowIDs pointing to rows of data. It acts like a look-up table for the server to use for finding a specific item quickly. There can be many clustered indexes for one table as each requires columns to be specified. Creating an index is like creating a table:

CREATE INDEX [name] ON Customer([Name])

Where [Name] is a highly unique column. Searching via the index is then achieved by a normal select:

SELECT [name], [address], [city], [country]
FROM Customer
WHERE [name] = 'Jake'

If you INDEX on a VIEW a ‘materialised view’ is created as the index creates a real object from the query that the view contained.

Clustered Index

A clustered index is one where the leaf level of the index contains rows of data. There can only be one clustered index for a table (it is the table). The clustered index acts like a B+Tree until the leaf level, where all the actual data is contained, including fields which aren’t part of the index key. The data is however sorted in index key order which is a big advantage for range scans. A clustered index also has the effect of compressing the data ~10X.

The clustered index is essentially an ordered version of the table. Creating a primary key for a table will automatically create a clustered index as it is then ordered on that column by definition. It is usually good practice for every table to have a clustered index. A table without a clustered index is called a heap because the data rows are not logically ordered.

Defining a clustered index is as simple as setting a PRIMARY KEY for the table:

CREATE TABLE table1
(
    ID int not null identity(1,1) primary key
)

Clustered index can also be defined explicitly: CREATE CLUSTERED INDEX idx2 ON ad (ModifiedDate, City)

Filtered Index

A filtered index is applied only to specific rows.

CREATE INDEX idx_fil ON Production.TransactionHistory (TransactionDate) WHERE [TransactionType] = N'P'

Note: N here means the ‘P’ string should be treated as an NAVARCHAR rather than VARCHAR denoting that the string is in UNICODE.

The following table lists the advantages and uses of different index schemes on tables.

Index Type	Attributes
Heap	Good for `INSERT`
Clustered Index	1 per table, leaf contains sorted data row, Does NOT allow `FILTER` or `INCLUDE`
Non-Clustered Index (On Heap)	999 per table(max), leaf contains RowIDs, `FILTER` and `INCLUDE` to select fields
Non-Clustered Index (On Clustered Index)	999 per table(max),leaf contains cluster key, `FILTER` and `INCLUDE` to select fields