SQL #11: Operation & Optimization

5 minute read

This is the eleventh in a series of SQL notes I made during the Kubrick Data Engineering training course. The others are:
#1: Database Design
#2: SQL Design
#3: DQL
#4: DDL
#5: DML
#6: Sorting
#7: Joining
#8: Temporary Tables
#9: Programmable Objects
#10: Indexing

All the posts in this series are my personal note taking and may be updated as the course progresses.

Transactions

A transaction is a sequence of operations performed as a single logical unit of work, which must adhere to ACID. Transactions are only used within DML (INSERT, UPDATE, DELETE) as these commands are not instant, the DDL commands commit to the database immediately so have no transaction control. When using these commands the transaction is implicit.
It is the responsibility of the database engine to provide mechanisms ensuring the physical integrity of each transaction. SQL programmers are responsible for starting and ending transactions at points that enforce the logical consistency of the data.

The commands to control a transaction are

COMMIT to save changes
ROLLBACK to undo changes
SAVEPOINT to create a point to ROLLBACK to
SET TRANSACTION to name the transaction

Transactions can be a good way to encorporate error handling into your code, an explicit transaction can be started with BEGIN TRAN, or using one of the commands above. WAITFOR DELAY can be used to set a physical wait time before an action.

BEGIN TRAN
BEGIN TRY
	DELETE FROM Production.Product
	WHERE ProductID = 55
	COMMIT TRAN
END TRY
BEGIN CATCH
	PRINT 'ProductID 55 does not exist, could not delete'
	WAITFOR DELAY '00:00:05'
	ROLLBACK TRAN
END CATCH

Transaction Isolation Levels

There are various levels of transaction isolation that can be set for a table or query. Good examples of the effects of each can be viewed here

A query level can be set using WITH (<LEVEL>) for example; SELECT * FROM Production.Product WITH (NOLOCK) gives the query READ UNCOMMITTED access to the table.

To set a global isolation level for a session, the following script should be executed after connecting to a server.

SET TRANSACTION ISOLATION LEVEL 
    { READ UNCOMMITTED --Pessimistic 
    | READ COMMITTED   --(DEFAULT) 
    | REPEATABLE READ 
    | SNAPSHOT 
    | SERIALIZABLE  --Optimistic 
    }

Read Uncommitted:

Queries can read rows that have been modified by other transactions but not yet committed. It’s possible with this statement to read rows that have been modified but not committed to the database, a dirty read. Nonreapeatable reads or phantom reads are also possible, these are when data is modified by another transaction between statements in the user transaction.

Read Committed:

Queries cannot read data that has been modified but not committed by other transactions. Nonrepeatable reads and phantom reads are still possible.

Repeatable Read:

Queries cannot read data that has been modified but not yet committed by other transactions and other transactions can modify data that has been read by the current transaction until the current transaction completes. This prevents diry reads and nonrepeatable reads. Phantom reads are still possible if the user transaction reads rows twice and another transaction inserts rows inbetween.

Snapshot:

The query works off a snapshot of the data as it was at the start of the transaction. No locking or blocking happens with snapshot and this prevents all types of dirty, nonrepeatable, or phantom reads.

Serializable:

Range locks are placed in the range of key values that match the search conditions of each statement executed in a transaction. This blocks other transactions from updating or inserting any rows that would create dirty, nonrepeatable, or phantom reads. However serializable isolation can have a big impact on computational cost.

Optimization

Optimizing queries is important for many reasons but there are two obvious ones; datasets are often very large so inefficient queries take time and they can also literally cost money for companies using external servers (AWS/Azure).

When a DML query is executed it is parsed and then processed by the query optimizer which calculates the most efficient execution plan. This is done by modelling many different execution plans and choosing the least cost option, for repeated statements the plan cache is consulted. DDL statements do not need to do any of this as there is only one way to perform a CREATE for example.

Execution Plans

An execution plan shows the way the optimizer implemented (actual )or will implement (estimated) the requested query. They are the primary way to diagnose an inefficient or expensive query as they show the cost by individual command. A plan can come in graphical, text, or XML format. Estimated and actual plan views can be turned on with CTRL-L and CTRL-M respectively.

System commands

System Information

sys.dm_db_index_usage_stats - Shows usage stats for all indexes
sys.dm_db_index_physical_stats - Shows physical stats for all indexes
sys.schemas - Contains info on schemas
sys.tables - Contains info on tables
sys.columns - Contains info on columns
sys.types - Contains info on types
sys.sql_dependencies - Contains info on table dependencies
sys.sql_modules - Contains info on stored procedures

Database Console Commands

DBCC statements allow a user to write a ‘console command’ which controls the database in some high level way. Commands can be grouped into Maintanence, Miscellaneous, Informational, and Validation.
An example would SHOW_STATISTICS which shows key figures for a supplied table.

DBCC SHOW_STATISTICS('Sales.SalesOrderDetail')

Error Handling

The following script will return an error code for the current transaction (0 is no error)

SELECT @@ERROR

The following script will create log_table, attempt to divide by 0 and if (when) it fails, print the ERROR_MESSAGE() and insert it into the log_table.

CREATE TABLE log_table(error_msg NVARCHAR(500))

BEGIN TRAN
DECLARE @denom FLOAT = 0
BEGIN TRY
	SELECT 500/@denom
END TRY
BEGIN CATCH
	PRINT 'uh oh, ' + ERROR_MESSAGE()  
	ROLLBACK TRAN
	INSERT INTO log_table (error_msg)  -- Log error
	SELECT ERROR_MESSAGE()
END CATCH

Statistics

SET STATISTICS IO ON before an operation will display the scan count, logical reads, physical reads and more of the operation in the messages tab.
SET STATISTICS TIME ON displays the CPU and elapsed time for an operation to complete.

Slowly Changing Dimensions

SCD’s are tables which have data columns that are changing over time. For example prices for items may change over time. When this happens an administrator may want to keep a record of the old prices and when they changed. There are a few ways to do this, they are summarised by the Kimball methodology. Ralph Kimball invented dimensional modeling and listed 8 ‘types’ of SCD technique which deal with SCD’s in different ways. I will just cover the first 4 here.

Type 0: Retain Original

Don’t do anything and keep the original.

Type 1: Overwrite

Overwrite the original and just keep the newest value.

Type 2: Add New Row

Add a new row with the new value while keeping the old value in an ‘original’ row. New columns are also required to signify either the altered date or the current row.

SupplierID	Name	Country	StartDate	EndDate
123	Apple	UK	01/01/2012	05/06/2015
124	Apple	GER	05/06/2015	NULL

Type 3: Add New Attribute

Add a new column denoting the current value which becomes the default, overwriting the original as in type 1.

SupplierID	Name	OriginalCountry	EffectiveDate	CurrentCountry
123	Apple	UK	05/06/2012	GER