SQL #1: Database Design

6 minute read

This is the first in a series of SQL notes I made during the Kubrick Data Engineering training course. The others are:
#2: SQL Design
#3: DQL
#4: DDL
#5: DML
#6: Sorting
#7: Joining
#8: Temporary Tables
#9: Programmable Objects
#10: Indexing
#11: Operation & Optimization

All the posts in this series are my personal note taking and may be updated as the course progresses.

Data Modelling

Entities & Attributes

An entity in database (DB) theory is described as just ‘a thing’. Things can have attributes and it’s attributes which constitute data. An entity can also have a relationship with another entity and share similar data. With attributes and known relationships we can access knowledge about the thing and find insights.

Attributes and relationships have qualities of their own and can be categorized by 3 measures.

Cardinality
- Measure of scalability of an entity relationship
- Applies seperately to either direction of relationship
- [A tweet has 1 author but an author may have many tweets]
Scalability
- Whether an attribute may hold multiple correct values
- Scalar:[Age, Weight], Non-Scalar:[Email, Address]
Determinism
- Whether an attribute value changes or is constant
- Deterministic:[D.O.B., Name], Non-Deterministic:[Time, Age]

Entity Relationship Diagrams

An ERD shows the entities, attributes, and relationships in a database. Entities are represented by shapes, attributes by (shown as a list in the diagram below) and relationships by .

If an attribute can hold multiple values (Non-Scalar) it should have 2 rings around it. The cardinality of a relationship is displayed by a symbol on the connecting line.

1: [|]
0: [o]
Many: [<]
1 to many: [|<]
0 to many: [o<]

Databases

RDBMS

Relationship DataBase Management Systems are very common structures for databases where the structure of data is defined by it’s relationships. An RDBMS requires a query language to be able to access it’s contents, a query language must be based on either relational algebra or relational calculus. Structured Query Language (SQL) is based on relational algrebra. The most popular RDBMS’s currently are Oracle (PL/SQL), MySQL (SQL/PSM), and Microsoft SQL (T-SQL). At some point, large tech companies such as Facebook and Google required a more expansive type of database. Thus NoSQL (Not Only SQL) was invented as a more flexible solution to big data storage. Some of the best known are Cassandra, Mongo DB, and Neo4J.

CRUD

The basic transactions that must be available on a database are: CREATE, UPDATE, READ, and DELETE. These are the requirements of any RDBMS or NoSQL system.

ACID

The additional requirements of an RDBMS are: ATOMICITY, CONSISTENCY, ISOLATION, and DURABILITY.

Atomicity means transactions are indivisible so always either fully complete for fully fail.
Consistency ensures that any written data is valid by all rules of that database.
Isolation states a row must be isolated and frozen while being edited.
Durability ensures that data is not loss if, for example, power to the database is lost.

ACID is implemented most commonly through either write ahead logging or shadow paging, both require locks on all information that is being updated when editing. Write ahead logging copies all unchanged data to a log before editing. In shadowing, updates are applied to a partial copy of the database, and the new copy is activated when the transaction commits.

3NF (OLTP)

An RDBMS can contain tables of the type 3NF or dimensional or a mixture of the two. 3NF stands for 3rd Normal Form and is the third in a set of table types, each with their own conditions and reliant on the previous.

1NF:

Each row has a unique key
Each field has one value
No repeating groups

2NF:

Be 1NF compliant
Every listed attribute is wholly dependent on the primary key

3NF:

Be 2NF compliant
Only store original data, nothing that could be calculated

3NF tables come under the umbrella of On-Line Transaction Processing. Because of the simplicity of each table the system is filled with many small tables. These are difficult to query as many tables may need to be consulted to get a figure. Howver they are easy to edit as the locking only occurs on a small area. 3NF OLTP systems are used in transaction-heavy areas such as banking where edits are needed very quickly and queries are less important.

Dimensional (OLAP)

A dimentional table combines the many tables of 3NF into a few very wide tables. This makes them much simpler to query and only one or two tables may need to be consulted. However editing dimensional tables can cause problems due to the amount of fields that will be locked when a single row is edited. Dimensional tables are used when lots of queries are made, primarily in analytics, part of On-Line Analytical Processing.

A dimensional warehouse will have a central controlling table referred to as the FACT table, the surrounding tables are ‘dimensions’. The FACT table is made of foreign keys to dimensions, and ‘FACTS’ or ‘measures’ drawn from those dimensions. This makes for a single very wide table containing a lot of information.

Operation

All databases have at least 2 files; a Transaction Log (TL) and a Masked Data File (MDF). The TL is a log of all actions taken on the database, basically the database metadata. The MDF is the actual data file. Both files are stored on hard memory and are updated from the more immediate RAM memory.

When a user makes an edit it is stored in the RAM, then transferred to the TL. When a user then queries the DB it checks the RAM first for the change, before going to the final MDF to check. The MDF is updated with actions from the TL after a set time (60s on MS SQL). This time is short enough that no actions are missed by the query as the RAM will store it for over 60s.

SQL

SQL is a powerful language and engine used for a vast array of data related functions. Strict SQL systems can be distinguished from NoSQL systems which are ‘Not Only’ SQL but include some extra ability. However, almost all database systems are built on the logic and assumptions of SQL.

SQL Server

SQL servers are computers configured to deliver data in the way descibed above. A computer may be especially configured to be an SQL server, but server software may also be installed on a normal PC. The following details relate to a Microsoft Windows install of MS SQL.

The default MSSQL port is 1433. The commands net stop <service> and net start <service> can be used to control the server where <service> is the service name (mssqlserver).

To access the server as a client a user must use client tools. The Microsoft client tool is SQL Server Management Service (SSMS) but there are many alternatives. SSMS uses T-SQL, the MSSQL language variant.

The default tables included in an SQL server are;

master: Central database, contains user info
MSDB: Automated system jobs
model: Template database
tempdb: Stores temporary tables, shared by all users

Pages are the fundamental unit of data in SQL, they are 8KB in size including a 96 byte header. When SQL performs disk I/O operations they are always on whole pages. An extent is a collection of eight physically contiguous pages. A large table could be spread over a single uniform extent, alternatively a mixed extent can hold data from multiple seperate tables. The allocation of uniform extents is recorded in a Global Allocation Map (GAM), mixed extents are recorded in a Shared GAM (SGAM).

Keys are unique identifiers for users or instances of an entity. There could be multiple candidate keys from which one primary key is chosen. Keys can also either be readable business keys or indexed surrogate keys. The choice of key is mostly up to the user.

Collation defines the encoding of the SQL language type. For example CI_AS is english, case insensitive & accent sensitive. The user should be aware of the exact collation of the language flavour before setting up an SQL server.

Note: collation relates to the data within tables. The case sensitivity of object names in SQL is a different thing, MSSQL is non case sensitive wheras Oracle and MySQL both are.