SQL

This chapter is all about SQL, which is the industry standard database query language.

What is SQL?

Structured Query Language, or SQL, is a language used to communicate with a database. It is the standard language for a relational database, and is used for structuring, storing, manipulating, and retriving data. While SQL is a standard language, every database implements its own version of SQL with slightly different behaviours. SQL usage between databases doesn't vary widely, so the techniques here will transfer readily to other database systems.

SQL is split into two main sublanguages; a Data Definition Language (DDL) and a Data Manipulation Language (DML). It can also be generated programmatically through other applications; for example, you may have worked with JDBC in the past. A popular method of generating SQL from object-oriented definitions is an ORM like SQLAlchemy. We'll jump right into describing the SQL syntax and usage.

Tables

All data in a relational database is stored in tables. A table is not unlike an Excel table; each column defines an attribute name and type, and each row is an instance of the table's schema. A schema is a blueprint for what data in the given table should look like, including types and constraints.

To create a table in SQL, we use the CREATE TABLE syntax to define our table name, attribute names and types, as well as any constraints on our data. A sample table creation is below:

CREATE TABLE person (
    id int PRIMARY KEY,
    full_name varchar(255) NOT NULL,
    age int
);

Here, we've defined a table called person with three attributes, id, full_name, and age. The id attribute is of type int and is a primary key - this means that it uniquely identifies a row in the person table and cannot be null. The full_name attribute if of type varchar(255) - a string of up to length 255 characters - and cannot be null. The age attribute is of type int and can be null.

If we define multiple tables, we can define foreign key relationships between them. A foreign key in one table references the primary key of another, and is the key behind the relationality of a relational database:

CREATE TABLE dogs (
    id int PRIMARY KEY,
    pet_name varchar(255) NOT NULL,
    owner_id int REFERENCES person(id)
);

To drop a table, use the DROP TABLE <table_name> syntax. This will delete the table and all of the data in it.

To alter a table, use the ALTER TABLE <table_name> syntax. You can ADD, DROP, and RENAME columns and constraints.

Constraints

Constraints are a way for SQL to check that certain properties always hold for your data. You've already seen some examples of constraints: NOT NULL and PRIMARY KEY are two such constraints. Constraints on a single attribute can appear after the attribute in table creation. Constraints on multiple attributes can appear outside of any column declaration. As we've studied, certain sets of constraints can fulfill certain properties, which can make our databases easier to work with; try to conform to a normal form if possible. The following is an example of some common constraints:

CREATE TABLE Cats (
    id int PRIMARY KEY, -- UNIQUE and NOT NULL
    num_legs int NOT NULL, -- Cannot be NULL
    pet_name varchar(255) UNIQUE, -- Must be unique
    owner_id int DEFAULT 0, -- Specify a default value

Constraints can come after DDL:

CREATE TABLE Cats (
    id int,
    owner_id int,
    PRIMARY KEY (id),
    FOREIGN KEY (owner_id) REFERENCES Person(id)

And here's an example of table CHECK constraints:

CREATE TABLE Cats (
    id int PRIMARY KEY,
    num_legs int NOT NULL,
    CHECK (num_legs < 5)

Global constraints are not supported by all databases, and are thus omitted from discussion here.

Selects

Now that we understand how data is organized into tables, we can start selecting data from tables. Select queries in SQL follow a basic syntax:

SELECT <columns...>
    FROM <tables...>
    WHERE <filters...>;

For example, let's say I had the Person table above. I could select all of the rows and columns:

SELECT * FROM Person;

Only the id and age columns:

SELECT id, age FROM Person;

I could order my output or limit the number of rows emitted:

SELECT * FROM Person ORDER BY age [ASC|DESC];

SELECT * FROM Person LIMIT 10;

I can ask for non-duplicate output:

SELECT DISTINCT full_name FROM Person;

I can also rename columns when outputting:

SELECT id AS person_id FROM Person;

Inserts, Updates, and Deletes

To insert data into a table, use the INSERT INTO syntax, optionally specifying the order of the columns you'll be inserting:

INSERT INTO Person (id, full_name, age) VALUES (1, "Barry", 19);
INSERT INTO Person VALUES (1, "Barry", 19);

Updates can be done like so:

UPDATE Person SET age=20 WHERE id=1;

And deletes:

DELETE FROM Person WHERE id=1;

To delete all data, but leave the table intact:

DELETE FROM Person;

Filters and Aggregation

Specifying filters on your data is the job of the WHERE clause. You can use comparison operations (=, <, <=, >, >=), text operators (LIKE), range operators (BETWEEN), or many more (IS NOT NULL, IN). Each filter can be joined using logical operators AND, OR, and NOT. The following are some examples:

SELECT id
    FROM Person
    WHERE id < 10;

SELECT id
    FROM Person
    WHERE full_name = "Barry the Bee";

SELECT id
    FROM Person
    WHERE id BETWEEN 10 AND 100;

SELECT id
    FROM Person
    WHERE age IS NOT NULL;

SELECT id
    FROM Person
    WHERE full_name LIKE "B%"; /* Everything that starts with 'B' */

You can also aggregate your data using a suite of aggregation functions: AVG, COUNT, MAX, MIN, SUM, and more. To get the average of each subset of data, use the GROUP BY clause. The attribute you group by must be a selected column. To filter data based on aggregations, use the HAVING clause. The following are some examples:

/* Get the average age across all people. */
SELECT AVG(age) FROM Person;

/* Get the number of people in each age group. */
SELECT COUNT(*), age
    FROM Person
    GROUP BY age;

/* Get the age groups with more than 10 people in them. */
SELECT COUNT(*) AS num, age
    FROM Person
    GROUP BY age
    HAVING num > 10;

Joins

Given multiple tables, it may be beneficial to combine tables on some common attribute. Given the Cat and Person relations above, it might be useful to join the Person(id) attribute with the Cat(owner_id) attribute. To do so, SQL exposes a number of JOIN operations, each one preserving a different amount of data on either side of the join.

We'll focus on CROSS JOIN and INNER JOIN. The CROSS JOIN returns every possible combination of rows from the left and right table combined. If the left table had \(m\) rows, and right right \(n\) rows, expect to have \(mn\) rows in the joined table. An example is below:

-- Get all pairs of people and cats
SELECT * FROM Person, Cat;

The INNER JOIN returns pairs of rows that share some common attribute. If a row on one of the two tables has no corresponding match in the other table, then it will simply not be included in the result. Other joins (LEFT JOIN, RIGHT JOIN, OUTER JOIN) will include elements with no matches. An example is below:

-- Get all owners and pets together
SELECT * FROM Person INNER JOIN Cat ON Person.id = Cat.owner_id;

The NATURAL (INNER) JOIN is an INNER JOIN that infers the columns that should be joined on.

Set Operations

Given the results of multiple subqueries, SQL allows you to combine the results using set operations. The UNION (ALL) operator adds the two result sets together, returning rows that are in either set (adding ALL preserves duplicates). The INTERSECT operator intersects the result sets, returning rows that are in both sets. The MINUS operator returns rows that are in the first set, but not the second. An example of how to use one of these operators is below:

-- Gets everybody below 10 years of age and above 18 years of age.
(SELECT *
    FROM Person
    WITH age < 10)
UNION -- Add the top set to the bottom set
(SELECT *
    FROM Person
    WITH age > 18)

Subqueries

Sometimes for very complex queries, it is helpful to break a query up into several smaller queries, then combine their results. This is achievable using SQL subqueries, which can be nested inside of a SELECT, INSERT, UPDATE, or DELETE statement by enclosing the subquery with parentheses. For example:

(SELECT * FROM Person WITH age == (
    SELECT MAX(age) FROM Person
));

It is often helpful to avoid nesting by naming the subqueries:

WITH maximum_age(age) AS (SELECT MAX(age) FROM Person)
SELECT * FROM Person WITH age == maximum_age.age;

Where to now?

There is a whole host of advanced SQL functions, including views, triggers, indices, functions, recursion, and more. Depending on the system you are using or building, a different set of these features may be available at varying levels of fidelity. For reference on what might be available, peruse the PostgreSQL documentation!