Query Evaluation
This chapter is all about query evaluation. We've learned about the query languages and the operations available to the user, and we've learned how we store data on disk efficiently. In this chapter we learn how we unify these two concepts to efficiently answer complex questions about data through the query engine. We will learn about query optimization and query execution, as well as numerous techniques used in either step.
The Query Engine
Database systems exist to automate the work of describing how data should be stored and retrieved; as a result, they should also perform very well no matter how the user specifies the data they wish to have retrieved. As much as possible, the database should find the fastest possible way to retrieve a valid result for the user's query, no matter how convoluted the user's query was written.
The life of a query usually follows three steps:
- Parsing: A parsing application transforms raw SQL into a logical query plan, typically represented by a tree.
- Optimization: The logical query plan is transformed via a set of rules to an optimized query plan, which is then realized into a physical query plan, typically by comparing multiple candidate physical plans by their estimated cost.
- Execution: The physical plan is executed by the database.
The first and last step aren't nearly as interesting as the second. First, some definitions. A logical query plan specifies the operations that the database should take to transform the data, but doesn't specify what specific algorithm or technique it should use. It is essentially a tree version of a SQL statement. A physical query plan on the other hand is a logical query plan that specifies exactly which algorithms it will use at each step. Query optimization, therefore, transforms a logical query plan into a (close to optimal) physical query plan. Simple, right?
Tnfortunately, it happens to be that query optimization is an incredibly hard problem, especially for larger queries. As we know from studying relational algebra, there are infinitely many equivalent query plans, and a potentially huge number of equivalent normalized query plans. Moreover, even if we were able to enumerate all possible query plans, deciding which one is best is difficult - there isn't a great way to know how long a query plan will take to execute without executing it.
We will first go over how a logical query plan can be manipulated to perform better without any knowledge outside of the query itself using rule-based optimization. Then, we will survey some of the more interesting algorithms that a query optimizer may be selecting from, as well as outline how their costs might be estimated in a cost-based optimizer.
Rule-Based Optimization
From relational algebra, we know that there are certain transformation we can apply to a query plan without altering it's final result. In fact, it turns out that there are a few of these transformations that will always result in faster execution times; we can call these transformation rules. Enumerating and applying these rules is the job of a rule-based optimizer. We will go over a few common rules.
Consider a query plan with the following substructure:
Consider a query plan with a complicated conjunctive condition in a filter, like:
Consider a query plan that only outputs column
Many such rules can be dreamt up and used, and depending on slight changes in the data and execution model, different rules could be effective. Choices on whether to apply each rule once or many times or until convergence are up to the database designer.
Query Execution Algorithms
Now that we've reduced our logical query plan to a more optimized form, we'll need to decide how exactly we should execute it; that is, which algorithms we should ask our database engine to use to evaluate each step of the query plan. Indeed, for each operation we could be doing, multiple valid algorithms exist with different time and space complexities. We'll gloss over scans, then go over the canonical algorithms for joins (algorithms for selection boil down to using indexes when we can, and for the rest are not particularly interesting).
Scans
In order to actually have any data in memory to work with, we must first scan it from our tables. For the most part, we have two main choices: scan over the relevant blocks without an index, or rely on an index and do an index scan. While it might seem that an index scan is superior, depending on the query, different scanning approaches may be more suitable. For example, if a query is a restrictive range query over a nonprimary attribute with an index over it, then using that index might make sense; however, if it isn't a particularly restrictive range query, then using the secondary index might cause many unnecesary page evictions as we access pages out of order. Indeed, in this case it may be smarter to simply read all of the rows, then filter them out. Another consideration is sortedness; say we would like to apply a merge-join later; scanning in order might allow us to avoid a merge-sort later, which can save time. Optimizing for all these cases is difficult, and often left to a cost optimizer.
Loop Joins
The most basic join algorithm is the loop join! The loop join essentially works as follows:
for row s in S:
for row t in T:
if s.key == t.key:
output(s, t)
This join is called the nested loop join. It might not immediately clear why, but this is probably the worst possible approach. Thinking about this in terms of block transfers, assuming the relation
We can do a little better. Instead of going through each row in
for block b_s in S:
for block b_t in T:
for row s in b_s:
for row t in b_t:
if s.key == t.key:
output(s, t)
This join is called the block nested loop join and is much better. We can do even better by choosing LIMIT
keyword), or in cases where one of our relations is sufficiently small, a block nested loop join is a great option.
Hash Joins
When performing a classic hash join algorithm on two tables, we have to find tuples that match on some key or value across both tables. To do so, the classic hash join creates a hash table index over the inner table, and then linearly scans the outer table, probing the inner table for corresponding values. However, due to the nature of hash functions, these tuples are distributed at random in the inner hash table. If the hash table is too large to reside entirely in memory, then we have to fetch data from disk at random, which greatly increases our disk I/O cost because pages will keep getting churned through our buffer cache. To illustrate this point, imagine that our inner table gets hashed into 100 buckets, and memory can only hold 8 buckets. For each tuple in the outer table, we are likely going to need to go to a different bucket. So, we end up doing one block transfer per element in the outer table; at this pace, we may as well have done a nested block loop join.
Luckily, the Grace hash join offers us a solution. Let's say we have two tables, denoted
After partitioning our two relations and resizing our hash tables so that we have the same number of entries in our global index, we then load corresponding pairs of buckets into memory and match up pairs of tuples. Concretely, we load the i
th bucket from i
th bucket from
The main issue to worry about is duplicity; if we ever come across the same pair of buckets twice, we may end up outputting duplicate join results. This is possible if we were to use extendible hashing and weren't careful about our pointers. Thus, it is important to be sure that we never process the same pair of buckets twice; keeping track of which pairs of buckets we have seen will solve this problem.
Also notice that each pair of buckets has a disjoint result set (think about what it would mean if result sets weren't disjoint), indicating that there is room for parallelization. Clearly, this is a problem with many possible optimizations; we will explore in depth one such optimization through the use of a Bloom filter.
Bloom Filters
One inefficiency in a hash join is that you potentially have to iterate through an entire bucket to find a value. To do that for every search value is really expensive - if we could speed up our search time somehow, that would be a huge win! Unfortunately, maintaining a bucket ordering doesn't really make sense just for a one-time search, and building an index over such a small data structure is overkill.
Bloom filters are a lightweight solution to the sparse search problem. A Bloom filter is a probabilistic data structure that can tell us one of two things about an element: whether it is definitely not in a set, or possibly in a set. See a visual simulation of a bloom filter here.
A bloom filter has two parts: a bitset (essentially an array of
Next, when we want to check if a particular element is in the set, we hash it
Applying this to the Grace hash join, we will construct Bloom filters on each bucket of the inner relation, then during the probing step, first check to see if each value in the outer relation's
Sort Merge Join and External Sort
In the case that our input relations are sorted, we can use a sort merge join. A sort merge join essentially uses one pointer per relation, and runs through each relation outputting pairs that match on the join key. This join exploits the fact that, when joining over a sorted field, matching tuples will be in the same relative region of their respective sorted tables. Special case should be taken when there are duplicates, but this is an overall rather intuitive algorithm to describe and implement. The difficult part is sorting the relation.
If the relation isn't already sorted by primary key, then we have to do some work to sort it. Thankfully, the classic merge sort algorithm generalizes great to our setting, because we can treat each block like an array segment. However, instead of doing the entire search in memory, we write out intermediate sorted blocks. This algorithm is known as external sort. The algorithm proceeds as follows:
- Sort each block in memory and write them out to disk. Each block is currently its own sorted chunk.
- On iteration
, read the first block of adjacent sorted chunks into memory and merge sort them, using a block-sized buffer. As this buffer fills up, write it out to a new chunk. Proceed for all chunks from iteration . - Proceed until the whole table is sorted.
Notice that when we are on iteration
Cost-Based Optimization
To decide on a query plan, the optimizer repeatedly proposes a physical plan and estimates the runtime, or cost, of that physical plan until it reaches a point where it is satisfied. This is a delicate balancing act between finding a good physical plan and wasting time looking for a better physical plan; indeed, the query optimization problem is very difficult, and achieving optimality for large enough queries is often impossible. To see why, remember that joins are both commutative and associative; as a result of commutativity, for a query using
The optimizer can't compute cost without some statistics about the database. Most databases collect metadata about the size of their tables, the relative density and distribution of the data, and other metrics that may be useful in query planning. Indeed, these values can be very useful in estimating the time it will take for a join to run, and the size of the output of a join. With these values, the optimizer can recursively decide on estimates to compare between physical plans. The hardest challenge in optimizing queries is deciding on a join ordering, then on which join algorithms to use. Indeed, we would like for joins to result in smaller tables as soon as possible, but it is impossible to know the size of a join before executing is. However, using information about table sizing, we can get a decent estimate over which order to impose.
There are a few difficulties with this approach. First, once we start making estimates based on estimated values, our margin of error expands exponentially. Indeed, cost estimation is an incredibly inexact science, and often query plans execute much differently than expected. Moreover, because it is nearly impossible to explore the entire search space, deciding which plan to evaluate next is also a challenge. We leave further exploration of the realm of query optimization for another text.
Code Generation
At the end of the road, we've decided on a query plan, and for each relational node, an algorithm to use. Now, we will convert this query plan into executable code for our database engine to ingest and run. As this is more a concern for compiler-writers, we eschew a lengthy discussion of code generation.