Following up on a series of articles that have been featured over the past few months, Peter Vogel takes a look at a variety of issues around subqueries — including those occasions when you shouldn’t use a Join.
Over the past year, Smart Access has printed a number of articles that described how subqueries worked and could be used. While the articles have covered a number of topics, there are some other things that you should know about subqueries. In this article, I’m going to look at subquery performance and when a subquery will solve problems that a Join creates. I’m also going to discuss the major problem with subqueries: performance.
The truth is that queries that include subqueries will almost always have worse performance than equivalent queries that don’t use subqueries. For instance, if you wanted to find all the items in your inventory that you had actually sold, you could create two different queries that would do the job. One query would join the Inventory table to the Orders table, based on the part number:
Select * From Inventory Inner Join Orders On Inventory.PartNo = Orders.PartNo
You could get the same results by using a subquery:
Select * From Inventory Where Inventory.PartNo In (Select Orders.PartNo From Orders)
The major difference between these two versions is that the subquery version will run more slowly. In the sample database, I’ve loaded Inventory and Orders tables with 32,000 records and have run these two queries against them. The subquery version of this problem typically took 25 percent to 33 percent longer than the Join version running against Jet 3.51. Your mileage may vary, of course, depending on your CPU, amount of memory, and database engine. In the sample database that’s available in the accompanying Download file, I’ve included the code to run this test and all of the other ones that follow in this article.
In fact, if you find that the Join and subquery versions of a query run in the same amount of time, it’s probably because your database management system’s query optimizer is rewriting your subqueries as Joins before executing them (Oracle does this, for instance). To quote Dr. E. F. Codd (the father of the relational theory), “nested versions of the same query put an unnecessary performance-oriented burden on users.”
Given that kind of performance difference, you might wonder why anyone would ever use a subquery. Well, there are at least a couple of classes of problems where using a subquery gets you results that you can actually use and a Join will not.
The relational theory defines a relation (a table) as having no duplicates. One of the purposes of the primary key in a table is to ensure that every record is unique because the field (or fields) that make up a primary key aren’t allowed to duplicate. Unfortunately, SQL wasn’t developed by Dr. Codd and isn’t relational: It’s all too easy to generate duplicate records with SQL. The sample data that appears in Table 1 and Table 2 provides a simple demonstration.
Table 1. The Inventory table.
Table 2. The Orders table.
A query to find out which parts have been sold, along with their descriptions, might look like this:
Select Inventory.PartNo, Description From Inventory Inner Join Orders On Inventory.PartNo = Orders.PartNo
Since the widget was so popular, the result (as shown in Table 3) contains duplicate records. In this case, because customer Vogel was so fond of Widgets that he bought two on two separate orders, duplicates can be eliminated only by including the OrderNo field. In SQL, to prevent duplicate records, you must use the keyword Distinct (except in Union queries, where you must use the All keyword to keep duplicates — isn’t SQL fun?).
Table 3. A recordset with duplicate records.
If duplicates are a problem for you, you’ll be happy to know that you can use a subquery to eliminate the problem. Since a subquery moves the second table into the Where clause, and since all a Where clause can do is remove records, a subquery-based query doesn’t generate duplicates. This query gives the results shown in Table 4:
Select Inventory.PartNo, Description From Inventory Where Inventory.PartNo In (Select Orders.PartNo From Orders)
Table 4. A recordset with no duplicate records.
What about performance? The Join version of the subquery still runs faster than the subquery — that is, until you add the Distinct keyword to the Join version to eliminate duplicate records. This version of the Join query runs twice as long as the subquery version:
Select Distinct Inventory.PartNo, Description From Inventory Inner Join Orders On Inventory.PartNo = Orders.PartNo
Another key feature of the subquery version of the query is that its result is always updateable. In a query that joins one table to another table in a one-to-many (or many-to-many) relationship, the resulting recordset might not be updateable. Jet and SQL Server will, on occasion, allow you to update both the “one” and the “many” side of the Join. Many database engines will allow you only to update the “many” side, and some database engines won’t allow you to update either. In this query, for instance, Jet will allow you to update the Customer Name and the Part Description, but other databases won’t:
Select Inventory.PartNo, Description, CustName From Inventory Inner Join Orders On Inventory.PartNo = Orders.PartNo
But, regardless of the database engine, the Distinct version of the Join query is never updateable. It’s not even updateable if no records are eliminated by the Distinct: The mere fact of inserting the Distinct keyword will make the query read-only.
In all of these versions of the query, the subquery version remains updateable. The reason is simple: Since the subquery only removes records from the table, the result is always the original table (with some records removed). Ignoring security, a table will always be updateable.
Making things faster
So, if subqueries are so useful, is there anything that can be done to speed them up? Frequently, the answer is yes, but it depends on what you’re trying to accomplish with your query. Unlike applying indexes, which speed up any query that depends on the indexed fields, a subquery must be rewritten to run faster.
You can classify subqueries into two groups. First, there are subqueries that may return multiple records. My subquery examples so far have been of that variety. Subqueries that return multiple records are typically joined to the main query with In, All, or Any. Here’s an example that finds the items in inventory that are more expensive than the items that have actually been sold:
Select * From Inventory Where Cost > All (Select Cost From Orders)
When you create a subquery that returns many records, the database engine must run the subquery at least once and store the results in a temporary table (probably unindexed). As each row is retrieved in the main query, this temporary table must be searched to satisfy the criteria of the Where clause.
The second set of subqueries, however, can return only one record. This single result can be held in memory and compared against each row of the main query without much cost. The preceding subquery could be rewritten to return only one record as follows:
Select * From Inventory Where Cost > (Select Max(Cost) From Orders)
This version of the query runs about twice as fast as the original version, using the sample data I generated for the queries at the beginning of this article. In general, using a subquery that returns only one record will give you faster performance than a subquery that returns many.
Another way to reduce the number of records generated is to use a co-related subquery. A co-related subquery ties the subquery to the main query through the use of aliases, so that the subquery is run once for each record in the main query. This might initially sound like there will be more time spent in the subquery rather than less. It’s worth remembering, though, that the main query might produce no records at all, in which case the co-related subquery won’t be run at all. Further, a co-related subquery only retrieves records related to the current record in the main query. There’s real potential for there to be fewer “selected” than the records retrieved for an unrestricted subquery.
As an example, I’ll use the subquery that started this article. This query finds all the items in inventory that have been sold by using a subquery:
Select * From Inventory Where Inventory.PartNo In (Select Orders.PartNo From Orders)
This query, on the other hand, uses a co-related subquery to check each item in inventory to see whether it has a match in the Orders table:
Select * From Inventory As I Where Exists (Select Orders.PartNo From Orders Where I.PartNo = Orders.PartNo)
To check the Where clause in this example, the database engine only needs to check to see whether there’s a PartNo in the Orders table that matches the current PartNo in the main query. If the PartNo field in the Orders table is indexed (and, in my sample database, it is), then it just means checking to see whether there’s a hit in the PartNo index. Since there’s a very good chance that the index is held in memory, this can be very quick indeed. My testing indicates that the Exists version runs 50 to 100 percent faster than the In version and, by the way, as fast or faster than the related Join version of the query.
While I’ve presented the difference between In and Exists as a cut-and-dried distinction based entirely on performance, it isn’t that simple. To account for the difference in performance between the In and the Exists, you’ll notice that I resorted to describing how the database engine satisfies the query. Every database engine will have its own special way of resolving subqueries and might not follow the scenario I’ve listed. The result might be that, on some DBMSs, the Exists will run faster, slower, or just the same as the In. Remember that Oracle, for instance, generally converts subqueries to Joins before executing, so that Joins and subqueries will end up with identical performance. In the end, you must do some simple testing on your own database system to find what the optimal code is. Fortunately, with SQL, generating a query or a table full of data to test it on isn’t hard to do.
Subqueries are tremendously useful tools but have a real performance cost when they’re used instead of a Join. On the other hand, subqueries can deliver the goods when a Join can’t. Knowing when to use a subquery, and how to use it efficiently, can mean the difference between lots of slow, awkward code and a tight, fast application.
Sidebar: “In” Doesn’t Equal “Exists”
The article’s examples might give you the impression that In and Exists can be used interchangeably, with the only difference being in their relative performance. That isn’t strictly true if you use the Not operator with In or Exists, where there’s a possibility of a Null in the result.
In the simplest possible situation, imagine that the Orders table contains one record that has a Null in the PartNo field. You first run this query to find all the items in Inventory that have never been sold:
Select * From Inventory Where PartNo Not In (Select PartNo From Orders)
You’ll get no records back, which is obviously wrong since none of the items in the Inventory table have ever been sold (remember, there’s one record in the Orders table, and it has a Null in the PartNo field).
Having read this article, you then decide to speed up the application by re-writing the query to use Exists and a co-related subquery:
Select * From Inventory As I Where Not Exists (Select PartNo From Orders Where I.PartNo = Orders.PartNo)
This query not only runs faster, it also returns all of the records in the Inventory table, which has the additional benefit of being the right answer.
Why the different results? The subquery in the In version of the query returns a list consisting of a single Null. When the In clause checks for a match against the Null, it returns “Unknown” (the only result that you can get when testing against Null). Since the In returns Unknown, Not In also returns Unknown. Because Unknown isn’t True, the Where clause isn’t True. When the Where clause isn’t True, no records are returned.
The Exists statement, however, just checks for a match in the index. Since Null doesn’t match any PartNo, no records are returned. When no records are returned, the Exists is False and the Not Exists is True. The result is that the Where clause that the Not Exists is part of is also True. When the Where clause is True, records are returned.
Don’t think that this will only occur when there’s a single record in the subquery table. As long as there’s even one record in the subquery that returns Null, the Not In will produce Unknown and return every record in the main query, no matter how many other valid records there are. It’s just a good idea to stay away from Not In.