SQL Query Optimization FAQ Part 1 (The SQL Plan)
Introduction and Goal
What is a SQL execution Plan?
Can we understand step by step how SQL Query plan is created and executed?
Ok, so where do I start from?
What is a Table Scan (Operator number 1)?
Can performance increase by adding a unique key for table scans?
What are physical and logical reads?
Table scan doesn't look to be efficient?
What is a seek scan (Operator number 2)?
Can we see a seek scan practically?
What are heap tables in SQL Server?
What is RID Lookup heap in SQL query plan? (Operator number 3)
What is a bookmark lookup?
Lookups are costly operation, how can we eliminate the same?
What are covering indexes?
Detail links which discuss Physical and logical read
In this article we first try to understand what is a SQL plan, how is it created and then we will move towards understanding how to read the SQL plan. As we read the SQL plan we will try to understand different operators like table scan, index seek scan, clustered scan etc. As we understand the logic of these operators we will also try to understand what the best practices are in different situations are.
Here's my small gift for all my .NET friends , a complete 400 pages FAQ Ebook which covers various .NET technologies like Azure , WCF , WWF , Silverlight , WPF , SharePoint and lot more. http://www.questpond.com
Every SQL query is broken down in to series of execution steps called as operators. Each operator performs basic operations like insertion, search, scan, updation, aggregation etc. There are 2 kinds of operators Logical operators and physical operators.
Logical operators describe how the execution will be executed at a conceptual level while physical operators are the actual logic / routine which perform the action.
On the query plan you always physical operators. One logical operator can map to multiple physical operator. It can also go vice versa but that's a rare scenario. Some operators are both physical as well as logical. You can check out all the logical and physical operators for SQL Server at http://msdn.microsoft.com/en-us/library/ms191158.aspx . Below table shows some sample mapping between logical and physical operators.
||. Merge join|
. Nested loops
There are three phases through which a SQL is parsed and executed.
Parse: - The first phase is to parse the SQL query for syntaxes and create a query processor tree which defines logical steps to execute the SQL. This process is also called as 'algebrizer'.
Optimize: - The next step is to find a optimized way of executing the query processor tree defined by the 'algebrizer'. This task is done by using 'Optimizer'.'Optimizer' takes data statistics like how many rows, how many unique data exist in the rows, do the table span over more than one page etc. In other words it takes information about data's data. These all statistics are taken, the query processor tree is taken and a cost based plan is prepared using resources, CPU and I/O. The optimizer generates and evaluates many plan using the data statistics, query processor tree, CPU cost, I/O cost etc to choose the best plan.
The optimizer arrives to an estimated plan, for this estimated plan it tries to find an actual execution plan in the cache. Estimated plan is basically which comes out from the optimizer and actual plan is the one which is generated once the query is actually executed.
Execute: - The final step is to execute the plan which is sent by the optimizer.
Till now we have understood operators and the way SQL plan is created and executed. So the first thing is to understand different operators and their logic. As discussed previously operator's form the basic unit of you SQL plan, if we are able to understand them we can optimize our SQL to a great extent. So let's understand the basic and important operators.
Before we try to understand table scan operator let's try to see how can we see a SQL plan. Once you are in your SQL management studio, click on new query and write the SQL for which you want to see the SQL plan. Click on the icon which is displayed on the query window as shown below. Once you hit the icon you will see the query plan.
Now that we have understood how to see a query plan let try to understand the first basic operator i.e. table scan. In order to understand the same we have created a simple companies table with company code and description. Do not create any primary key or indexes. Now write a simple SQL with a select clause on one of the properties, we have selected company code currently as shown in the above figure.
If you hit on the query plan icon you will see table scan in your query plan. Now let's try to understand what it signifies.
If there is a primary key table scan operator scans row by row every record until it finds the exact record. For instance in our search criteria we have given '500019' as the company code. In table scan it goes row by row until it finds the record. Once it gets the record it sends the same as output to the end client.
In case the table does not have a primary key, it will continue searching ahead to find more matches if there are any.
Yes, the performance increases if you add a unique key for table scan search criteria. Let's see an actual demonstration of the same. So right click on the table field and create a unique key on customer code field as shown in the below figure.
Now let's go to SQL Server analyzer and set the statistics to 'ON' using 'set statistics io on'. With the statistics we also execute our query with the customer code criteria as shown in the below code snippet.
set statistics io on
SELECT TOP 1000 [CustomerCode]
If you click on messages you will see logical reads as '3'.
(1 row(s) affected)
Table 'Customers'. Scan count 0, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Now let's go and remove the unique key the logical reads are 17, which means performance has degraded as compared with unique key.
(1 row(s) affected)
Table 'Customers'. Scan count 1, logical reads 17, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
The main work of database is to store and retrieve data. In other words lot of reads and writes to the disk. Read and writes consume lot of resources and take long time for completion. SQL server allocates virtual memory for cache to speed up I/O operations. Every instance of SQL server has its own cache.
When data is read it is stored in the SQL cache until it's not referenced or the cache is needed for some purpose. So rather than reading from physical disk its read from SQL cache. In the same way for writing, data is written back to disk only if it's modified.
So when data is fetched from SQL cache its terms as logical read and when its read from physical database its termed as physical read.
Table scan operators are good if the numbers of records in the tables are less. If the numbers of records in the tables are more, then table scan is very inefficient as it needs to scan row by row to get to the record.
So for big number of rows in a table rather than table scan operator, seek scan operator is preferred.
Seek scan does not scan all the rows to go to a record, it uses indexes and b-tree logic to get to a record. Below is a sample diagram, which explains how B-Tree fundamental works. The below diagram shows how index will work for number from 1-50:-
. Let us say you want to search 39. SQL Server will first start from the first node i.e. root node.
. It will see that the number is greater than 30, so it moves to the 50 node.
. Further in Non-Leaf nodes it compares is it more than 40 or less than 40. As it's less than 40 it loops through the leaf nodes which belong to 40 nodes.
In table scan it scans all rows while in seek scan it scans less number of rows comparatively. For instance to get to the record 39 it only scanned 9 records rather than travelling through all 50 records.
On the same above sample lets create a clustered index on the company code field and see the SQL plan again by hitting the SQL plan icon.
If you view the SQL plan for the select query with company code you will see a clustered index seek operator as shown in the below figure. In other words it says that it will use b-tree logic for searching rather than traversing row by row.
In case you have a non-clustered index on the company code it will show a non-clustered index operator as shown below. You may be wondering what the RID lookup and nested loop is in the query plan, we will come to that later on.
The below figure indicates that its using the non-clustered index logical operator.
A table with no clustered indexes is termed as heap tables. The data rows of a heap table are not sorted, due to this there is a huge overhead of accessing heap tables.
RID lookup will be seen when you use non-clustered indexes with join queries. For instance below is a simple join with customer and address table on 'custcode'.
SELECT Customers.CustomerName, Address.Address1
FROM Address INNER JOIN Customers ON Address.Custcode_fk = Customers.CustomerCode
If you do not put indexes on the primary table i.e. customer at this moment and you see the SQL plan, you should see the RID lookup heap as shown below. We will understand what it means, just try to get a glimpse of this operator first.
In order to understand the RID look up we need to first understand how non-clustered indexes work with clustered indexes and heap tables. Below is a simple figure which describes the working and their relationships.
Non-clustered indexes also use the B-tree structure fundamental to search data. In non-clustered indexes the leaf node is a 'Rowid' which points to different things depending on two scenarios:-
Scenario 1:- If the table which is having the primary key in the join has a clustered index on the join key (currently the join key is the custcode) then the leaf nodes i.e. 'rowid' will point to the index key of clustered index as shown below.
Scenario 2 :- if the table which is having the primary does not have a clustered index then the non-clustered index leaf node 'rowid' will point to actual row on the heap table. As the data is stored in a different heap table, it uses the lookup (i.e. RID lookup) to get to the actual row. Below is a query plan which has a join without clustered indexes.
As discussed in the previous section, when query searches on a column which is not a part of non-clustered index a lookup is required. As mentioned earlier either you need to lookup on the clustered index or you need to lookup on the heap tables i.e. RID lookup.
The old definition for these lookup was called as 'Bookmark' look up. If you are seeing SQL plan using SQL 2000 you should see a bookmark lookup in your query plan. In SQL 2000 bookmark loop up used a dedicated iterator to determine whether the table is heap table or index table and change the search logic accordingly.
So cutting short index lookup and RID lookup are nothing but types of bookmark lookup.
Yes, lookups are costly operations and should be avoided as far as possible. To avoid RID lookup or clustered lookup we can use covering indexes.
Below is a simple table which has two fields customer code and customer name. Clustered indexes are defined on Customer code and non-clustered index is defined on customer name as shown in the table figure below.
Lets fire the below and see the query plan for the same. Please note we are searching on the non-clustered index column.
FROM [CustomerNew].[dbo].[Customers] where CustomerName='20 microns'
As the non-clustered index needs to do a lookup on the clustered index it uses the RID lookup to get the clustered index. This lookup happens because the customer name non-clustered index does not have information about the clustered index data. If somehow we are able to make aware of clustered index data to non-clustered index the RID lookup can be avoided.
In order that the non-clustered index i.e. customer name is aware of the clustered index i.e. customer code we can create a combined non-clustered index on customer and customer doe as shown in the below figure.
Now if you generate the plan you can see that RID lookup is completely eliminated. The customer code key in this current scenario is called as the covering index.
As explained above. It helps to remove the lookup operation.
. Execution plan basics by Grant FritChey http://www.simple-talk.com/sql/performance/execution-plan-basics/
. The best guide on logical and physical operators http://msdn.microsoft.com/en-us/library/ms191158.aspx
. Inside Microsoft SQL Server 2005: Query Tuning and Optimization by Kalen Delaney