The SQL quiz for the week ending 2 December tested your knowledge of the circumstances under which Oracle would use an index to optimize a SQL statement.
Two players registered concerns about scoring the first choice as correct. This choice was:
SELECT last_name
FROM plch_employees
WHERE hire_date BETWEEN TO_DATE ('01-01-2010', 'MM-DD-YYYY')
AND TO_DATE ('12-31-2010', 'MM-DD-YYYY')
My intention was to isolate a small enough sampling of the total data set so as to invoke the index - and to also point out (through other choices) that if you use the column inside an expression, the index will not be used.
Here are the comment; I look forward to your thoughts.
Steven Feuerstein
1. I think, that the first choice for this quiz is ambiguous. The problem is, that the query selects about 3% of the rows. With the index in place, the database can choose either table full scan or index range scan followed by table access by rowid to run the query. Based on the assumptions, it's impossible to tell, which execution plan will be more efficient and hence whether the index improves query performance. In real world situation, we could expect, that the table is "almost" ordered by hire_date, which means that the clustering factor of the index would be low and the index access path would be more efficient than full table scan. However, there was no such assumption specified for this quiz. It would be better to include last_name in the index or select just a single month or a week instead of a year.
2. I have performed some research around the topic of this quiz, just to verify a little bit more deeper the case of choice number 1, which selects about 1/31 of the rows ( 1 year out of 31 years ) and it seemed to be the most obvious among all the choices. But query optimization is a huge topic, never ending and always dynamic, sometimes very frustrating and always exciting to research. I have performed some research around the topic of this quiz, just to verify a little bit more deeper the case of choice number 1, which selects about 1/31 of the rows ( 1 year out of 31 years ).
That is, though the query returns about 3% of the rows only, this is still not a complete guaranty that the index will be used in any case. It highly depends on the clustering factor of the index, or, in other words, it depends on how the rows were loaded into the table, in which degree they were loaded ordered by the HIRE_DATE column.
Below are two test cases, one in which the data is loaded with minimum clustering of the hire_date values (values generated in a circular fashion) and the other in which the values are highly clustered.
In both cases I used a direct-path insert for creating the test data faster.Then, for each case, I created the index and checked the execution plan:
a. without any statistics collected, which means that dynamic sampling will be used
b. with statistics collected using 20% sampling
c. with complete statistics collected.
Also, I use two slightly different queries:
1. One query identical with that of choice 1 in the quiz:
SELECT last_name
FROM plch_employees
WHERE hire_date BETWEEN TO_DATE ('01-01-2010', 'MM-DD-YYYY')
AND TO_DATE ('12-31-2010', 'MM-DD-YYYY')
/
2. One that selects the HIRE_DATE instead of the LAST_NAME
SELECT hire_date
FROM plch_employees
WHERE hire_date BETWEEN TO_DATE ('01-01-2010', 'MM-DD-YYYY')
AND TO_DATE ('12-31-2010', 'MM-DD-YYYY')
/
The conclusions are as follows:
1. When the rows are loaded cyclically by HIRE_DATE (not clustered, but dispersed), then the index is NOT used. Here dynamic sampling accurately estimates cardinality.
2. When the rows are loaded clustered by HIRE_DATE, then the index is always used, even for high cardinality estimated. Here dynamic sampling highly underestimates cardinality.
3. The index is always used when the query can be resolved entirely using the index only ( ex. selecting HIRE_DATE instead of LAST_NAME ).
See below the two test cases:
SET TIMING ON
SET LINESIZE 130
SET PAGESIZE 0
----------------------------------------------------------
CASE 1 -- data loaded in a circular fashion
----------------------------------------------------------
CREATE TABLE plch_employees
(
employee_id INTEGER
, last_name VARCHAR2 (100)
, first_name VARCHAR2 (100)
, hire_date DATE
, salary NUMBER
)
/
Table created.
Elapsed: 00:00:02.02
-- create a test case (10,000,000 rows) using smaller steps to avoid error
-- ORA-30009: Not enough memory for CONNECT BY operation
DECLARE
K PLS_INTEGER;
BEGIN
FOR I IN 1 .. 1000
LOOP
K := (I-1)*10000;
INSERT /*+APPEND */ INTO plch_employees
SELECT K+ROWNUM,
'ABC '||(K+ROWNUM),
'XYZ '||(K+ROWNUM),
TO_DATE('01011980','DDMMYYYY') + MOD(K+ROWNUM, 31*365),
10000 + MOD(K+ROWNUM, 20000)
FROM DUAL
CONNECT BY LEVEL <= 10000 ;
COMMIT;
END LOOP;
END;
/
PL/SQL procedure successfully completed.
Elapsed: 00:01:59.05
CREATE INDEX i_plch_employees
ON plch_employees (hire_date)
/
Index created.
Elapsed: 00:00:50.02
-- check execution plans without any statistics (will use dynamic sampling)
/* TEST1: selecting LAST_NAME */
EXPLAIN PLAN SET STATEMENT_ID='T1' FOR
SELECT last_name
FROM plch_employees
WHERE hire_date BETWEEN TO_DATE ('01-01-2010', 'MM-DD-YYYY')
AND TO_DATE ('12-31-2010', 'MM-DD-YYYY')
/
Explained.
Elapsed: 00:00:00.07
select plan_table_output
from table(dbms_xplan.display('plan_table','T1','serial'))
/
------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 326K| 50M| 6883 (10)| 00:00:27 |
|* 1 | TABLE ACCESS FULL| PLCH_EMPLOYEES | 326K| 50M| 6883 (10)| 00:00:27 |
------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("HIRE_DATE">=TO_DATE(' 2010-01-01 00:00:00', 'syyyy-mm-dd
hh24:mi:ss') AND "HIRE_DATE"<=TO_DATE(' 2010-12-31 00:00:00', 'syyyy-mm-dd
hh24:mi:ss'))
Note
-----
- dynamic sampling used for this statement
19 rows selected.
Elapsed: 00:00:00.00
/* TEST2: selecting HIRE_DATE instead of LAST_NAME - index only access suffices */
EXPLAIN PLAN SET STATEMENT_ID='T2' FOR
SELECT hire_date
FROM plch_employees
WHERE hire_date BETWEEN TO_DATE ('01-01-2010', 'MM-DD-YYYY')
AND TO_DATE ('12-31-2010', 'MM-DD-YYYY')
/
Explained.
Elapsed: 00:00:00.00
select plan_table_output
from table(dbms_xplan.display('plan_table','T2','serial'))
/
-------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 326K| 2866K| 869 (2)| 00:00:04 |
|* 1 | INDEX RANGE SCAN| I_PLCH_EMPLOYEES | 326K| 2866K| 869 (2)| 00:00:04 |
-------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("HIRE_DATE">=TO_DATE(' 2010-01-01 00:00:00', 'syyyy-mm-dd
hh24:mi:ss') AND "HIRE_DATE"<=TO_DATE(' 2010-12-31 00:00:00', 'syyyy-mm-dd
hh24:mi:ss'))
Note
-----
- dynamic sampling used for this statement
19 rows selected.
Elapsed: 00:00:00.00
-- gather statistics with 20% sampling
BEGIN
DBMS_STATS.GATHER_TABLE_STATS(USER,'PLCH_EMPLOYEES',
cascade => TRUE,
method_opt => 'FOR ALL INDEXED COLUMNS',
estimate_percent => 20);
END;
/
PL/SQL procedure successfully completed.
Elapsed: 00:00:46.00
/* TEST3: selecting LAST_NAME */
EXPLAIN PLAN SET STATEMENT_ID='T3' FOR
SELECT last_name
FROM plch_employees
WHERE hire_date BETWEEN TO_DATE ('01-01-2010', 'MM-DD-YYYY')
AND TO_DATE ('12-31-2010', 'MM-DD-YYYY')
/
Explained.
Elapsed: 00:00:01.04
select plan_table_output
from table(dbms_xplan.display('plan_table','T3','serial'))
/
------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 315K| 12M| 6867 (10)| 00:00:26 |
|* 1 | TABLE ACCESS FULL| PLCH_EMPLOYEES | 315K| 12M| 6867 (10)| 00:00:26 |
------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("HIRE_DATE">=TO_DATE(' 2010-01-01 00:00:00', 'syyyy-mm-dd
hh24:mi:ss') AND "HIRE_DATE"<=TO_DATE(' 2010-12-31 00:00:00', 'syyyy-mm-dd
hh24:mi:ss'))
15 rows selected.
Elapsed: 00:00:00.03
/* TEST4: selecting HIRE_DATE instead of LAST_NAME - index only access suffices */
EXPLAIN PLAN SET STATEMENT_ID='T4' FOR
SELECT hire_date
FROM plch_employees
WHERE hire_date BETWEEN TO_DATE ('01-01-2010', 'MM-DD-YYYY')
AND TO_DATE ('12-31-2010', 'MM-DD-YYYY')
/
Explained.
Elapsed: 00:00:00.00
select plan_table_output
from table(dbms_xplan.display('plan_table','T4','serial'))
/
-------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 315K| 2463K| 907 (2)| 00:00:04 |
|* 1 | INDEX RANGE SCAN| I_PLCH_EMPLOYEES | 315K| 2463K| 907 (2)| 00:00:04 |
-------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("HIRE_DATE">=TO_DATE(' 2010-01-01 00:00:00', 'syyyy-mm-dd
hh24:mi:ss') AND "HIRE_DATE"<=TO_DATE(' 2010-12-31 00:00:00', 'syyyy-mm-dd
hh24:mi:ss'))
15 rows selected.
Elapsed: 00:00:00.00
-- gather full statistics - almost same results as with 20%
BEGIN
DBMS_STATS.GATHER_TABLE_STATS(USER,'PLCH_EMPLOYEES',
cascade => TRUE,
method_opt => 'FOR ALL INDEXED COLUMNS',
estimate_percent => 99);
END;
/
PL/SQL procedure successfully completed.
Elapsed: 00:02:52.03
/* TEST5: selecting LAST_NAME */
EXPLAIN PLAN SET STATEMENT_ID='T5' FOR
SELECT last_name
FROM plch_employees
WHERE hire_date BETWEEN TO_DATE ('01-01-2010', 'MM-DD-YYYY')
AND TO_DATE ('12-31-2010', 'MM-DD-YYYY')
/
Explained.
Elapsed: 00:00:00.00
select plan_table_output
from table(dbms_xplan.display('plan_table','T5','serial'))
/
------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 315K| 12M| 6867 (10)| 00:00:26 |
|* 1 | TABLE ACCESS FULL| PLCH_EMPLOYEES | 315K| 12M| 6867 (10)| 00:00:26 |
------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("HIRE_DATE">=TO_DATE(' 2010-01-01 00:00:00', 'syyyy-mm-dd
hh24:mi:ss') AND "HIRE_DATE"<=TO_DATE(' 2010-12-31 00:00:00', 'syyyy-mm-dd
hh24:mi:ss'))
15 rows selected.
Elapsed: 00:00:00.03
/* TEST6: selecting HIRE_DATE instead of LAST_NAME - index only access suffices */
EXPLAIN PLAN SET STATEMENT_ID='T6' FOR
SELECT hire_date
FROM plch_employees
WHERE hire_date BETWEEN TO_DATE ('01-01-2010', 'MM-DD-YYYY')
AND TO_DATE ('12-31-2010', 'MM-DD-YYYY')
/
Explained.
Elapsed: 00:00:00.00
select plan_table_output
from table(dbms_xplan.display('plan_table','T6','serial'))
/
-------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 315K| 2462K| 853 (2)| 00:00:04 |
|* 1 | INDEX RANGE SCAN| I_PLCH_EMPLOYEES | 315K| 2462K| 853 (2)| 00:00:04 |
-------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("HIRE_DATE">=TO_DATE(' 2010-01-01 00:00:00', 'syyyy-mm-dd
hh24:mi:ss') AND "HIRE_DATE"<=TO_DATE(' 2010-12-31 00:00:00', 'syyyy-mm-dd
hh24:mi:ss'))
15 rows selected.
Elapsed: 00:00:00.00
DROP TABLE plch_employees
/
----------------------------------------------------------
CASE 2 -- data loaded clustered by HIRE_DATE
----------------------------------------------------------
CREATE TABLE plch_employees
(
employee_id INTEGER
, last_name VARCHAR2 (100)
, first_name VARCHAR2 (100)
, hire_date DATE
, salary NUMBER
)
/
Table created.
Elapsed: 00:00:02.02
DECLARE
K PLS_INTEGER;
N PLS_INTEGER := TRUNC(10000000 / (31*365));
BEGIN
FOR I IN 1 .. 1000
LOOP
K := (I-1)*10000;
INSERT /*+APPEND */ INTO plch_employees
SELECT K+ROWNUM,
'ABC '||(K+ROWNUM),
'XYZ '||(K+ROWNUM),
TO_DATE('01011980','DDMMYYYY') + FLOOR( (K+ROWNUM)/ N ),
10000 + MOD(K+ROWNUM, 20000)
FROM DUAL
CONNECT BY LEVEL <= 10000 ;
COMMIT;
END LOOP;
END;
/
PL/SQL procedure successfully completed.
Elapsed: 00:01:51.00
-- check correct (expected) distribution
SELECT COUNT(*)
FROM plch_employees
WHERE hire_date BETWEEN TO_DATE ('01-01-2010', 'MM-DD-YYYY')
AND TO_DATE ('12-31-2010', 'MM-DD-YYYY')
/
COUNT(*)
----------
322295
1 row selected.
Elapsed: 00:00:13.04
CREATE INDEX i_plch_employees ON plch_employees (hire_date)
/
Index created.
Elapsed: 00:00:47.01
/* TEST1: selecting LAST_NAME --
now index is used, though wrong cardinality estimation ! */
EXPLAIN PLAN SET STATEMENT_ID='T1' FOR
SELECT last_name
FROM plch_employees
WHERE hire_date BETWEEN TO_DATE ('01-01-2010', 'MM-DD-YYYY')
AND TO_DATE ('12-31-2010', 'MM-DD-YYYY')
/
Explained.
Elapsed: 00:00:00.00
------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2500 | 393K| 26 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| PLCH_EMPLOYEES | 2500 | 393K| 26 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | I_PLCH_EMPLOYEES | 2500 | | 9 (0)| 00:00:01 |
------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("HIRE_DATE">=TO_DATE(' 2010-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')
AND "HIRE_DATE"<=TO_DATE(' 2010-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
Note
-----
- dynamic sampling used for this statement
19 rows selected.
Elapsed: 00:00:00.04
/* TEST2: selecting HIRE_DATE instead of LAST_NAME - index only access suffices */
EXPLAIN PLAN SET STATEMENT_ID='T2' FOR
SELECT hire_date
FROM plch_employees
WHERE hire_date BETWEEN TO_DATE ('01-01-2010', 'MM-DD-YYYY')
AND TO_DATE ('12-31-2010', 'MM-DD-YYYY')
/
Explained.
Elapsed: 00:00:00.00
-------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2500 | 22500 | 9 (0)| 00:00:01 |
|* 1 | INDEX RANGE SCAN| I_PLCH_EMPLOYEES | 2500 | 22500 | 9 (0)| 00:00:01 |
-------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("HIRE_DATE">=TO_DATE(' 2010-01-01 00:00:00', 'syyyy-mm-dd
hh24:mi:ss') AND "HIRE_DATE"<=TO_DATE(' 2010-12-31 00:00:00', 'syyyy-mm-dd
hh24:mi:ss'))
Note
-----
- dynamic sampling used for this statement
19 rows selected.
Elapsed: 00:00:00.01
-- gather statistics with 20% sampling
BEGIN
DBMS_STATS.GATHER_TABLE_STATS(USER,'PLCH_EMPLOYEES',
cascade => TRUE,
method_opt => 'FOR ALL INDEXED COLUMNS',
estimate_percent => 20);
END;
/
PL/SQL procedure successfully completed.
Elapsed: 00:00:48.02
/* TEST3: selecting LAST_NAME - now with correct (much higher) cardinality, index still used ! */
EXPLAIN PLAN SET STATEMENT_ID='T3' FOR
SELECT last_name
FROM plch_employees
WHERE hire_date BETWEEN TO_DATE ('01-01-2010', 'MM-DD-YYYY')
AND TO_DATE ('12-31-2010', 'MM-DD-YYYY')
/
Explained.
Elapsed: 00:00:00.00
------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 318K| 12M| 3160 (2)| 00:00:12 |
| 1 | TABLE ACCESS BY INDEX ROWID| PLCH_EMPLOYEES | 318K| 12M| 3160 (2)| 00:00:12 |
|* 2 | INDEX RANGE SCAN | I_PLCH_EMPLOYEES | 318K| | 861 (2)| 00:00:04 |
------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("HIRE_DATE">=TO_DATE(' 2010-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')
AND "HIRE_DATE"<=TO_DATE(' 2010-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
15 rows selected.
Elapsed: 00:00:00.00
/* TEST4: selecting HIRE_DATE instead of LAST_NAME - index only access suffices */
EXPLAIN PLAN SET STATEMENT_ID='T4' FOR
SELECT hire_date
FROM plch_employees
WHERE hire_date BETWEEN TO_DATE ('01-01-2010', 'MM-DD-YYYY')
AND TO_DATE ('12-31-2010', 'MM-DD-YYYY')
/
Explained.
Elapsed: 00:00:00.00
-------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 318K| 2490K| 861 (2)| 00:00:04 |
|* 1 | INDEX RANGE SCAN| I_PLCH_EMPLOYEES | 318K| 2490K| 861 (2)| 00:00:04 |
-------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("HIRE_DATE">=TO_DATE(' 2010-01-01 00:00:00', 'syyyy-mm-dd
hh24:mi:ss') AND "HIRE_DATE"<=TO_DATE(' 2010-12-31 00:00:00', 'syyyy-mm-dd
hh24:mi:ss'))
15 rows selected.
Elapsed: 00:00:00.00
-- gather full statistics - almost same results as with 20%
BEGIN
DBMS_STATS.GATHER_TABLE_STATS(USER,'PLCH_EMPLOYEES',
cascade => TRUE,
method_opt => 'FOR ALL INDEXED COLUMNS',
estimate_percent => 99);
END;
/
PL/SQL procedure successfully completed.
Elapsed: 00:02:42.05
/* TEST5: selecting LAST_NAME -
now with correct (much higher) cardinality, index still used ! */
EXPLAIN PLAN SET STATEMENT_ID='T5' FOR
SELECT last_name
FROM plch_employees
WHERE hire_date BETWEEN TO_DATE ('01-01-2010', 'MM-DD-YYYY')
AND TO_DATE ('12-31-2010', 'MM-DD-YYYY')
/
Explained.
Elapsed: 00:00:00.00
------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 321K| 12M| 3035 (2)| 00:00:12 |
| 1 | TABLE ACCESS BY INDEX ROWID| PLCH_EMPLOYEES | 321K| 12M| 3035 (2)| 00:00:12 |
|* 2 | INDEX RANGE SCAN | I_PLCH_EMPLOYEES | 321K| | 868 (2)| 00:00:04 |
------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("HIRE_DATE">=TO_DATE(' 2010-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')
AND "HIRE_DATE"<=TO_DATE(' 2010-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
15 rows selected.
Elapsed: 00:00:00.00
/* TEST6: selecting HIRE_DATE instead of LAST_NAME - index only access suffices */
EXPLAIN PLAN SET STATEMENT_ID='T6' FOR
SELECT hire_date
FROM plch_employees
WHERE hire_date BETWEEN TO_DATE ('01-01-2010', 'MM-DD-YYYY')
AND TO_DATE ('12-31-2010', 'MM-DD-YYYY')
/
Explained.
Elapsed: 00:00:00.00
-------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 321K| 2511K| 868 (2)| 00:00:04 |
|* 1 | INDEX RANGE SCAN| I_PLCH_EMPLOYEES | 321K| 2511K| 868 (2)| 00:00:04 |
-------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("HIRE_DATE">=TO_DATE(' 2010-01-01 00:00:00', 'syyyy-mm-dd
hh24:mi:ss') AND "HIRE_DATE"<=TO_DATE(' 2010-12-31 00:00:00', 'syyyy-mm-dd
hh24:mi:ss'))
15 rows selected.
Elapsed: 00:00:00.00