Understanding Oracle Histograms

Histograms are a feature of the Cost Based Optimizer (CBO) that helps to know the data distribution in a column. Histograms should be created in columns with highly skew data and are included in the filter clause (WHERE) of a SQL statement.
Histograms play a very important role in the creation of the execution plan and we must be careful in the method used to gather statistics in the DB objects.
Let me show you how histograms work, for this test I have created a table T_HIST with some data in it:
 SQL> CREATE TABLE T_HIST AS SELECT LEVEL C1, mod(round(dbms_random.value(1,100000)),254) c2,mod(round(dbms_random.value(1,100000)),254) c3, ‘C’ C4 from dual connect by level < 529101;

SQL> INSERT INTO T_HIST SELECT LEVEL C1, mod(round(dbms_random.value(1,100000)),254) c2,
mod(round(dbms_random.value(1,100000)),254) c3, ‘P’ from dual connect by level < 301;

300 rows created.

SQL> INSERT INTO T_HIST SELECT LEVEL C1, mod(round(dbms_random.value(1,100000)),254) c2,
mod(round(dbms_random.value(1,100000)),254) c3, ‘R’ from dual connect by level < 301;

300 rows created.

SQL> INSERT INTO T_HIST SELECT LEVEL C1, mod(round(dbms_random.value(1,100000)),254) c2,
mod(round(dbms_random.value(1,100000)),254) c3, ‘T’ from dual connect by level < 301;

300 rows created.

SQL> INSERT INTO T_HIST SELECT LEVEL C1, mod(round(dbms_random.value(1,100000)),254) c2,
  2  mod(round(dbms_random.value(1,100000)),254) c3, ‘A’ from dual connect by level < 500001;

500000 rows created.

SQL> COMMIT;

Commit complete.

SQL>

We will now create an index in the column C4:
SQL> CREATE INDEX T_HIST_IDX1 ON T_HIST(C4);

Index created.

SQL>

Let us see how the data is distributed inside column C4:
SQL> SELECT C4,COUNT(*) FROM T_HIST GROUP BY C4 ORDER BY C4;
C   COUNT(*)
– ———-
A     500000
C     529100
P        300
R        300
T        300
You can see that we have 5 distinct values in column C4 (A, C, P, R & T) and the number of values that correspond to each. Let us take value ‘R’ as example, if we query the table the expected execution plan is to use the Index created, because we need only 300 rows of the 1030000 rows total. Let’s query:
SQL> explain plan for select /* test23 */ *
  2  from T_HIST
  3  where      C4 = ‘R’
  4  order by
  5      C1
  6  ;
Explained.
SQL>
SET LINESIZE 130
SET PAGESIZE 0
SELECT *
FROM   TABLE(DBMS_XPLAN.DISPLAY);  2    3    4    5    6
Explained.

SQL> SQL> SQL> SQL>   2
Plan hash value: 3146893614

——————————————————————————————–
| Id  | Operation                    | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
——————————————————————————————–
|   0 | SELECT STATEMENT             |             |    30 |  1260 |   387   (1)| 00:00:05 |
|   1 |  SORT ORDER BY               |             |    30 |  1260 |   387   (1)| 00:00:05 |
|   2 |   TABLE ACCESS BY INDEX ROWID| T_HIST      |    30 |  1260 |   386   (0)| 00:00:05 |
|*  3 |    INDEX RANGE SCAN          | T_HIST_IDX1 |  3607 |       |   375   (0)| 00:00:05 |
——————————————————————————————–

Predicate Information (identified by operation id):
—————————————————

   3 – access(“C4″=’R’)

Note
—–
   – dynamic sampling used for this statement (level=2)

What happened? Execution plan is saying that our query will return only 30 rows! obviously this is totally wrong; also look at the estimated elapsed time (5 sec). Ok, we are being unfair to the CBO as we are not giving all the information required, so let’s gather statistics but without histograms:
SQL> exec DBMS_STATS.GATHER_TABLE_STATS (ownname => ‘ALFRED’, tabname=>’T_HIST’, estimate_percent=>dbms_stats.auto_sample_size,method_opt=>’for all columns size 1′);
PL/SQL procedure successfully completed.
SQL> explain plan for select /* test23 */ *
  2  from T_HIST
  3  where      C4 = ‘R’
  4  order by
  5      C1
  6  ;
Explained.
SQL>
SQL> SET LINESIZE 130
SQL> SET PAGESIZE 0
SQL> SELECT *
  2  FROM   TABLE(DBMS_XPLAN.DISPLAY);
Plan hash value: 2911980760
————————————————————————————-
| Id  | Operation          | Name   | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
————————————————————————————-
|   0 | SELECT STATEMENT   |        |   206K|  3017K|       |  1832   (2)| 00:00:22 |
|   1 |  SORT ORDER BY     |        |   206K|  3017K|  5672K|  1832   (2)| 00:00:22 |
|*  2 |   TABLE ACCESS FULL| T_HIST |   206K|  3017K|       |   774   (2)| 00:00:10 |
————————————————————————————-
Predicate Information (identified by operation id):
—————————————————
   2 – filter(“C4″=’R’)
What happened? Now the CBO decided to choose a Full Table Scan, it also states that will return 206k rows and the estimated elapsed time is 22 sec. That’s because we used AUTO_SAMPLE_SIZE so Oracle we decide how many rows to sample and for sure values of ‘A’ & ‘C’ represent more than 99% of the samples. But how can we tell the CBO that our data distribution is skewed? that’s where histograms enter into play. Let’s now gather statistics using histogram for column C4:
SQL> exec DBMS_STATS.GATHER_TABLE_STATS (ownname => ‘ALFRED’, tabname=>’T_HIST’, estimate_percent=>dbms_stats.auto_sample_size,method_opt=>’for columns C4 size 254′);
PL/SQL procedure successfully completed.
SQL> explain plan for select /* test23 */ *
  2  from T_HIST
  3  where      C4 = ‘R’
  4  order by
  5      C1
  6  ;
Explained.
SQL>
SQL> SET LINESIZE 130
SQL> SET PAGESIZE 0
SQL> SELECT *
  2  FROM   TABLE(DBMS_XPLAN.DISPLAY);
Plan hash value: 3146893614
——————————————————————————————–
| Id  | Operation                    | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
——————————————————————————————–
|   0 | SELECT STATEMENT             |             |   941 | 14115 |     8  (13)| 00:00:01 |
|   1 |  SORT ORDER BY               |             |   941 | 14115 |     8  (13)| 00:00:01 |
|   2 |   TABLE ACCESS BY INDEX ROWID| T_HIST      |   941 | 14115 |     7   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN          | T_HIST_IDX1 |   941 |       |     4   (0)| 00:00:01 |
——————————————————————————————–
Predicate Information (identified by operation id):
—————————————————
   3 – access(“C4″=’R’)
Now is using the Index again, states that will return 941 rows but look how the cost decreased and the estimated elapsed time to 1 sec! Even if the CBO wasn’t able to calculate the real number of rows being returned the cost was fixed, in order to have the most accurate execution plan we must increase the sample size. Let us make a test with 100% of the samples:
SQL> exec DBMS_STATS.GATHER_TABLE_STATS (ownname => ‘ALFRED’, tabname=>’T_HIST’, estimate_percent=>100,method_opt=>’for columns C4 size 254′);
PL/SQL procedure successfully completed.
SQL> explain plan for select /* test23 */ *
  2  from T_HIST
  3  where      C4 = ‘R’
  4  order by
  5      C1
  6  ;
Explained.
SQL>
SQL> SET LINESIZE 130
SQL> SET PAGESIZE 0
SQL> SELECT *
  2  FROM   TABLE(DBMS_XPLAN.DISPLAY);
Plan hash value: 3146893614
——————————————————————————————–
| Id  | Operation                    | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
——————————————————————————————–
|   0 | SELECT STATEMENT             |             |   300 |  4500 |     5  (20)| 00:00:01 |
|   1 |  SORT ORDER BY               |             |   300 |  4500 |     5  (20)| 00:00:01 |
|   2 |   TABLE ACCESS BY INDEX ROWID| T_HIST      |   300 |  4500 |     4   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN          | T_HIST_IDX1 |   300 |       |     3   (0)| 00:00:01 |
——————————————————————————————–
Predicate Information (identified by operation id):
—————————————————
   3 – access(“C4″=’R’)
Now is totally accurate! Right? The execution plan generated by the CBO is using the Index, the estimated rows are 300 which is correct and the cost is even better. This is an example of a very small table, but is really an option to gather statistics with estimate of 100% to a big table?, I don’t think so; that’s why AUTO_SAMPLE_SIZE can be a good balance between sample size and accuracy.
Column C4 now is having a frequency histogram because is having less than 254 distinct values (NDV), only 5; that means that frequency histogram has the restriction of 254 buckets (254 NDV’s), seriously?. Let test it, we will gather statistics to all columns of the table using AUTO_SAMPLE_SIZE and AUTO for histograms:
SQL> exec dbms_stats.gather_table_stats(ownname=>’ALFRED’,tabname=>’T_HIST’, estimate_percent=>null, method_opt=>’FOR ALL COLUMNS SIZE AUTO’);
PL/SQL procedure successfully completed.
SQL> select COLUMN_NAME, NUM_DISTINCT, DENSITY, NUM_BUCKETS, HISTOGRAM from user_tab_columns where table_name = ‘T_HIST’;
C1                                   529100  .00000189           1 NONE
C2                                      254 .003937008           1 NONE
C3                                      254 4.8544E-07         254 FREQUENCY
C4                                        5 4.8544E-07           5 FREQUENCY
What happened?, well column C3 was also candidate for a histogram because it’s density (skewed) but look at the number of buckets is 254; that means that we are on the edge of the limit in order to have a histogram in this column. Don’t believe me? let’s make a quick test, we will insert a new distinct value into this column, gather statistics again and see what happen:
SQL> insert into T_HIST values (100001,42, 99999999999,’B’);
1 row created.
SQL> COMMIT;
Commit complete.
SQL> exec dbms_stats.gather_table_stats(ownname=>’ALFRED’,tabname=>’T_HIST’, estimate_percent=>null, method_opt=>’FOR ALL COLUMNS SIZE AUTO’);
PL/SQL procedure successfully completed.
SQL> select COLUMN_NAME, NUM_DISTINCT, DENSITY, NUM_BUCKETS, HISTOGRAM from user_tab_columns where table_name = ‘T_HIST’;
C1                                   529100  .00000189           1 NONE
C2                                      254 .003937008           1 NONE
C3                                      255 .003921569           1 NONE
C4                                        6 4.8544E-07           6 FREQUENCY
Can you see how the number of distinct values (NVD) changed? now we can’t have a histogram in that column, really?; well we can force it but look how low is our density. Let’s try to force it and specify to the gather stats that we want all columns with maximum of 254 buckets:
SQL> exec dbms_stats.gather_table_stats(ownname=>’ALFRED’,tabname=>’T_HIST’, estimate_percent=>null, method_opt=>’FOR ALL COLUMNS SIZE 254′);
PL/SQL procedure successfully completed.
SQL> select COLUMN_NAME, NUM_DISTINCT, DENSITY, NUM_BUCKETS, HISTOGRAM from user_tab_columns where table_name = ‘T_HIST’;
C1                                   529100 1.9186E-06         254 HEIGHT BALANCED
C2                                      254 4.8544E-07         254 FREQUENCY
C3                                      255 .003935563         254 HEIGHT BALANCED
C4                                        6 4.8544E-07           6 FREQUENCY
Look how it works? it can create frequency histogram on columns with less than 254 NDV’s and height balanced histogram in columns with more than 254 NDV’s. Also note the first example, we used size of 1 (1 bucket) that means no histogram because all values are inside only one bucket.
This is a basic example of how histograms work, honestly I was having a lot of doubts before making this test and some of them are clarified reading Jonathan Lewis new post in allthinngsoracle.com, however there’s a lot of path to walk through and even difficult as 12c have 2 more types of histograms to help CBO.

Thanks,
Alfredo