程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
 程式師世界 >> 數據庫知識 >> SyBase數據庫 >> SyBase教程 >> 解釋直方圖信息

解釋直方圖信息

編輯:SyBase教程

解釋直方圖信息


解釋直方圖信息

 

適用於:
Oracle Database - Enterprise Edition - Version 7.3.0.0 and later
Oracle Database - Standard Edition - Version 7.3.0.0 and later
Oracle Database - Personal Edition - Version 7.3.0.0 and later
Information in this document applies to any platform.

目的:
直方圖信息是怎麼被存儲的,是怎麼被解釋的。

范圍:
其他有用的直方圖參考:
Document 1445372.1 Histograms: An Overview (10g and Above)

細節:
直方圖是一種機制,該機制用來存儲 列數據(column data)的詳細信息。該數據被CBO使用,用來決定一個查詢語句最優化的訪問路徑(access path).
沒有直方圖時,優化器依靠的所有信息是:一個列的高值和低值,該列的不同值個數,該列的空值個數,該table的記錄總數。
(實際上列的高值和低值是以raw 格式存儲的,因此不是特別有用),其他的信息可以從dictionary views中查詢到。

沒有列的統計信息時,優化器假設數據是均衡分布的,對於等值謂詞,生成以一個選擇率(column selectivity),該選擇率是如下計算的:1/NVD(Number of Distinct Values)

有直方圖時,你可以訪問行數據的更多分布信息。

當一個列的數據分布不均衡時(即:列的數據分布 高度傾斜--數據分布傾斜的很厲害),Oracle 可以存儲列的直方圖以給出更好的選擇率.這會產生比使用標准的統計信息(high and low values plus Number of Distinct Values)更好的執行計劃

就具體實現而言(In terms of implementation),我們可以選擇 將 每個不同值和該值的記錄數存放在一起,對於值很少的記錄數是有效的,此時,'width balanced' histograms 被使用。

隨著不同值數量的增長,存儲數據的數量變得過高,我們需要使用一個不同的方法來存儲直方圖數據。此時,我們可以選擇 height balanced histograms.

使用如上兩種方法,列直方圖提供了一個有效和集中的方法來展現數據分布。當建立直方圖時,存儲的信息依靠“不同值的數量是否小於等於bucket(默認75個,最大254個)的數量”進行不同的解釋。

如果不同值的數量小於等於直方圖bucket的數量(bucket最多254個),那麼 Frequency Histogram 被建立
如果不同值的數量大於直方圖bucket的數量, Height Balanced Histogram 被建立。


Frequency Histogram
Frequency Histogram 使用bucket來記錄每一個不同值的記錄個數

Height Balanced Histogram
Height Balanced Histogram 通過把數據分割到不同bucket中來實現。每個bucket 包括相同數量的列值。每個bucket中的最高值(or END_POINT)和最低值被記錄在零號bucket中。

一旦數據被存儲於bucket中,我們可以識別兩個類型的data value--- Non-popular values and popular values

Non-popular values--are those that do not occur multiple times as end points.不會出現多次
Popular values--occur multiple times as end points.會出現多次。

We can use Popular and Non-Popular Values to provide use with various statistics.Since we know how many values there are in a bucket we can use this information to estimate the number of rows in total that are covered by Popular and Non-Popular values.
?The selectivity for popular values can be obtained by calculation the proportion of bucket endpoints filled by that popular value.
?The selectivity for non popular values can now be calculated as 1/number non-popular bucket endpoints, so we can now be more accurate about selectivities than the original 1/NDV, because we have removed the popular values from the equation.


How histograms are used
直方圖被用來得到column predicate 更好的selectivity 估算

Where there are fewer distinct values than buckets, the selectivity is simply calculated as we have accurate row information for each value. For the case where we have more distinct values than buckets, the following outlines how these selectivities are obtained.

Equality Predicate Selectivity calculated from:
?Popular Value:
Number of buckets for value / Total Number of buckets
?Non-Popular Value:
Density see:


Document 43041.1 Query Optimizer: What is Density?


Less than < (Same principle applies for > & >= )
?All Values:
Buckets with endpoints < value / Total No. of buckets

 

Histogram Examples

 

 Table TAB1

SQL> desc tab1
 Name                            Null?    Type
 ------------------------------- -------- ----
 A                                        NUMBER(6)
 B                                        NUMBER(6)

 

Column A contains unique values from 1 to 10000.
Column B contains 10 distinct values.

The value '5' occurs 9991 times.
Values '1, 2, 3, 4, 9996, 9997, 9998, 9999, 10000' occur only once.

i.e.

 

select distinct B , count(*)
from HTAB1
group by B
order by B
;

         B   COUNT(*)
---------- ----------
         1          1
         2          1
         3          1
         4          1
         5       9991
      9996          1
      9997          1
      9998          1
      9999          1
     10000          1

10 rows selected.

 

There is an index on Column B.
Statistics are gathered without Histograms using:

 

exec DBMS_STATS.GATHER_TABLE_STATS (NULL,'HTAB1', method_opt => 'FOR ALL COLUMNS SIZE 1');

 

Setup:

 

drop table HTAB1;
create table HTAB1 (a number, b number);

  Insert into HTAB1 ( A,B) values ( 1,1);
  Insert into HTAB1 ( A,B) values ( 2,2);
  Insert into HTAB1 ( A,B) values ( 3,3);
  Insert into HTAB1 ( A,B) values ( 4,4);
  Insert into HTAB1 ( A,B) values ( 9996,9996);
  Insert into HTAB1 ( A,B) values ( 9997,9997);
  Insert into HTAB1 ( A,B) values ( 9998,9998);
  Insert into HTAB1 ( A,B) values ( 9999,9999);
  Insert into HTAB1 ( A,B) values ( 10000,10000);

commit;
begin
 for i in 5 .. 9995 loop
  Insert into HTAB1 ( A,B)
values ( i,5);
  if (mod(i,100) = 0) then
     commit;
  end if;
 end loop;
 commit;
end;
/
commit;


create index HTAB1_B on HTAB1(b);
exec DBMS_STATS.GATHER_TABLE_STATS (NULL,'HTAB1', method_opt => 'FOR ALL COLUMNS SIZE 1');

alter session set OPTIMIZER_DYNAMIC_SAMPLING = 0;

 

Function to convert raw data in to numeric data:

create or replace function raw_to_number(my_input raw)
return number
as
    my_output number;
begin
    dbms_stats.convert_raw_value(my_input,my_output);
    return my_output;
end;
/  

This results in statistics as follows:

 

column COLUMN_NAME format a5 heading COL
column NUM_DISTINCT format 99990
column LOW_VALUE format 99990
column HIGH_VALUE format 99990
column DENSITY format 99990
column NUM_NULLS format 99990
column NUM_BUCKETS format 99990
column SAMPLE_SIZE format 99990
select COLUMN_NAME,NUM_DISTINCT,raw_to_number(LOW_VALUE) Low,raw_to_number(HIGH_VALUE) High,DENSITY,NUM_NULLS,
       NUM_BUCKETS,LAST_ANALYZED,SAMPLE_SIZE,HISTOGRAM
from user_tab_columns
where table_name = 'HTAB1';

COL   NUM_DISTINCT        LOW       HIGH DENSITY NUM_NULLS NUM_BUCKETS LAST_ANALYZED        SAMPLE_SIZE HISTOGRAM
----- ------------ ---------- ---------- ------- --------- ----------- -------------------- ----------- ---------------
A            10000          1      10000       0         0           1 31-jan-2013 09:32:08       10000 NONE
B               10          1      10000       0         0           1 31-jan-2013 09:32:08       10000 NONE


select lpad(TABLE_NAME,10) TAB, lpad(COLUMN_NAME, 10) COL,
 ENDPOINT_NUMBER, ENDPOINT_VALUE
from user_histograms
where table_name='HTAB1'
order by COL, ENDPOINT_NUMBER;


TAB        COL        ENDPOINT_NUMBER ENDPOINT_VALUE
---------- ---------- --------------- --------------
     HTAB1          A               0              1
     HTAB1          A               1          10000
     HTAB1          B               0              1
     HTAB1          B               1          10000

 

In the above you can see that the statistics gathering has not created a histogram. There is a single bucket and high and a low ENDPOINT_NUMBER for each column value ( you will always get 2 entries in USER_HISTOGRAMS for each column, for the high and low values respectively).

 

Test queries:

select * from htab1 where b=5;
select * from htab1 where b=3;

To replicate the tests you will need to disable OPTIMIZER_DYNAMIC_SAMPLING

 

alter session set OPTIMIZER_DYNAMIC_SAMPLING = 0;

 

See:

Document 336267.1 Optimizer Dynamic Sampling (OPTIMIZER_DYNAMIC_SAMPLING)

Without Histograms, both queries do an INDEX RANGE SCAN because the optimizer believes that the data is uniformly distributed in column B and that each predicate with return 1/10th of the values because there are 10 distinct values:

 

---------------------------------------------------------------------------------------
| Id  | Operation                   | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |         |  1111 |  6666 |     5   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| HTAB1   |  1111 |  6666 |     5   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | HTAB1_B |  1111 |       |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------------------

 

In fact it may be preferable to use a Full Table Scan for the select where b=5 and index lookups for the others.

 

Gathering Histogram Statistics

If we collect histogram statistics with the recommended settings:

 

exec DBMS_STATS.GATHER_TABLE_STATS (NULL,'HTAB1', method_opt => 'FOR ALL COLUMNS SIZE AUTO'); 

 

The b=5 query now does a Full Table Scan

 

 select * from htab1 where b=5;

---------------------------------------------------------------------------
| Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |       |  9991 | 69937 |     7   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| HTAB1 |  9991 | 69937 |     7   (0)| 00:00:01 |
---------------------------------------------------------------------------

 

The query where B is 3 still uses an index:

 

 select * from htab1 where b=3;

---------------------------------------------------------------------------------------
| Id  | Operation                   | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |         |     1 |     7 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| HTAB1   |     1 |     7 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | HTAB1_B |     1 |       |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------

 

This is because a FREQUENCY Histogram has been created:

 

COL   NUM_DISTINCT        LOW       HIGH DENSITY NUM_NULLS NUM_BUCKETS LAST_ANALYZED        SAMPLE_SIZE HISTOGRAM
----- ------------ ---------- ---------- ------- --------- ----------- -------------------- ----------- ---------------
A            10000          1      10000       0         0           1 31-jan-2013 09:58:01       10000 NONE
B               10          1      10000       0         0          10 31-jan-2013 09:58:01       10000 FREQUENCY

TAB        COL        ENDPOINT_NUMBER ENDPOINT_VALUE
---------- ---------- --------------- --------------
     HTAB1          A               0              1
     HTAB1          A               1          10000
     HTAB1          B               1              1
     HTAB1          B               2              2
     HTAB1          B               3              3
     HTAB1          B               4              4
     HTAB1          B            9995              5
     HTAB1          B            9996           9996
     HTAB1          B            9997           9997
     HTAB1          B            9998           9998
     HTAB1          B            9999           9999
     HTAB1          B           10000          10000

12 rows selected.

 

On Column B there are 10 buckets matching up with the 10 distinct values.

The ENDPOINT_VALUE shows the column value and the ENDPOINT_NUMBER shows the cumulative number of rows. So the number of rows for ENDPOINT_VALUE 2, it has an ENDPOINT_NUMBER 2, the previous ENDPOINT_NUMBER is 1, hence the number of rows with value 2 is 1. Another example is ENDPOINT_VALUE 5. Its ENDPOINT_NUMBER is 9995. The previous bucket ENDPOINT_NUMBER is 4, so 9995 - 4 = 9991 rows containing the value 5.

Frequency histograms work fine with a low number of distinct values, but when the number exceeds the maximum number of buckets, you cannot create a bucket for each value. In this case the Optimizer creates Height balanced histograms.

 

Height Balanced Histograms

You can demonstrate this situation by forcing the optimizer to create fewer buckets than the Number of Distinct Values. i.e. using 8 buckets for 10 Distinct Values:

 

exec DBMS_STATS.GATHER_TABLE_STATS (NULL,'HTAB1', method_opt => 'FOR COLUMNS B SIZE 8'); 

 

So now we have gathered a HEIGHT BALANCED HISTOGRAM for Column B:

 

COL   NUM_DISTINCT        LOW       HIGH DENSITY NUM_NULLS NUM_BUCKETS LAST_ANALYZED        SAMPLE_SIZE HISTOGRAM
----- ------------ ---------- ---------- ------- --------- ----------- -------------------- ----------- ---------------
A            10000          1      10000       0         0           1 31-jan-2013 09:58:01       10000 NONE
B               10          1      10000       0         0           8 31-jan-2013 09:59:09       10000 HEIGHT BALANCED

TAB        COL        ENDPOINT_NUMBER ENDPOINT_VALUE
---------- ---------- --------------- --------------
     HTAB1          A               0              1
     HTAB1          A               1          10000
     HTAB1          B               0              1
     HTAB1          B               7              5
     HTAB1          B               8          10000

 


Notice that there are 8 Buckets against B now.

Oracle puts the same number of values in each bucket and records the endpoint of each bucket.

With HEIGHT BALANCED Histograms, the ENDPOINT_NUMBER is the actual bucket number and ENDPOINT_VALUE is the endpoint value of the bucket determined by the column value.

From the above, bucket 0 holds the low value for the column.

Because buckets 1-7 have the same endpoint, Oracle does not store all these rows to save space. But we have: bucket 1 with an endpoint of 5, bucket 2 with an endpoint of 5, bucket 3 with an endpoint of 5, bucket 4 with an endpoint of 5, bucket 5 with an endpoint of 5, bucket 6 with an endpoint of 5, bucket 7 with an endpoint of 5 AND bucket 8 with an endpoint of 10000 So bucket 1 contains values between 1 and 5, bucket 8 contains values between 5 and 10000.

All buckets contain the same number of values (which is why they are called height-balanced histograms), except the last bucket may have fewer values then the other buckets.

 

Storing Character Values in Histograms

For character columns, Oracle only stores the first 32 bytes of any string (there are also limits on numeric columns, but these are less frequently an issue since the majority of numbers are insufficiently large to encounter any problems). See:

Document 212809.1 Limitations of the Oracle Cost Based Optimizer

Any predicates that contain strings greater than 32 characters will not use histogram information and the selectivity will be 1 / Number of DISTINCT Values. Data in histogram endpoints is normalized to double precision floating point arithmetic.

 

For Example

 

SQL> select * from example;

A
----------
a
b
c
d
e
e
e
e
  

 

The table contains 5 distinct values. There is one occurence of 'a', 'b', 'c' and 'd' There are 4 occurrences of 'e'. If we create a histogram: Looking in user_histograms:

 

TABLE      COL   ENDPOINT_NUMBER ENDPOINT_VALUE
---------- ----- --------------- --------------
   EXAMPLE     A               1     5.0365E+35
   EXAMPLE     A               2     5.0885E+35
   EXAMPLE     A               3     5.1404E+35
   EXAMPLE     A               4     5.1923E+35
   EXAMPLE     A               8     5.2442E+35

 

So:

 

 ENDPOINT_VALUE 	5.0365E+35 represents a
			5.0885E+35 represents b
			5.1404E+35 represents c
			5.1923E+35 represents d
			5.2442E+35 represents e

 

Then, if you look at the cumulative values for ENDPOINT_NUMBER, the corresponding ENDPOINT_VALUE's are correct.

 

 

  1. 上一頁:
  2. 下一頁:
Copyright © 程式師世界 All Rights Reserved