DB2 Query Optimization: Custom Stats Collection

One of the challenges in query optimization is to make sure that the query optimizer works with accurate estimates on cost of the different possible decision branches in the optimization process. The accuracy of the estimates depends on the efficiency of the stats collection.

DB2 assumes the data distribution is constant distribution. And that’s why there’s numfreq and numquantile options in RUNSTATS:

RUNSTATS ON TABLE table_name WITH DISTRIBUTION ON COLUMNS ( col1 NUM_FREQVALUES N NUM_QUANTILES N, ...);

with numfreq, we pickup the top N values with most occurance (ex, if the data distribution is 1,1,1,1,1,1,2,2,2,3,3,5,6,7,8. with numfreq=3, we know “1” happened 6 times, “2” happened 3 times and “3” happened 2 times. So if the query predicate is looking for “1” or “2”, we’ll be get the correct estimation if stats is up to date).
with numquantile, it breaks the entire data distribution into N blocks, for example, if the distribution is “1,1,1,1,1,1,2,2,2,3,3,5,6,7,8”, when there’s no numfreq defined, and numquantile defines to 3, so it will be “1,1,1,1,1 | 1,2,2,2,3 | 3,5,6,7,8” We record the min/max for each block, and assume normal distribution inside each block. So if we are looking for c1=1, then we know the entire block 0 has 1, and lowest value in block 1 is 1, highest is 3, then the estimation will be 5 + 5/(3-1)=7.5. Same idea applied if c1=6, the search will be in block2, so the estimation could be 5/(8-3)=1

Once a good set of parameters is defined for RUNSTATS, you need to make sure that the automated maintenance doesn’t override the correct distributions with the default stats collection parameters. You need to set up RUNSTATS PROFILE to define the autorunstats behavior. Otherwise the default behavior will be

RUNSTATS ON TABLE DB2USER.EMPLOYEE WITH DISTRIBUTION AND SAMPLED DETAILED INDEXES ALL

You can refer the following paper about RUNSTATS PROFILE:
http://www.ibm.com/developerworks/data/library/techarticle/dm-0509poon/
At this point, you might think: is there a way to automatically calculate based on statistics the correct parameters for the RUNSTATS sampling? According to DB2 engineer Tao E Wang: “Usually I don’t have good method to get the exact value of numfreq/numquantile just based on calculation. There are too many variables in optimizer to consider, so usually if we want the optimizer pickup a specific index, we need to study the plan and understand why/how each filter factor coming from, and sometime we need to even setup optimizer profile to force the plan in order to study what the cost estimation looks like if it pickup the specified index… So basically I’d say numfreq/numquantile are the variables to “play with”. Once we understand the bad plan is caused by incorrect estimation, we can try to increase those 2 values until optimizer successfully get accurate estimation…”

More information is available here:
num_freqvalues – Number of frequent values retained configuration parameter
http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/topic/com.ibm.db2.luw.admin.config.doc/doc/r0000331.html

num_quantiles – Number of quantiles for columns configuration parameter
http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/topic/com.ibm.db2.luw.admin.config.doc/doc/r0000334.html

Collecting distribution statistics for specific columns
http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/topic/com.ibm.db2.luw.admin.perf.doc/doc/t0005073.html

Videos

DB2 Query Optimization: Custom Stats Collection

You may also like