25年前提出的问题仍然没有得到解决,学者们总是在攻克他们能攻克的问题, 而不是他们应该去攻克的问题,有点意思!
说到Cardinality是查询优化的阿基琉斯之踵,但是之前很多研究都关注在新的结构来解决individual local predicates的估计问题,作者认为完全就是无用功
并提出应该解决哪些问题
Is Query Optimization a “solved” problem? If not, are we attacking the “right” problems?
How should we identify the “right” problems to solve?
I asked these same questions almost exactly 25 years ago, in an extended abstract for a Workshop on Database Query Optimization
that was organized by the then-Professor Goetz Graefe at the Oregon Graduate Center [Grae 89a].
Remarkably and quite regrettably, most of the issues and critical unsolved problems I identified in that brief rant remain true today.
Researchers continue to attack the wrong problems,
IMHO: they attack the ones that they can, i.e., that they have ideas for, rather than the ones that they should, i.e., that are critical to successfully modeling the true cost of plans and choosing a good one. Perhaps more importantly, that will avoid choosing a disastrous plan!
At the risk of repeating myself, I’d like to revisit these issues, because I’m disappointed that few in the research community have taken up my earlier challenge.
The root of all evil, the Achilles Heel of query optimization, is the estimation of the size of intermediate results, known as cardinalities.
Everything in cost estimation depends upon how many rows will be processed, so the entire cost model is predicated upon the cardinality model.
In my experience, the cost model may introduce errors of at most 30% for a given cardinality, but the cardinality model can quite easily introduce errors of many orders of magnitude!
I’ll give a realworld example in a moment.
With such errors, the wonder isn’t “Why did the optimizer pick a bad plan?” Rather, the wonder is “Why would the optimizer ever pick a decent plan?”
“Well,” you say, “we’ve seen lots of improvements in histograms, wavelets, and other statistics since 1989. Surely we do better now.”
There’s been no shortage of such papers, it’s true, but the wealth of such papers precisely illustrates my point.
Developing new histograms that improve selectivity estimation for individual local predicates of the form “Age BETWEEN 47 AND 63” by a few percent doesn’t really matter,
when other, much larger errors that are introduced elsewhere in cardinality estimation dwarf those minor improvements.
It’s simple engineering, folks. If I have to review one more such paper on improved histograms for local predicates, I’ll scream (and reject it)! It just doesn’t matter!
What we have now is good enough.
What still introduces the most error in cardinality estimation is
(a) host variables and parameter markers,
(b) the selectivity of join predicates, and, even more significantly,
(c) how we combine selectivities to estimate the cardinality.
Amazingly, these three topics also have enjoyed the least research attention, or at least the fewest number of papers attempting to solve them, unless I’ve missed some major contributions lately.
I’ll visit each of these topics in turn, describing the fundamental causes of errors, and why those errors can easily reach disastrous proportions, illustrated by war stories from real customer situations.
Host variables and parameter markers
就是模板参数的问题,
我理解这个问题,如果是每个SQL过来现算,应该没啥问题,因为SQL中本身就是有具体参数的
这里应该是指,根据SQL的模板,比如Prepare,参数不确定的情况下,做estimate
这个问题如果是假设均匀分布,用平均Cardinality去算,那碰到skewed的数据,肯定就会产生极大的误差
Host variables, parameter markers, and special registers occur in SQL queries because applications on top of the DBMS, not humans, invoke most queries, unlike in academic papers.
Such applications typically get the constants used in predicates from a “fill in the blank” field on a web page, for example. The SQL predicate then looks like “Age BETWEEN :hv1 AND :hv2”.
At compile time, the optimizer has no clue what :hv1 and :hv2 are, so it cannot look them up in those wonderful histograms that we all have polished.
Instead, it must make a wild guess on the average or likely values, which could be off significantly from one execution to the next, or even due to skew.
A war story illustrates this point.
One of our major ISVs retrofitted a table with a field that identified which subsystem each row came from.
It had 6 distinct values, but 99.99% of the rows had the value of the founding subsystem, i.e., when there was only one.
A predicate on this subsystem column was added to every query, with the value being passed as a host variable.
Not knowing that value a priori, DB2’s optimizer used the average value of 1/|distinct values| = 0.167,
though that predicate’s true selectivity was usually 0.9999 (not selective at all) and once in a blue moon was 0.0001 (extremely selective).
There has been some work on this socalled Parametric Query Optimization (PQO), 这个问题最先被称为,参数化的查询优化PQO
though it’s sometimes attacking the problem of other parameters unknown at compilation time (e.g. the number of buffer pages available) or limited to discrete values [Ioan 97].
One of my favorites is a fascinating empirical study by Reddy and Haritsa [Redd 05] of plan spaces for several commercial query optimizers as the selectivity of multiple local predicates are varied.
It demonstrated (quite colorfully!) that regions in which a particular plan is optimal may not be convex and may even be disconnected!
Grafe建议为不同的Value保持不同的Plan,这是一个比较直觉的想法,但是如果值比较多的话,不太好实现;所以需要对Value进行group,对于有限的group保持plan,更为可行
Graefe suggested keeping a different plan for each possible value of each host variable [Grae 89b],
but with multiple host variables and a large number of possible values,
Graefe’s scheme quickly gets impractical to optimize, store, and decide at runtime among the large crossproduct of possible plans, without grouping them into regions having the same plan [Stoy 08].
DB2的方法是对于不同的value进行re-optimization,而且两种options可选,REOPT(ALWAYS) and REOPT(ONCE)
Version 5 of DB2 for OS/390 (shipped June 1997) developed a practical approach to force re-optimization for host variables, parameter markers,
and special registers by adding new SQL bind options REOPT(ALWAYS) and REOPT(ONCE).
The latter re-optimizes the first time that the statement is executed with actual values for the parameters, and assumes that these values will be “typical”,
whereas the former forces reoptimization each time the statement is run.
Later, a REOPT(AUTO) option was added to autonomically determine if re-optimization is needed, based upon the change in the estimated filter factors from the last re-optimization’s plan.
Selectivity of join predicates
Join Predicates的问题,跨表多列?
The paucity(贫乏) of innovation in calculating join predicate selectivities is truly astounding(惊人的).
Most extant(现存的) systems still use the techniques pioneered by System R for computing the selectivity of a join 对于Join predicates的预估的创新的贫乏是惊人的,大部分还在基于System R 79年提出的
as the inverse(倒数) of the maximum of the two join-column cardinalities [Seli 79], which essentially assumes that the domain of one column is a subset of the other. (包含假设),否则应该是并集,并集的代价太高
While this assumption is valid for key domains having referential integrity constraints, in general the overlap of the two sets may vary greatly depending upon the semantics of the joined domains.
Furthermore, the common practice of pre-populating a dimension table such as “Date” with 100 years of dates into the future can incorrectly bias this calculation when the fact table initially has only a few months of data.
Statistics on join indexes [Vald 87] would give much more precise selectivities for join predicates, 建立Join indexes是一种方式,但是单列的Index无法解决列相关的问题
if we were willing to suffer the added costs of maintenance and lock contention of updates to these join indexes.
However, join indexes would not solve the intersection problem typical of star schemas.
For example, suppose a fact table of Transactions has dimensions for the Products, Stores, and Dates of each transaction.
Though current methods provide accurate selectivity estimates for predicates local to each dimension, e.g., ProductName = ‘Dockers’ and StoreName = ‘San Jose’ and Date = ’23Feb2013’,
it is impossible to determine the effect of the intersection of these predicates on the fact table.
Perhaps the San Jose store had a lossleader sale on Dockers that day that expired the next day, and a similar sale on some other product the next day,
so that the individual selectivities for each day, store, and product appear identical, but the actual sales of Dockers on the two days would be significantly different.
It is the interaction of these predicates, through join predicates in this case, that research to date doesn’t address.
This leads naturally to the final and most challenging problem in our trifecta.
Correlation of columns
列相关的问题
质疑columns间独立假设
With few exceptions ([Chri 83], [Vand 86]), query optimizers since [Seli 79] have modeled selectivities as probabilities that a predicate on any given row will be satisfied,
then multiplied these individual selectivities together.
Effectively, this assumes that Prob{ColX = Value1, ColY = Value2} = Prob{ColX = Value1} * Prob{ColY = Value2}, i.e., that the two predicates are probabilistically independent,
if you recall your college probability.
Fortunately for the database industry, this assumption is often valid. However, occasionally it is not.
著名的,本田和雅阁的例子
My favorite example, which occurred in a customer database, is Make = ‘Honda’ and Model = ‘Accord’.
To simplify somewhat, suppose there are 10 Makes and 100 Models.
Then the independence (and uniformity) assumption gives us a selectivity of 1/10 * 1/100 = 0.001.
But since only Honda makes Accords, by trademark law, the real selectivity is 0.01.
So we will underestimate the cardinality by an order of magnitude.
Such optimistic errors are much worse than pessimistic overestimation errors, 这种乐观的低估,比悲观的高估更差,因为会让优化器低估操作执行的代价,比如导致选择nested-loop join
because they cause the optimizer to think that certain operations will be cheaper than they really are, causing nasty surprises at run time.
The only way to avoid such errors is for the database administrator to be aware of the semantic relationship (a functional dependency, in this case) between those two columns and its consequences,
and to collect column group statistics, as DB2 and other database products now allow.
To identify these landmines in the schema automatically, Stillger et al. [Stil 01] developed the LEarning Optimizer (LEO), 自动的比对执行结果,来找到列相关的问题
which opportunistically and automatically compared runtime actual cardinalities to optimizer estimates, to identify column combinations exhibiting such correlation errors.
Ilyas et al. [Ilya 04] attacked the problem more proactively in CORDS (CORrelation Detection by Sampling), 在Sampling上穷尽的统计相关性
searching somewhat exhaustively for such correlations between any two columns in samples from the data before running any queries.
And Markl and colleagues [Mark 05], [Mark 07] have made groundbreaking advances on a consistent way to combine the selectivities of conjuncts in partial results.
All great progress on this problem, but none yet solves the problem of redundant predicates that can be inadvertently introduced by the query writer who typically believes that “more is better”,
that providing more predicates helps the DBMS do its job better – it’s American as Apple Pie! 没有解决冗余谓词的问题,下面给出一个事例
Let me illustrate with one of my favorite war stories.
At a meeting of the International DB2 User’s Group, a chief database administrator for a major U.S. insurance company
whom I’d helped with occasional bad plans asked me to conduct a class onsite.
I suggested it include an exercise on a real problem, unrehearsed(未排列的).
After my class, she obliged(强制) me by presenting two 1inch stacks of paper, each detailing the EXPLAIN of a plan for a query.
I feared I was going to embarrass(尴尬的) myself and fail miserably under the gun.
The queries differed in only one predicate, she said, but the original ran in seconds whereas the one with the extra predicate took over an hour.
I instinctively examined first the cardinality estimates for the results of the two, and the slower one had a cardinality estimate 7 orders of magnitude less than the fast one.
When asked what column the extra predicate was on, my host explained that it was a composite key constructed of
the first four letters of the policyholder’s last name, the first and middle initials, the zip code, and last four digits of his/her Social Security Number.
Did the original query have predicates on all those columns?
Of course! And how many rows were there in the table? Ten million. Bingo!
I explained that that predicate was completely redundant of the others, and its selectivity, 1/10 ,
when multiplied by the others, underestimated the cardinality by 7 orders of magnitude, wreaking havoc with the plan.
It took me maybe 5 minutes to figure this out, and I was immediately dubbed a “genius”, but it really was straightforward:
the added predicate might help the runtime, especially if they had an index on that column, but it totally threw off the optimizer,
which couldn’t possibly have detected that redundancy without LEO or CORDS.