当前位置：文档库 › On sampling and relational operators

On sampling and relational operators

On Sampling and Relational Operators

Surajit Chaudhuri Rajeev Motwani

Microsoft Research Stanford University

surajitc@https://www.wendangku.net/doc/7b8368094.html, rajeev@https://www.wendangku.net/doc/7b8368094.html,

Abstract

A major bottleneck in implementing sampling as a primitive relational operation is the inef?ciency of

sampling the output of a query.We highlight the primary dif?culties,summarize the results of some recent work in this area,and indicate directions for future work.

1Introduction

Data warehouses based on relational databases are becoming popular.The investment in data warehouses is tar-geted towards developing decision support applications that leverage the massive amount of data stored in data warehouses for a variety of business applications.On Line Analytical Processing(OLAP)and data mining are tools for analyzing large databases that are gaining popularity.Many of these tools serve as middleware or appli-cation servers that use a SQL database system as the backend data warehouse.They communicate data retrieval requests to the backend database through a relational(SQL)query.On a large database,the cost of executing such ad-hoc queries against the relational backend can be expensive.Fortunately,many data mining applications and statistical analysis techniques can use a sample of the data requested in the SQL query without compromising the results of the analysis.Likewise,OLAP servers that answer queries involving aggregation(e.g.,“?nd total sales for all products in the NorthWest region between1/1/98and1/15/98”)can signi?cantly bene?t from the ability to present to the user an approximate answer computed from a sample of the result of the query posed to the rela-tional database.It is well-known that for results of aggregation,sampling can be used accurately and ef?ciently. However,it is important to recognize that whether for data mining,OLAP,or other applications,sampling must be supported on the result of an arbitrary SQL query,not just on stored relations.For example,the preceding example of the OLAP query uses a star join between three tables(date,product,and sales).

This paper is concerned with supporting random sampling as a primitive operation in relational databases. In principle,this is easy—introduce into SQL an operation SAMPLE which produces a uniform random sample that is an-fraction of a relation.While producing a random sample from a relation is not entirely trivial,it is a well-studied problem and ef?cient strategies are available[15].However,these techniques are not effective if sampling needs to be applied to a relation produced by a query rather than to a base relation. It seems grossly inef?cient to evaluate,computing the entire relation,only to throw away most of it when applying SAMPLE.It would be much more desirable and ef?cient to partially evaluate so as to generate only the sample of.

For this purpose,it suf?ces to consider the case where we are given a query tree with SAMPLE only at the root.In this setting,it seems plausible that tremendous gains in ef?ciency can be achieved by“pushing”the sample operation down the tree towards the leaves,since then we would be feeding only a small(random)fraction of the relations(stored as well as intermediate relations)into the query tree and thereby minimizing the cost of query evaluation.To this end,we need to be able to“commute”the sample operation with standard relational operations.

There has not been much past work on supporting sampling as an operation for the end-user of a database system.While random sampling has been proposed and used in many different ways in databases[14,15],the main focus has been on the use of random sampling for the purposes of estimating query result size,aggregate values,and parameters for query optimization[16,13,10,11,7,6,3].

2Dif?culty of Sampling and Possible Solutions

In this section we elucidate some of the main dif?culties in ef?ciently implementing sampling in a relational system and suggest some avenues for circumventing these dif?culties.

2.1Join

We focus primarily on developing a technique for commuting sampling with a single join operation,since we could apply this technique repeatedly to push down the sample operator from the root to the leaves of a join tree. We establish that it is not possible to produce a sample of the result of even a single join from random samples of the two relations participating in the join,and show how we can leverage database statistics to circumvent this impossibility result.

Example1:Suppose that we have the relations

That is,is de?ned over the attributes and;amongst its tuples,one tuple has the-value and tuples have the-value,but all have distinct-values.Similarly,is de?ned over the attributes and;amongst its tuples,tuples have the-value and one tuple has the-value,but all have distinct-values.Observe that their join over,,is of size and has tuples with -value and tuples with-value.

Assume that we wish to choose a random sample.Consider a random sample.We expect that roughly half of the tuples in have-value,and roughly half of the tuples in have-value.

Suppose we pick random samples and.It is quite unlikely that will contain the tuple ,or that will contain the tuple.Thus,given the samples and,it is impossible to generate a random sample of for any reasonable sampling fraction.Note that this conclusion holds even if we allow(say)to be all of but require that be a proper subset of.In fact,in all these cases we would expect to be empty.

This suggests that SAMPLE does not commute with join since SAMPLE SAMPLE may not even contain any non-trivial size subset of,and so further computation or sampling from it cannot be used to ex-tract a sample of.We formalize this intuition as follows.Suppose we are given samples SAMPLE

and SAMPLE.Given only,,and any desired set of statistics for and,but without di-rect access to the tuples in and,we are required to produce a sample SAMPLE for some value of.The theorem below states the extremely negative result implicit in Example1—even when we

are given arbitrarily large samples from and,as well as arbitrarily detailed statistics,it is still not possible

to generate any non-empty random sample of

Theorem1:Suppose that at least one of,is strictly less than1.Then,it is not possible to generate the

sample SAMPLE from and for any where SAMPLE and SAMPLE.

Despite this strong negative result,it is still possible to exploit the power of sampling.Our key observation

is that given some partial statistics(e.g.,histograms)on one of the operand relations,we can use the statistics

to bias the sampling from the second relation in such a way that it becomes possible to produce a sample of the

join.We devise a variety of sampling schemes based on these observations,improving the state-of-the-art for join

sampling.In the context of a join tree,our work shows that it is possible to push down the sampling operation to

one of the two operand relations.At the same time,our negative results above show that it is inherently dif?cult

to achieve greater ef?ciency by pushing sampling down to both operands of a join in a query tree.

For the rest of the discussion,we will need to use some notation.Assume that the two relations and

are of size and,respectively,and that we are interested in an equi-join with respect to an attribute.We

denote the domain of the attribute by.For each value,let and be the number of distinct tuples in and,respectively,that contain value in attribute.

Observe that the impossibility of commuting SAMPLE with join does not preclude the possibility of some-

how obtaining SAMPLE from non-uniform samples of and.To better understand this point, consider the tuple and its in?uence on.While,the set has size .Thus,even though a random sample of is unlikely to pick up the tuple with-value,half of the tuples in the join have-value.This suggests that we sample a tuple of join attribute value with probability proportional to,in the hope that the resulting sample is more likely to re?ect the structure of. This is the basic insight behind most of our strategies in[4].As an example,consider the Frequency-Partition-Sample strategy[4],based on the insight that the naive sampling strategy performs badly only when the average frequency of attribute values is high enough to make the join size signi?cantly larger than the size of the operand relations.The idea is to partition the operands into two subrelations,one with the high-frequency values and the other with the low-frequency values.For the low-frequency values,we can use the naive sampling strategy,but for the high-frequency values we need to develop more re?ned approaches as this is precisely the set of values for which computing the full join is expensive.

The preceding discussion suggests that we sample tuples from based on frequency statistics for.This

leads us to a natural classi?cation of the problem and applicability of sampling techniques based on availability

of indexes and statistics on operand relations.We remark that the sampling strategy due to Olken et al[14,15]

applies only to the case where indexes and statistics are available on both operands of the join.In contrast,our

hybrid sampling technique does not need a full index on but merely some partial statistics(on the high fre-

quency values).

The preceding discussion was primarily concerned with the issue of sampling from a single join operation.

We now turn to the question of implementing sampling as a primitive relational operation in the context of a join

tree.Suppose we have a query generating a join tree with a sampling operation applied to its result.The

earlier negative results state that it is essentially impossible to push down the sampling operation to both operands

of a join operation,shedding some light on the dif?culty of dealing with linear and arbitrary join trees.Consider

sampling from-we cannot just select a uniform random sample of but have to pick a non-uniform sample whose distribution depends on the statistics of.But then,what about pushing the sampling further down the tree to?Now,we will have to sample from using statistics for both and. In principle,this can be done(e.g.,using multidimensional histograms on),since the operand relations are all base relations and their statistics can be precomputed.However,such requirements provide a high bar on being able to exploit sampling“on-the-?y”,i.e.,on ad-hoc queries.

Finally,consider the important special case where the query consists of foreign key joins,e.g.,

where is the dimension(referenced)table.In such cases,each tuple from joins with at most one tuple in

.This allows us to uniformly sample to create a sample for.Unfortunately,in most cases,such

queries would contain selection conditions on.As the next subsection shows,presence of selections limits

our ability to use sampling.

2.2Select and Group By

Since most queries involve selection conditions,it is important to study the interaction of sampling with selection.

The main insight here is that if the selectivity of a query is low,then it dramatically and adversely impacts the

accuracy of sampling-based estimation.Consider a simple query select count(*)from lineitem.Suppose we wish

to use sampling to approximate the query result.Consider a sample where each tuple is included with probability (the sampling fraction).If the number of tuples in the table is large,then the expected number of tuples in the sample(i.e.,)is large and we know from Chernoff bounds[12]that the actual number of tuples in the sample

is very close to the expected value with very high probability.As a result,we can accurately estimate the size of

the table from the sample.

Suppose,however,we were faced with the following query:select count(*)from lineitem where city=“Palo

Alto”.Let be the number of tuples that satisfy the selection condition in the query.If is not large enough,

then the expected value of the number of tuples from the sample that satisfy the selection condition(i.e.,)is low and the variance is large compared to this expected value.In that case,we will incur a large error in query

estimation based on the number of sampled tuples passing the selection condition.This intuition is formalized in

the following theorem due to Chaudhuri,Datar,Motwani,and Narasayya[5]where we show that the relative error

for a query is proportional to the inverse square root of the selectivity of the conditions in the query.Therefore,we

need a signi?cantly larger sample for queries with low selectivity.Earlier work has not satisfactorily addressed

this issue.Acharya et al[1]maintain samples of varying size and use different samples for different queries.

This requires estimating the selectivity of a query accurately.If we underestimate the selectivity,we will end

up using a bigger sample than required,thereby decreasing ef?ciency.On the other hand,if we overestimate

the selectivity,we will end up using a smaller sample and incurring a loss in accuracy.Likewise,the technique

adopted by Hellerstein et al[9]will fail to obtain good con?dence bounds unless we have looked at a large part

of the relation.Thus,low selectivity is a serious problem that limits the use of sampling as a robust technique.

Although we have focussed exclusively on Selection,a careful reader will note that the effect of Group By is

no different from that of selection since the effect of Group By is to create multiple selection queries that differ

by virtue of their qualifying values on the grouping attribute.Therefore,similar issues as has been studied for

Selection arises.

2.3Select Distinct

We now turn to the problem of estimating the number of distinct values in a column of a table.Our objective

is to provide estimation for such aggregates without scanning the entire table.A natural idea is to use a random

sample of the data.Unfortunately,we[3]proved that large errors are unavoidable for estimates derived only

from uniform random samples unless the sample size is very close to the size of the table itself.Indeed,we have

generalized this result and proved a stronger negative result for the the most general possible class of estima-

tors that,instead of being restricted to merely a random sampling,are allowed to use any(possibly randomized)

strategy to select a sequence of values to examine in the the input table.Such estimators could even pick a se-

quence of tuples adaptively,i.e.,pick the next tuple to examine based on the values of the previously-examined

tuples.This explains the observation by Haas et al[8]that no known estimator behaves well on all data distri-

butions primarily because their performance is sensitive to the skew of the data.Indeed,our lower bound on the

error stems in part from the dif?culty in distinguishing between low-skew and high-skew data.In fact,the proof

hinges on the inability to distinguish between two speci?c scenarios:one where there are few distinct values of very high frequency,and another where there a few high-frequency values together with a large number of very low-frequency values(thus having a large number of distinct values).The?rst scenario has low skew with few distinct values and the second one has high skew with a large number of distinct values.The following theorem, due to Charikar,Chaudhuri,Motwani,and Narasayya[2],states that any possible estimator must incur large error on at least one of these two scenarios.

Theorem2:Consider any(possibly adaptive and randomized)estimator for the number of distinct values that examines at most rows in a table with rows.Then,for any,there exists a choice of the input data such that with probability at least,

error()

table. However,in many cases,we are interested in other aggregating functions in addition to count,e.g.,select sum(sales) from sales

3Practicality of Sampling

Random sampling is an attractive data-reduction operation that reduces the cost of executing complex ad-hoc queries on large databases.But the preceding discussion shows that there are non-trivial barriers to implementing sampling as a relational operator.While the techniques we reviewed are able to circumvent some of these barriers, the goal of pushing sampling down to the leaf nodes of a complex query tree requires a substantial increase in the sample size to keep the error within bounds.In effect,sampling the results of an ad-hoc query is an inherently dif?cult and unsolved problem.Our work has demonstrated that auxiliary information such as histograms can be leveraged to facilitate ef?cient sampling.It remains an interesting question to determine what other auxiliary information may be employed for this purpose and more broadly for computing approximate answers to relational queries.

References

[1]S.Acharya,P.Gibbons,V.Poosala,and S.Ramaswamy.Join Synopses for Approximate Query Answering.In Proc.

ACM SIGMOD Conference,1999.

[2]M.Charikar,S.Chaudhuri,R.Motwani,and V.Narasayya.Towards Estimation Error Guarantees for Distinct Values.

Submitted for publication,1999.

[3]S.Chaudhuri,R.Motwani,and https://www.wendangku.net/doc/7b8368094.html,ing Random Sampling for Histogram Construction.In Proc.ACM

SIGMOD Conference,pages436–447,1998.

[4]S.Chaudhuri,R.Motwani,and V.Narasayya.On Random Sampling Over Joins.In Proc.ACM SIGMOD Conference,

pages263–274,1999.

[5]S.Chaudhuri,M.Datar,R.Motwani,and V.Narasayya.Overcoming Limitations of Sampling for Aggregation

Queries.Submitted for publication,1999.

[6]S.Ganguly,P.B.Gibbons,Y.Matias,and A.Silberschatz.Bifocal Sampling for Skew-Resistant Join Size Estimation.

In Proc.ACM SIGMOD Conference,pages271–281,1996.

[7]P.J.Haas,J.F.Naughton,and A.N.Swami.On the Relative Cost of Sampling for Join Selectivity Estimation.In

Proc.13th ACM PODS,pages14–24,1994.

[8]P.J.Haas,J.F.Naughton,S.Seshadri,and L.Stokes.Sampling-based estimation of the number of distinct values of

an attribute.In Proc.21st VLDB,pages311–322,1995.

[9]J.M.Hellerstein,P.J.Haas,and H.J.Wang.Online Aggregation.In Proc.ACM SIGMOD Conference,pages171–182,

1997.

[10]W.Hou,G.Ozsoyoglu,and E.Dogdu.Error-Constrained COUNT Query Evaluation in Relational Databases.In

Proc.ACM SIGMOD Conference,pages278–287,1991.

[11]R.J.Lipton,J.F.Naughton,D.A.Schneider,and S.Seshadri.Ef?cient Sampling Strategies for Relational Database

Operations.Theoretical Computer Science116(1993):195–226.

[12]R.Motwani and P.Raghavan.Randomized Algorithms.Cambridge University Press,1995.

[13]J.F.Naughton and S.Seshadri.On Estimating the Size of Projections.In Proc.Third International Conference on

Database Theory,pages499–513,1990.

[14]F.Olken and D.Rotem.Simple random sampling from relational databases.In Proc.12th VLDB,pages160–169,

1986.

[15]F.Olken.Random Sampling from Databases.PhD Dissertation,Computer Science,University of California at Berke-

ley,1993.

[16]G.Piatetsky-Shapiro and C.Connell.Accurate estimation of the number of tuples satisfying a condition.In Proc.ACM

SIGMOD Conference,pages256–276,1984.