Skyline with presorting Theory and optimization--688IT编程网

Skyline with Presorting:Theory and Optimizations

Jan Chomicki1,Parke Godfrey2,Jarek Gryz2,and Dongming Liang2

1University at Buﬀalo,USA

2York University,Canada

Abstract.There has been interest recently in skyline queries,also called Pareto queries,on relational databases.Relational query languages do not support search for“best”tuples,beyond the order by statement.The proposed skyline operator allows one to query for best tuples with respect to any number of attributes as preferences.In this work,we explore what the skyline means,and why skyline queries are useful,particularly for expressing preference.We describe the theoretical aspects and possible optimizations of an eﬃciant algorithm for computing skyline queries presented in[6].

1Introduction and Motivation

Often one would like to query a relational database in search of a“best”match,or tuples that best match one’s preferences.Relational query lan-guages provide only limited support for this:the min and max aggregation operators,which act over a single column;and the ability to order tuples with respect to thei

r attribute values.In SQL,this is done with the order by clause.This is suﬃcient when one’s preference is synonymous with the val-ues of one of the attributes,but is far from suﬃcient when one’s preferences are more complex,involving more of the attributes.

Consider a table of restaurant guide information,as in Figure1.Column S stands for service,F for food,and D for decor.Each is scored from1to 30,with30as the best.We are interested in choosing a restaurant from the guide,and we are looking for a best choice,or best choices from which to choose.Ideally,we would like the choice to be the best for service,food,and decor,and be the lowest priced.However,there is no restaurant that is better than all others on every criterion individually,as is usually the case in real life,and in real data.No one restaurant“trumps”all others.For instance, Summer Moon is best on food,but Zakopane is best on service.

While there is no one best restaurant with respect to our criteria,we want at least to eliminate from consideration those restaurants which are worse on all criteria than some other.Thus,the Briar Patch BBQ should be eliminated because the Fenton&Pickle is better on all our criteria and is thus a better choice.The Brearton Grill is in turn eliminated because Zakopane is better than it on all criteria.Meanwhile the Fenton&Pickle is worse on every criterion than every other(remaining)restaurant,except on price,where it

restaurant S F D price

Summer Moon21251947.50

Zakopane24202156.00

Yamanote22221751.50

Fenton&Pickle16141017.50

Fig.2.Restaurants in the skyline.

In[3],a new relational operator is proposed which they name the sky-line operator.They propose an extension to SQL with a skyline of clause as counterpart to this operator that would allow the easy expression of the restaurant query we imagined above.In[9]and elsewhere,this is called the Pareto operator.Indeed,the notion of Pareto optimality with respect to mul-tiple parameters is equivalent to that of choosing the non-dominated tuples, designated as the skyline.

[min|max|diﬀ]

skyline of a[1][min|max|diﬀ],...,a[n]

Fig.3.A proposed skyline operator for SQL.

The skyline of clause is shown in Figure3.Syntactically,it is similar to an order by clause.The columns a1,...,a n are the attributes that our preferences range over.They must be of domains that have a natural total ordering,as integers,ﬂoats,and dates.The directives min and max specify whether we prefer low or high values,respectively.The directive diff says that we are interested in retaining best choices with respect to every distinct value of that attribute.Let max be the default directive if none is stated.The

skyline query in Figure4over the table GoodEats in Figure1expresses what we had in mind above for choosing“best”restaurants,and would result in the answer set in Figure2.

select*from GoodEats

skyline of S max,F max,D max,price min

Fig.4.Skyline query to choose restaurants.

Skyline queries are not outside the expressive power of current SQL.The query in Figure5shows how we can write an arbitrary skyline query in present SQL.The c i’s are attributes of OurTable that we are interested to retain in our query,but are not skyline criteria.The s i are the attributes that are our skyline criteria to be maximized,and would appear in skyline of as s i max.(Without loss of generality,let us only consider max and not min.) The d i are the attributes that are the skyline criteria to diﬀer,and would appear in skyline of as d i diff.

select c1,...,c k,s1,...,s m,d1,...,d n

from OurTable

except

sorting outselect D.c1,...,D.c k,D.s1,...,D.s m,D.d1,...,D.d n

from OurTable T,OurTable D

where D.s1≤D.s m≤T.s m and

(D.s1<D.s m<T.s m)and

D.d1=D.d n=T.d n

Fig.5.SQL for generating the skyline set.

Certainly it would be cumbersome to need to write skyline-like queries in this way.The skyline clause is a useful syntactic addition.More important than ease of expression,however,is the expense of evaluation.The query in Figure5can be quite expensive.It involves a self-join over a table,and this join is aθ-join,not an equality-join.The self-join eﬀectively computes the tuples that are trumped—or dominated—by other tuples.The tuples that remain,that were never trumped,are then the skyline tuples.It is known that the size of the skyline tends to be small,with certain provisos,with respect to the size of the table[7].Thus,the intermediate result-set before the except can be enormous.

No current query optimizer would be able to do much with the query in Figure5to improve performance.If we want to support skyline queries, it is necessary to develop an eﬃcient algorithm for computing skyline.And if we want the skyline operator as part of SQL,this algorithm must be

easy to integrate in relational query plans,be well-behaved in a relational context,work in all cases(without special provisions in place),and be easily accommodated by the query optimizer.

Recent years have brought new interest in expressing preference queries in the context of relational databases and the World Wide Web.Two competing approaches have emerged so far.In theﬁrst approach[1,8],preferences are expressed by means of preference(utility)functions.The second approach uses logical formulas[4,9]and,in particular,the skyline operator[3,12], described in the previous section.Skyline computation is similar to the max-imal vector problem studied in[2,10,11].These consider algorithmic solutions to the problem and address the issue of skyline size.None of these works addresses the problem in a database context,however.In[7],we address the question of skyline query cardinality more concretely.

In this paper,we explore what the skyline means,and why skyline queries are useful,particularly for expressing preference.We describe a well-behaved, eﬃcient algorithm for computing skyline queries.Our algorithm improves on exisiting approaches in eﬃciency,pipelinabilty of output(of the skyline tu-ples),stability of run-time performance,and being applicable in any context. 2Skyline versus Ranking

The skyline of a relation in essence represents the best tuples of the relation, the Pareto optimal“solutions”,with respect to the skyline criteria.Another way toﬁnd“best”tuples is to score each tuple with respect to one’s prefer-ences,and then choose those tuples with the best score(ranking).The

latter could be done eﬃciently in a relational setting.In one table scan,one can score the tuples and collect the best scoring tuples.

How is skyline related to ranking then?It is known that the skyline repre-sents the closure over the maximum scoring tuples of the relation with respect to all monotone scoring functions.For example,in choosing a restaurant as in the example in Section1,say that one values service quality twice as much as food quality,and food quality twice as much as decor,those restaurants that are best with respect to this“weighting”will appear in the skyline.Fur-thermore,the skyline is the least-upper-bound closure over the maximums of the monotone scoring functions[3].

This means that the skyline can be used instead of ranking,or it can be used in conjunction with ranking.First,since the best tuples with respect to any(monotone)scoring are in the skyline,one only needs eﬀectively to query the skyline with one’s preference queries,and not the original table itself. The skyline is(usually)signiﬁcantly smaller than the table itself[7],so this would be much more eﬃcient if one had many preference queries to try over the same dataset.Second,as deﬁning one’s preferences in a preference query can be quite diﬃcult,while expressing a skyline query is relatively easy,users mayﬁnd skyline queries beneﬁcial.The skyline over-answers with respect to

the users’intent in a way,since it includes the best tuples with respect to any preferences.So there will be some choices (tuples)among the skyline that are not of interest to the user.However,every best choice with respect to the user’s implicit preferences shows up too.

While in [3],they observe this relation of skyline with monotone scoring functions,they did not oﬀer proof nor did they discuss linear scoring func-tions ,to which much work restricts focus.Let us investigate this more closely,and more formally,then,for the following reasons:

•to relate skyline to preference queries,and to illustrate that expressing preferences by scoring is more diﬃcult than one might initially expect;•to rectify some common misconceptions regarding scoring for the pur-poses of preference queries,and regarding the claim for skyline;and •to demonstrate a useful property of monotone scoring that we can exploit for an eﬃcient algorithm to compute the skyline.

Let attributes a 1,...,a k of schema R be the skyline criteria,without loss of generality,with respect to “max”.Let the domains of the a i ’s be real,without loss of generality.Let R be a relation of schema R,and so represents a given instance.

Deﬁnition 1.Deﬁne a monotone scoring function S with respect to R as a function that takes as its input domain tuples of R,and maps them onto the range of reals.S is composed of k monotone increasing fu

nctions,f 1,...,f k .For any tuple t ∈R ,S (t )= k i =1f i (t [a i ]).

Lemma 1.Any tuple that has the best score over R with respect to any monotone scoring function S with respect to R must be in the skyline.1

It is more diﬃcult to show that every tuple of the skyline is the best score of some monotone scoring.Most restrict attention to linear weightings when considering scoring,though,so let us consider this ﬁrst.

Deﬁnition 2.Deﬁne a positive,linear scoring function ,W ,as any function over a table R ’s tuples of the form W (t )= k i =1w i t [a i ],in which the w i ’s are positive,real constants.

As we insist that the w i ’s are positive,the class of the positive,linear scoring functions is a proper sub-class of the monotone scoring functions.Commonly in preference query work,as in [8],the focus is restricted to linear scoring.It is not true,however,that every skyline tuple is the best with respect to some positive,linear scoring.

Theorem 1.It is possible for a skyline tuple to exist on R such that,for every positive,linear scoring function,the tuple does not have the maximum score with respect to the function over table R .

Consider R=((4,1),(2,2),(1,4)).All three tuples are in the skyline (skyline of a1,a2).Linear scorings that choose(4,1)and(1,4)are obvious, but there is no positive,linear scoring that scores(2,2)best.Note that(2,2) is an interesting choice.Tuples(4,1)and(1,4)represent in a way outliers. They make the skyline because each has an attribute with an extrema value. Whereas(2,2)represents a balance between the attributes(and hence,pref-erences).For example,if we are conducting a house hunt,a1may represent the number of bathrooms,and a2,the number of bedrooms.Neither a house with four bathrooms and one bedroom,nor one with one bathroom and four bedrooms,seem very appealing,whereas a2bth/2bdrm house might. Theorem2.The skyline contains all,and only,tuples yielding maximum values of monotone scoring functions.

While there exists a monotone scoring function that chooses—assigns the highest score to—any given skyline tuple,it does not mean anyone would everﬁnd this function.In particular,this is because,in many cases,any such function is a contrivance based upon that skyline’s values.The user is searching for“best”tuples and has not seen them yet.Thus,it is unlikely anyone would discover a tuple like(2,2)above with any preference query. Yet,the2bth/2bdrm house might be exactly what we wanted.

For the algorithm for skyline computation we are to develop,we can exploit our observations on the monotone scoring functions.Let us deﬁne the dominance relation,“ ”,as follows:for tuples any r,t∈R,r t i

ﬀr[a i]≤t[a i],for all i∈1,...,k.Further deﬁne that r≺t iﬀr t and r[a i]<t[a i],for some i∈1,...,k.

Theorem3.Any total order of the tuples of R with respect to any monotone scoring function(ordered from highest to lowest score)is a topological sort with respect to the skyline dominance partial relation(“ ”).

Consider the total ordering on R provided by the basic SQL order by as in the query in Figure6.This total order is a topological sort with respect to dominance.

select*from R

order by a1desc,...,a k desc;

Fig.6.An order by query that produces a total monotone order.

The following proposition is fairly obvious and it is used to build a better skyline algorithm.

Theorem4.Any nested sort of R over the skyline attributes(sorting in descending order on each),as in the query in Figure6,is a topological sort with respect to the dominance partial order.

688IT编程网

Skyline with presorting Theory and optimization

发表评论

推荐文章

Linux怎么直接执行PHP脚本文件

php文件写入或追加数据

php中实现文件上传的函数

php文件上传类程序代码

413 request entity too large 解决方法 -回复

热门文章

php中用来导入其他文件的语句

php获取文件后缀名的方法

创建php文件方法

国家电网公司电子商务平台常见问题

【2018-2019】别克英朗说明书-实用word文档 (12页)

诺基亚E71常见问题以及解决方法

HXD3型电力机车故障应急处理

卫星电视中星9号解密方法及节目参数,长期可用

硬盘U盘等启动奶瓶beini详细步骤教程

BT3使用教程

破解网通铁通电信封路由器的几种方法

手把手教你WPA2加密无线网络

教你如何破解搜索到的无线网络

Get清风OD入门系列图文详细教程、破解做辅助起步

java rar破解原理

同余方程在密码学中的应用与破解

无限网络解码

winrar破解方法

macOS终端中的文件加密和解密技巧

rar加密原理

最新文章

php中实现文件上传的函数

413 request entity too large 解决方法 -回复

php实现编辑和保存文件的方法

php 配置文件的用法 -回复

突破php网站上传文件大小限制

php(实现url重写)

标签列表