Skyline with Presorting:Theory and Optimizations
Jan Chomicki1,Parke Godfrey2,Jarek Gryz2,and Dongming Liang2
1University at Buffalo,USA
2York University,Canada
Abstract.There has been interest recently in skyline queries,also called Pareto queries,on relational databases.Relational query languages do not support search for“best”tuples,beyond the order by statement.The proposed skyline operator allows one to query for best tuples with respect to any number of attributes as preferences.In this work,we explore what the skyline means,and why skyline queries are useful,particularly for expressing preference.We describe the theoretical aspects and possible optimizations of an efficiant algorithm for computing skyline queries presented in[6].
1Introduction and Motivation
Often one would like to query a relational database in search of a“best”match,or tuples that best match one’s preferences.Relational query lan-guages provide only limited support for this:the min and max aggregation operators,which act over a single column;and the ability to order tuples with respect to thei
r attribute values.In SQL,this is done with the order by clause.This is sufficient when one’s preference is synonymous with the val-ues of one of the attributes,but is far from sufficient when one’s preferences are more complex,involving more of the attributes.
Consider a table of restaurant guide information,as in Figure1.Column S stands for service,F for food,and D for decor.Each is scored from1to 30,with30as the best.We are interested in choosing a restaurant from the guide,and we are looking for a best choice,or best choices from which to choose.Ideally,we would like the choice to be the best for service,food,and decor,and be the lowest priced.However,there is no restaurant that is better than all others on every criterion individually,as is usually the case in real life,and in real data.No one restaurant“trumps”all others.For instance, Summer Moon is best on food,but Zakopane is best on service.
While there is no one best restaurant with respect to our criteria,we want at least to eliminate from consideration those restaurants which are worse on all criteria than some other.Thus,the Briar Patch BBQ should be eliminated because the Fenton&Pickle is better on all our criteria and is thus a better choice.The Brearton Grill is in turn eliminated because Zakopane is better than it on all criteria.Meanwhile the Fenton&Pickle is worse on every criterion than every other(remaining)restaurant,except on price,where it
restaurant S F D price
Summer Moon21251947.50
Zakopane24202156.00
Yamanote22221751.50
Fenton&Pickle16141017.50
Fig.2.Restaurants in the skyline.
In[3],a new relational operator is proposed which they name the sky-line operator.They propose an extension to SQL with a skyline of clause as counterpart to this operator that would allow the easy expression of the restaurant query we imagined above.In[9]and elsewhere,this is called the Pareto operator.Indeed,the notion of Pareto optimality with respect to mul-tiple parameters is equivalent to that of choosing the non-dominated tuples, designated as the skyline.
<
[min|max|diff]
skyline of a[1][min|max|diff],...,a[n]
Fig.3.A proposed skyline operator for SQL.
The skyline of clause is shown in Figure3.Syntactically,it is similar to an order by clause.The columns a1,...,a n are the attributes that our preferences range over.They must be of domains that have a natural total ordering,as integers,floats,and dates.The directives min and max specify whether we prefer low or high values,respectively.The directive diff says that we are interested in retaining best choices with respect to every distinct value of that attribute.Let max be the default directive if none is stated.The
skyline query in Figure4over the table GoodEats in Figure1expresses what we had in mind above for choosing“best”restaurants,and would result in the answer set in Figure2.
select*from GoodEats
skyline of S max,F max,D max,price min
Fig.4.Skyline query to choose restaurants.
Skyline queries are not outside the expressive power of current SQL.The query in Figure5shows how we can write an arbitrary skyline query in present SQL.The c i’s are attributes of OurTable that we are interested to retain in our query,but are not skyline criteria.The s i are the attributes that are our skyline criteria to be maximized,and would appear in skyline of as s i max.(Without loss of generality,let us only consider max and not min.) The d i are the attributes that are the skyline criteria to differ,and would appear in skyline of as d i diff.
select c1,...,c k,s1,...,s m,d1,...,d n
from OurTable
except
sorting outselect D.c1,...,D.c k,D.s1,...,D.s m,D.d1,...,D.d n
from OurTable T,OurTable D
where D.s1≤D.s m≤T.s m and
(D.s1<D.s m<T.s m)and
D.d1=D.d n=T.d n
Fig.5.SQL for generating the skyline set.
Certainly it would be cumbersome to need to write skyline-like queries in this way.The skyline clause is a useful syntactic addition.More important than ease of expression,however,is the expense of evaluation.The query in Figure5can be quite expensive.It involves a self-join over a table,and this join is aθ-join,not an equality-join.The self-join effectively computes the tuples that are trumped—or dominated—by other tuples.The tuples that remain,that were never trumped,are then the skyline tuples.It is known that the size of the skyline tends to be small,with certain provisos,with respect to the size of the table[7].Thus,the intermediate result-set before the except can be enormous.
No current query optimizer would be able to do much with the query in Figure5to improve performance.If we want to support skyline queries, it is necessary to develop an efficient algorithm for computing skyline.And if we want the skyline operator as part of SQL,this algorithm must be
easy to integrate in relational query plans,be well-behaved in a relational context,work in all cases(without special provisions in place),and be easily accommodated by the query optimizer.
Recent years have brought new interest in expressing preference queries in the context of relational databases and the World Wide Web.Two competing approaches have emerged so far.In thefirst approach[1,8],preferences are expressed by means of preference(utility)functions.The second approach uses logical formulas[4,9]and,in particular,the skyline operator[3,12], described in the previous section.Skyline computation is similar to the max-imal vector problem studied in[2,10,11].These consider algorithmic solutions to the problem and address the issue of skyline size.None of these works addresses the problem in a database context,however.In[7],we address the question of skyline query cardinality more concretely.
In this paper,we explore what the skyline means,and why skyline queries are useful,particularly for expressing preference.We describe a well-behaved, efficient algorithm for computing skyline queries.Our algorithm improves on exisiting approaches in efficiency,pipelinabilty of output(of the skyline tu-ples),stability of run-time performance,and being applicable in any context. 2Skyline versus Ranking
The skyline of a relation in essence represents the best tuples of the relation, the Pareto optimal“solutions”,with respect to the skyline criteria.Another way tofind“best”tuples is to score each tuple with respect to one’s prefer-ences,and then choose those tuples with the best score(ranking).The
latter could be done efficiently in a relational setting.In one table scan,one can score the tuples and collect the best scoring tuples.
How is skyline related to ranking then?It is known that the skyline repre-sents the closure over the maximum scoring tuples of the relation with respect to all monotone scoring functions.For example,in choosing a restaurant as in the example in Section1,say that one values service quality twice as much as food quality,and food quality twice as much as decor,those restaurants that are best with respect to this“weighting”will appear in the skyline.Fur-thermore,the skyline is the least-upper-bound closure over the maximums of the monotone scoring functions[3].
This means that the skyline can be used instead of ranking,or it can be used in conjunction with ranking.First,since the best tuples with respect to any(monotone)scoring are in the skyline,one only needs effectively to query the skyline with one’s preference queries,and not the original table itself. The skyline is(usually)significantly smaller than the table itself[7],so this would be much more efficient if one had many preference queries to try over the same dataset.Second,as defining one’s preferences in a preference query can be quite difficult,while expressing a skyline query is relatively easy,users mayfind skyline queries beneficial.The skyline over-answers with respect to
the users’intent in a way,since it includes the best tuples with respect to any preferences.So there will be some choices (tuples)among the skyline that are not of interest to the user.However,every best choice with respect to the user’s implicit preferences shows up too.
While in [3],they observe this relation of skyline with monotone scoring functions,they did not offer proof nor did they discuss linear scoring func-tions ,to which much work restricts focus.Let us investigate this more closely,and more formally,then,for the following reasons:
•to relate skyline to preference queries,and to illustrate that expressing preferences by scoring is more difficult than one might initially expect;•to rectify some common misconceptions regarding scoring for the pur-poses of preference queries,and regarding the claim for skyline;and •to demonstrate a useful property of monotone scoring that we can exploit for an efficient algorithm to compute the skyline.
Let attributes a 1,...,a k of schema R be the skyline criteria,without loss of generality,with respect to “max”.Let the domains of the a i ’s be real,without loss of generality.Let R be a relation of schema R,and so represents a given instance.
Definition 1.Define a monotone scoring function S with respect to R as a function that takes as its input domain tuples of R,and maps them onto the range of reals.S is composed of k monotone increasing fu
nctions,f 1,...,f k .For any tuple t ∈R ,S (t )= k i =1f i (t [a i ]).
Lemma 1.Any tuple that has the best score over R with respect to any monotone scoring function S with respect to R must be in the skyline.1
It is more difficult to show that every tuple of the skyline is the best score of some monotone scoring.Most restrict attention to linear weightings when considering scoring,though,so let us consider this first.
Definition 2.Define a positive,linear scoring function ,W ,as any function over a table R ’s tuples of the form W (t )= k i =1w i t [a i ],in which the w i ’s are positive,real constants.
As we insist that the w i ’s are positive,the class of the positive,linear scoring functions is a proper sub-class of the monotone scoring functions.Commonly in preference query work,as in [8],the focus is restricted to linear scoring.It is not true,however,that every skyline tuple is the best with respect to some positive,linear scoring.
Theorem 1.It is possible for a skyline tuple to exist on R such that,for every positive,linear scoring function,the tuple does not have the maximum score with respect to the function over table R .
Consider R=((4,1),(2,2),(1,4)).All three tuples are in the skyline (skyline of a1,a2).Linear scorings that choose(4,1)and(1,4)are obvious, but there is no positive,linear scoring that scores(2,2)best.Note that(2,2) is an interesting choice.Tuples(4,1)and(1,4)represent in a way outliers. They make the skyline because each has an attribute with an extrema value. Whereas(2,2)represents a balance between the attributes(and hence,pref-erences).For example,if we are conducting a house hunt,a1may represent the number of bathrooms,and a2,the number of bedrooms.Neither a house with four bathrooms and one bedroom,nor one with one bathroom and four bedrooms,seem very appealing,whereas a2bth/2bdrm house might. Theorem2.The skyline contains all,and only,tuples yielding maximum values of monotone scoring functions.
While there exists a monotone scoring function that chooses—assigns the highest score to—any given skyline tuple,it does not mean anyone would everfind this function.In particular,this is because,in many cases,any such function is a contrivance based upon that skyline’s values.The user is searching for“best”tuples and has not seen them yet.Thus,it is unlikely anyone would discover a tuple like(2,2)above with any preference query. Yet,the2bth/2bdrm house might be exactly what we wanted.
For the algorithm for skyline computation we are to develop,we can exploit our observations on the monotone scoring functions.Let us define the dominance relation,“ ”,as follows:for tuples any r,t∈R,r t i
ffr[a i]≤t[a i],for all i∈1,...,k.Further define that r≺t iffr t and r[a i]<t[a i],for some i∈1,...,k.
Theorem3.Any total order of the tuples of R with respect to any monotone scoring function(ordered from highest to lowest score)is a topological sort with respect to the skyline dominance partial relation(“ ”).
Consider the total ordering on R provided by the basic SQL order by as in the query in Figure6.This total order is a topological sort with respect to dominance.
select*from R
order by a1desc,...,a k desc;
Fig.6.An order by query that produces a total monotone order.
The following proposition is fairly obvious and it is used to build a better skyline algorithm.
Theorem4.Any nested sort of R over the skyline attributes(sorting in descending order on each),as in the query in Figure6,is a topological sort with respect to the dominance partial order.