文档库 最新最全的文档下载
当前位置:文档库 › Efficient Pagerank approximation via graph aggregation

Efficient Pagerank approximation via graph aggregation

Ef?cient PageRank Approximation via Graph Aggregation

Andrei Z.Broder ?

IBM T.J.Watson Research Center Hawthorne,NY,USA

abroder@https://www.wendangku.net/doc/2812623457.html,

Ronny Lempel

?

IBM Haifa Research Labs,Haifa,Israel rlempel@https://www.wendangku.net/doc/2812623457.html,

Farzin Maghoul

Y ahoo!Inc.

farzin.maghoul@https://www.wendangku.net/doc/2812623457.html,

Jan Pedersen

Y ahoo!Inc.

jan.pedersen@https://www.wendangku.net/doc/2812623457.html,

ABSTRACT

We present a framework for approximating random-walk based prob-ability distributions over Web pages using graph aggregation.We

(1)partition the Web’s graph into classes of quasi-equivalent ver-

tices,(2)project the page-based random walk to be approximated

onto those classes,and(3)compute the stationary probability dis-

tribution of the resulting class-based random walk.From this dis-

tribution we can quickly reconstruct a distribution on pages.In

particular,our framework can approximate the well-known PageR-

ank distribution by setting the classes according to the set of pages

on each Web host.We experimented on a Web-graph containing

over1.4billion pages,and were able to produce a ranking that has

Spearman rank-order correlation of0.95with respect to PageRank.

A simplistic implementation of our method required less than half

the running time of a highly optimized implementation of PageR-

ank,implying that larger speedup factors are probably possible.

Categories and Subject Descriptors

H.3.3[Information Storage and Retrieval]:Information Search

and Retrieval

General Terms

Algorithms,Performance,Experimentation,Measurements

Keywords

Link analysis,search engines,Web information retrieval

1.INTRODUCTION

Since the late nineties,Web search engines have started to rely

more and more on off-page,Web-speci?c data such as link analysis,

anchor-text,and click-through data.One particular form of link-

based ranking factors are static scores,which are query-independent

importance scores that are assigned to all Web pages.The most fa-

mous algorithm for producing such scores is PageRank,devised

by Brin and Page while developing the ranking module for the

prototype of the search engine Google(https://www.wendangku.net/doc/2812623457.html,)[1].

PageRank can be described as the stationary probability distribu-

tion of a certain random walk on the Web graph-the graph whose 484

3.THE GRAPH AGGREGATION METHOD Let T be a random walk on a graph with n nodes.T will denote both the random walk and the stochastic matrix which governs it. Let the n nodes be partitioned into m classes H1,...,H m.We concentrate on the case where T denotes PageRank and the par-titioning of Web pages is according to their host.Hence,a class H i contains all the nodes(pages)on a certain host.We develop an alternative random walk T ,derived from T,whose stationary dis-tribution can be computed more ef?ciently.In T ,a state transition departing from a node x∈H i is the following two-stage process:

?Move to some node y∈H i according to a distributionπH

i .

Note thatπH

i

depends only on the class of x.

?From y,perform a transition according to T.

The algorithm for calculating the stationary distribution of T :

1.De?ne an m×m stochastic matrix?T=[?t i,j]as follows:

?t

H i,H j=X

q∈H i

πH

i

(q)·X p∈H

j

t q,p.

2.Calculate the principal eigenvector of?T:Compute an m-

dimensional probability distribution?αsatisfying?α?T=?α.

https://www.wendangku.net/doc/2812623457.html,pute an n-dimensional probability distributionγ,where

for each node p,γ(p)=?αh(p)·πh(p)(p)(h(p)denotes the class to which p belongs).

4.The stationary distribution of T is the vectorβ =γT. Computational demands:the standard power-iteration computa-tion of PageRank converges in a few dozen iterations.Each itera-tion requires one pass over the complete list of links for the entire Web graph.In contrast,our algorithm needs only two passes over the entire set of links(steps1,4).Each power iteration required by our algorithm(in step2)is linear in the number of links of the host-graph,which is typically much smaller(maybe by a factor of20),than the number of links in the page-graph.This has im-plications also on the memory demands of our algorithm:search engines compute connectivity-based measures over tens of billions of links.This scale of data exceeds the RAM capabilities of most single-machine platforms.However,our algorithm performs most of its computations on the much smaller host graph,which may?t in the RAM of a single machine.

This algorithm is related to the BlockRank algorithm[5].Block-Rank accelerates PageRank by deriving a distribution vector v,from which(empirically)fewer Power iterations are required until con-vergence.The vector v is closely related to the vectorγproduced by step3above,when each distributionπH is chosen as the intra-host PageRank vector of H.Since BlockRank multiplies v by T several times(until convergence)while we only multiplyγby T once(step4),our approach is speedier than BlockRank. Experiments:we conducted experiments on a Web-graph contain-ing over1446million pages with almost6650million links,from a crawl of the Web conducted by AltaVista in September2003. The number of unique hosts in this graph was about31million, with241million host-to-host edges.We set all intra-host distribu-tionsπH to be uniform;hence,we called the resulting random walk ?avor the“U-model”.Computing PageRank on an Alpha server with four667MHz CPUs required12.5hours,while computing the U-model scores took5.8hours(a speedup factor of about2.1).

The PageRank computations used a robust and optimized infras-tructure,whereas our modi?ed algorithm was written in an ad-hoc, non-optimized manner.We predict that by optimizing our imple-mentation,speedup factors can approach the full potential indicated by the ratio of the number links in the Web graph and the corre-sponding?gure in the host-graph.

We measured the correlation between PageRank and the U-model using a sample of1298pages.The sample was not a uniform random sample since,by virtue of the power-law distribution of PageRank,the bulk of Web pages have few or no inlinks.Such pages would be ranked similarly by practically any static score al-gorithm.Instead,we sorted pages according to their PageRank, and sampled as follows:each of the top1000pages was chosen with probability0.2.Then,for j=3,...,8,the pages in places 1+10j,...,10j+1were sampled w.p.0.2·102?j.Thus,the sample covered the full range of PageRank values,and included many more pages with high PageRank values than a uniform ran-dom sample would have produced.In this sample,U-model had Pearson correlation0.81with PageRank.Furthermore,we com-pared the rankings that are induced by the two score?avors.The Spearman rank-order correlation between U-model and PageRank was quite high,at0.95.This suggests that U-model can be used as an effective approximation of PageRank in relevance ranking.

Whether the(small)differences between U-model and PageRank would imply better or worse search quality is currently unknown.

4.CONCLUSIONS

This paper introduced a new framework for calculating random-walk based static ranks of pages on the Web.The framework ap-proximates the behavior of a given random walk on the Web’s graph in a scalable manner,by performing most of its calculations on a more compact representation of the graph-a representation that follows from aggregating multiple pages onto a single node.

We experimented with a model of random Web browsing that decouples intra-host and inter-host steps.Departing from a page in our model consists of a random transition to another page of the same host,followed by a PageRank-like step from that page.

Whereas PageRank requires repeated scans of the Web-graph’s links, most of the computations required by our approach involve scan-ning the edges of the Web’s host-graph.

Future research should experiment with different aggregates of the Web’s graph,and with different stochastic behaviors within those aggregates.Each such?avor should be evaluated in terms of the speedup factors it achieves,and in terms of the search qual-ity that it induces when used in the ranking core of search engines.

5.REFERENCES

[1]S.Brin and L.Page.The anatomy of a large-scale

hypertextual web search engine.In Proc.7th International

WWW Conference,pages107–117,1998.

[2]T.H.Haveliwala.Ef?cient computation of pagerank.

Technical Report Technical Report,Stanford University,

October1999.

[3]T.H.Haveliwala.Topic-sensitive pagerank.In Proc.11th

International WWW Conference(WWW2002),2002.

[4]G.Jeh and J.Widom.Scaling personalized web search.In

Proc.12th International WWW Conference(WWW2003),

Budapest,Hungary,pages271–279,2003.

[5]S.D.Kamvar,T.H.Haveliwala,C.D.Manning,and G.H.

Golub.Exploiting the block structure of the web for

computating pagerank.Technical Report Technical Report,

Stanford University,March2003.

485

相关文档