You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Otis Gospodnetić <ot...@gmail.com> on 2016/01/15 22:04:55 UTC

Handling large scale incremental PageRank updates

Hello,

We are working on a very large scale crawl (many billions of web pages)
that needs to make use of link/page rank.  Because page rank for a page P
changes as more links to page P are discovered, one really ought to
periodically update the rank of the previously indexed page P.

This is not a problem for small crawls, but for large ones this is a
problem if one tries to just reindex previously existing pages - reindexing
is not cheap and if you've indexed hundreds of millions or billions of
pages, reindexing them will take a long time and require a lot of resources.

How do people normally handle that with Solr or Elasticsearch at large
scale?

With Solr, do people stick the rank in the External File Field, for example?

With Elasticsearch, do people store pageID => pageRank info in an external
store (e.g. Redis) and pull it from there to use when scoring search
results?  Or maybe that, too, would be too slow when the number of matches
is high?  Elasticsearch rescore to the rescue?

Or are there better, more scalable ways to handle this?

Thanks,
Otis
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/

Re: Handling large scale incremental PageRank updates

Posted by Dennis Kubes <ku...@apache.org>.

When we were doing billion page crawls awhile back, in 2006-2008 we had 
the following setup.

 1. Have a given number of shards to handle the full index, at that time
    this was 25 million pages per shard for 40 shards for a total of 1
    billion pages.
 2. Crawl the pages for 1 shard.  Update the WebGraph and Linkrank as
    described here.  https://wiki.apache.org/nutch/NewScoring. Don't use
    Loops.  It was a bad program with a bad algorithm and I never should
    have put it in.  Live and learn.
 3. Do the same for shards 2..n.  Each time updating.  Each crawl should
    get the highest ranked pages that haven't already been crawled
    within the recrawl interval.
 4. Once you reach the maximum amount you can crawl.  Reset the crawl
    intervals for all documents are start over with shard 1 replacing
    the original shard index with the new one.

With this type of setup you will have possible duplicates and it is 
batch so you don't get the fast updates you might be looking for.  It 
should give you an increasingly better index as crawls continue and more 
links are added to the WebGraph.

Ways to improve this might be.

 1. Change the algorithm for when pages get recrawled based on how often
    they change.  Would require determining change rate.
 2. Move fast changing pages to a separate index and only reindex those
    after each shard run.  This fast index then just becomes another shard.
 3. Many realtime or NRT search servers behind a partitioning
    algorithm.  Do a shard crawl, update the WebGraph and then reindex
    top X or Y most links changed.

This was all done using the Nutch SearchServer back when there was one.  
Not sure how that setup would translate to a solr or elasticsearch 
setup.  Hope this helps.

Dennis


On 01/15/2016 03:04 PM, Otis Gospodnetić wrote:
> Hello,
>
> We are working on a very large scale crawl (many billions of web pages)
> that needs to make use of link/page rank.  Because page rank for a page P
> changes as more links to page P are discovered, one really ought to
> periodically update the rank of the previously indexed page P.
>
> This is not a problem for small crawls, but for large ones this is a
> problem if one tries to just reindex previously existing pages - reindexing
> is not cheap and if you've indexed hundreds of millions or billions of
> pages, reindexing them will take a long time and require a lot of resources.
>
> How do people normally handle that with Solr or Elasticsearch at large
> scale?
>
> With Solr, do people stick the rank in the External File Field, for example?
>
> With Elasticsearch, do people store pageID => pageRank info in an external
> store (e.g. Redis) and pull it from there to use when scoring search
> results?  Or maybe that, too, would be too slow when the number of matches
> is high?  Elasticsearch rescore to the rescue?
>
> Or are there better, more scalable ways to handle this?
>
> Thanks,
> Otis
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>