You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Tobias Marx <tm...@uni-wuppertal.de> on 2014/02/21 16:24:46 UTC

PageRank or Opic?

Hi!

We're using nutch (1.7) and solr 3.6 for indexing about 80k pages on several 100 different hosts.

This works quiet well, but there is still room for improvement to search result ranking and "relevancy".

When using nutch and solr there are basically two values that influence the score auf a query result (correct me if I'm wrong). The score from nutch, which becomes the "boost" value in solr and the boost value from solr, which is e.g. calculated at query time.

The score in nutch is either calculated bei the "scoring-opic" plugin or with the "webgraph" toolchain described here: http://wiki.apache.org/nutch/NewScoringIndexingExample which gives the PageRank/LinkRank (btw. what with the "scoring-link" plugin? Does it do anything at all? What is it role in this?).

We've been playing around with PageRank lately and it's scores look a little better than with opic, but on the downside, calculation really takes very long and is very cpu intensive.

Well, to cut a long story short, what is your opinion on this? Which ranking do you use? Is PageRank worth the trouble? How do you boost solr queries (if you use solr at all)?

BR,
-- 
Tobias Marx

Zentrum für Informations- und Medienverarbeitung - ZIM

Bergische Universität Wuppertal

Büro: T.11.08
+49 202 439 2237
tmarx@uni-wuppertal.de


Re: PageRank or Opic?

Posted by Mateusz Zakarczemny <ma...@up2data.pl>.
Note that in nutch 2 branch only OPIC is implemented. If you want move 
to it in future it might be problematic.

Dnia pią, 21 lut 2014, 16:52:35 Markus Jelsma pisze:
> Hi - you can safely forget about OPIC, it is useless in continuous crawls. LinkRank, however, only works well on very large crawls, with many hosts. It can work for single hosts (do not ignore internal links) but the graph will become very dense; that's where the IO and CPU time comes from. We don't use LinkRank score in Solr at all because results are already very relevant due to other (less costly) measures.
>
> You can do it, but you will need some serious hardware. Also, there is the problem of frequently changing scores, but you are not frequently updating all documents in Solr, using ExternalFileField may help.
>
> -----Original message-----
> From: Tobias Marx<tm...@uni-wuppertal.de>
> Sent: Friday 21st February 2014 16:29
> To: user@nutch.apache.org
> Subject: PageRank or Opic?
>
> Hi!
>
> We're using nutch (1.7) and solr 3.6 for indexing about 80k pages on several 100 different hosts.
>
> This works quiet well, but there is still room for improvement to search result ranking and "relevancy".
>
> When using nutch and solr there are basically two values that influence the score auf a query result (correct me if I'm wrong). The score from nutch, which becomes the "boost" value in solr and the boost value from solr, which is e.g. calculated at query time.
>
> The score in nutch is either calculated bei the "scoring-opic" plugin or with the "webgraph" toolchain described here: http://wiki.apache.org/nutch/NewScoringIndexingExample <http://wiki.apache.org/nutch/NewScoringIndexingExample> which gives the PageRank/LinkRank (btw. what with the "scoring-link" plugin? Does it do anything at all? What is it role in this?).
>
> We've been playing around with PageRank lately and it's scores look a little better than with opic, but on the downside, calculation really takes very long and is very cpu intensive.
>
> Well, to cut a long story short, what is your opinion on this? Which ranking do you use? Is PageRank worth the trouble? How do you boost solr queries (if you use solr at all)?
>
> BR,
>
> --
> Tobias Marx
>
> Zentrum für Informations- und Medienverarbeitung - ZIM
>
> Bergische Universität Wuppertal
>
> Büro: T.11.08
> +49 202 439 2237
> tmarx@uni-wuppertal.de <ma...@uni-wuppertal.de>
>
>

RE: PageRank or Opic?

Posted by Markus Jelsma <ma...@openindex.io>.
Hi - you can safely forget about OPIC, it is useless in continuous crawls. LinkRank, however, only works well on very large crawls, with many hosts. It can work for single hosts (do not ignore internal links) but the graph will become very dense; that's where the IO and CPU time comes from. We don't use LinkRank score in Solr at all because results are already very relevant due to other (less costly) measures.

You can do it, but you will need some serious hardware. Also, there is the problem of frequently changing scores, but you are not frequently updating all documents in Solr, using ExternalFileField may help.

-----Original message-----
From: Tobias Marx<tm...@uni-wuppertal.de>
Sent: Friday 21st February 2014 16:29
To: user@nutch.apache.org
Subject: PageRank or Opic?

Hi!

We're using nutch (1.7) and solr 3.6 for indexing about 80k pages on several 100 different hosts.

This works quiet well, but there is still room for improvement to search result ranking and "relevancy".

When using nutch and solr there are basically two values that influence the score auf a query result (correct me if I'm wrong). The score from nutch, which becomes the "boost" value in solr and the boost value from solr, which is e.g. calculated at query time.

The score in nutch is either calculated bei the "scoring-opic" plugin or with the "webgraph" toolchain described here: http://wiki.apache.org/nutch/NewScoringIndexingExample <http://wiki.apache.org/nutch/NewScoringIndexingExample> which gives the PageRank/LinkRank (btw. what with the "scoring-link" plugin? Does it do anything at all? What is it role in this?).

We've been playing around with PageRank lately and it's scores look a little better than with opic, but on the downside, calculation really takes very long and is very cpu intensive.

Well, to cut a long story short, what is your opinion on this? Which ranking do you use? Is PageRank worth the trouble? How do you boost solr queries (if you use solr at all)?

BR,

--
Tobias Marx

Zentrum für Informations- und Medienverarbeitung - ZIM

Bergische Universität Wuppertal

Büro: T.11.08
+49 202 439 2237
tmarx@uni-wuppertal.de <ma...@uni-wuppertal.de>