You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by John Lafitte <jl...@brandextract.com> on 2014/03/25 20:31:51 UTC

Freegen and Solr score

I setup a script that uses freegen to manually index new/updated URLs.  I
thought it was working great, but now I'm just realizing that Solr returns
a score of 0 for these new documents.  I thought the score was calculated
independent from what Nutch does, just uses the content and other metadata
to calculate it, however that doesn't seem to be the case.  Anyone have a
clue what might be causing this?  The content and other metadata look
normal and I reloaded the core to no avail.

Re: Freegen and Solr score

Posted by Sebastian Nagel <wa...@googlemail.com>.
Afaik, it's not possible to do this via properties or from command-line.
It could be done in a custom scoring filter because FreeGenerator calls
injectedScore() for all active scoring filters plugins. We could also add
such a functionality to FreeGenerator itself. Feel free to open an issue
for that.


2014-03-26 15:57 GMT+01:00 John Lafitte <jl...@brandextract.com>:

> Thanks Sebastian,
>
> That did work when I set both of those to false, but now the url I'm
> inserting has an abnormally high score.  You mentioned two options, the
> first was to use FreeGenerator with an initial score, however I cannot find
> it documented anywhere how to do that.  The only parameters I see is
> normalize and filter and they don't take values.  Can you point me in the
> right direction for that?
>
>
> On Wed, Mar 26, 2014 at 6:59 AM, Sebastian Nagel <
> wastl.nagel@googlemail.com
> > wrote:
>
> > There may be no relevant links if all documents are from one single host
> > (or domain) and
> >  (link.ignore.internal.host == true)
> > resp.
> >  (link.ignore.internal.domain == true)
> > cf. explanations about that in the wiki.
> >
> >
> > 2014-03-26 4:09 GMT+01:00 John Lafitte <jl...@brandextract.com>:
> >
> > > Thanks for that Sebastian.  So given the hint you've given me, I'm
> trying
> > > to generate the scoring using this example:
> > > https://wiki.apache.org/nutch/NewScoringIndexingExample
> > >
> > > But when it gets to the LinkRank part I get:
> > >
> > > 2014-03-26 02:57:14,208 INFO  webgraph.LinkRank - Analysis: starting at
> > > 2014-03-26 02:57:14
> > > 2014-03-26 02:57:14,913 INFO  webgraph.LinkRank - Starting link counter
> > job
> > > 2014-03-26 02:57:17,927 INFO  webgraph.LinkRank - Finished link counter
> > job
> > > 2014-03-26 02:57:17,928 INFO  webgraph.LinkRank - Reading numlinks temp
> > > file
> > > 2014-03-26 02:57:17,932 ERROR webgraph.LinkRank - LinkAnalysis:
> > > java.io.IOException: No links to process, is the webgra$
> > >         at
> > >
> org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:132)
> > >         at
> > > org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:622)
> > >         at
> > > org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:702)
> > >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > >         at
> > > org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:668)
> > >
> > > I can see the webgraph directory got created and there are directories
> > and
> > > files in there, but I'm guessing something is not getting populated
> > > correctly.  Any clue what I may be doing wrong?
> > >
> > >
> > > On Tue, Mar 25, 2014 at 4:15 PM, Sebastian Nagel <
> > > wastl.nagel@googlemail.com
> > > > wrote:
> > >
> > > > Hi John,
> > > >
> > > > FreeGenerator unlike Injector does not use db.score.injected
> (default =
> > > > 1.0)
> > > > but sets the initial score to 0.0. If all URLs stem from
> FreeGenerator
> > > the
> > > > total
> > > > score in the link graph is also 0.0, and no linked documents can get
> a
> > > > higher score
> > > > that 0.0
> > > > As possible solutions:
> > > > - use FreeGenerator with a initial score > 0.0
> > > >   (but don't put thousands URLs with a score of 1.0:
> > > >    if the total score is too high some pages may get unreasonable
> > > >    high scores)
> > > > - use linkrank (https://wiki.apache.org/nutch/NewScoring) to get the
> > > > scores:
> > > >   the default scoring OPIC has the advantage of calculating scores
> > online
> > > >   while following links. It gives good and plausible scores if crawl
> is
> > > > started
> > > >   from few authoritative seeds. But sometimes, esp. in continuous
> > crawls,
> > > >   OPIC scores run out of control.
> > > >
> > > > Sebastian
> > > >
> > > > On 03/25/2014 08:31 PM, John Lafitte wrote:
> > > > > I setup a script that uses freegen to manually index new/updated
> > URLs.
> > >  I
> > > > > thought it was working great, but now I'm just realizing that Solr
> > > > returns
> > > > > a score of 0 for these new documents.  I thought the score was
> > > calculated
> > > > > independent from what Nutch does, just uses the content and other
> > > > metadata
> > > > > to calculate it, however that doesn't seem to be the case.  Anyone
> > > have a
> > > > > clue what might be causing this?  The content and other metadata
> look
> > > > > normal and I reloaded the core to no avail.
> > > > >
> > > >
> > > >
> > >
> >
>

Re: Freegen and Solr score

Posted by John Lafitte <jl...@brandextract.com>.
Thanks Sebastian,

That did work when I set both of those to false, but now the url I'm
inserting has an abnormally high score.  You mentioned two options, the
first was to use FreeGenerator with an initial score, however I cannot find
it documented anywhere how to do that.  The only parameters I see is
normalize and filter and they don't take values.  Can you point me in the
right direction for that?


On Wed, Mar 26, 2014 at 6:59 AM, Sebastian Nagel <wastl.nagel@googlemail.com
> wrote:

> There may be no relevant links if all documents are from one single host
> (or domain) and
>  (link.ignore.internal.host == true)
> resp.
>  (link.ignore.internal.domain == true)
> cf. explanations about that in the wiki.
>
>
> 2014-03-26 4:09 GMT+01:00 John Lafitte <jl...@brandextract.com>:
>
> > Thanks for that Sebastian.  So given the hint you've given me, I'm trying
> > to generate the scoring using this example:
> > https://wiki.apache.org/nutch/NewScoringIndexingExample
> >
> > But when it gets to the LinkRank part I get:
> >
> > 2014-03-26 02:57:14,208 INFO  webgraph.LinkRank - Analysis: starting at
> > 2014-03-26 02:57:14
> > 2014-03-26 02:57:14,913 INFO  webgraph.LinkRank - Starting link counter
> job
> > 2014-03-26 02:57:17,927 INFO  webgraph.LinkRank - Finished link counter
> job
> > 2014-03-26 02:57:17,928 INFO  webgraph.LinkRank - Reading numlinks temp
> > file
> > 2014-03-26 02:57:17,932 ERROR webgraph.LinkRank - LinkAnalysis:
> > java.io.IOException: No links to process, is the webgra$
> >         at
> > org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:132)
> >         at
> > org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:622)
> >         at
> > org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:702)
> >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >         at
> > org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:668)
> >
> > I can see the webgraph directory got created and there are directories
> and
> > files in there, but I'm guessing something is not getting populated
> > correctly.  Any clue what I may be doing wrong?
> >
> >
> > On Tue, Mar 25, 2014 at 4:15 PM, Sebastian Nagel <
> > wastl.nagel@googlemail.com
> > > wrote:
> >
> > > Hi John,
> > >
> > > FreeGenerator unlike Injector does not use db.score.injected (default =
> > > 1.0)
> > > but sets the initial score to 0.0. If all URLs stem from FreeGenerator
> > the
> > > total
> > > score in the link graph is also 0.0, and no linked documents can get a
> > > higher score
> > > that 0.0
> > > As possible solutions:
> > > - use FreeGenerator with a initial score > 0.0
> > >   (but don't put thousands URLs with a score of 1.0:
> > >    if the total score is too high some pages may get unreasonable
> > >    high scores)
> > > - use linkrank (https://wiki.apache.org/nutch/NewScoring) to get the
> > > scores:
> > >   the default scoring OPIC has the advantage of calculating scores
> online
> > >   while following links. It gives good and plausible scores if crawl is
> > > started
> > >   from few authoritative seeds. But sometimes, esp. in continuous
> crawls,
> > >   OPIC scores run out of control.
> > >
> > > Sebastian
> > >
> > > On 03/25/2014 08:31 PM, John Lafitte wrote:
> > > > I setup a script that uses freegen to manually index new/updated
> URLs.
> >  I
> > > > thought it was working great, but now I'm just realizing that Solr
> > > returns
> > > > a score of 0 for these new documents.  I thought the score was
> > calculated
> > > > independent from what Nutch does, just uses the content and other
> > > metadata
> > > > to calculate it, however that doesn't seem to be the case.  Anyone
> > have a
> > > > clue what might be causing this?  The content and other metadata look
> > > > normal and I reloaded the core to no avail.
> > > >
> > >
> > >
> >
>

Re: Freegen and Solr score

Posted by Sebastian Nagel <wa...@googlemail.com>.
There may be no relevant links if all documents are from one single host
(or domain) and
 (link.ignore.internal.host == true)
resp.
 (link.ignore.internal.domain == true)
cf. explanations about that in the wiki.


2014-03-26 4:09 GMT+01:00 John Lafitte <jl...@brandextract.com>:

> Thanks for that Sebastian.  So given the hint you've given me, I'm trying
> to generate the scoring using this example:
> https://wiki.apache.org/nutch/NewScoringIndexingExample
>
> But when it gets to the LinkRank part I get:
>
> 2014-03-26 02:57:14,208 INFO  webgraph.LinkRank - Analysis: starting at
> 2014-03-26 02:57:14
> 2014-03-26 02:57:14,913 INFO  webgraph.LinkRank - Starting link counter job
> 2014-03-26 02:57:17,927 INFO  webgraph.LinkRank - Finished link counter job
> 2014-03-26 02:57:17,928 INFO  webgraph.LinkRank - Reading numlinks temp
> file
> 2014-03-26 02:57:17,932 ERROR webgraph.LinkRank - LinkAnalysis:
> java.io.IOException: No links to process, is the webgra$
>         at
> org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:132)
>         at
> org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:622)
>         at
> org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:702)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at
> org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:668)
>
> I can see the webgraph directory got created and there are directories and
> files in there, but I'm guessing something is not getting populated
> correctly.  Any clue what I may be doing wrong?
>
>
> On Tue, Mar 25, 2014 at 4:15 PM, Sebastian Nagel <
> wastl.nagel@googlemail.com
> > wrote:
>
> > Hi John,
> >
> > FreeGenerator unlike Injector does not use db.score.injected (default =
> > 1.0)
> > but sets the initial score to 0.0. If all URLs stem from FreeGenerator
> the
> > total
> > score in the link graph is also 0.0, and no linked documents can get a
> > higher score
> > that 0.0
> > As possible solutions:
> > - use FreeGenerator with a initial score > 0.0
> >   (but don't put thousands URLs with a score of 1.0:
> >    if the total score is too high some pages may get unreasonable
> >    high scores)
> > - use linkrank (https://wiki.apache.org/nutch/NewScoring) to get the
> > scores:
> >   the default scoring OPIC has the advantage of calculating scores online
> >   while following links. It gives good and plausible scores if crawl is
> > started
> >   from few authoritative seeds. But sometimes, esp. in continuous crawls,
> >   OPIC scores run out of control.
> >
> > Sebastian
> >
> > On 03/25/2014 08:31 PM, John Lafitte wrote:
> > > I setup a script that uses freegen to manually index new/updated URLs.
>  I
> > > thought it was working great, but now I'm just realizing that Solr
> > returns
> > > a score of 0 for these new documents.  I thought the score was
> calculated
> > > independent from what Nutch does, just uses the content and other
> > metadata
> > > to calculate it, however that doesn't seem to be the case.  Anyone
> have a
> > > clue what might be causing this?  The content and other metadata look
> > > normal and I reloaded the core to no avail.
> > >
> >
> >
>

Re: Freegen and Solr score

Posted by John Lafitte <jl...@brandextract.com>.
Thanks for that Sebastian.  So given the hint you've given me, I'm trying
to generate the scoring using this example:
https://wiki.apache.org/nutch/NewScoringIndexingExample

But when it gets to the LinkRank part I get:

2014-03-26 02:57:14,208 INFO  webgraph.LinkRank - Analysis: starting at
2014-03-26 02:57:14
2014-03-26 02:57:14,913 INFO  webgraph.LinkRank - Starting link counter job
2014-03-26 02:57:17,927 INFO  webgraph.LinkRank - Finished link counter job
2014-03-26 02:57:17,928 INFO  webgraph.LinkRank - Reading numlinks temp file
2014-03-26 02:57:17,932 ERROR webgraph.LinkRank - LinkAnalysis:
java.io.IOException: No links to process, is the webgra$
        at
org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:132)
        at
org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:622)
        at org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:702)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at
org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:668)

I can see the webgraph directory got created and there are directories and
files in there, but I'm guessing something is not getting populated
correctly.  Any clue what I may be doing wrong?


On Tue, Mar 25, 2014 at 4:15 PM, Sebastian Nagel <wastl.nagel@googlemail.com
> wrote:

> Hi John,
>
> FreeGenerator unlike Injector does not use db.score.injected (default =
> 1.0)
> but sets the initial score to 0.0. If all URLs stem from FreeGenerator the
> total
> score in the link graph is also 0.0, and no linked documents can get a
> higher score
> that 0.0
> As possible solutions:
> - use FreeGenerator with a initial score > 0.0
>   (but don't put thousands URLs with a score of 1.0:
>    if the total score is too high some pages may get unreasonable
>    high scores)
> - use linkrank (https://wiki.apache.org/nutch/NewScoring) to get the
> scores:
>   the default scoring OPIC has the advantage of calculating scores online
>   while following links. It gives good and plausible scores if crawl is
> started
>   from few authoritative seeds. But sometimes, esp. in continuous crawls,
>   OPIC scores run out of control.
>
> Sebastian
>
> On 03/25/2014 08:31 PM, John Lafitte wrote:
> > I setup a script that uses freegen to manually index new/updated URLs.  I
> > thought it was working great, but now I'm just realizing that Solr
> returns
> > a score of 0 for these new documents.  I thought the score was calculated
> > independent from what Nutch does, just uses the content and other
> metadata
> > to calculate it, however that doesn't seem to be the case.  Anyone have a
> > clue what might be causing this?  The content and other metadata look
> > normal and I reloaded the core to no avail.
> >
>
>

Re: Freegen and Solr score

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi John,

FreeGenerator unlike Injector does not use db.score.injected (default = 1.0)
but sets the initial score to 0.0. If all URLs stem from FreeGenerator the total
score in the link graph is also 0.0, and no linked documents can get a higher score
that 0.0
As possible solutions:
- use FreeGenerator with a initial score > 0.0
  (but don't put thousands URLs with a score of 1.0:
   if the total score is too high some pages may get unreasonable
   high scores)
- use linkrank (https://wiki.apache.org/nutch/NewScoring) to get the scores:
  the default scoring OPIC has the advantage of calculating scores online
  while following links. It gives good and plausible scores if crawl is started
  from few authoritative seeds. But sometimes, esp. in continuous crawls,
  OPIC scores run out of control.

Sebastian

On 03/25/2014 08:31 PM, John Lafitte wrote:
> I setup a script that uses freegen to manually index new/updated URLs.  I
> thought it was working great, but now I'm just realizing that Solr returns
> a score of 0 for these new documents.  I thought the score was calculated
> independent from what Nutch does, just uses the content and other metadata
> to calculate it, however that doesn't seem to be the case.  Anyone have a
> clue what might be causing this?  The content and other metadata look
> normal and I reloaded the core to no avail.
> 


Re: Freegen and Solr score

Posted by Gora Mohanty <go...@mimirtech.com>.
On Mar 26, 2014 1:02 AM, "John Lafitte" <jl...@brandextract.com> wrote:
>
> I setup a script that uses freegen to manually index new/updated URLs.  I
> thought it was working great, but now I'm just realizing that Solr returns
> a score of 0 for these new documents.  I thought the score was calculated
> independent from what Nutch does, just uses the content and other metadata
> to calculate it, however that doesn't seem to be the case.  Anyone have a
> clue what might be causing this?  The content and other metadata look
> normal and I reloaded the core to no avail.

Please try adding debugQuery=on to your Solr search URL. This will return
an explanation of how the score is calculated. Also see
http://wiki.apache.org/solr/CommonQueryParameters for other debugging
options.

Regards,
Gora