You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Eugen Kochuev <eu...@lan23.net> on 2006/05/16 16:42:33 UTC

changing ranking

Hi guys,

  I have a catalogue of the sites where domains are ranked by human
  experts. Is it possible to tweak the score of pages belonging to the
  domains listed in the catalogue according to their catalogue rank?

  So, I'm interested in the ability to change scores of some urls.

-- 
Best regards,
 Eugen                          mailto:eugen@lan23.net


Re: changing ranking

Posted by Ken Krugler <kk...@transpac.com>.
>If some has to adopt the plugin, it has to go with new crawling. 
>Will there be a  way, where we could apply these scoring mechanisms 
>to existing already fetched, indexed and merged pages too.
>Can you please shed some light?

I think it would be possible to write a map-reduce job that simulated 
the crawl of all current pages, to the extent necessary to get 
reasonable history/page cash values for OPIC. But that's just a guess 
until the actual implementation is at least sketched out.

-- Ken

>Andrzej Bialecki <ab...@getopt.org> wrote: Ken Krugler wrote:
>>>  Eugen Kochuev wrote:
>>>>  Hello Andrzej,
>>>>
>>>>>  Please see the scoring API - you can write a plugin that manipulates
>>>>>  page scores according to your own idea.
>>>>
>>>>  Thanks a lot for your answer, but could you please shed some more
>>>>  light onto scoring technique used in the Nutch?
>>>>  As I can see from the source code Nutch uses something similar to the
>>>>  pagerank algorithm propagating page scores through outlinks, but
>>>>  only one
>>>>  iteration is used (while pagerank requires several iterations to
>>>>  converge).
>>>
>>>  That's a bit complicated subject - I could either explain this in
>>>  very general terms, or suggest that you read the paper that underlies
>>>  the current Nutch implementation (with a twist). Please see the
>>>  comment in OPICScoringFilter.java for the link to the paper.
>>
>>  I've started writing up a description of the changes that I think need
>>  to be made to Nutch to really implement the OPIC algorithm, as
>>  described by by the "Adaptive On-Line Page Importance Computation"
>>  paper (ACM 1-58113-680-3/03/0005).
>>
>>  Should I just open a JIRA issue, and dump what might be a pretty long
>>  write-up into it?
>
>Yes, please do - I'd love to implement this in that original form, even
>if it would go into another plugin ...
>
>--
>Best regards,
>Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
>[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>___|||__||  \|  ||  |  Embedded Unix, System Integration
>http://www.sigram.com  Contact: info at sigram dot com
>
>
>
>
>  __________________________________________________
>Do You Yahoo!?
>Tired of spam?  Yahoo! Mail has the best spam protection around
>http://mail.yahoo.com


-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Re: changing ranking

Posted by sudhendra seshachala <su...@yahoo.com>.

If some has to adopt the plugin, it has to go with new crawling. Will there be a  way, where we could apply these scoring mechanisms to existing already fetched, indexed and merged pages too.
Can you please shed some light?

Thanks


Andrzej Bialecki <ab...@getopt.org> wrote: Ken Krugler wrote:
>> Eugen Kochuev wrote:
>>> Hello Andrzej,
>>>
>>>> Please see the scoring API - you can write a plugin that manipulates
>>>> page scores according to your own idea.
>>>
>>> Thanks a lot for your answer, but could you please shed some more
>>> light onto scoring technique used in the Nutch?
>>> As I can see from the source code Nutch uses something similar to the
>>> pagerank algorithm propagating page scores through outlinks, but 
>>> only one
>>> iteration is used (while pagerank requires several iterations to
>>> converge).
>>
>> That's a bit complicated subject - I could either explain this in 
>> very general terms, or suggest that you read the paper that underlies 
>> the current Nutch implementation (with a twist). Please see the 
>> comment in OPICScoringFilter.java for the link to the paper.
>
> I've started writing up a description of the changes that I think need 
> to be made to Nutch to really implement the OPIC algorithm, as 
> described by by the "Adaptive On-Line Page Importance Computation" 
> paper (ACM 1-58113-680-3/03/0005).
>
> Should I just open a JIRA issue, and dump what might be a pretty long 
> write-up into it?

Yes, please do - I'd love to implement this in that original form, even 
if it would go into another plugin ...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




 __________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: changing ranking

Posted by Andrzej Bialecki <ab...@getopt.org>.
Ken Krugler wrote:
>> Eugen Kochuev wrote:
>>> Hello Andrzej,
>>>
>>>> Please see the scoring API - you can write a plugin that manipulates
>>>> page scores according to your own idea.
>>>
>>> Thanks a lot for your answer, but could you please shed some more
>>> light onto scoring technique used in the Nutch?
>>> As I can see from the source code Nutch uses something similar to the
>>> pagerank algorithm propagating page scores through outlinks, but 
>>> only one
>>> iteration is used (while pagerank requires several iterations to
>>> converge).
>>
>> That's a bit complicated subject - I could either explain this in 
>> very general terms, or suggest that you read the paper that underlies 
>> the current Nutch implementation (with a twist). Please see the 
>> comment in OPICScoringFilter.java for the link to the paper.
>
> I've started writing up a description of the changes that I think need 
> to be made to Nutch to really implement the OPIC algorithm, as 
> described by by the "Adaptive On-Line Page Importance Computation" 
> paper (ACM 1-58113-680-3/03/0005).
>
> Should I just open a JIRA issue, and dump what might be a pretty long 
> write-up into it?

Yes, please do - I'd love to implement this in that original form, even 
if it would go into another plugin ...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: changing ranking

Posted by Andrzej Bialecki <ab...@getopt.org>.
Insurance Squared Inc. wrote:
> Could I trouble anyone to post a link to the scoring API 
> documentation, as well as the "paper that underlies the current Nutch 
> implementation"?  I've dipped into the docs in a few places and 
> haven't bumped into either of these documents.

There's a link to it in Javadoc for OPICScoringFilter.java.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: changing ranking

Posted by "Insurance Squared Inc." <gc...@insurancesquared.com>.
Could I trouble anyone to post a link to the scoring API documentation, 
as well as the "paper that underlies the current Nutch implementation"?  
I've dipped into the docs in a few places and haven't bumped into either 
of these documents.

Thanks,
g.


Ken Krugler wrote:

>> Eugen Kochuev wrote:
>>
>>> Hello Andrzej,
>>>
>>>> Please see the scoring API - you can write a plugin that manipulates
>>>> page scores according to your own idea.
>>>
>>>
>>> Thanks a lot for your answer, but could you please shed some more
>>> light onto scoring technique used in the Nutch?
>>> As I can see from the source code Nutch uses something similar to the
>>> pagerank algorithm propagating page scores through outlinks, but 
>>> only one
>>> iteration is used (while pagerank requires several iterations to
>>> converge).
>>
>>
>> That's a bit complicated subject - I could either explain this in 
>> very general terms, or suggest that you read the paper that underlies 
>> the current Nutch implementation (with a twist). Please see the 
>> comment in OPICScoringFilter.java for the link to the paper.
>
>
> I've started writing up a description of the changes that I think need 
> to be made to Nutch to really implement the OPIC algorithm, as 
> described by by the "Adaptive On-Line Page Importance Computation" 
> paper (ACM 1-58113-680-3/03/0005).
>
> Should I just open a JIRA issue, and dump what might be a pretty long 
> write-up into it?
>
> Thanks,
>
> -- Ken


Re: changing ranking

Posted by Ken Krugler <kk...@transpac.com>.
>Eugen Kochuev wrote:
>>Hello Andrzej,
>>
>>>Please see the scoring API - you can write a plugin that manipulates
>>>page scores according to your own idea.
>>
>>Thanks a lot for your answer, but could you please shed some more
>>light onto scoring technique used in the Nutch?
>>As I can see from the source code Nutch uses something similar to the
>>pagerank algorithm propagating page scores through outlinks, but only one
>>iteration is used (while pagerank requires several iterations to
>>converge).
>
>That's a bit complicated subject - I could either explain this in 
>very general terms, or suggest that you read the paper that 
>underlies the current Nutch implementation (with a twist). Please 
>see the comment in OPICScoringFilter.java for the link to the paper.

I've started writing up a description of the changes that I think 
need to be made to Nutch to really implement the OPIC algorithm, as 
described by by the "Adaptive On-Line Page Importance Computation" 
paper (ACM 1-58113-680-3/03/0005).

Should I just open a JIRA issue, and dump what might be a pretty long 
write-up into it?

Thanks,

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Re: changing ranking

Posted by Andrzej Bialecki <ab...@getopt.org>.
Eugen Kochuev wrote:
> Hello Andrzej,
>
>   
>> Please see the scoring API - you can write a plugin that manipulates
>> page scores according to your own idea.
>>     
>
> Thanks a lot for your answer, but could you please shed some more
> light onto scoring technique used in the Nutch?
> As I can see from the source code Nutch uses something similar to the
> pagerank algorithm propagating page scores through outlinks, but only one
> iteration is used (while pagerank requires several iterations to
> converge).
>   

That's a bit complicated subject - I could either explain this in very 
general terms, or suggest that you read the paper that underlies the 
current Nutch implementation (with a twist). Please see the comment in 
OPICScoringFilter.java for the link to the paper.

> Another questions are about db.score.injected and
> db.score.link.internal parameters. They are listed in the
> nutch-default.conf, but are never referenced in  the code.
>   

db.score.injected is used in the above-mentioned OPIC scoring plugin, 
and in CrawlDbReducer. db.score.link.internal might be used in these 
places, but isn't - please file a bug report, this needs to be fixed (if 
we really want it to be fixed, i.e. if we really want to distinguish 
between internal/external links when calculating score contributions and 
setting initial scores).

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re[2]: changing ranking

Posted by Eugen Kochuev <eu...@lan23.net>.
Hello Andrzej,

> Please see the scoring API - you can write a plugin that manipulates
> page scores according to your own idea.

Thanks a lot for your answer, but could you please shed some more
light onto scoring technique used in the Nutch?
As I can see from the source code Nutch uses something similar to the
pagerank algorithm propagating page scores through outlinks, but only one
iteration is used (while pagerank requires several iterations to
converge).

Another questions are about db.score.injected and
db.score.link.internal parameters. They are listed in the
nutch-default.conf, but are never referenced in  the code.





-- 
Best regards,
 Eugen                            mailto:eugen@lan23.net


Re: changing ranking

Posted by Andrzej Bialecki <ab...@getopt.org>.
Eugen Kochuev wrote:
> Hi guys,
>
>   I have a catalogue of the sites where domains are ranked by human
>   experts. Is it possible to tweak the score of pages belonging to the
>   domains listed in the catalogue according to their catalogue rank?
>
>   So, I'm interested in the ability to change scores of some urls.
>
>   


Please see the scoring API - you can write a plugin that manipulates 
page scores according to your own idea.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com