You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Yongyao Jiang <j....@gmail.com> on 2017/04/17 22:31:05 UTC

Why "generate.min.score" does not work?

Hi,

I am using scoring-similarity plugin. After setting the generate.min.score
to 0.05, and indexing all the pages (with its score) into Elastic, I can
still observe many web pages whose scores are below 0.05.

<property>
  <name>generate.min.score</name>
  <value>0.05</value>
  <description>Select only entries with a score larger than
  generate.min.score.</description>
</property>

Below is the result of a simple aggregation of "score" in ES,
        {
               "key": "20170417215917",
               "doc_count": 200,
               "Stats": {
                  "count": 200,
                  "min": 0,
                  "max": 0.019184709,
                  "avg": 0.0012828724450000002,
                  "sum": 0.256574489
               }
            }

Thanks,
Yongyao

Re: Why "generate.min.score" does not work?

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Yongyao,

I haven't tried to combine both scoring filter plugins and I don't know whether they work together
well. The ScoringFilter interface is designed so that all methods have access to the score
previously calculated by the filters in the chain. In doubt, check the implementations, try it
and we are glad to hear how to make the scoring filters cooperate. A focused crawler which its based
on both content and link structure may be even better. But I expect that adjusting everything can be
subtle.

> of indexing (indexerscore())?

Here I see no problem. You have to look esp. on the methods
  passScoreAfterParsing(...)
  distributeScoreToOutlinks(...)
Looks like that the plugin called second (cf. scoring.filter.order) overwrites any values
set before.

Best,
Sebastian


On 04/25/2017 09:41 PM, Yongyao Jiang wrote:
> Thanks, Sebastian. That makes sense. Just a follow-up question, if I want
> to combine the OPIC score and the similarity score, how shall I do it?
> Maybe I am wrong, I don't think just putting
> scoring-opic|scoring-similarity can do this trick as there is a chance they
> will be mixed up, or one gets overwritten by the other at various scoring
> steps. Do I have to create two attributes for them and combine them at end
> of indexing (indexerscore())?
> 
> Yongyao
> 
> On Sat, Apr 22, 2017 at 1:15 PM, Sebastian Nagel <wastl.nagel@googlemail.com
>> wrote:
> 
>> Hi Yongyao,
>>
>> yes, that sounds reasonable. A simple
>>   return datum.getScore() * initSort;
>> would do the job. That should be enough, as the similarity score is
>> calculated after parsing and distributed to the outlinks. However,
>> also
>>   updateDbScore(...)
>> needs to be implemented accordingly. Otherwise the scores from outlinks
>> are newer aggregated in the CrawlDb, only for newly found links the
>> similarity
>> score is used. The question is whether scoring-similarity wasn't designed
>> to be used in combination with another scoring plugin (e.g., scoring-opic)
>> which really implements these methods.
>>
>> Please, open an issue on Jira to discuss any questions and for
>> documentation
>> and release report, a PR is also welcome!
>>
>> Thanks,
>> Sebastian
>>
>> On 04/18/2017 09:05 PM, Yongyao Jiang wrote:
>>> Hi Sebastian,
>>>
>>> Yes, I understand. But when people use the similarity-scoring plugin,
>> they
>>> intend to do domain-specific crawling in most cases. It also means that
>>> they want to control how the crawler works by adjusting the
>>> generate.min.score.
>>>
>>> I just figured out the reason that adjusting the min value does not
>> change
>>> the results is that the "sort" variable in the code below always equals
>> 1.0
>>> when using the scoring-similarity plugin, because this plugin doesn't
>>> implement the "generatorSortValue()" function.
>>> https://github.com/apache/nutch/blob/master/src/java/
>> org/apache/nutch/crawl/
>>> Generator.java#L211
>>> https://github.com/apache/nutch/blob/master/src/java/
>>> org/apache/nutch/scoring/AbstractScoringFilter.java#L40
>>>
>>> I think this is supposed to be a bug. Please correct me if I am wrong. I
>>> can also submit a PR if needed.
>>>
>>> Thanks,
>>> Yongyao
>>>
>>> On Tue, Apr 18, 2017 at 1:13 PM, Sebastian Nagel <
>> wastl.nagel@googlemail.com
>>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> the scores in the index is not relevant for generating, only the scores
>> in
>>>> CrawlDb.
>>>> The ScoringFilter interface defines a method indexerScore(...), some
>>>> scoring filters
>>>> return a modified (normalized) indexer score (cf. indexer.score.power).
>>>> Also, changes to
>>>> generate.min.score affect only which pages are fetched, pages fetched
>>>> before may have a lower score.
>>>> The score may also change when a page is processed (parsed, etc.) or
>> even
>>>> afterwards
>>>> (by links pointing to it).
>>>>
>>>> In short: generate.min.score determines what is crawled, not what is
>>>> indexed.
>>>>
>>>> Best,
>>>> Sebastian
>>>>
>>>> On 04/18/2017 12:31 AM, Yongyao Jiang wrote:
>>>>> Hi,
>>>>>
>>>>> I am using scoring-similarity plugin. After setting the
>>>> generate.min.score
>>>>> to 0.05, and indexing all the pages (with its score) into Elastic, I
>> can
>>>>> still observe many web pages whose scores are below 0.05.
>>>>>
>>>>> <property>
>>>>>   <name>generate.min.score</name>
>>>>>   <value>0.05</value>
>>>>>   <description>Select only entries with a score larger than
>>>>>   generate.min.score.</description>
>>>>> </property>
>>>>>
>>>>> Below is the result of a simple aggregation of "score" in ES,
>>>>>         {
>>>>>                "key": "20170417215917",
>>>>>                "doc_count": 200,
>>>>>                "Stats": {
>>>>>                   "count": 200,
>>>>>                   "min": 0,
>>>>>                   "max": 0.019184709,
>>>>>                   "avg": 0.0012828724450000002,
>>>>>                   "sum": 0.256574489
>>>>>                }
>>>>>             }
>>>>>
>>>>> Thanks,
>>>>> Yongyao
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
> 
>

Re: Why "generate.min.score" does not work?

Posted by Yongyao Jiang <j....@gmail.com>.

Thanks, Sebastian. That makes sense. Just a follow-up question, if I want
to combine the OPIC score and the similarity score, how shall I do it?
Maybe I am wrong, I don't think just putting
scoring-opic|scoring-similarity can do this trick as there is a chance they
will be mixed up, or one gets overwritten by the other at various scoring
steps. Do I have to create two attributes for them and combine them at end
of indexing (indexerscore())?

Yongyao

On Sat, Apr 22, 2017 at 1:15 PM, Sebastian Nagel <wastl.nagel@googlemail.com
> wrote:

> Hi Yongyao,
>
> yes, that sounds reasonable. A simple
>   return datum.getScore() * initSort;
> would do the job. That should be enough, as the similarity score is
> calculated after parsing and distributed to the outlinks. However,
> also
>   updateDbScore(...)
> needs to be implemented accordingly. Otherwise the scores from outlinks
> are newer aggregated in the CrawlDb, only for newly found links the
> similarity
> score is used. The question is whether scoring-similarity wasn't designed
> to be used in combination with another scoring plugin (e.g., scoring-opic)
> which really implements these methods.
>
> Please, open an issue on Jira to discuss any questions and for
> documentation
> and release report, a PR is also welcome!
>
> Thanks,
> Sebastian
>
> On 04/18/2017 09:05 PM, Yongyao Jiang wrote:
> > Hi Sebastian,
> >
> > Yes, I understand. But when people use the similarity-scoring plugin,
> they
> > intend to do domain-specific crawling in most cases. It also means that
> > they want to control how the crawler works by adjusting the
> > generate.min.score.
> >
> > I just figured out the reason that adjusting the min value does not
> change
> > the results is that the "sort" variable in the code below always equals
> 1.0
> > when using the scoring-similarity plugin, because this plugin doesn't
> > implement the "generatorSortValue()" function.
> > https://github.com/apache/nutch/blob/master/src/java/
> org/apache/nutch/crawl/
> > Generator.java#L211
> > https://github.com/apache/nutch/blob/master/src/java/
> > org/apache/nutch/scoring/AbstractScoringFilter.java#L40
> >
> > I think this is supposed to be a bug. Please correct me if I am wrong. I
> > can also submit a PR if needed.
> >
> > Thanks,
> > Yongyao
> >
> > On Tue, Apr 18, 2017 at 1:13 PM, Sebastian Nagel <
> wastl.nagel@googlemail.com
> >> wrote:
> >
> >> Hi,
> >>
> >> the scores in the index is not relevant for generating, only the scores
> in
> >> CrawlDb.
> >> The ScoringFilter interface defines a method indexerScore(...), some
> >> scoring filters
> >> return a modified (normalized) indexer score (cf. indexer.score.power).
> >> Also, changes to
> >> generate.min.score affect only which pages are fetched, pages fetched
> >> before may have a lower score.
> >> The score may also change when a page is processed (parsed, etc.) or
> even
> >> afterwards
> >> (by links pointing to it).
> >>
> >> In short: generate.min.score determines what is crawled, not what is
> >> indexed.
> >>
> >> Best,
> >> Sebastian
> >>
> >> On 04/18/2017 12:31 AM, Yongyao Jiang wrote:
> >>> Hi,
> >>>
> >>> I am using scoring-similarity plugin. After setting the
> >> generate.min.score
> >>> to 0.05, and indexing all the pages (with its score) into Elastic, I
> can
> >>> still observe many web pages whose scores are below 0.05.
> >>>
> >>> <property>
> >>>   <name>generate.min.score</name>
> >>>   <value>0.05</value>
> >>>   <description>Select only entries with a score larger than
> >>>   generate.min.score.</description>
> >>> </property>
> >>>
> >>> Below is the result of a simple aggregation of "score" in ES,
> >>>         {
> >>>                "key": "20170417215917",
> >>>                "doc_count": 200,
> >>>                "Stats": {
> >>>                   "count": 200,
> >>>                   "min": 0,
> >>>                   "max": 0.019184709,
> >>>                   "avg": 0.0012828724450000002,
> >>>                   "sum": 0.256574489
> >>>                }
> >>>             }
> >>>
> >>> Thanks,
> >>> Yongyao
> >>>
> >>
> >>
> >
> >
>
>


-- 
Yongyao Jiang
https://www.linkedin.com/in/yongyao-jiang-42516164
Ph.D. Student in Earth Systems and GeoInformation Sciences
NSF Spatiotemporal Innovation Center
George Mason University

Re: Why "generate.min.score" does not work?

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Yongyao,

yes, that sounds reasonable. A simple
  return datum.getScore() * initSort;
would do the job. That should be enough, as the similarity score is
calculated after parsing and distributed to the outlinks. However,
also
  updateDbScore(...)
needs to be implemented accordingly. Otherwise the scores from outlinks
are newer aggregated in the CrawlDb, only for newly found links the similarity
score is used. The question is whether scoring-similarity wasn't designed
to be used in combination with another scoring plugin (e.g., scoring-opic)
which really implements these methods.

Please, open an issue on Jira to discuss any questions and for documentation
and release report, a PR is also welcome!

Thanks,
Sebastian

On 04/18/2017 09:05 PM, Yongyao Jiang wrote:
> Hi Sebastian,
> 
> Yes, I understand. But when people use the similarity-scoring plugin, they
> intend to do domain-specific crawling in most cases. It also means that
> they want to control how the crawler works by adjusting the
> generate.min.score.
> 
> I just figured out the reason that adjusting the min value does not change
> the results is that the "sort" variable in the code below always equals 1.0
> when using the scoring-similarity plugin, because this plugin doesn't
> implement the "generatorSortValue()" function.
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/
> Generator.java#L211
> https://github.com/apache/nutch/blob/master/src/java/
> org/apache/nutch/scoring/AbstractScoringFilter.java#L40
> 
> I think this is supposed to be a bug. Please correct me if I am wrong. I
> can also submit a PR if needed.
> 
> Thanks,
> Yongyao
> 
> On Tue, Apr 18, 2017 at 1:13 PM, Sebastian Nagel <wastl.nagel@googlemail.com
>> wrote:
> 
>> Hi,
>>
>> the scores in the index is not relevant for generating, only the scores in
>> CrawlDb.
>> The ScoringFilter interface defines a method indexerScore(...), some
>> scoring filters
>> return a modified (normalized) indexer score (cf. indexer.score.power).
>> Also, changes to
>> generate.min.score affect only which pages are fetched, pages fetched
>> before may have a lower score.
>> The score may also change when a page is processed (parsed, etc.) or even
>> afterwards
>> (by links pointing to it).
>>
>> In short: generate.min.score determines what is crawled, not what is
>> indexed.
>>
>> Best,
>> Sebastian
>>
>> On 04/18/2017 12:31 AM, Yongyao Jiang wrote:
>>> Hi,
>>>
>>> I am using scoring-similarity plugin. After setting the
>> generate.min.score
>>> to 0.05, and indexing all the pages (with its score) into Elastic, I can
>>> still observe many web pages whose scores are below 0.05.
>>>
>>> <property>
>>>   <name>generate.min.score</name>
>>>   <value>0.05</value>
>>>   <description>Select only entries with a score larger than
>>>   generate.min.score.</description>
>>> </property>
>>>
>>> Below is the result of a simple aggregation of "score" in ES,
>>>         {
>>>                "key": "20170417215917",
>>>                "doc_count": 200,
>>>                "Stats": {
>>>                   "count": 200,
>>>                   "min": 0,
>>>                   "max": 0.019184709,
>>>                   "avg": 0.0012828724450000002,
>>>                   "sum": 0.256574489
>>>                }
>>>             }
>>>
>>> Thanks,
>>> Yongyao
>>>
>>
>>
> 
>

Re: Why "generate.min.score" does not work?

Posted by Yongyao Jiang <j....@gmail.com>.

Hi Sebastian,

Yes, I understand. But when people use the similarity-scoring plugin, they
intend to do domain-specific crawling in most cases. It also means that
they want to control how the crawler works by adjusting the
generate.min.score.

I just figured out the reason that adjusting the min value does not change
the results is that the "sort" variable in the code below always equals 1.0
when using the scoring-similarity plugin, because this plugin doesn't
implement the "generatorSortValue()" function.
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/
Generator.java#L211
https://github.com/apache/nutch/blob/master/src/java/
org/apache/nutch/scoring/AbstractScoringFilter.java#L40

I think this is supposed to be a bug. Please correct me if I am wrong. I
can also submit a PR if needed.

Thanks,
Yongyao

On Tue, Apr 18, 2017 at 1:13 PM, Sebastian Nagel <wastl.nagel@googlemail.com
> wrote:

> Hi,
>
> the scores in the index is not relevant for generating, only the scores in
> CrawlDb.
> The ScoringFilter interface defines a method indexerScore(...), some
> scoring filters
> return a modified (normalized) indexer score (cf. indexer.score.power).
> Also, changes to
> generate.min.score affect only which pages are fetched, pages fetched
> before may have a lower score.
> The score may also change when a page is processed (parsed, etc.) or even
> afterwards
> (by links pointing to it).
>
> In short: generate.min.score determines what is crawled, not what is
> indexed.
>
> Best,
> Sebastian
>
> On 04/18/2017 12:31 AM, Yongyao Jiang wrote:
> > Hi,
> >
> > I am using scoring-similarity plugin. After setting the
> generate.min.score
> > to 0.05, and indexing all the pages (with its score) into Elastic, I can
> > still observe many web pages whose scores are below 0.05.
> >
> > <property>
> >   <name>generate.min.score</name>
> >   <value>0.05</value>
> >   <description>Select only entries with a score larger than
> >   generate.min.score.</description>
> > </property>
> >
> > Below is the result of a simple aggregation of "score" in ES,
> >         {
> >                "key": "20170417215917",
> >                "doc_count": 200,
> >                "Stats": {
> >                   "count": 200,
> >                   "min": 0,
> >                   "max": 0.019184709,
> >                   "avg": 0.0012828724450000002,
> >                   "sum": 0.256574489
> >                }
> >             }
> >
> > Thanks,
> > Yongyao
> >
>
>

-- 
Yongyao Jiang
https://www.linkedin.com/in/yongyao-jiang-42516164
Ph.D. Student in Earth Systems and GeoInformation Sciences
NSF Spatiotemporal Innovation Center
George Mason University

Re: Why "generate.min.score" does not work?

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi,

the scores in the index is not relevant for generating, only the scores in CrawlDb.
The ScoringFilter interface defines a method indexerScore(...), some scoring filters
return a modified (normalized) indexer score (cf. indexer.score.power).  Also, changes to
generate.min.score affect only which pages are fetched, pages fetched before may have a lower score.
The score may also change when a page is processed (parsed, etc.) or even afterwards
(by links pointing to it).

In short: generate.min.score determines what is crawled, not what is indexed.

Best,
Sebastian

On 04/18/2017 12:31 AM, Yongyao Jiang wrote:
> Hi,
> 
> I am using scoring-similarity plugin. After setting the generate.min.score
> to 0.05, and indexing all the pages (with its score) into Elastic, I can
> still observe many web pages whose scores are below 0.05.
> 
> <property>
>   <name>generate.min.score</name>
>   <value>0.05</value>
>   <description>Select only entries with a score larger than
>   generate.min.score.</description>
> </property>
> 
> Below is the result of a simple aggregation of "score" in ES,
>         {
>                "key": "20170417215917",
>                "doc_count": 200,
>                "Stats": {
>                   "count": 200,
>                   "min": 0,
>                   "max": 0.019184709,
>                   "avg": 0.0012828724450000002,
>                   "sum": 0.256574489
>                }
>             }
> 
> Thanks,
> Yongyao
>