You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Patrick Mézard <pa...@mezard.eu> on 2020/07/23 09:09:09 UTC

Reconfiguring scoring plugin

Hello,

I have crawled a first document set using a combination of depth and opic scoring plugins. I would like to add the similarity scoring plugin but obviously the crawldb scores should be updated for it and following "generate" phases to be effective. Is there a recommended approach to achieve this?

My current understanding is since the similarity plugin operates in parse phase, I would have to remove all parsed data from segments, re-parse them and updatedb? Would that work? Is there anything smarter?

Thanks,
--
Patrick Mézard

Re: Reconfiguring scoring plugin

Posted by Patrick Mézard <pa...@mezard.eu>.
Thanks for the clarification.
--
Patrick Mézard

On 23/07/2020 19:23, Shashanka Balakuntala wrote:
> Hi Patrick,
> Yes, I did want to mention that it will not affect previous fetch lists. Sorry for the confusion.
> 
> Thanks,
> Shashanka Balakuntala
> 
> 
> On Thu, 23 Jul 2020, 22:40 Patrick Mézard, <patrick@mezard.eu <ma...@mezard.eu>> wrote:
> 
>     Hello,
> 
>     On 23/07/2020 14:37, Shashanka Balakuntala wrote:
>      > Hi Patrick,
>      >
>      > Yes, the idea that you have suggested would work, but i do have to mention
>      > that it might just affect the next iteration. So you can just clean the
>      > last parse segment and parse again and updatedb with the plugins activated
>      > and that would do.
> 
>     I do not follow you. How could the similarity scores of all documents be collected and used by updatedb without reparsing all content? From what I see, the similarity scorer operates during the parse phase and the score should be recorded in crawl_parse.
> 
>      > Deleting all the the parsed segments might not work because, because a url
>      > with score less than threshold will not be generated or fetched, so none of
>      > its outlinks will be fetched as well. So if you just delete parse segment
>      > and do the process, it would mean the all the already fetched segments will
>      > not be impacted. So it will update the scoring, if you just need the score
>      > for something else, please do go ahead with this.
> 
>     Again, I am confused. My mental model is:
> 
>     - Delete and reparse everything. I means similarity scores are taken in account and included all segments crawl_parse.
>     - Run updatedb on all segments. CrawlDatum entries will be gathered by "url" and some final score will be generated in the reduce phase, probably favoring the more recent score.
> 
>     Now, maybe the existing crawldb might interfere during the final merge and I should clear it somehow, but otherwise, once the similarity scores are reflected in the updated crawldb, the next generate phase will take them in account.
> 
>     Obviously, they will not retroactively affect the previous fetch lists. Is it what you tried to tell me?
> 
>     Thanks for your comments,
>     --
>     Patrick Mézard
> 
>      > Lets see if anyone has any other items to add or clear here.
>      >
>      > *Regards*
>      >    Shashanka Balakuntala Srinivasa
>      >
>      >
>      >
>      > On Thu, Jul 23, 2020 at 2:40 PM Patrick Mézard <patrick@mezard.eu <ma...@mezard.eu>> wrote:
>      >
>      >> Hello,
>      >>
>      >> I have crawled a first document set using a combination of depth and opic
>      >> scoring plugins. I would like to add the similarity scoring plugin but
>      >> obviously the crawldb scores should be updated for it and following
>      >> "generate" phases to be effective. Is there a recommended approach to
>      >> achieve this?
>      >>
>      >> My current understanding is since the similarity plugin operates in parse
>      >> phase, I would have to remove all parsed data from segments, re-parse them
>      >> and updatedb? Would that work? Is there anything smarter?
>      >>
>      >> Thanks,
>      >> --
>      >> Patrick Mézard
>      >>
>      >
> 


Re: Reconfiguring scoring plugin

Posted by Shashanka Balakuntala <sh...@gmail.com>.
Hi Patrick,
Yes, I did want to mention that it will not affect previous fetch lists.
Sorry for the confusion.

Thanks,
Shashanka Balakuntala


On Thu, 23 Jul 2020, 22:40 Patrick Mézard, <pa...@mezard.eu> wrote:

> Hello,
>
> On 23/07/2020 14:37, Shashanka Balakuntala wrote:
> > Hi Patrick,
> >
> > Yes, the idea that you have suggested would work, but i do have to
> mention
> > that it might just affect the next iteration. So you can just clean the
> > last parse segment and parse again and updatedb with the plugins
> activated
> > and that would do.
>
> I do not follow you. How could the similarity scores of all documents be
> collected and used by updatedb without reparsing all content? From what I
> see, the similarity scorer operates during the parse phase and the score
> should be recorded in crawl_parse.
>
> > Deleting all the the parsed segments might not work because, because a
> url
> > with score less than threshold will not be generated or fetched, so none
> of
> > its outlinks will be fetched as well. So if you just delete parse segment
> > and do the process, it would mean the all the already fetched segments
> will
> > not be impacted. So it will update the scoring, if you just need the
> score
> > for something else, please do go ahead with this.
>
> Again, I am confused. My mental model is:
>
> - Delete and reparse everything. I means similarity scores are taken in
> account and included all segments crawl_parse.
> - Run updatedb on all segments. CrawlDatum entries will be gathered by
> "url" and some final score will be generated in the reduce phase, probably
> favoring the more recent score.
>
> Now, maybe the existing crawldb might interfere during the final merge and
> I should clear it somehow, but otherwise, once the similarity scores are
> reflected in the updated crawldb, the next generate phase will take them in
> account.
>
> Obviously, they will not retroactively affect the previous fetch lists. Is
> it what you tried to tell me?
>
> Thanks for your comments,
> --
> Patrick Mézard
>
> > Lets see if anyone has any other items to add or clear here.
> >
> > *Regards*
> >    Shashanka Balakuntala Srinivasa
> >
> >
> >
> > On Thu, Jul 23, 2020 at 2:40 PM Patrick Mézard <pa...@mezard.eu>
> wrote:
> >
> >> Hello,
> >>
> >> I have crawled a first document set using a combination of depth and
> opic
> >> scoring plugins. I would like to add the similarity scoring plugin but
> >> obviously the crawldb scores should be updated for it and following
> >> "generate" phases to be effective. Is there a recommended approach to
> >> achieve this?
> >>
> >> My current understanding is since the similarity plugin operates in
> parse
> >> phase, I would have to remove all parsed data from segments, re-parse
> them
> >> and updatedb? Would that work? Is there anything smarter?
> >>
> >> Thanks,
> >> --
> >> Patrick Mézard
> >>
> >
>
>

Re: Reconfiguring scoring plugin

Posted by Patrick Mézard <pa...@mezard.eu>.
Hello,

On 23/07/2020 14:37, Shashanka Balakuntala wrote:
> Hi Patrick,
> 
> Yes, the idea that you have suggested would work, but i do have to mention
> that it might just affect the next iteration. So you can just clean the
> last parse segment and parse again and updatedb with the plugins activated
> and that would do.

I do not follow you. How could the similarity scores of all documents be collected and used by updatedb without reparsing all content? From what I see, the similarity scorer operates during the parse phase and the score should be recorded in crawl_parse.
  
> Deleting all the the parsed segments might not work because, because a url
> with score less than threshold will not be generated or fetched, so none of
> its outlinks will be fetched as well. So if you just delete parse segment
> and do the process, it would mean the all the already fetched segments will
> not be impacted. So it will update the scoring, if you just need the score
> for something else, please do go ahead with this.

Again, I am confused. My mental model is:

- Delete and reparse everything. I means similarity scores are taken in account and included all segments crawl_parse.
- Run updatedb on all segments. CrawlDatum entries will be gathered by "url" and some final score will be generated in the reduce phase, probably favoring the more recent score.

Now, maybe the existing crawldb might interfere during the final merge and I should clear it somehow, but otherwise, once the similarity scores are reflected in the updated crawldb, the next generate phase will take them in account.

Obviously, they will not retroactively affect the previous fetch lists. Is it what you tried to tell me?

Thanks for your comments,
--
Patrick Mézard
  
> Lets see if anyone has any other items to add or clear here.
> 
> *Regards*
>    Shashanka Balakuntala Srinivasa
> 
> 
> 
> On Thu, Jul 23, 2020 at 2:40 PM Patrick Mézard <pa...@mezard.eu> wrote:
> 
>> Hello,
>>
>> I have crawled a first document set using a combination of depth and opic
>> scoring plugins. I would like to add the similarity scoring plugin but
>> obviously the crawldb scores should be updated for it and following
>> "generate" phases to be effective. Is there a recommended approach to
>> achieve this?
>>
>> My current understanding is since the similarity plugin operates in parse
>> phase, I would have to remove all parsed data from segments, re-parse them
>> and updatedb? Would that work? Is there anything smarter?
>>
>> Thanks,
>> --
>> Patrick Mézard
>>
> 


Re: Reconfiguring scoring plugin

Posted by Shashanka Balakuntala <sh...@gmail.com>.
Hi Patrick,

Yes, the idea that you have suggested would work, but i do have to mention
that it might just affect the next iteration. So you can just clean the
last parse segment and parse again and updatedb with the plugins activated
and that would do.

Deleting all the the parsed segments might not work because, because a url
with score less than threshold will not be generated or fetched, so none of
its outlinks will be fetched as well. So if you just delete parse segment
and do the process, it would mean the all the already fetched segments will
not be impacted. So it will update the scoring, if you just need the score
for something else, please do go ahead with this.

Lets see if anyone has any other items to add or clear here.

*Regards*
  Shashanka Balakuntala Srinivasa



On Thu, Jul 23, 2020 at 2:40 PM Patrick Mézard <pa...@mezard.eu> wrote:

> Hello,
>
> I have crawled a first document set using a combination of depth and opic
> scoring plugins. I would like to add the similarity scoring plugin but
> obviously the crawldb scores should be updated for it and following
> "generate" phases to be effective. Is there a recommended approach to
> achieve this?
>
> My current understanding is since the similarity plugin operates in parse
> phase, I would have to remove all parsed data from segments, re-parse them
> and updatedb? Would that work? Is there anything smarter?
>
> Thanks,
> --
> Patrick Mézard
>