You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Lorenzo <de...@ieee.org> on 2007/04/19 19:55:02 UTC

Re: [Nutch-dev] Creating a new scoring filter

Hi,
sorry to re-open this thread, but I am facing the same problem of Nicolás.
I like both yours (Doğacan) and Nicolas' ideas, more yours as I think 
abstract
classes are not good extension points.
Anyway, is any of these implemented? I really need it!
Also, I can't understand from the docs what does it means that the 
adjust datum
will update the score of the original datum in updatedb.
Update or adjusted in which way? I obtain strange values..

Thanks!

Lorenzo



> Hi,
> On 2/27/07, Nicolás Lichtmaier <nick@relo... 
> <http://www.opensubscriber.com/sendEmail.os?message=6159544&inline=0>> 
> wrote:
> [snip]
> >
> > It doesn't seem a good way to do it. What if there are no outlinks? 
> This
> > method won't be called at all. And anyway, it would be called once per
> > each outlink, which would multiplicate the work.
> Multiplication is easy to solve but you are right that it won't work
> if there are no outlinks.
> Maybe scoring filter api should change? A distributeScoreToOutlinks
> method may be more useful than the current one: (which will be called
> even if there are no outlinks)
> CrawlDatum distributeScoreToOutlinks(Text fromUrl, List<String>
> toUrlList, List<CrawlDatum> datumList, ParseData parseData,
> CrawlDatum adjust)
> This method gives more control to the plugin since knowing all the
> outlinks the plugin can make more informed decisions. Like, right now,
> there is no way a scoring filter can be sure that it has distributed
> all its cash (e.g if db.score.internal.link is 0.5 and
> db.score.external.link is 1.0, filter will almost always distribute
> less than its cash).
> This will also work for your case, since you will just ignore the
> outlinks and return the adjust datum based on information in parse
> metadata.
> What do you (and others) think?
> >
> > Thanks!
> >
> >
> -- 
> Doğacan Güney

Re: [Nutch-dev] Creating a new scoring filter

Posted by Lorenzo <de...@ieee.org>.

Sorry, I misunderstood your intentions.
Now I can see the advantages of your approach: a developer has to 
implement the whole interface only if he/she needs to have more control 
over some features.
This sounds great to me!

Lorenzo


Nicolás Lichtmaier wrote:
>
>> sorry to re-open this thread, but I am facing the same problem of 
>> Nicolás.
>> I like both yours (Doğacan) and Nicolas' ideas, more yours as I think 
>> abstract
>> classes are not good extension points.
>
> That wasn't what I had proposed. My suggestion was to use an 
> interface, as always, but made this API real clean, expressing the 
> minimum the rest of the code needs from a scoring plugin, removing 
> assumptions about its implementation. Then I've proposed to have an 
> abstract class, implementing this interface, with a skeleton for any 
> class which works "distributing score to outlinks". So we would have 
> the best of both worlds: People creating new "PageRank" algorithms 
> wouldn't need to reimplement anuything, they would just subclass the 
> abstract class. And people like you and me would directly implement 
> the interface (or use a different abstract class if there's common 
> logic to share). My boss put all of this on hold, but I'd like to 
> implement this idea in a near future and try to have it included in 
> Nutch.
>
>
>

Re: [Nutch-dev] Creating a new scoring filter

Posted by Nicolás Lichtmaier <ni...@reloco.com.ar>.

> sorry to re-open this thread, but I am facing the same problem of 
> Nicolás.
> I like both yours (Doğacan) and Nicolas' ideas, more yours as I think 
> abstract
> classes are not good extension points.

That wasn't what I had proposed. My suggestion was to use an interface, 
as always, but made this API real clean, expressing the minimum the rest 
of the code needs from a scoring plugin, removing assumptions about its 
implementation. Then I've proposed to have an abstract class, 
implementing this interface, with a skeleton for any class which works 
"distributing score to outlinks". So we would have the best of both 
worlds: People creating new "PageRank" algorithms wouldn't need to 
reimplement anuything, they would just subclass the abstract class. And 
people like you and me would directly implement the interface (or use a 
different abstract class if there's common logic to share). My boss put 
all of this on hold, but I'd like to implement this idea in a near 
future and try to have it included in Nutch.

Re: [Nutch-dev] Creating a new scoring filter

Posted by Lorenzo <de...@ieee.org>.

Very briefly, with an HtmlParseFilter and a list of weighted words.
This filter examines the Parse text and add a boost value if it finds 
one of the words in the list.
This boost value is added to ParseData MetaData.
Then, a ScoringPlugin reads this MetaData (passScoreAfterParsing) and 
update the CrawlData, both of outlinked pages (to focus more the search)
and of the current page (the difficult part, as explained in the ml; 
however, with NUTCH-468 it should be easyer now)

If you need other informations, please ask!

Lorenzo


Briggs wrote:
> Yes.  I too need to alter the score based on attributes and such of
> the particular url passed.
> May I ask what you have done?
>
>
> On 4/22/07, Lorenzo <de...@ieee.org> wrote:
>> Perfect! Now I have it working, and it performs quite well for a focused
>> serch engine like ours!
>> Do you think it could be an interesting plug-in to add to nutch?
>>
>> Lorenzo
>>
>>
>> Doğacan Güney wrote:
>> > On 4/21/07, Lorenzo <de...@ieee.org> wrote:
>> >>
>> >> Uhmm... so, suppose I decided, from its content, that the current 
>> page
>> >> http://foo/bar.htm is really desiderable.
>> >> I have put in ParseData's metadata a flag to mark it.
>> >> In distributeScoreToOutlink(s) I read it from the ParseData param, 
>> and
>> >> put it in the adjust CrawlData metadata
>> >>
>> >>       MapWritable adjustMap = adjust.getMetaData();
>> >>       adjustMap.put(key, new FloatWritable(bootsValue));
>> >>       return adjust;
>> >>
>> >> So in updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List
>> >> inlinked)
>> >> the adjust CrawlData will be between the inlinked List. Is it 
>> right? How
>> >> do I distinguish it?
>> >> I can put the URL in metadata too, and scroll through the list, but
>> >> maybe there is a better method?
>> >
>> >
>> >
>> > Best approach is yours, you should put a flag in adjust datum's
>> > metadata to
>> > mark it, then process it in updateDbScore.
>> >
>> > Also, this CrawlDatum will be the same that is passed to indexerScore?
>> >
>> >
>> > You get 2 CrawlDatum's in indexerScore. First is fetchDatum which is
>> > the one
>> > in crawl_fetch that contains the fetching status. Second is dbDatum 
>> which
>> > comes from crawldb. This dbDatum is the one that you set in
>> > updateDbScore(The 'datum' argument of updateDbScore)
>> >
>> >
>> > Thanks a lot!
>> >>
>> >> Lorenzo
>> >>
>> >>
>> >
>> >
>>
>>
>
>

Re: [Nutch-dev] Creating a new scoring filter

Posted by Briggs <ac...@gmail.com>.

Yes.  I too need to alter the score based on attributes and such of
the particular url passed.
May I ask what you have done?


On 4/22/07, Lorenzo <de...@ieee.org> wrote:
> Perfect! Now I have it working, and it performs quite well for a focused
> serch engine like ours!
> Do you think it could be an interesting plug-in to add to nutch?
>
> Lorenzo
>
>
> Doğacan Güney wrote:
> > On 4/21/07, Lorenzo <de...@ieee.org> wrote:
> >>
> >> Uhmm... so, suppose I decided, from its content, that the current page
> >> http://foo/bar.htm is really desiderable.
> >> I have put in ParseData's metadata a flag to mark it.
> >> In distributeScoreToOutlink(s) I read it from the ParseData param, and
> >> put it in the adjust CrawlData metadata
> >>
> >>       MapWritable adjustMap = adjust.getMetaData();
> >>       adjustMap.put(key, new FloatWritable(bootsValue));
> >>       return adjust;
> >>
> >> So in updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List
> >> inlinked)
> >> the adjust CrawlData will be between the inlinked List. Is it right? How
> >> do I distinguish it?
> >> I can put the URL in metadata too, and scroll through the list, but
> >> maybe there is a better method?
> >
> >
> >
> > Best approach is yours, you should put a flag in adjust datum's
> > metadata to
> > mark it, then process it in updateDbScore.
> >
> > Also, this CrawlDatum will be the same that is passed to indexerScore?
> >
> >
> > You get 2 CrawlDatum's in indexerScore. First is fetchDatum which is
> > the one
> > in crawl_fetch that contains the fetching status. Second is dbDatum which
> > comes from crawldb. This dbDatum is the one that you set in
> > updateDbScore(The 'datum' argument of updateDbScore)
> >
> >
> > Thanks a lot!
> >>
> >> Lorenzo
> >>
> >>
> >
> >
>
>


-- 
"Conscious decisions by concious minds are what make reality real"

Re: [Nutch-dev] Creating a new scoring filter

Posted by Lorenzo <de...@ieee.org>.

Perfect! Now I have it working, and it performs quite well for a focused 
serch engine like ours!
Do you think it could be an interesting plug-in to add to nutch?

Lorenzo


Doğacan Güney wrote:
> On 4/21/07, Lorenzo <de...@ieee.org> wrote:
>>
>> Uhmm... so, suppose I decided, from its content, that the current page
>> http://foo/bar.htm is really desiderable.
>> I have put in ParseData's metadata a flag to mark it.
>> In distributeScoreToOutlink(s) I read it from the ParseData param, and
>> put it in the adjust CrawlData metadata
>>
>>       MapWritable adjustMap = adjust.getMetaData();
>>       adjustMap.put(key, new FloatWritable(bootsValue));
>>       return adjust;
>>
>> So in updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List
>> inlinked)
>> the adjust CrawlData will be between the inlinked List. Is it right? How
>> do I distinguish it?
>> I can put the URL in metadata too, and scroll through the list, but
>> maybe there is a better method?
>
>
>
> Best approach is yours, you should put a flag in adjust datum's 
> metadata to
> mark it, then process it in updateDbScore.
>
> Also, this CrawlDatum will be the same that is passed to indexerScore?
>
>
> You get 2 CrawlDatum's in indexerScore. First is fetchDatum which is 
> the one
> in crawl_fetch that contains the fetching status. Second is dbDatum which
> comes from crawldb. This dbDatum is the one that you set in
> updateDbScore(The 'datum' argument of updateDbScore)
>
>
> Thanks a lot!
>>
>> Lorenzo
>>
>>
>
>

Re: [Nutch-dev] Creating a new scoring filter

Posted by Doğacan Güney <do...@gmail.com>.

On 4/21/07, Lorenzo <de...@ieee.org> wrote:
>
> Doğacan Güney wrote:
> > On 4/19/07, Lorenzo <de...@ieee.org> wrote:
> >>
> >> Hi,
> >> sorry to re-open this thread, but I am facing the same problem of
> >> Nicolás.
> >> I like both yours (Doğacan) and Nicolas' ideas, more yours as I think
> >> abstract
> >> classes are not good extension points.
> >> Anyway, is any of these implemented? I really need it!
> >
> >
> > Well, I have implemented a subset of what we discussed in
> > <https://issues.apache.org/jira/browse/NUTCH-468>
> > NUTCH-468 <https://issues.apache.org/jira/browse/NUTCH-468>. There is
> > a lot
> > more to be done but IMHO, NUTCH-468 may be a good starting point.
> >
> > Also, I can't understand from the docs what does it means that the
> >> adjust datum
> >> will update the score of the original datum in updatedb.
> >> Update or adjusted in which way? I obtain strange values..
> >
> >
> > In ScoringFilter.updateDbScore you get a list of inlinked datums that
> you
> > can use to change score. Now, if in distributeScoreToOutlink(s) you
> > return a
> > datum with a status of STATUS_LINKED, you will get this datum as one
> > of the
> > inlinked datums in updateDbScore.
> >
> > I hope, this clears it up a bit.
> >
> Uhmm... so, suppose I decided, from its content, that the current page
> http://foo/bar.htm is really desiderable.
> I have put in ParseData's metadata a flag to mark it.
> In distributeScoreToOutlink(s) I read it from the ParseData param, and
> put it in the adjust CrawlData metadata
>
>       MapWritable adjustMap = adjust.getMetaData();
>       adjustMap.put(key, new FloatWritable(bootsValue));
>       return adjust;
>
> So in updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List
> inlinked)
> the adjust CrawlData will be between the inlinked List. Is it right? How
> do I distinguish it?
> I can put the URL in metadata too, and scroll through the list, but
> maybe there is a better method?



Best approach is yours, you should put a flag in adjust datum's metadata to
mark it, then process it in updateDbScore.

Also, this CrawlDatum will be the same that is passed to indexerScore?


You get 2 CrawlDatum's in indexerScore. First is fetchDatum which is the one
in crawl_fetch that contains the fetching status. Second is dbDatum which
comes from crawldb. This dbDatum is the one that you set in
updateDbScore(The 'datum' argument of updateDbScore)


Thanks a lot!
>
> Lorenzo
>
>


-- 
Doğacan Güney

Re: [Nutch-dev] Creating a new scoring filter

Posted by Lorenzo <de...@ieee.org>.

Doğacan Güney wrote:
> On 4/19/07, Lorenzo <de...@ieee.org> wrote:
>>
>> Hi,
>> sorry to re-open this thread, but I am facing the same problem of 
>> Nicolás.
>> I like both yours (Doğacan) and Nicolas' ideas, more yours as I think
>> abstract
>> classes are not good extension points.
>> Anyway, is any of these implemented? I really need it!
>
>
> Well, I have implemented a subset of what we discussed in
> <https://issues.apache.org/jira/browse/NUTCH-468>
> NUTCH-468 <https://issues.apache.org/jira/browse/NUTCH-468>. There is 
> a lot
> more to be done but IMHO, NUTCH-468 may be a good starting point.
>
> Also, I can't understand from the docs what does it means that the
>> adjust datum
>> will update the score of the original datum in updatedb.
>> Update or adjusted in which way? I obtain strange values..
>
>
> In ScoringFilter.updateDbScore you get a list of inlinked datums that you
> can use to change score. Now, if in distributeScoreToOutlink(s) you 
> return a
> datum with a status of STATUS_LINKED, you will get this datum as one 
> of the
> inlinked datums in updateDbScore.
>
> I hope, this clears it up a bit.
>
Uhmm... so, suppose I decided, from its content, that the current page 
http://foo/bar.htm is really desiderable.
I have put in ParseData's metadata a flag to mark it.
In distributeScoreToOutlink(s) I read it from the ParseData param, and 
put it in the adjust CrawlData metadata

      MapWritable adjustMap = adjust.getMetaData();
      adjustMap.put(key, new FloatWritable(bootsValue));
      return adjust;

So in updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List 
inlinked)
the adjust CrawlData will be between the inlinked List. Is it right? How 
do I distinguish it?
I can put the URL in metadata too, and scroll through the list, but 
maybe there is a better method?

Also, this CrawlDatum will be the same that is passed to indexerScore?
Thanks a lot!

Lorenzo

Re: [Nutch-dev] Creating a new scoring filter

Posted by Doğacan Güney <do...@gmail.com>.

On 4/19/07, Lorenzo <de...@ieee.org> wrote:
>
> Hi,
> sorry to re-open this thread, but I am facing the same problem of Nicolás.
> I like both yours (Doğacan) and Nicolas' ideas, more yours as I think
> abstract
> classes are not good extension points.
> Anyway, is any of these implemented? I really need it!


Well, I have implemented a subset of what we discussed in
<https://issues.apache.org/jira/browse/NUTCH-468>
NUTCH-468 <https://issues.apache.org/jira/browse/NUTCH-468>. There is a lot
more to be done but IMHO, NUTCH-468 may be a good starting point.

Also, I can't understand from the docs what does it means that the
> adjust datum
> will update the score of the original datum in updatedb.
> Update or adjusted in which way? I obtain strange values..


In ScoringFilter.updateDbScore you get a list of inlinked datums that you
can use to change score. Now, if in distributeScoreToOutlink(s) you return a
datum with a status of STATUS_LINKED, you will get this datum as one of the
inlinked datums in updateDbScore.

I hope, this clears it up a bit.

Thanks!
>
> Lorenzo
>
>
>
> > Hi,
> > On 2/27/07, Nicolás Lichtmaier <nick@relo...
> > <http://www.opensubscriber.com/sendEmail.os?message=6159544&inline=0>>
> > wrote:
> > [snip]
> > >
> > > It doesn't seem a good way to do it. What if there are no outlinks?
> > This
> > > method won't be called at all. And anyway, it would be called once per
> > > each outlink, which would multiplicate the work.
> > Multiplication is easy to solve but you are right that it won't work
> > if there are no outlinks.
> > Maybe scoring filter api should change? A distributeScoreToOutlinks
> > method may be more useful than the current one: (which will be called
> > even if there are no outlinks)
> > CrawlDatum distributeScoreToOutlinks(Text fromUrl, List<String>
> > toUrlList, List<CrawlDatum> datumList, ParseData parseData,
> > CrawlDatum adjust)
> > This method gives more control to the plugin since knowing all the
> > outlinks the plugin can make more informed decisions. Like, right now,
> > there is no way a scoring filter can be sure that it has distributed
> > all its cash (e.g if db.score.internal.link is 0.5 and
> > db.score.external.link is 1.0, filter will almost always distribute
> > less than its cash).
> > This will also work for your case, since you will just ignore the
> > outlinks and return the adjust datum based on information in parse
> > metadata.
> > What do you (and others) think?
> > >
> > > Thanks!
> > >
> > >
> > --
> > Doğacan Güney
>



-- 
Doğacan Güney