You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Michael Erickson <er...@gmail.com> on 2012/05/09 20:07:11 UTC

Focused Crawling with Nutch (IndexingFilter:filter)

Hello all,

I'd like to try to do a focused crawl [1][2] using Nutch.  I have a classifier trained on a large corpus of hand-curated data.  My goal is to have Nutch run a crawl, but for each page it finds, run the contents of the page through my classifier to see if that page is interesting to me.  If it is, I'll have Nutch proceed as normal.  However, if the page is not interesting to me, I want to avoid indexing the page and prevent its outbound links from being added to the frontier.

After reviewing the documentation, it appears that writing an `IndexingFilter` plugin might help.  Specifically, using the `filter` method to return NULL if I'm not interested in this page.  What I can't tell is if returning NULL from the `filter` method will just stop that page from being inserted into the index, or if it will also prevent that page's outbound links from being added to the frontier.  Can anyone clarify this for me?

Best regards,
--mike

Michael Erickson
erickson.michael@gmail.com


[1] http://www8.org/w8-papers/5a-search-query/crawling/ 
[2] http://www.cse.iitb.ac.in/~soumen/focus/ 
[3] http://nutch.apache.org/apidocs-1.3/org/apache/nutch/indexer/IndexingFilter.html

Re: Focused Crawling with Nutch (IndexingFilter:filter)

Posted by Michael Erickson <er...@gmail.com>.

On May 9, 2012, at 1:18 PM, Markus Jelsma wrote:

> Hi,
> 
> On Wed, 9 May 2012 13:07:11 -0500, Michael Erickson <er...@gmail.com> wrote:
>> Hello all,
>> 
>> I'd like to try to do a focused crawl [1][2] using Nutch.  I have a
>> classifier trained on a large corpus of hand-curated data.  My goal is
>> to have Nutch run a crawl, but for each page it finds, run the
>> contents of the page through my classifier to see if that page is
>> interesting to me.  If it is, I'll have Nutch proceed as normal.
>> However, if the page is not interesting to me, I want to avoid
>> indexing the page and prevent its outbound links from being added to
>> the frontier.
>> 
>> After reviewing the documentation, it appears that writing an
>> `IndexingFilter` plugin might help.  Specifically, using the `filter`
>> method to return NULL if I'm not interested in this page.  What I
>> can't tell is if returning NULL from the `filter` method will just
>> stop that page from being inserted into the index, or if it will also
>> prevent that page's outbound links from being added to the frontier.
>> Can anyone clarify this for me?
> 
> An indexing filter is one step too late. Implement a parse filter instead and you're good to go.
> 

Thanks Markus!

> cheers
> 
>> 
>> Best regards,
>> --mike
>> 
>> Michael Erickson
>> erickson.michael@gmail.com
>> 
>> 
>> [1] http://www8.org/w8-papers/5a-search-query/crawling/
>> [2] http://www.cse.iitb.ac.in/~soumen/focus/
>> [3]
>> http://nutch.apache.org/apidocs-1.3/org/apache/nutch/indexer/IndexingFilter.html
> 
> -- 
> Markus Jelsma - CTO - Openindex

Michael Erickson
erickson.michael@gmail.com

Re: Focused Crawling with Nutch (IndexingFilter:filter)

Posted by Markus Jelsma <ma...@openindex.io>.

 Hi,

 On Wed, 9 May 2012 13:07:11 -0500, Michael Erickson 
 <er...@gmail.com> wrote:
> Hello all,
>
> I'd like to try to do a focused crawl [1][2] using Nutch.  I have a
> classifier trained on a large corpus of hand-curated data.  My goal 
> is
> to have Nutch run a crawl, but for each page it finds, run the
> contents of the page through my classifier to see if that page is
> interesting to me.  If it is, I'll have Nutch proceed as normal.
> However, if the page is not interesting to me, I want to avoid
> indexing the page and prevent its outbound links from being added to
> the frontier.
>
> After reviewing the documentation, it appears that writing an
> `IndexingFilter` plugin might help.  Specifically, using the `filter`
> method to return NULL if I'm not interested in this page.  What I
> can't tell is if returning NULL from the `filter` method will just
> stop that page from being inserted into the index, or if it will also
> prevent that page's outbound links from being added to the frontier.
> Can anyone clarify this for me?

 An indexing filter is one step too late. Implement a parse filter 
 instead and you're good to go.

 cheers

>
> Best regards,
> --mike
>
> Michael Erickson
> erickson.michael@gmail.com
>
>
> [1] http://www8.org/w8-papers/5a-search-query/crawling/
> [2] http://www.cse.iitb.ac.in/~soumen/focus/
> [3]
> 
> http://nutch.apache.org/apidocs-1.3/org/apache/nutch/indexer/IndexingFilter.html

-- 
 Markus Jelsma - CTO - Openindex