You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Dennis Kubes (JIRA)" <ji...@apache.org> on 2010/06/08 07:37:11 UTC

[jira] Created: (NUTCH-828) Fetch Filter

Fetch Filter
------------

Key: NUTCH-828
URL: https://issues.apache.org/jira/browse/NUTCH-828
Project: Nutch
Issue Type: New Feature
Components: fetcher
Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Fix For: 1.1
Attachments: NUTCH-828-1-20100608.patch

Adds a Nutch extension point for a fetch filter. The fetch filter allows filtering content and parse data/text after it is fetched but before it is written to segments. The fliter can return true if content is to be written or false if it is not.

Some use cases for this filter would be topical search engines that only want to fetch/index certain types of content, for example a news or sports only search engine. In these types of situations the only way to determine if content belongs to a particular set is to fetch the page and then analyze the content. If the content passes, meaning belongs to the set of say sports pages, then we want to include it. If it doesn't then we want to ignore it, never fetch that same page in the future, and ignore any urls on that page. If content is rejected due to a fetch filter then its status is written to the CrawlDb as gone and its content is ignored and not written to segments. This effectively stop crawling along the crawl path of that page and the urls from that page. An example filter, fetch-safe, is provided that allows fetching content that does not contain a list of bad words.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-828) Fetch Filter

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated NUTCH-828:
------------------------------------

    Fix Version/s: 1.2
                       (was: 1.1)

- bumpity to 1.2 since 1.1 is out the door

> Fetch Filter
> ------------
>
>                 Key: NUTCH-828
>                 URL: https://issues.apache.org/jira/browse/NUTCH-828
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.2
>
>         Attachments: NUTCH-828-1-20100608.patch, NUTCH-828-2-20100608.patch
>
>
> Adds a Nutch extension point for a fetch filter.  The fetch filter allows filtering content and parse data/text after it is fetched but before it is written to segments.  The fliter can return true if content is to be written or false if it is not.  
> Some use cases for this filter would be topical search engines that only want to fetch/index certain types of content, for example a news or sports only search engine.  In these types of situations the only way to determine if content belongs to a particular set is to fetch the page and then analyze the content.  If the content passes, meaning belongs to the set of say sports pages, then we want to include it.  If it doesn't then we want to ignore it, never fetch that same page in the future, and ignore any urls on that page.  If content is rejected due to a fetch filter then its status is written to the CrawlDb as gone and its content is ignored and not written to segments.  This effectively stop crawling along the crawl path of that page and the urls from that page.  An example filter, fetch-safe, is provided that allows fetching content that does not contain a list of bad words.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-828) Fetch Filter

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-828:
-------------------------------

    Attachment: NUTCH-828-1-20100608.txt

> Fetch Filter
> ------------
>
>                 Key: NUTCH-828
>                 URL: https://issues.apache.org/jira/browse/NUTCH-828
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.1
>
>         Attachments: NUTCH-828-1-20100608.patch
>
>
> Adds a Nutch extension point for a fetch filter.  The fetch filter allows filtering content and parse data/text after it is fetched but before it is written to segments.  The fliter can return true if content is to be written or false if it is not.  
> Some use cases for this filter would be topical search engines that only want to fetch/index certain types of content, for example a news or sports only search engine.  In these types of situations the only way to determine if content belongs to a particular set is to fetch the page and then analyze the content.  If the content passes, meaning belongs to the set of say sports pages, then we want to include it.  If it doesn't then we want to ignore it, never fetch that same page in the future, and ignore any urls on that page.  If content is rejected due to a fetch filter then its status is written to the CrawlDb as gone and its content is ignored and not written to segments.  This effectively stop crawling along the crawl path of that page and the urls from that page.  An example filter, fetch-safe, is provided that allows fetching content that does not contain a list of bad words.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-828) Fetch Filter

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876669#action_12876669 ] 

Dennis Kubes commented on NUTCH-828:
------------------------------------

Yeah. I am not proposing to get this into 1.1.  Oh wait, I did with the affects selection.  No this can / should wait until after the 1.1 release.  Anybody that wants it before then can patch :)

> Fetch Filter
> ------------
>
>                 Key: NUTCH-828
>                 URL: https://issues.apache.org/jira/browse/NUTCH-828
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.1
>
>         Attachments: NUTCH-828-1-20100608.patch
>
>
> Adds a Nutch extension point for a fetch filter.  The fetch filter allows filtering content and parse data/text after it is fetched but before it is written to segments.  The fliter can return true if content is to be written or false if it is not.  
> Some use cases for this filter would be topical search engines that only want to fetch/index certain types of content, for example a news or sports only search engine.  In these types of situations the only way to determine if content belongs to a particular set is to fetch the page and then analyze the content.  If the content passes, meaning belongs to the set of say sports pages, then we want to include it.  If it doesn't then we want to ignore it, never fetch that same page in the future, and ignore any urls on that page.  If content is rejected due to a fetch filter then its status is written to the CrawlDb as gone and its content is ignored and not written to segments.  This effectively stop crawling along the crawl path of that page and the urls from that page.  An example filter, fetch-safe, is provided that allows fetching content that does not contain a list of bad words.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-828) Fetch Filter

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876576#action_12876576 ] 

Julien Nioche commented on NUTCH-828:
-------------------------------------

Shall we postpone this after the release of 1.1? This is a new functionality and at this stage we probably just want to iron out bugs on what we currently have. Makes sense? 

> Fetch Filter
> ------------
>
>                 Key: NUTCH-828
>                 URL: https://issues.apache.org/jira/browse/NUTCH-828
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.1
>
>         Attachments: NUTCH-828-1-20100608.patch
>
>
> Adds a Nutch extension point for a fetch filter.  The fetch filter allows filtering content and parse data/text after it is fetched but before it is written to segments.  The fliter can return true if content is to be written or false if it is not.  
> Some use cases for this filter would be topical search engines that only want to fetch/index certain types of content, for example a news or sports only search engine.  In these types of situations the only way to determine if content belongs to a particular set is to fetch the page and then analyze the content.  If the content passes, meaning belongs to the set of say sports pages, then we want to include it.  If it doesn't then we want to ignore it, never fetch that same page in the future, and ignore any urls on that page.  If content is rejected due to a fetch filter then its status is written to the CrawlDb as gone and its content is ignored and not written to segments.  This effectively stop crawling along the crawl path of that page and the urls from that page.  An example filter, fetch-safe, is provided that allows fetching content that does not contain a list of bad words.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (NUTCH-828) Fetch Filter

Posted by Dennis Kubes <ku...@apache.org>.

The current patch actually wouldn't allow the crawling of the outlinks.  
If a page is filtered the content and parse objects wouldn't be saved 
and the URL would be poisoned in the CrawlDb from being crawled in the 
future.  Maybe we can add in the ability to still crawl outlinks.  I 
think this would need to be a global setting applied to all pages 
filtered versus a page by page basis.

It is a new extension point so other plugins can be written, such as one 
interacting with Mahout algorithms.

Dennis

On 06/09/2010 01:14 AM, David Stuart wrote:
> I haven't review this patch but I was about to start work on something 
> similar so a big +1 on the ability to filter out the page but allow 
> crawling of the outlinks. Also if the filter was able to be pluggable 
> to external inputs (like mahout) def +1 on that too
>
> David Stuart
>
> On 9 Jun 2010, at 06:53, "Andrzej Bialecki  (JIRA)" <ji...@apache.org> 
> wrote:
>
>>
>>    [ 
>> https://issues.apache.org/jira/browse/NUTCH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876964#action_12876964 ] 
>>
>>
>> Andrzej Bialecki  commented on NUTCH-828:
>> -----------------------------------------
>>
>> First, as you point out, we cannot ignore the page because the 
>> problem will repeat itself as we keep re-discovering it, so we have 
>> to "poison" it with GONE - and I think it's ok to add another status 
>> here to express that we never ever want to collect this page, because 
>> GONE gets reset periodically.
>>
>> If we run Fetcher in parsing mode then we can change this status 
>> immediately, so no problem here. If we run ParseSegment then we can 
>> also update this status in a similar way as we implement the 
>> signature update, i.e. in ParseOutputFormat emit a 
>> <pageUrl,CrawlDatum> that will switch the status of this page when 
>> collected later on in CrawlDbReducer.
>>
>>> Fetch Filter
>>> ------------
>>>
>>>                Key: NUTCH-828
>>>                URL: https://issues.apache.org/jira/browse/NUTCH-828
>>>            Project: Nutch
>>>         Issue Type: New Feature
>>>         Components: fetcher
>>>        Environment: All
>>>           Reporter: Dennis Kubes
>>>           Assignee: Dennis Kubes
>>>            Fix For: 1.1
>>>
>>>        Attachments: NUTCH-828-1-20100608.patch, 
>>> NUTCH-828-2-20100608.patch
>>>
>>>
>>> Adds a Nutch extension point for a fetch filter.  The fetch filter 
>>> allows filtering content and parse data/text after it is fetched but 
>>> before it is written to segments.  The fliter can return true if 
>>> content is to be written or false if it is not.
>>> Some use cases for this filter would be topical search engines that 
>>> only want to fetch/index certain types of content, for example a 
>>> news or sports only search engine.  In these types of situations the 
>>> only way to determine if content belongs to a particular set is to 
>>> fetch the page and then analyze the content.  If the content passes, 
>>> meaning belongs to the set of say sports pages, then we want to 
>>> include it.  If it doesn't then we want to ignore it, never fetch 
>>> that same page in the future, and ignore any urls on that page.  If 
>>> content is rejected due to a fetch filter then its status is written 
>>> to the CrawlDb as gone and its content is ignored and not written to 
>>> segments.  This effectively stop crawling along the crawl path of 
>>> that page and the urls from that page.  An example filter, 
>>> fetch-safe, is provided that allows fetching content that does not 
>>> contain a list of bad words.
>>
>> -- 
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>

Re: [jira] Commented: (NUTCH-828) Fetch Filter

Posted by David Stuart <da...@progressivealliance.co.uk>.

I haven't review this patch but I was about to start work on something  
similar so a big +1 on the ability to filter out the page but allow  
crawling of the outlinks. Also if the filter was able to be pluggable  
to external inputs (like mahout) def +1 on that too

David Stuart

On 9 Jun 2010, at 06:53, "Andrzej Bialecki  (JIRA)" <ji...@apache.org>  
wrote:

>
>    [ https://issues.apache.org/jira/browse/NUTCH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876964#action_12876964 
>  ]
>
> Andrzej Bialecki  commented on NUTCH-828:
> -----------------------------------------
>
> First, as you point out, we cannot ignore the page because the  
> problem will repeat itself as we keep re-discovering it, so we have  
> to "poison" it with GONE - and I think it's ok to add another status  
> here to express that we never ever want to collect this page,  
> because GONE gets reset periodically.
>
> If we run Fetcher in parsing mode then we can change this status  
> immediately, so no problem here. If we run ParseSegment then we can  
> also update this status in a similar way as we implement the  
> signature update, i.e. in ParseOutputFormat emit a  
> <pageUrl,CrawlDatum> that will switch the status of this page when  
> collected later on in CrawlDbReducer.
>
>> Fetch Filter
>> ------------
>>
>>                Key: NUTCH-828
>>                URL: https://issues.apache.org/jira/browse/NUTCH-828
>>            Project: Nutch
>>         Issue Type: New Feature
>>         Components: fetcher
>>        Environment: All
>>           Reporter: Dennis Kubes
>>           Assignee: Dennis Kubes
>>            Fix For: 1.1
>>
>>        Attachments: NUTCH-828-1-20100608.patch,  
>> NUTCH-828-2-20100608.patch
>>
>>
>> Adds a Nutch extension point for a fetch filter.  The fetch filter  
>> allows filtering content and parse data/text after it is fetched  
>> but before it is written to segments.  The fliter can return true  
>> if content is to be written or false if it is not.
>> Some use cases for this filter would be topical search engines that  
>> only want to fetch/index certain types of content, for example a  
>> news or sports only search engine.  In these types of situations  
>> the only way to determine if content belongs to a particular set is  
>> to fetch the page and then analyze the content.  If the content  
>> passes, meaning belongs to the set of say sports pages, then we  
>> want to include it.  If it doesn't then we want to ignore it, never  
>> fetch that same page in the future, and ignore any urls on that  
>> page.  If content is rejected due to a fetch filter then its status  
>> is written to the CrawlDb as gone and its content is ignored and  
>> not written to segments.  This effectively stop crawling along the  
>> crawl path of that page and the urls from that page.  An example  
>> filter, fetch-safe, is provided that allows fetching content that  
>> does not contain a list of bad words.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>

[jira] Commented: (NUTCH-828) Fetch Filter

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876964#action_12876964 ] 

Andrzej Bialecki  commented on NUTCH-828:
-----------------------------------------

First, as you point out, we cannot ignore the page because the problem will repeat itself as we keep re-discovering it, so we have to "poison" it with GONE - and I think it's ok to add another status here to express that we never ever want to collect this page, because GONE gets reset periodically.

If we run Fetcher in parsing mode then we can change this status immediately, so no problem here. If we run ParseSegment then we can also update this status in a similar way as we implement the signature update, i.e. in ParseOutputFormat emit a <pageUrl,CrawlDatum> that will switch the status of this page when collected later on in CrawlDbReducer.

> Fetch Filter
> ------------
>
>                 Key: NUTCH-828
>                 URL: https://issues.apache.org/jira/browse/NUTCH-828
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.1
>
>         Attachments: NUTCH-828-1-20100608.patch, NUTCH-828-2-20100608.patch
>
>
> Adds a Nutch extension point for a fetch filter.  The fetch filter allows filtering content and parse data/text after it is fetched but before it is written to segments.  The fliter can return true if content is to be written or false if it is not.  
> Some use cases for this filter would be topical search engines that only want to fetch/index certain types of content, for example a news or sports only search engine.  In these types of situations the only way to determine if content belongs to a particular set is to fetch the page and then analyze the content.  If the content passes, meaning belongs to the set of say sports pages, then we want to include it.  If it doesn't then we want to ignore it, never fetch that same page in the future, and ignore any urls on that page.  If content is rejected due to a fetch filter then its status is written to the CrawlDb as gone and its content is ignored and not written to segments.  This effectively stop crawling along the crawl path of that page and the urls from that page.  An example filter, fetch-safe, is provided that allows fetching content that does not contain a list of bad words.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-828) Fetch Filter

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-828:
-------------------------------

    Attachment: NUTCH-828-2-20100608.patch

Forgot to add the nutch-default.xml changes to the old patch.  Here is a new one.

> Fetch Filter
> ------------
>
>                 Key: NUTCH-828
>                 URL: https://issues.apache.org/jira/browse/NUTCH-828
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.1
>
>         Attachments: NUTCH-828-1-20100608.patch, NUTCH-828-2-20100608.patch
>
>
> Adds a Nutch extension point for a fetch filter.  The fetch filter allows filtering content and parse data/text after it is fetched but before it is written to segments.  The fliter can return true if content is to be written or false if it is not.  
> Some use cases for this filter would be topical search engines that only want to fetch/index certain types of content, for example a news or sports only search engine.  In these types of situations the only way to determine if content belongs to a particular set is to fetch the page and then analyze the content.  If the content passes, meaning belongs to the set of say sports pages, then we want to include it.  If it doesn't then we want to ignore it, never fetch that same page in the future, and ignore any urls on that page.  If content is rejected due to a fetch filter then its status is written to the CrawlDb as gone and its content is ignored and not written to segments.  This effectively stop crawling along the crawl path of that page and the urls from that page.  An example filter, fetch-safe, is provided that allows fetching content that does not contain a list of bad words.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-828) Fetch Filter

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated NUTCH-828:
------------------------------------

    Fix Version/s: 2.0
                       (was: 1.2)

> Fetch Filter
> ------------
>
>                 Key: NUTCH-828
>                 URL: https://issues.apache.org/jira/browse/NUTCH-828
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 2.0
>
>         Attachments: NUTCH-828-1-20100608.patch, NUTCH-828-2-20100608.patch
>
>
> Adds a Nutch extension point for a fetch filter.  The fetch filter allows filtering content and parse data/text after it is fetched but before it is written to segments.  The fliter can return true if content is to be written or false if it is not.  
> Some use cases for this filter would be topical search engines that only want to fetch/index certain types of content, for example a news or sports only search engine.  In these types of situations the only way to determine if content belongs to a particular set is to fetch the page and then analyze the content.  If the content passes, meaning belongs to the set of say sports pages, then we want to include it.  If it doesn't then we want to ignore it, never fetch that same page in the future, and ignore any urls on that page.  If content is rejected due to a fetch filter then its status is written to the CrawlDb as gone and its content is ignored and not written to segments.  This effectively stop crawling along the crawl path of that page and the urls from that page.  An example filter, fetch-safe, is provided that allows fetching content that does not contain a list of bad words.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-828) Fetch Filter

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-828:
-------------------------------

    Attachment:     (was: NUTCH-828-1-20100608.txt)

> Fetch Filter
> ------------
>
>                 Key: NUTCH-828
>                 URL: https://issues.apache.org/jira/browse/NUTCH-828
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.1
>
>         Attachments: NUTCH-828-1-20100608.patch
>
>
> Adds a Nutch extension point for a fetch filter.  The fetch filter allows filtering content and parse data/text after it is fetched but before it is written to segments.  The fliter can return true if content is to be written or false if it is not.  
> Some use cases for this filter would be topical search engines that only want to fetch/index certain types of content, for example a news or sports only search engine.  In these types of situations the only way to determine if content belongs to a particular set is to fetch the page and then analyze the content.  If the content passes, meaning belongs to the set of say sports pages, then we want to include it.  If it doesn't then we want to ignore it, never fetch that same page in the future, and ignore any urls on that page.  If content is rejected due to a fetch filter then its status is written to the CrawlDb as gone and its content is ignored and not written to segments.  This effectively stop crawling along the crawl path of that page and the urls from that page.  An example filter, fetch-safe, is provided that allows fetching content that does not contain a list of bad words.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-828) Fetch Filter

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876748#action_12876748 ] 

Andrzej Bialecki  commented on NUTCH-828:
-----------------------------------------

I generally like the idea of a decision point, but I think the place where this decision is taken in this patch (Fetcher) is not right. Since you rely on the presence of ParseResult (understandably so) it seems to me that a much better place to run the filters would be inside ParseUtils.parse(content), and you could return null (or a special ParseResult) to indicate that the content is to be discarded.

This way you can both run this filtering as a part of a Fetcher in parsing mode, and as a part of ParseSegment, without duplicating the same logic. Consequently, I propose to change the name from FetchFilter to ParseFilter.

> Fetch Filter
> ------------
>
>                 Key: NUTCH-828
>                 URL: https://issues.apache.org/jira/browse/NUTCH-828
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.1
>
>         Attachments: NUTCH-828-1-20100608.patch, NUTCH-828-2-20100608.patch
>
>
> Adds a Nutch extension point for a fetch filter.  The fetch filter allows filtering content and parse data/text after it is fetched but before it is written to segments.  The fliter can return true if content is to be written or false if it is not.  
> Some use cases for this filter would be topical search engines that only want to fetch/index certain types of content, for example a news or sports only search engine.  In these types of situations the only way to determine if content belongs to a particular set is to fetch the page and then analyze the content.  If the content passes, meaning belongs to the set of say sports pages, then we want to include it.  If it doesn't then we want to ignore it, never fetch that same page in the future, and ignore any urls on that page.  If content is rejected due to a fetch filter then its status is written to the CrawlDb as gone and its content is ignored and not written to segments.  This effectively stop crawling along the crawl path of that page and the urls from that page.  An example filter, fetch-safe, is provided that allows fetching content that does not contain a list of bad words.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-828) Fetch Filter

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876923#action_12876923 ] 

Dennis Kubes commented on NUTCH-828:
------------------------------------

I agree about wanting the decision not just in fetcher while parsing but also in parse segment.  Here is the problem as I see it in returning null content.

Say we are wanting to create a topical search engine about sports.  We fetch pages.  Run through a fetch filter for a yes/no is this page in sports by its content.  If we null out content and from that ParseText and ParseData, we still have the CrawlDatum to deal with.  If we leave it as is, the CrawlDatum will get updated into CrawlDb as successfully fetched.  Content and Parse won't get collected because they are null.  We won't have the problem of Outlinks on that page getting queued in CrawlDb but the original URL will still be there and will be queued after an interval for repeated crawling.  Over time what we have is a large number of URLs that we know to be filtered being repeatedly crawled.

The decision point isn't just keep the content.  It is should we keep the URL and its content/parse and continue crawling down the path of the URLs outlinks or should we ignore this URL and not crawl anything it points to, break the crawl graph at this point.  Hence FetchFilter.  My solution to this was to null out content/parse and add a different CrawlDatum that essentially said the page was gone.  Ideally we should have a separate status but the gone worked as a first pass.  This gets updated back into CrawlDb and won't get recrawled at a later date  This was only possible in the Fetcher though.

Thoughts on how we might approach this?



> Fetch Filter
> ------------
>
>                 Key: NUTCH-828
>                 URL: https://issues.apache.org/jira/browse/NUTCH-828
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.1
>
>         Attachments: NUTCH-828-1-20100608.patch, NUTCH-828-2-20100608.patch
>
>
> Adds a Nutch extension point for a fetch filter.  The fetch filter allows filtering content and parse data/text after it is fetched but before it is written to segments.  The fliter can return true if content is to be written or false if it is not.  
> Some use cases for this filter would be topical search engines that only want to fetch/index certain types of content, for example a news or sports only search engine.  In these types of situations the only way to determine if content belongs to a particular set is to fetch the page and then analyze the content.  If the content passes, meaning belongs to the set of say sports pages, then we want to include it.  If it doesn't then we want to ignore it, never fetch that same page in the future, and ignore any urls on that page.  If content is rejected due to a fetch filter then its status is written to the CrawlDb as gone and its content is ignored and not written to segments.  This effectively stop crawling along the crawl path of that page and the urls from that page.  An example filter, fetch-safe, is provided that allows fetching content that does not contain a list of bad words.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-828) Fetch Filter

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-828:
-------------------------------

    Attachment: NUTCH-828-1-20100608.patch

> Fetch Filter
> ------------
>
>                 Key: NUTCH-828
>                 URL: https://issues.apache.org/jira/browse/NUTCH-828
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.1
>
>         Attachments: NUTCH-828-1-20100608.patch
>
>
> Adds a Nutch extension point for a fetch filter.  The fetch filter allows filtering content and parse data/text after it is fetched but before it is written to segments.  The fliter can return true if content is to be written or false if it is not.  
> Some use cases for this filter would be topical search engines that only want to fetch/index certain types of content, for example a news or sports only search engine.  In these types of situations the only way to determine if content belongs to a particular set is to fetch the page and then analyze the content.  If the content passes, meaning belongs to the set of say sports pages, then we want to include it.  If it doesn't then we want to ignore it, never fetch that same page in the future, and ignore any urls on that page.  If content is rejected due to a fetch filter then its status is written to the CrawlDb as gone and its content is ignored and not written to segments.  This effectively stop crawling along the crawl path of that page and the urls from that page.  An example filter, fetch-safe, is provided that allows fetching content that does not contain a list of bad words.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-828) Fetch Filter

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876970#action_12876970 ] 

Dennis Kubes commented on NUTCH-828:
------------------------------------

Nice.  I didn't realize the signature update would do that.  I am assuming since ParseUtil doesn't interact with the CrawlDatum we are going to have to call the FetchFilters (I am ok with renaming this btw) twice, once in the fetcher and once in the ParseSegment, both dealing with their respective CrawlDatum needs?

> Fetch Filter
> ------------
>
>                 Key: NUTCH-828
>                 URL: https://issues.apache.org/jira/browse/NUTCH-828
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.1
>
>         Attachments: NUTCH-828-1-20100608.patch, NUTCH-828-2-20100608.patch
>
>
> Adds a Nutch extension point for a fetch filter.  The fetch filter allows filtering content and parse data/text after it is fetched but before it is written to segments.  The fliter can return true if content is to be written or false if it is not.  
> Some use cases for this filter would be topical search engines that only want to fetch/index certain types of content, for example a news or sports only search engine.  In these types of situations the only way to determine if content belongs to a particular set is to fetch the page and then analyze the content.  If the content passes, meaning belongs to the set of say sports pages, then we want to include it.  If it doesn't then we want to ignore it, never fetch that same page in the future, and ignore any urls on that page.  If content is rejected due to a fetch filter then its status is written to the CrawlDb as gone and its content is ignored and not written to segments.  This effectively stop crawling along the crawl path of that page and the urls from that page.  An example filter, fetch-safe, is provided that allows fetching content that does not contain a list of bad words.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.