You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Amna Waqar <am...@gmail.com> on 2011/02/09 11:31:05 UTC

Urgent help: Deleting the fetched pages in segment

hi all,
I want to delete the fetched pages stored in the segment based on its
content processing like if contains unicode characters(0x600 to 0x6FF) then
it should not be stored in the segement..How can i do this in Fetcher.java
where   Content content = output.getContent(); (content of the page has been
fetched)
I need some command to delete that page before storing it in segment.

Please help me
Thanks
Amna Waqar

Re: Urgent help: Deleting the fetched pages in segment

Posted by ".: Abhishek :." <ab...@gmail.com>.
Hi Amna,

 You could write a filter extending the HtmlParseFilter class for this, you
get the contents and check for the presence of the unicode characters and
then return the ParseResult accordingly.

 I am not sure if there is a command for it as such. Experts here, please
correct me if I am wrong.

./Abi

On Wed, Feb 9, 2011 at 6:31 PM, Amna Waqar <am...@gmail.com> wrote:

> hi all,
> I want to delete the fetched pages stored in the segment based on its
> content processing like if contains unicode characters(0x600 to 0x6FF) then
> it should not be stored in the segement..How can i do this in Fetcher.java
> where   Content content = output.getContent(); (content of the page has
> been
> fetched)
> I need some command to delete that page before storing it in segment.
>
> Please help me
> Thanks
> Amna Waqar
>