You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by 刘 <lc...@gmail.com> on 2012/08/03 14:59:14 UTC

Can I only add url in a specified div to the fetch list with nutch?

Such as the title, I want crawl a page with many urls, but only the ones in
a specified div are meaningful to me. So I want to write a plugin to filter
it, but I don't know which extension point should I choose.

The htmlparser filter can get the html content, but seems like process
after the "add to fetch list" operation. And the urlfilter can control the
fetch list, but I cant get the html content in it.

Look forward to any helpful replies, thx.

RE: Can I only add url in a specified div to the fetch list with nutch?

Posted by Markus Jelsma <ma...@openindex.io>.
Hi,

Outlinks are added to the ParseData object before being passed to a HTMLParseFilter. In a HTMLParseFilter plugin you can obtain the Outlinks and remove those you don't want.

   Outlinks[] outlinks = parseResult.get(content.getUrl()).getData().getOutlinks();

Use the setOutlinks() method to write your processed list to the ParseData.

Cheers,
 
 
-----Original message-----
> From:刘?? <lc...@gmail.com>
> Sent: Fri 03-Aug-2012 15:45
> To: user@nutch.apache.org
> Subject: Can I only add url in a specified div to the fetch list with nutch?
> 
> Such as the title, I want crawl a page with many urls, but only the ones in
> a specified div are meaningful to me. So I want to write a plugin to filter
> it, but I don't know which extension point should I choose.
> 
> The htmlparser filter can get the html content, but seems like process
> after the "add to fetch list" operation. And the urlfilter can control the
> fetch list, but I cant get the html content in it.
> 
> Look forward to any helpful replies, thx.
>