You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by 刘 <lc...@gmail.com> on 2012/08/03 14:59:14 UTC
Can I only add url in a specified div to the fetch list with nutch?
Such as the title, I want crawl a page with many urls, but only the ones in
a specified div are meaningful to me. So I want to write a plugin to filter
it, but I don't know which extension point should I choose.
The htmlparser filter can get the html content, but seems like process
after the "add to fetch list" operation. And the urlfilter can control the
fetch list, but I cant get the html content in it.
Look forward to any helpful replies, thx.
RE: Can I only add url in a specified div to the fetch list with
nutch?
Posted by Markus Jelsma <ma...@openindex.io>.
Hi,
Outlinks are added to the ParseData object before being passed to a HTMLParseFilter. In a HTMLParseFilter plugin you can obtain the Outlinks and remove those you don't want.
Outlinks[] outlinks = parseResult.get(content.getUrl()).getData().getOutlinks();
Use the setOutlinks() method to write your processed list to the ParseData.
Cheers,
-----Original message-----
> From:刘?? <lc...@gmail.com>
> Sent: Fri 03-Aug-2012 15:45
> To: user@nutch.apache.org
> Subject: Can I only add url in a specified div to the fetch list with nutch?
>
> Such as the title, I want crawl a page with many urls, but only the ones in
> a specified div are meaningful to me. So I want to write a plugin to filter
> it, but I don't know which extension point should I choose.
>
> The htmlparser filter can get the html content, but seems like process
> after the "add to fetch list" operation. And the urlfilter can control the
> fetch list, but I cant get the html content in it.
>
> Look forward to any helpful replies, thx.
>