You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Žygimantas Medelis <zy...@medelis.lt> on 2011/01/14 23:19:00 UTC

URLFilter based on anchor text

Hi,

URLFilters allow to filter links based on content of the URL. Is it possible
to extend filters so as to filter links based on their anchor text?
URLFilter takes only url as its parameter.

One way to do this is to modify parse-html plugin. There out-links are
collected and Outlink class provides getAnchor method. Then
those out-links which do not have required anchor text are not included when
ParseData is created thus preventing Nutch from crawling them.

Yet this does not seem like a good solution, parser plugins should not do
URL filtering. Is there a better way? What about extending this even further
and creating a filter based on the whole sentence an anchor is located in?

regards
zm

Re: URLFilter based on anchor text

Posted by Sourabh Kasliwal <so...@mojostation.com>.
Creating plugin by extending HtmlParserFilter Can be the other option...
Since it is a plugin interface, so no modification in nutch build...
regards
Sourabh

On Mon, Jan 17, 2011 at 8:59 PM, Nobin Mathew <no...@gmail.com>wrote:

> On Sat, Jan 15, 2011 at 3:49 AM, Žygimantas Medelis
> <zy...@medelis.lt> wrote:
> > Hi,
> >
> > URLFilters allow to filter links based on content of the URL. Is it
> possible
> > to extend filters so as to filter links based on their anchor text?
> > URLFilter takes only url as its parameter.
> >
> > One way to do this is to modify parse-html plugin. There out-links are
> > collected and Outlink class provides getAnchor method. Then
> > those out-links which do not have required anchor text are not included
> when
> > ParseData is created thus preventing Nutch from crawling them.
>
> what about ParseOuputFormat.java write() function, where you will get
> the outlink and anchor text.
> You can create some thing like URLFilter which will also take anchor
> text as input.
> I don't how to get the sentence which is having the anchor text.
>
> >
> > Yet this does not seem like a good solution, parser plugins should not do
> > URL filtering. Is there a better way? What about extending this even
> further
> > and creating a filter based on the whole sentence an anchor is located
> in?
> >
> > regards
> > zm
> >
>

Re: URLFilter based on anchor text

Posted by Nobin Mathew <no...@gmail.com>.
On Sat, Jan 15, 2011 at 3:49 AM, Žygimantas Medelis
<zy...@medelis.lt> wrote:
> Hi,
>
> URLFilters allow to filter links based on content of the URL. Is it possible
> to extend filters so as to filter links based on their anchor text?
> URLFilter takes only url as its parameter.
>
> One way to do this is to modify parse-html plugin. There out-links are
> collected and Outlink class provides getAnchor method. Then
> those out-links which do not have required anchor text are not included when
> ParseData is created thus preventing Nutch from crawling them.

what about ParseOuputFormat.java write() function, where you will get
the outlink and anchor text.
You can create some thing like URLFilter which will also take anchor
text as input.
I don't how to get the sentence which is having the anchor text.

>
> Yet this does not seem like a good solution, parser plugins should not do
> URL filtering. Is there a better way? What about extending this even further
> and creating a filter based on the whole sentence an anchor is located in?
>
> regards
> zm
>