You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Marcin Okraszewski <ok...@o2.pl> on 2007/09/06 23:09:33 UTC

Limiting outlink tags.

Hi,
I have noticed that Nutch considers img/@src as an outlink. I suppose in many cases people do not want to threat image as an outlink. At least I don't want. The same case is with script/@src. But, it seems there is no way to limit outlink tags. The DOMContentUtils.getOutlinks() takes links from all a,area,form,frame,iframe,script,link,img. Only "form" element can be turned off by "parser.html.form.use_action" parameter.

I would suggest to introduce a new configuration parameter which could be used to turn on or off certain elements. It could be simply done by single parameter, which would contain coma separated list of tags to be turned off.

What is your opinion? If you think it is a valid issue I can make a patch for this.

Regards,
Marcin


Re: Limiting outlink tags.

Posted by Marcin Okraszewski <ok...@o2.pl>.
I finally found a while to do it. I have added a patch to the
NUTCH-488 with the coma-separated list of tags to ignore.


On 9/7/07, Doğacan Güney <do...@gmail.com> wrote:
> Hi Marcin,
>
> On 9/7/07, Marcin Okraszewski <ok...@o2.pl> wrote:
> > Hi,
> > I have noticed that Nutch considers img/@src as an outlink. I suppose in many cases people do not want to threat image as an outlink. At least I don't want. The same case is with script/@src. But, it seems there is no way to limit outlink tags. The DOMContentUtils.getOutlinks() takes links from all a,area,form,frame,iframe,script,link,img. Only "form" element can be turned off by "parser.html.form.use_action" parameter.
> >
> > I would suggest to introduce a new configuration parameter which could be used to turn on or off certain elements. It could be simply done by single parameter, which would contain coma separated list of tags to be turned off.
> >
> > What is your opinion? If you think it is a valid issue I can make a patch for this.
>
> There is already NUTCH-488 open for this (with a patch). Feel free to
> add comments/patches/etc. there. Btw, I agree that using a CSV is
> better than using a new configuration parameter for every tag.
>
> >
> > Regards,
> > Marcin
> >
> >
>
>
> --
> Doğacan Güney
>

Re: Limiting outlink tags.

Posted by Doğacan Güney <do...@gmail.com>.
Hi Marcin,

On 9/7/07, Marcin Okraszewski <ok...@o2.pl> wrote:
> Hi,
> I have noticed that Nutch considers img/@src as an outlink. I suppose in many cases people do not want to threat image as an outlink. At least I don't want. The same case is with script/@src. But, it seems there is no way to limit outlink tags. The DOMContentUtils.getOutlinks() takes links from all a,area,form,frame,iframe,script,link,img. Only "form" element can be turned off by "parser.html.form.use_action" parameter.
>
> I would suggest to introduce a new configuration parameter which could be used to turn on or off certain elements. It could be simply done by single parameter, which would contain coma separated list of tags to be turned off.
>
> What is your opinion? If you think it is a valid issue I can make a patch for this.

There is already NUTCH-488 open for this (with a patch). Feel free to
add comments/patches/etc. there. Btw, I agree that using a CSV is
better than using a new configuration parameter for every tag.

>
> Regards,
> Marcin
>
>


-- 
Doğacan Güney