You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Chaushu, Shani" <sh...@intel.com> on 2016/03/22 16:27:21 UTC

don't crawl links in header

Hi,
Sometimes in the header of pages that are <link> tag that link to pages that are source code that doesn't interesting for example http://......../somexmlsettingsdata?type=xml
This link is not suffix xml so I can't filter it out but I want that the nutch will get only links from body and not from the header.
Is this possible? (I'm using nutch 1.9)

Thanks,
Shani

---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Re: don't crawl links in header

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Shani,

> Sometimes in the header of pages that are <link> tag that link
> to pages that are source code that doesn't interesting ...

Yes, that's often true.

> ... I want that the nutch will get only links from body and not from the header.
> Is this possible? (I'm using nutch 1.9)

Have a look at the following property:

<property>
  <name>parser.html.outlinks.ignore_tags</name>
  <value></value>
  <description>Comma separated list of HTML tags, from which outlinks
  shouldn't be extracted. Nutch takes links from: a, area, form, frame,
  iframe, script, link, img. If you add any of those tags here, it
  won't be taken. Default is empty list. Probably reasonable value
  for most people would be "img,script,link".</description>
</property>

This would allow to easily exclude "link" links at all.
Afaik, there is no solution to follow only links from
the body. Also, be aware that some "link" links, e.g.,
  <link rel="canonical" href="..." />
are worth to follow. Of course, well-maintained sites
will always make these pages reachable by ordinary "a" links.
So, normally, that's no problem.

Best,
Sebastian

On 03/22/2016 04:27 PM, Chaushu, Shani wrote:
> Hi,
> Sometimes in the header of pages that are <link> tag that link to pages that are source code that doesn't interesting for example http://......../somexmlsettingsdata?type=xml
> This link is not suffix xml so I can't filter it out but I want that the nutch will get only links from body and not from the header.
> Is this possible? (I'm using nutch 1.9)
> 
> Thanks,
> Shani
> 
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
> 
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>