You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Chaushu, Shani" <sh...@intel.com> on 2017/03/28 09:29:26 UTC

Nutch 1.12 with custom metadata

Hi,
I'm trying to run crawl with nutch 1.12, and the seed file contains urls in this form (like the Example in the code comments)
http://www.nutch.org/ \t key=value

when I try to crawl, the log has error with invalid url http://www.nutch.org/%20\t%20key=value - the tab and key value custom metatags are considers as part of the url - the injector didn't  parse the meta tags.
I tried to add urlmeta in plugin.include property, and add the key to urlmeta.tags

Am I missing something? Something else to make it work ?

Thanks,
Shani

---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Re: Nutch 1.12 with custom metadata

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,

the example is meant as a tab-separated seed file
 <url> <tab> <key>=<value>

The invalid URL looks like the seed list contained
  url  space  backslash  letter t  space  key=value

Best Sebastian

On 03/28/2017 11:29 AM, Chaushu, Shani wrote:
> Hi,
> I'm trying to run crawl with nutch 1.12, and the seed file contains urls in this form (like the Example in the code comments)
> http://www.nutch.org/ \t key=value
> 
> when I try to crawl, the log has error with invalid url http://www.nutch.org/%20\t%20key=value - the tab and key value custom metatags are considers as part of the url - the injector didn't  parse the meta tags.
> I tried to add urlmeta in plugin.include property, and add the key to urlmeta.tags
> 
> Am I missing something? Something else to make it work ?
> 
> Thanks,
> Shani
> 
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
> 
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>