You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov> on 2010/06/25 06:39:31 UTC

Re: How to unsubscribe from this group?

Send an email to dev-unsubscribe@nutch.apache.org and follow the instructions from there...


On 6/24/10 9:36 PM, "Vimal Varghese" <vi...@tcs.com> wrote:




Vimal Varghese

-----Claus Schröter (JIRA) wrote: -----
To: dev@nutch.apache.org
From: Claus Schröter (JIRA) <ji...@apache.org>
Date: 06/25/2010 01:59AM
Subject: [jira] Commented: (NUTCH-655) Injecting Crawl metadata

    [ https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882313#action_12882313 ]

Claus Schröter commented on NUTCH-655:
--------------------------------------

Hi Julien, thanks for this patch...
is there any way to inherit the metadata or parts of it to suburls while crawling?
I fiddled around with a scoring filter but with no success.

Cheers
Claus

> Injecting Crawl metadata
> ------------------------
>
>                 Key: NUTCH-655
>                 URL: https://issues.apache.org/jira/browse/NUTCH-655
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: Injector.patch, NUTCH-655.v2
>
>
> the patch attached allows to inject metadata into the crawlDB. The input file has to contain fields separated by tabs, with the URL being on the first column. The metadata names and values are separated by '='. A input line might look like this:
> http://www.myurl.com <http://www.myurl.com/>   \t  categ=value1 \t categ2=value2
> This functionality can be useful to store external knowledge and index it with a custom plugin


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++