You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2013/08/08 17:07:00 UTC

[jira] [Updated] (NUTCH-1622) Create Outlinks with metadata

     [ https://issues.apache.org/jira/browse/NUTCH-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-1622:
---------------------------------

    Attachment: NUTCH-1622.patch
    
> Create Outlinks with metadata
> -----------------------------
>
>                 Key: NUTCH-1622
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1622
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.7, 2.2.1
>            Reporter: Julien Nioche
>         Attachments: NUTCH-1622.patch
>
>
> Having the possibility to specify metadata when creating an outlink is extremely useful as it allows to pass information from a source page to the pages it links to. We use that routinely within our custom parsers in combination with the url-meta plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

RE: [jira] [Updated] (NUTCH-1622) Create Outlinks with metadata

Posted by Richard Bergmann <RB...@colsa.com>.
Julien,

For what it's worth (and to anyone out there who may be interested in the code), I created a custom parse-feed plugin, which is based on the feed plugin (i.e., I didn't directly "hack" the feed plugin), because I needed to get extra information from the feed item Xml (specifically Geo data, which I got by including the Rome module that does so).

So the parse-feed parser:

  o  Captures the relevant Xml elements, via, Rome, and

  o  Places those element values into a Metadata object, and,

  o  Places that Metadata object into the Outlink for each item.


The parse-feed indexer:

  o  Attempts to locate the Metadata in the CrawlDatum, and, if found,

  o  Populates the NutchDocument with fields that correspond to the Metadata entries


Thanks again.

Rich

From: Julien Nioche [mailto:lists.digitalpebble@gmail.com] 
Sent: Friday, August 09, 2013 4:14 AM
To: dev@nutch.apache.org
Subject: Re: [jira] [Updated] (NUTCH-1622) Create Outlinks with metadata

Hi Rich,

Glad you got it to work. You get the metadata in the crawldatum indeed, as if they were passed via the injection. From there you can use the urlmeta + index-metadata plugins.

Would be worth checking whether Tika passes on the metadata in which case you could have a HTMLParseFilter to pull the stuff with XPath and then add the metadata to the outlinks. It would be a bit neater as you wouldn't need to hack the feed plugin at all.

Thanks for sharing your experience

Julien



On 8 August 2013 22:33, Richard Bergmann <RB...@colsa.com> wrote:
Julien,

No need to reply -- I "guessed" properly.  The metadata that I am stuffing into the outlinks is, indeed, coming back to me in the CrawlDatum, so I am now successfully building my index with the crawled/linked page content and the RSS feed item info (from metadata).

Of course this required your patch (NUTCH-1622).  Thank you!

Rich Bergmann

-----Original Message-----
From: Richard Bergmann [mailto:RBERGMANN@colsa.com]
Sent: Thursday, August 08, 2013 12:58 PM
To: dev@nutch.apache.org
Subject: RE: [jira] [Updated] (NUTCH-1622) Create Outlinks with metadata

Julien,

I am trying to save myself a bit of time here by asking you this question (and making all subscribers listen!) before digging into the code:

Based on this patch (which I have applied), where will the metadata show up when it gets to my IndexingFilter extension?  CrawlDatum.getMetaData()?  Somewhere else?  Do I have to modify an Html parser to ensure the metadata gets to my IndexingFilter?

With the current "feed" Parser and IndexingFilter the metadata I am interested in is stuffed into the parse metadata: Parse.getData().getParseMeta().

Thank you!

Rich Bergmann

-----Original Message-----
From: Julien Nioche (JIRA) [mailto:jira@apache.org]
Sent: Thursday, August 08, 2013 11:07 AM
To: dev@nutch.apache.org
Subject: [jira] [Updated] (NUTCH-1622) Create Outlinks with metadata


     [ https://issues.apache.org/jira/browse/NUTCH-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-1622:
---------------------------------

    Attachment: NUTCH-1622.patch

> Create Outlinks with metadata
> -----------------------------
>
>                 Key: NUTCH-1622
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1622
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.7, 2.2.1
>            Reporter: Julien Nioche
>         Attachments: NUTCH-1622.patch
>
>
> Having the possibility to specify metadata when creating an outlink is extremely useful as it allows to pass information from a source page to the pages it links to. We use that routinely within our custom parsers in combination with the url-meta plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: [jira] [Updated] (NUTCH-1622) Create Outlinks with metadata

Posted by Julien Nioche <li...@gmail.com>.
Hi Rich,

Glad you got it to work. You get the metadata in the crawldatum indeed, as
if they were passed via the injection. From there you can use the urlmeta +
index-metadata plugins.

Would be worth checking whether Tika passes on the metadata in which case
you could have a HTMLParseFilter to pull the stuff with XPath and then add
the metadata to the outlinks. It would be a bit neater as you wouldn't need
to hack the feed plugin at all.

Thanks for sharing your experience

Julien




On 8 August 2013 22:33, Richard Bergmann <RB...@colsa.com> wrote:

> Julien,
>
> No need to reply -- I "guessed" properly.  The metadata that I am stuffing
> into the outlinks is, indeed, coming back to me in the CrawlDatum, so I am
> now successfully building my index with the crawled/linked page content and
> the RSS feed item info (from metadata).
>
> Of course this required your patch (NUTCH-1622).  Thank you!
>
> Rich Bergmann
>
> -----Original Message-----
> From: Richard Bergmann [mailto:RBERGMANN@colsa.com]
> Sent: Thursday, August 08, 2013 12:58 PM
> To: dev@nutch.apache.org
> Subject: RE: [jira] [Updated] (NUTCH-1622) Create Outlinks with metadata
>
> Julien,
>
> I am trying to save myself a bit of time here by asking you this question
> (and making all subscribers listen!) before digging into the code:
>
> Based on this patch (which I have applied), where will the metadata show
> up when it gets to my IndexingFilter extension?  CrawlDatum.getMetaData()?
>  Somewhere else?  Do I have to modify an Html parser to ensure the metadata
> gets to my IndexingFilter?
>
> With the current "feed" Parser and IndexingFilter the metadata I am
> interested in is stuffed into the parse metadata:
> Parse.getData().getParseMeta().
>
> Thank you!
>
> Rich Bergmann
>
> -----Original Message-----
> From: Julien Nioche (JIRA) [mailto:jira@apache.org]
> Sent: Thursday, August 08, 2013 11:07 AM
> To: dev@nutch.apache.org
> Subject: [jira] [Updated] (NUTCH-1622) Create Outlinks with metadata
>
>
>      [
> https://issues.apache.org/jira/browse/NUTCH-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Julien Nioche updated NUTCH-1622:
> ---------------------------------
>
>     Attachment: NUTCH-1622.patch
>
> > Create Outlinks with metadata
> > -----------------------------
> >
> >                 Key: NUTCH-1622
> >                 URL: https://issues.apache.org/jira/browse/NUTCH-1622
> >             Project: Nutch
> >          Issue Type: New Feature
> >          Components: parser
> >    Affects Versions: 1.7, 2.2.1
> >            Reporter: Julien Nioche
> >         Attachments: NUTCH-1622.patch
> >
> >
> > Having the possibility to specify metadata when creating an outlink is
> extremely useful as it allows to pass information from a source page to the
> pages it links to. We use that routinely within our custom parsers in
> combination with the url-meta plugin.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators For more information on JIRA, see:
> http://www.atlassian.com/software/jira
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

RE: [jira] [Updated] (NUTCH-1622) Create Outlinks with metadata

Posted by Richard Bergmann <RB...@colsa.com>.
Julien,

No need to reply -- I "guessed" properly.  The metadata that I am stuffing into the outlinks is, indeed, coming back to me in the CrawlDatum, so I am now successfully building my index with the crawled/linked page content and the RSS feed item info (from metadata).

Of course this required your patch (NUTCH-1622).  Thank you!

Rich Bergmann

-----Original Message-----
From: Richard Bergmann [mailto:RBERGMANN@colsa.com] 
Sent: Thursday, August 08, 2013 12:58 PM
To: dev@nutch.apache.org
Subject: RE: [jira] [Updated] (NUTCH-1622) Create Outlinks with metadata

Julien,

I am trying to save myself a bit of time here by asking you this question (and making all subscribers listen!) before digging into the code:

Based on this patch (which I have applied), where will the metadata show up when it gets to my IndexingFilter extension?  CrawlDatum.getMetaData()?  Somewhere else?  Do I have to modify an Html parser to ensure the metadata gets to my IndexingFilter?

With the current "feed" Parser and IndexingFilter the metadata I am interested in is stuffed into the parse metadata: Parse.getData().getParseMeta().

Thank you!

Rich Bergmann

-----Original Message-----
From: Julien Nioche (JIRA) [mailto:jira@apache.org] 
Sent: Thursday, August 08, 2013 11:07 AM
To: dev@nutch.apache.org
Subject: [jira] [Updated] (NUTCH-1622) Create Outlinks with metadata


     [ https://issues.apache.org/jira/browse/NUTCH-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-1622:
---------------------------------

    Attachment: NUTCH-1622.patch
    
> Create Outlinks with metadata
> -----------------------------
>
>                 Key: NUTCH-1622
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1622
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.7, 2.2.1
>            Reporter: Julien Nioche
>         Attachments: NUTCH-1622.patch
>
>
> Having the possibility to specify metadata when creating an outlink is extremely useful as it allows to pass information from a source page to the pages it links to. We use that routinely within our custom parsers in combination with the url-meta plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira


RE: [jira] [Updated] (NUTCH-1622) Create Outlinks with metadata

Posted by Richard Bergmann <RB...@colsa.com>.
Julien,

I am trying to save myself a bit of time here by asking you this question (and making all subscribers listen!) before digging into the code:

Based on this patch (which I have applied), where will the metadata show up when it gets to my IndexingFilter extension?  CrawlDatum.getMetaData()?  Somewhere else?  Do I have to modify an Html parser to ensure the metadata gets to my IndexingFilter?

With the current "feed" Parser and IndexingFilter the metadata I am interested in is stuffed into the parse metadata: Parse.getData().getParseMeta().

Thank you!

Rich Bergmann

-----Original Message-----
From: Julien Nioche (JIRA) [mailto:jira@apache.org] 
Sent: Thursday, August 08, 2013 11:07 AM
To: dev@nutch.apache.org
Subject: [jira] [Updated] (NUTCH-1622) Create Outlinks with metadata


     [ https://issues.apache.org/jira/browse/NUTCH-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-1622:
---------------------------------

    Attachment: NUTCH-1622.patch
    
> Create Outlinks with metadata
> -----------------------------
>
>                 Key: NUTCH-1622
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1622
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.7, 2.2.1
>            Reporter: Julien Nioche
>         Attachments: NUTCH-1622.patch
>
>
> Having the possibility to specify metadata when creating an outlink is extremely useful as it allows to pass information from a source page to the pages it links to. We use that routinely within our custom parsers in combination with the url-meta plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira