You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "julien nioche (JIRA)" <ji...@apache.org> on 2008/10/01 14:47:44 UTC

[jira] Created: (NUTCH-655) Injecting Crawl metadata

Injecting Crawl metadata
------------------------

                 Key: NUTCH-655
                 URL: https://issues.apache.org/jira/browse/NUTCH-655
             Project: Nutch
          Issue Type: Improvement
          Components: injector
            Reporter: julien nioche
            Priority: Minor
         Attachments: Injector.patch

the patch attached allows to inject metadata into the crawlDB. The input file has to contain fields separated by tabs, with the URL being on the first column. The metadata names and values are separated by '='. A input line might look like this:
http://www.myurl.com  \t  categ=value1 \t categ2=value2

This functionality can be useful to store external knowledge and index it with a custom plugin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-655) Injecting Crawl metadata

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-655:
--------------------------------

    Attachment: NUTCH-655.v2

Improved version of the patch which allows to specify custom scores for the URLs. A score is specified by simply setting a float value instead of a name=value couple e.g. 
http://www.lemonde.fr/    label=newspaper  10.0
http://www.lequipe.fr/    label=sports  2.0

> Injecting Crawl metadata
> ------------------------
>
>                 Key: NUTCH-655
>                 URL: https://issues.apache.org/jira/browse/NUTCH-655
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>            Reporter: Julien Nioche
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: Injector.patch, NUTCH-655.v2
>
>
> the patch attached allows to inject metadata into the crawlDB. The input file has to contain fields separated by tabs, with the URL being on the first column. The metadata names and values are separated by '='. A input line might look like this:
> http://www.myurl.com  \t  categ=value1 \t categ2=value2
> This functionality can be useful to store external knowledge and index it with a custom plugin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (NUTCH-655) Injecting Crawl metadata

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641047#action_12641047 ] 

otis edited comment on NUTCH-655 at 10/20/08 9:29 AM:
------------------------------------------------------------------

I think we need a generic way for keeping meta data about hosts ... I think I started that somewhere in JIRA a while back.... aha: NUTCH-628

I'm mentioning this simply because we can probably use the same or very similar mechanism for keeping meta data about hosts and individual URLs.

But it looks like NUTCH-650 may be the way of the future.


      was (Author: otis):
    I think we need a generic way for keeping meta data about hosts ... I think I started that somewhere in JIRA a while back.... aha: https://issues.apache.org/jira/browse/NUTCH-628

I'm mentioning this simply because we can probably use the same or very similar mechanism for keeping meta data about hosts and individual URLs.

  
> Injecting Crawl metadata
> ------------------------
>
>                 Key: NUTCH-655
>                 URL: https://issues.apache.org/jira/browse/NUTCH-655
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>            Reporter: julien nioche
>            Priority: Minor
>         Attachments: Injector.patch
>
>
> the patch attached allows to inject metadata into the crawlDB. The input file has to contain fields separated by tabs, with the URL being on the first column. The metadata names and values are separated by '='. A input line might look like this:
> http://www.myurl.com  \t  categ=value1 \t categ2=value2
> This functionality can be useful to store external knowledge and index it with a custom plugin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-655) Injecting Crawl metadata

Posted by "julien nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

julien nioche updated NUTCH-655:
--------------------------------

    Attachment: Injector.patch

Patch for injecting metadata into a crawlDB

> Injecting Crawl metadata
> ------------------------
>
>                 Key: NUTCH-655
>                 URL: https://issues.apache.org/jira/browse/NUTCH-655
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>            Reporter: julien nioche
>            Priority: Minor
>         Attachments: Injector.patch
>
>
> the patch attached allows to inject metadata into the crawlDB. The input file has to contain fields separated by tabs, with the URL being on the first column. The metadata names and values are separated by '='. A input line might look like this:
> http://www.myurl.com  \t  categ=value1 \t categ2=value2
> This functionality can be useful to store external knowledge and index it with a custom plugin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-655) Injecting Crawl metadata

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797013#action_12797013 ] 

Andrzej Bialecki  commented on NUTCH-655:
-----------------------------------------

I'm not sure about the latest addition (the score option). If we go this route, then I suggest doing the last minor step and recognize reserved metadata keys to do also other useful things like setting fetch interval. I.e. define and recognize "nutch.score" and "nutch.fetchInterval", and document it properly somewhere ...(wiki? javadoc? cmd-line synopsis?).

> Injecting Crawl metadata
> ------------------------
>
>                 Key: NUTCH-655
>                 URL: https://issues.apache.org/jira/browse/NUTCH-655
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: Injector.patch, NUTCH-655.v2
>
>
> the patch attached allows to inject metadata into the crawlDB. The input file has to contain fields separated by tabs, with the URL being on the first column. The metadata names and values are separated by '='. A input line might look like this:
> http://www.myurl.com  \t  categ=value1 \t categ2=value2
> This functionality can be useful to store external knowledge and index it with a custom plugin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-655) Injecting Crawl metadata

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641047#action_12641047 ] 

Otis Gospodnetic commented on NUTCH-655:
----------------------------------------

I think we need a generic way for keeping meta data about hosts ... I think I started that somewhere in JIRA a while back.... aha: https://issues.apache.org/jira/browse/NUTCH-628

I'm mentioning this simply because we can probably use the same or very similar mechanism for keeping meta data about hosts and individual URLs.


> Injecting Crawl metadata
> ------------------------
>
>                 Key: NUTCH-655
>                 URL: https://issues.apache.org/jira/browse/NUTCH-655
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>            Reporter: julien nioche
>            Priority: Minor
>         Attachments: Injector.patch
>
>
> the patch attached allows to inject metadata into the crawlDB. The input file has to contain fields separated by tabs, with the URL being on the first column. The metadata names and values are separated by '='. A input line might look like this:
> http://www.myurl.com  \t  categ=value1 \t categ2=value2
> This functionality can be useful to store external knowledge and index it with a custom plugin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (NUTCH-655) Injecting Crawl metadata

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche resolved NUTCH-655.
---------------------------------

    Resolution: Fixed

Committed revision 896539

> Injecting Crawl metadata
> ------------------------
>
>                 Key: NUTCH-655
>                 URL: https://issues.apache.org/jira/browse/NUTCH-655
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: Injector.patch, NUTCH-655.v2
>
>
> the patch attached allows to inject metadata into the crawlDB. The input file has to contain fields separated by tabs, with the URL being on the first column. The metadata names and values are separated by '='. A input line might look like this:
> http://www.myurl.com  \t  categ=value1 \t categ2=value2
> This functionality can be useful to store external knowledge and index it with a custom plugin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-655) Injecting Crawl metadata

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666283#action_12666283 ] 

Otis Gospodnetic commented on NUTCH-655:
----------------------------------------

1.1 sounds good to me.


> Injecting Crawl metadata
> ------------------------
>
>                 Key: NUTCH-655
>                 URL: https://issues.apache.org/jira/browse/NUTCH-655
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>            Reporter: julien nioche
>            Priority: Minor
>         Attachments: Injector.patch
>
>
> the patch attached allows to inject metadata into the crawlDB. The input file has to contain fields separated by tabs, with the URL being on the first column. The metadata names and values are separated by '='. A input line might look like this:
> http://www.myurl.com  \t  categ=value1 \t categ2=value2
> This functionality can be useful to store external knowledge and index it with a custom plugin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (NUTCH-655) Injecting Crawl metadata

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche reassigned NUTCH-655:
-----------------------------------

    Assignee: Julien Nioche

> Injecting Crawl metadata
> ------------------------
>
>                 Key: NUTCH-655
>                 URL: https://issues.apache.org/jira/browse/NUTCH-655
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: Injector.patch, NUTCH-655.v2
>
>
> the patch attached allows to inject metadata into the crawlDB. The input file has to contain fields separated by tabs, with the URL being on the first column. The metadata names and values are separated by '='. A input line might look like this:
> http://www.myurl.com  \t  categ=value1 \t categ2=value2
> This functionality can be useful to store external knowledge and index it with a custom plugin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-655) Injecting Crawl metadata

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12636016#action_12636016 ] 

Doğacan Güney commented on NUTCH-655:
-------------------------------------

We may discuss if tab-separation is the best way to go, but +1 for the idea from me.

> Injecting Crawl metadata
> ------------------------
>
>                 Key: NUTCH-655
>                 URL: https://issues.apache.org/jira/browse/NUTCH-655
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>            Reporter: julien nioche
>            Priority: Minor
>         Attachments: Injector.patch
>
>
> the patch attached allows to inject metadata into the crawlDB. The input file has to contain fields separated by tabs, with the URL being on the first column. The metadata names and values are separated by '='. A input line might look like this:
> http://www.myurl.com  \t  categ=value1 \t categ2=value2
> This functionality can be useful to store external knowledge and index it with a custom plugin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-655) Injecting Crawl metadata

Posted by "julien nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12645471#action_12645471 ] 

julien nioche commented on NUTCH-655:
-------------------------------------

I agree that https://issues.apache.org/jira/browse/NUTCH-650 would provide a cleaner way of doing this but since it is a substantial change it might take some time before it is committed. 

Regarding https://issues.apache.org/jira/browse/NUTCH-628 we could also have a similar injector for hostDBs that could be used to store / update statistics or any other information about hosts without necessarily getting it from the crawlDB. 

> Injecting Crawl metadata
> ------------------------
>
>                 Key: NUTCH-655
>                 URL: https://issues.apache.org/jira/browse/NUTCH-655
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>            Reporter: julien nioche
>            Priority: Minor
>         Attachments: Injector.patch
>
>
> the patch attached allows to inject metadata into the crawlDB. The input file has to contain fields separated by tabs, with the URL being on the first column. The metadata names and values are separated by '='. A input line might look like this:
> http://www.myurl.com  \t  categ=value1 \t categ2=value2
> This functionality can be useful to store external knowledge and index it with a custom plugin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-655) Injecting Crawl metadata

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797176#action_12797176 ] 

Julien Nioche commented on NUTCH-655:
-------------------------------------

good idea. I've made the modification and documented in the javadoc :

The URL files contain one URL per line, optionally followed by custom metadata separated by tabs with the metadata key separated from the corresponding value by '='. 
Note that some metadata keys are reserved : 
- <i>nutch.score</i> : allows to set a custom score for a specific URL <br>
- <i>nutch.fetchInterval</i> : allows to set a custom fetch interval for a specific URL <br>
e.g. http://www.nutch.org/ \t nutch.score=10 \t nutch.fetchInterval=2592000 \t userType=open_source
 

> Injecting Crawl metadata
> ------------------------
>
>                 Key: NUTCH-655
>                 URL: https://issues.apache.org/jira/browse/NUTCH-655
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: Injector.patch, NUTCH-655.v2
>
>
> the patch attached allows to inject metadata into the crawlDB. The input file has to contain fields separated by tabs, with the URL being on the first column. The metadata names and values are separated by '='. A input line might look like this:
> http://www.myurl.com  \t  categ=value1 \t categ2=value2
> This functionality can be useful to store external knowledge and index it with a custom plugin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-655) Injecting Crawl metadata

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796815#action_12796815 ] 

Julien Nioche commented on NUTCH-655:
-------------------------------------

Any objections to committing this patch?  

> Injecting Crawl metadata
> ------------------------
>
>                 Key: NUTCH-655
>                 URL: https://issues.apache.org/jira/browse/NUTCH-655
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: Injector.patch, NUTCH-655.v2
>
>
> the patch attached allows to inject metadata into the crawlDB. The input file has to contain fields separated by tabs, with the URL being on the first column. The metadata names and values are separated by '='. A input line might look like this:
> http://www.myurl.com  \t  categ=value1 \t categ2=value2
> This functionality can be useful to store external knowledge and index it with a custom plugin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-655) Injecting Crawl metadata

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665848#action_12665848 ] 

Doğacan Güney commented on NUTCH-655:
-------------------------------------

Is everyone OK with moving this issue to target 1.1 release?

> Injecting Crawl metadata
> ------------------------
>
>                 Key: NUTCH-655
>                 URL: https://issues.apache.org/jira/browse/NUTCH-655
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>            Reporter: julien nioche
>            Priority: Minor
>         Attachments: Injector.patch
>
>
> the patch attached allows to inject metadata into the crawlDB. The input file has to contain fields separated by tabs, with the URL being on the first column. The metadata names and values are separated by '='. A input line might look like this:
> http://www.myurl.com  \t  categ=value1 \t categ2=value2
> This functionality can be useful to store external knowledge and index it with a custom plugin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Closed: (NUTCH-655) Injecting Crawl metadata

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche closed NUTCH-655.
-------------------------------


> Injecting Crawl metadata
> ------------------------
>
>                 Key: NUTCH-655
>                 URL: https://issues.apache.org/jira/browse/NUTCH-655
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: Injector.patch, NUTCH-655.v2
>
>
> the patch attached allows to inject metadata into the crawlDB. The input file has to contain fields separated by tabs, with the URL being on the first column. The metadata names and values are separated by '='. A input line might look like this:
> http://www.myurl.com  \t  categ=value1 \t categ2=value2
> This functionality can be useful to store external knowledge and index it with a custom plugin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-655) Injecting Crawl metadata

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-655:
--------------------------------

    Fix Version/s: 1.1

Moved to 1.1.

> Injecting Crawl metadata
> ------------------------
>
>                 Key: NUTCH-655
>                 URL: https://issues.apache.org/jira/browse/NUTCH-655
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>            Reporter: julien nioche
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: Injector.patch
>
>
> the patch attached allows to inject metadata into the crawlDB. The input file has to contain fields separated by tabs, with the URL being on the first column. The metadata names and values are separated by '='. A input line might look like this:
> http://www.myurl.com  \t  categ=value1 \t categ2=value2
> This functionality can be useful to store external knowledge and index it with a custom plugin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.