You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Chris A. Mattmann (JIRA)" <ji...@apache.org> on 2010/07/25 19:51:51 UTC

[jira] Resolved: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

     [ https://issues.apache.org/jira/browse/NUTCH-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann resolved NUTCH-855.
-------------------------------------

    Fix Version/s:     (was: 2.0)
       Resolution: Fixed

- Applied to 1.2-branch in r979079. Cleaned up comments, removed author tags (Nutch decided a long time ago that the project would move away from author tags), cleaned up formatting. Patch doesn't apply to trunk or Nutchbase branch because LuceneWriter doesn't exist anymore for Nutch 2.0. If someone wants to port this to Nutchbase-ville, by all means, but if so, please open a new issue for it. Thanks very much, Scott!

> ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-855
>                 URL: https://issues.apache.org/jira/browse/NUTCH-855
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator, indexer
>    Affects Versions: 1.1
>            Reporter: Scott Gonyea
>            Assignee: Chris A. Mattmann
>             Fix For: 1.2
>
>         Attachments: nutch-855.txt
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
> 1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
> 2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.
> The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
> www.url.com\tkey1=value1\tkey2=value2\t...\tkeyN=valueN
> or:
> http://slashdot.org/	corp_owner=Geeknet	will_it_blend=indubitably
> http://engadget.com/	corp_owner=Weblogs	genre=geeksquad_thriller
> To activate this plugin, you must modify two properties in your nutch-sites.xml:
> 1. plugin.includes
>    add: urlmeta
>    to:   <value>...</value>
>    ie: <value>urlmeta|parse-tika|scoring-opic|...</value>
> 2. urlmeta.tags
>    Insert a comma-delimited list of metatags. Using the above example:
>    <value>corp_owner, will_it_blend, genre</value>
>    Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Resolved: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Scott,

> Aww you removed my sarcasm.

Yep.

> Also, I think you committed bits with references
> to "index-urlmeta". That might have been my bad for leaving it in.

I'm guessing you meant the single sentence in javadoc that referenced
activating your plugins via the index-urlmeta plugin, right? Fixed that, in
r979128.

> 
> I changed it to just "urlmeta" as it's both an indexing and a scoring filter.
> I think the comments need to be adjusted to reflect that, else I may be the
> target of a hit-and-run.
> 

Done!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



Re: [jira] Resolved: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

Posted by Scott Gonyea <sc...@aitrus.org>.
Aww you removed my sarcasm. Also, I think you committed bits with references to "index-urlmeta". That might have been my bad for leaving it in.

I changed it to just "urlmeta" as it's both an indexing and a scoring filter. I think the comments need to be adjusted to reflect that, else I may be the target of a hit-and-run.

Sent from my iPhone

On Jul 25, 2010, at 10:51 AM, "Chris A. Mattmann (JIRA)" <ji...@apache.org> wrote:

> 
>     [ https://issues.apache.org/jira/browse/NUTCH-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
> 
> Chris A. Mattmann resolved NUTCH-855.
> -------------------------------------
> 
>    Fix Version/s:     (was: 2.0)
>       Resolution: Fixed
> 
> - Applied to 1.2-branch in r979079. Cleaned up comments, removed author tags (Nutch decided a long time ago that the project would move away from author tags), cleaned up formatting. Patch doesn't apply to trunk or Nutchbase branch because LuceneWriter doesn't exist anymore for Nutch 2.0. If someone wants to port this to Nutchbase-ville, by all means, but if so, please open a new issue for it. Thanks very much, Scott!
> 
>> ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.
>> -------------------------------------------------------------------------------------------------------------
>> 
>>                Key: NUTCH-855
>>                URL: https://issues.apache.org/jira/browse/NUTCH-855
>>            Project: Nutch
>>         Issue Type: New Feature
>>         Components: generator, indexer
>>   Affects Versions: 1.1
>>           Reporter: Scott Gonyea
>>           Assignee: Chris A. Mattmann
>>            Fix For: 1.2
>> 
>>        Attachments: nutch-855.txt
>> 
>>  Original Estimate: 168h
>> Remaining Estimate: 168h
>> 
>> This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
>> 1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
>> 2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.
>> The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
>> www.url.com\tkey1=value1\tkey2=value2\t...\tkeyN=valueN
>> or:
>> http://slashdot.org/    corp_owner=Geeknet    will_it_blend=indubitably
>> http://engadget.com/    corp_owner=Weblogs    genre=geeksquad_thriller
>> To activate this plugin, you must modify two properties in your nutch-sites.xml:
>> 1. plugin.includes
>>   add: urlmeta
>>   to:   <value>...</value>
>>   ie: <value>urlmeta|parse-tika|scoring-opic|...</value>
>> 2. urlmeta.tags
>>   Insert a comma-delimited list of metatags. Using the above example:
>>   <value>corp_owner, will_it_blend, genre</value>
>>   Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.
> 
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>