You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Scott Gonyea (JIRA)" <ji...@apache.org> on 2010/07/15 03:48:52 UTC

[jira] Created: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.
-------------------------------------------------------------------------------------------------------------

                 Key: NUTCH-855
                 URL: https://issues.apache.org/jira/browse/NUTCH-855
             Project: Nutch
          Issue Type: New Feature
          Components: generator, indexer
    Affects Versions: 1.1
            Reporter: Scott Gonyea
             Fix For: 1.2


This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.

The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
[www.url.com]\t[key1]=[value1]\t[key2]=[value2]...[keyN]=[valueN]
or:
http://slashdot.org/	corp_owner=Geeknet	will_it_blend=indubitably
http://engadget.com/	corp_owner=Weblogs	genre=geeksquad_thriller

To activate this plugin, you must modify two properties in your nutch-sites.xml:
1. plugin.includes
   from: index-(basic|anchor)
   to:   index-(basic|anchor|urlmeta)
2. urlmeta.tags
   Insert a comma-delimited list of metatags. Using the above example:
   <value>corp_owner, will_it_blend, genre</value>
   Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Updated: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

Posted by Scott Gonyea <sc...@aitrus.org>.
Sorry about the spam, everyone.  I hope my patch didn't suck too much :).

On Wed, Jul 14, 2010 at 6:53 PM, Scott Gonyea (JIRA) <ji...@apache.org>wrote:

>
>     [
> https://issues.apache.org/jira/browse/NUTCH-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Scott Gonyea updated NUTCH-855:
> -------------------------------
>
>     Attachment: nutch-855.txt
>
> > ScoringFilter and IndexingFilter: To allow for the propagation of URL
> Metatags and their subsequent indexing.
> >
> -------------------------------------------------------------------------------------------------------------
> >
> >                 Key: NUTCH-855
> >                 URL: https://issues.apache.org/jira/browse/NUTCH-855
> >             Project: Nutch
> >          Issue Type: New Feature
> >          Components: generator, indexer
> >    Affects Versions: 1.1
> >            Reporter: Scott Gonyea
> >             Fix For: 1.2
> >
> >         Attachments: nutch-855.txt
> >
> >   Original Estimate: 168h
> >  Remaining Estimate: 168h
> >
> > This plugin is designed to enhance the NUTCH-655 patch, by doing two
> things:
> > 1. Meta Tags that are supplied with your Crawl URLs, during injection,
> will be propagated throughout the outlinks of those Crawl URLs.
> > 2. When you index your URLs, the meta tags that you specified with your
> URLs will be indexed alongside those URLs--and can be directly queried,
> assuming you have done everything else correctly.
> > The flat-file of URLs you are injecting should, per NUTCH-655, be
> tab-delimited in the form of:
> > [www.url.com]\t[key1]=[value1]\t[key2]=[value2]...[keyN]=[valueN]
> > or:
> > http://slashdot.org/  corp_owner=Geeknet      will_it_blend=indubitably
> > http://engadget.com/  corp_owner=Weblogs      genre=geeksquad_thriller
> > To activate this plugin, you must modify two properties in your
> nutch-sites.xml:
> > 1. plugin.includes
> >    from: index-(basic|anchor)
> >    to:   index-(basic|anchor|urlmeta)
> > 2. urlmeta.tags
> >    Insert a comma-delimited list of metatags. Using the above example:
> >    <value>corp_owner, will_it_blend, genre</value>
> >    Note that you do not need to include the tag with every URL. However,
> you must specify each tag if you want it to be propagated and later indexed.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

[jira] Updated: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

Posted by "Scott Gonyea (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Scott Gonyea updated NUTCH-855:
-------------------------------

    Attachment: nutch-855.txt

> ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-855
>                 URL: https://issues.apache.org/jira/browse/NUTCH-855
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator, indexer
>    Affects Versions: 1.1
>            Reporter: Scott Gonyea
>             Fix For: 1.2
>
>         Attachments: nutch-855.txt
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
> 1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
> 2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.
> The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
> [www.url.com]\t[key1]=[value1]\t[key2]=[value2]...[keyN]=[valueN]
> or:
> http://slashdot.org/	corp_owner=Geeknet	will_it_blend=indubitably
> http://engadget.com/	corp_owner=Weblogs	genre=geeksquad_thriller
> To activate this plugin, you must modify two properties in your nutch-sites.xml:
> 1. plugin.includes
>    from: index-(basic|anchor)
>    to:   index-(basic|anchor|urlmeta)
> 2. urlmeta.tags
>    Insert a comma-delimited list of metatags. Using the above example:
>    <value>corp_owner, will_it_blend, genre</value>
>    Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

Posted by "Scott Gonyea (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Scott Gonyea updated NUTCH-855:
-------------------------------

    Attachment:     (was: nutch-855.txt)

> ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-855
>                 URL: https://issues.apache.org/jira/browse/NUTCH-855
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator, indexer
>    Affects Versions: 1.1
>            Reporter: Scott Gonyea
>             Fix For: 1.2
>
>         Attachments: nutch-855.txt
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
> 1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
> 2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.
> The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
> [www.url.com]\t[key1]=[value1]\t[key2]=[value2]...[keyN]=[valueN]
> or:
> http://slashdot.org/	corp_owner=Geeknet	will_it_blend=indubitably
> http://engadget.com/	corp_owner=Weblogs	genre=geeksquad_thriller
> To activate this plugin, you must modify two properties in your nutch-sites.xml:
> 1. plugin.includes
>    from: index-(basic|anchor)
>    to:   index-(basic|anchor|urlmeta)
> 2. urlmeta.tags
>    Insert a comma-delimited list of metatags. Using the above example:
>    <value>corp_owner, will_it_blend, genre</value>
>    Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann reassigned NUTCH-855:
---------------------------------------

    Assignee: Chris A. Mattmann

> ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-855
>                 URL: https://issues.apache.org/jira/browse/NUTCH-855
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator, indexer
>    Affects Versions: 1.1
>            Reporter: Scott Gonyea
>            Assignee: Chris A. Mattmann
>             Fix For: 1.2, 2.0
>
>         Attachments: nutch-855.txt
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
> 1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
> 2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.
> The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
> www.url.com\tkey1=value1\tkey2=value2\t...\tkeyN=valueN
> or:
> http://slashdot.org/	corp_owner=Geeknet	will_it_blend=indubitably
> http://engadget.com/	corp_owner=Weblogs	genre=geeksquad_thriller
> To activate this plugin, you must modify two properties in your nutch-sites.xml:
> 1. plugin.includes
>    add: urlmeta
>    to:   <value>...</value>
>    ie: <value>urlmeta|parse-tika|scoring-opic|...</value>
> 2. urlmeta.tags
>    Insert a comma-delimited list of metatags. Using the above example:
>    <value>corp_owner, will_it_blend, genre</value>
>    Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

Posted by "Scott Gonyea (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Scott Gonyea updated NUTCH-855:
-------------------------------

    Attachment:     (was: nutch-855.txt)

> ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-855
>                 URL: https://issues.apache.org/jira/browse/NUTCH-855
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator, indexer
>    Affects Versions: 1.1
>            Reporter: Scott Gonyea
>             Fix For: 1.2
>
>         Attachments: nutch-855.txt
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
> 1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
> 2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.
> The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
> [www.url.com]\t[key1]=[value1]\t[key2]=[value2]...[keyN]=[valueN]
> or:
> http://slashdot.org/	corp_owner=Geeknet	will_it_blend=indubitably
> http://engadget.com/	corp_owner=Weblogs	genre=geeksquad_thriller
> To activate this plugin, you must modify two properties in your nutch-sites.xml:
> 1. plugin.includes
>    from: index-(basic|anchor)
>    to:   index-(basic|anchor|urlmeta)
> 2. urlmeta.tags
>    Insert a comma-delimited list of metatags. Using the above example:
>    <value>corp_owner, will_it_blend, genre</value>
>    Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896271#action_12896271 ] 

Chris A. Mattmann commented on NUTCH-855:
-----------------------------------------

updated the docs with your new comments Scott, in r983257. Thanks!

> ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-855
>                 URL: https://issues.apache.org/jira/browse/NUTCH-855
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator, indexer
>    Affects Versions: 1.1
>            Reporter: Scott Gonyea
>            Assignee: Chris A. Mattmann
>             Fix For: 1.2
>
>         Attachments: nutch-855.txt
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
> 1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
> 2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.
> The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
> www.url.com\tkey1=value1\tkey2=value2\t...\tkeyN=valueN
> or:
> http://slashdot.org/	corp_owner=Geeknet	will_it_blend=indubitably
> http://engadget.com/	corp_owner=Weblogs	genre=geeksquad_thriller
> To activate this plugin, you must modify two properties in your nutch-sites.xml:
> 1. plugin.includes
>    add: urlmeta
>    to:   <value>...</value>
>    ie: <value>urlmeta|parse-tika|scoring-opic|...</value>
> 2. urlmeta.tags
>    Insert a comma-delimited list of metatags. Using the above example:
>    <value>corp_owner, will_it_blend, genre</value>
>    Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

Posted by "Scott Gonyea (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Scott Gonyea updated NUTCH-855:
-------------------------------

    Attachment: nutch-855.txt

This is my revised patch, with some small bug fixes.

> ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-855
>                 URL: https://issues.apache.org/jira/browse/NUTCH-855
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator, indexer
>    Affects Versions: 1.1
>            Reporter: Scott Gonyea
>             Fix For: 1.2
>
>         Attachments: nutch-855.txt
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
> 1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
> 2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.
> The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
> [www.url.com]\t[key1]=[value1]\t[key2]=[value2]...[keyN]=[valueN]
> or:
> http://slashdot.org/	corp_owner=Geeknet	will_it_blend=indubitably
> http://engadget.com/	corp_owner=Weblogs	genre=geeksquad_thriller
> To activate this plugin, you must modify two properties in your nutch-sites.xml:
> 1. plugin.includes
>    from: index-(basic|anchor)
>    to:   index-(basic|anchor|urlmeta)
> 2. urlmeta.tags
>    Insert a comma-delimited list of metatags. Using the above example:
>    <value>corp_owner, will_it_blend, genre</value>
>    Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

Posted by "Scott Gonyea (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Scott Gonyea updated NUTCH-855:
-------------------------------

    Fix Version/s: 2.0
      Description: 
This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.

The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
www.url.com\tkey1=value1\tkey2=value2\t...\tkeyN=valueN
or:
http://slashdot.org/	corp_owner=Geeknet	will_it_blend=indubitably
http://engadget.com/	corp_owner=Weblogs	genre=geeksquad_thriller

To activate this plugin, you must modify two properties in your nutch-sites.xml:
1. plugin.includes
   add: urlmeta
   to:   <value>...</value>
   ie: <value>urlmeta|parse-tika|scoring-opic|...</value>
2. urlmeta.tags
   Insert a comma-delimited list of metatags. Using the above example:
   <value>corp_owner, will_it_blend, genre</value>
   Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.


  was:
This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.

The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
[www.url.com]\t[key1]=[value1]\t[key2]=[value2]...[keyN]=[valueN]
or:
http://slashdot.org/	corp_owner=Geeknet	will_it_blend=indubitably
http://engadget.com/	corp_owner=Weblogs	genre=geeksquad_thriller

To activate this plugin, you must modify two properties in your nutch-sites.xml:
1. plugin.includes
   from: index-(basic|anchor)
   to:   index-(basic|anchor|urlmeta)
2. urlmeta.tags
   Insert a comma-delimited list of metatags. Using the above example:
   <value>corp_owner, will_it_blend, genre</value>
   Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.



Updated comments, revised patch is now available. It's more robust to the nefarious "null" and his NullPointerException cabal.

> ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-855
>                 URL: https://issues.apache.org/jira/browse/NUTCH-855
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator, indexer
>    Affects Versions: 1.1
>            Reporter: Scott Gonyea
>             Fix For: 1.2, 2.0
>
>         Attachments: nutch-855.txt
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
> 1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
> 2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.
> The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
> www.url.com\tkey1=value1\tkey2=value2\t...\tkeyN=valueN
> or:
> http://slashdot.org/	corp_owner=Geeknet	will_it_blend=indubitably
> http://engadget.com/	corp_owner=Weblogs	genre=geeksquad_thriller
> To activate this plugin, you must modify two properties in your nutch-sites.xml:
> 1. plugin.includes
>    add: urlmeta
>    to:   <value>...</value>
>    ie: <value>urlmeta|parse-tika|scoring-opic|...</value>
> 2. urlmeta.tags
>    Insert a comma-delimited list of metatags. Using the above example:
>    <value>corp_owner, will_it_blend, genre</value>
>    Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

Posted by "Scott Gonyea (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896223#action_12896223 ] 

Scott Gonyea commented on NUTCH-855:
------------------------------------

Let it be known, to anyone who uses this:

The "urlmeta.tags" must be comma-delimited, with no white-space to pad the boundaries.  Why?

http://svn.apache.org/viewvc/lucene/hadoop/trunk/src/java/org/apache/hadoop/conf/Configuration.java?annotate=394984&pathrev=394984

I have no damn clue.  After having 1.5 days wasted, trying to figure out why my metatags were not working...  I finally found the answer.  I originally tested against a hadoop-0.21 build, not thinking that white-space would trash my numerous days worth of time.

I seriously spent the last 2-hours wondering who it was that I hated so much, for not thinking ahead... Then I saw that commit, where it used to split based upon commas and bounded white-space... and vanished for no stated reason.  So odd.  This was my pedantic deed for the month.

Grrrr, Doug Cutting!

> ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-855
>                 URL: https://issues.apache.org/jira/browse/NUTCH-855
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator, indexer
>    Affects Versions: 1.1
>            Reporter: Scott Gonyea
>            Assignee: Chris A. Mattmann
>             Fix For: 1.2
>
>         Attachments: nutch-855.txt
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
> 1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
> 2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.
> The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
> www.url.com\tkey1=value1\tkey2=value2\t...\tkeyN=valueN
> or:
> http://slashdot.org/	corp_owner=Geeknet	will_it_blend=indubitably
> http://engadget.com/	corp_owner=Weblogs	genre=geeksquad_thriller
> To activate this plugin, you must modify two properties in your nutch-sites.xml:
> 1. plugin.includes
>    add: urlmeta
>    to:   <value>...</value>
>    ie: <value>urlmeta|parse-tika|scoring-opic|...</value>
> 2. urlmeta.tags
>    Insert a comma-delimited list of metatags. Using the above example:
>    <value>corp_owner, will_it_blend, genre</value>
>    Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

Posted by "Scott Gonyea (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896224#action_12896224 ] 

Scott Gonyea commented on NUTCH-855:
------------------------------------

If it wasn't clear from my prior comment, the property for urlmeta in nutch-site should look like:

<property>
  <name>urlmeta.tags</name>
  <value>damn,you,doug,cutting</value>
</property>

It might be nice if someone updates the "nutch-default.xml" entry for "urlmeta.tags" to the following:

<property>
  <name>urlmeta.tags</name>
  <value></value>
  <description>
    To be used in conjunction with features introduced in NUTCH-655, which allows
    for custom metatags to be injected alongside your crawl URLs. Specifying those
    custom tags here will allow for their propagation into a pages outlinks, as
    well as allow for them to be included as part of an index.
    Values should be comma-delimited. ("tag1,tag2,tag3")
  </description>
</property>

> ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-855
>                 URL: https://issues.apache.org/jira/browse/NUTCH-855
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator, indexer
>    Affects Versions: 1.1
>            Reporter: Scott Gonyea
>            Assignee: Chris A. Mattmann
>             Fix For: 1.2
>
>         Attachments: nutch-855.txt
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
> 1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
> 2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.
> The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
> www.url.com\tkey1=value1\tkey2=value2\t...\tkeyN=valueN
> or:
> http://slashdot.org/	corp_owner=Geeknet	will_it_blend=indubitably
> http://engadget.com/	corp_owner=Weblogs	genre=geeksquad_thriller
> To activate this plugin, you must modify two properties in your nutch-sites.xml:
> 1. plugin.includes
>    add: urlmeta
>    to:   <value>...</value>
>    ie: <value>urlmeta|parse-tika|scoring-opic|...</value>
> 2. urlmeta.tags
>    Insert a comma-delimited list of metatags. Using the above example:
>    <value>corp_owner, will_it_blend, genre</value>
>    Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

Posted by "Scott Gonyea (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Scott Gonyea updated NUTCH-855:
-------------------------------

    Comment: was deleted

(was: If it wasn't clear from my prior comment, the property for urlmeta in nutch-site should look like:

<property>
  <name>urlmeta.tags</name>
  <value>damn,you,doug,cutting</value>
</property>

It might be nice if someone updates the "nutch-default.xml" entry for "urlmeta.tags" to the following:

<property>
  <name>urlmeta.tags</name>
  <value></value>
  <description>
    To be used in conjunction with features introduced in NUTCH-655, which allows
    for custom metatags to be injected alongside your crawl URLs. Specifying those
    custom tags here will allow for their propagation into a pages outlinks, as
    well as allow for them to be included as part of an index.
    Values should be comma-delimited. ("tag1,tag2,tag3")  Do not pad the tags with
    white-space at their boundaries, if you are using anything earlier than Hadoop-0.21.
  </description>
</property>

Unless, of course, Nutch-1.2 ships with Hadoop-0.21...  Then it's a wash.  I do think it's good to note that in there, as someone may stumble across that tidbit while troubleshooting some unrelated godawful bug.  I'm looking out for you, long lost not-twin.)

> ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-855
>                 URL: https://issues.apache.org/jira/browse/NUTCH-855
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator, indexer
>    Affects Versions: 1.1
>            Reporter: Scott Gonyea
>            Assignee: Chris A. Mattmann
>             Fix For: 1.2
>
>         Attachments: nutch-855.txt
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
> 1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
> 2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.
> The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
> www.url.com\tkey1=value1\tkey2=value2\t...\tkeyN=valueN
> or:
> http://slashdot.org/	corp_owner=Geeknet	will_it_blend=indubitably
> http://engadget.com/	corp_owner=Weblogs	genre=geeksquad_thriller
> To activate this plugin, you must modify two properties in your nutch-sites.xml:
> 1. plugin.includes
>    add: urlmeta
>    to:   <value>...</value>
>    ie: <value>urlmeta|parse-tika|scoring-opic|...</value>
> 2. urlmeta.tags
>    Insert a comma-delimited list of metatags. Using the above example:
>    <value>corp_owner, will_it_blend, genre</value>
>    Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Work started: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on NUTCH-855 started by Chris A. Mattmann.

> ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-855
>                 URL: https://issues.apache.org/jira/browse/NUTCH-855
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator, indexer
>    Affects Versions: 1.1
>            Reporter: Scott Gonyea
>            Assignee: Chris A. Mattmann
>             Fix For: 1.2, 2.0
>
>         Attachments: nutch-855.txt
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
> 1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
> 2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.
> The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
> www.url.com\tkey1=value1\tkey2=value2\t...\tkeyN=valueN
> or:
> http://slashdot.org/	corp_owner=Geeknet	will_it_blend=indubitably
> http://engadget.com/	corp_owner=Weblogs	genre=geeksquad_thriller
> To activate this plugin, you must modify two properties in your nutch-sites.xml:
> 1. plugin.includes
>    add: urlmeta
>    to:   <value>...</value>
>    ie: <value>urlmeta|parse-tika|scoring-opic|...</value>
> 2. urlmeta.tags
>    Insert a comma-delimited list of metatags. Using the above example:
>    <value>corp_owner, will_it_blend, genre</value>
>    Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

Posted by "Scott Gonyea (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896223#action_12896223 ] 

Scott Gonyea edited comment on NUTCH-855 at 8/7/10 4:52 AM:
------------------------------------------------------------

FYI for anyone who might use this:

The "urlmeta.tags" must be comma-delimited, with no white-space to pad the boundaries.

      was (Author: sgonyea):
    Let it be known, to anyone who uses this:

The "urlmeta.tags" must be comma-delimited, with no white-space to pad the boundaries.  Why?

http://svn.apache.org/viewvc/lucene/hadoop/trunk/src/java/org/apache/hadoop/conf/Configuration.java?annotate=394984&pathrev=394984

I have no damn clue.  After having 1.5 days wasted, trying to figure out why my metatags were not working...  I finally found the answer.  I originally tested against a hadoop-0.21 build, not thinking that white-space would trash my numerous days worth of time.

I seriously spent the last 2-hours wondering who it was that I hated so much, for not thinking ahead... Then I saw that commit, where it used to split based upon commas and bounded white-space... and vanished for no stated reason.  So odd.  This was my pedantic deed for the month.

Grrrr, Doug Cutting!
  
> ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-855
>                 URL: https://issues.apache.org/jira/browse/NUTCH-855
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator, indexer
>    Affects Versions: 1.1
>            Reporter: Scott Gonyea
>            Assignee: Chris A. Mattmann
>             Fix For: 1.2
>
>         Attachments: nutch-855.txt
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
> 1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
> 2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.
> The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
> www.url.com\tkey1=value1\tkey2=value2\t...\tkeyN=valueN
> or:
> http://slashdot.org/	corp_owner=Geeknet	will_it_blend=indubitably
> http://engadget.com/	corp_owner=Weblogs	genre=geeksquad_thriller
> To activate this plugin, you must modify two properties in your nutch-sites.xml:
> 1. plugin.includes
>    add: urlmeta
>    to:   <value>...</value>
>    ie: <value>urlmeta|parse-tika|scoring-opic|...</value>
> 2. urlmeta.tags
>    Insert a comma-delimited list of metatags. Using the above example:
>    <value>corp_owner, will_it_blend, genre</value>
>    Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

Posted by "Scott Gonyea (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896228#action_12896228 ] 

Scott Gonyea edited comment on NUTCH-855 at 8/7/10 4:56 AM:
------------------------------------------------------------

If it wasn't clear from my prior comment, the property for urlmeta in nutch-site should look like:
<property>
<name>urlmeta.tags</name>
<value>damn,you,lord,cuddlebums</value>
</property>

It might be nice if someone updates the "nutch-default.xml" entry for "urlmeta.tags" to the following:

<property>
<name>urlmeta.tags</name>
<value></value>
<description>
To be used in conjunction with features introduced in NUTCH-655, which allows
for custom metatags to be injected alongside your crawl URLs. Specifying those
custom tags here will allow for their propagation into a pages outlinks, as
well as allow for them to be included as part of an index.
Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad the tags with
white-space at their boundaries, if you are using Hadoop releases prior to 0.21.
</description>
</property>

Unless, of course, Nutch-1.2 ships with Hadoop-0.21... Then it's a wash. I do think it's good to note that in there, as someone may stumble across that tidbit while troubleshooting some unrelated timewaster.  I'm looking out for you, long lost not-twin.


      was (Author: sgonyea):
    If it wasn't clear from my prior comment, the property for urlmeta in nutch-site should look like:
<property>
<name>urlmeta.tags</name>
<value>damn,you,doug,cutting</value>
</property>

It might be nice if someone updates the "nutch-default.xml" entry for "urlmeta.tags" to the following:

<property>
<name>urlmeta.tags</name>
<value></value>
<description>
To be used in conjunction with features introduced in NUTCH-655, which allows
for custom metatags to be injected alongside your crawl URLs. Specifying those
custom tags here will allow for their propagation into a pages outlinks, as
well as allow for them to be included as part of an index.
Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad the tags with
white-space at their boundaries, if you are using anything earlier than Hadoop-0.21.
</description>
</property>

Unless, of course, Nutch-1.2 ships with Hadoop-0.21... Then it's a wash. I do think it's good to note that in there, as someone may stumble across that tidbit while troubleshooting some unrelated godawful bug. I'm looking out for you, long lost not-twin.

  
> ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-855
>                 URL: https://issues.apache.org/jira/browse/NUTCH-855
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator, indexer
>    Affects Versions: 1.1
>            Reporter: Scott Gonyea
>            Assignee: Chris A. Mattmann
>             Fix For: 1.2
>
>         Attachments: nutch-855.txt
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
> 1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
> 2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.
> The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
> www.url.com\tkey1=value1\tkey2=value2\t...\tkeyN=valueN
> or:
> http://slashdot.org/	corp_owner=Geeknet	will_it_blend=indubitably
> http://engadget.com/	corp_owner=Weblogs	genre=geeksquad_thriller
> To activate this plugin, you must modify two properties in your nutch-sites.xml:
> 1. plugin.includes
>    add: urlmeta
>    to:   <value>...</value>
>    ie: <value>urlmeta|parse-tika|scoring-opic|...</value>
> 2. urlmeta.tags
>    Insert a comma-delimited list of metatags. Using the above example:
>    <value>corp_owner, will_it_blend, genre</value>
>    Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Reopened: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

Posted by "Scott Gonyea (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Scott Gonyea reopened NUTCH-855:
--------------------------------


If it wasn't clear from my prior comment, the property for urlmeta in nutch-site should look like:
<property>
<name>urlmeta.tags</name>
<value>damn,you,doug,cutting</value>
</property>

It might be nice if someone updates the "nutch-default.xml" entry for "urlmeta.tags" to the following:

<property>
<name>urlmeta.tags</name>
<value></value>
<description>
To be used in conjunction with features introduced in NUTCH-655, which allows
for custom metatags to be injected alongside your crawl URLs. Specifying those
custom tags here will allow for their propagation into a pages outlinks, as
well as allow for them to be included as part of an index.
Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad the tags with
white-space at their boundaries, if you are using anything earlier than Hadoop-0.21.
</description>
</property>

Unless, of course, Nutch-1.2 ships with Hadoop-0.21... Then it's a wash. I do think it's good to note that in there, as someone may stumble across that tidbit while troubleshooting some unrelated godawful bug. I'm looking out for you, long lost not-twin.


> ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-855
>                 URL: https://issues.apache.org/jira/browse/NUTCH-855
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator, indexer
>    Affects Versions: 1.1
>            Reporter: Scott Gonyea
>            Assignee: Chris A. Mattmann
>             Fix For: 1.2
>
>         Attachments: nutch-855.txt
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
> 1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
> 2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.
> The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
> www.url.com\tkey1=value1\tkey2=value2\t...\tkeyN=valueN
> or:
> http://slashdot.org/	corp_owner=Geeknet	will_it_blend=indubitably
> http://engadget.com/	corp_owner=Weblogs	genre=geeksquad_thriller
> To activate this plugin, you must modify two properties in your nutch-sites.xml:
> 1. plugin.includes
>    add: urlmeta
>    to:   <value>...</value>
>    ie: <value>urlmeta|parse-tika|scoring-opic|...</value>
> 2. urlmeta.tags
>    Insert a comma-delimited list of metatags. Using the above example:
>    <value>corp_owner, will_it_blend, genre</value>
>    Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

Posted by "Scott Gonyea (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896224#action_12896224 ] 

Scott Gonyea edited comment on NUTCH-855 at 8/7/10 1:59 AM:
------------------------------------------------------------

If it wasn't clear from my prior comment, the property for urlmeta in nutch-site should look like:

<property>
  <name>urlmeta.tags</name>
  <value>damn,you,doug,cutting</value>
</property>

It might be nice if someone updates the "nutch-default.xml" entry for "urlmeta.tags" to the following:

<property>
  <name>urlmeta.tags</name>
  <value></value>
  <description>
    To be used in conjunction with features introduced in NUTCH-655, which allows
    for custom metatags to be injected alongside your crawl URLs. Specifying those
    custom tags here will allow for their propagation into a pages outlinks, as
    well as allow for them to be included as part of an index.
    Values should be comma-delimited. ("tag1,tag2,tag3")  Do not pad the tags with
    white-space at their boundaries, if you are using anything earlier than Hadoop-0.21.
  </description>
</property>

Unless, of course, Nutch-1.2 ships with Hadoop-0.21...  Then it's a wash.  I do think it's good to note that in there, as someone may stumble across that tidbit while troubleshooting some unrelated godawful bug.  I'm looking out for you, long lost not-twin.

      was (Author: sgonyea):
    If it wasn't clear from my prior comment, the property for urlmeta in nutch-site should look like:

<property>
  <name>urlmeta.tags</name>
  <value>damn,you,doug,cutting</value>
</property>

It might be nice if someone updates the "nutch-default.xml" entry for "urlmeta.tags" to the following:

<property>
  <name>urlmeta.tags</name>
  <value></value>
  <description>
    To be used in conjunction with features introduced in NUTCH-655, which allows
    for custom metatags to be injected alongside your crawl URLs. Specifying those
    custom tags here will allow for their propagation into a pages outlinks, as
    well as allow for them to be included as part of an index.
    Values should be comma-delimited. ("tag1,tag2,tag3")  Do not pad the tags with
    white-space at their boundaries, if you are using anything earlier than Hadoop-0.21.
  </description>
</property>

Unless, of course, Nutch-1.2 ships with Hadoop-0.21...  Then it's a wash.
  
> ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-855
>                 URL: https://issues.apache.org/jira/browse/NUTCH-855
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator, indexer
>    Affects Versions: 1.1
>            Reporter: Scott Gonyea
>            Assignee: Chris A. Mattmann
>             Fix For: 1.2
>
>         Attachments: nutch-855.txt
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
> 1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
> 2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.
> The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
> www.url.com\tkey1=value1\tkey2=value2\t...\tkeyN=valueN
> or:
> http://slashdot.org/	corp_owner=Geeknet	will_it_blend=indubitably
> http://engadget.com/	corp_owner=Weblogs	genre=geeksquad_thriller
> To activate this plugin, you must modify two properties in your nutch-sites.xml:
> 1. plugin.includes
>    add: urlmeta
>    to:   <value>...</value>
>    ie: <value>urlmeta|parse-tika|scoring-opic|...</value>
> 2. urlmeta.tags
>    Insert a comma-delimited list of metatags. Using the above example:
>    <value>corp_owner, will_it_blend, genre</value>
>    Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

Posted by "Scott Gonyea (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896224#action_12896224 ] 

Scott Gonyea edited comment on NUTCH-855 at 8/7/10 1:57 AM:
------------------------------------------------------------

If it wasn't clear from my prior comment, the property for urlmeta in nutch-site should look like:

<property>
  <name>urlmeta.tags</name>
  <value>damn,you,doug,cutting</value>
</property>

It might be nice if someone updates the "nutch-default.xml" entry for "urlmeta.tags" to the following:

<property>
  <name>urlmeta.tags</name>
  <value></value>
  <description>
    To be used in conjunction with features introduced in NUTCH-655, which allows
    for custom metatags to be injected alongside your crawl URLs. Specifying those
    custom tags here will allow for their propagation into a pages outlinks, as
    well as allow for them to be included as part of an index.
    Values should be comma-delimited. ("tag1,tag2,tag3")  Do not pad the tags with
    white-space at their boundaries, if you are using anything earlier than Hadoop-0.21.
  </description>
</property>

Unless, of course, Nutch-1.2 ships with Hadoop-0.21...  Then it's a wash.

      was (Author: sgonyea):
    If it wasn't clear from my prior comment, the property for urlmeta in nutch-site should look like:

<property>
  <name>urlmeta.tags</name>
  <value>damn,you,doug,cutting</value>
</property>

It might be nice if someone updates the "nutch-default.xml" entry for "urlmeta.tags" to the following:

<property>
  <name>urlmeta.tags</name>
  <value></value>
  <description>
    To be used in conjunction with features introduced in NUTCH-655, which allows
    for custom metatags to be injected alongside your crawl URLs. Specifying those
    custom tags here will allow for their propagation into a pages outlinks, as
    well as allow for them to be included as part of an index.
    Values should be comma-delimited. ("tag1,tag2,tag3")
  </description>
</property>
  
> ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-855
>                 URL: https://issues.apache.org/jira/browse/NUTCH-855
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator, indexer
>    Affects Versions: 1.1
>            Reporter: Scott Gonyea
>            Assignee: Chris A. Mattmann
>             Fix For: 1.2
>
>         Attachments: nutch-855.txt
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
> 1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
> 2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.
> The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
> www.url.com\tkey1=value1\tkey2=value2\t...\tkeyN=valueN
> or:
> http://slashdot.org/	corp_owner=Geeknet	will_it_blend=indubitably
> http://engadget.com/	corp_owner=Weblogs	genre=geeksquad_thriller
> To activate this plugin, you must modify two properties in your nutch-sites.xml:
> 1. plugin.includes
>    add: urlmeta
>    to:   <value>...</value>
>    ie: <value>urlmeta|parse-tika|scoring-opic|...</value>
> 2. urlmeta.tags
>    Insert a comma-delimited list of metatags. Using the above example:
>    <value>corp_owner, will_it_blend, genre</value>
>    Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

Posted by "Scott Gonyea (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896228#action_12896228 ] 

Scott Gonyea edited comment on NUTCH-855 at 8/7/10 5:11 AM:
------------------------------------------------------------

If it wasn't clear from my prior comment, the property for urlmeta in nutch-site should look like:
<property>
<name>urlmeta.tags</name>
<value>tags,are,sooo,web2.0,man</value>
</property>

It might be nice if someone updates the "nutch-default.xml" entry for "urlmeta.tags" to the following:

<property>
<name>urlmeta.tags</name>
<value></value>
<description>
To be used in conjunction with features introduced in NUTCH-655, which allows
for custom metatags to be injected alongside your crawl URLs. Specifying those
custom tags here will allow for their propagation into a pages outlinks, as
well as allow for them to be included as part of an index.
Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad the tags with
white-space at their boundaries, if you are using Hadoop releases prior to 0.21.
</description>
</property>

Unless, of course, Nutch-1.2 ships with Hadoop-0.21... Then it's a wash. I do think it's good to note that in there, as someone may stumble across that tidbit while troubleshooting some unrelated timewaster.  I'm looking out for you, long lost not-twin.


      was (Author: sgonyea):
    If it wasn't clear from my prior comment, the property for urlmeta in nutch-site should look like:
<property>
<name>urlmeta.tags</name>
<value>damn,you,lord,cuddlebums</value>
</property>

It might be nice if someone updates the "nutch-default.xml" entry for "urlmeta.tags" to the following:

<property>
<name>urlmeta.tags</name>
<value></value>
<description>
To be used in conjunction with features introduced in NUTCH-655, which allows
for custom metatags to be injected alongside your crawl URLs. Specifying those
custom tags here will allow for their propagation into a pages outlinks, as
well as allow for them to be included as part of an index.
Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad the tags with
white-space at their boundaries, if you are using Hadoop releases prior to 0.21.
</description>
</property>

Unless, of course, Nutch-1.2 ships with Hadoop-0.21... Then it's a wash. I do think it's good to note that in there, as someone may stumble across that tidbit while troubleshooting some unrelated timewaster.  I'm looking out for you, long lost not-twin.

  
> ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-855
>                 URL: https://issues.apache.org/jira/browse/NUTCH-855
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator, indexer
>    Affects Versions: 1.1
>            Reporter: Scott Gonyea
>            Assignee: Chris A. Mattmann
>             Fix For: 1.2
>
>         Attachments: nutch-855.txt
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
> 1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
> 2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.
> The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
> www.url.com\tkey1=value1\tkey2=value2\t...\tkeyN=valueN
> or:
> http://slashdot.org/	corp_owner=Geeknet	will_it_blend=indubitably
> http://engadget.com/	corp_owner=Weblogs	genre=geeksquad_thriller
> To activate this plugin, you must modify two properties in your nutch-sites.xml:
> 1. plugin.includes
>    add: urlmeta
>    to:   <value>...</value>
>    ie: <value>urlmeta|parse-tika|scoring-opic|...</value>
> 2. urlmeta.tags
>    Insert a comma-delimited list of metatags. Using the above example:
>    <value>corp_owner, will_it_blend, genre</value>
>    Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Resolved: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Scott,

> Aww you removed my sarcasm.

Yep.

> Also, I think you committed bits with references
> to "index-urlmeta". That might have been my bad for leaving it in.

I'm guessing you meant the single sentence in javadoc that referenced
activating your plugins via the index-urlmeta plugin, right? Fixed that, in
r979128.

> 
> I changed it to just "urlmeta" as it's both an indexing and a scoring filter.
> I think the comments need to be adjusted to reflect that, else I may be the
> target of a hit-and-run.
> 

Done!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



Re: [jira] Resolved: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

Posted by Scott Gonyea <sc...@aitrus.org>.
Aww you removed my sarcasm. Also, I think you committed bits with references to "index-urlmeta". That might have been my bad for leaving it in.

I changed it to just "urlmeta" as it's both an indexing and a scoring filter. I think the comments need to be adjusted to reflect that, else I may be the target of a hit-and-run.

Sent from my iPhone

On Jul 25, 2010, at 10:51 AM, "Chris A. Mattmann (JIRA)" <ji...@apache.org> wrote:

> 
>     [ https://issues.apache.org/jira/browse/NUTCH-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
> 
> Chris A. Mattmann resolved NUTCH-855.
> -------------------------------------
> 
>    Fix Version/s:     (was: 2.0)
>       Resolution: Fixed
> 
> - Applied to 1.2-branch in r979079. Cleaned up comments, removed author tags (Nutch decided a long time ago that the project would move away from author tags), cleaned up formatting. Patch doesn't apply to trunk or Nutchbase branch because LuceneWriter doesn't exist anymore for Nutch 2.0. If someone wants to port this to Nutchbase-ville, by all means, but if so, please open a new issue for it. Thanks very much, Scott!
> 
>> ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.
>> -------------------------------------------------------------------------------------------------------------
>> 
>>                Key: NUTCH-855
>>                URL: https://issues.apache.org/jira/browse/NUTCH-855
>>            Project: Nutch
>>         Issue Type: New Feature
>>         Components: generator, indexer
>>   Affects Versions: 1.1
>>           Reporter: Scott Gonyea
>>           Assignee: Chris A. Mattmann
>>            Fix For: 1.2
>> 
>>        Attachments: nutch-855.txt
>> 
>>  Original Estimate: 168h
>> Remaining Estimate: 168h
>> 
>> This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
>> 1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
>> 2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.
>> The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
>> www.url.com\tkey1=value1\tkey2=value2\t...\tkeyN=valueN
>> or:
>> http://slashdot.org/    corp_owner=Geeknet    will_it_blend=indubitably
>> http://engadget.com/    corp_owner=Weblogs    genre=geeksquad_thriller
>> To activate this plugin, you must modify two properties in your nutch-sites.xml:
>> 1. plugin.includes
>>   add: urlmeta
>>   to:   <value>...</value>
>>   ie: <value>urlmeta|parse-tika|scoring-opic|...</value>
>> 2. urlmeta.tags
>>   Insert a comma-delimited list of metatags. Using the above example:
>>   <value>corp_owner, will_it_blend, genre</value>
>>   Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.
> 
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
> 

[jira] Resolved: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann resolved NUTCH-855.
-------------------------------------

    Fix Version/s:     (was: 2.0)
       Resolution: Fixed

- Applied to 1.2-branch in r979079. Cleaned up comments, removed author tags (Nutch decided a long time ago that the project would move away from author tags), cleaned up formatting. Patch doesn't apply to trunk or Nutchbase branch because LuceneWriter doesn't exist anymore for Nutch 2.0. If someone wants to port this to Nutchbase-ville, by all means, but if so, please open a new issue for it. Thanks very much, Scott!

> ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-855
>                 URL: https://issues.apache.org/jira/browse/NUTCH-855
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator, indexer
>    Affects Versions: 1.1
>            Reporter: Scott Gonyea
>            Assignee: Chris A. Mattmann
>             Fix For: 1.2
>
>         Attachments: nutch-855.txt
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
> 1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
> 2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.
> The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
> www.url.com\tkey1=value1\tkey2=value2\t...\tkeyN=valueN
> or:
> http://slashdot.org/	corp_owner=Geeknet	will_it_blend=indubitably
> http://engadget.com/	corp_owner=Weblogs	genre=geeksquad_thriller
> To activate this plugin, you must modify two properties in your nutch-sites.xml:
> 1. plugin.includes
>    add: urlmeta
>    to:   <value>...</value>
>    ie: <value>urlmeta|parse-tika|scoring-opic|...</value>
> 2. urlmeta.tags
>    Insert a comma-delimited list of metatags. Using the above example:
>    <value>corp_owner, will_it_blend, genre</value>
>    Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann resolved NUTCH-855.
-------------------------------------

    Resolution: Fixed

My preference is that rather than reopen issues (which is a real pain for JIRA and CHANGES.txt where they have already been marked resolved) just open a new issue and link it to this.

I see that you reopened it I'm guessing b/c you'd like the description updated in the nutch-default.xml. I'll do that now.

> ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-855
>                 URL: https://issues.apache.org/jira/browse/NUTCH-855
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator, indexer
>    Affects Versions: 1.1
>            Reporter: Scott Gonyea
>            Assignee: Chris A. Mattmann
>             Fix For: 1.2
>
>         Attachments: nutch-855.txt
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
> 1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
> 2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.
> The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
> www.url.com\tkey1=value1\tkey2=value2\t...\tkeyN=valueN
> or:
> http://slashdot.org/	corp_owner=Geeknet	will_it_blend=indubitably
> http://engadget.com/	corp_owner=Weblogs	genre=geeksquad_thriller
> To activate this plugin, you must modify two properties in your nutch-sites.xml:
> 1. plugin.includes
>    add: urlmeta
>    to:   <value>...</value>
>    ie: <value>urlmeta|parse-tika|scoring-opic|...</value>
> 2. urlmeta.tags
>    Insert a comma-delimited list of metatags. Using the above example:
>    <value>corp_owner, will_it_blend, genre</value>
>    Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

Posted by "Scott Gonyea (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Scott Gonyea updated NUTCH-855:
-------------------------------

    Attachment: nutch-855

> ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-855
>                 URL: https://issues.apache.org/jira/browse/NUTCH-855
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator, indexer
>    Affects Versions: 1.1
>            Reporter: Scott Gonyea
>             Fix For: 1.2
>
>         Attachments: nutch-855
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
> 1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
> 2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.
> The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
> [www.url.com]\t[key1]=[value1]\t[key2]=[value2]...[keyN]=[valueN]
> or:
> http://slashdot.org/	corp_owner=Geeknet	will_it_blend=indubitably
> http://engadget.com/	corp_owner=Weblogs	genre=geeksquad_thriller
> To activate this plugin, you must modify two properties in your nutch-sites.xml:
> 1. plugin.includes
>    from: index-(basic|anchor)
>    to:   index-(basic|anchor|urlmeta)
> 2. urlmeta.tags
>    Insert a comma-delimited list of metatags. Using the above example:
>    <value>corp_owner, will_it_blend, genre</value>
>    Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

Posted by "Scott Gonyea (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Scott Gonyea updated NUTCH-855:
-------------------------------

    Attachment: nutch-855.txt

> ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-855
>                 URL: https://issues.apache.org/jira/browse/NUTCH-855
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator, indexer
>    Affects Versions: 1.1
>            Reporter: Scott Gonyea
>             Fix For: 1.2
>
>         Attachments: nutch-855.txt
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
> 1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
> 2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.
> The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
> [www.url.com]\t[key1]=[value1]\t[key2]=[value2]...[keyN]=[valueN]
> or:
> http://slashdot.org/	corp_owner=Geeknet	will_it_blend=indubitably
> http://engadget.com/	corp_owner=Weblogs	genre=geeksquad_thriller
> To activate this plugin, you must modify two properties in your nutch-sites.xml:
> 1. plugin.includes
>    from: index-(basic|anchor)
>    to:   index-(basic|anchor|urlmeta)
> 2. urlmeta.tags
>    Insert a comma-delimited list of metatags. Using the above example:
>    <value>corp_owner, will_it_blend, genre</value>
>    Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

Posted by "Scott Gonyea (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Scott Gonyea updated NUTCH-855:
-------------------------------

    Attachment:     (was: nutch-855)

> ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-855
>                 URL: https://issues.apache.org/jira/browse/NUTCH-855
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator, indexer
>    Affects Versions: 1.1
>            Reporter: Scott Gonyea
>             Fix For: 1.2
>
>         Attachments: nutch-855.txt
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
> 1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
> 2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.
> The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
> [www.url.com]\t[key1]=[value1]\t[key2]=[value2]...[keyN]=[valueN]
> or:
> http://slashdot.org/	corp_owner=Geeknet	will_it_blend=indubitably
> http://engadget.com/	corp_owner=Weblogs	genre=geeksquad_thriller
> To activate this plugin, you must modify two properties in your nutch-sites.xml:
> 1. plugin.includes
>    from: index-(basic|anchor)
>    to:   index-(basic|anchor|urlmeta)
> 2. urlmeta.tags
>    Insert a comma-delimited list of metatags. Using the above example:
>    <value>corp_owner, will_it_blend, genre</value>
>    Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.