You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2010/09/08 12:42:33 UTC

[jira] Created: (NUTCH-901) Make index-more plug-in configurable

Make index-more plug-in configurable

--------------------------------------

                 Key: NUTCH-901
                 URL: https://issues.apache.org/jira/browse/NUTCH-901
             Project: Nutch
          Issue Type: Improvement
          Components: indexer
            Reporter: Markus Jelsma
             Fix For: 1.2


In my case, i don't want the index-more plug-in to split content-types on slash. Tokenization is something a Solr instance should take care of. Instead of removing the code (which would break compatibility for users that rely on it), we need a way to configure the plug-in not to split the content-type.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (NUTCH-901) Make index-more plug-in configurable

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann reassigned NUTCH-901:
---------------------------------------

    Assignee: Chris A. Mattmann

> Make index-more plug-in configurable
> ------------------------------------
>
>                 Key: NUTCH-901
>                 URL: https://issues.apache.org/jira/browse/NUTCH-901
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.2, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Chris A. Mattmann
>             Fix For: 2.0
>
>         Attachments: NUTCH-901-MarkusJelsma.998958.patch, NUTCH-901-trunk.998961.patch
>
>
> In my case, i don't want the index-more plug-in to split content-types on slash. Tokenization is something a Solr instance should take care of. Instead of removing the code (which would break compatibility for users that rely on it), we need a way to configure the plug-in not to split the content-type.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-901) Make index-more plug-in configurable

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated NUTCH-901:
------------------------------------

    Fix Version/s: 1.2

- fix for 1.2 as well (sigh, this means *another* RC). Oh well, for the greater good!

> Make index-more plug-in configurable
> ------------------------------------
>
>                 Key: NUTCH-901
>                 URL: https://issues.apache.org/jira/browse/NUTCH-901
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.2, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Chris A. Mattmann
>             Fix For: 1.2, 2.0
>
>         Attachments: NUTCH-901-MarkusJelsma.998958.patch, NUTCH-901-trunk.998961.patch
>
>
> In my case, i don't want the index-more plug-in to split content-types on slash. Tokenization is something a Solr instance should take care of. Instead of removing the code (which would break compatibility for users that rely on it), we need a way to configure the plug-in not to split the content-type.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Work started: (NUTCH-901) Make index-more plug-in configurable

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on NUTCH-901 started by Chris A. Mattmann.

> Make index-more plug-in configurable
> ------------------------------------
>
>                 Key: NUTCH-901
>                 URL: https://issues.apache.org/jira/browse/NUTCH-901
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.2, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Chris A. Mattmann
>             Fix For: 2.0
>
>         Attachments: NUTCH-901-MarkusJelsma.998958.patch, NUTCH-901-trunk.998961.patch
>
>
> In my case, i don't want the index-more plug-in to split content-types on slash. Tokenization is something a Solr instance should take care of. Instead of removing the code (which would break compatibility for users that rely on it), we need a way to configure the plug-in not to split the content-type.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-901) Make index-more plug-in configurable

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-901:
--------------------------------

    Attachment: NUTCH-901-MarkusJelsma.998958.patch

Here's a patch for version 1.2. It includes a backward compatible setting in nutch-default.xml and handles the setting the the MoreIndexingFilter.java. It's tested and behaves as expected on my 1.2 up to date check out.

> Make index-more plug-in configurable
> ------------------------------------
>
>                 Key: NUTCH-901
>                 URL: https://issues.apache.org/jira/browse/NUTCH-901
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.2, 2.0
>            Reporter: Markus Jelsma
>             Fix For: 2.0
>
>         Attachments: NUTCH-901-MarkusJelsma.998958.patch
>
>
> In my case, i don't want the index-more plug-in to split content-types on slash. Tokenization is something a Solr instance should take care of. Instead of removing the code (which would break compatibility for users that rely on it), we need a way to configure the plug-in not to split the content-type.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (NUTCH-901) Make index-more plug-in configurable

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann resolved NUTCH-901.
-------------------------------------

    Resolution: Fixed

- patch applied to trunk in r999181 and to branch-1.2 in r999200. Thanks so much Markus!

One nit: no unit tests. I've created one in the trunk (in r999203 and in r999204), and one in the branch-1.2 (in r999208).

I won't be applying *any more* patches to the Nutch 1.2 RC. Let's get this thing VOTEd into release-dom with RC #4.

> Make index-more plug-in configurable
> ------------------------------------
>
>                 Key: NUTCH-901
>                 URL: https://issues.apache.org/jira/browse/NUTCH-901
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.2, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Chris A. Mattmann
>             Fix For: 1.2, 2.0
>
>         Attachments: NUTCH-901-MarkusJelsma.998958.patch, NUTCH-901-trunk.998961.patch
>
>
> In my case, i don't want the index-more plug-in to split content-types on slash. Tokenization is something a Solr instance should take care of. Instead of removing the code (which would break compatibility for users that rely on it), we need a way to configure the plug-in not to split the content-type.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-901) Make index-more plug-in configurable

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-901:
--------------------------------

    Attachment: NUTCH-901-trunk.998961.patch

Here's also a patch for 2.0 trunk. I could not test the code because i haven't managed to compile trunk as of yet.

> Make index-more plug-in configurable
> ------------------------------------
>
>                 Key: NUTCH-901
>                 URL: https://issues.apache.org/jira/browse/NUTCH-901
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.2, 2.0
>            Reporter: Markus Jelsma
>             Fix For: 2.0
>
>         Attachments: NUTCH-901-MarkusJelsma.998958.patch, NUTCH-901-trunk.998961.patch
>
>
> In my case, i don't want the index-more plug-in to split content-types on slash. Tokenization is something a Solr instance should take care of. Instead of removing the code (which would break compatibility for users that rely on it), we need a way to configure the plug-in not to split the content-type.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-901) Make index-more plug-in configurable

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-901:
--------------------------------

              Summary: Make index-more plug-in configurable  (was: Make index-more plug-in configurable
)
        Fix Version/s: 2.0
    Affects Version/s: 1.2
                       2.0

Needs fixing in the trunk as well (v2.0)

> Make index-more plug-in configurable
> ------------------------------------
>
>                 Key: NUTCH-901
>                 URL: https://issues.apache.org/jira/browse/NUTCH-901
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.2, 2.0
>            Reporter: Markus Jelsma
>             Fix For: 1.2, 2.0
>
>
> In my case, i don't want the index-more plug-in to split content-types on slash. Tokenization is something a Solr instance should take care of. Instead of removing the code (which would break compatibility for users that rely on it), we need a way to configure the plug-in not to split the content-type.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-901) Make index-more plug-in configurable

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925318#action_12925318 ] 

Markus Jelsma commented on NUTCH-901:
-------------------------------------

Applied patch and added Mattmann's test to branch-1.3 

> Make index-more plug-in configurable
> ------------------------------------
>
>                 Key: NUTCH-901
>                 URL: https://issues.apache.org/jira/browse/NUTCH-901
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.2, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Chris A. Mattmann
>             Fix For: 1.2, 2.0
>
>         Attachments: NUTCH-901-MarkusJelsma.998958.patch, NUTCH-901-trunk.998961.patch
>
>
> In my case, i don't want the index-more plug-in to split content-types on slash. Tokenization is something a Solr instance should take care of. Instead of removing the code (which would break compatibility for users that rely on it), we need a way to configure the plug-in not to split the content-type.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-901) Make index-more plug-in configurable

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated NUTCH-901:
------------------------------------

    Fix Version/s:     (was: 1.2)

Hi Guys: I don't have time to put together a patch for this, and I haven't seen anything produced yet. Let's push this off to 2.0. If someone gets me a patch in the next day or so, I'll try and squeeze it in, but for now, I'm pushing to 2.0.

> Make index-more plug-in configurable
> ------------------------------------
>
>                 Key: NUTCH-901
>                 URL: https://issues.apache.org/jira/browse/NUTCH-901
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.2, 2.0
>            Reporter: Markus Jelsma
>             Fix For: 2.0
>
>
> In my case, i don't want the index-more plug-in to split content-types on slash. Tokenization is something a Solr instance should take care of. Instead of removing the code (which would break compatibility for users that rely on it), we need a way to configure the plug-in not to split the content-type.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (NUTCH-901) Make index-more plug-in configurable

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12912547#action_12912547 ] 

Markus Jelsma edited comment on NUTCH-901 at 9/20/10 11:53 AM:
---------------------------------------------------------------

Here's a patch for version 1.2 (that's the NUTCH-901-MarkusJelsma.998958.patch file). It includes a backward compatible setting in nutch-default.xml and handles the setting the the MoreIndexingFilter.java. It's tested and behaves as expected on my 1.2 up to date check out.

      was (Author: markus17):
    Here's a patch for version 1.2. It includes a backward compatible setting in nutch-default.xml and handles the setting the the MoreIndexingFilter.java. It's tested and behaves as expected on my 1.2 up to date check out.
  
> Make index-more plug-in configurable
> ------------------------------------
>
>                 Key: NUTCH-901
>                 URL: https://issues.apache.org/jira/browse/NUTCH-901
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.2, 2.0
>            Reporter: Markus Jelsma
>             Fix For: 2.0
>
>         Attachments: NUTCH-901-MarkusJelsma.998958.patch, NUTCH-901-trunk.998961.patch
>
>
> In my case, i don't want the index-more plug-in to split content-types on slash. Tokenization is something a Solr instance should take care of. Instead of removing the code (which would break compatibility for users that rely on it), we need a way to configure the plug-in not to split the content-type.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.