You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Dawid Weiss (JIRA)" <ji...@apache.org> on 2006/03/23 12:38:16 UTC

[jira] Created: (NUTCH-237) Carrot2 clustering plugin upgrade.

Carrot2 clustering plugin upgrade.
----------------------------------

         Key: NUTCH-237
         URL: http://issues.apache.org/jira/browse/NUTCH-237
     Project: Nutch
        Type: Improvement
    Reporter: Dawid Weiss
    Priority: Trivial


This is an upgrade of the clustering plugin to the newest release (1.0.2).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (NUTCH-237) Carrot2 clustering plugin upgrade.

Posted by "Dawid Weiss (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-237?page=all ]

Dawid Weiss updated NUTCH-237:
------------------------------

    Attachment: NUTCH-237.DWEISS.patch.zip

Hi Andrzej. The ZIP file contains a patch and svn stat with the improved code:

- The primary language for hits without explicit langid and a list of enabled languages in the clustering component can be specified in the configuration file (readme.txt gives the details).

- by default all languages in Carrot2 (except for Polish) are enabled. English is the default.

- I removed the dependency on Neko in favor of the simpler routine we have in Carrot2 codebase anyway. The change shouldn't affect the results (I checked on my local installation and it seems to be fine).

I haven't played with the language identifier yet because I don't have a crawl with documents containing langid codes. The code should work without problems though -- details.getValue("lang") is converted to Carrot2's property RawDocument.PROPERTY_LANGUAGE and this is taken into account when clustering.

I couldn't delete previously attached files. This ZIP file contains only the patch and svnstat -- you'll have to remove a few JARs manually and replace other with their new counterparts from the ZIP file I've attached to this issue earlier (they haven't changed). Let me know if you need anything.


> Carrot2 clustering plugin upgrade.
> ----------------------------------
>
>          Key: NUTCH-237
>          URL: http://issues.apache.org/jira/browse/NUTCH-237
>      Project: Nutch
>         Type: Improvement

>     Reporter: Dawid Weiss
>     Priority: Trivial
>  Attachments: NUTCH-237.DWEISS.patch.zip, c2.patch, libs.zip, svn-stat.txt
>
> This is an upgrade of the clustering plugin to the newest release (1.0.2).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (NUTCH-237) Carrot2 clustering plugin upgrade.

Posted by "Dawid Weiss (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-237?page=all ]

Dawid Weiss updated NUTCH-237:
------------------------------

    Attachment: libs.zip

Libraries that need to be replaced.

> Carrot2 clustering plugin upgrade.
> ----------------------------------
>
>          Key: NUTCH-237
>          URL: http://issues.apache.org/jira/browse/NUTCH-237
>      Project: Nutch
>         Type: Improvement
>     Reporter: Dawid Weiss
>     Priority: Trivial
>  Attachments: c2.patch, libs.zip, svn-stat.txt
>
> This is an upgrade of the clustering plugin to the newest release (1.0.2).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-237) Carrot2 clustering plugin upgrade.

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-237?page=comments#action_12371606 ] 

Andrzej Bialecki  commented on NUTCH-237:
-----------------------------------------

Hmm, I'm not sure I like this patch. It removes support for other languages than English. While I can agree with the argument that language detection in Carrot is relatively simplistic and we should use other mechanisms if available, this patch removed also stemmers and stopword lists for other languages, without the ability to provide them so that Lingo uses them when extracting text features... or perhaps I'm missing something?

> Carrot2 clustering plugin upgrade.
> ----------------------------------
>
>          Key: NUTCH-237
>          URL: http://issues.apache.org/jira/browse/NUTCH-237
>      Project: Nutch
>         Type: Improvement
>     Reporter: Dawid Weiss
>     Priority: Trivial
>  Attachments: c2.patch, libs.zip, svn-stat.txt
>
> This is an upgrade of the clustering plugin to the newest release (1.0.2).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (NUTCH-237) Carrot2 clustering plugin upgrade.

Posted by "Dawid Weiss (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-237?page=all ]

Dawid Weiss updated NUTCH-237:
------------------------------

    Attachment: c2.patch
                svn-stat.txt

Note the two deleted files (I attached the result of svn stat). I didn't know how to include this info in the diff file, don't think it's possible with plain svn.

> Carrot2 clustering plugin upgrade.
> ----------------------------------
>
>          Key: NUTCH-237
>          URL: http://issues.apache.org/jira/browse/NUTCH-237
>      Project: Nutch
>         Type: Improvement
>     Reporter: Dawid Weiss
>     Priority: Trivial
>  Attachments: c2.patch, svn-stat.txt
>
> This is an upgrade of the clustering plugin to the newest release (1.0.2).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-237) Carrot2 clustering plugin upgrade.

Posted by "Dawid Weiss (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-237?page=comments#action_12371687 ] 

Dawid Weiss commented on NUTCH-237:
-----------------------------------

Yes and no. I removed the "support" for foreign languages from the constructor code:

        // We initialize Lingo with English stemming and stopwords. Lingo has 
        // a simple language detection filter, but you'll be better off hardcoding
        // the language according to your needs. If you have bilingual indices, 
        // then there is a possibility of creating a more complex process that assigns
        // a language tag before the clustering is actually started.
        return new LingoLocalFilterComponent(
          new Language[] { new English() },
          defaults);
      }

Language detection is not really brilliant in the open source Lingo so I thought it wouldn't make sense to give people false hopes. Now, all the stemmers and stopword lists are still included in the release (look inside carrot2-util-tokenizer.jar$/com/dawidweiss/carrot/util/tokenizer/languages/...) so you can freely switch to another language by changing the instantiated language. 

I have a better idea though -- how about if you apply this patch (because I\ve tested it and know it works) and I'll make the language configurable via ISO codes set in nutch configuration? The default would be English and you could set your own language in there if you wanted to. All right?

> Carrot2 clustering plugin upgrade.
> ----------------------------------
>
>          Key: NUTCH-237
>          URL: http://issues.apache.org/jira/browse/NUTCH-237
>      Project: Nutch
>         Type: Improvement
>     Reporter: Dawid Weiss
>     Priority: Trivial
>  Attachments: c2.patch, libs.zip, svn-stat.txt
>
> This is an upgrade of the clustering plugin to the newest release (1.0.2).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira