You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Pieter Berkel (JIRA)" <ji...@apache.org> on 2007/10/15 09:22:51 UTC

[jira] Created: (SOLR-379) KStem Token Filter

KStem Token Filter
------------------

                 Key: SOLR-379
                 URL: https://issues.apache.org/jira/browse/SOLR-379
             Project: Solr
          Issue Type: New Feature
          Components: search
            Reporter: Pieter Berkel
            Priority: Minor


A Lucene / Solr implementation of the KStem stemmer.  Full credit goes to Harry Wagner for adapting the Lucene version found here:
http://ciir.cs.umass.edu/cgi-bin/downloads/downloads.cgi

Background discussion to this stemmer (including licensing issues) can be found in this thread:
http://www.nabble.com/Embedded-about-50--faster-for-indexing-tf4325720.html#a12376295

I've made some minor changes to KStemFilterFactory so that it compiles cleanly against trunk:
1) removed some unnecessary imports
2) changed the init() method parameters introduced by SOLR-215
3) moved KStemFilterFactory into package org.apache.solr.analysis

Once compiled and included in your Solr war (or as a jar in your lib directory, the KStem filter can be used in your schema very easily:

      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KStemFilterFactory" cacheSize="20000"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-379) KStem Token Filter

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12597185#action_12597185 ] 

Otis Gospodnetic commented on SOLR-379:
---------------------------------------

It would be great to have this available in Solr.  Because of Kstem's incompatible library, I don't know how we can handle this.  Incompatible license really just means we cannot distribute the KStem code (and cannot have it in the Lucene/Solr svn repository).  Usually when incompatible licensing is a problem we say "modify the build script to download the needed library on demand if it's not present locally".  This is what some of the Lucene contrib components do, for example.

However, looking at your ZIP file I see:

  -rw-r--r--      2836  15-Oct-2007  17:16:46  src/java/org/apache/solr/analysis/KStemFilterFactory.java
  -rw-r--r--     42222  15-Oct-2007  16:28:08  src/java/org/apache/lucene/analysis/KStemmer.java
  -rw-r--r--      4501  15-Oct-2007  17:08:38  src/java/org/apache/lucene/analysis/KStemFilter.java
  -rw-r--r--     34259  15-Oct-2007  16:28:24  src/java/org/apache/lucene/analysis/KStemData8.java
  -rw-r--r--     39918  15-Oct-2007  16:28:28  src/java/org/apache/lucene/analysis/KStemData7.java
  -rw-r--r--     41412  15-Oct-2007  16:28:34  src/java/org/apache/lucene/analysis/KStemData6.java
  -rw-r--r--     40457  15-Oct-2007  16:28:40  src/java/org/apache/lucene/analysis/KStemData5.java
  -rw-r--r--     40823  15-Oct-2007  16:28:44  src/java/org/apache/lucene/analysis/KStemData4.java
  -rw-r--r--     39808  15-Oct-2007  16:28:50  src/java/org/apache/lucene/analysis/KStemData3.java
  -rw-r--r--     42696  15-Oct-2007  16:29:00  src/java/org/apache/lucene/analysis/KStemData2.java
  -rw-r--r--     40020  15-Oct-2007  16:29:14  src/java/org/apache/lucene/analysis/KStemData1.java

But this is really just a duplicate of what's in http://ciir.cs.umass.edu/downloads/files/KStem.jar, plus the Solr-specific KStemFilterFactory.java.

So, could we simply download KStem.jar on demand?  And is KStemFilterFactory.java really copyright CIIR?  If we can change that to ASL then we can include it in the repo and with the modified build that downloads KStem.jar before compiling this class would compile.


> KStem Token Filter
> ------------------
>
>                 Key: SOLR-379
>                 URL: https://issues.apache.org/jira/browse/SOLR-379
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Pieter Berkel
>            Priority: Minor
>         Attachments: KStemSolr.zip
>
>
> A Lucene / Solr implementation of the KStem stemmer.  Full credit goes to Harry Wagner for adapting the Lucene version found here:
> http://ciir.cs.umass.edu/cgi-bin/downloads/downloads.cgi
> Background discussion to this stemmer (including licensing issues) can be found in this thread:
> http://www.nabble.com/Embedded-about-50--faster-for-indexing-tf4325720.html#a12376295
> I've made some minor changes to KStemFilterFactory so that it compiles cleanly against trunk:
> 1) removed some unnecessary imports
> 2) changed the init() method parameters introduced by SOLR-215
> 3) moved KStemFilterFactory into package org.apache.solr.analysis
> Once compiled and included in your Solr war (or as a jar in your lib directory, the KStem filter can be used in your schema very easily:
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
>         <filter class="solr.StandardFilterFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.KStemFilterFactory" cacheSize="20000"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (SOLR-379) KStem Token Filter

Posted by "Pieter Berkel (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pieter Berkel updated SOLR-379:
-------------------------------

    Attachment: KStemSolr.zip

I've attached a zip file containing the KStem source rather than a patch as I'm not sure how this code will be eventually integrated with Solr.

Since I did not write this and am unsure of the legal status of this code, I have not granted ASF license, although recent discussion suggests the license included with KStem is compatible with the Apache license.

Hopefully we'll be able to resolve these above issues fairly quickly.


> KStem Token Filter
> ------------------
>
>                 Key: SOLR-379
>                 URL: https://issues.apache.org/jira/browse/SOLR-379
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Pieter Berkel
>            Priority: Minor
>         Attachments: KStemSolr.zip
>
>
> A Lucene / Solr implementation of the KStem stemmer.  Full credit goes to Harry Wagner for adapting the Lucene version found here:
> http://ciir.cs.umass.edu/cgi-bin/downloads/downloads.cgi
> Background discussion to this stemmer (including licensing issues) can be found in this thread:
> http://www.nabble.com/Embedded-about-50--faster-for-indexing-tf4325720.html#a12376295
> I've made some minor changes to KStemFilterFactory so that it compiles cleanly against trunk:
> 1) removed some unnecessary imports
> 2) changed the init() method parameters introduced by SOLR-215
> 3) moved KStemFilterFactory into package org.apache.solr.analysis
> Once compiled and included in your Solr war (or as a jar in your lib directory, the KStem filter can be used in your schema very easily:
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
>         <filter class="solr.StandardFilterFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.KStemFilterFactory" cacheSize="20000"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-379) KStem Token Filter

Posted by "Pieter Berkel (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605169#action_12605169 ] 

Pieter Berkel commented on SOLR-379:
------------------------------------

As far as I'm aware KStemFilterFactory.java was written by Harry Wagner so if he's happy to grant ASL it should be possible to include that in the repo.  Everything in "/src/java/org/apache/lucene/analysis" has been copied from KStem.jar which was originally downloaded from CIIR, so if that can possibly be loaded on demand, then it should be fairly straightforward to include support for this stemmer in Solr.


> KStem Token Filter
> ------------------
>
>                 Key: SOLR-379
>                 URL: https://issues.apache.org/jira/browse/SOLR-379
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Pieter Berkel
>            Priority: Minor
>         Attachments: KStemSolr.zip
>
>
> A Lucene / Solr implementation of the KStem stemmer.  Full credit goes to Harry Wagner for adapting the Lucene version found here:
> http://ciir.cs.umass.edu/cgi-bin/downloads/downloads.cgi
> Background discussion to this stemmer (including licensing issues) can be found in this thread:
> http://www.nabble.com/Embedded-about-50--faster-for-indexing-tf4325720.html#a12376295
> I've made some minor changes to KStemFilterFactory so that it compiles cleanly against trunk:
> 1) removed some unnecessary imports
> 2) changed the init() method parameters introduced by SOLR-215
> 3) moved KStemFilterFactory into package org.apache.solr.analysis
> Once compiled and included in your Solr war (or as a jar in your lib directory, the KStem filter can be used in your schema very easily:
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
>         <filter class="solr.StandardFilterFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.KStemFilterFactory" cacheSize="20000"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.