You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Paul Curren (JIRA)" <ji...@apache.org> on 2007/06/07 06:55:25 UTC

[jira] Created: (LUCENE-915) PorterStemmer is incorrectly truncating words ending in e

PorterStemmer is incorrectly truncating words ending in e
---------------------------------------------------------

                 Key: LUCENE-915
                 URL: https://issues.apache.org/jira/browse/LUCENE-915
             Project: Lucene - Java
          Issue Type: Bug
          Components: Index, QueryParser, Search
    Affects Versions: 1.9
         Environment: Java 1.5 on Mac OS X 10.4.
            Reporter: Paul Curren


Searching for the word 'orange' will result incorrectly in matches for 'orang'.

Likewise, searching for 'apple' will incorrectly match 'appl'

The problem is in step6() of the PorterStemmer class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-915) PorterStemmer is incorrectly truncating words ending in e

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502221 ] 

Hoss Man commented on LUCENE-915:
---------------------------------

can you elaborate on why you think this is a bug?  

This is a fairly basic function of the Porter Stemming Algorithm, and exists in the official java version of the algorith published by Martin Porter...

http://www.tartarus.org/~martin/PorterStemmer/java.txt

(you may not arge with Porter's decision to strip trailing Es, but it's in the algorithm, and the class implements the algorithm)

> PorterStemmer is incorrectly truncating words ending in e
> ---------------------------------------------------------
>
>                 Key: LUCENE-915
>                 URL: https://issues.apache.org/jira/browse/LUCENE-915
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index, QueryParser, Search
>    Affects Versions: 1.9
>         Environment: Java 1.5 on Mac OS X 10.4.
>            Reporter: Paul Curren
>
> Searching for the word 'orange' will result incorrectly in matches for 'orang'.
> Likewise, searching for 'apple' will incorrectly match 'appl'
> The problem is in step6() of the PorterStemmer class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-915) PorterStemmer is incorrectly truncating words ending in e

Posted by "Paul Curren (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502223 ] 

Paul Curren commented on LUCENE-915:
------------------------------------

I believe like you say that the algorithm is correctly implemented, however the algorithm has the above bug.

The end result is false positives in the search results.
It's not a big issue, however the search engine has a behaviour that leads to false hits being returned and therefore this is definitely a defect in application functionality.

I'd imagine you aren't going to fix it since it would require explicit 'exception word' checking being added to the algorithm. So feel free to close the issue. The main reason I raise it is so others will hopefully spend less time that I did being confused by this.

> PorterStemmer is incorrectly truncating words ending in e
> ---------------------------------------------------------
>
>                 Key: LUCENE-915
>                 URL: https://issues.apache.org/jira/browse/LUCENE-915
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index, QueryParser, Search
>    Affects Versions: 1.9
>         Environment: Java 1.5 on Mac OS X 10.4.
>            Reporter: Paul Curren
>
> Searching for the word 'orange' will result incorrectly in matches for 'orang'.
> Likewise, searching for 'apple' will incorrectly match 'appl'
> The problem is in step6() of the PorterStemmer class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-915) PorterStemmer is incorrectly truncating words ending in e

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502251 ] 

Hoss Man commented on LUCENE-915:
---------------------------------

NOTE: There are *lots* of stemmers in the world, not just Porter ... Martin Porter himself recommends using the Snowball stemmer (which also exists as an optional Lucene Filter)

Further discussion about stemmers and choice of stemmers in building lucene applications should be directed to the java-user mailing list, and not in Jira.

> PorterStemmer is incorrectly truncating words ending in e
> ---------------------------------------------------------
>
>                 Key: LUCENE-915
>                 URL: https://issues.apache.org/jira/browse/LUCENE-915
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index, QueryParser, Search
>    Affects Versions: 1.9
>         Environment: Java 1.5 on Mac OS X 10.4.
>            Reporter: Paul Curren
>
> Searching for the word 'orange' will result incorrectly in matches for 'orang'.
> Likewise, searching for 'apple' will incorrectly match 'appl'
> The problem is in step6() of the PorterStemmer class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-915) PorterStemmer is incorrectly truncating words ending in e

Posted by "Paul Curren (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502233 ] 

Paul Curren commented on LUCENE-915:
------------------------------------

I understand, and thanks for you help.

Incidentally, i'm comparing against a commercial search engine which doesn't exhibit this stemming behaviour.
They must be using a different or modified algorithm - I don't know which i'm afraid.

This is the reason why I raise this as a bug - at the application level (forget about low level components and stemmers for now) there is an unexpected behaviour and it relates to an implementation detail within Lucene (the choice of algorithm).

That's all, thanks again.



> PorterStemmer is incorrectly truncating words ending in e
> ---------------------------------------------------------
>
>                 Key: LUCENE-915
>                 URL: https://issues.apache.org/jira/browse/LUCENE-915
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index, QueryParser, Search
>    Affects Versions: 1.9
>         Environment: Java 1.5 on Mac OS X 10.4.
>            Reporter: Paul Curren
>
> Searching for the word 'orange' will result incorrectly in matches for 'orang'.
> Likewise, searching for 'apple' will incorrectly match 'appl'
> The problem is in step6() of the PorterStemmer class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-915) PorterStemmer is incorrectly truncating words ending in e

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hoss Man resolved LUCENE-915.
-----------------------------

    Resolution: Invalid

> I'd imagine you aren't going to fix it since it would require explicit 'exception word' 
> checking being added to the algorithm. 

...well, my point actually is that there is no bug to fix -- the algorithm is what it is, and the code implements the algorithm.

changing the code wouldn't be fixing a bug, it would be breaking the PorterStemmer class so that it no longer does what it says "implementing the Porter Stemming Algorithm"

i'm sure there are *lots* of other use cases unrelated the the ones you outlined where people could argue that the Porter algorithm does something they don't want -- but that's just the nature of algorithm stemmers.  as outlined onthe Porter Stemmer homepage...

"The most frequently asked question is why word X should be stemmed to x1, when one would have expected it to be stemmed to x2. It is important to remember that the stemming algorithm cannot achieve perfection. On balance it will (or may) improve IR performance, but in individual cases it may sometimes make what are, or what seem to be, errors."

> PorterStemmer is incorrectly truncating words ending in e
> ---------------------------------------------------------
>
>                 Key: LUCENE-915
>                 URL: https://issues.apache.org/jira/browse/LUCENE-915
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index, QueryParser, Search
>    Affects Versions: 1.9
>         Environment: Java 1.5 on Mac OS X 10.4.
>            Reporter: Paul Curren
>
> Searching for the word 'orange' will result incorrectly in matches for 'orang'.
> Likewise, searching for 'apple' will incorrectly match 'appl'
> The problem is in step6() of the PorterStemmer class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org