You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Tanguy Moal (JIRA)" <ji...@apache.org> on 2012/05/16 17:37:02 UTC

[jira] [Created] (SOLR-3463) FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens

Tanguy Moal created SOLR-3463:
---------------------------------

             Summary: FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens
                 Key: SOLR-3463
                 URL: https://issues.apache.org/jira/browse/SOLR-3463
             Project: Solr
          Issue Type: Improvement
          Components: Schema and Analysis
    Affects Versions: 3.4
            Reporter: Tanguy Moal
            Priority: Minor


FrenchLightStemmer performs aggressive deletions on repeated character sequences, even on numbers.
This might be unexpected during full text search.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Comment Edited] (SOLR-3463) FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens

Posted by "Tanguy Moal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13276832#comment-13276832 ] 

Tanguy Moal edited comment on SOLR-3463 at 5/16/12 4:00 PM:
------------------------------------------------------------

This patch implements the solution suggested by Robert Muir on the ML.

Patch for lucene/solr trunk, generated from root directory.
                
      was (Author: tanguy):
    This patch implements the solution suggested by Robert Muir on the ML.

Patch for lucene/solr trunk, generated from root.
                  
> FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens
> -------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-3463
>                 URL: https://issues.apache.org/jira/browse/SOLR-3463
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.4
>            Reporter: Tanguy Moal
>            Priority: Minor
>         Attachments: SOLR-3463.patch
>
>
> FrenchLightStemmer performs aggressive deletions on repeated character sequences, even on numbers.
> This might be unexpected during full text search.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-3463) FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13276856#comment-13276856 ] 

Steven Rowe commented on SOLR-3463:
-----------------------------------

+1

Tanguy, can you add a couple more tests?  You should demonstrate that the deletion of repeated characters still works (with letter chars).  Also, since there are two repetition removal operations in the code, a test specific to each would be useful.
                
> FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens
> -------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-3463
>                 URL: https://issues.apache.org/jira/browse/SOLR-3463
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.4
>            Reporter: Tanguy Moal
>            Priority: Minor
>         Attachments: SOLR-3463.patch
>
>
> FrenchLightStemmer performs aggressive deletions on repeated character sequences, even on numbers.
> This might be unexpected during full text search.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (SOLR-3463) FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens

Posted by "Tanguy Moal (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tanguy Moal updated SOLR-3463:
------------------------------

    Attachment: SOLR-3463.patch

I didn't drink anything yet but maybe it's time to begin :-)
                
> FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens
> -------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-3463
>                 URL: https://issues.apache.org/jira/browse/SOLR-3463
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.4
>            Reporter: Tanguy Moal
>            Priority: Minor
>         Attachments: SOLR-3463.patch, SOLR-3463.patch, SOLR-3463.patch
>
>
> FrenchLightStemmer performs aggressive deletions on repeated character sequences, even on numbers.
> This might be unexpected during full text search.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-4063) FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-4063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277051#comment-13277051 ] 

Steven Rowe commented on LUCENE-4063:
-------------------------------------

Tanguy, since this is entirely a Lucene change, I've moved the issue's project from Solr to Lucene.
                
> FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens
> -------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-4063
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4063
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.4, 4.0
>            Reporter: Tanguy Moal
>            Priority: Minor
>         Attachments: SOLR-3463.patch, SOLR-3463.patch, SOLR-3463.patch
>
>
> FrenchLightStemmer performs aggressive deletions on repeated character sequences, even on numbers.
> This might be unexpected during full text search.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-4063) FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-4063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277066#comment-13277066 ] 

Steven Rowe commented on LUCENE-4063:
-------------------------------------

Committed to trunk.  Thanks Tanguy!

I'm not sure if this should be committed on the 3.6 branch, since that branch is bug-fix only, and this issue is marked as an improvement.  Thoughts?
                
> FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens
> -------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-4063
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4063
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.4, 4.0
>            Reporter: Tanguy Moal
>            Assignee: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-4063.patch, SOLR-3463.patch, SOLR-3463.patch, SOLR-3463.patch
>
>
> FrenchLightStemmer performs aggressive deletions on repeated character sequences, even on numbers.
> This might be unexpected during full text search.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Comment Edited] (LUCENE-4063) FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens

Posted by "Tanguy Moal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-4063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287408#comment-13287408 ] 

Tanguy Moal edited comment on LUCENE-4063 at 6/1/12 1:55 PM:
-------------------------------------------------------------

I agree with both of you, it sounds like a design change.

I think Jacques Savoy's algorithm was intended to be used on words. Not on numbers, or mixes of both (like in 22h00).

Which is true for any stemmer, I think. That's why on the mailing I also suggested we could have each stemmer share a common interface that would filter non-stemmable literals out of the way. That could prevent the same issue to rise from a different stemming implementation.

I'm just saying this as I think about it.
                
      was (Author: tanguy):
    I agree with both of you, it sounds like a design change.

I think Jacques Savoy's algorithm was intended to be used on words. Not on numbers, or mixes of both (like in 22h00).

Which is true for any stemmer, I think. That's why on the mailing I also suggested we could have each stemmer share a common interface that would filter non-stemmable literals out of the way. That could prevent the same issue to raise from a different stemming implementation.

I'm just saying this as I think about it.
                  
> FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens
> -------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-4063
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4063
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.4, 4.0
>            Reporter: Tanguy Moal
>            Assignee: Steven Rowe
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-4063.patch, SOLR-3463.patch, SOLR-3463.patch, SOLR-3463.patch
>
>
> FrenchLightStemmer performs aggressive deletions on repeated character sequences, even on numbers.
> This might be unexpected during full text search.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Assigned] (LUCENE-4063) FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-4063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe reassigned LUCENE-4063:
-----------------------------------

    Assignee: Steven Rowe
    
> FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens
> -------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-4063
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4063
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.4, 4.0
>            Reporter: Tanguy Moal
>            Assignee: Steven Rowe
>            Priority: Minor
>         Attachments: SOLR-3463.patch, SOLR-3463.patch, SOLR-3463.patch
>
>
> FrenchLightStemmer performs aggressive deletions on repeated character sequences, even on numbers.
> This might be unexpected during full text search.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-4063) FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-4063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277728#comment-13277728 ] 

Robert Muir commented on LUCENE-4063:
-------------------------------------

As far as this being a bug, the original code implements the algorithm it claims to implement, and undoubling anything was its heuristic: see http://members.unine.ch/jacques.savoy/clef/frenchStemmerPlus.txt

                
> FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens
> -------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-4063
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4063
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.4, 4.0
>            Reporter: Tanguy Moal
>            Assignee: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-4063.patch, SOLR-3463.patch, SOLR-3463.patch, SOLR-3463.patch
>
>
> FrenchLightStemmer performs aggressive deletions on repeated character sequences, even on numbers.
> This might be unexpected during full text search.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-4063) FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens

Posted by "Tanguy Moal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-4063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277635#comment-13277635 ] 

Tanguy Moal commented on LUCENE-4063:
-------------------------------------

I'd be glad to see this on 3.x x >=4 since that's the version I used to spot the issue, may be should I have marked this issue as a bug rather than improvement ? :-)

I have a custom filterfactory marking numbers as keywords anyway as I needed a quick fix.
So from my point of view it doesn't really matter... I could just drop that filter from my analysis if the patch finds its way to 3x.

Thank you very much for your quick responses about this issue.
                
> FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens
> -------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-4063
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4063
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.4, 4.0
>            Reporter: Tanguy Moal
>            Assignee: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-4063.patch, SOLR-3463.patch, SOLR-3463.patch, SOLR-3463.patch
>
>
> FrenchLightStemmer performs aggressive deletions on repeated character sequences, even on numbers.
> This might be unexpected during full text search.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-3463) FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13276875#comment-13276875 ] 

Steven Rowe commented on SOLR-3463:
-----------------------------------

bq. Added some tests related to this issue.

The new patch looks identical to the old patch - I don't see any new tests?
                
> FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens
> -------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-3463
>                 URL: https://issues.apache.org/jira/browse/SOLR-3463
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.4
>            Reporter: Tanguy Moal
>            Priority: Minor
>         Attachments: SOLR-3463.patch, SOLR-3463.patch
>
>
> FrenchLightStemmer performs aggressive deletions on repeated character sequences, even on numbers.
> This might be unexpected during full text search.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-4063) FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-4063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287447#comment-13287447 ] 

Robert Muir commented on LUCENE-4063:
-------------------------------------

{quote}
That's why on the mailing I also suggested we could have each stemmer share a common interface that would filter non-stemmable literals out of the way
{quote}

We actually have this in place, but its too limited. Its called KeywordAttribute. When this is set, the stemmer will not touch the word.

Currently the only way to set this out of box is to use KeywordMarkerFilter which takes a Set of protected words.

But to make your idea more flexible: I could imagine a couple more filters:
* one that marks as Keyword based on a set of types. In this case you would just add NUM to that set, and no stemmers would touch any numbers. Of course
  for french this is solved already, but imagine if you are using the URLEmail tokenizer: I think a set like { URL, EMAIL } would be very useful,
  otherwise stemmers will probably muck with them.
* one that marks as Keyword based on a regular expression. This could be good for fine-tuning stemmers for a lot of general purpose needs: e.g. on the 
  mailing list before someone was unhappy about how russian stemmers would treat russian place names and they had a certain set of suffixes they didnt
  want stemmed.

Anyway, I would really like to see these filters, I think they would be pretty simple to implement as well. 
                
> FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens
> -------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-4063
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4063
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.4, 4.0
>            Reporter: Tanguy Moal
>            Assignee: Steven Rowe
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-4063.patch, SOLR-3463.patch, SOLR-3463.patch, SOLR-3463.patch
>
>
> FrenchLightStemmer performs aggressive deletions on repeated character sequences, even on numbers.
> This might be unexpected during full text search.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-4063) FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-4063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-4063:
--------------------------------

    Attachment: LUCENE-4063.patch

Patch with a couple more tests, and a CHANGES.txt entry.

Committing to trunk shortly.
                
> FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens
> -------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-4063
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4063
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.4, 4.0
>            Reporter: Tanguy Moal
>            Assignee: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-4063.patch, SOLR-3463.patch, SOLR-3463.patch, SOLR-3463.patch
>
>
> FrenchLightStemmer performs aggressive deletions on repeated character sequences, even on numbers.
> This might be unexpected during full text search.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Resolved] (LUCENE-4063) FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-4063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe resolved LUCENE-4063.
---------------------------------

       Resolution: Fixed
    Fix Version/s: 4.0

I agree with Robert, this is a design change, not a bug fix, so I won't backport to the 3.6 branch.
                
> FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens
> -------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-4063
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4063
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.4, 4.0
>            Reporter: Tanguy Moal
>            Assignee: Steven Rowe
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-4063.patch, SOLR-3463.patch, SOLR-3463.patch, SOLR-3463.patch
>
>
> FrenchLightStemmer performs aggressive deletions on repeated character sequences, even on numbers.
> This might be unexpected during full text search.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-4063) FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens

Posted by "Tanguy Moal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-4063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287408#comment-13287408 ] 

Tanguy Moal commented on LUCENE-4063:
-------------------------------------

I agree with both of you, it sounds like a design change.

I think Jacques Savoy's algorithm was intended to be used on words. Not on numbers, or mixes of both (like in 22h00).

Which is true for any stemmer, I think. That's why on the mailing I also suggested we could have each stemmer share a common interface that would filter non-stemmable literals out of the way. That could prevent the same issue to raise from a different stemming implementation.

I'm just saying this as I think about it.
                
> FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens
> -------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-4063
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4063
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.4, 4.0
>            Reporter: Tanguy Moal
>            Assignee: Steven Rowe
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-4063.patch, SOLR-3463.patch, SOLR-3463.patch, SOLR-3463.patch
>
>
> FrenchLightStemmer performs aggressive deletions on repeated character sequences, even on numbers.
> This might be unexpected during full text search.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (SOLR-3463) FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens

Posted by "Tanguy Moal (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tanguy Moal updated SOLR-3463:
------------------------------

    Attachment: SOLR-3463.patch

This patch implements the solution suggested by Robert Muir on the ML.

Patch for lucene/solr trunk, generated from root.
                
> FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens
> -------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-3463
>                 URL: https://issues.apache.org/jira/browse/SOLR-3463
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.4
>            Reporter: Tanguy Moal
>            Priority: Minor
>         Attachments: SOLR-3463.patch
>
>
> FrenchLightStemmer performs aggressive deletions on repeated character sequences, even on numbers.
> This might be unexpected during full text search.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (SOLR-3463) FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens

Posted by "Tanguy Moal (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tanguy Moal updated SOLR-3463:
------------------------------

    Attachment: SOLR-3463.patch

Updated patch to cover corner case (code also performs additional deletion of last character if it equals last character minus 1.

Also added very minimal unit test (which exhibited the uncovered corner case)
                
> FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens
> -------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-3463
>                 URL: https://issues.apache.org/jira/browse/SOLR-3463
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.4
>            Reporter: Tanguy Moal
>            Priority: Minor
>         Attachments: SOLR-3463.patch, SOLR-3463.patch
>
>
> FrenchLightStemmer performs aggressive deletions on repeated character sequences, even on numbers.
> This might be unexpected during full text search.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Moved] (LUCENE-4063) FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-4063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe moved SOLR-3463 to LUCENE-4063:
-------------------------------------------

          Component/s:     (was: Schema and Analysis)
                       modules/analysis
        Lucene Fields: New,Patch Available
    Affects Version/s:     (was: 3.4)
                       4.0
                       3.4
                  Key: LUCENE-4063  (was: SOLR-3463)
              Project: Lucene - Java  (was: Solr)
    
> FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens
> -------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-4063
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4063
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.4, 4.0
>            Reporter: Tanguy Moal
>            Priority: Minor
>         Attachments: SOLR-3463.patch, SOLR-3463.patch, SOLR-3463.patch
>
>
> FrenchLightStemmer performs aggressive deletions on repeated character sequences, even on numbers.
> This might be unexpected during full text search.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (SOLR-3463) FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens

Posted by "Tanguy Moal (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tanguy Moal updated SOLR-3463:
------------------------------

    Attachment:     (was: SOLR-3463.patch)
    
> FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens
> -------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-3463
>                 URL: https://issues.apache.org/jira/browse/SOLR-3463
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.4
>            Reporter: Tanguy Moal
>            Priority: Minor
>         Attachments: SOLR-3463.patch
>
>
> FrenchLightStemmer performs aggressive deletions on repeated character sequences, even on numbers.
> This might be unexpected during full text search.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (SOLR-3463) FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens

Posted by "Tanguy Moal (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tanguy Moal updated SOLR-3463:
------------------------------

    Attachment: SOLR-3463.patch

Added some tests related to this issue.
                
> FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens
> -------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-3463
>                 URL: https://issues.apache.org/jira/browse/SOLR-3463
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.4
>            Reporter: Tanguy Moal
>            Priority: Minor
>         Attachments: SOLR-3463.patch, SOLR-3463.patch
>
>
> FrenchLightStemmer performs aggressive deletions on repeated character sequences, even on numbers.
> This might be unexpected during full text search.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org