You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2009/11/16 18:41:39 UTC

[jira] Created: (LUCENE-2070) document LengthFilter wrt Unicode 4.0

document LengthFilter wrt Unicode 4.0
-------------------------------------

                 Key: LUCENE-2070
                 URL: https://issues.apache.org/jira/browse/LUCENE-2070
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Analysis
            Reporter: Robert Muir
            Priority: Trivial
             Fix For: 3.1


LengthFilter calculates its min/max length from TermAttribute.termLength()
This is not characters, but instead UTF-16 code units.

In my opinion this should not be changed, merely documented.
If we changed it, it would have an adverse performance impact because we would have to actually calculate Character.codePointCount() on the text.

If you feel strongly otherwise, fixing it to count codepoints would be a trivial patch, but I'd rather not hurt performance.
I admit I don't fully understand all the use cases for this filter.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2070) document LengthFilter wrt Unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2070:
--------------------------------

    Attachment: LUCENE-2070.patch

> document LengthFilter wrt Unicode 4.0
> -------------------------------------
>
>                 Key: LUCENE-2070
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2070
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Trivial
>             Fix For: 3.1
>
>         Attachments: LUCENE-2070.patch
>
>
> LengthFilter calculates its min/max length from TermAttribute.termLength()
> This is not characters, but instead UTF-16 code units.
> In my opinion this should not be changed, merely documented.
> If we changed it, it would have an adverse performance impact because we would have to actually calculate Character.codePointCount() on the text.
> If you feel strongly otherwise, fixing it to count codepoints would be a trivial patch, but I'd rather not hurt performance.
> I admit I don't fully understand all the use cases for this filter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org