You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "David Goldfarb (JIRA)" <ji...@apache.org> on 2014/03/12 15:03:44 UTC

[jira] [Updated] (LUCENE-4072) CharFilter that Unicode-normalizes input

     [ https://issues.apache.org/jira/browse/LUCENE-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Goldfarb updated LUCENE-4072:
-----------------------------------

    Attachment: 4072.patch

Attaching a new patch. All tests pass. 

I'm using Normalizer2.isInert to check if we need to keep reading to the input buffer since it doesn't return false positives, even though it's not as fast as .hasBoundaryBefore().

> CharFilter that Unicode-normalizes input
> ----------------------------------------
>
>                 Key: LUCENE-4072
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4072
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Ippei UKAI
>         Attachments: 4072.patch, 4072.patch, DebugCode.txt, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, ippeiukai-ICUNormalizer2CharFilter-4752cad.zip
>
>
> I'd like to contribute a CharFilter that Unicode-normalizes input with ICU4J.
> The benefit of having this process as CharFilter is that tokenizer can work on normalised text while offset-correction ensuring fast vector highlighter and other offset-dependent features do not break.
> The implementation is available at following repository:
> https://github.com/ippeiukai/ICUNormalizer2CharFilter
> Unfortunately this is my unpaid side-project and cannot spend much time to merge my work to Lucene to make appropriate patch. I'd appreciate it if anyone could give it a go. I'm happy to relicense it to whatever that meets your needs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org