You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Steve Rowe (JIRA)" <ji...@apache.org> on 2016/02/18 23:08:18 UTC

[jira] [Comment Edited] (LUCENE-6993) Update UAX29URLEmailTokenizer TLDs to latest list, and upgrade all JFlex-based tokenizers to support Unicode 8.0

    [ https://issues.apache.org/jira/browse/LUCENE-6993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15153209#comment-15153209 ] 

Steve Rowe edited comment on LUCENE-6993 at 2/18/16 10:07 PM:
--------------------------------------------------------------

[~mdrob], I haven't looked at your patch yet but there is a non-rote Unicode upgrade item that needs to be dealt with - from LUCENE-5357's TODO list:

* Upgrade the UAX#29-based grammars to the Unicode -6.3- _8.0_ word break rules, in StandardTokenizerImpl.jflex and UAX29URLEmailTokenizer.jflex.

UAX#29 word break rules can (and usually do) change with each Unicode release, so we'll need to review the changes between 6.3 and 8.0 and see what, if anything, needs changing in the tokenizer grammars.  Another item from the LUCENE-5357 TODO list will confirm that this has been done correctly:

* Test the new scanners against the Unicode -6.3- _8.0_ word break test data
** \[...]


was (Author: steve_rowe):
[~mdrob], I haven't looked at your patch yet but there is a non-rote Unicode upgrade item that needs to be dealt with - from LUCENE-5357's TODO list:

* Upgrade the UAX#29-based grammars to the Unicode -6.3- _8.0_ word break rules, in StandardTokenizerImpl.jflex and UAX29URLEmailTokenizer.jflex.

UAX#29 word break rules can (and usually do) change with each Unicode release, so we'll need to review the changes between 6.3 and 8.0 and see what, if anything, needs changing in the tokenizer grammars.  Another item from the LUCENE-5357 TODO list will confirm that this has been done correctly:

* Test the new scanners against the Unicode 6.3 word break test data
** \[...]

> Update UAX29URLEmailTokenizer TLDs to latest list, and upgrade all JFlex-based tokenizers to support Unicode 8.0
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-6993
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6993
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Mike Drob
>            Assignee: Robert Muir
>             Fix For: 6.0
>
>         Attachments: LUCENE-6993.patch, LUCENE-6993.patch, LUCENE-6993.patch, LUCENE-6993.patch
>
>
> We did this once before in LUCENE-5357, but it might be time to update the list of TLDs again. Comparing our old list with a new list indicates 800+ new domains, so it would be nice to include them.
> Also the JFlex tokenizer grammars should be upgraded to support Unicode 8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org