You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Dawid Weiss (JIRA)" <ji...@apache.org> on 2015/11/02 09:23:27 UTC

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

    [ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14984873#comment-14984873 ] 

Dawid Weiss commented on LUCENE-6874:
-------------------------------------

Depends what you consider a trap. 

A non-breakable whitespace could be a legitimate way to prevent two tokens from being separated if they need to be tokenized together. An example that comes to my mind is the special "zero-width" space or the hyphenation marker... which even on its own poses a problem [1]...

Ultimately it should be probably the question of whether we want to tokenize on "whitespace as in formatted text" or "whitespace as in logical codepoint units" and it doesn't apply to the WhitespaceTokenizer only, but to any tokenizer in general?

bq. I think WhitespaceTokenizer should tokenize on this.

Seems like majority of people would want it to be tokenized, I agree. But if you change this then there is no way to go back to previous behavior. Currently it's relatively easy to wrap your input in a reader that replaces those problematic codepoints on the fly before they're fed to the tokenizer?

[1] https://www.cs.tut.fi/~jkorpela/shy.html

> WhitespaceTokenizer should tokenize on NBSP
> -------------------------------------------
>
>                 Key: LUCENE-6874
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6874
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: David Smiley
>            Priority: Minor
>
> WhitespaceTokenizer uses [Character.isWhitespace |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org