You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by "Marshall Schor (JIRA)" <ui...@incubator.apache.org> on 2009/07/21 23:05:14 UTC
[jira] Commented: (UIMA-1447) Tabulations are annotated as tokens
after a space
[ https://issues.apache.org/jira/browse/UIMA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733823#action_12733823 ]
Marshall Schor commented on UIMA-1447:
--------------------------------------
I took a look at the code. It seems it considers \t to be a "special character". The WhiteSpace classification it is using is just the Java Character.SPACE_SEPARATOR character classes, which excludes the \t.
It instead treats this character as a "special character" - and annotates it as a 1 character token. Running it in the DocumentAnalyzer shows the \t as a 1 char token, as expected.
getCoveredText returns text.substring(getBegin(), getEnd()). When I ran this in the documentAnalyzer, the GUI display of the 1 character looked like a blank - but that's probably just an artifact of the GUI.
If you have a test case where getCoveredText is actually returning a 0 length string, please post it.
> Tabulations are annotated as tokens after a space
> -------------------------------------------------
>
> Key: UIMA-1447
> URL: https://issues.apache.org/jira/browse/UIMA-1447
> Project: UIMA
> Issue Type: Bug
> Components: Sandbox-WhitespaceTokenizer
> Affects Versions: 2.3S
> Environment: Unix (ubuntu 8.04), Eclipse Galileo 3.5
> Reporter: Jérôme Rocheteau
>
> This is a test-text for the Whitespace Tokenizer in the UIMA Sandbox.
> It behaves as follows: i.e. a '\t' character after a space is
> annotated as a token and its covered text is set to the empty string ""!
> I suppose it shoudn't be the case, am I wrong?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.