You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by "Marshall Schor (JIRA)" <ui...@incubator.apache.org> on 2009/07/21 23:05:14 UTC

[jira] Commented: (UIMA-1447) Tabulations are annotated as tokens after a space

    [ https://issues.apache.org/jira/browse/UIMA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733823#action_12733823 ] 

Marshall Schor commented on UIMA-1447:
--------------------------------------

I took a look at the code.  It seems it considers \t to be a "special character".  The WhiteSpace classification it is using is just the Java Character.SPACE_SEPARATOR character classes, which excludes the \t.   

It instead treats this character as a "special character" - and annotates it as a 1 character token.  Running it in the DocumentAnalyzer shows the \t as a 1 char token, as expected.  

getCoveredText returns text.substring(getBegin(), getEnd()).  When I ran this in the documentAnalyzer, the GUI display of the 1 character looked like a blank - but that's probably just an artifact of the GUI.

If you have a test case where getCoveredText is actually returning a 0 length string, please post it.



> Tabulations are annotated as tokens after a space
> -------------------------------------------------
>
>                 Key: UIMA-1447
>                 URL: https://issues.apache.org/jira/browse/UIMA-1447
>             Project: UIMA
>          Issue Type: Bug
>          Components: Sandbox-WhitespaceTokenizer
>    Affects Versions: 2.3S
>         Environment: Unix (ubuntu 8.04), Eclipse Galileo 3.5
>            Reporter: Jérôme Rocheteau
>
> This is a test-text for the Whitespace Tokenizer in the UIMA Sandbox. 
> It behaves as follows: 	i.e. a '\t' character after a space is 
> annotated as a token and its covered text is set to the empty string ""! 
> I suppose it shoudn't be the case, am I wrong?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.