You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by "Jérôme Rocheteau (JIRA)" <ui...@incubator.apache.org> on 2009/07/21 17:05:14 UTC

[jira] Created: (UIMA-1447) Tabulations are annotated as tokens after a space

Tabulations are annotated as tokens after a space
-------------------------------------------------

                 Key: UIMA-1447
                 URL: https://issues.apache.org/jira/browse/UIMA-1447
             Project: UIMA
          Issue Type: Bug
          Components: Sandbox-WhitespaceTokenizer
    Affects Versions: 2.3S
         Environment: Unix (ubuntu 8.04), Eclipse Galileo 3.5
            Reporter: Jérôme Rocheteau


This is a test-text for the Whitespace Tokenizer in the UIMA Sandbox. 
It behaves as follows: 	i.e. a '\t' character after a space is 
annotated as a token and its covered text is set to the empty string ""! 
I suppose it shoudn't be the case, am I wrong?


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (UIMA-1447) Tabulations are annotated as tokens after a space

Posted by "Marshall Schor (JIRA)" <ui...@incubator.apache.org>.
    [ https://issues.apache.org/jira/browse/UIMA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12735158#action_12735158 ] 

Marshall Schor commented on UIMA-1447:
--------------------------------------

Based on Thilo's remark 2 above, I'm also +1 for Jörn's solution, without any options etc. for the old behavior.  

> Tabulations are annotated as tokens after a space
> -------------------------------------------------
>
>                 Key: UIMA-1447
>                 URL: https://issues.apache.org/jira/browse/UIMA-1447
>             Project: UIMA
>          Issue Type: Bug
>          Components: Sandbox-WhitespaceTokenizer
>    Affects Versions: 2.3S
>         Environment: Unix (ubuntu 8.04), Eclipse Galileo 3.5
>            Reporter: Jérôme Rocheteau
>         Attachments: patch-an-wst.txt
>
>
> This is a test-text for the Whitespace Tokenizer in the UIMA Sandbox. 
> It behaves as follows: 	i.e. a '\t' character after a space is 
> annotated as a token and its covered text is set to the empty string ""! 
> I suppose it shoudn't be the case, am I wrong?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (UIMA-1447) Tabulations are annotated as tokens after a space

Posted by "Marshall Schor (JIRA)" <ui...@incubator.apache.org>.
    [ https://issues.apache.org/jira/browse/UIMA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733823#action_12733823 ] 

Marshall Schor commented on UIMA-1447:
--------------------------------------

I took a look at the code.  It seems it considers \t to be a "special character".  The WhiteSpace classification it is using is just the Java Character.SPACE_SEPARATOR character classes, which excludes the \t.   

It instead treats this character as a "special character" - and annotates it as a 1 character token.  Running it in the DocumentAnalyzer shows the \t as a 1 char token, as expected.  

getCoveredText returns text.substring(getBegin(), getEnd()).  When I ran this in the documentAnalyzer, the GUI display of the 1 character looked like a blank - but that's probably just an artifact of the GUI.

If you have a test case where getCoveredText is actually returning a 0 length string, please post it.



> Tabulations are annotated as tokens after a space
> -------------------------------------------------
>
>                 Key: UIMA-1447
>                 URL: https://issues.apache.org/jira/browse/UIMA-1447
>             Project: UIMA
>          Issue Type: Bug
>          Components: Sandbox-WhitespaceTokenizer
>    Affects Versions: 2.3S
>         Environment: Unix (ubuntu 8.04), Eclipse Galileo 3.5
>            Reporter: Jérôme Rocheteau
>
> This is a test-text for the Whitespace Tokenizer in the UIMA Sandbox. 
> It behaves as follows: 	i.e. a '\t' character after a space is 
> annotated as a token and its covered text is set to the empty string ""! 
> I suppose it shoudn't be the case, am I wrong?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (UIMA-1447) Tabulations are annotated as tokens after a space

Posted by "Thilo Goetz (JIRA)" <ui...@incubator.apache.org>.
    [ https://issues.apache.org/jira/browse/UIMA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734918#action_12734918 ] 

Thilo Goetz commented on UIMA-1447:
-----------------------------------

Marshall, I don't think this is something we need to worry about.  If people have code working around these issues, that code will simply no longer be called.  For example, people might have code to check and skip tokens that just contain whitespace.  I have written such code myself in the past, other tokenizers have similar issues.  I'm +1 for Joern's solution, and that should be the default as well.  I wouldn't even support the old behavior, not even with an option.  It'll just make the code more complicated for no good reason.  If somebody really desperately needs the old behavior, they can use an old version of the tokenizer.

> Tabulations are annotated as tokens after a space
> -------------------------------------------------
>
>                 Key: UIMA-1447
>                 URL: https://issues.apache.org/jira/browse/UIMA-1447
>             Project: UIMA
>          Issue Type: Bug
>          Components: Sandbox-WhitespaceTokenizer
>    Affects Versions: 2.3S
>         Environment: Unix (ubuntu 8.04), Eclipse Galileo 3.5
>            Reporter: Jérôme Rocheteau
>         Attachments: patch-an-wst.txt
>
>
> This is a test-text for the Whitespace Tokenizer in the UIMA Sandbox. 
> It behaves as follows: 	i.e. a '\t' character after a space is 
> annotated as a token and its covered text is set to the empty string ""! 
> I suppose it shoudn't be the case, am I wrong?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (UIMA-1447) Tabulations are annotated as tokens after a space

Posted by "Marshall Schor (JIRA)" <ui...@incubator.apache.org>.
    [ https://issues.apache.org/jira/browse/UIMA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734745#action_12734745 ] 

Marshall Schor commented on UIMA-1447:
--------------------------------------

A couple of comments:

1) if we change the behavior of this annotator, it may cause other uses of it to now fail, because they were built with the previous behavior in mind.

2) If we solve (1), and want to have a version of this annotator which defines whitespace differently, then I would prefer Jörn's fix because it puts all the programming logic involved in determining the meaning of Whitespace in one spot.

A possible fix for (1) would be to add an optional parameter (that defaults if not specified to the current mode of operation) that, when set, causes this alternate view of Whitespace to be used.  

Of course, another fix is just to have users that want other definitions of character classes, to copy this annotator, and rename it somewhat, and change the code to their liking :-) .

In this particular case, I agree with Jérôme that the definition of Whitespace is not what I would think is normally expected, so I'm in favor of finding some way to correct this (without breaking backward compatibility).

> Tabulations are annotated as tokens after a space
> -------------------------------------------------
>
>                 Key: UIMA-1447
>                 URL: https://issues.apache.org/jira/browse/UIMA-1447
>             Project: UIMA
>          Issue Type: Bug
>          Components: Sandbox-WhitespaceTokenizer
>    Affects Versions: 2.3S
>         Environment: Unix (ubuntu 8.04), Eclipse Galileo 3.5
>            Reporter: Jérôme Rocheteau
>         Attachments: patch-an-wst.txt
>
>
> This is a test-text for the Whitespace Tokenizer in the UIMA Sandbox. 
> It behaves as follows: 	i.e. a '\t' character after a space is 
> annotated as a token and its covered text is set to the empty string ""! 
> I suppose it shoudn't be the case, am I wrong?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Closed: (UIMA-1447) Tabulations are annotated as tokens after a space

Posted by "Marshall Schor (JIRA)" <ui...@incubator.apache.org>.
     [ https://issues.apache.org/jira/browse/UIMA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marshall Schor closed UIMA-1447.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 2.3S

> Tabulations are annotated as tokens after a space
> -------------------------------------------------
>
>                 Key: UIMA-1447
>                 URL: https://issues.apache.org/jira/browse/UIMA-1447
>             Project: UIMA
>          Issue Type: Bug
>          Components: Sandbox-WhitespaceTokenizer
>    Affects Versions: 2.3S
>         Environment: Unix (ubuntu 8.04), Eclipse Galileo 3.5
>            Reporter: Jérôme Rocheteau
>            Assignee: Marshall Schor
>             Fix For: 2.3S
>
>         Attachments: patch-an-wst.txt
>
>
> This is a test-text for the Whitespace Tokenizer in the UIMA Sandbox. 
> It behaves as follows: 	i.e. a '\t' character after a space is 
> annotated as a token and its covered text is set to the empty string ""! 
> I suppose it shoudn't be the case, am I wrong?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (UIMA-1447) Tabulations are annotated as tokens after a space

Posted by "Jérôme Rocheteau (JIRA)" <ui...@incubator.apache.org>.
    [ https://issues.apache.org/jira/browse/UIMA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734606#action_12734606 ] 

Jérôme Rocheteau edited comment on UIMA-1447 at 7/23/09 7:46 AM:
-----------------------------------------------------------------

I suggest this patch: it merely checks if the current character isn't a whitespace while creating a token annotation for a special character.

      was (Author: jerome.rocheteau):
    I suggest this patch: it merely checks if the current character isn't a whitespace while creating a token annotation is created for a special character.
  
> Tabulations are annotated as tokens after a space
> -------------------------------------------------
>
>                 Key: UIMA-1447
>                 URL: https://issues.apache.org/jira/browse/UIMA-1447
>             Project: UIMA
>          Issue Type: Bug
>          Components: Sandbox-WhitespaceTokenizer
>    Affects Versions: 2.3S
>         Environment: Unix (ubuntu 8.04), Eclipse Galileo 3.5
>            Reporter: Jérôme Rocheteau
>         Attachments: patch-an-wst.txt
>
>
> This is a test-text for the Whitespace Tokenizer in the UIMA Sandbox. 
> It behaves as follows: 	i.e. a '\t' character after a space is 
> annotated as a token and its covered text is set to the empty string ""! 
> I suppose it shoudn't be the case, am I wrong?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (UIMA-1447) Tabulations are annotated as tokens after a space

Posted by "Marshall Schor (JIRA)" <ui...@incubator.apache.org>.
     [ https://issues.apache.org/jira/browse/UIMA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marshall Schor reassigned UIMA-1447:
------------------------------------

    Assignee: Marshall Schor

> Tabulations are annotated as tokens after a space
> -------------------------------------------------
>
>                 Key: UIMA-1447
>                 URL: https://issues.apache.org/jira/browse/UIMA-1447
>             Project: UIMA
>          Issue Type: Bug
>          Components: Sandbox-WhitespaceTokenizer
>    Affects Versions: 2.3S
>         Environment: Unix (ubuntu 8.04), Eclipse Galileo 3.5
>            Reporter: Jérôme Rocheteau
>            Assignee: Marshall Schor
>         Attachments: patch-an-wst.txt
>
>
> This is a test-text for the Whitespace Tokenizer in the UIMA Sandbox. 
> It behaves as follows: 	i.e. a '\t' character after a space is 
> annotated as a token and its covered text is set to the empty string ""! 
> I suppose it shoudn't be the case, am I wrong?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (UIMA-1447) Tabulations are annotated as tokens after a space

Posted by "Jörn Kottmann (JIRA)" <ui...@incubator.apache.org>.
    [ https://issues.apache.org/jira/browse/UIMA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12735031#action_12735031 ] 

Jörn Kottmann commented on UIMA-1447:
-------------------------------------

I never really understood how isWhitespace must be called. There is one which takes a char and one that takes an int as parameter.
The one with the int was added in java 1.5. And they write in the javadoc that it must be used to also support supplementary characters.

Do we have to support supplementary characters in our text processing code ?

If so we then we first must find out if the 16 bit char is a high surrogate code unit and depending
on that either pass one or two code units (as 32 bit int), right ?

> Tabulations are annotated as tokens after a space
> -------------------------------------------------
>
>                 Key: UIMA-1447
>                 URL: https://issues.apache.org/jira/browse/UIMA-1447
>             Project: UIMA
>          Issue Type: Bug
>          Components: Sandbox-WhitespaceTokenizer
>    Affects Versions: 2.3S
>         Environment: Unix (ubuntu 8.04), Eclipse Galileo 3.5
>            Reporter: Jérôme Rocheteau
>         Attachments: patch-an-wst.txt
>
>
> This is a test-text for the Whitespace Tokenizer in the UIMA Sandbox. 
> It behaves as follows: 	i.e. a '\t' character after a space is 
> annotated as a token and its covered text is set to the empty string ""! 
> I suppose it shoudn't be the case, am I wrong?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (UIMA-1447) Tabulations are annotated as tokens after a space

Posted by "Jörn Kottmann (JIRA)" <ui...@incubator.apache.org>.
    [ https://issues.apache.org/jira/browse/UIMA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734520#action_12734520 ] 

Jörn Kottmann commented on UIMA-1447:
-------------------------------------

A tab can be considered as a white space. To fix it we an use Character.isWhitespace for space detection and additionally check
if the char is part of the Unicode Zs category (Character.SPACE_SEPARATOR) because isWhitespace excludes no-break spaces.

> Tabulations are annotated as tokens after a space
> -------------------------------------------------
>
>                 Key: UIMA-1447
>                 URL: https://issues.apache.org/jira/browse/UIMA-1447
>             Project: UIMA
>          Issue Type: Bug
>          Components: Sandbox-WhitespaceTokenizer
>    Affects Versions: 2.3S
>         Environment: Unix (ubuntu 8.04), Eclipse Galileo 3.5
>            Reporter: Jérôme Rocheteau
>
> This is a test-text for the Whitespace Tokenizer in the UIMA Sandbox. 
> It behaves as follows: 	i.e. a '\t' character after a space is 
> annotated as a token and its covered text is set to the empty string ""! 
> I suppose it shoudn't be the case, am I wrong?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (UIMA-1447) Tabulations are annotated as tokens after a space

Posted by "Thilo Goetz (JIRA)" <ui...@incubator.apache.org>.
    [ https://issues.apache.org/jira/browse/UIMA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12735089#action_12735089 ] 

Thilo Goetz commented on UIMA-1447:
-----------------------------------

That would probably be the only place in the UIMA code where we handle surrogates correctly.  I wouldn't bother.  All our processing (like the "character" offsets) is done in terms of 16 bit code units, not code points (i.e., characters).  If Java ever switches to 32 bit code units, we'll have to make that move, too, and that should automatically make things work more correctly.  I don't think that's in the cards for the mid-term future, though.  Too many things are riding on 16 bit code units.


> Tabulations are annotated as tokens after a space
> -------------------------------------------------
>
>                 Key: UIMA-1447
>                 URL: https://issues.apache.org/jira/browse/UIMA-1447
>             Project: UIMA
>          Issue Type: Bug
>          Components: Sandbox-WhitespaceTokenizer
>    Affects Versions: 2.3S
>         Environment: Unix (ubuntu 8.04), Eclipse Galileo 3.5
>            Reporter: Jérôme Rocheteau
>         Attachments: patch-an-wst.txt
>
>
> This is a test-text for the Whitespace Tokenizer in the UIMA Sandbox. 
> It behaves as follows: 	i.e. a '\t' character after a space is 
> annotated as a token and its covered text is set to the empty string ""! 
> I suppose it shoudn't be the case, am I wrong?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (UIMA-1447) Tabulations are annotated as tokens after a space

Posted by "Jérôme Rocheteau (JIRA)" <ui...@incubator.apache.org>.
    [ https://issues.apache.org/jira/browse/UIMA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734083#action_12734083 ] 

Jérôme Rocheteau commented on UIMA-1447:
----------------------------------------

I just would like to known if the Whitespace Tokenizer behaves as expected when a '\t' character have to be annotated as a TokenAnnotation or not following a ' ' character? And then I would like to know if a patch could be applied? 

I don't have examples where the getCoveredText method returns a 0 length string. Actually, it returns a 1 length string even if it displays an empty string.

> Tabulations are annotated as tokens after a space
> -------------------------------------------------
>
>                 Key: UIMA-1447
>                 URL: https://issues.apache.org/jira/browse/UIMA-1447
>             Project: UIMA
>          Issue Type: Bug
>          Components: Sandbox-WhitespaceTokenizer
>    Affects Versions: 2.3S
>         Environment: Unix (ubuntu 8.04), Eclipse Galileo 3.5
>            Reporter: Jérôme Rocheteau
>
> This is a test-text for the Whitespace Tokenizer in the UIMA Sandbox. 
> It behaves as follows: 	i.e. a '\t' character after a space is 
> annotated as a token and its covered text is set to the empty string ""! 
> I suppose it shoudn't be the case, am I wrong?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (UIMA-1447) Tabulations are annotated as tokens after a space

Posted by "Jérôme Rocheteau (JIRA)" <ui...@incubator.apache.org>.
     [ https://issues.apache.org/jira/browse/UIMA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jérôme Rocheteau updated UIMA-1447:
-----------------------------------

    Attachment: patch-an-wst.txt

I suggest this patch: it merely checks if the current character isn't a whitespace while creating a token annotation is created for a special character.

> Tabulations are annotated as tokens after a space
> -------------------------------------------------
>
>                 Key: UIMA-1447
>                 URL: https://issues.apache.org/jira/browse/UIMA-1447
>             Project: UIMA
>          Issue Type: Bug
>          Components: Sandbox-WhitespaceTokenizer
>    Affects Versions: 2.3S
>         Environment: Unix (ubuntu 8.04), Eclipse Galileo 3.5
>            Reporter: Jérôme Rocheteau
>         Attachments: patch-an-wst.txt
>
>
> This is a test-text for the Whitespace Tokenizer in the UIMA Sandbox. 
> It behaves as follows: 	i.e. a '\t' character after a space is 
> annotated as a token and its covered text is set to the empty string ""! 
> I suppose it shoudn't be the case, am I wrong?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.