You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by Jens Grivolla <j+...@grivolla.net> on 2011/04/20 10:58:04 UTC

CR+LF = 1 character?

Hi,

while working on the integration between UIMA and a different text 
annotation system we ran into problems with differing offsets between 
the two systems.

As it turns out, the other system considers CR+LF (Windows style line 
endings) to be two characters, while UIMA sees it as one.  Clearly, 
CR+LF are two bytes in one-byte-per-character encodings (ASCII, Latin-1, 
...) so all systems based on those encodings will see it as two 
characters, and I believe it is also represented as two Unicode characters.

In a way it makes sense to consider a "newline" as one character, 
independently of how it is represented, so I think the UIMA way is fine. 
  But is there an overview somewhere how different systems and 
programming language handle this, e.g. when extracting substrings, etc.?

Given the mess that this can be it's probably best to normalize all text 
at the beginning to only deal with Unicode strings with LF endings, 
encoded with UTF-8 when writing to disk or otherwise serializing the data.

It would still be interesting to know how painful this can get when not 
normalizing, and e.g. passing data between UIMA (Java), NLTK (Python), 
our own C#-based system, etc.

Thanks,
Jens

Re: CR+LF = 1 character?

Posted by Thilo Götz <tw...@gmx.de>.

On 4/20/2011 14:31, Steven Bethard wrote:
> On Wed, Apr 20, 2011 at 10:58 AM, Jens Grivolla <j+...@grivolla.net> wrote:
>> As it turns out, the other system considers CR+LF (Windows style line
>> endings) to be two characters, while UIMA sees it as one.
> 
> As Jörn suggested, this is probably a bug in the code somewhere where
> you read in the text. Perhaps you're using
> org.apache.uima.pear.util.FileUtil.loadTextFile? That's definitely
> broken in terms of line endings and I know that gave us trouble
> before. We found that org.apache.uima.util.FileUtils.file2String
> actually does the right thing, so you could use that instead. Having
> been bitten by this though, I tend to avoid the UIMA classes for
> handling files, and use com.google.common.io.Files.toString from the
> guava libraries instead, which I trust more.

This is getting slightly off-topic, but you can also use
Apache Commons IO for this sort of thing.

Although I resent having the UIMA core file utils lumped
in with the pear stuff, I can't blame you for your conclusion ;-)

--Thilo

> 
> Steve
> 
> P.S. Yes, I know I should have filed a bug report. Sorry for not
> getting around to it...

Re: CR+LF = 1 character?

Posted by Steven Bethard <st...@gmail.com>.

On Wed, Apr 20, 2011 at 10:58 AM, Jens Grivolla <j+...@grivolla.net> wrote:
> As it turns out, the other system considers CR+LF (Windows style line
> endings) to be two characters, while UIMA sees it as one.

As Jörn suggested, this is probably a bug in the code somewhere where
you read in the text. Perhaps you're using
org.apache.uima.pear.util.FileUtil.loadTextFile? That's definitely
broken in terms of line endings and I know that gave us trouble
before. We found that org.apache.uima.util.FileUtils.file2String
actually does the right thing, so you could use that instead. Having
been bitten by this though, I tend to avoid the UIMA classes for
handling files, and use com.google.common.io.Files.toString from the
guava libraries instead, which I trust more.

Steve

P.S. Yes, I know I should have filed a bug report. Sorry for not
getting around to it...
-- 
Where did you get that preposterous hypothesis?
Did Steve tell you that?
        --- The Hiphopopotamus

Re: CR+LF = 1 character?

Posted by Jörn Kottmann <ko...@gmail.com>.

On 4/20/11 10:58 AM, Jens Grivolla wrote:
> Hi,
>
> while working on the integration between UIMA and a different text 
> annotation system we ran into problems with differing offsets between 
> the two systems.
>
> As it turns out, the other system considers CR+LF (Windows style line 
> endings) to be two characters, while UIMA sees it as one. 

The string sofa inside a CAS contains 16 bit unicode characters and 
CR+LF are two unicode characters. So I believe you are mistaken
or there is somewhere a bug which turns CR+LF into one char. All offsets 
are 16 bit unicode offsets, even so one character might need
two 16 bit slots. So it might be possible to have an annotation over one 
character which has a length of two.

Jörn