You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Marshall Schor <ms...@schor.com> on 2008/06/04 03:10:53 UTC

Some offset issues with Open Calais

I fiddled around a bit more with this, trying various things that 
actually connected to the service.

I finally figured out that if you send the string "xxx & yyy" to the 
service, it actually processes the string

<Document><Title>1212537108630-85FDAB4B-292518</Title><Date>2008-06-03</Date><Body>xxx 
&amp; yyy</Body></Document>

or something like that. And that returned offsets are relative to this 
string. 

To correct the offsets returned so that they correspond to what you sent 
looks like it has 2 parts:  the first part - the prefix "<Document ...  
<Body>" is pretty easily accounted for.  The send part, expanding & to 
&amp; requires more work.  Other characters are also converted, some 
strangely.  I've seen the usual:

<  converted to &lt;,   > converted to &gt;

The character " seemed to be converted to &amp;quot;

All this is apparently a "bug" - their forum includes a post saying the 
problem with the "&" will be fixed in the next release.

I've posted a reply to their forum asking about other characters beside 
the "&".

One final note: their API says that for the POST method, content sent 
using that method needs to be escaped.  I think that means the kind of 
escaping that is done for encoding strings in URLs; I used the Java 
library method: URLEncoder.encode(string, 'UTF-8') to do this and it 
seemed to do the trick.

-Marshall