You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Marshall Schor <ms...@schor.com> on 2008/06/04 03:10:53 UTC
Some offset issues with Open Calais
I fiddled around a bit more with this, trying various things that
actually connected to the service.
I finally figured out that if you send the string "xxx & yyy" to the
service, it actually processes the string
<Document><Title>1212537108630-85FDAB4B-292518</Title><Date>2008-06-03</Date><Body>xxx
& yyy</Body></Document>
or something like that. And that returned offsets are relative to this
string.
To correct the offsets returned so that they correspond to what you sent
looks like it has 2 parts: the first part - the prefix "<Document ...
<Body>" is pretty easily accounted for. The send part, expanding & to
& requires more work. Other characters are also converted, some
strangely. I've seen the usual:
< converted to <, > converted to >
The character " seemed to be converted to &quot;
All this is apparently a "bug" - their forum includes a post saying the
problem with the "&" will be fixed in the next release.
I've posted a reply to their forum asking about other characters beside
the "&".
One final note: their API says that for the POST method, content sent
using that method needs to be escaped. I think that means the kind of
escaping that is done for encoding strings in URLs; I used the Java
library method: URLEncoder.encode(string, 'UTF-8') to do this and it
seemed to do the trick.
-Marshall