You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by "Julio C." <ju...@gmail.com> on 2009/10/14 12:39:45 UTC

getting original indexes from JCas Text (Stanford NER and UIMA)

Hi everybody,

I'm working with the Stanford NER and UIMA and I was wondering if there's an
easy and clean way of get the position of word begin/end from the original
JCas document(from .getDocumentText()) after it was converted into
List<List<CoreLabel>> and processed by the NER.

I would appreciate very much if someone who has experience with this issue
would help me!

Thanks in advance!

-- 
JC Rodrigues

Re: getting original indexes from JCas Text (Stanford NER and UIMA)

Posted by Christopher Manning <ma...@cs.stanford.edu>.

Jörn Kottmann <ko...@...> writes:
> 
> Julio C. wrote:
> > Hi everybody,
> >
> > I'm working with the Stanford NER and UIMA and I was wondering if there's an
> > easy and clean way of get the position of word begin/end from the original
> > JCas document(from .getDocumentText()) after it was converted into
> > List<List<CoreLabel>> and processed by the NER.
> >   
> Maybe you can keep an array of your word annotations and
> then use the absolute index of a Core Label to map back to
> the word annotation which then can be used to retrieve its
> offset and length.
> 
> Otherwise you could use a map, where you map from Core Label
> to word annotation.
> 
> Jörn

Hi Julio,

I'm not sure of the UIMA end of things (whose UIMA wrapper of Stanford NER
are you using? FLorian Laws'?).

But the CoreLabel objects can store begin and end character offsets.  They're
just a map.  So if the wrapper doesn't already, it should be able to be adapted
to store the character offsets (under a key such as CharacterOffsetStart), and
then you can get it on the output.

Chris.

Re: getting original indexes from JCas Text (Stanford NER and UIMA)

Posted by Jörn Kottmann <ko...@gmail.com>.

Julio C. wrote:
> Hi everybody,
>
> I'm working with the Stanford NER and UIMA and I was wondering if there's an
> easy and clean way of get the position of word begin/end from the original
> JCas document(from .getDocumentText()) after it was converted into
> List<List<CoreLabel>> and processed by the NER.
>   
Maybe you can keep an array of your word annotations and
then use the absolute index of a Core Label to map back to
the word annotation which then can be used to retrieve its
offset and length.

Otherwise you could use a map, where you map from Core Label
to word annotation.

Jörn