You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Julien Nioche <J....@dcs.shef.ac.uk> on 2007/07/11 23:13:55 UTC

get all the annotations located between two positions

Hi,

Sorry if someone already asked the question.
Is there a direct way to obtain from a Cas all the annotations of a 
given type located between two positions in the text? Something like 
getContained(String type,int start,int end)?
I am trying to get all the Tokens contained within a specific Sentence. 
I have used iterators for doing that and compared the offset with those 
of the Sentence but it is a bit tedious. Have I missed something obvious?

Thanks

Julien

Re: Iterators: problem when using standard methods in combination with moveTo*

Posted by Thilo Goetz <tw...@gmx.de>.
Hi Julien,

Julien Nioche wrote:
> Thilo and Marshall,
> 
> Thanks for sharing the tip. Indeed it would be a good idea to add this
> little example to the documentation.
> 
> A quick comment about the Iterator methods. I had a problem with the
> following piece of code:
> 
> /while (wordFormIterator.hasNext()){
> WordForm wf = (WordForm)wordFormIterator.next();
> if (wf.getBegin()==token.getBegin() && wf.getEnd()==token.getEnd()){
> liste.add(wf);
> }
> else {
> //  move back
> wordFormIterator.moveToPrevious();
>  return liste;
>  }
> }
> /
> The last element of the iterator was never accessible because
> /hasNext()/ returned false despite the fact that there WAS an element
> left in there. /moveToPrevious /had been previously called on this iterator.
> 
> Should not /hasNext() /return true even if the cursor has been moved
> forward or backward within the iterator? Or is the use of the legacy
> methods (hasNext(), next()) incompatible with the /moveTo* /methods?

hm, I thought this was in our documentation, but couldn't find it myself.
You should not mix the use of next()/hasNext() with the methods defined
in the FSIterator interface.  They do not work well together.  If you use
the FSIterator APIs, you should use them exclusively.  Sorry about that.
I'll add a comment to the javadocs.

> 
> Thanks
> 
> Julien
>> To be a bit more explicit, here's some code that will determine how
>> many tokens the longest sentence in the document contains.  It's a
>> silly example, but it illustrates the concept.  Maybe this should go
>> in the docs.  Note: I have not actually run this code, it may not
>> work immediately ;-)
>>
>>     CAS cas = ...;
>>     Type sentenceType = cas.getTypeSystem().getType("yourSentenceTypeName");
>>     Type tokenType = cas.getTypeSystem().getType("yourTokenTypeName");
>>     FSIterator sentenceIt = cas.getAnnotationIndex(sentenceType).iterator();
>>     AnnotationIndex tokenIndex = cas.getAnnotationIndex(tokenType);
>>     FSIterator tokenIt;
>>     int maxLen = 0;
>>     int currentLen;
>>     for (sentenceIt.moveToFirst(); sentenceIt.isValid(); sentenceIt.moveToNext()) {
>>       tokenIt = tokenIndex.subiterator((AnnotationFS) sentenceIt.get());
>>       currentLen = 0;
>>       for (tokenIt.moveToFirst(); tokenIt.isValid(); tokenIt.moveToNext()) {
>> 	++currentLen;
>>       }
>>       maxLen = ((maxLen < currentLen) ? currentLen : maxLen);
>>     }
>>     System.out.println("Longest sentence contains " + maxLen + " tokens.");
>>
>> --Thilo
>>
>> Marshall Schor wrote:
>>   
>>> Did you consider using subIterators?  These are (briefly) described in
>>> section 4.7.4 of the Apache UIMA Reference book, and may include exactly
>>> what you're trying to get at - an interator over elements that are
>>> "contained" in the span of other elements.
>>>
>>> -Marshall
>>>
>>> Julien Nioche wrote:
>>>     
>>>> Hi,
>>>>
>>>> Sorry if someone already asked the question.
>>>> Is there a direct way to obtain from a Cas all the annotations of a
>>>> given type located between two positions in the text? Something like
>>>> getContained(String type,int start,int end)?
>>>> I am trying to get all the Tokens contained within a specific
>>>> Sentence. I have used iterators for doing that and compared the offset
>>>> with those of the Sentence but it is a bit tedious. Have I missed
>>>> something obvious?
>>>>
>>>> Thanks
>>>>
>>>> Julien
>>>>
>>>>
>>>>       
> 

Re: get all the annotations located between two positions

Posted by Thilo Goetz <tw...@gmx.de>.
To be a bit more explicit, here's some code that will determine how
many tokens the longest sentence in the document contains.  It's a
silly example, but it illustrates the concept.  Maybe this should go
in the docs.  Note: I have not actually run this code, it may not
work immediately ;-)

    CAS cas = ...;
    Type sentenceType = cas.getTypeSystem().getType("yourSentenceTypeName");
    Type tokenType = cas.getTypeSystem().getType("yourTokenTypeName");
    FSIterator sentenceIt = cas.getAnnotationIndex(sentenceType).iterator();
    AnnotationIndex tokenIndex = cas.getAnnotationIndex(tokenType);
    FSIterator tokenIt;
    int maxLen = 0;
    int currentLen;
    for (sentenceIt.moveToFirst(); sentenceIt.isValid(); sentenceIt.moveToNext()) {
      tokenIt = tokenIndex.subiterator((AnnotationFS) sentenceIt.get());
      currentLen = 0;
      for (tokenIt.moveToFirst(); tokenIt.isValid(); tokenIt.moveToNext()) {
	++currentLen;
      }
      maxLen = ((maxLen < currentLen) ? currentLen : maxLen);
    }
    System.out.println("Longest sentence contains " + maxLen + " tokens.");

--Thilo

Marshall Schor wrote:
> Did you consider using subIterators?  These are (briefly) described in
> section 4.7.4 of the Apache UIMA Reference book, and may include exactly
> what you're trying to get at - an interator over elements that are
> "contained" in the span of other elements.
> 
> -Marshall
> 
> Julien Nioche wrote:
>> Hi,
>>
>> Sorry if someone already asked the question.
>> Is there a direct way to obtain from a Cas all the annotations of a
>> given type located between two positions in the text? Something like
>> getContained(String type,int start,int end)?
>> I am trying to get all the Tokens contained within a specific
>> Sentence. I have used iterators for doing that and compared the offset
>> with those of the Sentence but it is a bit tedious. Have I missed
>> something obvious?
>>
>> Thanks
>>
>> Julien
>>
>>

Re: get all the annotations located between two positions

Posted by Marshall Schor <ms...@schor.com>.
Did you consider using subIterators?  These are (briefly) described in 
section 4.7.4 of the Apache UIMA Reference book, and may include exactly 
what you're trying to get at - an interator over elements that are 
"contained" in the span of other elements.

-Marshall

Julien Nioche wrote:
> Hi,
>
> Sorry if someone already asked the question.
> Is there a direct way to obtain from a Cas all the annotations of a 
> given type located between two positions in the text? Something like 
> getContained(String type,int start,int end)?
> I am trying to get all the Tokens contained within a specific 
> Sentence. I have used iterators for doing that and compared the offset 
> with those of the Sentence but it is a bit tedious. Have I missed 
> something obvious?
>
> Thanks
>
> Julien
>
>