You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@uima.apache.org by Marshall Schor <ms...@schor.com> on 2015/03/20 14:50:12 UTC

Some experimental work speeding up UIMA iterators over Sorted indexes (like the annotation index)

I recently helped a UIMA user speed up a particular component that had n-squared
CAS instance retrieval operations on Annotation subtype T (and its subtypes
Tsub) where there were lots of subtypes.

UIMA indexes allow retrieving items from the CAS, and are supposed to trade off
space (for indexes) for time (speed of finding items in the CAS, speed of
iterating). 

Profiling showed a lot of time being spent in the iterators "heapify up and
down" methods - these are used when iterating over a type which has subtypes, to
select the "next" item in the sorted order which may be of a different type than
the current item.

I was able to speed the annotator up by about 10x (because of the n-squared
algorithm - they iterate over all instances of T, and then, for each instance,
they iterated over the "subannotations" bounded by that instance, looking for a
relationship).  The speedup was done by first extracting the instances (in
sorted order) into a plain Java array of Java-cover-type instances (they're
using the JCas, so these were JCas classes) and then writing an iterator over
that array; this iterator of course no longer had to do the heapify up and down
operations.

This got me thinking that it would not be too hard to this kind of thing
automatically, in core UIMA (extract into a flattened 1-level array) an index
for all instances of a type T (and its subtypes).  The main issue is to figure
out the right time to do this, because an update to the index for T or any of
its subtypes Tsub makes it necessary to re-compute the flattened array.  Given
that I've seen lots of cases of UIMA flows where some annotators will create and
index a type (and its subtypes), and once that's been done, the indexes are not
subsequently updated for these types, but downstream annotators iterate over
them, it seems that a lazy creation for this kind of flattened index would work
well in many cases.

I'm experimenting with this, and will probably add something like this to UIMA
in a way that should be backwards compatible, except for having iterators over
sorted indexes sometimes **no longer** be "fast fail"; this is already allowed
by the specification.  The iterator would check to see if a flattened version
existed, and if so, verify it was valid at the start, but would not be
guaranteed to notice if the index was updated (things added/ removed/ reordered)
while the iteration was in progress.

I'm planning to trigger the creation of this flattened index when the number of
heapify up and down calls > index size (with the counters reset if the indexes
were updated); the creation will be done in an executor on a separate thread. 

At a high level, this would be a bit like JIT for this part of UIMA - improving
"indexes" operation.

-Marshall

Re: Some experimental work speeding up UIMA iterators over Sorted indexes (like the annotation index)

Posted by Marshall Schor <ms...@schor.com>.

I haven't finished the work yet, but soon :-).

-Marshall

On 3/20/2015 11:20 AM, Peter Klügl wrote:
> Very interesting. From the ruta perspective, one can see ruta rules as some
> kind of cascaded FSTs / atomic AEs, which iterate over specific annotations
> that have been created previously but are hardly updated/added in later
> phases. I am already very curious how your changes influence the speed of the
> ruta rule execution.
>
> Best,
>
> Peter
>
> Am 20.03.2015 um 14:50 schrieb Marshall Schor:
>> I recently helped a UIMA user speed up a particular component that had n-squared
>> CAS instance retrieval operations on Annotation subtype T (and its subtypes
>> Tsub) where there were lots of subtypes.
>>
>> UIMA indexes allow retrieving items from the CAS, and are supposed to trade off
>> space (for indexes) for time (speed of finding items in the CAS, speed of
>> iterating).
>>
>> Profiling showed a lot of time being spent in the iterators "heapify up and
>> down" methods - these are used when iterating over a type which has subtypes, to
>> select the "next" item in the sorted order which may be of a different type than
>> the current item.
>>
>> I was able to speed the annotator up by about 10x (because of the n-squared
>> algorithm - they iterate over all instances of T, and then, for each instance,
>> they iterated over the "subannotations" bounded by that instance, looking for a
>> relationship).  The speedup was done by first extracting the instances (in
>> sorted order) into a plain Java array of Java-cover-type instances (they're
>> using the JCas, so these were JCas classes) and then writing an iterator over
>> that array; this iterator of course no longer had to do the heapify up and down
>> operations.
>>
>> This got me thinking that it would not be too hard to this kind of thing
>> automatically, in core UIMA (extract into a flattened 1-level array) an index
>> for all instances of a type T (and its subtypes).  The main issue is to figure
>> out the right time to do this, because an update to the index for T or any of
>> its subtypes Tsub makes it necessary to re-compute the flattened array.  Given
>> that I've seen lots of cases of UIMA flows where some annotators will create and
>> index a type (and its subtypes), and once that's been done, the indexes are not
>> subsequently updated for these types, but downstream annotators iterate over
>> them, it seems that a lazy creation for this kind of flattened index would work
>> well in many cases.
>>
>> I'm experimenting with this, and will probably add something like this to UIMA
>> in a way that should be backwards compatible, except for having iterators over
>> sorted indexes sometimes **no longer** be "fast fail"; this is already allowed
>> by the specification.  The iterator would check to see if a flattened version
>> existed, and if so, verify it was valid at the start, but would not be
>> guaranteed to notice if the index was updated (things added/ removed/ reordered)
>> while the iteration was in progress.
>>
>> I'm planning to trigger the creation of this flattened index when the number of
>> heapify up and down calls > index size (with the counters reset if the indexes
>> were updated); the creation will be done in an executor on a separate thread.
>>
>> At a high level, this would be a bit like JIT for this part of UIMA - improving
>> "indexes" operation.
>>
>> -Marshall
>>   
>
>
>

Re: Some experimental work speeding up UIMA iterators over Sorted indexes (like the annotation index)

Posted by Peter Klügl <pe...@averbis.com>.

Very interesting. From the ruta perspective, one can see ruta rules as 
some kind of cascaded FSTs / atomic AEs, which iterate over specific 
annotations that have been created previously but are hardly 
updated/added in later phases. I am already very curious how your 
changes influence the speed of the ruta rule execution.

Best,

Peter

Am 20.03.2015 um 14:50 schrieb Marshall Schor:
> I recently helped a UIMA user speed up a particular component that had n-squared
> CAS instance retrieval operations on Annotation subtype T (and its subtypes
> Tsub) where there were lots of subtypes.
>
> UIMA indexes allow retrieving items from the CAS, and are supposed to trade off
> space (for indexes) for time (speed of finding items in the CAS, speed of
> iterating).
>
> Profiling showed a lot of time being spent in the iterators "heapify up and
> down" methods - these are used when iterating over a type which has subtypes, to
> select the "next" item in the sorted order which may be of a different type than
> the current item.
>
> I was able to speed the annotator up by about 10x (because of the n-squared
> algorithm - they iterate over all instances of T, and then, for each instance,
> they iterated over the "subannotations" bounded by that instance, looking for a
> relationship).  The speedup was done by first extracting the instances (in
> sorted order) into a plain Java array of Java-cover-type instances (they're
> using the JCas, so these were JCas classes) and then writing an iterator over
> that array; this iterator of course no longer had to do the heapify up and down
> operations.
>
> This got me thinking that it would not be too hard to this kind of thing
> automatically, in core UIMA (extract into a flattened 1-level array) an index
> for all instances of a type T (and its subtypes).  The main issue is to figure
> out the right time to do this, because an update to the index for T or any of
> its subtypes Tsub makes it necessary to re-compute the flattened array.  Given
> that I've seen lots of cases of UIMA flows where some annotators will create and
> index a type (and its subtypes), and once that's been done, the indexes are not
> subsequently updated for these types, but downstream annotators iterate over
> them, it seems that a lazy creation for this kind of flattened index would work
> well in many cases.
>
> I'm experimenting with this, and will probably add something like this to UIMA
> in a way that should be backwards compatible, except for having iterators over
> sorted indexes sometimes **no longer** be "fast fail"; this is already allowed
> by the specification.  The iterator would check to see if a flattened version
> existed, and if so, verify it was valid at the start, but would not be
> guaranteed to notice if the index was updated (things added/ removed/ reordered)
> while the iteration was in progress.
>
> I'm planning to trigger the creation of this flattened index when the number of
> heapify up and down calls > index size (with the counters reset if the indexes
> were updated); the creation will be done in an executor on a separate thread.
>
> At a high level, this would be a bit like JIT for this part of UIMA - improving
> "indexes" operation.
>
> -Marshall
>