You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by "Kline, Larry" <La...@mckesson.com> on 2014/04/01 22:01:49 UTC

RE: FilteredIterator is very slow

Thomas,

Thanks for the suggestions.  Before I read your email I happened on this method, FSUtil.getAnnotationsIteratorInSpan(aJCas, annotationType, begin, end).  I pass it the typeId of the type I want to check and the begin and end of the dictTerm's span.  I need to run some tests to make sure it is doing what I assume it is, but if it is correct, it has much better performance than my previous code.  Perhaps this is using the method you suggest.  Yes, my dictTerms are much more numerous than the filter types but currently iterate over them one at a time in the caller, ask if any of the filter types overlaps, and if so discard it.  If it's not discarded I go on to do other things with it.  I could probably preprocess the dictTerms first by checking them for overlap against the filter types and remove the matches.  Why do you think this would be faster?  Isn't it MxN versus NxM?

Thanks again,
Larry

-----Original Message-----
From: Thomas Ginter [mailto:thomas.ginter@utah.edu] 
Sent: Monday, March 31, 2014 12:56 PM
To: user@uima.apache.org
Subject: Re: FilteredIterator is very slow

Larry,

A faster way to get the list of types that you will skip would be to do the following:

FSIndex<TitlePersonHonorificAnnotation> titlePersonHAIndex = aJCas.getAnnotationIndex(TitlePersonHonorificAnnotation.type);

Doing this for each type will yield an index that points to just the annotations in the CAS of each type you are interested in.  From there you can get an iterator reference ( titlePersonHAIndex.iterator() ) and either traverse each one separately or else add them to a common Collection such as an ArrayList and iterate through that.  You could also take advantage of the fact that the default index in UIMA sorts on ascending order on the begin index and descending order on the ending index to stop once you have traversed the list past the ending index of the dictTerm.  

An important design decision though would be to consider whether the dictTerm annotations are much more numerous than the TitlePersonHonorificAnnotation, MeasurementAnnotation, and ProgFactorTerm filtering annotation types.  Generally if the filter types are much more plentiful and the dictTerm type was more rare then looking for overlapping filter types will yield fewer iterations of your algorithm, however if there are a lot of dictTerm occurrences and only a few of the filter types then it may be more efficient to iterate through the filter types and eliminate dictTerms that overlap or are covered.  

Thanks,

Thomas Ginter
801-448-7676
thomas.ginter@utah.edu

On Mar 31, 2014, at 11:47 AM, Kline, Larry <La...@mckesson.com> wrote:

> When I use a filtered FSIterator it's an order of magnitude slower than a non-filtered iterator.  Here's my code:
> 
> Create the iterator:
>       private FSIterator<Annotation> createConstrainedIterator(JCas aJCas) throws CASException {
>              FSIterator<Annotation> it = aJCas.getAnnotationIndex().iterator();
>              FSTypeConstraint constraint = aJCas.getConstraintFactory().createTypeConstraint();
>              constraint.add((new TitlePersonHonorificAnnotation(aJCas)).getType());
>              constraint.add((new MeasurementAnnotation(aJCas)).getType());
>              constraint.add((new ProgFactorTerm(aJCas)).getType());
>              it = aJCas.createFilteredIterator(it, constraint);
>              return it;
>       }
> Use the iterator:
>       public void process(JCas aJCas) throws AnalysisEngineProcessException {
>              ...
> // The following is done in a loop
>                           if (shouldSkip(dictTerm, skipIter))
>                                  continue;
>              ...
>       }
> Here's the method called:
>       private boolean shouldSkip(G2DictTerm dictTerm, FSIterator<Annotation> skipIter) throws CASException {
>              boolean shouldSkip = false;
>              skipIter.moveToFirst();
>              while (skipIter.hasNext()) {
>                     Annotation annotation = skipIter.next();
>                     if (UIMAUtils.annotationsOverlap(dictTerm, annotation)) {
>                           shouldSkip = true;
>                           break;
>                     }
>              }
>              return shouldSkip;
>       }
> 
> If I change the method, createConstrainedIterator(), to this (that is, no constraints):
>       private FSIterator<Annotation> createConstrainedIterator(JCas aJCas) throws CASException {
>              FSIterator<Annotation> it = aJCas.getAnnotationIndex().iterator();
>              return it;
>       }
> 
> It runs literally 10 times faster.  Doing some profiling I see that all of the time is spent in the skipIter.moveToFirst() call.  I also tried creating the filtered iterator each time anew in the shouldSkip() method instead of passing it in, but that has even slightly worse performance.
> 
> Given this performance I suppose I should probably use a non-filtered iterator and just check for the types I'm interested in inside the loop.
> 
> Any other suggestions welcome.
> 
> Thanks,
> Larry Kline
> 
>

Re: FilteredIterator is very slow

Posted by Thomas Ginter <th...@utah.edu>.

Larry,

The FSUtil method looks like a good option.  

The reason why one method can be faster than another is because if you have a type 1 with N instances and a type 2 with M instances when checking for overlap you don’t have to check all M instances of type 2 but merely a fraction of M because of the way the index is sorted.  This can reduce the problem to checking Nxm wherein m represents an average scalar subset, the percentage of which it represents being dependent on the size of N.  The smaller the N the smaller the factor of M is required to check.  If you choose a type with a much smaller N then the constant factor becomes smaller than choosing a type with a much larger N for which m approaches the NxM upper bound.

Thanks,

Thomas Ginter
801-448-7676
thomas.ginter@utah.edu




On Apr 1, 2014, at 2:01 PM, Kline, Larry <La...@mckesson.com> wrote:

> Thomas,
> 
> Thanks for the suggestions.  Before I read your email I happened on this method, FSUtil.getAnnotationsIteratorInSpan(aJCas, annotationType, begin, end).  I pass it the typeId of the type I want to check and the begin and end of the dictTerm's span.  I need to run some tests to make sure it is doing what I assume it is, but if it is correct, it has much better performance than my previous code.  Perhaps this is using the method you suggest.  Yes, my dictTerms are much more numerous than the filter types but currently iterate over them one at a time in the caller, ask if any of the filter types overlaps, and if so discard it.  If it's not discarded I go on to do other things with it.  I could probably preprocess the dictTerms first by checking them for overlap against the filter types and remove the matches.  Why do you think this would be faster?  Isn't it MxN versus NxM?
> 
> Thanks again,
> Larry
> 
> -----Original Message-----
> From: Thomas Ginter [mailto:thomas.ginter@utah.edu] 
> Sent: Monday, March 31, 2014 12:56 PM
> To: user@uima.apache.org
> Subject: Re: FilteredIterator is very slow
> 
> Larry,
> 
> A faster way to get the list of types that you will skip would be to do the following:
> 
> FSIndex<TitlePersonHonorificAnnotation> titlePersonHAIndex = aJCas.getAnnotationIndex(TitlePersonHonorificAnnotation.type);
> 
> Doing this for each type will yield an index that points to just the annotations in the CAS of each type you are interested in.  From there you can get an iterator reference ( titlePersonHAIndex.iterator() ) and either traverse each one separately or else add them to a common Collection such as an ArrayList and iterate through that.  You could also take advantage of the fact that the default index in UIMA sorts on ascending order on the begin index and descending order on the ending index to stop once you have traversed the list past the ending index of the dictTerm.  
> 
> An important design decision though would be to consider whether the dictTerm annotations are much more numerous than the TitlePersonHonorificAnnotation, MeasurementAnnotation, and ProgFactorTerm filtering annotation types.  Generally if the filter types are much more plentiful and the dictTerm type was more rare then looking for overlapping filter types will yield fewer iterations of your algorithm, however if there are a lot of dictTerm occurrences and only a few of the filter types then it may be more efficient to iterate through the filter types and eliminate dictTerms that overlap or are covered.  
> 
> Thanks,
> 
> Thomas Ginter
> 801-448-7676
> thomas.ginter@utah.edu
> 
> 
> 
> 
> On Mar 31, 2014, at 11:47 AM, Kline, Larry <La...@mckesson.com> wrote:
> 
>> When I use a filtered FSIterator it's an order of magnitude slower than a non-filtered iterator.  Here's my code:
>> 
>> Create the iterator:
>>      private FSIterator<Annotation> createConstrainedIterator(JCas aJCas) throws CASException {
>>             FSIterator<Annotation> it = aJCas.getAnnotationIndex().iterator();
>>             FSTypeConstraint constraint = aJCas.getConstraintFactory().createTypeConstraint();
>>             constraint.add((new TitlePersonHonorificAnnotation(aJCas)).getType());
>>             constraint.add((new MeasurementAnnotation(aJCas)).getType());
>>             constraint.add((new ProgFactorTerm(aJCas)).getType());
>>             it = aJCas.createFilteredIterator(it, constraint);
>>             return it;
>>      }
>> Use the iterator:
>>      public void process(JCas aJCas) throws AnalysisEngineProcessException {
>>             ...
>> // The following is done in a loop
>>                          if (shouldSkip(dictTerm, skipIter))
>>                                 continue;
>>             ...
>>      }
>> Here's the method called:
>>      private boolean shouldSkip(G2DictTerm dictTerm, FSIterator<Annotation> skipIter) throws CASException {
>>             boolean shouldSkip = false;
>>             skipIter.moveToFirst();
>>             while (skipIter.hasNext()) {
>>                    Annotation annotation = skipIter.next();
>>                    if (UIMAUtils.annotationsOverlap(dictTerm, annotation)) {
>>                          shouldSkip = true;
>>                          break;
>>                    }
>>             }
>>             return shouldSkip;
>>      }
>> 
>> If I change the method, createConstrainedIterator(), to this (that is, no constraints):
>>      private FSIterator<Annotation> createConstrainedIterator(JCas aJCas) throws CASException {
>>             FSIterator<Annotation> it = aJCas.getAnnotationIndex().iterator();
>>             return it;
>>      }
>> 
>> It runs literally 10 times faster.  Doing some profiling I see that all of the time is spent in the skipIter.moveToFirst() call.  I also tried creating the filtered iterator each time anew in the shouldSkip() method instead of passing it in, but that has even slightly worse performance.
>> 
>> Given this performance I suppose I should probably use a non-filtered iterator and just check for the types I'm interested in inside the loop.
>> 
>> Any other suggestions welcome.
>> 
>> Thanks,
>> Larry Kline
>> 
>> 
>