You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by Erik Fäßler <er...@uni-jena.de> on 2015/10/21 17:07:17 UTC

Performance of UIMAfit JCasUtil.selectCovered() and variants

Hi all,

I’m wondering about the performance differences between

1) JCasUtil.selectCovered(JCas, Class<T>, AnnotationFS),
2) JCasUtil.selectCovered(JCas, Class<T>, int, int) and
3) JCasUtil.indexCovered(JCas, Class<T>, Class<S>)

It is clear that 3) iterates once through the CAS and just returns a map. Once this is done, map access is swift.

The Javadoc of 2) states that it is slower than 1).
3) states that it is preferable to 2).

Questions:
Is 3) also preferable over 2) when there is only one covering annotation or is the performance of 2) and 3) roughly equal then?
Main question: Is 3) also quicker than 1) if there are many covering annotations?

Use case: I want to iterate through all sentences in paragraphs. Normally, I would use subiterators(), but the known type priority issue could be a problem for me. Should I just use 1)? Or would I still benefit from 3) if I have more than one paragraph?

Thank you very much!

Best,

Erik

Re: Performance of UIMAfit JCasUtil.selectCovered() and variants

Posted by Erik Fäßler <er...@uni-jena.de>.

Very nice, thank you so much :-)

> On 21 Oct 2015, at 18:26, Richard Eckart de Castilho <re...@apache.org> wrote:
> 
> Hi,
> 
> 1 uses the UIMA indexes which I believe use a binary search, so it should
> be something like O(log n).
> 
> 2 is in principle O(n) but since it does a linear scan from the beginning
> and stops when no further annotations may be found, it practice O(n) 
> should be the upper bound when called for annotations towards the end of
> the document.
> 
> 3 is fastest for repeated use. It should be O(n) for creating
> the index and then uses hashmap lookups.
> 
> So 1 and 3 are better than two.
> 
> If you need speed and need coverage information a lot, 3 should be the best.
> 
> 1 and 2 are more convenient for coding.
> 
> If you use plain UIMA and have type priorities set up, then using an
> iterator over sentences and a subiterator over tokens is likely to
> be better than 3 because it doesn't need the initial scan that 3 does.
> 
> I'm not aware that anybody did extensive performance comparisons here.
> Some are being done in org.apache.uima.fit.util.JCasUtilTest.testSelectCoverRandom()
> which compares 1 and 2. Here a few lines from the test output (mind to increase the
> ITERATIONS variable if you try):
> 
> ...
> Speed up factor 5.50 [naive:11 optimized:2 diff:9]
> Speed up factor 6.67 [naive:20 optimized:3 diff:17]
> Speed up factor 4.00 [naive:16 optimized:4 diff:12]
> Speed up factor 2.50 [naive:30 optimized:12 diff:18]
> Speed up factor 7.00 [naive:35 optimized:5 diff:30]
> Speed up factor 5.63 [naive:45 optimized:8 diff:37]
> Speed up factor 7.78 [naive:70 optimized:9 diff:61]
> Speed up factor 8.09 [naive:89 optimized:11 diff:78]
> ...
> 
> Cheers,
> 
> -- Richard
> 
>> On 21.10.2015, at 17:07, Erik Fäßler <er...@uni-jena.de> wrote:
>> 
>> Hi all,
>> 
>> I’m wondering about the performance differences between
>> 
>> 1) JCasUtil.selectCovered(JCas, Class<T>, AnnotationFS),
>> 2) JCasUtil.selectCovered(JCas, Class<T>, int, int) and
>> 3) JCasUtil.indexCovered(JCas, Class<T>, Class<S>)
>> 
>> It is clear that 3) iterates once through the CAS and just returns a map. Once this is done, map access is swift.
>> 
>> The Javadoc of 2) states that it is slower than 1).
>> 3) states that it is preferable to 2).
>> 
>> Questions:
>> Is 3) also preferable over 2) when there is only one covering annotation or is the performance of 2) and 3) roughly equal then?
>> Main question: Is 3) also quicker than 1) if there are many covering annotations?
>> 
>> Use case: I want to iterate through all sentences in paragraphs. Normally, I would use subiterators(), but the known type priority issue could be a problem for me. Should I just use 1)? Or would I still benefit from 3) if I have more than one paragraph?
>> 
>> Thank you very much!
>> 
>> Best,
>> 
>> Erik
>

Re: Performance of UIMAfit JCasUtil.selectCovered() and variants

Posted by Erik Fäßler <er...@uni-jena.de>.

Small follow-up, just stumbled across by chance, did not even search for it:

http://searchivarius.org/blog/selectcovered-substantially-better-version-uima-subiterator <http://searchivarius.org/blog/selectcovered-substantially-better-version-uima-subiterator>

Someone has done a performance comparison. It does not necessarily strike me to be the most sophisticated approach, but the code is available and one could use this as a hint.

Best,

Erik

> On 21 Oct 2015, at 18:26, Richard Eckart de Castilho <re...@apache.org> wrote:
> 
> Hi,
> 
> 1 uses the UIMA indexes which I believe use a binary search, so it should
> be something like O(log n).
> 
> 2 is in principle O(n) but since it does a linear scan from the beginning
> and stops when no further annotations may be found, it practice O(n) 
> should be the upper bound when called for annotations towards the end of
> the document.
> 
> 3 is fastest for repeated use. It should be O(n) for creating
> the index and then uses hashmap lookups.
> 
> So 1 and 3 are better than two.
> 
> If you need speed and need coverage information a lot, 3 should be the best.
> 
> 1 and 2 are more convenient for coding.
> 
> If you use plain UIMA and have type priorities set up, then using an
> iterator over sentences and a subiterator over tokens is likely to
> be better than 3 because it doesn't need the initial scan that 3 does.
> 
> I'm not aware that anybody did extensive performance comparisons here.
> Some are being done in org.apache.uima.fit.util.JCasUtilTest.testSelectCoverRandom()
> which compares 1 and 2. Here a few lines from the test output (mind to increase the
> ITERATIONS variable if you try):
> 
> ...
> Speed up factor 5.50 [naive:11 optimized:2 diff:9]
> Speed up factor 6.67 [naive:20 optimized:3 diff:17]
> Speed up factor 4.00 [naive:16 optimized:4 diff:12]
> Speed up factor 2.50 [naive:30 optimized:12 diff:18]
> Speed up factor 7.00 [naive:35 optimized:5 diff:30]
> Speed up factor 5.63 [naive:45 optimized:8 diff:37]
> Speed up factor 7.78 [naive:70 optimized:9 diff:61]
> Speed up factor 8.09 [naive:89 optimized:11 diff:78]
> ...
> 
> Cheers,
> 
> -- Richard
> 
>> On 21.10.2015, at 17:07, Erik Fäßler <er...@uni-jena.de> wrote:
>> 
>> Hi all,
>> 
>> I’m wondering about the performance differences between
>> 
>> 1) JCasUtil.selectCovered(JCas, Class<T>, AnnotationFS),
>> 2) JCasUtil.selectCovered(JCas, Class<T>, int, int) and
>> 3) JCasUtil.indexCovered(JCas, Class<T>, Class<S>)
>> 
>> It is clear that 3) iterates once through the CAS and just returns a map. Once this is done, map access is swift.
>> 
>> The Javadoc of 2) states that it is slower than 1).
>> 3) states that it is preferable to 2).
>> 
>> Questions:
>> Is 3) also preferable over 2) when there is only one covering annotation or is the performance of 2) and 3) roughly equal then?
>> Main question: Is 3) also quicker than 1) if there are many covering annotations?
>> 
>> Use case: I want to iterate through all sentences in paragraphs. Normally, I would use subiterators(), but the known type priority issue could be a problem for me. Should I just use 1)? Or would I still benefit from 3) if I have more than one paragraph?
>> 
>> Thank you very much!
>> 
>> Best,
>> 
>> Erik
>

Re: Performance of UIMAfit JCasUtil.selectCovered() and variants

Posted by Richard Eckart de Castilho <re...@apache.org>.

Hi,

1 uses the UIMA indexes which I believe use a binary search, so it should
be something like O(log n).

2 is in principle O(n) but since it does a linear scan from the beginning
and stops when no further annotations may be found, it practice O(n) 
should be the upper bound when called for annotations towards the end of
the document.

3 is fastest for repeated use. It should be O(n) for creating
the index and then uses hashmap lookups.

So 1 and 3 are better than two.

If you need speed and need coverage information a lot, 3 should be the best.

1 and 2 are more convenient for coding.

If you use plain UIMA and have type priorities set up, then using an
iterator over sentences and a subiterator over tokens is likely to
be better than 3 because it doesn't need the initial scan that 3 does.

I'm not aware that anybody did extensive performance comparisons here.
Some are being done in org.apache.uima.fit.util.JCasUtilTest.testSelectCoverRandom()
which compares 1 and 2. Here a few lines from the test output (mind to increase the
ITERATIONS variable if you try):

...
Speed up factor 5.50 [naive:11 optimized:2 diff:9]
Speed up factor 6.67 [naive:20 optimized:3 diff:17]
Speed up factor 4.00 [naive:16 optimized:4 diff:12]
Speed up factor 2.50 [naive:30 optimized:12 diff:18]
Speed up factor 7.00 [naive:35 optimized:5 diff:30]
Speed up factor 5.63 [naive:45 optimized:8 diff:37]
Speed up factor 7.78 [naive:70 optimized:9 diff:61]
Speed up factor 8.09 [naive:89 optimized:11 diff:78]
...

Cheers,

-- Richard

> On 21.10.2015, at 17:07, Erik Fäßler <er...@uni-jena.de> wrote:
> 
> Hi all,
> 
> I’m wondering about the performance differences between
> 
> 1) JCasUtil.selectCovered(JCas, Class<T>, AnnotationFS),
> 2) JCasUtil.selectCovered(JCas, Class<T>, int, int) and
> 3) JCasUtil.indexCovered(JCas, Class<T>, Class<S>)
> 
> It is clear that 3) iterates once through the CAS and just returns a map. Once this is done, map access is swift.
> 
> The Javadoc of 2) states that it is slower than 1).
> 3) states that it is preferable to 2).
> 
> Questions:
> Is 3) also preferable over 2) when there is only one covering annotation or is the performance of 2) and 3) roughly equal then?
> Main question: Is 3) also quicker than 1) if there are many covering annotations?
> 
> Use case: I want to iterate through all sentences in paragraphs. Normally, I would use subiterators(), but the known type priority issue could be a problem for me. Should I just use 1)? Or would I still benefit from 3) if I have more than one paragraph?
> 
> Thank you very much!
> 
> Best,
> 
> Erik