You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by Erwan Moreau <Er...@lipn.univ-paris13.fr> on 2010/06/15 19:35:01 UTC

Concurrent access to CAS index

Hello,

I experience problems using several threads which read annotations in
the same (default) CAS index, inside the same call to the process
method. Since I'm new to UIMA I'm not sure how to interpret that: normal
behaviour due to wrong usage or bug ? The exception stack is:

java.lang.IndexOutOfBoundsException: Index: 0, Size: 3
       at java.util.ArrayList.RangeCheck(ArrayList.java:547)
       at java.util.ArrayList.get(ArrayList.java:322)
       at
org.apache.uima.cas.impl.FSIndexRepositoryImpl$LeafPointerIterator.initPointerIterator(FSIndexRepositoryImpl.java:628)

       at
org.apache.uima.cas.impl.FSIndexRepositoryImpl$LeafPointerIterator.<init>(FSIndexRepositoryImpl.java:636)

       at
org.apache.uima.cas.impl.FSIndexRepositoryImpl$LeafPointerIterator.<init>(FSIndexRepositoryImpl.java:612)

       at
org.apache.uima.cas.impl.FSIndexRepositoryImpl.createPointerIterator(FSIndexRepositoryImpl.java:158)

       at
org.apache.uima.cas.impl.FSIndexRepositoryImpl$IndexImpl.iterator(FSIndexRepositoryImpl.java:792)

       at
org.apache.uima.cas.impl.AnnotationIndexImpl.iterator(AnnotationIndexImpl.java:97)

       at
fr.lipn.uima.testing.TestConcurrentCASAccesAE.getFSIterator(TestConcurrentCASAccesAE.java:59)


I managed to isolate the problem and wrote a simple AE to explain/show
it (attached).

Thanks for your help (and sorry if I missed something in the doc !)

Erwan



Re: Concurrent access to CAS index

Posted by Erwan Moreau <Er...@lipn.univ-paris13.fr>.
Hi,

> On 6/16/2010 19:22, Erwan Moreau wrote:
>>
>>> Seems that an easy work-around would be to have your reader and writer
>>> threads synchronize on their access to the CAS.  If we implemented
>>> concurrent access, this is what we would have to do, inside the CAS
>>> itself.
>>>
>>> When new data are added to the CAS, indexes are often updated.  If 
>>> these
>>> are concurrently being accessed, *bad things* can happen, which is
>>> probably what's happening in your case.
>>>
>>>
>> Well, not exactly because I do not *write* any data in the CAS: threads
>> only read the annotations contained in the CAS, and in my real
>> annotators data is written in the CAS after all threads have terminated.
>> I'm not expert in thread-safety so I might miss something, but at first
>> sight I don't understand how concurrent read access can fail? (though I
>> must admit I did not try to study the source code in the
>> FSIndexRepositoryImpl class)
>
> I agree, this should be possible.  I'll take a look sometime
> when our build has stabilized.
>
> It may have to do with the way our internal iterator cache
> works.  What you could try to do is this: create one iterator
> of every type you're interested in, in a sequential manner.
> You don't need to use them.  Then try your concurrent access
> again.  No guarantees though, I didn't even look at the code.
>
> --Thilo


I was curious so I have investigated a bit more deeply about the 
problems which arise when reading simultaneously in the CAS. I give 
below my conclusions, in the hope they can be useful for future 
implementations. I'm sorry I have only run tests using sources from 
release 2.3.0. Please tell me if I should do something else (more 
details, my testing environment etc.).

There are actually two places where things can go wrong:

1) Creating iterators simultaneously can either mess the data (the 
annotations read do not correspond to the real ones), or sometimes cause 
the following exception:
java.lang.IndexOutOfBoundsException: Index: 0, Size: 6
    at java.util.ArrayList.RangeCheck(ArrayList.java:547)
    at java.util.ArrayList.get(ArrayList.java:322)
    at 
org.apache.uima.cas.impl.FSIndexRepositoryImpl$LeafPointerIterator.initPointerIterator(FSIndexRepositoryImpl.java:628)
    at 
org.apache.uima.cas.impl.FSIndexRepositoryImpl$LeafPointerIterator.<init>(FSIndexRepositoryImpl.java:636)
    at 
org.apache.uima.cas.impl.FSIndexRepositoryImpl$LeafPointerIterator.<init>(FSIndexRepositoryImpl.java:612)
    at 
org.apache.uima.cas.impl.FSIndexRepositoryImpl.createPointerIterator(FSIndexRepositoryImpl.java:158)
    at 
org.apache.uima.cas.impl.FSIndexRepositoryImpl$IndexImpl.iterator(FSIndexRepositoryImpl.java:792)
    at 
org.apache.uima.cas.impl.AnnotationIndexImpl.iterator(AnnotationIndexImpl.java:97)
    at 
erwan.TestConcurrentCASAccesAE.readSomeAnnotations(TestConcurrentCASAccesAE.java:205)
    at 
erwan.TestConcurrentCASAccesAE$CASReaderThread.run(TestConcurrentCASAccesAE.java:290)
    at java.lang.Thread.run(Thread.java:619)

I succeeded in solving this problem in the following different ways:
- (user side) by creating the iterators sequentially before starting the 
threads, or by synchronizing the calls.
- (uima side) in the org.apache.uima.cas.impl.FSIndexRepositoryImpl 
class: actually the problem is due to the fact in the 
createPointerIterator methods    1) the call to 
iicp.createIndexIteratorCache() creates some data (I don't really know 
what I'm talking about actually!) which is stored in the iicp object, 2) 
then the initPointerIterator method (called by new 
[Leaf]PointerIterator(iicp)) reads this data that may have been modified 
in the "concurrent access" case. Thus I tested transmitting this object 
classically (iicp.createIndexIteratorCache() returning an 
ArrayList<FSLeafIndexImpl> object and other methods receiving it as a 
parameter), and that works fine (this error does not appear anymore, 
tested with my test case over more than 200000 runs).


2) Calling simultanously the next() (or hasNex()) method (in two 
different FSIterator objects, of course) causes exceptions like the 
following:
java.lang.ArrayIndexOutOfBoundsException: 1381
    at org.apache.uima.jcas.impl.JCasHashMap.get(JCasHashMap.java:117)
    at 
org.apache.uima.jcas.impl.JCasImpl.getJfsFromCaddr(JCasImpl.java:1044)
    at 
org.apache.uima.jcas.impl.JCasImpl$JCasFsGenerator.createFS(JCasImpl.java:830)
    at org.apache.uima.cas.impl.CASImpl.ll_getFSForRef(CASImpl.java:3106)
    at org.apache.uima.cas.impl.CASImpl.createFS(CASImpl.java:1762)
    at 
org.apache.uima.cas.impl.FSIteratorWrapper.get(FSIteratorWrapper.java:48)
    at 
org.apache.uima.cas.impl.FSIteratorImplBase.next(FSIteratorImplBase.java:67)
    at 
org.apache.uima.cas.impl.FSIteratorImplBase.next(FSIteratorImplBase.java:33)
    at 
erwan.TestConcurrentCASAccesAE.getNextAnnotation(TestConcurrentCASAccesAE.java:130)
    at 
erwan.TestConcurrentCASAccesAE.readSomeAnnotations(TestConcurrentCASAccesAE.java:213)
    at 
erwan.TestConcurrentCASAccesAE$CASReaderThread.run(TestConcurrentCASAccesAE.java:290)
    at java.lang.Thread.run(Thread.java:619)

These errors happen even if the iterators had been created before 
starting the threads.
As you told me, this one is certainly due to the caching strategy. The 
comments in the org.apache.uima.jcas.impl.JCasImpl class are clear about 
the fact that the implementation is intended to be single-threaded 
(although that point is not documented the API I think). Once again I 
ran som tests:
- on the user side, the problem can be solved by synchronizing each call 
to next() indeed.
 - I have also tested a simple modification in the 
org.apache.uima.jcas.impl.JCasImpl class: in the 
JCasFsGenerator.createFS method, removing the call to 
jcasView.putJfsFromCaddr(addr, fs) solves the problem (also tested over 
more than 200000 tests without error); I guess that corresponds roughly 
to disabling the caching strategy, since nothing is written in it 
anymore (?). I don't know what are the performance consequences of such 
a modification, but maybe an option could be proposed to disable the 
cache ? Imho it could at least be documented that this class is not 
thread-safe, because it seems to me quite unusual to have to synchronize 
the calls to next().

thanks for your work!
Erwan


Re: Concurrent access to CAS index

Posted by Thilo Goetz <tw...@gmx.de>.
On 6/16/2010 19:22, Erwan Moreau wrote:
>
>> Seems that an easy work-around would be to have your reader and writer
>> threads synchronize on their access to the CAS.  If we implemented
>> concurrent access, this is what we would have to do, inside the CAS
>> itself.
>>
>> When new data are added to the CAS, indexes are often updated.  If these
>> are concurrently being accessed, *bad things* can happen, which is
>> probably what's happening in your case.
>>
>>
> Well, not exactly because I do not *write* any data in the CAS: threads
> only read the annotations contained in the CAS, and in my real
> annotators data is written in the CAS after all threads have terminated.
> I'm not expert in thread-safety so I might miss something, but at first
> sight I don't understand how concurrent read access can fail? (though I
> must admit I did not try to study the source code in the
> FSIndexRepositoryImpl class)

I agree, this should be possible.  I'll take a look sometime
when our build has stabilized.

It may have to do with the way our internal iterator cache
works.  What you could try to do is this: create one iterator
of every type you're interested in, in a sequential manner.
You don't need to use them.  Then try your concurrent access
again.  No guarantees though, I didn't even look at the code.

--Thilo

>
>
>> The CAS is used as a "unit-of-work" in many places in UIMA, as well.  If
>> you used it for this purpose, then a workflow might be:
>>
>> Have the Writer write to the process, so the process gets all its
>> inputs, then have the reader read from the process the results.
>>
>> For scale-out, have multiple CASes.
>>
>> Would this work in your use case?  -Marshall
>>
> Yes, indeed. The only quite negative point in this solution is that it
> requires to totally duplicate the data at each input or output step,
> thus needing a bit more time and memory. I guess this solution is more
> "UIMA standard" than synchronizing every CAS access in my threads?
>
> Thanks again!
> Erwan


Re: Concurrent access to CAS index

Posted by Erwan Moreau <Er...@lipn.univ-paris13.fr>.
> Seems that an easy work-around would be to have your reader and writer
> threads synchronize on their access to the CAS.  If we implemented
> concurrent access, this is what we would have to do, inside the CAS
> itself. 
>
> When new data are added to the CAS, indexes are often updated.  If these
> are concurrently being accessed, *bad things* can happen, which is
> probably what's happening in your case.  
>
>   
Well, not exactly because I do not *write* any data in the CAS: threads
only read the annotations contained in the CAS, and in my real
annotators data is written in the CAS after all threads have terminated.
I'm not expert in thread-safety so I might miss something, but at first
sight I don't understand how concurrent read access can fail? (though I
must admit I did not try to study the source code in the
FSIndexRepositoryImpl class)


> The CAS is used as a "unit-of-work" in many places in UIMA, as well.  If
> you used it for this purpose, then a workflow might be:
>
> Have the Writer write to the process, so the process gets all its
> inputs, then have the reader read from the process the results.
>
> For scale-out, have multiple CASes.
>
> Would this work in your use case?  -Marshall
>   
Yes, indeed. The only quite negative point in this solution is that it
requires to totally duplicate the data at each input or output step,
thus needing a bit more time and memory. I guess this solution is more
"UIMA standard" than synchronizing every CAS access in my threads?

Thanks again!
Erwan

Re: Concurrent access to CAS index

Posted by Marshall Schor <ms...@schor.com>.

On 6/16/2010 10:22 AM, Erwan Moreau wrote:
> Hi,
>
> Thanks for the answer.
>
>   
>> Hi,
>>
>> The CAS is not designed for concurrent access, to my knowledge, but
>> perhaps others can comment more on this.
>>   
>>     
> I'd like to know more about that, because imho this is a quite strong
> limitation: maybe naively, I used to think that using concurrent access
> only for reading was safe, since most concurrency problems occur when
> threads can also write the shared object?
>   

This design, as I recall, was a performance trade-off, where we decided
to have fast CAS access at higher priority than allowing multi-thread
access, especially since the known and imagined use-cases had
multiple-threads using separate CAS objects.

Another factor here was the design of annotators - these are typically
user-written code, done by algorithm experts, not necessarily software
engineers experienced in the nuances of multi-threaded applications. So,
we run annotator instances on just one thread; again, scale-out is done
by instantiating multiple instances of annotators.  So the annotators
don't have to be "thread-safe" (except for static data, which is shared
among the threads).
>> Most scale-out use-cases are designs which also scale out the CASes.  We
>> would be interested in hearing about a use case which motivates
>> multi-threaded access to a single CAS.
>>   
>>     
> Indeed, my use-case probably does not correspond to what UIMA is
> intended for. I must explain a bit the context: we are actually building
> wrapper annotators for external programs called through a ProcessBuilder
> object (yes, the dirty "exec" call). We are aware of the problems that
> this implies, and ideally we would have re-coded our tools from scratch
> as UIMA annotators or used C++ framework. Nevertheless we decided that
> was the best choice, because our team owns a few complex NLP tools which
> are the core of our work and would be very costly to migrate; so we want
> to provide quite quickly a way to use them in a UIMA environment so that
> people start using UIMA when creating higher level components (and maybe
> these core components will be migrated later).
>   

OK.  UIMA has a C++ framework as well, if and when you get around to
migrating your components.
> In this context, we try to provide an "as safe and efficient as
> possible" framework in which these programs are called inside an
> annotator. That is why we use threads to provide the input stream and
> read the output stream. In order to avoid wasting time and space, our
> threads use Reader and Writer objects so that data is transmitted on the
> fly to/from the process (inside the process method). Thus concurrent
> access to the CAS is required when the Writer object that provides the
> stdin stream is still reading annotations, while the Reader object has
> already started to re-align the program output with the CAS content. Of
> course no concurrency problem occurs if the input/output are transmitted
> as simple String objets or as files, but that is clearly less efficient
> (and not safer, as far as i know).
>   

Seems that an easy work-around would be to have your reader and writer
threads synchronize on their access to the CAS.  If we implemented
concurrent access, this is what we would have to do, inside the CAS
itself. 

When new data are added to the CAS, indexes are often updated.  If these
are concurrently being accessed, *bad things* can happen, which is
probably what's happening in your case.  

The CAS is used as a "unit-of-work" in many places in UIMA, as well.  If
you used it for this purpose, then a workflow might be:

Have the Writer write to the process, so the process gets all its
inputs, then have the reader read from the process the results.

For scale-out, have multiple CASes.

Would this work in your use case?  -Marshall
> I don't know whether there can be more standard use-cases using threads.
> Nevertheless the problem would be the same if the black box was not an
> external program but any piece of code that can not be modified and
> behaves like a pipe.
>   


> Erwan
>
>
>   
>> -Marshall
>>
>> On 6/15/2010 1:35 PM, Erwan Moreau wrote:
>>   
>>     
>>> Hello,
>>>
>>> I experience problems using several threads which read annotations in
>>> the same (default) CAS index, inside the same call to the process
>>> method. Since I'm new to UIMA I'm not sure how to interpret that: normal
>>> behaviour due to wrong usage or bug ? The exception stack is:
>>>
>>> java.lang.IndexOutOfBoundsException: Index: 0, Size: 3
>>>        at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>>>        at java.util.ArrayList.get(ArrayList.java:322)
>>>        at
>>> org.apache.uima.cas.impl.FSIndexRepositoryImpl$LeafPointerIterator.initPointerIterator(FSIndexRepositoryImpl.java:628)
>>>
>>>        at
>>> org.apache.uima.cas.impl.FSIndexRepositoryImpl$LeafPointerIterator.<init>(FSIndexRepositoryImpl.java:636)
>>>
>>>        at
>>> org.apache.uima.cas.impl.FSIndexRepositoryImpl$LeafPointerIterator.<init>(FSIndexRepositoryImpl.java:612)
>>>
>>>        at
>>> org.apache.uima.cas.impl.FSIndexRepositoryImpl.createPointerIterator(FSIndexRepositoryImpl.java:158)
>>>
>>>        at
>>> org.apache.uima.cas.impl.FSIndexRepositoryImpl$IndexImpl.iterator(FSIndexRepositoryImpl.java:792)
>>>
>>>        at
>>> org.apache.uima.cas.impl.AnnotationIndexImpl.iterator(AnnotationIndexImpl.java:97)
>>>
>>>        at
>>> fr.lipn.uima.testing.TestConcurrentCASAccesAE.getFSIterator(TestConcurrentCASAccesAE.java:59)
>>>
>>>
>>> I managed to isolate the problem and wrote a simple AE to explain/show
>>> it (attached).
>>>
>>> Thanks for your help (and sorry if I missed something in the doc !)
>>>
>>> Erwan
>>>
>>>
>>>   
>>>     
>>>       
>
>
>   

Re: Concurrent access to CAS index

Posted by Erwan Moreau <Er...@lipn.univ-paris13.fr>.
Hi,

Thanks for the answer.

> Hi,
>
> The CAS is not designed for concurrent access, to my knowledge, but
> perhaps others can comment more on this.
>   
I'd like to know more about that, because imho this is a quite strong
limitation: maybe naively, I used to think that using concurrent access
only for reading was safe, since most concurrency problems occur when
threads can also write the shared object?
> Most scale-out use-cases are designs which also scale out the CASes.  We
> would be interested in hearing about a use case which motivates
> multi-threaded access to a single CAS.
>   
Indeed, my use-case probably does not correspond to what UIMA is
intended for. I must explain a bit the context: we are actually building
wrapper annotators for external programs called through a ProcessBuilder
object (yes, the dirty "exec" call). We are aware of the problems that
this implies, and ideally we would have re-coded our tools from scratch
as UIMA annotators or used C++ framework. Nevertheless we decided that
was the best choice, because our team owns a few complex NLP tools which
are the core of our work and would be very costly to migrate; so we want
to provide quite quickly a way to use them in a UIMA environment so that
people start using UIMA when creating higher level components (and maybe
these core components will be migrated later).

In this context, we try to provide an "as safe and efficient as
possible" framework in which these programs are called inside an
annotator. That is why we use threads to provide the input stream and
read the output stream. In order to avoid wasting time and space, our
threads use Reader and Writer objects so that data is transmitted on the
fly to/from the process (inside the process method). Thus concurrent
access to the CAS is required when the Writer object that provides the
stdin stream is still reading annotations, while the Reader object has
already started to re-align the program output with the CAS content. Of
course no concurrency problem occurs if the input/output are transmitted
as simple String objets or as files, but that is clearly less efficient
(and not safer, as far as i know).

I don't know whether there can be more standard use-cases using threads.
Nevertheless the problem would be the same if the black box was not an
external program but any piece of code that can not be modified and
behaves like a pipe.

Erwan


> -Marshall
>
> On 6/15/2010 1:35 PM, Erwan Moreau wrote:
>   
>> Hello,
>>
>> I experience problems using several threads which read annotations in
>> the same (default) CAS index, inside the same call to the process
>> method. Since I'm new to UIMA I'm not sure how to interpret that: normal
>> behaviour due to wrong usage or bug ? The exception stack is:
>>
>> java.lang.IndexOutOfBoundsException: Index: 0, Size: 3
>>        at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>>        at java.util.ArrayList.get(ArrayList.java:322)
>>        at
>> org.apache.uima.cas.impl.FSIndexRepositoryImpl$LeafPointerIterator.initPointerIterator(FSIndexRepositoryImpl.java:628)
>>
>>        at
>> org.apache.uima.cas.impl.FSIndexRepositoryImpl$LeafPointerIterator.<init>(FSIndexRepositoryImpl.java:636)
>>
>>        at
>> org.apache.uima.cas.impl.FSIndexRepositoryImpl$LeafPointerIterator.<init>(FSIndexRepositoryImpl.java:612)
>>
>>        at
>> org.apache.uima.cas.impl.FSIndexRepositoryImpl.createPointerIterator(FSIndexRepositoryImpl.java:158)
>>
>>        at
>> org.apache.uima.cas.impl.FSIndexRepositoryImpl$IndexImpl.iterator(FSIndexRepositoryImpl.java:792)
>>
>>        at
>> org.apache.uima.cas.impl.AnnotationIndexImpl.iterator(AnnotationIndexImpl.java:97)
>>
>>        at
>> fr.lipn.uima.testing.TestConcurrentCASAccesAE.getFSIterator(TestConcurrentCASAccesAE.java:59)
>>
>>
>> I managed to isolate the problem and wrote a simple AE to explain/show
>> it (attached).
>>
>> Thanks for your help (and sorry if I missed something in the doc !)
>>
>> Erwan
>>
>>
>>   
>>     


Re: Concurrent access to CAS index

Posted by Marshall Schor <ms...@schor.com>.
Hi,

The CAS is not designed for concurrent access, to my knowledge, but
perhaps others can comment more on this.

Most scale-out use-cases are designs which also scale out the CASes.  We
would be interested in hearing about a use case which motivates
multi-threaded access to a single CAS.

-Marshall

On 6/15/2010 1:35 PM, Erwan Moreau wrote:
> Hello,
>
> I experience problems using several threads which read annotations in
> the same (default) CAS index, inside the same call to the process
> method. Since I'm new to UIMA I'm not sure how to interpret that: normal
> behaviour due to wrong usage or bug ? The exception stack is:
>
> java.lang.IndexOutOfBoundsException: Index: 0, Size: 3
>        at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>        at java.util.ArrayList.get(ArrayList.java:322)
>        at
> org.apache.uima.cas.impl.FSIndexRepositoryImpl$LeafPointerIterator.initPointerIterator(FSIndexRepositoryImpl.java:628)
>
>        at
> org.apache.uima.cas.impl.FSIndexRepositoryImpl$LeafPointerIterator.<init>(FSIndexRepositoryImpl.java:636)
>
>        at
> org.apache.uima.cas.impl.FSIndexRepositoryImpl$LeafPointerIterator.<init>(FSIndexRepositoryImpl.java:612)
>
>        at
> org.apache.uima.cas.impl.FSIndexRepositoryImpl.createPointerIterator(FSIndexRepositoryImpl.java:158)
>
>        at
> org.apache.uima.cas.impl.FSIndexRepositoryImpl$IndexImpl.iterator(FSIndexRepositoryImpl.java:792)
>
>        at
> org.apache.uima.cas.impl.AnnotationIndexImpl.iterator(AnnotationIndexImpl.java:97)
>
>        at
> fr.lipn.uima.testing.TestConcurrentCASAccesAE.getFSIterator(TestConcurrentCASAccesAE.java:59)
>
>
> I managed to isolate the problem and wrote a simple AE to explain/show
> it (attached).
>
> Thanks for your help (and sorry if I missed something in the doc !)
>
> Erwan
>
>
>