You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ctakes.apache.org by "Finan, Sean" <Se...@childrens.harvard.edu> on 2019/03/28 20:19:58 UTC

Re: Threading and cTAKES (on Spark) [EXTERNAL]

Hi Jeff,

> 1) do you think it might not crash yet produce unreliable results when
using the components in the DefaultClinicalPipeline?

-- I am pretty certain that you would get unreliable results.  I seem to recall attempts with the default pipeline crashing, but with a small corpus one could get lucky.

> 2) Do you have any more information about [Spark]

-- No, not really.  I don't work with it, I am just regurgitating from memory things read or heard.

> 3) In the TS pipelines, what does the "threads" keyword ...

-- "threads" specifies how many threads share a single pipeline.   
-- All annotators in this pipeline must be thread-safe.
-- It is up to that single instance of a pipeline to be thread safe.  "threads" does not enforce anything.
-- "threads n" will attempt to process a maximum of n documents simultaneously on a pipeline.
-- "threads n" works by running the single pipeline on n threads and running a single document through the pipeline on each thread.
-- It is entirely up to the pipeline to determine the concurrency of processing documents.
-- The more thread-safe annotators that don't require locking, the more utilized the threads will be.

I hope that makes sense.



________________________________________
From: Jeffrey Miller <je...@gmail.com>
Sent: Thursday, March 28, 2019 3:51 PM
To: dev@ctakes.apache.org
Subject: Threading and cTAKES (on Spark) [EXTERNAL]

Hi,

I am following up on a discussion previously in the "re: ctakes web
service" thread from this month. Apologies if I summarize anyone's comments
incorrectly. Sean had commented that it would not be advisable to create a
pool of pipelines and dispatch 1 per thread in the same process because the
individual AEs have static variables and resources that would be shared
across instances. I can comment that anecdotally, we have not seen crashes
when doing this (but we have seen crashes when we are trying to share 1
pipeline across > 1 thread). Nevertheless, I cannot guarantee that the
annotations are happening correctly all the time or that we might not
occasionally get unlucky and enter into a race condition. It also sounds
like from Peter's comment in the previous thread,
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_93da8248b03b1c59135fb9b4030b0546a4631ec32d6f5c779d2821cc-40-253Cdev.ctakes.apache.org-253E&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=uYabaJeyLV-qVc3xJyB-6w9LVawSFytQEU37NnkdHV0&s=bwkSz7ZhmUnXJZmcm7zVEKuaMpsv_IH-Xs-UYZU3u3M&e=
that a pipeline pool across multiple threads has been stable for his work.
I have a couple of questions:

1) Does anyone else have experience with this? Sean, from your comments
before, do you think it might not crash yet produce unreliable results when
using the components in the DefaultClinicalPipeline?

2) Sean, you commented before

> That being said, supposedly you can configure Spark to handle this by
keeping everything contained in a unique copy per thread.  Sort of like
ThreadLocal (I think), but more effective on a full-pipeline level.

Do you have any more information about this- we are currently looking into
it, and it looks like it should be possible to limit each executor (JVM) to
a single thread, but I was wondering if you had any references to the
ThreadLocal-style setup or knew anyone else that had tried it.

3) In the TS pipelines, what does the "threads" keyword in the piper file
actually enforce? Is it the number of threads it will allow you to share
the pipeline with or does it automatically create a threaded pipeline for
you?

Thanks!
Jeff

Re: Threading and cTAKES (on Spark) [EXTERNAL]

Posted by Jeffrey Miller <je...@gmail.com>.

Thanks again Sean, that is all very helpful.

On Thu, Mar 28, 2019 at 4:20 PM Finan, Sean <
Sean.Finan@childrens.harvard.edu> wrote:

> Hi Jeff,
>
> > 1) do you think it might not crash yet produce unreliable results when
> using the components in the DefaultClinicalPipeline?
>
> -- I am pretty certain that you would get unreliable results.  I seem to
> recall attempts with the default pipeline crashing, but with a small corpus
> one could get lucky.
>
> > 2) Do you have any more information about [Spark]
>
> -- No, not really.  I don't work with it, I am just regurgitating from
> memory things read or heard.
>
> > 3) In the TS pipelines, what does the "threads" keyword ...
>
> -- "threads" specifies how many threads share a single pipeline.
> -- All annotators in this pipeline must be thread-safe.
> -- It is up to that single instance of a pipeline to be thread safe.
> "threads" does not enforce anything.
> -- "threads n" will attempt to process a maximum of n documents
> simultaneously on a pipeline.
> -- "threads n" works by running the single pipeline on n threads and
> running a single document through the pipeline on each thread.
> -- It is entirely up to the pipeline to determine the concurrency of
> processing documents.
> -- The more thread-safe annotators that don't require locking, the more
> utilized the threads will be.
>
> I hope that makes sense.
>
>
>
> ________________________________________
> From: Jeffrey Miller <je...@gmail.com>
> Sent: Thursday, March 28, 2019 3:51 PM
> To: dev@ctakes.apache.org
> Subject: Threading and cTAKES (on Spark) [EXTERNAL]
>
> Hi,
>
> I am following up on a discussion previously in the "re: ctakes web
> service" thread from this month. Apologies if I summarize anyone's comments
> incorrectly. Sean had commented that it would not be advisable to create a
> pool of pipelines and dispatch 1 per thread in the same process because the
> individual AEs have static variables and resources that would be shared
> across instances. I can comment that anecdotally, we have not seen crashes
> when doing this (but we have seen crashes when we are trying to share 1
> pipeline across > 1 thread). Nevertheless, I cannot guarantee that the
> annotations are happening correctly all the time or that we might not
> occasionally get unlucky and enter into a race condition. It also sounds
> like from Peter's comment in the previous thread,
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_93da8248b03b1c59135fb9b4030b0546a4631ec32d6f5c779d2821cc-40-253Cdev.ctakes.apache.org-253E&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=uYabaJeyLV-qVc3xJyB-6w9LVawSFytQEU37NnkdHV0&s=bwkSz7ZhmUnXJZmcm7zVEKuaMpsv_IH-Xs-UYZU3u3M&e=
> that a pipeline pool across multiple threads has been stable for his work.
> I have a couple of questions:
>
> 1) Does anyone else have experience with this? Sean, from your comments
> before, do you think it might not crash yet produce unreliable results when
> using the components in the DefaultClinicalPipeline?
>
> 2) Sean, you commented before
>
> > That being said, supposedly you can configure Spark to handle this by
> keeping everything contained in a unique copy per thread.  Sort of like
> ThreadLocal (I think), but more effective on a full-pipeline level.
>
> Do you have any more information about this- we are currently looking into
> it, and it looks like it should be possible to limit each executor (JVM) to
> a single thread, but I was wondering if you had any references to the
> ThreadLocal-style setup or knew anyone else that had tried it.
>
> 3) In the TS pipelines, what does the "threads" keyword in the piper file
> actually enforce? Is it the number of threads it will allow you to share
> the pipeline with or does it automatically create a threaded pipeline for
> you?
>
> Thanks!
> Jeff
>

Re: Threading and cTAKES (on Spark) [EXTERNAL]

Posted by Peter Abramowitsch <pa...@gmail.com>.

Actually  my implementation does not share a single pipeline across
threads, it creates a set of separate pipelines.  I found that once the
code is in memory, it actually does not take long to instantiate many
pipelines.  Each one is attached to a thread safe pool object that also
hosts a re-settable jCas.  When a request arrives on a thread, one of these
pipeline-jcas pairs is activated and assigned to a document.   Typically
each pool object needs about 1.7G.  On a multi core machine we can run as
many parallel threads as we have memory and send the processor idle time
down to 10% or less.   Since it doesn't rely on the annotators being thread
safe, I can use any of them.  Where they might have class variables - these
are usually for configuration only, and by instantiating all of them ahead
of time on a single thread, they are safely initialized.  The multi
threading only happens at document processing time.  We've run high
intensity sessions with many threads for 12-15 hours and never seen any
conflicts.

On Thu, Mar 28, 2019 at 9:20 PM Finan, Sean <
Sean.Finan@childrens.harvard.edu> wrote:

> Hi Jeff,
>
> > 1) do you think it might not crash yet produce unreliable results when
> using the components in the DefaultClinicalPipeline?
>
> -- I am pretty certain that you would get unreliable results.  I seem to
> recall attempts with the default pipeline crashing, but with a small corpus
> one could get lucky.
>
> > 2) Do you have any more information about [Spark]
>
> -- No, not really.  I don't work with it, I am just regurgitating from
> memory things read or heard.
>
> > 3) In the TS pipelines, what does the "threads" keyword ...
>
> -- "threads" specifies how many threads share a single pipeline.
> -- All annotators in this pipeline must be thread-safe.
> -- It is up to that single instance of a pipeline to be thread safe.
> "threads" does not enforce anything.
> -- "threads n" will attempt to process a maximum of n documents
> simultaneously on a pipeline.
> -- "threads n" works by running the single pipeline on n threads and
> running a single document through the pipeline on each thread.
> -- It is entirely up to the pipeline to determine the concurrency of
> processing documents.
> -- The more thread-safe annotators that don't require locking, the more
> utilized the threads will be.
>
> I hope that makes sense.
>
>
>
> ________________________________________
> From: Jeffrey Miller <je...@gmail.com>
> Sent: Thursday, March 28, 2019 3:51 PM
> To: dev@ctakes.apache.org
> Subject: Threading and cTAKES (on Spark) [EXTERNAL]
>
> Hi,
>
> I am following up on a discussion previously in the "re: ctakes web
> service" thread from this month. Apologies if I summarize anyone's comments
> incorrectly. Sean had commented that it would not be advisable to create a
> pool of pipelines and dispatch 1 per thread in the same process because the
> individual AEs have static variables and resources that would be shared
> across instances. I can comment that anecdotally, we have not seen crashes
> when doing this (but we have seen crashes when we are trying to share 1
> pipeline across > 1 thread). Nevertheless, I cannot guarantee that the
> annotations are happening correctly all the time or that we might not
> occasionally get unlucky and enter into a race condition. It also sounds
> like from Peter's comment in the previous thread,
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_93da8248b03b1c59135fb9b4030b0546a4631ec32d6f5c779d2821cc-40-253Cdev.ctakes.apache.org-253E&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=uYabaJeyLV-qVc3xJyB-6w9LVawSFytQEU37NnkdHV0&s=bwkSz7ZhmUnXJZmcm7zVEKuaMpsv_IH-Xs-UYZU3u3M&e=
> that a pipeline pool across multiple threads has been stable for his work.
> I have a couple of questions:
>
> 1) Does anyone else have experience with this? Sean, from your comments
> before, do you think it might not crash yet produce unreliable results when
> using the components in the DefaultClinicalPipeline?
>
> 2) Sean, you commented before
>
> > That being said, supposedly you can configure Spark to handle this by
> keeping everything contained in a unique copy per thread.  Sort of like
> ThreadLocal (I think), but more effective on a full-pipeline level.
>
> Do you have any more information about this- we are currently looking into
> it, and it looks like it should be possible to limit each executor (JVM) to
> a single thread, but I was wondering if you had any references to the
> ThreadLocal-style setup or knew anyone else that had tried it.
>
> 3) In the TS pipelines, what does the "threads" keyword in the piper file
> actually enforce? Is it the number of threads it will allow you to share
> the pipeline with or does it automatically create a threaded pipeline for
> you?
>
> Thanks!
> Jeff
>