You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Steve Suppe <ss...@llnl.gov> on 2008/04/23 00:51:39 UTC

Server Socket Timeout Woes

Hi all,

Thanks so much for this list - I'm constantly lurking and learning things :)

I'm having trouble with our distributed cluster - our setup is as follows:

We have a 'reader' node reading from the local FS, 15 'worker' nodes each 
running identical aggregates of analysis and consumers that connect to an 
oracle DB for final storing of data results.  On each worker I have 
multiple instances running, typically 32, so I have 15x32 connections to 
Oracle.  I have about 20,000,000 documents to process.

After a certain amount of time, I start to get Broken Pipe server socket 
exceptions, of the following:

4/21/08 5:40:24 PM - 11: 
org.apache.uima.adapter.vinci.CASTransportable.toStream(288): WARNING: 
Broken pipe
java.net.SocketException: Broken pipe
         at java.net.SocketOutputStream.socketWrite0(Native Method)
         at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
         at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
         at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
         at java.io.BufferedOutputStream.write(BufferedOutputStream.java:78)
         at 
org.apache.vinci.transport.XTalkTransporter.writeInt(XTalkTransporter.java:508)
         at 
org.apache.vinci.transport.XTalkTransporter.stringToBin(XTalkTransporter.java:446)
         at 
org.apache.uima.adapter.vinci.CASTransportable$XTalkSerializer.startElement(CASTransportable.java:219)
         at 
org.apache.uima.cas.impl.XCASSerializer$XCASDocSerializer.startElement(XCASSerializer.java:327)
         at 
org.apache.uima.cas.impl.XCASSerializer$XCASDocSerializer.encodeFS(XCASSerializer.java:466)
         at 
org.apache.uima.cas.impl.XCASSerializer$XCASDocSerializer.encodeIndexed(XCASSerializer.java:347)
         at 
org.apache.uima.cas.impl.XCASSerializer$XCASDocSerializer.serialize(XCASSerializer.java:271)
         at 
org.apache.uima.cas.impl.XCASSerializer$XCASDocSerializer.access$600(XCASSerializer.java:62)
         at 
org.apache.uima.cas.impl.XCASSerializer.serialize(XCASSerializer.java:919)
         at 
org.apache.uima.adapter.vinci.CASTransportable.toStream(CASTransportable.java:279)
         at 
org.apache.vinci.transport.BaseServerRunnable.run(BaseServerRunnable.java:90)
         at 
org.apache.vinci.transport.BaseServer$PooledThread.run(BaseServer.java:101)

and

org.apache.uima.collection.impl.base_cpm.container.ServiceConnectionException: 
The service did not complete a call within the specified time. (Thread 
Name: [Procesing Pipeline#172 Thread]::) Host: 192.168.3.52 Port: 11000 
Exceeded Timeout Value: 600000
         at 
org.apache.uima.collection.impl.cpm.container.deployer.VinciTAP.sendAndReceive(VinciTAP.java:533)
         at 
org.apache.uima.collection.impl.cpm.container.deployer.VinciTAP.analyze(VinciTAP.java:927)
         at 
org.apache.uima.collection.impl.cpm.container.NetworkCasProcessorImpl.process(NetworkCasProcessorImpl.java:198)
         at 
org.apache.uima.collection.impl.cpm.engine.ProcessingUnit.processNext(ProcessingUnit.java:1071)
         at 
org.apache.uima.collection.impl.cpm.engine.ProcessingUnit.run(ProcessingUnit.java:668)
org.apache.uima.resource.ResourceProcessException
         at 
org.apache.uima.collection.impl.cpm.container.NetworkCasProcessorImpl.process(NetworkCasProcessorImpl.java:200)
         at 
org.apache.uima.collection.impl.cpm.engine.ProcessingUnit.processNext(ProcessingUnit.java:1071)
         at 
org.apache.uima.collection.impl.cpm.engine.ProcessingUnit.run(ProcessingUnit.java:668)



I've found that if I lower my <timeout> for each casProcessor too low, (and 
also my maxConsecutiveRestarts), I only get about 70000 documents in before 
the whole thing goes sour.  If I raise everything to obscenely high (say, 
1,000,000,000 ms), then I get about 400,000 in.  The the whole system 
freezes, and nothing gets into Oracle.  I keep my vinci descriptor for the 
VNS server at unlimited, and the serverSocketTimeout for the Vinci 
descriptor for the CPE obscenely high as well.

I don't know if I'm adequately explaining my problem, but I'm trying to 
figure out the best way to set my timeouts on the CPE and the Vinci 
descriptors as well.

My next attempt is to keep the timeouts from the CPE side very high, the 
Vinci VNS descriptor unlimited, and the serverSocketTimeout at 30000ms.

I guess, overall, I would like to give ample time to let an AE work, but 
not so long it never returns.  This includes the fact that since I have 
32x15 processingUnitThreadCounts, I need the timeout to be large enough at 
initialization.

Sorry for the rambling, does anyone have any general guidelines/experiences 
for this kind of setup?

Thanks!

Steve


Re: Server Socket Timeout Woes

Posted by Steve Suppe <ss...@llnl.gov>.
Hi Adam, thanks for the reply,

You KNOW, thinking back, when we were running UIMA 2.1.0, I HAD hacked up 
the pools to be Arraylists instead of arrays, that way I didn't have to 
worry about it.  It was POC code, and I forgot about it, as it became 
production for the last year.  I think I will roll us back to that version 
and see what happens.

Thanks!

Steve

At 07:38 PM 4/22/2008, you wrote:
>On Tue, Apr 22, 2008 at 6:51 PM, Steve Suppe <ss...@llnl.gov> wrote:
> > Hi all,
> >
> >  Thanks so much for this list - I'm constantly lurking and learning things
> > :)
> >
> >  I'm having trouble with our distributed cluster - our setup is as follows:
> >
> >  We have a 'reader' node reading from the local FS, 15 'worker' nodes each
> > running identical aggregates of analysis and consumers that connect to an
> > oracle DB for final storing of data results.  On each worker I have 
> multiple
> > instances running, typically 32, so I have 15x32 connections to Oracle.  I
> > have about 20,000,000 documents to process.
> >
> >  After a certain amount of time, I start to get Broken Pipe server socket
> > exceptions, of the following:
> >  <snip/>
>
>I'm not completely sure, but this might be related to this JIRA issue:
>http://issues.apache.org/jira/browse/UIMA-821.  I ran into this
>problem when I had a large number of clients trying to connect to a
>Vinci service, just like what you're doing.  The Vinci service has a
>default thread pool of size 20 - if there are more clients than that,
>things didn't work right.  The first 20 clients hogged all the threads
>and the other clients couldn't get in.
>
>UIMA 2.2.2 will allow you to configure the server thread pool size so
>this problem doesn't occur.  Hopefully this release will be out soon -
>it is currently up for an approval vote from the Apache Incubator.
>
>  -Adam


Re: Server Socket Timeout Woes

Posted by Eddie Epstein <ea...@gmail.com>.
On Wed, Apr 30, 2008 at 5:12 PM, Charles Proefrock <ch...@hotmail.com>
wrote:

> Thanks for the heads-up.  I'm new to the Apache world ... Is there a
> repository of requirements or a component functional description for
> UIMA-AS, or does an interested party need to join the dev group and download
> the project to access this material?


There was a post on uima-dev some time back with a pointer to more info:
http://cwiki.apache.org/UIMA/uimaasdoc.html


Need to provide honest assessments of the technology to users, and propose
> next steps ... so information is always helpful.In any event, always good to
> see something better come along that has the words "no code changes
> required" associated with it ;)
>

No code changes, right. We took advantage of breakage caused by the name
space move to org.apache to make a few other changes, but otherwise UIMA has
done a good job of preserving backwards compatibility all along.

A simple way to think about UIMA-AS is that the core UIMA aggregate defines
a CPE (note that collection readers, analysis engines and CAS consumers are
all allowed in a UIMA aggregate), and UIMA-AS leverages asynchronous
middleware (JMS implementations) to provide a flexible way of deploying UIMA
aggregate components in order to achieve scale up.

Eddie

RE: Server Socket Timeout Woes

Posted by Charles Proefrock <ch...@hotmail.com>.
Eddie,
 
Thanks for the heads-up.  I'm new to the Apache world ... Is there a repository of requirements or a component functional description for UIMA-AS, or does an interested party need to join the dev group and download the project to access this material?
 
Need to provide honest assessments of the technology to users, and propose next steps ... so information is always helpful.In any event, always good to see something better come along that has the words "no code changes required" associated with it ;)
 
Thanks for your efforts,
 
Charles



> Date: Tue, 29 Apr 2008 16:11:58 -0400> From: eaepstein@gmail.com> To: uima-user@incubator.apache.org> Subject: Re: Server Socket Timeout Woes> > Vinci has scalability limitations for the reasons you describe below. In> case you are not aware, there is a new scalability layer, UIMA AS, which is> on the verge of being ready for release. UIMA AS uses ActiveMQ for the> communications layer, providing both proper load balancing and dynamic> addition/subtraction of service instances. Some other features of UIMA AS:> > - includes all of the error handling logic in the CPM and more> - fully supports the UIMA flow controller> - enables replication for individual AEs, not just entire CPE> pipelines> - no code changes required for Apache UIMA components> - supports C++ annotators running as native processes, no Java wrapper> required> > Wish it were out already. Trying hard.> Eddie> > On Tue, Apr 29, 2008 at 1:22 PM, Charles Proefrock <ch...@hotmail.com>> wrote:> > > I'm excited to see this thread for it's affirmation that someone has> > pushed Vinci scalability to the point that Steve has at LLNL. Also, to know> > the currently released version has some limitations. At the risk of> > diverting this thread, let me share what we've found.> >> > I'm on board with Adam's line of thinking. We've just spent 2 weeks> > experimenting with the various options for exclusive/random allocation of> > Vinci services, finding that 'exclusive' is the most reliable way to balance> > load (random sometimes hands all of the clients the same service while other> > services go unused). The phrase "when a service is needed" isn't clear in> > the documentation. As Adam indicated, our finding is that "need" occurs> > only at client thread initialization time as opposed to each process(CAS)> > call. Additionally, "exclusive" is not exactly clear, as two client threads> > can be handed the same service if the number of services available are less> > than the number of threads initializing. This behavior is robust (better to> > get a remote than have nothing allocated), but it isn't clear from our> > relatively small setup (two threads, two remotes) what the word 'exclusive'> > means or how large a system can get before 'exclusive' pans out as the> > right/wrong approach.> >> > In the face of services starting/stopping on remote computers (e.g.,> > during multi-platform reboot), there seems to be no way to robustly take> > advantage of additional services coming on-line. If "when needed" meant> > each process(CAS) call (as an option at least ... to trade the re-connect> > negotiation overhead for dynamic scalability), then a system that> > initializes to 5 remotes can balance out as 10,20,30 remotes come online.> > For now, we are using the CPE 'numToProcess' parameter to exit the CPE,> > then construct a new CPE and re-enter the process() routine to seek out new> > services periodically.> > Also, we are seeing a startup sequence that sometimes results in the first> > document sent to each remote returning immediately with a connection/timeout> > exception ... so we catch those items and re-submit them at the end of the> > queue in case they really did exit due to a valid timeout exception.> >> > Any feedback/collaboration would be appreciated.> >> > - Charles> >> >
_________________________________________________________________
Make i'm yours.  Create a custom banner to support your cause.
http://im.live.com/Messenger/IM/Contribute/Default.aspx?source=TXT_TAGHM_MSN_Make_IM_Yours

Re: Server Socket Timeout Woes

Posted by Eddie Epstein <ea...@gmail.com>.
Vinci has scalability limitations for the reasons you describe below. In
case you are not aware, there is a new scalability layer, UIMA AS, which is
on the verge of being ready for release. UIMA AS uses ActiveMQ for the
communications layer, providing both proper load balancing and dynamic
addition/subtraction of service instances. Some other features of UIMA AS:

   - includes all of the error handling logic in the CPM and more
   - fully supports the UIMA flow controller
   - enables replication for individual AEs, not just entire CPE
   pipelines
   - no code changes required for Apache UIMA components
   - supports C++ annotators running as native processes, no Java wrapper
   required

Wish it were out already. Trying hard.
Eddie

On Tue, Apr 29, 2008 at 1:22 PM, Charles Proefrock <ch...@hotmail.com>
wrote:

> I'm excited to see this thread for it's affirmation that someone has
> pushed Vinci scalability to the point that Steve has at LLNL.  Also, to know
> the currently released version has some limitations.  At the risk of
> diverting this thread, let me share what we've found.
>
> I'm on board with Adam's line of thinking.  We've just spent 2 weeks
> experimenting with the various options for exclusive/random allocation of
> Vinci services, finding that 'exclusive' is the most reliable way to balance
> load (random sometimes hands all of the clients the same service while other
> services go unused).  The phrase "when a service is needed" isn't clear in
> the documentation.  As Adam indicated, our finding is that "need" occurs
> only at client thread initialization time as opposed to each process(CAS)
> call.  Additionally, "exclusive" is not exactly clear, as two client threads
> can be handed the same service if the number of services available are less
> than the number of threads initializing.  This behavior is robust (better to
> get a remote than have nothing allocated), but it isn't clear from our
> relatively small setup (two threads, two remotes) what the word 'exclusive'
> means or how large a system can get before 'exclusive' pans out as the
> right/wrong approach.
>
> In the face of services starting/stopping on remote computers (e.g.,
> during multi-platform reboot), there seems to be no way to robustly take
> advantage of additional services coming on-line.  If "when needed" meant
> each process(CAS) call (as an option at least ... to trade the re-connect
> negotiation overhead for dynamic scalability), then a system that
> initializes to 5 remotes can balance out as 10,20,30 remotes come online.
>  For now, we are using the CPE 'numToProcess' parameter to exit the CPE,
> then construct a new CPE and re-enter the process() routine to seek out new
> services periodically.
> Also, we are seeing a startup sequence that sometimes results in the first
> document sent to each remote returning immediately with a connection/timeout
> exception ... so we catch those items and re-submit them at the end of the
> queue in case they really did exit due to a valid timeout exception.
>
> Any feedback/collaboration would be appreciated.
>
> - Charles
>
>

RE: Server Socket Timeout Woes

Posted by Charles Proefrock <ch...@hotmail.com>.
Yes, the overhead of connecting to a short-run-time remote may be a concern.  In the system you describe, and in my assumptions about scalability, it seems like the bulk of a single pipeline would be distributed out to worker nodes in some nested fashion.  So, the master CPE might just distribute the CAS to a first-level set of remotes, which may act as distributors to second-level remotes doing the work (or maybe that's the role of the 15*32 threads running in the master CPE ... there isn't a lot of work for them to do other than send out and wait for a response, so you may only need a single-level distributor).  The time savings would be in keeping the AEs close to each other and use the architecture to move work out to those nodes.  Am I headed in the right direction?
So, keeping the current features and adding a once-per-process re-connect for that master CPE to use would be a small step in the right direction.  (Unless this line of discussion is moot given the post about UIMA AS).
 
- Charles



> Date: Wed, 30 Apr 2008 09:48:56 -0700> To: uima-user@incubator.apache.org> From: ssuppe@llnl.gov> Subject: RE: Server Socket Timeout Woes> > Thanks for the kind words - it's all been out of necessity, not some grand > scheme! I too have thought about the balance load aspects of the 'job > scheduler.' Even without the ability to add/subtract additional resources > (a nice feature), it seems that the current setup is missing some other > niceties as well.> > I find that the in order for all of our nodes to be used, I have to > 'overshoot' the number of instances I'd really like to process. This is > because if, say, I had 10 worker nodes, and I started 10 instances, there's > a good chance some of them will get 2 instances per worker, or more, while > others would get 0. So I oversaturate the lines and hope for the best.> > I think, as had been said in this thread, perhaps the best bet would be to > allow a thread to get a resource simply for the length of a single > processCas(), then release it back to the pool. I suppose there are some > overhead issues with this? But at least you wouldn't worry about wasting > so many threads all of the time. Maybe a few different options, such as > the current setup, a new thread per processCas, and maybe a way to gain > priority? So if you're constantly "checking out" the same type of thread, > you're allowed to hold on to a longer "lease" of that thread, and overhead > time goes down? Something like DHCP, but for worker threads :) Of course, > that might be too complicated and not worth the effort.> > It seems like taking a resource just long enough to perform one block of > work (one processCas) is the simplest and most 'tried-and-true' > form. However, at least in most of our work, each processCas is really > pretty quick, so it would look like a lot of overhead for switching threads > around all of the time. Of course 'pretty quick' is relative, and in > computer-time is closer to an eternity. But we're averaging 100s to 1000s > of documents per second, so if we're ALWAYS setting up and tearing down, > that could eat into out efficiency.> > These are just some of my thoughts, anyone have any ideas?> > Steve> > At 10:22 AM 4/29/2008, you wrote:> >I'm excited to see this thread for it's affirmation that someone has > >pushed Vinci scalability to the point that Steve has at LLNL. Also, to > >know the currently released version has some limitations. At the risk of > >diverting this thread, let me share what we've found.> >> >I'm on board with Adam's line of thinking. We've just spent 2 weeks > >experimenting with the various options for exclusive/random allocation of > >Vinci services, finding that 'exclusive' is the most reliable way to > >balance load (random sometimes hands all of the clients the same service > >while other services go unused). The phrase "when a service is needed" > >isn't clear in the documentation. As Adam indicated, our finding is that > >"need" occurs only at client thread initialization time as opposed to each > >process(CAS) call. Additionally, "exclusive" is not exactly clear, as two > >client threads can be handed the same service if the number of services > >available are less than the number of threads initializing. This behavior > >is robust (better to get a remote than have nothing allocated), but it > >isn't clear from our relatively small setup (two threads, two remotes) > >what the word 'exclusive' means or how large a system can get before > >'exclusive' pans out as the right/wrong approach.> >> >In the face of services starting/stopping on remote computers (e.g., > >during multi-platform reboot), there seems to be no way to robustly take > >advantage of additional services coming on-line. If "when needed" meant > >each process(CAS) call (as an option at least ... to trade the re-connect > >negotiation overhead for dynamic scalability), then a system that > >initializes to 5 remotes can balance out as 10,20,30 remotes come > >online. For now, we are using the CPE 'numToProcess' parameter to exit > >the CPE, then construct a new CPE and re-enter the process() routine to > >seek out new services periodically.> >Also, we are seeing a startup sequence that sometimes results in the first > >document sent to each remote returning immediately with a > >connection/timeout exception ... so we catch those items and re-submit > >them at the end of the queue in case they really did exit due to a valid > >timeout exception.> >> >Any feedback/collaboration would be appreciated.> >> >- Charles> >> >> >> >> > > Date: Wed, 23 Apr 2008 17:44:50 -0400> From: alally@alum.rpi.edu> To: > > uima-user@incubator.apache.org> Subject: Re: Server Socket Timeout > > Woes> > On Wed, Apr 23, 2008 at 4:39 PM, Steve Suppe <ss...@llnl.gov> > > wrote:> > Hello again,> >> > I think you are 100% right here. I managed > > to roll back to my patched> > version of UIMA 2.1.0. In this one, I > > implemented the pool of threads as> > automatically expandable. This > > seemed to solve all of our problems, and> > things are chugging away very > > happily now.> >> > I know this is the user group, but is this something I > > should look to> > contributing somehow?> >> > Definitely - you could open > > a JIRA issue and attach a patch. We> should probably think a bit about > > how this thread pool was supposed to> work, though. My first thought is > > that the clients would round-robin> over the available threads, and each > > thread would be used for only one> request and would then be relinquished > > back into the pool. But> instead, it looks like the client holds onto a > > thread for the entire> time that client is connected, which doesn't make > > a whole lot of> sense. If the thread pool worked in a more sensible way, > > it might not> need to be expandable.> > -Adam> >_________________________________________________________________> >Back to work after baby­how do you know when you’re ready?> >http://lifestyle.msn.com/familyandparenting/articleNW.aspx?cp-documentid=5797498&ocid=T067MSN40A0701A > >> 
_________________________________________________________________
Spell a grand slam in this game where word skill meets World Series. Get in the game.
http://club.live.com/word_slugger.aspx?icid=word_slugger_wlhm_admod_april08

RE: Server Socket Timeout Woes

Posted by Steve Suppe <ss...@llnl.gov>.
Thanks for the kind words - it's all been out of necessity, not some grand 
scheme!  I too have thought about the balance load aspects of the 'job 
scheduler.'  Even without the ability to add/subtract additional resources 
(a nice feature), it seems that the current setup is missing some other 
niceties as well.

I find that the in order for all of our nodes to be used, I have to 
'overshoot' the number of instances I'd really like to process.  This is 
because if, say, I had 10 worker nodes, and I started 10 instances, there's 
a good chance some of them will get 2 instances per worker, or more, while 
others would get 0.  So I oversaturate the lines and hope for the best.

I think, as had been said in this thread, perhaps the best bet would be to 
allow a thread to get a resource simply for the length of a single 
processCas(), then release it back to the pool.  I suppose there are some 
overhead issues with this?  But at least you wouldn't worry about wasting 
so many threads all of the time.  Maybe a few different options, such as 
the current setup, a new thread per processCas, and maybe a way to gain 
priority?  So if you're constantly "checking out" the same type of thread, 
you're allowed to hold on to a longer "lease" of that thread, and overhead 
time goes down?  Something like DHCP, but for worker threads :)  Of course, 
that might be too complicated and not worth the effort.

It seems like taking a resource just long enough to perform one block of 
work (one processCas) is the simplest and most 'tried-and-true' 
form.  However, at least in most of our work, each processCas is really 
pretty quick, so it would look like a lot of overhead for switching threads 
around all of the time.  Of course 'pretty quick' is relative, and in 
computer-time is closer to an eternity.  But we're averaging 100s to 1000s 
of documents per second, so if we're ALWAYS setting up and tearing down, 
that could eat into out efficiency.

These are just some of my thoughts, anyone have any ideas?

Steve

At 10:22 AM 4/29/2008, you wrote:
>I'm excited to see this thread for it's affirmation that someone has 
>pushed Vinci scalability to the point that Steve has at LLNL.  Also, to 
>know the currently released version has some limitations.  At the risk of 
>diverting this thread, let me share what we've found.
>
>I'm on board with Adam's line of thinking.  We've just spent 2 weeks 
>experimenting with the various options for exclusive/random allocation of 
>Vinci services, finding that 'exclusive' is the most reliable way to 
>balance load (random sometimes hands all of the clients the same service 
>while other services go unused).  The phrase "when a service is needed" 
>isn't clear in the documentation.  As Adam indicated, our finding is that 
>"need" occurs only at client thread initialization time as opposed to each 
>process(CAS) call.  Additionally, "exclusive" is not exactly clear, as two 
>client threads can be handed the same service if the number of services 
>available are less than the number of threads initializing.  This behavior 
>is robust (better to get a remote than have nothing allocated), but it 
>isn't clear from our relatively small setup (two threads, two remotes) 
>what the word 'exclusive' means or how large a system can get before 
>'exclusive' pans out as the right/wrong approach.
>
>In the face of services starting/stopping on remote computers (e.g., 
>during multi-platform reboot), there seems to be no way to robustly take 
>advantage of additional services coming on-line.  If "when needed" meant 
>each process(CAS) call (as an option at least ... to trade the re-connect 
>negotiation overhead for dynamic scalability), then a system that 
>initializes to 5 remotes can balance out as 10,20,30 remotes come 
>online.  For now, we are using the CPE 'numToProcess' parameter to exit 
>the CPE, then construct a new CPE and re-enter the process() routine to 
>seek out new services periodically.
>Also, we are seeing a startup sequence that sometimes results in the first 
>document sent to each remote returning immediately with a 
>connection/timeout exception ... so we catch those items and re-submit 
>them at the end of the queue in case they really did exit due to a valid 
>timeout exception.
>
>Any feedback/collaboration would be appreciated.
>
>- Charles
>
>
>
>
> > Date: Wed, 23 Apr 2008 17:44:50 -0400> From: alally@alum.rpi.edu> To: 
> uima-user@incubator.apache.org> Subject: Re: Server Socket Timeout 
> Woes> > On Wed, Apr 23, 2008 at 4:39 PM, Steve Suppe <ss...@llnl.gov> 
> wrote:> > Hello again,> >> > I think you are 100% right here. I managed 
> to roll back to my patched> > version of UIMA 2.1.0. In this one, I 
> implemented the pool of threads as> > automatically expandable. This 
> seemed to solve all of our problems, and> > things are chugging away very 
> happily now.> >> > I know this is the user group, but is this something I 
> should look to> > contributing somehow?> >> > Definitely - you could open 
> a JIRA issue and attach a patch. We> should probably think a bit about 
> how this thread pool was supposed to> work, though. My first thought is 
> that the clients would round-robin> over the available threads, and each 
> thread would be used for only one> request and would then be relinquished 
> back into the pool. But> instead, it looks like the client holds onto a 
> thread for the entire> time that client is connected, which doesn't make 
> a whole lot of> sense. If the thread pool worked in a more sensible way, 
> it might not> need to be expandable.> > -Adam
>_________________________________________________________________
>Back to work after baby­how do you know when you’re ready?
>http://lifestyle.msn.com/familyandparenting/articleNW.aspx?cp-documentid=5797498&ocid=T067MSN40A0701A 
>


RE: Server Socket Timeout Woes

Posted by Charles Proefrock <ch...@hotmail.com>.
I'm excited to see this thread for it's affirmation that someone has pushed Vinci scalability to the point that Steve has at LLNL.  Also, to know the currently released version has some limitations.  At the risk of diverting this thread, let me share what we've found.
 
I'm on board with Adam's line of thinking.  We've just spent 2 weeks experimenting with the various options for exclusive/random allocation of Vinci services, finding that 'exclusive' is the most reliable way to balance load (random sometimes hands all of the clients the same service while other services go unused).  The phrase "when a service is needed" isn't clear in the documentation.  As Adam indicated, our finding is that "need" occurs only at client thread initialization time as opposed to each process(CAS) call.  Additionally, "exclusive" is not exactly clear, as two client threads can be handed the same service if the number of services available are less than the number of threads initializing.  This behavior is robust (better to get a remote than have nothing allocated), but it isn't clear from our relatively small setup (two threads, two remotes) what the word 'exclusive' means or how large a system can get before 'exclusive' pans out as the right/wrong approach.
 
In the face of services starting/stopping on remote computers (e.g., during multi-platform reboot), there seems to be no way to robustly take advantage of additional services coming on-line.  If "when needed" meant each process(CAS) call (as an option at least ... to trade the re-connect negotiation overhead for dynamic scalability), then a system that initializes to 5 remotes can balance out as 10,20,30 remotes come online.  For now, we are using the CPE 'numToProcess' parameter to exit the CPE, then construct a new CPE and re-enter the process() routine to seek out new services periodically.
Also, we are seeing a startup sequence that sometimes results in the first document sent to each remote returning immediately with a connection/timeout exception ... so we catch those items and re-submit them at the end of the queue in case they really did exit due to a valid timeout exception.
 
Any feedback/collaboration would be appreciated.
 
- Charles
 



> Date: Wed, 23 Apr 2008 17:44:50 -0400> From: alally@alum.rpi.edu> To: uima-user@incubator.apache.org> Subject: Re: Server Socket Timeout Woes> > On Wed, Apr 23, 2008 at 4:39 PM, Steve Suppe <ss...@llnl.gov> wrote:> > Hello again,> >> > I think you are 100% right here. I managed to roll back to my patched> > version of UIMA 2.1.0. In this one, I implemented the pool of threads as> > automatically expandable. This seemed to solve all of our problems, and> > things are chugging away very happily now.> >> > I know this is the user group, but is this something I should look to> > contributing somehow?> >> > Definitely - you could open a JIRA issue and attach a patch. We> should probably think a bit about how this thread pool was supposed to> work, though. My first thought is that the clients would round-robin> over the available threads, and each thread would be used for only one> request and would then be relinquished back into the pool. But> instead, it looks like the client holds onto a thread for the entire> time that client is connected, which doesn't make a whole lot of> sense. If the thread pool worked in a more sensible way, it might not> need to be expandable.> > -Adam
_________________________________________________________________
Back to work after baby–how do you know when you’re ready?
http://lifestyle.msn.com/familyandparenting/articleNW.aspx?cp-documentid=5797498&ocid=T067MSN40A0701A

Re: Server Socket Timeout Woes

Posted by Adam Lally <al...@alum.rpi.edu>.
On Wed, Apr 23, 2008 at 4:39 PM, Steve Suppe <ss...@llnl.gov> wrote:
> Hello again,
>
>  I think you are 100% right here.  I managed to roll back to my patched
> version of UIMA 2.1.0.  In this one, I implemented the pool of threads as
> automatically expandable.  This seemed to solve all of our problems, and
> things are chugging away very happily now.
>
>  I know this is the user group, but is this something I should look to
> contributing somehow?
>

Definitely - you could open a JIRA issue and attach a patch.  We
should probably think a bit about how this thread pool was supposed to
work, though.  My first thought is that the clients would round-robin
over the available threads, and each thread would be used for only one
request and would then be relinquished back into the pool.  But
instead, it looks like the client holds onto a thread for the entire
time that client is connected, which doesn't make a whole lot of
sense.  If the thread pool worked in a more sensible way, it might not
need to be expandable.

  -Adam

Re: Server Socket Timeout Woes

Posted by Steve Suppe <ss...@llnl.gov>.
Hello again,

I think you are 100% right here.  I managed to roll back to my patched 
version of UIMA 2.1.0.  In this one, I implemented the pool of threads as 
automatically expandable.  This seemed to solve all of our problems, and 
things are chugging away very happily now.

I know this is the user group, but is this something I should look to 
contributing somehow?

Thanks for the brainstorm, guys!

Steve


At 07:38 PM 4/22/2008, you wrote:
>On Tue, Apr 22, 2008 at 6:51 PM, Steve Suppe <ss...@llnl.gov> wrote:
> > Hi all,
> >
> >  Thanks so much for this list - I'm constantly lurking and learning things
> > :)
> >
> >  I'm having trouble with our distributed cluster - our setup is as follows:
> >
> >  We have a 'reader' node reading from the local FS, 15 'worker' nodes each
> > running identical aggregates of analysis and consumers that connect to an
> > oracle DB for final storing of data results.  On each worker I have 
> multiple
> > instances running, typically 32, so I have 15x32 connections to Oracle.  I
> > have about 20,000,000 documents to process.
> >
> >  After a certain amount of time, I start to get Broken Pipe server socket
> > exceptions, of the following:
> >  <snip/>
>
>I'm not completely sure, but this might be related to this JIRA issue:
>http://issues.apache.org/jira/browse/UIMA-821.  I ran into this
>problem when I had a large number of clients trying to connect to a
>Vinci service, just like what you're doing.  The Vinci service has a
>default thread pool of size 20 - if there are more clients than that,
>things didn't work right.  The first 20 clients hogged all the threads
>and the other clients couldn't get in.
>
>UIMA 2.2.2 will allow you to configure the server thread pool size so
>this problem doesn't occur.  Hopefully this release will be out soon -
>it is currently up for an approval vote from the Apache Incubator.
>
>  -Adam


Re: Server Socket Timeout Woes

Posted by Adam Lally <al...@alum.rpi.edu>.
On Tue, Apr 22, 2008 at 6:51 PM, Steve Suppe <ss...@llnl.gov> wrote:
> Hi all,
>
>  Thanks so much for this list - I'm constantly lurking and learning things
> :)
>
>  I'm having trouble with our distributed cluster - our setup is as follows:
>
>  We have a 'reader' node reading from the local FS, 15 'worker' nodes each
> running identical aggregates of analysis and consumers that connect to an
> oracle DB for final storing of data results.  On each worker I have multiple
> instances running, typically 32, so I have 15x32 connections to Oracle.  I
> have about 20,000,000 documents to process.
>
>  After a certain amount of time, I start to get Broken Pipe server socket
> exceptions, of the following:
>  <snip/>

I'm not completely sure, but this might be related to this JIRA issue:
http://issues.apache.org/jira/browse/UIMA-821.  I ran into this
problem when I had a large number of clients trying to connect to a
Vinci service, just like what you're doing.  The Vinci service has a
default thread pool of size 20 - if there are more clients than that,
things didn't work right.  The first 20 clients hogged all the threads
and the other clients couldn't get in.

UIMA 2.2.2 will allow you to configure the server thread pool size so
this problem doesn't occur.  Hopefully this release will be out soon -
it is currently up for an approval vote from the Apache Incubator.

 -Adam

Re: Server Socket Timeout Woes

Posted by Steve Suppe <ss...@llnl.gov>.
Marshall,

Thanks for the response.  We're running 2.2.1.  I had this nagging 
suspicion that maybe I was treating the symptoms, not the problem, but at 
this point I'm out of ideas :)  My next attempt will be to bring the 
threads and pool size down to something more manageable.  This is supposed 
to be 'rev2' of our design, hardware wise.  I'll go back to my original CPE 
settings (lower count, etc) from rev1, and hope for the best.

Thanks again, I will keep you posted.
Steve

At 05:00 PM 4/22/2008, you wrote:
>Hi Steve -
>
>I'm no expert in these matter, but I wonder if changing the timeouts is 
>the right approach.  Have you isolated the problem to something wrong with 
>the timeouts?  Could it be something else (some rare race condition 
>causing a hang at some point, for intance)?
>What level of UIMA are you running?
>
>-Marshall
>
>Steve Suppe wrote:
>>Hi all,
>>
>>Thanks so much for this list - I'm constantly lurking and learning things :)
>>
>>I'm having trouble with our distributed cluster - our setup is as follows:
>>
>>We have a 'reader' node reading from the local FS, 15 'worker' nodes each 
>>running identical aggregates of analysis and consumers that connect to an 
>>oracle DB for final storing of data results.  On each worker I have 
>>multiple instances running, typically 32, so I have 15x32 connections to 
>>Oracle.  I have about 20,000,000 documents to process.
>>
>>After a certain amount of time, I start to get Broken Pipe server socket 
>>exceptions, of the following:
>>
>>4/21/08 5:40:24 PM - 11: 
>>org.apache.uima.adapter.vinci.CASTransportable.toStream(288): WARNING: 
>>Broken pipe
>>java.net.SocketException: Broken pipe
>>         at java.net.SocketOutputStream.socketWrite0(Native Method)
>>         at 
>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
>>         at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>>         at 
>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>>         at java.io.BufferedOutputStream.write(BufferedOutputStream.java:78)
>>         at 
>> org.apache.vinci.transport.XTalkTransporter.writeInt(XTalkTransporter.java:508) 
>>
>>         at 
>> org.apache.vinci.transport.XTalkTransporter.stringToBin(XTalkTransporter.java:446) 
>>
>>         at 
>> org.apache.uima.adapter.vinci.CASTransportable$XTalkSerializer.startElement(CASTransportable.java:219) 
>>
>>         at 
>> org.apache.uima.cas.impl.XCASSerializer$XCASDocSerializer.startElement(XCASSerializer.java:327) 
>>
>>         at 
>> org.apache.uima.cas.impl.XCASSerializer$XCASDocSerializer.encodeFS(XCASSerializer.java:466) 
>>
>>         at 
>> org.apache.uima.cas.impl.XCASSerializer$XCASDocSerializer.encodeIndexed(XCASSerializer.java:347) 
>>
>>         at 
>> org.apache.uima.cas.impl.XCASSerializer$XCASDocSerializer.serialize(XCASSerializer.java:271) 
>>
>>         at 
>> org.apache.uima.cas.impl.XCASSerializer$XCASDocSerializer.access$600(XCASSerializer.java:62) 
>>
>>         at 
>> org.apache.uima.cas.impl.XCASSerializer.serialize(XCASSerializer.java:919)
>>         at 
>> org.apache.uima.adapter.vinci.CASTransportable.toStream(CASTransportable.java:279) 
>>
>>         at 
>> org.apache.vinci.transport.BaseServerRunnable.run(BaseServerRunnable.java:90)
>>         at 
>> org.apache.vinci.transport.BaseServer$PooledThread.run(BaseServer.java:101)
>>
>>and
>>
>>org.apache.uima.collection.impl.base_cpm.container.ServiceConnectionException: 
>>The service did not complete a call within the specified time. (Thread 
>>Name: [Procesing Pipeline#172 Thread]::) Host: 192.168.3.52 Port: 11000 
>>Exceeded Timeout Value: 600000
>>         at 
>> org.apache.uima.collection.impl.cpm.container.deployer.VinciTAP.sendAndReceive(VinciTAP.java:533) 
>>
>>         at 
>> org.apache.uima.collection.impl.cpm.container.deployer.VinciTAP.analyze(VinciTAP.java:927) 
>>
>>         at 
>> org.apache.uima.collection.impl.cpm.container.NetworkCasProcessorImpl.process(NetworkCasProcessorImpl.java:198) 
>>
>>         at 
>> org.apache.uima.collection.impl.cpm.engine.ProcessingUnit.processNext(ProcessingUnit.java:1071) 
>>
>>         at 
>> org.apache.uima.collection.impl.cpm.engine.ProcessingUnit.run(ProcessingUnit.java:668) 
>>
>>org.apache.uima.resource.ResourceProcessException
>>         at 
>> org.apache.uima.collection.impl.cpm.container.NetworkCasProcessorImpl.process(NetworkCasProcessorImpl.java:200) 
>>
>>         at 
>> org.apache.uima.collection.impl.cpm.engine.ProcessingUnit.processNext(ProcessingUnit.java:1071) 
>>
>>         at 
>> org.apache.uima.collection.impl.cpm.engine.ProcessingUnit.run(ProcessingUnit.java:668) 
>>
>>
>>
>>
>>I've found that if I lower my <timeout> for each casProcessor too low, 
>>(and also my maxConsecutiveRestarts), I only get about 70000 documents in 
>>before the whole thing goes sour.  If I raise everything to obscenely 
>>high (say, 1,000,000,000 ms), then I get about 400,000 in.
>>The the whole system freezes, and nothing gets into Oracle.  I keep my 
>>vinci descriptor for the VNS server at unlimited, and the 
>>serverSocketTimeout for the Vinci descriptor for the CPE obscenely high 
>>as well.
>>
>>I don't know if I'm adequately explaining my problem, but I'm trying to 
>>figure out the best way to set my timeouts on the CPE and the Vinci 
>>descriptors as well.
>>
>>My next attempt is to keep the timeouts from the CPE side very high, the 
>>Vinci VNS descriptor unlimited, and the serverSocketTimeout at 30000ms.
>>
>>I guess, overall, I would like to give ample time to let an AE work, but 
>>not so long it never returns.  This includes the fact that since I have 
>>32x15 processingUnitThreadCounts, I need the timeout to be large enough 
>>at initialization.
>>
>>Sorry for the rambling, does anyone have any general 
>>guidelines/experiences for this kind of setup?
>>
>>Thanks!
>>
>>Steve
>>
>>


Re: Server Socket Timeout Woes

Posted by Marshall Schor <ms...@schor.com>.
Hi Steve -

I'm no expert in these matter, but I wonder if changing the timeouts is 
the right approach.  Have you isolated the problem to something wrong 
with the timeouts?  Could it be something else (some rare race condition 
causing a hang at some point, for intance)? 

What level of UIMA are you running?

-Marshall

Steve Suppe wrote:
> Hi all,
>
> Thanks so much for this list - I'm constantly lurking and learning 
> things :)
>
> I'm having trouble with our distributed cluster - our setup is as 
> follows:
>
> We have a 'reader' node reading from the local FS, 15 'worker' nodes 
> each running identical aggregates of analysis and consumers that 
> connect to an oracle DB for final storing of data results.  On each 
> worker I have multiple instances running, typically 32, so I have 
> 15x32 connections to Oracle.  I have about 20,000,000 documents to 
> process.
>
> After a certain amount of time, I start to get Broken Pipe server 
> socket exceptions, of the following:
>
> 4/21/08 5:40:24 PM - 11: 
> org.apache.uima.adapter.vinci.CASTransportable.toStream(288): WARNING: 
> Broken pipe
> java.net.SocketException: Broken pipe
>         at java.net.SocketOutputStream.socketWrite0(Native Method)
>         at 
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
>         at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>         at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>         at 
> java.io.BufferedOutputStream.write(BufferedOutputStream.java:78)
>         at 
> org.apache.vinci.transport.XTalkTransporter.writeInt(XTalkTransporter.java:508) 
>
>         at 
> org.apache.vinci.transport.XTalkTransporter.stringToBin(XTalkTransporter.java:446) 
>
>         at 
> org.apache.uima.adapter.vinci.CASTransportable$XTalkSerializer.startElement(CASTransportable.java:219) 
>
>         at 
> org.apache.uima.cas.impl.XCASSerializer$XCASDocSerializer.startElement(XCASSerializer.java:327) 
>
>         at 
> org.apache.uima.cas.impl.XCASSerializer$XCASDocSerializer.encodeFS(XCASSerializer.java:466) 
>
>         at 
> org.apache.uima.cas.impl.XCASSerializer$XCASDocSerializer.encodeIndexed(XCASSerializer.java:347) 
>
>         at 
> org.apache.uima.cas.impl.XCASSerializer$XCASDocSerializer.serialize(XCASSerializer.java:271) 
>
>         at 
> org.apache.uima.cas.impl.XCASSerializer$XCASDocSerializer.access$600(XCASSerializer.java:62) 
>
>         at 
> org.apache.uima.cas.impl.XCASSerializer.serialize(XCASSerializer.java:919) 
>
>         at 
> org.apache.uima.adapter.vinci.CASTransportable.toStream(CASTransportable.java:279) 
>
>         at 
> org.apache.vinci.transport.BaseServerRunnable.run(BaseServerRunnable.java:90) 
>
>         at 
> org.apache.vinci.transport.BaseServer$PooledThread.run(BaseServer.java:101) 
>
>
> and
>
> org.apache.uima.collection.impl.base_cpm.container.ServiceConnectionException: 
> The service did not complete a call within the specified time. (Thread 
> Name: [Procesing Pipeline#172 Thread]::) Host: 192.168.3.52 Port: 
> 11000 Exceeded Timeout Value: 600000
>         at 
> org.apache.uima.collection.impl.cpm.container.deployer.VinciTAP.sendAndReceive(VinciTAP.java:533) 
>
>         at 
> org.apache.uima.collection.impl.cpm.container.deployer.VinciTAP.analyze(VinciTAP.java:927) 
>
>         at 
> org.apache.uima.collection.impl.cpm.container.NetworkCasProcessorImpl.process(NetworkCasProcessorImpl.java:198) 
>
>         at 
> org.apache.uima.collection.impl.cpm.engine.ProcessingUnit.processNext(ProcessingUnit.java:1071) 
>
>         at 
> org.apache.uima.collection.impl.cpm.engine.ProcessingUnit.run(ProcessingUnit.java:668) 
>
> org.apache.uima.resource.ResourceProcessException
>         at 
> org.apache.uima.collection.impl.cpm.container.NetworkCasProcessorImpl.process(NetworkCasProcessorImpl.java:200) 
>
>         at 
> org.apache.uima.collection.impl.cpm.engine.ProcessingUnit.processNext(ProcessingUnit.java:1071) 
>
>         at 
> org.apache.uima.collection.impl.cpm.engine.ProcessingUnit.run(ProcessingUnit.java:668) 
>
>
>
>
> I've found that if I lower my <timeout> for each casProcessor too low, 
> (and also my maxConsecutiveRestarts), I only get about 70000 documents 
> in before the whole thing goes sour.  If I raise everything to 
> obscenely high (say, 1,000,000,000 ms), then I get about 400,000 in.  
> The the whole system freezes, and nothing gets into Oracle.  I keep my 
> vinci descriptor for the VNS server at unlimited, and the 
> serverSocketTimeout for the Vinci descriptor for the CPE obscenely 
> high as well.
>
> I don't know if I'm adequately explaining my problem, but I'm trying 
> to figure out the best way to set my timeouts on the CPE and the Vinci 
> descriptors as well.
>
> My next attempt is to keep the timeouts from the CPE side very high, 
> the Vinci VNS descriptor unlimited, and the serverSocketTimeout at 
> 30000ms.
>
> I guess, overall, I would like to give ample time to let an AE work, 
> but not so long it never returns.  This includes the fact that since I 
> have 32x15 processingUnitThreadCounts, I need the timeout to be large 
> enough at initialization.
>
> Sorry for the rambling, does anyone have any general 
> guidelines/experiences for this kind of setup?
>
> Thanks!
>
> Steve
>
>
>