You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Jörn Kottmann <ko...@gmail.com> on 2010/01/06 10:10:46 UTC

UIMA AS: Worker node hangs after errors occur

Hi,

our AEs call a web service and in case the web service
gets updated or is down for a short period of time
the worker nodes stop answering requests from
the input queue, but they should continue answering
requests and reporting errors back to the client.

There is no specific error configuration which I believe means
that all errors are reported back to the client and the terminate
or disable action is not taken. But I think its disconnecting itself
from the input queue.

Maybe this log output is helpful:

Exception stack traces from my AEs process method ..

1/5/10 4:25:16 PM - 15: 
org.apache.uima.aae.controller.AggregateAnalysisEngineController_impl.stopCasMultiplier: 
INFO: Controll
er: SearchEngine Stopping Remote Delegate Cas Multiplier: SearchEngine

Exception stack traces from my AEs process method ..

1/5/10 4:25:17 PM - 15: 
org.apache.uima.aae.controller.AggregateAnalysisEngineController_impl.stopCasMultiplier: 
INFO: Controller: SearchEngine Stopping Remote Delegate Cas Multiplier: 
SearchEngine

Exception stack traces from my AEs process method ..

1/5/10 4:25:19 PM - 15: 
org.apache.uima.aae.controller.AggregateAnalysisEngineController_impl.stopCasMultiplier: 
INFO: Controll
er: SearchEngine Stopping Remote Delegate Cas Multiplier: SearchEngine

Exception stack traces from my AEs process method ..

1/5/10 4:55:19 PM - 19: 
org.apache.uima.adapter.jms.activemq.JmsOutputChannel$ConnectionTimer.startTimer: 
INFO: Inactivity Timer Expired. Thread: Controller:SearchEngine:Reply 
TimerThread-:ID:dkcphlinh1master-55655-1262686967339-0:0:1:12724040314479401 
Controller: SearchEngine Timeout Value: 1,800,000 Endpoint Name: 
ID:dkcphlinh1master-55655-1262686967339-0:0:1

The following is printed to the console a few times:
 >>> Instance Hashcode:1272339693 Controller:SearchEngine Stopping 
Collocated Delegate Cas Multiplier:DocumentRetrieverCasMultiplier

Jörn

Re: UIMA AS: Worker node hangs after errors occur

Posted by Jörn Kottmann <ko...@gmail.com>.
Jaroslaw Cwiklik wrote:
> Your pasted log messages that indicate that a service is stopping. Any
> exceptions in the log? Also, can you increase log level to see if there is
> more revealing information. 

Which log level do you suggest for tracing these problems
in a production system ?

Jörn

Re: UIMA AS: Worker node hangs after errors occur

Posted by Jörn Kottmann <ko...@gmail.com>.
Jaroslaw Cwiklik wrote:
> Jorn, I've checked the code and see that a log message is confusing in the
> scenario you've described. The message says something like this:
>
> Controller:Test Aggregate TAE Stopping Collocated Delegate Cas
> Multiplier:TestMultiplier
>
> followed by something like this:
>
>   
>>>>> Cas Multiplier:Simple Text Segmenter Stopped Generating CASes from
>>>>>           
> Input CAS:-61dfae6a:1267b2ab2ab:-7fe8
>
> The first message is wrong and confusing. What is happening in your scenario
> is the following:
>
> 1) CM receives input CAS C1
> 2) The CM generates a new CAS C2
> 3) C2 is sent to Delegate Service D1
> 4) D1 throws exception on C2
> 5) Aggregate receives exception from D1 and sends STOP message to CM asking
> it to stop producing
>     new CASes from C1. This message is not sent to stop the CM!
> 6) CM returns C1 to the Aggregate
> 7) Aggregate determines that C1 has been marked as failed ( because of C2
> failure)
> 8) Aggregate returns C1 marked as failed to the client
>
> The confusion is the message that the aggregate logs: Stopping Collocated
> Delegate Cas Multiplier:
> It should say something like, Sending Request to CM to Stop Generating New
> CASes from CAS id: xxx
>
> Can you confirm that the client receives C1 with an exception.
>   
In the test it did, also think that works in our production system.

Jörn

Re: UIMA AS: Worker node hangs after errors occur

Posted by Jaroslaw Cwiklik <ui...@gmail.com>.
Jorn, I've checked the code and see that a log message is confusing in the
scenario you've described. The message says something like this:

Controller:Test Aggregate TAE Stopping Collocated Delegate Cas
Multiplier:TestMultiplier

followed by something like this:

>>>> Cas Multiplier:Simple Text Segmenter Stopped Generating CASes from
Input CAS:-61dfae6a:1267b2ab2ab:-7fe8

The first message is wrong and confusing. What is happening in your scenario
is the following:

1) CM receives input CAS C1
2) The CM generates a new CAS C2
3) C2 is sent to Delegate Service D1
4) D1 throws exception on C2
5) Aggregate receives exception from D1 and sends STOP message to CM asking
it to stop producing
    new CASes from C1. This message is not sent to stop the CM!
6) CM returns C1 to the Aggregate
7) Aggregate determines that C1 has been marked as failed ( because of C2
failure)
8) Aggregate returns C1 marked as failed to the client

The confusion is the message that the aggregate logs: Stopping Collocated
Delegate Cas Multiplier:
It should say something like, Sending Request to CM to Stop Generating New
CASes from CAS id: xxx

Can you confirm that the client receives C1 with an exception.

JC



On Fri, Jan 29, 2010 at 11:30 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> Jaroslaw Cwiklik wrote:
>
>> Jorn, I need more information to determine the cause of the problem. Can
>> you
>> clarify your deployment? I understand you have worker nodes running UIMA
>> AS
>> services. These services connect to a remote Web Service. I presume that
>> this connection is done your code in the AE. Is this correct?
>>
>> Your pasted log messages that indicate that a service is stopping. Any
>> exceptions in the log? Also, can you increase log level to see if there is
>> more revealing information. Another idea, is to attach jConsole to the
>> UIMA
>> AS service that is not processing messages and look at the JVM threads to
>> see if there is a hang somewhere. Also, check UIMA JMX MBeans. There
>> should
>> be MBeans for CM and their CasPools. Are the pools empty indicating that
>> CASes are stuck somewhere?
>>
>>
> It happened again.
>
> This time I checked the process with JMX, the CAS pools where all empty,
> and
> 11 CASes where waiting in the input queue of an AE. Does that mean that the
> AE.process
> method blocks ?
>
> Can my hang be explained by blocking AE.process methods ? Is there a
> default
> timeout for that case ?
>
> Actually since I last reported the hang issue it happened two more times,
> today
> and a few days ago. I have a core dump of the process from a few days ago.
> I checked the stack traces and it looked like the CM at the beginning of
> the pipe
> was waiting in next() -> getEmptyCAS(), but it locked because the CAS pools
> where empty (?).
> All other threads where waiting in framework code, and not in my
> implementation code.
>
> The stopping message in the log file occurs always if an exception is
> thrown from an
> AE which is behind a CM. I opened a jira issue to demonstrate that.
>
> Jörn
>

Re: UIMA AS: Worker node hangs after errors occur

Posted by Jörn Kottmann <ko...@gmail.com>.
Jaroslaw Cwiklik wrote:
> Jorn, I need more information to determine the cause of the problem. Can you
> clarify your deployment? I understand you have worker nodes running UIMA AS
> services. These services connect to a remote Web Service. I presume that
> this connection is done your code in the AE. Is this correct?
>
> Your pasted log messages that indicate that a service is stopping. Any
> exceptions in the log? Also, can you increase log level to see if there is
> more revealing information. Another idea, is to attach jConsole to the UIMA
> AS service that is not processing messages and look at the JVM threads to
> see if there is a hang somewhere. Also, check UIMA JMX MBeans. There should
> be MBeans for CM and their CasPools. Are the pools empty indicating that
> CASes are stuck somewhere?
>   
It happened again.

This time I checked the process with JMX, the CAS pools where all empty, and
11 CASes where waiting in the input queue of an AE. Does that mean that 
the AE.process
method blocks ?

Can my hang be explained by blocking AE.process methods ? Is there a default
timeout for that case ?

Actually since I last reported the hang issue it happened two more 
times, today
and a few days ago. I have a core dump of the process from a few days ago.
I checked the stack traces and it looked like the CM at the beginning of 
the pipe
was waiting in next() -> getEmptyCAS(), but it locked because the CAS 
pools where empty (?).
All other threads where waiting in framework code, and not in my 
implementation code.

The stopping message in the log file occurs always if an exception is 
thrown from an
AE which is behind a CM. I opened a jira issue to demonstrate that.

Jörn

Re: UIMA AS: Worker node hangs after errors occur

Posted by Jörn Kottmann <ko...@gmail.com>.
Jaroslaw Cwiklik wrote:
> Jorn, I need more information to determine the cause of the problem. Can you
> clarify your deployment? I understand you have worker nodes running UIMA AS
> services. These services connect to a remote Web Service. I presume that
> this connection is done your code in the AE. Is this correct?
>   
Yes the calls are done in the process method, and can fail
which should result in an exception which should be reported back
to the client for that particular CAS.
> Your pasted log messages that indicate that a service is stopping. Any
> exceptions in the log?
As far as I have seen the log only contains exception thrown
by the calls to the WS from the process method.
>  Also, can you increase log level to see if there is
> more revealing information. Another idea, is to attach jConsole to the UIMA
> AS service that is not processing messages and look at the JVM threads to
> see if there is a hang somewhere. Also, check UIMA JMX MBeans. There should
> be MBeans for CM and their CasPools. Are the pools empty indicating that
> CASes are stuck somewhere?
>   
I set up the JMX agent for the service and will check it with jconsole
the next time the problem occurs.

Thanks for your help,
Jörn

Re: UIMA AS: Worker node hangs after errors occur

Posted by Jaroslaw Cwiklik <ui...@gmail.com>.
Jorn, I need more information to determine the cause of the problem. Can you
clarify your deployment? I understand you have worker nodes running UIMA AS
services. These services connect to a remote Web Service. I presume that
this connection is done your code in the AE. Is this correct?

Your pasted log messages that indicate that a service is stopping. Any
exceptions in the log? Also, can you increase log level to see if there is
more revealing information. Another idea, is to attach jConsole to the UIMA
AS service that is not processing messages and look at the JVM threads to
see if there is a hang somewhere. Also, check UIMA JMX MBeans. There should
be MBeans for CM and their CasPools. Are the pools empty indicating that
CASes are stuck somewhere?

JC



On Wed, Jan 6, 2010 at 4:10 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> Hi,
>
> our AEs call a web service and in case the web service
> gets updated or is down for a short period of time
> the worker nodes stop answering requests from
> the input queue, but they should continue answering
> requests and reporting errors back to the client.
>
> There is no specific error configuration which I believe means
> that all errors are reported back to the client and the terminate
> or disable action is not taken. But I think its disconnecting itself
> from the input queue.
>
> Maybe this log output is helpful:
>
> Exception stack traces from my AEs process method ..
>
> 1/5/10 4:25:16 PM - 15:
> org.apache.uima.aae.controller.AggregateAnalysisEngineController_impl.stopCasMultiplier:
> INFO: Controll
> er: SearchEngine Stopping Remote Delegate Cas Multiplier: SearchEngine
>
> Exception stack traces from my AEs process method ..
>
> 1/5/10 4:25:17 PM - 15:
> org.apache.uima.aae.controller.AggregateAnalysisEngineController_impl.stopCasMultiplier:
> INFO: Controller: SearchEngine Stopping Remote Delegate Cas Multiplier:
> SearchEngine
>
> Exception stack traces from my AEs process method ..
>
> 1/5/10 4:25:19 PM - 15:
> org.apache.uima.aae.controller.AggregateAnalysisEngineController_impl.stopCasMultiplier:
> INFO: Controll
> er: SearchEngine Stopping Remote Delegate Cas Multiplier: SearchEngine
>
> Exception stack traces from my AEs process method ..
>
> 1/5/10 4:55:19 PM - 19:
> org.apache.uima.adapter.jms.activemq.JmsOutputChannel$ConnectionTimer.startTimer:
> INFO: Inactivity Timer Expired. Thread: Controller:SearchEngine:Reply
> TimerThread-:ID:dkcphlinh1master-55655-1262686967339-0:0:1:12724040314479401
> Controller: SearchEngine Timeout Value: 1,800,000 Endpoint Name:
> ID:dkcphlinh1master-55655-1262686967339-0:0:1
>
> The following is printed to the console a few times:
> >>> Instance Hashcode:1272339693 Controller:SearchEngine Stopping
> Collocated Delegate Cas Multiplier:DocumentRetrieverCasMultiplier
>
> Jörn
>

Re: UIMA AS: Worker node hangs after errors occur

Posted by Jörn Kottmann <ko...@gmail.com>.
Jörn Kottmann wrote:
> Hi,
>
> our AEs call a web service and in case the web service
> gets updated or is down for a short period of time
> the worker nodes stop answering requests from
> the input queue, but they should continue answering
> requests and reporting errors back to the client.
>

The issue is now resolved it turned out that one of the calls
to the web service from the AE.process method just blocked
forever. After a while all worker threads where blocked and UIMA AS
stopped dequeuing messages from the input queue.

Jörn