You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@manifoldcf.apache.org by ju...@francelabs.com on 2021/03/02 17:53:00 UTC

Inactive MCF agent

Hi Karl,

I recently faced a weird case where a job in a "running" state was not doing
anything for several hours. The MCF agent process was up but neither the
Simple History nor the logs showed any activity. Since we could not wait
more than 12 hours, we decided to restart the agent, and the job "went back
on rails" and continued its work normally.
In order to avoid as much as possible the need for such a manual
intervention, I would have two questions:
- Is there a way to "test" the agent process ? Like a "process ping" which
can detect if the process is doing or ready to do something ? And if not, is
there a way to implement such thing easily ? The idea being to make the
detection and restart automatically rather than manually.
- Knowing that we have activated the debug log level, would you have
recommendation on what to look at to find a potential cause of such an issue
?

Regards,
Julien Massiera

Re: Inactive MCF agent

Posted by Karl Wright <da...@gmail.com>.

If anything running in the agents process runs out of memory, it's fatal
and corrupting and MUST shut the process down.  So if your connector throws
an OOM, the agents process will log something to the console and exit.  It
should exit with a very specific exit code.  So all you have to do to make
MCF agents process shut itself down is NOT catch any OutOfMemory error
exceptions, or if you do, rethrow them.

Karl


On Tue, Mar 16, 2021 at 6:06 AM <ju...@francelabs.com> wrote:

> Hi Karl,
>
> Took me some time to reproduce, but I was able to dump the process after
> it happened again, and it appears that an OOM is the cause of the problem.
> After investigation, it seems that this OOM was triggered by a
> transformation connector I had developed. I increased the JVM heap size a
> little and the problem never happened again. For info, I had limited the
> number of connections of that connector to only 1, to be sure this was not
> a potential cause of the issue.
> My question is : To make sure that the agent process crashes instead of
> staying up in a similar case (OOM in my scenario), is there something that
> can be done at the connector level or at a more global level in MCF ?
>
> Regards,
> Julien
>
>
> -----Message d'origine-----
> De : Karl Wright <da...@gmail.com>
> Envoyé : mardi 2 mars 2021 19:17
> À : dev <de...@manifoldcf.apache.org>
> Objet : Re: Inactive MCF agent
>
> The MCF Agents process shouldn't get hung up under normal operation.  If
> it encounters a problem that may call its continued activity into question,
> it shuts itself down.
>
> There are two situations where the process could theoretically hang.
>
> The first is when you are using file-based synch, and you forcibly kill
> another ManifoldCF process so that it doesn't clean up locks after itself.
> But if you are using Zookeeper, it should not ever fail to clean up after
> a process is killed.
>
> The second situation is when certain database conditions arise, and MCF
> decides it needs to reset all its worker threads.  When it does this, it
> blocks all worker threads from proceeding until it reaches a point where
> they are all quiescent, and then it resets all of them at the same time.
> When it is waiting for all threads to shut down in this way, if that never
> completely happens, MCF will be paused forever.
>
> What I'd like to do in that case is get a thread dump of the agents
> process.  That will tell us what the problem is.
>
> Karl
>
>
> On Tue, Mar 2, 2021 at 12:53 PM <ju...@francelabs.com> wrote:
>
> > Hi Karl,
> >
> > I recently faced a weird case where a job in a "running" state was not
> > doing anything for several hours. The MCF agent process was up but
> > neither the Simple History nor the logs showed any activity. Since we
> > could not wait more than 12 hours, we decided to restart the agent,
> > and the job "went back on rails" and continued its work normally.
> > In order to avoid as much as possible the need for such a manual
> > intervention, I would have two questions:
> > - Is there a way to "test" the agent process ? Like a "process ping"
> > which can detect if the process is doing or ready to do something ?
> > And if not, is there a way to implement such thing easily ? The idea
> > being to make the detection and restart automatically rather than
> > manually.
> > - Knowing that we have activated the debug log level, would you have
> > recommendation on what to look at to find a potential cause of such an
> > issue ?
> >
> > Regards,
> > Julien Massiera
> >
> >
>
>

RE: Inactive MCF agent

Posted by ju...@francelabs.com.

Hi Karl,

Took me some time to reproduce, but I was able to dump the process after it happened again, and it appears that an OOM is the cause of the problem. After investigation, it seems that this OOM was triggered by a transformation connector I had developed. I increased the JVM heap size a little and the problem never happened again. For info, I had limited the number of connections of that connector to only 1, to be sure this was not a potential cause of the issue.
My question is : To make sure that the agent process crashes instead of staying up in a similar case (OOM in my scenario), is there something that can be done at the connector level or at a more global level in MCF ?

Regards,
Julien

-----Message d'origine-----
De : Karl Wright <da...@gmail.com> 
Envoyé : mardi 2 mars 2021 19:17
À : dev <de...@manifoldcf.apache.org>
Objet : Re: Inactive MCF agent

The MCF Agents process shouldn't get hung up under normal operation.  If it encounters a problem that may call its continued activity into question, it shuts itself down.

There are two situations where the process could theoretically hang.

The first is when you are using file-based synch, and you forcibly kill another ManifoldCF process so that it doesn't clean up locks after itself.
But if you are using Zookeeper, it should not ever fail to clean up after a process is killed.

The second situation is when certain database conditions arise, and MCF decides it needs to reset all its worker threads.  When it does this, it blocks all worker threads from proceeding until it reaches a point where they are all quiescent, and then it resets all of them at the same time.
When it is waiting for all threads to shut down in this way, if that never completely happens, MCF will be paused forever.

What I'd like to do in that case is get a thread dump of the agents process.  That will tell us what the problem is.

Karl

On Tue, Mar 2, 2021 at 12:53 PM <ju...@francelabs.com> wrote:

> Hi Karl,
>
> I recently faced a weird case where a job in a "running" state was not 
> doing anything for several hours. The MCF agent process was up but 
> neither the Simple History nor the logs showed any activity. Since we 
> could not wait more than 12 hours, we decided to restart the agent, 
> and the job "went back on rails" and continued its work normally.
> In order to avoid as much as possible the need for such a manual 
> intervention, I would have two questions:
> - Is there a way to "test" the agent process ? Like a "process ping" 
> which can detect if the process is doing or ready to do something ? 
> And if not, is there a way to implement such thing easily ? The idea 
> being to make the detection and restart automatically rather than 
> manually.
> - Knowing that we have activated the debug log level, would you have 
> recommendation on what to look at to find a potential cause of such an 
> issue ?
>
> Regards,
> Julien Massiera
>
>

Re: Inactive MCF agent

Posted by Karl Wright <da...@gmail.com>.

The MCF Agents process shouldn't get hung up under normal operation.  If it
encounters a problem that may call its continued activity into question, it
shuts itself down.

There are two situations where the process could theoretically hang.

The first is when you are using file-based synch, and you forcibly kill
another ManifoldCF process so that it doesn't clean up locks after itself.
But if you are using Zookeeper, it should not ever fail to clean up after a
process is killed.

The second situation is when certain database conditions arise, and MCF
decides it needs to reset all its worker threads.  When it does this, it
blocks all worker threads from proceeding until it reaches a point where
they are all quiescent, and then it resets all of them at the same time.
When it is waiting for all threads to shut down in this way, if that never
completely happens, MCF will be paused forever.

What I'd like to do in that case is get a thread dump of the agents
process.  That will tell us what the problem is.

Karl

On Tue, Mar 2, 2021 at 12:53 PM <ju...@francelabs.com> wrote:

> Hi Karl,
>
> I recently faced a weird case where a job in a "running" state was not
> doing
> anything for several hours. The MCF agent process was up but neither the
> Simple History nor the logs showed any activity. Since we could not wait
> more than 12 hours, we decided to restart the agent, and the job "went back
> on rails" and continued its work normally.
> In order to avoid as much as possible the need for such a manual
> intervention, I would have two questions:
> - Is there a way to "test" the agent process ? Like a "process ping" which
> can detect if the process is doing or ready to do something ? And if not,
> is
> there a way to implement such thing easily ? The idea being to make the
> detection and restart automatically rather than manually.
> - Knowing that we have activated the debug log level, would you have
> recommendation on what to look at to find a potential cause of such an
> issue
> ?
>
> Regards,
> Julien Massiera
>
>