You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@oozie.apache.org by Robert Kanter <rk...@cloudera.com> on 2014/02/22 00:59:38 UTC

Obscure problem with SchedulerService that causes tasks to not be rescheduled

Sorry for the long message, but its hard to explain the issue succinctly.

We recently saw a weird issue where Oozie suddenly stopped materializing
actions for a Coordinator job.  All we saw in the logs was this message
2014-01-31 02:10:03,378 WARN org.apache.oozie.service.JPAService: USER[-]
GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] JPAExecutor
[CoordJobsToBeMaterializedJPAExecutor] ended with an active transaction,
rolling back

After a bunch of digging, I figured out what was happening:

The CoordMaterializeTriggerService is responsible for materializing Coord
actions via a Runnable that gets scheduled every 5min via the
SchedulerService.  The SchedulerService is backed by a
ScheduledThreadPoolExecutor.  The Runnable does a bunch of things, but the
important thing is that it executes CoordJobsToBeMaterializedJPAExecutor
via the JPAService.  In the JPAService#execute(...) method, this happens:
try {
    if (em.getTransaction().isActive()) {
        LOG.warn("JPAExecutor [{0}] ended with an active transaction,
rolling back", executor.getName());
        *em.getTransaction().rollback();*
    }
}
    catch (Exception ex) {
        LOG.warn("Could not check/rollback transaction after JPAExecutor
[{0}], {1}", executor.getName(), ex
            .getMessage(), ex);
}
I've determined that the bold line (the rollback) can throw an Error (as in
the Throwable, not a synonym for Exception).  For obvious reasons, we're
not catching this Error anywhere.  This causes the Thread to exit; however,
we don't "lose" any threads because the ScheduledThreadPoolExecutor will
simply start a new one.  However, the task that the thread was executed is
dropped from the schedule.  According to the documentation [1] it says "If
any execution of the task encounters an exception, subsequent executions
are suppressed."  I did some testing, and that's not quite true.
 Exceptions don't cause subsequent executions to be suppressed, but Errors
do.

I tried catching all Errors to figure out what was causing this and
reproduced the problem (or at least a similar problem by killing the MySQL
database) and got a stack trace like this (that I've shorted for brevity):
java.lang.ExceptionInInitializerError
        at
org.apache.openjpa.lib.util.ConcreteClassGenerator.newInstance(ConcreteClassGenerator.java:132)
        at
org.apache.openjpa.lib.jdbc.ConfiguringConnectionDecorator.decorate(ConfiguringConnectionDecorator.java:110)
        at
org.apache.openjpa.lib.jdbc.DecoratingDataSource.decorate(DecoratingDataSource.java:118)
        at
org.apache.openjpa.lib.jdbc.DecoratingDataSource.getConnection(DecoratingDataSource.java:107)
        ...
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.GeneratedConstructorAccessor26.newInstance(Unknown
Source)
        at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at
org.apache.openjpa.lib.util.ConcreteClassGenerator.newInstance(ConcreteClassGenerator.java:130)
        ... 28 more
Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
Communications link failure
As we can see here [2], ConcreteClassGenerator.newInstance throws an
ExceptionInInitializerError when java.lang.reflect.Constructor.newInstance
throws any Exception; in this case, an InvocationTargetException, which it
throws if the underlying constructor in the reflection throws an Exception
[3].  And the underlying constructor is throwing a CommunicationsException
from the MySQL JDBC driver.

Long story short, in some cases scheduled tasks can throw an Error instead
of an Exception when there's a database-related problem.  And this causes
them to not be rescheduled.  It is also possible for this to happen to
other tasks being executed via the SchedulerService besides the
CoordMaterializeTriggerService, which would result in other missing
behaviors.

I also noticed that even when I caught the Error, it still didn't
reschedule the task -- not sure why...

Any thoughts or suggestions on the best way to recover from this situation?

Restarting Oozie works, but that's not ideal.
One idea I had was to add some kind of "watchdog" thread that would somehow
notice if a scheduled task hadn't executed in the proper interval + some
threshold and reschedule it.
My other, simper, idea is to try to manually reschedule a new identical
task in the Error catch block, which I'll try soon.



[1]
http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ScheduledThreadPoolExecutor.html#scheduleWithFixedDelay(java.lang.Runnable,%20long,%20long,%20java.util.concurrent.TimeUnit)
[2]
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.openjpa/openjpa-lib/2.2.2/org/apache/openjpa/lib/util/ConcreteClassGenerator.java?av=f#145
[3]
http://docs.oracle.com/javase/7/docs/api/java/lang/reflect/Constructor.html#newInstance(java.lang.Object..
.)

Re: Obscure problem with SchedulerService that causes tasks to not be rescheduled

Posted by Robert Kanter <rk...@cloudera.com>.

It looks like OOZIE-1699 fixes the problem!

thanks for pointing that out.


On Mon, Feb 24, 2014 at 10:58 AM, Robert Kanter <rk...@cloudera.com>wrote:

> Do you see that a single job materialization works fine when there is an
>> Exception but never proceeds when the runnable throws an Error?
>
> Yes; though it did seem funny that it currently works fine for Exceptions
> but not Errors -- this is because of the issue Shwetha described/fixed in
> OOZIE-1699?  I'll try checking if that patch fixes the issue we're seeing.
>
> just seen this and wanted to quickly point out something about java errors]
>> http://stackoverflow.com/questions/11017304/catching-java-errors
>> see accepted response
>
> Yup, I saw that.  I don't think we need to worry about all types of
> Errors, just the specific one (ExceptionInInitializerError) that OpenJPA is
> throwing when the MySQL driver is throwing an Exception.  In this case, we
> can recover from this Error by basically just ignoring it for now
> and "hoping" it works the next time the task is run.  See
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.openjpa/openjpa-lib/2.2.2/org/apache/openjpa/lib/util/ConcreteClassGenerator.java?av=f#149
>
>
>
> On Mon, Feb 24, 2014 at 10:49 AM, Alejandro Abdelnur <tu...@gmail.com>wrote:
>
>> [me off at the moment, just seen this and wanted to quickly point out
>> something about java errors]
>>
>> http://stackoverflow.com/questions/11017304/catching-java-errors
>>
>> see accepted response
>>
>> thx
>>
>> Alejandro
>> (phone typing)
>>
>> > On Feb 24, 2014, at 10:43, Virag Kothari <vi...@yahoo-inc.com> wrote:
>> >
>> > The coord materialization runnable always does the same task which is
>> > bringing coord jobs eligible for materialization on Oozie server.
>> > So, it shouldn't matter if the one of the task scheduled is lost due to
>> an
>> > error or exception. If only one coord job stops materializing and not
>> > others, I am also thinking that you might be hitting OOZIE-1699. Do you
>> > see that a single job materialization works fine when there is an
>> > Exception but never proceeds when the runnable throws an Error?
>> >
>> > Thanks,
>> > Virag
>> >
>> >> On 2/24/14 10:16 AM, "Robert Kanter" <rk...@cloudera.com> wrote:
>> >>
>> >> Shwetha, while that sounds similar to the issue I described, and may
>> have
>> >> a
>> >> similar solution, I don't think they're the same issue.  Your issue
>> looks
>> >> like it has to do with some logic Oozie is using in
>> CallableQueueService
>> >> where its not handling Exceptions properly in an edge case.  My issue
>> has
>> >> to do with how the Java's ThreadPool handles Runnables that throw an
>> >> Error.
>> >>
>> >>
>> >>
>> >>> On Mon, Feb 24, 2014 at 1:39 AM, Shwetha GS <sh...@inmobi.com>
>> wrote:
>> >>>
>> >>> We had the same issue. We found that because of an edge case, the
>> >>> callable
>> >>> command(CoordMaterializeTransitionXCommand in this case which does the
>> >>> actual materialisation) is added to the unique callable(in
>> >>> CallableQueueService), but is never added to the queue. Hence the
>> >>> materialisation doesn't happen. Since
>> >>> CoordMaterializeTransitionXCommand is
>> >>> already in unique callable, further recovery
>> >>> from CoordMaterializeTriggerService doesn't enqueue any
>> >>> more CoordMaterializeTransitionXCommand. The issue is being fixed as
>> >>> part
>> >>> of https://issues.apache.org/jira/browse/OOZIE-1699. This issue can
>> >>> happen
>> >>> with any command(not just materialisation).
>> >>>
>> >>> To verify if this is indeed the issue,
>> >>> 1. CoordMaterializeTriggerService should pick up this coord every 5
>> >>> mins(materialisation interval)
>> >>> 2. CoordMaterializeTransitionXCommand never runs
>> >>> 3. oozie admin -queuedump lists that coord for mater in unique
>> callable
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Sat, Feb 22, 2014 at 5:29 AM, Robert Kanter <rk...@cloudera.com>
>> >>> wrote:
>> >>>
>> >>>> Sorry for the long message, but its hard to explain the issue
>> >>> succinctly.
>> >>>>
>> >>>> We recently saw a weird issue where Oozie suddenly stopped
>> >>> materializing
>> >>>> actions for a Coordinator job.  All we saw in the logs was this
>> >>> message
>> >>>> 2014-01-31 02:10:03,378 WARN org.apache.oozie.service.JPAService:
>> >>> USER[-]
>> >>>> GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] JPAExecutor
>> >>>> [CoordJobsToBeMaterializedJPAExecutor] ended with an active
>> >>> transaction,
>> >>>> rolling back
>> >>>>
>> >>>> After a bunch of digging, I figured out what was happening:
>> >>>>
>> >>>> The CoordMaterializeTriggerService is responsible for materializing
>> >>> Coord
>> >>>> actions via a Runnable that gets scheduled every 5min via the
>> >>>> SchedulerService.  The SchedulerService is backed by a
>> >>>> ScheduledThreadPoolExecutor.  The Runnable does a bunch of things,
>> but
>> >>> the
>> >>>> important thing is that it executes
>> >>> CoordJobsToBeMaterializedJPAExecutor
>> >>>> via the JPAService.  In the JPAService#execute(...) method, this
>> >>> happens:
>> >>>> try {
>> >>>>    if (em.getTransaction().isActive()) {
>> >>>>        LOG.warn("JPAExecutor [{0}] ended with an active transaction,
>> >>>> rolling back", executor.getName());
>> >>>>        *em.getTransaction().rollback();*
>> >>>>    }
>> >>>> }
>> >>>>    catch (Exception ex) {
>> >>>>        LOG.warn("Could not check/rollback transaction after
>> >>> JPAExecutor
>> >>>> [{0}], {1}", executor.getName(), ex
>> >>>>            .getMessage(), ex);
>> >>>> }
>> >>>> I've determined that the bold line (the rollback) can throw an Error
>> >>> (as
>> >>> in
>> >>>> the Throwable, not a synonym for Exception).  For obvious reasons,
>> >>> we're
>> >>>> not catching this Error anywhere.  This causes the Thread to exit;
>> >>> however,
>> >>>> we don't "lose" any threads because the ScheduledThreadPoolExecutor
>> >>> will
>> >>>> simply start a new one.  However, the task that the thread was
>> >>> executed
>> >>> is
>> >>>> dropped from the schedule.  According to the documentation [1] it
>> says
>> >>> "If
>> >>>> any execution of the task encounters an exception, subsequent
>> >>> executions
>> >>>> are suppressed."  I did some testing, and that's not quite true.
>> >>>> Exceptions don't cause subsequent executions to be suppressed, but
>> >>> Errors
>> >>>> do.
>> >>>>
>> >>>> I tried catching all Errors to figure out what was causing this and
>> >>>> reproduced the problem (or at least a similar problem by killing the
>> >>> MySQL
>> >>>> database) and got a stack trace like this (that I've shorted for
>> >>> brevity):
>> >>>> java.lang.ExceptionInInitializerError
>> >>>>        at
>> >>>
>> >>>
>> org.apache.openjpa.lib.util.ConcreteClassGenerator.newInstance(ConcreteCl
>> >>> assGenerator.java:132)
>> >>>>        at
>> >>>
>> >>>
>> org.apache.openjpa.lib.jdbc.ConfiguringConnectionDecorator.decorate(Confi
>> >>> guringConnectionDecorator.java:110)
>> >>>>        at
>> >>>
>> >>>
>> org.apache.openjpa.lib.jdbc.DecoratingDataSource.decorate(DecoratingDataS
>> >>> ource.java:118)
>> >>>>        at
>> >>>
>> >>>
>> org.apache.openjpa.lib.jdbc.DecoratingDataSource.getConnection(Decorating
>> >>> DataSource.java:107)
>> >>>>        ...
>> >>>> Caused by: java.lang.reflect.InvocationTargetException
>> >>>>        at
>> >>> sun.reflect.GeneratedConstructorAccessor26.newInstance(Unknown
>> >>>> Source)
>> >>>>        at
>> >>>
>> >>>
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConst
>> >>> ructorAccessorImpl.java:27)
>> >>>>        at
>> >>> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>> >>>>        at
>> >>>
>> >>>
>> org.apache.openjpa.lib.util.ConcreteClassGenerator.newInstance(ConcreteCl
>> >>> assGenerator.java:130)
>> >>>>        ... 28 more
>> >>>> Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
>> >>>> Communications link failure
>> >>>> As we can see here [2], ConcreteClassGenerator.newInstance throws an
>> >>>> ExceptionInInitializerError when
>> >>> java.lang.reflect.Constructor.newInstance
>> >>>> throws any Exception; in this case, an InvocationTargetException,
>> >>> which
>> >>> it
>> >>>> throws if the underlying constructor in the reflection throws an
>> >>> Exception
>> >>>> [3].  And the underlying constructor is throwing a
>> >>> CommunicationsException
>> >>>> from the MySQL JDBC driver.
>> >>>>
>> >>>> Long story short, in some cases scheduled tasks can throw an Error
>> >>> instead
>> >>>> of an Exception when there's a database-related problem.  And this
>> >>> causes
>> >>>> them to not be rescheduled.  It is also possible for this to happen
>> to
>> >>>> other tasks being executed via the SchedulerService besides the
>> >>>> CoordMaterializeTriggerService, which would result in other missing
>> >>>> behaviors.
>> >>>>
>> >>>> I also noticed that even when I caught the Error, it still didn't
>> >>>> reschedule the task -- not sure why...
>> >>>>
>> >>>> Any thoughts or suggestions on the best way to recover from this
>> >>> situation?
>> >>>>
>> >>>> Restarting Oozie works, but that's not ideal.
>> >>>> One idea I had was to add some kind of "watchdog" thread that would
>> >>> somehow
>> >>>> notice if a scheduled task hadn't executed in the proper interval +
>> >>> some
>> >>>> threshold and reschedule it.
>> >>>> My other, simper, idea is to try to manually reschedule a new
>> >>> identical
>> >>>> task in the Error catch block, which I'll try soon.
>> >>>>
>> >>>>
>> >>>>
>> >>>> [1]
>> >>>
>> >>>
>> http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ScheduledTh
>> >>>
>> readPoolExecutor.html#scheduleWithFixedDelay(java.lang.Runnable,%20long,%
>> >>> 20long,%20java.util.concurrent.TimeUnit)
>> >>>> [2]
>> >>>
>> >>>
>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.openjpa/openjp
>> >>>
>> a-lib/2.2.2/org/apache/openjpa/lib/util/ConcreteClassGenerator.java?av=f#
>> >>> 145
>> >>>> [3]
>> >>>
>> >>>
>> http://docs.oracle.com/javase/7/docs/api/java/lang/reflect/Constructor.ht
>> >>> ml#newInstance(java.lang.Object
>> >>>> ..
>> >>>> .)
>> >>>
>> >>> --
>> >>> _____________________________________________________________
>> >>> The information contained in this communication is intended solely for
>> >>> the
>> >>> use of the individual or entity to whom it is addressed and others
>> >>> authorized to receive it. It may contain confidential or legally
>> >>> privileged
>> >>> information. If you are not the intended recipient you are hereby
>> >>> notified
>> >>> that any disclosure, copying, distribution or taking any action in
>> >>> reliance
>> >>> on the contents of this information is strictly prohibited and may be
>> >>> unlawful. If you have received this communication in error, please
>> >>> notify
>> >>> us immediately by responding to this email and then delete it from
>> your
>> >>> system. The firm is neither liable for the proper and complete
>> >>> transmission
>> >>> of the information contained in this communication nor for any delay
>> in
>> >>> its
>> >>> receipt.
>> >
>>
>
>

Re: Obscure problem with SchedulerService that causes tasks to not be rescheduled

Posted by Robert Kanter <rk...@cloudera.com>.

>
> Do you see that a single job materialization works fine when there is an
> Exception but never proceeds when the runnable throws an Error?

Yes; though it did seem funny that it currently works fine for Exceptions
but not Errors -- this is because of the issue Shwetha described/fixed in
OOZIE-1699?  I'll try checking if that patch fixes the issue we're seeing.

just seen this and wanted to quickly point out something about java errors]
> http://stackoverflow.com/questions/11017304/catching-java-errors
> see accepted response

Yup, I saw that.  I don't think we need to worry about all types of Errors,
just the specific one (ExceptionInInitializerError) that OpenJPA is
throwing when the MySQL driver is throwing an Exception.  In this case, we
can recover from this Error by basically just ignoring it for now
and "hoping" it works the next time the task is run.  See
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.openjpa/openjpa-lib/2.2.2/org/apache/openjpa/lib/util/ConcreteClassGenerator.java?av=f#149



On Mon, Feb 24, 2014 at 10:49 AM, Alejandro Abdelnur <tu...@gmail.com>wrote:

> [me off at the moment, just seen this and wanted to quickly point out
> something about java errors]
>
> http://stackoverflow.com/questions/11017304/catching-java-errors
>
> see accepted response
>
> thx
>
> Alejandro
> (phone typing)
>
> > On Feb 24, 2014, at 10:43, Virag Kothari <vi...@yahoo-inc.com> wrote:
> >
> > The coord materialization runnable always does the same task which is
> > bringing coord jobs eligible for materialization on Oozie server.
> > So, it shouldn't matter if the one of the task scheduled is lost due to
> an
> > error or exception. If only one coord job stops materializing and not
> > others, I am also thinking that you might be hitting OOZIE-1699. Do you
> > see that a single job materialization works fine when there is an
> > Exception but never proceeds when the runnable throws an Error?
> >
> > Thanks,
> > Virag
> >
> >> On 2/24/14 10:16 AM, "Robert Kanter" <rk...@cloudera.com> wrote:
> >>
> >> Shwetha, while that sounds similar to the issue I described, and may
> have
> >> a
> >> similar solution, I don't think they're the same issue.  Your issue
> looks
> >> like it has to do with some logic Oozie is using in CallableQueueService
> >> where its not handling Exceptions properly in an edge case.  My issue
> has
> >> to do with how the Java's ThreadPool handles Runnables that throw an
> >> Error.
> >>
> >>
> >>
> >>> On Mon, Feb 24, 2014 at 1:39 AM, Shwetha GS <sh...@inmobi.com>
> wrote:
> >>>
> >>> We had the same issue. We found that because of an edge case, the
> >>> callable
> >>> command(CoordMaterializeTransitionXCommand in this case which does the
> >>> actual materialisation) is added to the unique callable(in
> >>> CallableQueueService), but is never added to the queue. Hence the
> >>> materialisation doesn't happen. Since
> >>> CoordMaterializeTransitionXCommand is
> >>> already in unique callable, further recovery
> >>> from CoordMaterializeTriggerService doesn't enqueue any
> >>> more CoordMaterializeTransitionXCommand. The issue is being fixed as
> >>> part
> >>> of https://issues.apache.org/jira/browse/OOZIE-1699. This issue can
> >>> happen
> >>> with any command(not just materialisation).
> >>>
> >>> To verify if this is indeed the issue,
> >>> 1. CoordMaterializeTriggerService should pick up this coord every 5
> >>> mins(materialisation interval)
> >>> 2. CoordMaterializeTransitionXCommand never runs
> >>> 3. oozie admin -queuedump lists that coord for mater in unique callable
> >>>
> >>>
> >>>
> >>>
> >>> On Sat, Feb 22, 2014 at 5:29 AM, Robert Kanter <rk...@cloudera.com>
> >>> wrote:
> >>>
> >>>> Sorry for the long message, but its hard to explain the issue
> >>> succinctly.
> >>>>
> >>>> We recently saw a weird issue where Oozie suddenly stopped
> >>> materializing
> >>>> actions for a Coordinator job.  All we saw in the logs was this
> >>> message
> >>>> 2014-01-31 02:10:03,378 WARN org.apache.oozie.service.JPAService:
> >>> USER[-]
> >>>> GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] JPAExecutor
> >>>> [CoordJobsToBeMaterializedJPAExecutor] ended with an active
> >>> transaction,
> >>>> rolling back
> >>>>
> >>>> After a bunch of digging, I figured out what was happening:
> >>>>
> >>>> The CoordMaterializeTriggerService is responsible for materializing
> >>> Coord
> >>>> actions via a Runnable that gets scheduled every 5min via the
> >>>> SchedulerService.  The SchedulerService is backed by a
> >>>> ScheduledThreadPoolExecutor.  The Runnable does a bunch of things, but
> >>> the
> >>>> important thing is that it executes
> >>> CoordJobsToBeMaterializedJPAExecutor
> >>>> via the JPAService.  In the JPAService#execute(...) method, this
> >>> happens:
> >>>> try {
> >>>>    if (em.getTransaction().isActive()) {
> >>>>        LOG.warn("JPAExecutor [{0}] ended with an active transaction,
> >>>> rolling back", executor.getName());
> >>>>        *em.getTransaction().rollback();*
> >>>>    }
> >>>> }
> >>>>    catch (Exception ex) {
> >>>>        LOG.warn("Could not check/rollback transaction after
> >>> JPAExecutor
> >>>> [{0}], {1}", executor.getName(), ex
> >>>>            .getMessage(), ex);
> >>>> }
> >>>> I've determined that the bold line (the rollback) can throw an Error
> >>> (as
> >>> in
> >>>> the Throwable, not a synonym for Exception).  For obvious reasons,
> >>> we're
> >>>> not catching this Error anywhere.  This causes the Thread to exit;
> >>> however,
> >>>> we don't "lose" any threads because the ScheduledThreadPoolExecutor
> >>> will
> >>>> simply start a new one.  However, the task that the thread was
> >>> executed
> >>> is
> >>>> dropped from the schedule.  According to the documentation [1] it says
> >>> "If
> >>>> any execution of the task encounters an exception, subsequent
> >>> executions
> >>>> are suppressed."  I did some testing, and that's not quite true.
> >>>> Exceptions don't cause subsequent executions to be suppressed, but
> >>> Errors
> >>>> do.
> >>>>
> >>>> I tried catching all Errors to figure out what was causing this and
> >>>> reproduced the problem (or at least a similar problem by killing the
> >>> MySQL
> >>>> database) and got a stack trace like this (that I've shorted for
> >>> brevity):
> >>>> java.lang.ExceptionInInitializerError
> >>>>        at
> >>>
> >>>
> org.apache.openjpa.lib.util.ConcreteClassGenerator.newInstance(ConcreteCl
> >>> assGenerator.java:132)
> >>>>        at
> >>>
> >>>
> org.apache.openjpa.lib.jdbc.ConfiguringConnectionDecorator.decorate(Confi
> >>> guringConnectionDecorator.java:110)
> >>>>        at
> >>>
> >>>
> org.apache.openjpa.lib.jdbc.DecoratingDataSource.decorate(DecoratingDataS
> >>> ource.java:118)
> >>>>        at
> >>>
> >>>
> org.apache.openjpa.lib.jdbc.DecoratingDataSource.getConnection(Decorating
> >>> DataSource.java:107)
> >>>>        ...
> >>>> Caused by: java.lang.reflect.InvocationTargetException
> >>>>        at
> >>> sun.reflect.GeneratedConstructorAccessor26.newInstance(Unknown
> >>>> Source)
> >>>>        at
> >>>
> >>>
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConst
> >>> ructorAccessorImpl.java:27)
> >>>>        at
> >>> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> >>>>        at
> >>>
> >>>
> org.apache.openjpa.lib.util.ConcreteClassGenerator.newInstance(ConcreteCl
> >>> assGenerator.java:130)
> >>>>        ... 28 more
> >>>> Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
> >>>> Communications link failure
> >>>> As we can see here [2], ConcreteClassGenerator.newInstance throws an
> >>>> ExceptionInInitializerError when
> >>> java.lang.reflect.Constructor.newInstance
> >>>> throws any Exception; in this case, an InvocationTargetException,
> >>> which
> >>> it
> >>>> throws if the underlying constructor in the reflection throws an
> >>> Exception
> >>>> [3].  And the underlying constructor is throwing a
> >>> CommunicationsException
> >>>> from the MySQL JDBC driver.
> >>>>
> >>>> Long story short, in some cases scheduled tasks can throw an Error
> >>> instead
> >>>> of an Exception when there's a database-related problem.  And this
> >>> causes
> >>>> them to not be rescheduled.  It is also possible for this to happen to
> >>>> other tasks being executed via the SchedulerService besides the
> >>>> CoordMaterializeTriggerService, which would result in other missing
> >>>> behaviors.
> >>>>
> >>>> I also noticed that even when I caught the Error, it still didn't
> >>>> reschedule the task -- not sure why...
> >>>>
> >>>> Any thoughts or suggestions on the best way to recover from this
> >>> situation?
> >>>>
> >>>> Restarting Oozie works, but that's not ideal.
> >>>> One idea I had was to add some kind of "watchdog" thread that would
> >>> somehow
> >>>> notice if a scheduled task hadn't executed in the proper interval +
> >>> some
> >>>> threshold and reschedule it.
> >>>> My other, simper, idea is to try to manually reschedule a new
> >>> identical
> >>>> task in the Error catch block, which I'll try soon.
> >>>>
> >>>>
> >>>>
> >>>> [1]
> >>>
> >>>
> http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ScheduledTh
> >>>
> readPoolExecutor.html#scheduleWithFixedDelay(java.lang.Runnable,%20long,%
> >>> 20long,%20java.util.concurrent.TimeUnit)
> >>>> [2]
> >>>
> >>>
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.openjpa/openjp
> >>>
> a-lib/2.2.2/org/apache/openjpa/lib/util/ConcreteClassGenerator.java?av=f#
> >>> 145
> >>>> [3]
> >>>
> >>>
> http://docs.oracle.com/javase/7/docs/api/java/lang/reflect/Constructor.ht
> >>> ml#newInstance(java.lang.Object
> >>>> ..
> >>>> .)
> >>>
> >>> --
> >>> _____________________________________________________________
> >>> The information contained in this communication is intended solely for
> >>> the
> >>> use of the individual or entity to whom it is addressed and others
> >>> authorized to receive it. It may contain confidential or legally
> >>> privileged
> >>> information. If you are not the intended recipient you are hereby
> >>> notified
> >>> that any disclosure, copying, distribution or taking any action in
> >>> reliance
> >>> on the contents of this information is strictly prohibited and may be
> >>> unlawful. If you have received this communication in error, please
> >>> notify
> >>> us immediately by responding to this email and then delete it from your
> >>> system. The firm is neither liable for the proper and complete
> >>> transmission
> >>> of the information contained in this communication nor for any delay in
> >>> its
> >>> receipt.
> >
>

Re: Obscure problem with SchedulerService that causes tasks to not be rescheduled

Posted by Alejandro Abdelnur <tu...@gmail.com>.

[me off at the moment, just seen this and wanted to quickly point out something about java errors]

http://stackoverflow.com/questions/11017304/catching-java-errors

see accepted response

thx

Alejandro
(phone typing)

> On Feb 24, 2014, at 10:43, Virag Kothari <vi...@yahoo-inc.com> wrote:
> 
> The coord materialization runnable always does the same task which is
> bringing coord jobs eligible for materialization on Oozie server.
> So, it shouldn't matter if the one of the task scheduled is lost due to an
> error or exception. If only one coord job stops materializing and not
> others, I am also thinking that you might be hitting OOZIE-1699. Do you
> see that a single job materialization works fine when there is an
> Exception but never proceeds when the runnable throws an Error?
> 
> Thanks,
> Virag
> 
>> On 2/24/14 10:16 AM, "Robert Kanter" <rk...@cloudera.com> wrote:
>> 
>> Shwetha, while that sounds similar to the issue I described, and may have
>> a
>> similar solution, I don't think they're the same issue.  Your issue looks
>> like it has to do with some logic Oozie is using in CallableQueueService
>> where its not handling Exceptions properly in an edge case.  My issue has
>> to do with how the Java's ThreadPool handles Runnables that throw an
>> Error.
>> 
>> 
>> 
>>> On Mon, Feb 24, 2014 at 1:39 AM, Shwetha GS <sh...@inmobi.com> wrote:
>>> 
>>> We had the same issue. We found that because of an edge case, the
>>> callable
>>> command(CoordMaterializeTransitionXCommand in this case which does the
>>> actual materialisation) is added to the unique callable(in
>>> CallableQueueService), but is never added to the queue. Hence the
>>> materialisation doesn't happen. Since
>>> CoordMaterializeTransitionXCommand is
>>> already in unique callable, further recovery
>>> from CoordMaterializeTriggerService doesn't enqueue any
>>> more CoordMaterializeTransitionXCommand. The issue is being fixed as
>>> part
>>> of https://issues.apache.org/jira/browse/OOZIE-1699. This issue can
>>> happen
>>> with any command(not just materialisation).
>>> 
>>> To verify if this is indeed the issue,
>>> 1. CoordMaterializeTriggerService should pick up this coord every 5
>>> mins(materialisation interval)
>>> 2. CoordMaterializeTransitionXCommand never runs
>>> 3. oozie admin -queuedump lists that coord for mater in unique callable
>>> 
>>> 
>>> 
>>> 
>>> On Sat, Feb 22, 2014 at 5:29 AM, Robert Kanter <rk...@cloudera.com>
>>> wrote:
>>> 
>>>> Sorry for the long message, but its hard to explain the issue
>>> succinctly.
>>>> 
>>>> We recently saw a weird issue where Oozie suddenly stopped
>>> materializing
>>>> actions for a Coordinator job.  All we saw in the logs was this
>>> message
>>>> 2014-01-31 02:10:03,378 WARN org.apache.oozie.service.JPAService:
>>> USER[-]
>>>> GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] JPAExecutor
>>>> [CoordJobsToBeMaterializedJPAExecutor] ended with an active
>>> transaction,
>>>> rolling back
>>>> 
>>>> After a bunch of digging, I figured out what was happening:
>>>> 
>>>> The CoordMaterializeTriggerService is responsible for materializing
>>> Coord
>>>> actions via a Runnable that gets scheduled every 5min via the
>>>> SchedulerService.  The SchedulerService is backed by a
>>>> ScheduledThreadPoolExecutor.  The Runnable does a bunch of things, but
>>> the
>>>> important thing is that it executes
>>> CoordJobsToBeMaterializedJPAExecutor
>>>> via the JPAService.  In the JPAService#execute(...) method, this
>>> happens:
>>>> try {
>>>>    if (em.getTransaction().isActive()) {
>>>>        LOG.warn("JPAExecutor [{0}] ended with an active transaction,
>>>> rolling back", executor.getName());
>>>>        *em.getTransaction().rollback();*
>>>>    }
>>>> }
>>>>    catch (Exception ex) {
>>>>        LOG.warn("Could not check/rollback transaction after
>>> JPAExecutor
>>>> [{0}], {1}", executor.getName(), ex
>>>>            .getMessage(), ex);
>>>> }
>>>> I've determined that the bold line (the rollback) can throw an Error
>>> (as
>>> in
>>>> the Throwable, not a synonym for Exception).  For obvious reasons,
>>> we're
>>>> not catching this Error anywhere.  This causes the Thread to exit;
>>> however,
>>>> we don't "lose" any threads because the ScheduledThreadPoolExecutor
>>> will
>>>> simply start a new one.  However, the task that the thread was
>>> executed
>>> is
>>>> dropped from the schedule.  According to the documentation [1] it says
>>> "If
>>>> any execution of the task encounters an exception, subsequent
>>> executions
>>>> are suppressed."  I did some testing, and that's not quite true.
>>>> Exceptions don't cause subsequent executions to be suppressed, but
>>> Errors
>>>> do.
>>>> 
>>>> I tried catching all Errors to figure out what was causing this and
>>>> reproduced the problem (or at least a similar problem by killing the
>>> MySQL
>>>> database) and got a stack trace like this (that I've shorted for
>>> brevity):
>>>> java.lang.ExceptionInInitializerError
>>>>        at
>>> 
>>> org.apache.openjpa.lib.util.ConcreteClassGenerator.newInstance(ConcreteCl
>>> assGenerator.java:132)
>>>>        at
>>> 
>>> org.apache.openjpa.lib.jdbc.ConfiguringConnectionDecorator.decorate(Confi
>>> guringConnectionDecorator.java:110)
>>>>        at
>>> 
>>> org.apache.openjpa.lib.jdbc.DecoratingDataSource.decorate(DecoratingDataS
>>> ource.java:118)
>>>>        at
>>> 
>>> org.apache.openjpa.lib.jdbc.DecoratingDataSource.getConnection(Decorating
>>> DataSource.java:107)
>>>>        ...
>>>> Caused by: java.lang.reflect.InvocationTargetException
>>>>        at
>>> sun.reflect.GeneratedConstructorAccessor26.newInstance(Unknown
>>>> Source)
>>>>        at
>>> 
>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConst
>>> ructorAccessorImpl.java:27)
>>>>        at
>>> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>>>>        at
>>> 
>>> org.apache.openjpa.lib.util.ConcreteClassGenerator.newInstance(ConcreteCl
>>> assGenerator.java:130)
>>>>        ... 28 more
>>>> Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
>>>> Communications link failure
>>>> As we can see here [2], ConcreteClassGenerator.newInstance throws an
>>>> ExceptionInInitializerError when
>>> java.lang.reflect.Constructor.newInstance
>>>> throws any Exception; in this case, an InvocationTargetException,
>>> which
>>> it
>>>> throws if the underlying constructor in the reflection throws an
>>> Exception
>>>> [3].  And the underlying constructor is throwing a
>>> CommunicationsException
>>>> from the MySQL JDBC driver.
>>>> 
>>>> Long story short, in some cases scheduled tasks can throw an Error
>>> instead
>>>> of an Exception when there's a database-related problem.  And this
>>> causes
>>>> them to not be rescheduled.  It is also possible for this to happen to
>>>> other tasks being executed via the SchedulerService besides the
>>>> CoordMaterializeTriggerService, which would result in other missing
>>>> behaviors.
>>>> 
>>>> I also noticed that even when I caught the Error, it still didn't
>>>> reschedule the task -- not sure why...
>>>> 
>>>> Any thoughts or suggestions on the best way to recover from this
>>> situation?
>>>> 
>>>> Restarting Oozie works, but that's not ideal.
>>>> One idea I had was to add some kind of "watchdog" thread that would
>>> somehow
>>>> notice if a scheduled task hadn't executed in the proper interval +
>>> some
>>>> threshold and reschedule it.
>>>> My other, simper, idea is to try to manually reschedule a new
>>> identical
>>>> task in the Error catch block, which I'll try soon.
>>>> 
>>>> 
>>>> 
>>>> [1]
>>> 
>>> http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ScheduledTh
>>> readPoolExecutor.html#scheduleWithFixedDelay(java.lang.Runnable,%20long,%
>>> 20long,%20java.util.concurrent.TimeUnit)
>>>> [2]
>>> 
>>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.openjpa/openjp
>>> a-lib/2.2.2/org/apache/openjpa/lib/util/ConcreteClassGenerator.java?av=f#
>>> 145
>>>> [3]
>>> 
>>> http://docs.oracle.com/javase/7/docs/api/java/lang/reflect/Constructor.ht
>>> ml#newInstance(java.lang.Object
>>>> ..
>>>> .)
>>> 
>>> --
>>> _____________________________________________________________
>>> The information contained in this communication is intended solely for
>>> the
>>> use of the individual or entity to whom it is addressed and others
>>> authorized to receive it. It may contain confidential or legally
>>> privileged
>>> information. If you are not the intended recipient you are hereby
>>> notified
>>> that any disclosure, copying, distribution or taking any action in
>>> reliance
>>> on the contents of this information is strictly prohibited and may be
>>> unlawful. If you have received this communication in error, please
>>> notify
>>> us immediately by responding to this email and then delete it from your
>>> system. The firm is neither liable for the proper and complete
>>> transmission
>>> of the information contained in this communication nor for any delay in
>>> its
>>> receipt.
>

Re: Obscure problem with SchedulerService that causes tasks to not be rescheduled

Posted by Virag Kothari <vi...@yahoo-inc.com>.

The coord materialization runnable always does the same task which is
bringing coord jobs eligible for materialization on Oozie server.
So, it shouldn't matter if the one of the task scheduled is lost due to an
error or exception. If only one coord job stops materializing and not
others, I am also thinking that you might be hitting OOZIE-1699. Do you
see that a single job materialization works fine when there is an
Exception but never proceeds when the runnable throws an Error?

Thanks,
Virag

On 2/24/14 10:16 AM, "Robert Kanter" <rk...@cloudera.com> wrote:

>Shwetha, while that sounds similar to the issue I described, and may have
>a
>similar solution, I don't think they're the same issue.  Your issue looks
>like it has to do with some logic Oozie is using in CallableQueueService
>where its not handling Exceptions properly in an edge case.  My issue has
>to do with how the Java's ThreadPool handles Runnables that throw an
>Error.
>
>
>
>On Mon, Feb 24, 2014 at 1:39 AM, Shwetha GS <sh...@inmobi.com> wrote:
>
>> We had the same issue. We found that because of an edge case, the
>>callable
>> command(CoordMaterializeTransitionXCommand in this case which does the
>> actual materialisation) is added to the unique callable(in
>> CallableQueueService), but is never added to the queue. Hence the
>> materialisation doesn't happen. Since
>>CoordMaterializeTransitionXCommand is
>> already in unique callable, further recovery
>> from CoordMaterializeTriggerService doesn't enqueue any
>> more CoordMaterializeTransitionXCommand. The issue is being fixed as
>>part
>> of https://issues.apache.org/jira/browse/OOZIE-1699. This issue can
>>happen
>> with any command(not just materialisation).
>>
>> To verify if this is indeed the issue,
>> 1. CoordMaterializeTriggerService should pick up this coord every 5
>> mins(materialisation interval)
>> 2. CoordMaterializeTransitionXCommand never runs
>> 3. oozie admin -queuedump lists that coord for mater in unique callable
>>
>>
>>
>>
>> On Sat, Feb 22, 2014 at 5:29 AM, Robert Kanter <rk...@cloudera.com>
>> wrote:
>>
>> > Sorry for the long message, but its hard to explain the issue
>>succinctly.
>> >
>> > We recently saw a weird issue where Oozie suddenly stopped
>>materializing
>> > actions for a Coordinator job.  All we saw in the logs was this
>>message
>> > 2014-01-31 02:10:03,378 WARN org.apache.oozie.service.JPAService:
>>USER[-]
>> > GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] JPAExecutor
>> > [CoordJobsToBeMaterializedJPAExecutor] ended with an active
>>transaction,
>> > rolling back
>> >
>> > After a bunch of digging, I figured out what was happening:
>> >
>> > The CoordMaterializeTriggerService is responsible for materializing
>>Coord
>> > actions via a Runnable that gets scheduled every 5min via the
>> > SchedulerService.  The SchedulerService is backed by a
>> > ScheduledThreadPoolExecutor.  The Runnable does a bunch of things, but
>> the
>> > important thing is that it executes
>>CoordJobsToBeMaterializedJPAExecutor
>> > via the JPAService.  In the JPAService#execute(...) method, this
>>happens:
>> > try {
>> >     if (em.getTransaction().isActive()) {
>> >         LOG.warn("JPAExecutor [{0}] ended with an active transaction,
>> > rolling back", executor.getName());
>> >         *em.getTransaction().rollback();*
>> >     }
>> > }
>> >     catch (Exception ex) {
>> >         LOG.warn("Could not check/rollback transaction after
>>JPAExecutor
>> > [{0}], {1}", executor.getName(), ex
>> >             .getMessage(), ex);
>> > }
>> > I've determined that the bold line (the rollback) can throw an Error
>>(as
>> in
>> > the Throwable, not a synonym for Exception).  For obvious reasons,
>>we're
>> > not catching this Error anywhere.  This causes the Thread to exit;
>> however,
>> > we don't "lose" any threads because the ScheduledThreadPoolExecutor
>>will
>> > simply start a new one.  However, the task that the thread was
>>executed
>> is
>> > dropped from the schedule.  According to the documentation [1] it says
>> "If
>> > any execution of the task encounters an exception, subsequent
>>executions
>> > are suppressed."  I did some testing, and that's not quite true.
>> >  Exceptions don't cause subsequent executions to be suppressed, but
>> Errors
>> > do.
>> >
>> > I tried catching all Errors to figure out what was causing this and
>> > reproduced the problem (or at least a similar problem by killing the
>> MySQL
>> > database) and got a stack trace like this (that I've shorted for
>> brevity):
>> > java.lang.ExceptionInInitializerError
>> >         at
>> >
>> >
>> 
>>org.apache.openjpa.lib.util.ConcreteClassGenerator.newInstance(ConcreteCl
>>assGenerator.java:132)
>> >         at
>> >
>> >
>> 
>>org.apache.openjpa.lib.jdbc.ConfiguringConnectionDecorator.decorate(Confi
>>guringConnectionDecorator.java:110)
>> >         at
>> >
>> >
>> 
>>org.apache.openjpa.lib.jdbc.DecoratingDataSource.decorate(DecoratingDataS
>>ource.java:118)
>> >         at
>> >
>> >
>> 
>>org.apache.openjpa.lib.jdbc.DecoratingDataSource.getConnection(Decorating
>>DataSource.java:107)
>> >         ...
>> > Caused by: java.lang.reflect.InvocationTargetException
>> >         at 
>>sun.reflect.GeneratedConstructorAccessor26.newInstance(Unknown
>> > Source)
>> >         at
>> >
>> >
>> 
>>sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConst
>>ructorAccessorImpl.java:27)
>> >         at
>> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>> >         at
>> >
>> >
>> 
>>org.apache.openjpa.lib.util.ConcreteClassGenerator.newInstance(ConcreteCl
>>assGenerator.java:130)
>> >         ... 28 more
>> > Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
>> > Communications link failure
>> > As we can see here [2], ConcreteClassGenerator.newInstance throws an
>> > ExceptionInInitializerError when
>> java.lang.reflect.Constructor.newInstance
>> > throws any Exception; in this case, an InvocationTargetException,
>>which
>> it
>> > throws if the underlying constructor in the reflection throws an
>> Exception
>> > [3].  And the underlying constructor is throwing a
>> CommunicationsException
>> > from the MySQL JDBC driver.
>> >
>> > Long story short, in some cases scheduled tasks can throw an Error
>> instead
>> > of an Exception when there's a database-related problem.  And this
>>causes
>> > them to not be rescheduled.  It is also possible for this to happen to
>> > other tasks being executed via the SchedulerService besides the
>> > CoordMaterializeTriggerService, which would result in other missing
>> > behaviors.
>> >
>> > I also noticed that even when I caught the Error, it still didn't
>> > reschedule the task -- not sure why...
>> >
>> > Any thoughts or suggestions on the best way to recover from this
>> situation?
>> >
>> > Restarting Oozie works, but that's not ideal.
>> > One idea I had was to add some kind of "watchdog" thread that would
>> somehow
>> > notice if a scheduled task hadn't executed in the proper interval +
>>some
>> > threshold and reschedule it.
>> > My other, simper, idea is to try to manually reschedule a new
>>identical
>> > task in the Error catch block, which I'll try soon.
>> >
>> >
>> >
>> > [1]
>> >
>> >
>> 
>>http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ScheduledTh
>>readPoolExecutor.html#scheduleWithFixedDelay(java.lang.Runnable,%20long,%
>>20long,%20java.util.concurrent.TimeUnit)
>> > [2]
>> >
>> >
>> 
>>http://grepcode.com/file/repo1.maven.org/maven2/org.apache.openjpa/openjp
>>a-lib/2.2.2/org/apache/openjpa/lib/util/ConcreteClassGenerator.java?av=f#
>>145
>> > [3]
>> >
>> >
>> 
>>http://docs.oracle.com/javase/7/docs/api/java/lang/reflect/Constructor.ht
>>ml#newInstance(java.lang.Object
>> > ..
>> > .)
>> >
>>
>> --
>> _____________________________________________________________
>> The information contained in this communication is intended solely for
>>the
>> use of the individual or entity to whom it is addressed and others
>> authorized to receive it. It may contain confidential or legally
>>privileged
>> information. If you are not the intended recipient you are hereby
>>notified
>> that any disclosure, copying, distribution or taking any action in
>>reliance
>> on the contents of this information is strictly prohibited and may be
>> unlawful. If you have received this communication in error, please
>>notify
>> us immediately by responding to this email and then delete it from your
>> system. The firm is neither liable for the proper and complete
>>transmission
>> of the information contained in this communication nor for any delay in
>>its
>> receipt.
>>

Re: Obscure problem with SchedulerService that causes tasks to not be rescheduled

Posted by Robert Kanter <rk...@cloudera.com>.

Shwetha, while that sounds similar to the issue I described, and may have a
similar solution, I don't think they're the same issue.  Your issue looks
like it has to do with some logic Oozie is using in CallableQueueService
where its not handling Exceptions properly in an edge case.  My issue has
to do with how the Java's ThreadPool handles Runnables that throw an Error.



On Mon, Feb 24, 2014 at 1:39 AM, Shwetha GS <sh...@inmobi.com> wrote:

> We had the same issue. We found that because of an edge case, the callable
> command(CoordMaterializeTransitionXCommand in this case which does the
> actual materialisation) is added to the unique callable(in
> CallableQueueService), but is never added to the queue. Hence the
> materialisation doesn't happen. Since CoordMaterializeTransitionXCommand is
> already in unique callable, further recovery
> from CoordMaterializeTriggerService doesn't enqueue any
> more CoordMaterializeTransitionXCommand. The issue is being fixed as part
> of https://issues.apache.org/jira/browse/OOZIE-1699. This issue can happen
> with any command(not just materialisation).
>
> To verify if this is indeed the issue,
> 1. CoordMaterializeTriggerService should pick up this coord every 5
> mins(materialisation interval)
> 2. CoordMaterializeTransitionXCommand never runs
> 3. oozie admin -queuedump lists that coord for mater in unique callable
>
>
>
>
> On Sat, Feb 22, 2014 at 5:29 AM, Robert Kanter <rk...@cloudera.com>
> wrote:
>
> > Sorry for the long message, but its hard to explain the issue succinctly.
> >
> > We recently saw a weird issue where Oozie suddenly stopped materializing
> > actions for a Coordinator job.  All we saw in the logs was this message
> > 2014-01-31 02:10:03,378 WARN org.apache.oozie.service.JPAService: USER[-]
> > GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] JPAExecutor
> > [CoordJobsToBeMaterializedJPAExecutor] ended with an active transaction,
> > rolling back
> >
> > After a bunch of digging, I figured out what was happening:
> >
> > The CoordMaterializeTriggerService is responsible for materializing Coord
> > actions via a Runnable that gets scheduled every 5min via the
> > SchedulerService.  The SchedulerService is backed by a
> > ScheduledThreadPoolExecutor.  The Runnable does a bunch of things, but
> the
> > important thing is that it executes CoordJobsToBeMaterializedJPAExecutor
> > via the JPAService.  In the JPAService#execute(...) method, this happens:
> > try {
> >     if (em.getTransaction().isActive()) {
> >         LOG.warn("JPAExecutor [{0}] ended with an active transaction,
> > rolling back", executor.getName());
> >         *em.getTransaction().rollback();*
> >     }
> > }
> >     catch (Exception ex) {
> >         LOG.warn("Could not check/rollback transaction after JPAExecutor
> > [{0}], {1}", executor.getName(), ex
> >             .getMessage(), ex);
> > }
> > I've determined that the bold line (the rollback) can throw an Error (as
> in
> > the Throwable, not a synonym for Exception).  For obvious reasons, we're
> > not catching this Error anywhere.  This causes the Thread to exit;
> however,
> > we don't "lose" any threads because the ScheduledThreadPoolExecutor will
> > simply start a new one.  However, the task that the thread was executed
> is
> > dropped from the schedule.  According to the documentation [1] it says
> "If
> > any execution of the task encounters an exception, subsequent executions
> > are suppressed."  I did some testing, and that's not quite true.
> >  Exceptions don't cause subsequent executions to be suppressed, but
> Errors
> > do.
> >
> > I tried catching all Errors to figure out what was causing this and
> > reproduced the problem (or at least a similar problem by killing the
> MySQL
> > database) and got a stack trace like this (that I've shorted for
> brevity):
> > java.lang.ExceptionInInitializerError
> >         at
> >
> >
> org.apache.openjpa.lib.util.ConcreteClassGenerator.newInstance(ConcreteClassGenerator.java:132)
> >         at
> >
> >
> org.apache.openjpa.lib.jdbc.ConfiguringConnectionDecorator.decorate(ConfiguringConnectionDecorator.java:110)
> >         at
> >
> >
> org.apache.openjpa.lib.jdbc.DecoratingDataSource.decorate(DecoratingDataSource.java:118)
> >         at
> >
> >
> org.apache.openjpa.lib.jdbc.DecoratingDataSource.getConnection(DecoratingDataSource.java:107)
> >         ...
> > Caused by: java.lang.reflect.InvocationTargetException
> >         at sun.reflect.GeneratedConstructorAccessor26.newInstance(Unknown
> > Source)
> >         at
> >
> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> >         at
> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> >         at
> >
> >
> org.apache.openjpa.lib.util.ConcreteClassGenerator.newInstance(ConcreteClassGenerator.java:130)
> >         ... 28 more
> > Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
> > Communications link failure
> > As we can see here [2], ConcreteClassGenerator.newInstance throws an
> > ExceptionInInitializerError when
> java.lang.reflect.Constructor.newInstance
> > throws any Exception; in this case, an InvocationTargetException, which
> it
> > throws if the underlying constructor in the reflection throws an
> Exception
> > [3].  And the underlying constructor is throwing a
> CommunicationsException
> > from the MySQL JDBC driver.
> >
> > Long story short, in some cases scheduled tasks can throw an Error
> instead
> > of an Exception when there's a database-related problem.  And this causes
> > them to not be rescheduled.  It is also possible for this to happen to
> > other tasks being executed via the SchedulerService besides the
> > CoordMaterializeTriggerService, which would result in other missing
> > behaviors.
> >
> > I also noticed that even when I caught the Error, it still didn't
> > reschedule the task -- not sure why...
> >
> > Any thoughts or suggestions on the best way to recover from this
> situation?
> >
> > Restarting Oozie works, but that's not ideal.
> > One idea I had was to add some kind of "watchdog" thread that would
> somehow
> > notice if a scheduled task hadn't executed in the proper interval + some
> > threshold and reschedule it.
> > My other, simper, idea is to try to manually reschedule a new identical
> > task in the Error catch block, which I'll try soon.
> >
> >
> >
> > [1]
> >
> >
> http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ScheduledThreadPoolExecutor.html#scheduleWithFixedDelay(java.lang.Runnable,%20long,%20long,%20java.util.concurrent.TimeUnit)
> > [2]
> >
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.openjpa/openjpa-lib/2.2.2/org/apache/openjpa/lib/util/ConcreteClassGenerator.java?av=f#145
> > [3]
> >
> >
> http://docs.oracle.com/javase/7/docs/api/java/lang/reflect/Constructor.html#newInstance(java.lang.Object
> > ..
> > .)
> >
>
> --
> _____________________________________________________________
> The information contained in this communication is intended solely for the
> use of the individual or entity to whom it is addressed and others
> authorized to receive it. It may contain confidential or legally privileged
> information. If you are not the intended recipient you are hereby notified
> that any disclosure, copying, distribution or taking any action in reliance
> on the contents of this information is strictly prohibited and may be
> unlawful. If you have received this communication in error, please notify
> us immediately by responding to this email and then delete it from your
> system. The firm is neither liable for the proper and complete transmission
> of the information contained in this communication nor for any delay in its
> receipt.
>

Re: Obscure problem with SchedulerService that causes tasks to not be rescheduled

Posted by Shwetha GS <sh...@inmobi.com>.

We had the same issue. We found that because of an edge case, the callable
command(CoordMaterializeTransitionXCommand in this case which does the
actual materialisation) is added to the unique callable(in
CallableQueueService), but is never added to the queue. Hence the
materialisation doesn't happen. Since CoordMaterializeTransitionXCommand is
already in unique callable, further recovery
from CoordMaterializeTriggerService doesn't enqueue any
more CoordMaterializeTransitionXCommand. The issue is being fixed as part
of https://issues.apache.org/jira/browse/OOZIE-1699. This issue can happen
with any command(not just materialisation).

To verify if this is indeed the issue,
1. CoordMaterializeTriggerService should pick up this coord every 5
mins(materialisation interval)
2. CoordMaterializeTransitionXCommand never runs
3. oozie admin -queuedump lists that coord for mater in unique callable




On Sat, Feb 22, 2014 at 5:29 AM, Robert Kanter <rk...@cloudera.com> wrote:

> Sorry for the long message, but its hard to explain the issue succinctly.
>
> We recently saw a weird issue where Oozie suddenly stopped materializing
> actions for a Coordinator job.  All we saw in the logs was this message
> 2014-01-31 02:10:03,378 WARN org.apache.oozie.service.JPAService: USER[-]
> GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] JPAExecutor
> [CoordJobsToBeMaterializedJPAExecutor] ended with an active transaction,
> rolling back
>
> After a bunch of digging, I figured out what was happening:
>
> The CoordMaterializeTriggerService is responsible for materializing Coord
> actions via a Runnable that gets scheduled every 5min via the
> SchedulerService.  The SchedulerService is backed by a
> ScheduledThreadPoolExecutor.  The Runnable does a bunch of things, but the
> important thing is that it executes CoordJobsToBeMaterializedJPAExecutor
> via the JPAService.  In the JPAService#execute(...) method, this happens:
> try {
>     if (em.getTransaction().isActive()) {
>         LOG.warn("JPAExecutor [{0}] ended with an active transaction,
> rolling back", executor.getName());
>         *em.getTransaction().rollback();*
>     }
> }
>     catch (Exception ex) {
>         LOG.warn("Could not check/rollback transaction after JPAExecutor
> [{0}], {1}", executor.getName(), ex
>             .getMessage(), ex);
> }
> I've determined that the bold line (the rollback) can throw an Error (as in
> the Throwable, not a synonym for Exception).  For obvious reasons, we're
> not catching this Error anywhere.  This causes the Thread to exit; however,
> we don't "lose" any threads because the ScheduledThreadPoolExecutor will
> simply start a new one.  However, the task that the thread was executed is
> dropped from the schedule.  According to the documentation [1] it says "If
> any execution of the task encounters an exception, subsequent executions
> are suppressed."  I did some testing, and that's not quite true.
>  Exceptions don't cause subsequent executions to be suppressed, but Errors
> do.
>
> I tried catching all Errors to figure out what was causing this and
> reproduced the problem (or at least a similar problem by killing the MySQL
> database) and got a stack trace like this (that I've shorted for brevity):
> java.lang.ExceptionInInitializerError
>         at
>
> org.apache.openjpa.lib.util.ConcreteClassGenerator.newInstance(ConcreteClassGenerator.java:132)
>         at
>
> org.apache.openjpa.lib.jdbc.ConfiguringConnectionDecorator.decorate(ConfiguringConnectionDecorator.java:110)
>         at
>
> org.apache.openjpa.lib.jdbc.DecoratingDataSource.decorate(DecoratingDataSource.java:118)
>         at
>
> org.apache.openjpa.lib.jdbc.DecoratingDataSource.getConnection(DecoratingDataSource.java:107)
>         ...
> Caused by: java.lang.reflect.InvocationTargetException
>         at sun.reflect.GeneratedConstructorAccessor26.newInstance(Unknown
> Source)
>         at
>
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>         at
>
> org.apache.openjpa.lib.util.ConcreteClassGenerator.newInstance(ConcreteClassGenerator.java:130)
>         ... 28 more
> Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
> Communications link failure
> As we can see here [2], ConcreteClassGenerator.newInstance throws an
> ExceptionInInitializerError when java.lang.reflect.Constructor.newInstance
> throws any Exception; in this case, an InvocationTargetException, which it
> throws if the underlying constructor in the reflection throws an Exception
> [3].  And the underlying constructor is throwing a CommunicationsException
> from the MySQL JDBC driver.
>
> Long story short, in some cases scheduled tasks can throw an Error instead
> of an Exception when there's a database-related problem.  And this causes
> them to not be rescheduled.  It is also possible for this to happen to
> other tasks being executed via the SchedulerService besides the
> CoordMaterializeTriggerService, which would result in other missing
> behaviors.
>
> I also noticed that even when I caught the Error, it still didn't
> reschedule the task -- not sure why...
>
> Any thoughts or suggestions on the best way to recover from this situation?
>
> Restarting Oozie works, but that's not ideal.
> One idea I had was to add some kind of "watchdog" thread that would somehow
> notice if a scheduled task hadn't executed in the proper interval + some
> threshold and reschedule it.
> My other, simper, idea is to try to manually reschedule a new identical
> task in the Error catch block, which I'll try soon.
>
>
>
> [1]
>
> http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ScheduledThreadPoolExecutor.html#scheduleWithFixedDelay(java.lang.Runnable,%20long,%20long,%20java.util.concurrent.TimeUnit)
> [2]
>
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.openjpa/openjpa-lib/2.2.2/org/apache/openjpa/lib/util/ConcreteClassGenerator.java?av=f#145
> [3]
>
> http://docs.oracle.com/javase/7/docs/api/java/lang/reflect/Constructor.html#newInstance(java.lang.Object
> ..
> .)
>

-- 
_____________________________________________________________
The information contained in this communication is intended solely for the 
use of the individual or entity to whom it is addressed and others 
authorized to receive it. It may contain confidential or legally privileged 
information. If you are not the intended recipient you are hereby notified 
that any disclosure, copying, distribution or taking any action in reliance 
on the contents of this information is strictly prohibited and may be 
unlawful. If you have received this communication in error, please notify 
us immediately by responding to this email and then delete it from your 
system. The firm is neither liable for the proper and complete transmission 
of the information contained in this communication nor for any delay in its 
receipt.