You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Anthony Leonard <an...@york.ac.uk> on 2013/01/21 13:57:49 UTC

Job hanging on "Starting up" with never ending external query.

Hi there,

We have recently started running a nightly job 2AM in ManifoldCF to extract
data from an Oracle repository and populate a Solr index. Most nights this
works fine, but occasionally the job has been hanging at the "Starting up"
phase. We have observed this on our test setup also occasionally. A restart
of ManifoldCF usually solves this.

Using the simple history reports today I looked up all records and sorted
them by the "Time" column, largest first, and found the following:

"Start Time","Activity","Identifier","Result Code","Bytes","Time","Result
Description"
"11-12-2012 05:00:05.941","external query" "... SQL QUERY
...","ERROR","0","1926607529","Interrupted: null"
"01-21-2013 02:00:11.843","external query" "... SQL QUERY
...","ERROR","0","31644956","Interrupted: null"
"01-17-2013 02:00:03.600","external query" "... SQL QUERY
...","ERROR","0","31637594","Interrupted: null"
"12-04-2012 12:12:19.860","external query" "... SQL QUERY
...","OK","0","17511",""
... etc ...

If the Time column is in millis that means the first query was hanging for
22 days! (This was in the period before we went live when our live server
was sitting idle for a while.) The other two occasions it was hanging for
about 8 hours until we arrived to restart the job in the morning. I have
confirmed that the Oracle database we are connecting to was available
throughout these periods. These times are also too long for any network or
database timeouts, which makes me suspect that it's a problem with the
application.

We have the following logging config in properties.xml

  <property name="org.apache.manifoldcf.jobs" value="ALL"/>
  <property name="org.apache.manifoldcf.connectors" value="ALL"/>
  <property name="org.apache.manifoldcf.agents" value="ALL"/>
  <property name="org.apache.manifoldcf.misc" value="ALL"/>

The job failed again last night and when I checked at 10:40 AM this morning
the last few lines of manifoldcf.log were:

DEBUG 2013-01-21 01:59:45,654 (Job start thread) - Checking if job
1352455005553 needs to be started; it was last checked at 1358733575454,
and now it is 1358733585635
DEBUG 2013-01-21 01:59:45,654 (Job start thread) -  No time match found
within interval 1358733575454 to 1358733585635
DEBUG 2013-01-21 01:59:55,805 (Job start thread) - Checking if job
1352455005553 needs to be started; it was last checked at 1358733585636,
and now it is 1358733595662
DEBUG 2013-01-21 01:59:55,805 (Job start thread) -  No time match found
within interval 1358733585636 to 1358733595662
DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Checking if job
1352455005553 needs to be started; it was last checked at 1358733595663,
and now it is 1358733605813
DEBUG 2013-01-21 02:00:05,821 (Job start thread) -  Time match FOUND within
interval 1358733595663 to 1358733605813
DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Job '1352455005553' is
within run window at 1358733605813 ms. (which starts at 1358733600000 ms.)
DEBUG 2013-01-21 02:00:05,830 (Job start thread) - Signalled for job start
for job 1352455005553
DEBUG 2013-01-21 02:00:11,674 (Startup thread) - Marked job 1352455005553
for startup
DEBUG 2013-01-21 02:00:11,843 (Thread-951922) - JDBC: The connect string is
'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'

After that - nothing. Once I restarted manifoldcf this morning the job
magically restarted itself and the following log messages were added where
it had left off:

DEBUG 2013-01-21 10:47:36,852 (Startup thread) - Setting job 1352455005553
back to 'ReadyForStartup' state
DEBUG 2013-01-21 10:48:57,538 (Thread-98) - Resetting due to restart
DEBUG 2013-01-21 10:48:57,958 (Thread-98) - Reset complete
DEBUG 2013-01-21 10:48:58,200 (Startup thread) - Marked job 1352455005553
for startup
DEBUG 2013-01-21 10:48:58,367 (Thread-184) - JDBC: The connect string is
'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
DEBUG 2013-01-21 10:49:02,071 (Startup thread) - Job 1352455005553 is now
started
... etc ...

So it appears that the application is running fine with a Job start thread
logging nicely every ten seconds until it tries to start this job and hangs
entirely until the system is shutdown - even though it performed that same
tasks perfectly the night before and later the same day.

Can anyone advise on what might be happening here? We are running
ManifoldCF version 1.0.1 on Ubuntu 10.04.

Best wishes,
Anthony.

-- 
Dr Anthony Leonard
System Integrator, Information Directorate
University of York, Heslington, York, UK, YO10 5DD
Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
Times Higher Education University of the Year 2010

Re: Job hanging on "Starting up" with never ending external query.

Posted by Karl Wright <da...@gmail.com>.
You can download the current release candidate for 1.1 (RC6) from
http://people.apache.org/~kwright/apache-manifoldcf-1.1 .

Karl


On Fri, Jan 25, 2013 at 12:02 PM, Anthony Leonard
<an...@york.ac.uk> wrote:
> Hi Karl,
>
> Thank you so much for this. Sorry for the lack of response as we've been
> working on other things.
>
> One question - would we have to build ManifoldCF ourselves to get the new
> code you've checked in or would it already be part of a binary distribution
> somewhere?
>
> Best wishes,
> Anthony.
>
> --
> Dr Anthony Leonard
> System Integrator, Information Directorate
> University of York, Heslington, York, UK, YO10 5DD
> Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
> Times Higher Education University of the Year 2010
>
>
> On Tue, Jan 22, 2013 at 6:51 PM, Karl Wright <da...@gmail.com> wrote:
>>
>> I've checked in code in both trunk and the release branch for this issue.
>>
>> It would be good if you could try this again in your environment.  The
>> fix simply prevents some kinds of exceptions from causing a handle
>> leak.  Please try this with only 1 JDBC Connection connection handle
>> per JVM and let me know if you see any hangs.
>>
>> Thanks,
>> Karl
>>
>>
>> On Tue, Jan 22, 2013 at 1:11 PM, Karl Wright <da...@gmail.com> wrote:
>> > I've looked into the code in some detail.  There is indeed a place
>> > where it is possible for a JDBC connection handle to be leaked, I
>> > believe.  However, it's not clear whether this is the circumstance you
>> > are encountering or not, since it does in involve an exception getting
>> > thrown doing something not terribly likely to cause exceptions.
>> >
>> > I've opened a ticket - CONNECTORS-620.
>> >
>> > Karl
>> >
>> > On Tue, Jan 22, 2013 at 12:53 PM, Karl Wright <da...@gmail.com>
>> > wrote:
>> >> Hmm.
>> >>
>> >> The following threads are of interest here:
>> >>
>> >> Thread 29975: (state = BLOCKED)
>> >>  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may
>> >> be imprecise)
>> >>  - java.lang.Thread.join(long) @bci=38, line=1203 (Compiled frame)
>> >>  - java.lang.Thread.join() @bci=2, line=1256 (Compiled frame)
>> >>  -
>> >> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection$JDBCPSResultSet.<init>(org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection,
>> >> java.lang.String, java.util.ArrayList, int) @bci=39, line=1058
>> >> (Interpreted frame)
>> >>  -
>> >> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection.executeUncachedQuery(java.lang.String,
>> >> java.util.ArrayList, int) @bci=23, line=256 (Interpreted frame)
>> >>  -
>> >> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.addSeedDocuments(org.apache.manifoldcf.crawler.interfaces.ISeedingActivity,
>> >> org.apache.manifoldcf.crawler.interfaces.DocumentSpecification, long,
>> >> long, int) @bci=106, line=246 (Interpreted frame)
>> >>  - org.apache.manifoldcf.crawler.system.StartupThread.run() @bci=636,
>> >> line=179 (Interpreted frame)
>> >>
>> >> ... which is probably waiting in this one:
>> >>
>> >> Thread 24457: (state = BLOCKED)
>> >>  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may
>> >> be imprecise)
>> >>  - java.lang.Object.wait() @bci=2, line=502 (Interpreted frame)
>> >>  - org.apache.manifoldcf.core.jdbcpool.ConnectionPool.getConnection()
>> >> @bci=80, line=80 (Interpreted frame)
>> >>  -
>> >> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnectionFactory.getConnection(java.lang.String,
>> >> java.lang.String, java.lang.String, java.lang.String,
>> >> java.lang.String) @bci=433, line=128 (Interpreted frame)
>> >>  -
>> >> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection$PreparedStatementQueryThread.run()
>> >> @bci=36, line=1212 (Interpreted frame)
>> >>
>> >> ... which is waiting to obtain a JDBC connection, and the reason it
>> >> can't is because it thinks that the only available JDBC connection is
>> >> currently in use.
>> >>
>> >> Since you have only a single connection around, and nothing else is
>> >> active, it stands to reason that a JDBC connection handle has somehow
>> >> been leaked, which is a challenge since connections are typically
>> >> freed in a try/finally block through ManifoldCF.
>> >>
>> >> I notice that your stack frames are pretty unusual - what JDK is this
>> >> that you are using?
>> >>
>> >> Karl
>> >>
>> >>
>> >> Since
>> >> On Tue, Jan 22, 2013 at 12:00 PM, Anthony Leonard
>> >> <an...@york.ac.uk> wrote:
>> >>> Dear Karl,
>> >>>
>> >>> Our DBA noticed that each time our job was run 10 Oracle connections
>> >>> were
>> >>> created. So, we dropped the "Max connections" parameter on the
>> >>> repository
>> >>> connection config to 1 and re-ran the job with the DBA watching. The
>> >>> job
>> >>> worked fine but the DBA reported that 1 connection was created and
>> >>> then 10
>> >>> more briefly ...
>> >>>
>> >>> Out of curiosity we re-ran the job again with no further changes and
>> >>> this
>> >>> time got the following results:
>> >>>
>> >>> * the job hung in the "Starting Up" phase again, with the same logging
>> >>> and
>> >>> symptoms as detailed before on this thread.
>> >>> * the DBA reported seeing no connections at all this time.
>> >>> * I have attached a thread dump created by "jstack -F <pid>". This is
>> >>> reporting all threads as blocked.
>> >>>
>> >>> Any ideas? Any help with this would certainly be very gratefully
>> >>> received.
>> >>>
>> >>> Best wishes,
>> >>> Anthony.
>> >>>
>> >>> --
>> >>> Dr Anthony Leonard
>> >>> System Integrator, Information Directorate
>> >>> University of York, Heslington, York, UK, YO10 5DD
>> >>> Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
>> >>> Times Higher Education University of the Year 2010
>> >>>
>> >>>
>> >>> On Mon, Jan 21, 2013 at 2:15 PM, Karl Wright <da...@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>> kill -QUIT should not abort the agents process, just cause a thread
>> >>>> dump.  kill -9 is a different story.
>> >>>>
>> >>>> You can also do the same thing by using jstack, in the jvm bin
>> >>>> directory.
>> >>>>
>> >>>> Karl
>> >>>>
>> >>>>
>> >>>> On Mon, Jan 21, 2013 at 9:04 AM, Anthony Leonard
>> >>>> <an...@york.ac.uk> wrote:
>> >>>> > Dear Karl,
>> >>>> >
>> >>>> > Many thanks for your insights. I'll do a kill -QUIT next time we
>> >>>> > have
>> >>>> > this
>> >>>> > issue which should hopefully give me the thread dump. However we've
>> >>>> > noticed
>> >>>> > that killing processes means we have to run the "locks-clean"
>> >>>> > script so
>> >>>> > it's
>> >>>> > not our favourite way of doing it.
>> >>>> >
>> >>>> > Also I definitely think a timeout for queries would be a good
>> >>>> > thing.
>> >>>> >
>> >>>> > I guess we go back to checking that the connection to the database
>> >>>> > should
>> >>>> > have been ok last night...
>> >>>> >
>> >>>> > Best wishes,
>> >>>> > Anthony.
>> >>>> >
>> >>>> > --
>> >>>> > Dr Anthony Leonard
>> >>>> > System Integrator, Information Directorate
>> >>>> > University of York, Heslington, York, UK, YO10 5DD
>> >>>> > Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
>> >>>> > Times Higher Education University of the Year 2010
>> >>>> >
>> >>>> >
>> >>>> > On Mon, Jan 21, 2013 at 1:25 PM, Karl Wright <da...@gmail.com>
>> >>>> > wrote:
>> >>>> >>
>> >>>> >> Hi Anthony,
>> >>>> >>
>> >>>> >> What happens between the framework recognizing that the job should
>> >>>> >> be
>> >>>> >> started (which it does fine in both cases), and actually achieving
>> >>>> >> a
>> >>>> >> correct job start, is the "seeding" phase, which is going to try
>> >>>> >> to
>> >>>> >> execute the seeding query against your Oracle database.  If
>> >>>> >> something
>> >>>> >> happens at that time to hang the JDBC connection's seeding query,
>> >>>> >> then
>> >>>> >> it precisely explains the behavior you are seeing.
>> >>>> >>
>> >>>> >> It is also the case that the timeout on the queries that the JDBC
>> >>>> >> connector does is effectively infinite.  This makes me suspicious
>> >>>> >> that
>> >>>> >> what is happening is an Oracle query is going out but there is no
>> >>>> >> response ever coming back.
>> >>>> >>
>> >>>> >> The other possibility is that the JDBC connector is in fact
>> >>>> >> correctly
>> >>>> >> throwing a ServiceInterruption, but that the ManifoldCF code is
>> >>>> >> either
>> >>>> >> not handling it properly, or the connector is not forming it
>> >>>> >> properly.
>> >>>> >>  In that case, when you notice a hung job, the startup thread will
>> >>>> >> be
>> >>>> >> a particular place in the code, and I can diagnose it that way.
>> >>>> >>
>> >>>> >> The first order of business is therefore to get a thread dump when
>> >>>> >> the
>> >>>> >> system is "hung".  That will help confirm the picture.  There are
>> >>>> >> a
>> >>>> >> number of additional questions here.
>> >>>> >>
>> >>>> >> (1) Why is this happening?  Is there any possibility that the
>> >>>> >> Oracle
>> >>>> >> database you are crawling is (very occasionally) not able to
>> >>>> >> properly
>> >>>> >> respond to a JDBC query?  I can imagine that, under some network
>> >>>> >> conditions, it might be possible for the Oracle JDBC driver to
>> >>>> >> wind up
>> >>>> >> waiting indefinitely for a response that never comes.
>> >>>> >>
>> >>>> >> (2) Given that we can't always control the infrastructure we're
>> >>>> >> trying
>> >>>> >> to crawl through, should we attempt to provide a reasonable
>> >>>> >> workaround?  For example, a timeout on JDBC connector queries,
>> >>>> >> where
>> >>>> >> we throw a ServiceInterruption if the timeout is exceeded?
>> >>>> >>
>> >>>> >> Karl
>> >>>> >>
>> >>>> >> On Mon, Jan 21, 2013 at 7:57 AM, Anthony Leonard
>> >>>> >> <an...@york.ac.uk> wrote:
>> >>>> >> > Hi there,
>> >>>> >> >
>> >>>> >> > We have recently started running a nightly job 2AM in ManifoldCF
>> >>>> >> > to
>> >>>> >> > extract
>> >>>> >> > data from an Oracle repository and populate a Solr index. Most
>> >>>> >> > nights
>> >>>> >> > this
>> >>>> >> > works fine, but occasionally the job has been hanging at the
>> >>>> >> > "Starting
>> >>>> >> > up"
>> >>>> >> > phase. We have observed this on our test setup also
>> >>>> >> > occasionally. A
>> >>>> >> > restart
>> >>>> >> > of ManifoldCF usually solves this.
>> >>>> >> >
>> >>>> >> > Using the simple history reports today I looked up all records
>> >>>> >> > and
>> >>>> >> > sorted
>> >>>> >> > them by the "Time" column, largest first, and found the
>> >>>> >> > following:
>> >>>> >> >
>> >>>> >> > "Start Time","Activity","Identifier","Result
>> >>>> >> > Code","Bytes","Time","Result
>> >>>> >> > Description"
>> >>>> >> > "11-12-2012 05:00:05.941","external query" "... SQL QUERY
>> >>>> >> > ...","ERROR","0","1926607529","Interrupted: null"
>> >>>> >> > "01-21-2013 02:00:11.843","external query" "... SQL QUERY
>> >>>> >> > ...","ERROR","0","31644956","Interrupted: null"
>> >>>> >> > "01-17-2013 02:00:03.600","external query" "... SQL QUERY
>> >>>> >> > ...","ERROR","0","31637594","Interrupted: null"
>> >>>> >> > "12-04-2012 12:12:19.860","external query" "... SQL QUERY
>> >>>> >> > ...","OK","0","17511",""
>> >>>> >> > ... etc ...
>> >>>> >> >
>> >>>> >> > If the Time column is in millis that means the first query was
>> >>>> >> > hanging
>> >>>> >> > for
>> >>>> >> > 22 days! (This was in the period before we went live when our
>> >>>> >> > live
>> >>>> >> > server
>> >>>> >> > was sitting idle for a while.) The other two occasions it was
>> >>>> >> > hanging
>> >>>> >> > for
>> >>>> >> > about 8 hours until we arrived to restart the job in the
>> >>>> >> > morning. I
>> >>>> >> > have
>> >>>> >> > confirmed that the Oracle database we are connecting to was
>> >>>> >> > available
>> >>>> >> > throughout these periods. These times are also too long for any
>> >>>> >> > network
>> >>>> >> > or
>> >>>> >> > database timeouts, which makes me suspect that it's a problem
>> >>>> >> > with
>> >>>> >> > the
>> >>>> >> > application.
>> >>>> >> >
>> >>>> >> > We have the following logging config in properties.xml
>> >>>> >> >
>> >>>> >> >   <property name="org.apache.manifoldcf.jobs" value="ALL"/>
>> >>>> >> >   <property name="org.apache.manifoldcf.connectors"
>> >>>> >> > value="ALL"/>
>> >>>> >> >   <property name="org.apache.manifoldcf.agents" value="ALL"/>
>> >>>> >> >   <property name="org.apache.manifoldcf.misc" value="ALL"/>
>> >>>> >> >
>> >>>> >> > The job failed again last night and when I checked at 10:40 AM
>> >>>> >> > this
>> >>>> >> > morning
>> >>>> >> > the last few lines of manifoldcf.log were:
>> >>>> >> >
>> >>>> >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) - Checking if
>> >>>> >> > job
>> >>>> >> > 1352455005553 needs to be started; it was last checked at
>> >>>> >> > 1358733575454,
>> >>>> >> > and
>> >>>> >> > now it is 1358733585635
>> >>>> >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) -  No time
>> >>>> >> > match
>> >>>> >> > found
>> >>>> >> > within interval 1358733575454 to 1358733585635
>> >>>> >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) - Checking if
>> >>>> >> > job
>> >>>> >> > 1352455005553 needs to be started; it was last checked at
>> >>>> >> > 1358733585636,
>> >>>> >> > and
>> >>>> >> > now it is 1358733595662
>> >>>> >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) -  No time
>> >>>> >> > match
>> >>>> >> > found
>> >>>> >> > within interval 1358733585636 to 1358733595662
>> >>>> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Checking if
>> >>>> >> > job
>> >>>> >> > 1352455005553 needs to be started; it was last checked at
>> >>>> >> > 1358733595663,
>> >>>> >> > and
>> >>>> >> > now it is 1358733605813
>> >>>> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) -  Time match
>> >>>> >> > FOUND
>> >>>> >> > within
>> >>>> >> > interval 1358733595663 to 1358733605813
>> >>>> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Job
>> >>>> >> > '1352455005553'
>> >>>> >> > is
>> >>>> >> > within run window at 1358733605813 ms. (which starts at
>> >>>> >> > 1358733600000
>> >>>> >> > ms.)
>> >>>> >> > DEBUG 2013-01-21 02:00:05,830 (Job start thread) - Signalled for
>> >>>> >> > job
>> >>>> >> > start
>> >>>> >> > for job 1352455005553
>> >>>> >> > DEBUG 2013-01-21 02:00:11,674 (Startup thread) - Marked job
>> >>>> >> > 1352455005553
>> >>>> >> > for startup
>> >>>> >> > DEBUG 2013-01-21 02:00:11,843 (Thread-951922) - JDBC: The
>> >>>> >> > connect
>> >>>> >> > string
>> >>>> >> > is
>> >>>> >> >
>> >>>> >> > 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
>> >>>> >> >
>> >>>> >> > After that - nothing. Once I restarted manifoldcf this morning
>> >>>> >> > the
>> >>>> >> > job
>> >>>> >> > magically restarted itself and the following log messages were
>> >>>> >> > added
>> >>>> >> > where
>> >>>> >> > it had left off:
>> >>>> >> >
>> >>>> >> > DEBUG 2013-01-21 10:47:36,852 (Startup thread) - Setting job
>> >>>> >> > 1352455005553
>> >>>> >> > back to 'ReadyForStartup' state
>> >>>> >> > DEBUG 2013-01-21 10:48:57,538 (Thread-98) - Resetting due to
>> >>>> >> > restart
>> >>>> >> > DEBUG 2013-01-21 10:48:57,958 (Thread-98) - Reset complete
>> >>>> >> > DEBUG 2013-01-21 10:48:58,200 (Startup thread) - Marked job
>> >>>> >> > 1352455005553
>> >>>> >> > for startup
>> >>>> >> > DEBUG 2013-01-21 10:48:58,367 (Thread-184) - JDBC: The connect
>> >>>> >> > string
>> >>>> >> > is
>> >>>> >> >
>> >>>> >> > 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
>> >>>> >> > DEBUG 2013-01-21 10:49:02,071 (Startup thread) - Job
>> >>>> >> > 1352455005553 is
>> >>>> >> > now
>> >>>> >> > started
>> >>>> >> > ... etc ...
>> >>>> >> >
>> >>>> >> > So it appears that the application is running fine with a Job
>> >>>> >> > start
>> >>>> >> > thread
>> >>>> >> > logging nicely every ten seconds until it tries to start this
>> >>>> >> > job and
>> >>>> >> > hangs
>> >>>> >> > entirely until the system is shutdown - even though it performed
>> >>>> >> > that
>> >>>> >> > same
>> >>>> >> > tasks perfectly the night before and later the same day.
>> >>>> >> >
>> >>>> >> > Can anyone advise on what might be happening here? We are
>> >>>> >> > running
>> >>>> >> > ManifoldCF
>> >>>> >> > version 1.0.1 on Ubuntu 10.04.
>> >>>> >> >
>> >>>> >> > Best wishes,
>> >>>> >> > Anthony.
>> >>>> >> >
>> >>>> >> > --
>> >>>> >> > Dr Anthony Leonard
>> >>>> >> > System Integrator, Information Directorate
>> >>>> >> > University of York, Heslington, York, UK, YO10 5DD
>> >>>> >> > Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
>> >>>> >> > Times Higher Education University of the Year 2010
>> >>>> >
>> >>>> >
>> >>>
>> >>>
>
>

Re: Job hanging on "Starting up" with never ending external query.

Posted by Anthony Leonard <an...@york.ac.uk>.
Hi Karl,

Thank you so much for this. Sorry for the lack of response as we've been
working on other things.

One question - would we have to build ManifoldCF ourselves to get the new
code you've checked in or would it already be part of a binary distribution
somewhere?

Best wishes,
Anthony.

-- 
Dr Anthony Leonard
System Integrator, Information Directorate
University of York, Heslington, York, UK, YO10 5DD
Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
Times Higher Education University of the Year 2010


On Tue, Jan 22, 2013 at 6:51 PM, Karl Wright <da...@gmail.com> wrote:

> I've checked in code in both trunk and the release branch for this issue.
>
> It would be good if you could try this again in your environment.  The
> fix simply prevents some kinds of exceptions from causing a handle
> leak.  Please try this with only 1 JDBC Connection connection handle
> per JVM and let me know if you see any hangs.
>
> Thanks,
> Karl
>
>
> On Tue, Jan 22, 2013 at 1:11 PM, Karl Wright <da...@gmail.com> wrote:
> > I've looked into the code in some detail.  There is indeed a place
> > where it is possible for a JDBC connection handle to be leaked, I
> > believe.  However, it's not clear whether this is the circumstance you
> > are encountering or not, since it does in involve an exception getting
> > thrown doing something not terribly likely to cause exceptions.
> >
> > I've opened a ticket - CONNECTORS-620.
> >
> > Karl
> >
> > On Tue, Jan 22, 2013 at 12:53 PM, Karl Wright <da...@gmail.com>
> wrote:
> >> Hmm.
> >>
> >> The following threads are of interest here:
> >>
> >> Thread 29975: (state = BLOCKED)
> >>  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may
> >> be imprecise)
> >>  - java.lang.Thread.join(long) @bci=38, line=1203 (Compiled frame)
> >>  - java.lang.Thread.join() @bci=2, line=1256 (Compiled frame)
> >>  -
> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection$JDBCPSResultSet.<init>(org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection,
> >> java.lang.String, java.util.ArrayList, int) @bci=39, line=1058
> >> (Interpreted frame)
> >>  -
> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection.executeUncachedQuery(java.lang.String,
> >> java.util.ArrayList, int) @bci=23, line=256 (Interpreted frame)
> >>  -
> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.addSeedDocuments(org.apache.manifoldcf.crawler.interfaces.ISeedingActivity,
> >> org.apache.manifoldcf.crawler.interfaces.DocumentSpecification, long,
> >> long, int) @bci=106, line=246 (Interpreted frame)
> >>  - org.apache.manifoldcf.crawler.system.StartupThread.run() @bci=636,
> >> line=179 (Interpreted frame)
> >>
> >> ... which is probably waiting in this one:
> >>
> >> Thread 24457: (state = BLOCKED)
> >>  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may
> >> be imprecise)
> >>  - java.lang.Object.wait() @bci=2, line=502 (Interpreted frame)
> >>  - org.apache.manifoldcf.core.jdbcpool.ConnectionPool.getConnection()
> >> @bci=80, line=80 (Interpreted frame)
> >>  -
> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnectionFactory.getConnection(java.lang.String,
> >> java.lang.String, java.lang.String, java.lang.String,
> >> java.lang.String) @bci=433, line=128 (Interpreted frame)
> >>  -
> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection$PreparedStatementQueryThread.run()
> >> @bci=36, line=1212 (Interpreted frame)
> >>
> >> ... which is waiting to obtain a JDBC connection, and the reason it
> >> can't is because it thinks that the only available JDBC connection is
> >> currently in use.
> >>
> >> Since you have only a single connection around, and nothing else is
> >> active, it stands to reason that a JDBC connection handle has somehow
> >> been leaked, which is a challenge since connections are typically
> >> freed in a try/finally block through ManifoldCF.
> >>
> >> I notice that your stack frames are pretty unusual - what JDK is this
> >> that you are using?
> >>
> >> Karl
> >>
> >>
> >> Since
> >> On Tue, Jan 22, 2013 at 12:00 PM, Anthony Leonard
> >> <an...@york.ac.uk> wrote:
> >>> Dear Karl,
> >>>
> >>> Our DBA noticed that each time our job was run 10 Oracle connections
> were
> >>> created. So, we dropped the "Max connections" parameter on the
> repository
> >>> connection config to 1 and re-ran the job with the DBA watching. The
> job
> >>> worked fine but the DBA reported that 1 connection was created and
> then 10
> >>> more briefly ...
> >>>
> >>> Out of curiosity we re-ran the job again with no further changes and
> this
> >>> time got the following results:
> >>>
> >>> * the job hung in the "Starting Up" phase again, with the same logging
> and
> >>> symptoms as detailed before on this thread.
> >>> * the DBA reported seeing no connections at all this time.
> >>> * I have attached a thread dump created by "jstack -F <pid>". This is
> >>> reporting all threads as blocked.
> >>>
> >>> Any ideas? Any help with this would certainly be very gratefully
> received.
> >>>
> >>> Best wishes,
> >>> Anthony.
> >>>
> >>> --
> >>> Dr Anthony Leonard
> >>> System Integrator, Information Directorate
> >>> University of York, Heslington, York, UK, YO10 5DD
> >>> Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
> >>> Times Higher Education University of the Year 2010
> >>>
> >>>
> >>> On Mon, Jan 21, 2013 at 2:15 PM, Karl Wright <da...@gmail.com>
> wrote:
> >>>>
> >>>> kill -QUIT should not abort the agents process, just cause a thread
> >>>> dump.  kill -9 is a different story.
> >>>>
> >>>> You can also do the same thing by using jstack, in the jvm bin
> directory.
> >>>>
> >>>> Karl
> >>>>
> >>>>
> >>>> On Mon, Jan 21, 2013 at 9:04 AM, Anthony Leonard
> >>>> <an...@york.ac.uk> wrote:
> >>>> > Dear Karl,
> >>>> >
> >>>> > Many thanks for your insights. I'll do a kill -QUIT next time we
> have
> >>>> > this
> >>>> > issue which should hopefully give me the thread dump. However we've
> >>>> > noticed
> >>>> > that killing processes means we have to run the "locks-clean"
> script so
> >>>> > it's
> >>>> > not our favourite way of doing it.
> >>>> >
> >>>> > Also I definitely think a timeout for queries would be a good thing.
> >>>> >
> >>>> > I guess we go back to checking that the connection to the database
> >>>> > should
> >>>> > have been ok last night...
> >>>> >
> >>>> > Best wishes,
> >>>> > Anthony.
> >>>> >
> >>>> > --
> >>>> > Dr Anthony Leonard
> >>>> > System Integrator, Information Directorate
> >>>> > University of York, Heslington, York, UK, YO10 5DD
> >>>> > Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
> >>>> > Times Higher Education University of the Year 2010
> >>>> >
> >>>> >
> >>>> > On Mon, Jan 21, 2013 at 1:25 PM, Karl Wright <da...@gmail.com>
> wrote:
> >>>> >>
> >>>> >> Hi Anthony,
> >>>> >>
> >>>> >> What happens between the framework recognizing that the job should
> be
> >>>> >> started (which it does fine in both cases), and actually achieving
> a
> >>>> >> correct job start, is the "seeding" phase, which is going to try to
> >>>> >> execute the seeding query against your Oracle database.  If
> something
> >>>> >> happens at that time to hang the JDBC connection's seeding query,
> then
> >>>> >> it precisely explains the behavior you are seeing.
> >>>> >>
> >>>> >> It is also the case that the timeout on the queries that the JDBC
> >>>> >> connector does is effectively infinite.  This makes me suspicious
> that
> >>>> >> what is happening is an Oracle query is going out but there is no
> >>>> >> response ever coming back.
> >>>> >>
> >>>> >> The other possibility is that the JDBC connector is in fact
> correctly
> >>>> >> throwing a ServiceInterruption, but that the ManifoldCF code is
> either
> >>>> >> not handling it properly, or the connector is not forming it
> properly.
> >>>> >>  In that case, when you notice a hung job, the startup thread will
> be
> >>>> >> a particular place in the code, and I can diagnose it that way.
> >>>> >>
> >>>> >> The first order of business is therefore to get a thread dump when
> the
> >>>> >> system is "hung".  That will help confirm the picture.  There are a
> >>>> >> number of additional questions here.
> >>>> >>
> >>>> >> (1) Why is this happening?  Is there any possibility that the
> Oracle
> >>>> >> database you are crawling is (very occasionally) not able to
> properly
> >>>> >> respond to a JDBC query?  I can imagine that, under some network
> >>>> >> conditions, it might be possible for the Oracle JDBC driver to
> wind up
> >>>> >> waiting indefinitely for a response that never comes.
> >>>> >>
> >>>> >> (2) Given that we can't always control the infrastructure we're
> trying
> >>>> >> to crawl through, should we attempt to provide a reasonable
> >>>> >> workaround?  For example, a timeout on JDBC connector queries,
> where
> >>>> >> we throw a ServiceInterruption if the timeout is exceeded?
> >>>> >>
> >>>> >> Karl
> >>>> >>
> >>>> >> On Mon, Jan 21, 2013 at 7:57 AM, Anthony Leonard
> >>>> >> <an...@york.ac.uk> wrote:
> >>>> >> > Hi there,
> >>>> >> >
> >>>> >> > We have recently started running a nightly job 2AM in ManifoldCF
> to
> >>>> >> > extract
> >>>> >> > data from an Oracle repository and populate a Solr index. Most
> nights
> >>>> >> > this
> >>>> >> > works fine, but occasionally the job has been hanging at the
> >>>> >> > "Starting
> >>>> >> > up"
> >>>> >> > phase. We have observed this on our test setup also
> occasionally. A
> >>>> >> > restart
> >>>> >> > of ManifoldCF usually solves this.
> >>>> >> >
> >>>> >> > Using the simple history reports today I looked up all records
> and
> >>>> >> > sorted
> >>>> >> > them by the "Time" column, largest first, and found the
> following:
> >>>> >> >
> >>>> >> > "Start Time","Activity","Identifier","Result
> >>>> >> > Code","Bytes","Time","Result
> >>>> >> > Description"
> >>>> >> > "11-12-2012 05:00:05.941","external query" "... SQL QUERY
> >>>> >> > ...","ERROR","0","1926607529","Interrupted: null"
> >>>> >> > "01-21-2013 02:00:11.843","external query" "... SQL QUERY
> >>>> >> > ...","ERROR","0","31644956","Interrupted: null"
> >>>> >> > "01-17-2013 02:00:03.600","external query" "... SQL QUERY
> >>>> >> > ...","ERROR","0","31637594","Interrupted: null"
> >>>> >> > "12-04-2012 12:12:19.860","external query" "... SQL QUERY
> >>>> >> > ...","OK","0","17511",""
> >>>> >> > ... etc ...
> >>>> >> >
> >>>> >> > If the Time column is in millis that means the first query was
> >>>> >> > hanging
> >>>> >> > for
> >>>> >> > 22 days! (This was in the period before we went live when our
> live
> >>>> >> > server
> >>>> >> > was sitting idle for a while.) The other two occasions it was
> hanging
> >>>> >> > for
> >>>> >> > about 8 hours until we arrived to restart the job in the
> morning. I
> >>>> >> > have
> >>>> >> > confirmed that the Oracle database we are connecting to was
> available
> >>>> >> > throughout these periods. These times are also too long for any
> >>>> >> > network
> >>>> >> > or
> >>>> >> > database timeouts, which makes me suspect that it's a problem
> with
> >>>> >> > the
> >>>> >> > application.
> >>>> >> >
> >>>> >> > We have the following logging config in properties.xml
> >>>> >> >
> >>>> >> >   <property name="org.apache.manifoldcf.jobs" value="ALL"/>
> >>>> >> >   <property name="org.apache.manifoldcf.connectors" value="ALL"/>
> >>>> >> >   <property name="org.apache.manifoldcf.agents" value="ALL"/>
> >>>> >> >   <property name="org.apache.manifoldcf.misc" value="ALL"/>
> >>>> >> >
> >>>> >> > The job failed again last night and when I checked at 10:40 AM
> this
> >>>> >> > morning
> >>>> >> > the last few lines of manifoldcf.log were:
> >>>> >> >
> >>>> >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) - Checking if
> job
> >>>> >> > 1352455005553 needs to be started; it was last checked at
> >>>> >> > 1358733575454,
> >>>> >> > and
> >>>> >> > now it is 1358733585635
> >>>> >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) -  No time match
> >>>> >> > found
> >>>> >> > within interval 1358733575454 to 1358733585635
> >>>> >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) - Checking if
> job
> >>>> >> > 1352455005553 needs to be started; it was last checked at
> >>>> >> > 1358733585636,
> >>>> >> > and
> >>>> >> > now it is 1358733595662
> >>>> >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) -  No time match
> >>>> >> > found
> >>>> >> > within interval 1358733585636 to 1358733595662
> >>>> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Checking if
> job
> >>>> >> > 1352455005553 needs to be started; it was last checked at
> >>>> >> > 1358733595663,
> >>>> >> > and
> >>>> >> > now it is 1358733605813
> >>>> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) -  Time match
> FOUND
> >>>> >> > within
> >>>> >> > interval 1358733595663 to 1358733605813
> >>>> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Job
> >>>> >> > '1352455005553'
> >>>> >> > is
> >>>> >> > within run window at 1358733605813 ms. (which starts at
> 1358733600000
> >>>> >> > ms.)
> >>>> >> > DEBUG 2013-01-21 02:00:05,830 (Job start thread) - Signalled for
> job
> >>>> >> > start
> >>>> >> > for job 1352455005553
> >>>> >> > DEBUG 2013-01-21 02:00:11,674 (Startup thread) - Marked job
> >>>> >> > 1352455005553
> >>>> >> > for startup
> >>>> >> > DEBUG 2013-01-21 02:00:11,843 (Thread-951922) - JDBC: The connect
> >>>> >> > string
> >>>> >> > is
> >>>> >> > 'jdbc:oracle:thin:@//
> oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
> >>>> >> >
> >>>> >> > After that - nothing. Once I restarted manifoldcf this morning
> the
> >>>> >> > job
> >>>> >> > magically restarted itself and the following log messages were
> added
> >>>> >> > where
> >>>> >> > it had left off:
> >>>> >> >
> >>>> >> > DEBUG 2013-01-21 10:47:36,852 (Startup thread) - Setting job
> >>>> >> > 1352455005553
> >>>> >> > back to 'ReadyForStartup' state
> >>>> >> > DEBUG 2013-01-21 10:48:57,538 (Thread-98) - Resetting due to
> restart
> >>>> >> > DEBUG 2013-01-21 10:48:57,958 (Thread-98) - Reset complete
> >>>> >> > DEBUG 2013-01-21 10:48:58,200 (Startup thread) - Marked job
> >>>> >> > 1352455005553
> >>>> >> > for startup
> >>>> >> > DEBUG 2013-01-21 10:48:58,367 (Thread-184) - JDBC: The connect
> string
> >>>> >> > is
> >>>> >> > 'jdbc:oracle:thin:@//
> oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
> >>>> >> > DEBUG 2013-01-21 10:49:02,071 (Startup thread) - Job
> 1352455005553 is
> >>>> >> > now
> >>>> >> > started
> >>>> >> > ... etc ...
> >>>> >> >
> >>>> >> > So it appears that the application is running fine with a Job
> start
> >>>> >> > thread
> >>>> >> > logging nicely every ten seconds until it tries to start this
> job and
> >>>> >> > hangs
> >>>> >> > entirely until the system is shutdown - even though it performed
> that
> >>>> >> > same
> >>>> >> > tasks perfectly the night before and later the same day.
> >>>> >> >
> >>>> >> > Can anyone advise on what might be happening here? We are running
> >>>> >> > ManifoldCF
> >>>> >> > version 1.0.1 on Ubuntu 10.04.
> >>>> >> >
> >>>> >> > Best wishes,
> >>>> >> > Anthony.
> >>>> >> >
> >>>> >> > --
> >>>> >> > Dr Anthony Leonard
> >>>> >> > System Integrator, Information Directorate
> >>>> >> > University of York, Heslington, York, UK, YO10 5DD
> >>>> >> > Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
> >>>> >> > Times Higher Education University of the Year 2010
> >>>> >
> >>>> >
> >>>
> >>>
>

Re: Job hanging on "Starting up" with never ending external query.

Posted by Karl Wright <da...@gmail.com>.
I've checked in code in both trunk and the release branch for this issue.

It would be good if you could try this again in your environment.  The
fix simply prevents some kinds of exceptions from causing a handle
leak.  Please try this with only 1 JDBC Connection connection handle
per JVM and let me know if you see any hangs.

Thanks,
Karl


On Tue, Jan 22, 2013 at 1:11 PM, Karl Wright <da...@gmail.com> wrote:
> I've looked into the code in some detail.  There is indeed a place
> where it is possible for a JDBC connection handle to be leaked, I
> believe.  However, it's not clear whether this is the circumstance you
> are encountering or not, since it does in involve an exception getting
> thrown doing something not terribly likely to cause exceptions.
>
> I've opened a ticket - CONNECTORS-620.
>
> Karl
>
> On Tue, Jan 22, 2013 at 12:53 PM, Karl Wright <da...@gmail.com> wrote:
>> Hmm.
>>
>> The following threads are of interest here:
>>
>> Thread 29975: (state = BLOCKED)
>>  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may
>> be imprecise)
>>  - java.lang.Thread.join(long) @bci=38, line=1203 (Compiled frame)
>>  - java.lang.Thread.join() @bci=2, line=1256 (Compiled frame)
>>  - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection$JDBCPSResultSet.<init>(org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection,
>> java.lang.String, java.util.ArrayList, int) @bci=39, line=1058
>> (Interpreted frame)
>>  - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection.executeUncachedQuery(java.lang.String,
>> java.util.ArrayList, int) @bci=23, line=256 (Interpreted frame)
>>  - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.addSeedDocuments(org.apache.manifoldcf.crawler.interfaces.ISeedingActivity,
>> org.apache.manifoldcf.crawler.interfaces.DocumentSpecification, long,
>> long, int) @bci=106, line=246 (Interpreted frame)
>>  - org.apache.manifoldcf.crawler.system.StartupThread.run() @bci=636,
>> line=179 (Interpreted frame)
>>
>> ... which is probably waiting in this one:
>>
>> Thread 24457: (state = BLOCKED)
>>  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may
>> be imprecise)
>>  - java.lang.Object.wait() @bci=2, line=502 (Interpreted frame)
>>  - org.apache.manifoldcf.core.jdbcpool.ConnectionPool.getConnection()
>> @bci=80, line=80 (Interpreted frame)
>>  - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnectionFactory.getConnection(java.lang.String,
>> java.lang.String, java.lang.String, java.lang.String,
>> java.lang.String) @bci=433, line=128 (Interpreted frame)
>>  - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection$PreparedStatementQueryThread.run()
>> @bci=36, line=1212 (Interpreted frame)
>>
>> ... which is waiting to obtain a JDBC connection, and the reason it
>> can't is because it thinks that the only available JDBC connection is
>> currently in use.
>>
>> Since you have only a single connection around, and nothing else is
>> active, it stands to reason that a JDBC connection handle has somehow
>> been leaked, which is a challenge since connections are typically
>> freed in a try/finally block through ManifoldCF.
>>
>> I notice that your stack frames are pretty unusual - what JDK is this
>> that you are using?
>>
>> Karl
>>
>>
>> Since
>> On Tue, Jan 22, 2013 at 12:00 PM, Anthony Leonard
>> <an...@york.ac.uk> wrote:
>>> Dear Karl,
>>>
>>> Our DBA noticed that each time our job was run 10 Oracle connections were
>>> created. So, we dropped the "Max connections" parameter on the repository
>>> connection config to 1 and re-ran the job with the DBA watching. The job
>>> worked fine but the DBA reported that 1 connection was created and then 10
>>> more briefly ...
>>>
>>> Out of curiosity we re-ran the job again with no further changes and this
>>> time got the following results:
>>>
>>> * the job hung in the "Starting Up" phase again, with the same logging and
>>> symptoms as detailed before on this thread.
>>> * the DBA reported seeing no connections at all this time.
>>> * I have attached a thread dump created by "jstack -F <pid>". This is
>>> reporting all threads as blocked.
>>>
>>> Any ideas? Any help with this would certainly be very gratefully received.
>>>
>>> Best wishes,
>>> Anthony.
>>>
>>> --
>>> Dr Anthony Leonard
>>> System Integrator, Information Directorate
>>> University of York, Heslington, York, UK, YO10 5DD
>>> Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
>>> Times Higher Education University of the Year 2010
>>>
>>>
>>> On Mon, Jan 21, 2013 at 2:15 PM, Karl Wright <da...@gmail.com> wrote:
>>>>
>>>> kill -QUIT should not abort the agents process, just cause a thread
>>>> dump.  kill -9 is a different story.
>>>>
>>>> You can also do the same thing by using jstack, in the jvm bin directory.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Mon, Jan 21, 2013 at 9:04 AM, Anthony Leonard
>>>> <an...@york.ac.uk> wrote:
>>>> > Dear Karl,
>>>> >
>>>> > Many thanks for your insights. I'll do a kill -QUIT next time we have
>>>> > this
>>>> > issue which should hopefully give me the thread dump. However we've
>>>> > noticed
>>>> > that killing processes means we have to run the "locks-clean" script so
>>>> > it's
>>>> > not our favourite way of doing it.
>>>> >
>>>> > Also I definitely think a timeout for queries would be a good thing.
>>>> >
>>>> > I guess we go back to checking that the connection to the database
>>>> > should
>>>> > have been ok last night...
>>>> >
>>>> > Best wishes,
>>>> > Anthony.
>>>> >
>>>> > --
>>>> > Dr Anthony Leonard
>>>> > System Integrator, Information Directorate
>>>> > University of York, Heslington, York, UK, YO10 5DD
>>>> > Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
>>>> > Times Higher Education University of the Year 2010
>>>> >
>>>> >
>>>> > On Mon, Jan 21, 2013 at 1:25 PM, Karl Wright <da...@gmail.com> wrote:
>>>> >>
>>>> >> Hi Anthony,
>>>> >>
>>>> >> What happens between the framework recognizing that the job should be
>>>> >> started (which it does fine in both cases), and actually achieving a
>>>> >> correct job start, is the "seeding" phase, which is going to try to
>>>> >> execute the seeding query against your Oracle database.  If something
>>>> >> happens at that time to hang the JDBC connection's seeding query, then
>>>> >> it precisely explains the behavior you are seeing.
>>>> >>
>>>> >> It is also the case that the timeout on the queries that the JDBC
>>>> >> connector does is effectively infinite.  This makes me suspicious that
>>>> >> what is happening is an Oracle query is going out but there is no
>>>> >> response ever coming back.
>>>> >>
>>>> >> The other possibility is that the JDBC connector is in fact correctly
>>>> >> throwing a ServiceInterruption, but that the ManifoldCF code is either
>>>> >> not handling it properly, or the connector is not forming it properly.
>>>> >>  In that case, when you notice a hung job, the startup thread will be
>>>> >> a particular place in the code, and I can diagnose it that way.
>>>> >>
>>>> >> The first order of business is therefore to get a thread dump when the
>>>> >> system is "hung".  That will help confirm the picture.  There are a
>>>> >> number of additional questions here.
>>>> >>
>>>> >> (1) Why is this happening?  Is there any possibility that the Oracle
>>>> >> database you are crawling is (very occasionally) not able to properly
>>>> >> respond to a JDBC query?  I can imagine that, under some network
>>>> >> conditions, it might be possible for the Oracle JDBC driver to wind up
>>>> >> waiting indefinitely for a response that never comes.
>>>> >>
>>>> >> (2) Given that we can't always control the infrastructure we're trying
>>>> >> to crawl through, should we attempt to provide a reasonable
>>>> >> workaround?  For example, a timeout on JDBC connector queries, where
>>>> >> we throw a ServiceInterruption if the timeout is exceeded?
>>>> >>
>>>> >> Karl
>>>> >>
>>>> >> On Mon, Jan 21, 2013 at 7:57 AM, Anthony Leonard
>>>> >> <an...@york.ac.uk> wrote:
>>>> >> > Hi there,
>>>> >> >
>>>> >> > We have recently started running a nightly job 2AM in ManifoldCF to
>>>> >> > extract
>>>> >> > data from an Oracle repository and populate a Solr index. Most nights
>>>> >> > this
>>>> >> > works fine, but occasionally the job has been hanging at the
>>>> >> > "Starting
>>>> >> > up"
>>>> >> > phase. We have observed this on our test setup also occasionally. A
>>>> >> > restart
>>>> >> > of ManifoldCF usually solves this.
>>>> >> >
>>>> >> > Using the simple history reports today I looked up all records and
>>>> >> > sorted
>>>> >> > them by the "Time" column, largest first, and found the following:
>>>> >> >
>>>> >> > "Start Time","Activity","Identifier","Result
>>>> >> > Code","Bytes","Time","Result
>>>> >> > Description"
>>>> >> > "11-12-2012 05:00:05.941","external query" "... SQL QUERY
>>>> >> > ...","ERROR","0","1926607529","Interrupted: null"
>>>> >> > "01-21-2013 02:00:11.843","external query" "... SQL QUERY
>>>> >> > ...","ERROR","0","31644956","Interrupted: null"
>>>> >> > "01-17-2013 02:00:03.600","external query" "... SQL QUERY
>>>> >> > ...","ERROR","0","31637594","Interrupted: null"
>>>> >> > "12-04-2012 12:12:19.860","external query" "... SQL QUERY
>>>> >> > ...","OK","0","17511",""
>>>> >> > ... etc ...
>>>> >> >
>>>> >> > If the Time column is in millis that means the first query was
>>>> >> > hanging
>>>> >> > for
>>>> >> > 22 days! (This was in the period before we went live when our live
>>>> >> > server
>>>> >> > was sitting idle for a while.) The other two occasions it was hanging
>>>> >> > for
>>>> >> > about 8 hours until we arrived to restart the job in the morning. I
>>>> >> > have
>>>> >> > confirmed that the Oracle database we are connecting to was available
>>>> >> > throughout these periods. These times are also too long for any
>>>> >> > network
>>>> >> > or
>>>> >> > database timeouts, which makes me suspect that it's a problem with
>>>> >> > the
>>>> >> > application.
>>>> >> >
>>>> >> > We have the following logging config in properties.xml
>>>> >> >
>>>> >> >   <property name="org.apache.manifoldcf.jobs" value="ALL"/>
>>>> >> >   <property name="org.apache.manifoldcf.connectors" value="ALL"/>
>>>> >> >   <property name="org.apache.manifoldcf.agents" value="ALL"/>
>>>> >> >   <property name="org.apache.manifoldcf.misc" value="ALL"/>
>>>> >> >
>>>> >> > The job failed again last night and when I checked at 10:40 AM this
>>>> >> > morning
>>>> >> > the last few lines of manifoldcf.log were:
>>>> >> >
>>>> >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) - Checking if job
>>>> >> > 1352455005553 needs to be started; it was last checked at
>>>> >> > 1358733575454,
>>>> >> > and
>>>> >> > now it is 1358733585635
>>>> >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) -  No time match
>>>> >> > found
>>>> >> > within interval 1358733575454 to 1358733585635
>>>> >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) - Checking if job
>>>> >> > 1352455005553 needs to be started; it was last checked at
>>>> >> > 1358733585636,
>>>> >> > and
>>>> >> > now it is 1358733595662
>>>> >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) -  No time match
>>>> >> > found
>>>> >> > within interval 1358733585636 to 1358733595662
>>>> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Checking if job
>>>> >> > 1352455005553 needs to be started; it was last checked at
>>>> >> > 1358733595663,
>>>> >> > and
>>>> >> > now it is 1358733605813
>>>> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) -  Time match FOUND
>>>> >> > within
>>>> >> > interval 1358733595663 to 1358733605813
>>>> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Job
>>>> >> > '1352455005553'
>>>> >> > is
>>>> >> > within run window at 1358733605813 ms. (which starts at 1358733600000
>>>> >> > ms.)
>>>> >> > DEBUG 2013-01-21 02:00:05,830 (Job start thread) - Signalled for job
>>>> >> > start
>>>> >> > for job 1352455005553
>>>> >> > DEBUG 2013-01-21 02:00:11,674 (Startup thread) - Marked job
>>>> >> > 1352455005553
>>>> >> > for startup
>>>> >> > DEBUG 2013-01-21 02:00:11,843 (Thread-951922) - JDBC: The connect
>>>> >> > string
>>>> >> > is
>>>> >> > 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
>>>> >> >
>>>> >> > After that - nothing. Once I restarted manifoldcf this morning the
>>>> >> > job
>>>> >> > magically restarted itself and the following log messages were added
>>>> >> > where
>>>> >> > it had left off:
>>>> >> >
>>>> >> > DEBUG 2013-01-21 10:47:36,852 (Startup thread) - Setting job
>>>> >> > 1352455005553
>>>> >> > back to 'ReadyForStartup' state
>>>> >> > DEBUG 2013-01-21 10:48:57,538 (Thread-98) - Resetting due to restart
>>>> >> > DEBUG 2013-01-21 10:48:57,958 (Thread-98) - Reset complete
>>>> >> > DEBUG 2013-01-21 10:48:58,200 (Startup thread) - Marked job
>>>> >> > 1352455005553
>>>> >> > for startup
>>>> >> > DEBUG 2013-01-21 10:48:58,367 (Thread-184) - JDBC: The connect string
>>>> >> > is
>>>> >> > 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
>>>> >> > DEBUG 2013-01-21 10:49:02,071 (Startup thread) - Job 1352455005553 is
>>>> >> > now
>>>> >> > started
>>>> >> > ... etc ...
>>>> >> >
>>>> >> > So it appears that the application is running fine with a Job start
>>>> >> > thread
>>>> >> > logging nicely every ten seconds until it tries to start this job and
>>>> >> > hangs
>>>> >> > entirely until the system is shutdown - even though it performed that
>>>> >> > same
>>>> >> > tasks perfectly the night before and later the same day.
>>>> >> >
>>>> >> > Can anyone advise on what might be happening here? We are running
>>>> >> > ManifoldCF
>>>> >> > version 1.0.1 on Ubuntu 10.04.
>>>> >> >
>>>> >> > Best wishes,
>>>> >> > Anthony.
>>>> >> >
>>>> >> > --
>>>> >> > Dr Anthony Leonard
>>>> >> > System Integrator, Information Directorate
>>>> >> > University of York, Heslington, York, UK, YO10 5DD
>>>> >> > Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
>>>> >> > Times Higher Education University of the Year 2010
>>>> >
>>>> >
>>>
>>>

Re: Job hanging on "Starting up" with never ending external query.

Posted by Karl Wright <da...@gmail.com>.
I've looked into the code in some detail.  There is indeed a place
where it is possible for a JDBC connection handle to be leaked, I
believe.  However, it's not clear whether this is the circumstance you
are encountering or not, since it does in involve an exception getting
thrown doing something not terribly likely to cause exceptions.

I've opened a ticket - CONNECTORS-620.

Karl

On Tue, Jan 22, 2013 at 12:53 PM, Karl Wright <da...@gmail.com> wrote:
> Hmm.
>
> The following threads are of interest here:
>
> Thread 29975: (state = BLOCKED)
>  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may
> be imprecise)
>  - java.lang.Thread.join(long) @bci=38, line=1203 (Compiled frame)
>  - java.lang.Thread.join() @bci=2, line=1256 (Compiled frame)
>  - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection$JDBCPSResultSet.<init>(org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection,
> java.lang.String, java.util.ArrayList, int) @bci=39, line=1058
> (Interpreted frame)
>  - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection.executeUncachedQuery(java.lang.String,
> java.util.ArrayList, int) @bci=23, line=256 (Interpreted frame)
>  - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.addSeedDocuments(org.apache.manifoldcf.crawler.interfaces.ISeedingActivity,
> org.apache.manifoldcf.crawler.interfaces.DocumentSpecification, long,
> long, int) @bci=106, line=246 (Interpreted frame)
>  - org.apache.manifoldcf.crawler.system.StartupThread.run() @bci=636,
> line=179 (Interpreted frame)
>
> ... which is probably waiting in this one:
>
> Thread 24457: (state = BLOCKED)
>  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may
> be imprecise)
>  - java.lang.Object.wait() @bci=2, line=502 (Interpreted frame)
>  - org.apache.manifoldcf.core.jdbcpool.ConnectionPool.getConnection()
> @bci=80, line=80 (Interpreted frame)
>  - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnectionFactory.getConnection(java.lang.String,
> java.lang.String, java.lang.String, java.lang.String,
> java.lang.String) @bci=433, line=128 (Interpreted frame)
>  - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection$PreparedStatementQueryThread.run()
> @bci=36, line=1212 (Interpreted frame)
>
> ... which is waiting to obtain a JDBC connection, and the reason it
> can't is because it thinks that the only available JDBC connection is
> currently in use.
>
> Since you have only a single connection around, and nothing else is
> active, it stands to reason that a JDBC connection handle has somehow
> been leaked, which is a challenge since connections are typically
> freed in a try/finally block through ManifoldCF.
>
> I notice that your stack frames are pretty unusual - what JDK is this
> that you are using?
>
> Karl
>
>
> Since
> On Tue, Jan 22, 2013 at 12:00 PM, Anthony Leonard
> <an...@york.ac.uk> wrote:
>> Dear Karl,
>>
>> Our DBA noticed that each time our job was run 10 Oracle connections were
>> created. So, we dropped the "Max connections" parameter on the repository
>> connection config to 1 and re-ran the job with the DBA watching. The job
>> worked fine but the DBA reported that 1 connection was created and then 10
>> more briefly ...
>>
>> Out of curiosity we re-ran the job again with no further changes and this
>> time got the following results:
>>
>> * the job hung in the "Starting Up" phase again, with the same logging and
>> symptoms as detailed before on this thread.
>> * the DBA reported seeing no connections at all this time.
>> * I have attached a thread dump created by "jstack -F <pid>". This is
>> reporting all threads as blocked.
>>
>> Any ideas? Any help with this would certainly be very gratefully received.
>>
>> Best wishes,
>> Anthony.
>>
>> --
>> Dr Anthony Leonard
>> System Integrator, Information Directorate
>> University of York, Heslington, York, UK, YO10 5DD
>> Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
>> Times Higher Education University of the Year 2010
>>
>>
>> On Mon, Jan 21, 2013 at 2:15 PM, Karl Wright <da...@gmail.com> wrote:
>>>
>>> kill -QUIT should not abort the agents process, just cause a thread
>>> dump.  kill -9 is a different story.
>>>
>>> You can also do the same thing by using jstack, in the jvm bin directory.
>>>
>>> Karl
>>>
>>>
>>> On Mon, Jan 21, 2013 at 9:04 AM, Anthony Leonard
>>> <an...@york.ac.uk> wrote:
>>> > Dear Karl,
>>> >
>>> > Many thanks for your insights. I'll do a kill -QUIT next time we have
>>> > this
>>> > issue which should hopefully give me the thread dump. However we've
>>> > noticed
>>> > that killing processes means we have to run the "locks-clean" script so
>>> > it's
>>> > not our favourite way of doing it.
>>> >
>>> > Also I definitely think a timeout for queries would be a good thing.
>>> >
>>> > I guess we go back to checking that the connection to the database
>>> > should
>>> > have been ok last night...
>>> >
>>> > Best wishes,
>>> > Anthony.
>>> >
>>> > --
>>> > Dr Anthony Leonard
>>> > System Integrator, Information Directorate
>>> > University of York, Heslington, York, UK, YO10 5DD
>>> > Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
>>> > Times Higher Education University of the Year 2010
>>> >
>>> >
>>> > On Mon, Jan 21, 2013 at 1:25 PM, Karl Wright <da...@gmail.com> wrote:
>>> >>
>>> >> Hi Anthony,
>>> >>
>>> >> What happens between the framework recognizing that the job should be
>>> >> started (which it does fine in both cases), and actually achieving a
>>> >> correct job start, is the "seeding" phase, which is going to try to
>>> >> execute the seeding query against your Oracle database.  If something
>>> >> happens at that time to hang the JDBC connection's seeding query, then
>>> >> it precisely explains the behavior you are seeing.
>>> >>
>>> >> It is also the case that the timeout on the queries that the JDBC
>>> >> connector does is effectively infinite.  This makes me suspicious that
>>> >> what is happening is an Oracle query is going out but there is no
>>> >> response ever coming back.
>>> >>
>>> >> The other possibility is that the JDBC connector is in fact correctly
>>> >> throwing a ServiceInterruption, but that the ManifoldCF code is either
>>> >> not handling it properly, or the connector is not forming it properly.
>>> >>  In that case, when you notice a hung job, the startup thread will be
>>> >> a particular place in the code, and I can diagnose it that way.
>>> >>
>>> >> The first order of business is therefore to get a thread dump when the
>>> >> system is "hung".  That will help confirm the picture.  There are a
>>> >> number of additional questions here.
>>> >>
>>> >> (1) Why is this happening?  Is there any possibility that the Oracle
>>> >> database you are crawling is (very occasionally) not able to properly
>>> >> respond to a JDBC query?  I can imagine that, under some network
>>> >> conditions, it might be possible for the Oracle JDBC driver to wind up
>>> >> waiting indefinitely for a response that never comes.
>>> >>
>>> >> (2) Given that we can't always control the infrastructure we're trying
>>> >> to crawl through, should we attempt to provide a reasonable
>>> >> workaround?  For example, a timeout on JDBC connector queries, where
>>> >> we throw a ServiceInterruption if the timeout is exceeded?
>>> >>
>>> >> Karl
>>> >>
>>> >> On Mon, Jan 21, 2013 at 7:57 AM, Anthony Leonard
>>> >> <an...@york.ac.uk> wrote:
>>> >> > Hi there,
>>> >> >
>>> >> > We have recently started running a nightly job 2AM in ManifoldCF to
>>> >> > extract
>>> >> > data from an Oracle repository and populate a Solr index. Most nights
>>> >> > this
>>> >> > works fine, but occasionally the job has been hanging at the
>>> >> > "Starting
>>> >> > up"
>>> >> > phase. We have observed this on our test setup also occasionally. A
>>> >> > restart
>>> >> > of ManifoldCF usually solves this.
>>> >> >
>>> >> > Using the simple history reports today I looked up all records and
>>> >> > sorted
>>> >> > them by the "Time" column, largest first, and found the following:
>>> >> >
>>> >> > "Start Time","Activity","Identifier","Result
>>> >> > Code","Bytes","Time","Result
>>> >> > Description"
>>> >> > "11-12-2012 05:00:05.941","external query" "... SQL QUERY
>>> >> > ...","ERROR","0","1926607529","Interrupted: null"
>>> >> > "01-21-2013 02:00:11.843","external query" "... SQL QUERY
>>> >> > ...","ERROR","0","31644956","Interrupted: null"
>>> >> > "01-17-2013 02:00:03.600","external query" "... SQL QUERY
>>> >> > ...","ERROR","0","31637594","Interrupted: null"
>>> >> > "12-04-2012 12:12:19.860","external query" "... SQL QUERY
>>> >> > ...","OK","0","17511",""
>>> >> > ... etc ...
>>> >> >
>>> >> > If the Time column is in millis that means the first query was
>>> >> > hanging
>>> >> > for
>>> >> > 22 days! (This was in the period before we went live when our live
>>> >> > server
>>> >> > was sitting idle for a while.) The other two occasions it was hanging
>>> >> > for
>>> >> > about 8 hours until we arrived to restart the job in the morning. I
>>> >> > have
>>> >> > confirmed that the Oracle database we are connecting to was available
>>> >> > throughout these periods. These times are also too long for any
>>> >> > network
>>> >> > or
>>> >> > database timeouts, which makes me suspect that it's a problem with
>>> >> > the
>>> >> > application.
>>> >> >
>>> >> > We have the following logging config in properties.xml
>>> >> >
>>> >> >   <property name="org.apache.manifoldcf.jobs" value="ALL"/>
>>> >> >   <property name="org.apache.manifoldcf.connectors" value="ALL"/>
>>> >> >   <property name="org.apache.manifoldcf.agents" value="ALL"/>
>>> >> >   <property name="org.apache.manifoldcf.misc" value="ALL"/>
>>> >> >
>>> >> > The job failed again last night and when I checked at 10:40 AM this
>>> >> > morning
>>> >> > the last few lines of manifoldcf.log were:
>>> >> >
>>> >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) - Checking if job
>>> >> > 1352455005553 needs to be started; it was last checked at
>>> >> > 1358733575454,
>>> >> > and
>>> >> > now it is 1358733585635
>>> >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) -  No time match
>>> >> > found
>>> >> > within interval 1358733575454 to 1358733585635
>>> >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) - Checking if job
>>> >> > 1352455005553 needs to be started; it was last checked at
>>> >> > 1358733585636,
>>> >> > and
>>> >> > now it is 1358733595662
>>> >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) -  No time match
>>> >> > found
>>> >> > within interval 1358733585636 to 1358733595662
>>> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Checking if job
>>> >> > 1352455005553 needs to be started; it was last checked at
>>> >> > 1358733595663,
>>> >> > and
>>> >> > now it is 1358733605813
>>> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) -  Time match FOUND
>>> >> > within
>>> >> > interval 1358733595663 to 1358733605813
>>> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Job
>>> >> > '1352455005553'
>>> >> > is
>>> >> > within run window at 1358733605813 ms. (which starts at 1358733600000
>>> >> > ms.)
>>> >> > DEBUG 2013-01-21 02:00:05,830 (Job start thread) - Signalled for job
>>> >> > start
>>> >> > for job 1352455005553
>>> >> > DEBUG 2013-01-21 02:00:11,674 (Startup thread) - Marked job
>>> >> > 1352455005553
>>> >> > for startup
>>> >> > DEBUG 2013-01-21 02:00:11,843 (Thread-951922) - JDBC: The connect
>>> >> > string
>>> >> > is
>>> >> > 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
>>> >> >
>>> >> > After that - nothing. Once I restarted manifoldcf this morning the
>>> >> > job
>>> >> > magically restarted itself and the following log messages were added
>>> >> > where
>>> >> > it had left off:
>>> >> >
>>> >> > DEBUG 2013-01-21 10:47:36,852 (Startup thread) - Setting job
>>> >> > 1352455005553
>>> >> > back to 'ReadyForStartup' state
>>> >> > DEBUG 2013-01-21 10:48:57,538 (Thread-98) - Resetting due to restart
>>> >> > DEBUG 2013-01-21 10:48:57,958 (Thread-98) - Reset complete
>>> >> > DEBUG 2013-01-21 10:48:58,200 (Startup thread) - Marked job
>>> >> > 1352455005553
>>> >> > for startup
>>> >> > DEBUG 2013-01-21 10:48:58,367 (Thread-184) - JDBC: The connect string
>>> >> > is
>>> >> > 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
>>> >> > DEBUG 2013-01-21 10:49:02,071 (Startup thread) - Job 1352455005553 is
>>> >> > now
>>> >> > started
>>> >> > ... etc ...
>>> >> >
>>> >> > So it appears that the application is running fine with a Job start
>>> >> > thread
>>> >> > logging nicely every ten seconds until it tries to start this job and
>>> >> > hangs
>>> >> > entirely until the system is shutdown - even though it performed that
>>> >> > same
>>> >> > tasks perfectly the night before and later the same day.
>>> >> >
>>> >> > Can anyone advise on what might be happening here? We are running
>>> >> > ManifoldCF
>>> >> > version 1.0.1 on Ubuntu 10.04.
>>> >> >
>>> >> > Best wishes,
>>> >> > Anthony.
>>> >> >
>>> >> > --
>>> >> > Dr Anthony Leonard
>>> >> > System Integrator, Information Directorate
>>> >> > University of York, Heslington, York, UK, YO10 5DD
>>> >> > Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
>>> >> > Times Higher Education University of the Year 2010
>>> >
>>> >
>>
>>

Re: Job hanging on "Starting up" with never ending external query.

Posted by Karl Wright <da...@gmail.com>.
Hmm.

The following threads are of interest here:

Thread 29975: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may
be imprecise)
 - java.lang.Thread.join(long) @bci=38, line=1203 (Compiled frame)
 - java.lang.Thread.join() @bci=2, line=1256 (Compiled frame)
 - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection$JDBCPSResultSet.<init>(org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection,
java.lang.String, java.util.ArrayList, int) @bci=39, line=1058
(Interpreted frame)
 - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection.executeUncachedQuery(java.lang.String,
java.util.ArrayList, int) @bci=23, line=256 (Interpreted frame)
 - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.addSeedDocuments(org.apache.manifoldcf.crawler.interfaces.ISeedingActivity,
org.apache.manifoldcf.crawler.interfaces.DocumentSpecification, long,
long, int) @bci=106, line=246 (Interpreted frame)
 - org.apache.manifoldcf.crawler.system.StartupThread.run() @bci=636,
line=179 (Interpreted frame)

... which is probably waiting in this one:

Thread 24457: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may
be imprecise)
 - java.lang.Object.wait() @bci=2, line=502 (Interpreted frame)
 - org.apache.manifoldcf.core.jdbcpool.ConnectionPool.getConnection()
@bci=80, line=80 (Interpreted frame)
 - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnectionFactory.getConnection(java.lang.String,
java.lang.String, java.lang.String, java.lang.String,
java.lang.String) @bci=433, line=128 (Interpreted frame)
 - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection$PreparedStatementQueryThread.run()
@bci=36, line=1212 (Interpreted frame)

... which is waiting to obtain a JDBC connection, and the reason it
can't is because it thinks that the only available JDBC connection is
currently in use.

Since you have only a single connection around, and nothing else is
active, it stands to reason that a JDBC connection handle has somehow
been leaked, which is a challenge since connections are typically
freed in a try/finally block through ManifoldCF.

I notice that your stack frames are pretty unusual - what JDK is this
that you are using?

Karl


Since
On Tue, Jan 22, 2013 at 12:00 PM, Anthony Leonard
<an...@york.ac.uk> wrote:
> Dear Karl,
>
> Our DBA noticed that each time our job was run 10 Oracle connections were
> created. So, we dropped the "Max connections" parameter on the repository
> connection config to 1 and re-ran the job with the DBA watching. The job
> worked fine but the DBA reported that 1 connection was created and then 10
> more briefly ...
>
> Out of curiosity we re-ran the job again with no further changes and this
> time got the following results:
>
> * the job hung in the "Starting Up" phase again, with the same logging and
> symptoms as detailed before on this thread.
> * the DBA reported seeing no connections at all this time.
> * I have attached a thread dump created by "jstack -F <pid>". This is
> reporting all threads as blocked.
>
> Any ideas? Any help with this would certainly be very gratefully received.
>
> Best wishes,
> Anthony.
>
> --
> Dr Anthony Leonard
> System Integrator, Information Directorate
> University of York, Heslington, York, UK, YO10 5DD
> Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
> Times Higher Education University of the Year 2010
>
>
> On Mon, Jan 21, 2013 at 2:15 PM, Karl Wright <da...@gmail.com> wrote:
>>
>> kill -QUIT should not abort the agents process, just cause a thread
>> dump.  kill -9 is a different story.
>>
>> You can also do the same thing by using jstack, in the jvm bin directory.
>>
>> Karl
>>
>>
>> On Mon, Jan 21, 2013 at 9:04 AM, Anthony Leonard
>> <an...@york.ac.uk> wrote:
>> > Dear Karl,
>> >
>> > Many thanks for your insights. I'll do a kill -QUIT next time we have
>> > this
>> > issue which should hopefully give me the thread dump. However we've
>> > noticed
>> > that killing processes means we have to run the "locks-clean" script so
>> > it's
>> > not our favourite way of doing it.
>> >
>> > Also I definitely think a timeout for queries would be a good thing.
>> >
>> > I guess we go back to checking that the connection to the database
>> > should
>> > have been ok last night...
>> >
>> > Best wishes,
>> > Anthony.
>> >
>> > --
>> > Dr Anthony Leonard
>> > System Integrator, Information Directorate
>> > University of York, Heslington, York, UK, YO10 5DD
>> > Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
>> > Times Higher Education University of the Year 2010
>> >
>> >
>> > On Mon, Jan 21, 2013 at 1:25 PM, Karl Wright <da...@gmail.com> wrote:
>> >>
>> >> Hi Anthony,
>> >>
>> >> What happens between the framework recognizing that the job should be
>> >> started (which it does fine in both cases), and actually achieving a
>> >> correct job start, is the "seeding" phase, which is going to try to
>> >> execute the seeding query against your Oracle database.  If something
>> >> happens at that time to hang the JDBC connection's seeding query, then
>> >> it precisely explains the behavior you are seeing.
>> >>
>> >> It is also the case that the timeout on the queries that the JDBC
>> >> connector does is effectively infinite.  This makes me suspicious that
>> >> what is happening is an Oracle query is going out but there is no
>> >> response ever coming back.
>> >>
>> >> The other possibility is that the JDBC connector is in fact correctly
>> >> throwing a ServiceInterruption, but that the ManifoldCF code is either
>> >> not handling it properly, or the connector is not forming it properly.
>> >>  In that case, when you notice a hung job, the startup thread will be
>> >> a particular place in the code, and I can diagnose it that way.
>> >>
>> >> The first order of business is therefore to get a thread dump when the
>> >> system is "hung".  That will help confirm the picture.  There are a
>> >> number of additional questions here.
>> >>
>> >> (1) Why is this happening?  Is there any possibility that the Oracle
>> >> database you are crawling is (very occasionally) not able to properly
>> >> respond to a JDBC query?  I can imagine that, under some network
>> >> conditions, it might be possible for the Oracle JDBC driver to wind up
>> >> waiting indefinitely for a response that never comes.
>> >>
>> >> (2) Given that we can't always control the infrastructure we're trying
>> >> to crawl through, should we attempt to provide a reasonable
>> >> workaround?  For example, a timeout on JDBC connector queries, where
>> >> we throw a ServiceInterruption if the timeout is exceeded?
>> >>
>> >> Karl
>> >>
>> >> On Mon, Jan 21, 2013 at 7:57 AM, Anthony Leonard
>> >> <an...@york.ac.uk> wrote:
>> >> > Hi there,
>> >> >
>> >> > We have recently started running a nightly job 2AM in ManifoldCF to
>> >> > extract
>> >> > data from an Oracle repository and populate a Solr index. Most nights
>> >> > this
>> >> > works fine, but occasionally the job has been hanging at the
>> >> > "Starting
>> >> > up"
>> >> > phase. We have observed this on our test setup also occasionally. A
>> >> > restart
>> >> > of ManifoldCF usually solves this.
>> >> >
>> >> > Using the simple history reports today I looked up all records and
>> >> > sorted
>> >> > them by the "Time" column, largest first, and found the following:
>> >> >
>> >> > "Start Time","Activity","Identifier","Result
>> >> > Code","Bytes","Time","Result
>> >> > Description"
>> >> > "11-12-2012 05:00:05.941","external query" "... SQL QUERY
>> >> > ...","ERROR","0","1926607529","Interrupted: null"
>> >> > "01-21-2013 02:00:11.843","external query" "... SQL QUERY
>> >> > ...","ERROR","0","31644956","Interrupted: null"
>> >> > "01-17-2013 02:00:03.600","external query" "... SQL QUERY
>> >> > ...","ERROR","0","31637594","Interrupted: null"
>> >> > "12-04-2012 12:12:19.860","external query" "... SQL QUERY
>> >> > ...","OK","0","17511",""
>> >> > ... etc ...
>> >> >
>> >> > If the Time column is in millis that means the first query was
>> >> > hanging
>> >> > for
>> >> > 22 days! (This was in the period before we went live when our live
>> >> > server
>> >> > was sitting idle for a while.) The other two occasions it was hanging
>> >> > for
>> >> > about 8 hours until we arrived to restart the job in the morning. I
>> >> > have
>> >> > confirmed that the Oracle database we are connecting to was available
>> >> > throughout these periods. These times are also too long for any
>> >> > network
>> >> > or
>> >> > database timeouts, which makes me suspect that it's a problem with
>> >> > the
>> >> > application.
>> >> >
>> >> > We have the following logging config in properties.xml
>> >> >
>> >> >   <property name="org.apache.manifoldcf.jobs" value="ALL"/>
>> >> >   <property name="org.apache.manifoldcf.connectors" value="ALL"/>
>> >> >   <property name="org.apache.manifoldcf.agents" value="ALL"/>
>> >> >   <property name="org.apache.manifoldcf.misc" value="ALL"/>
>> >> >
>> >> > The job failed again last night and when I checked at 10:40 AM this
>> >> > morning
>> >> > the last few lines of manifoldcf.log were:
>> >> >
>> >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) - Checking if job
>> >> > 1352455005553 needs to be started; it was last checked at
>> >> > 1358733575454,
>> >> > and
>> >> > now it is 1358733585635
>> >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) -  No time match
>> >> > found
>> >> > within interval 1358733575454 to 1358733585635
>> >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) - Checking if job
>> >> > 1352455005553 needs to be started; it was last checked at
>> >> > 1358733585636,
>> >> > and
>> >> > now it is 1358733595662
>> >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) -  No time match
>> >> > found
>> >> > within interval 1358733585636 to 1358733595662
>> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Checking if job
>> >> > 1352455005553 needs to be started; it was last checked at
>> >> > 1358733595663,
>> >> > and
>> >> > now it is 1358733605813
>> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) -  Time match FOUND
>> >> > within
>> >> > interval 1358733595663 to 1358733605813
>> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Job
>> >> > '1352455005553'
>> >> > is
>> >> > within run window at 1358733605813 ms. (which starts at 1358733600000
>> >> > ms.)
>> >> > DEBUG 2013-01-21 02:00:05,830 (Job start thread) - Signalled for job
>> >> > start
>> >> > for job 1352455005553
>> >> > DEBUG 2013-01-21 02:00:11,674 (Startup thread) - Marked job
>> >> > 1352455005553
>> >> > for startup
>> >> > DEBUG 2013-01-21 02:00:11,843 (Thread-951922) - JDBC: The connect
>> >> > string
>> >> > is
>> >> > 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
>> >> >
>> >> > After that - nothing. Once I restarted manifoldcf this morning the
>> >> > job
>> >> > magically restarted itself and the following log messages were added
>> >> > where
>> >> > it had left off:
>> >> >
>> >> > DEBUG 2013-01-21 10:47:36,852 (Startup thread) - Setting job
>> >> > 1352455005553
>> >> > back to 'ReadyForStartup' state
>> >> > DEBUG 2013-01-21 10:48:57,538 (Thread-98) - Resetting due to restart
>> >> > DEBUG 2013-01-21 10:48:57,958 (Thread-98) - Reset complete
>> >> > DEBUG 2013-01-21 10:48:58,200 (Startup thread) - Marked job
>> >> > 1352455005553
>> >> > for startup
>> >> > DEBUG 2013-01-21 10:48:58,367 (Thread-184) - JDBC: The connect string
>> >> > is
>> >> > 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
>> >> > DEBUG 2013-01-21 10:49:02,071 (Startup thread) - Job 1352455005553 is
>> >> > now
>> >> > started
>> >> > ... etc ...
>> >> >
>> >> > So it appears that the application is running fine with a Job start
>> >> > thread
>> >> > logging nicely every ten seconds until it tries to start this job and
>> >> > hangs
>> >> > entirely until the system is shutdown - even though it performed that
>> >> > same
>> >> > tasks perfectly the night before and later the same day.
>> >> >
>> >> > Can anyone advise on what might be happening here? We are running
>> >> > ManifoldCF
>> >> > version 1.0.1 on Ubuntu 10.04.
>> >> >
>> >> > Best wishes,
>> >> > Anthony.
>> >> >
>> >> > --
>> >> > Dr Anthony Leonard
>> >> > System Integrator, Information Directorate
>> >> > University of York, Heslington, York, UK, YO10 5DD
>> >> > Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
>> >> > Times Higher Education University of the Year 2010
>> >
>> >
>
>

Re: Job hanging on "Starting up" with never ending external query.

Posted by Nigel Thomas <em...@gmail.com>.
Hello Anthony,

Worth taking a look at the puppet config for the service for launching
the mcf agents at
https://github.com/nigel-v-thomas/puppet-manifoldcf/blob/master/manifests/service.pp

Assuming the log errors are coming from the agent process, you will need
to add the variable "-Djava.security.egd=file:///dev/urandom" into
"multiprocess-example/processes/options.env".
The file options.env will need to be added into the puppet for
management as well.

Best,

Nigel
On 25/01/2013 17:02, Anthony Leonard wrote:
> Hi Nigel,
>
> You could be on to something here. I'll have a chat with the guys here
> as I think we've hit this issue before?
>
> Hope you're well :)  
>
> Not sure how I would configure ManifoldCF to use /dev/urandom ...
>
> Best wishes,
> Anthony.
>
> -- 
> Dr Anthony Leonard
> System Integrator, Information Directorate
> University of York, Heslington, York, UK, YO10 5DD
> Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
> Times Higher Education University of the Year 2010
>
>
> On Fri, Jan 25, 2013 at 4:21 PM, Nigel Thomas <eml.nvt@gmail.com
> <ma...@gmail.com>> wrote:
>
>     Hello Anthony,
>
>     This issue may be related to the entropy exhaustion on Linux using
>     certain versions of JDBC driver, see http://goo.gl/8yA3a
>     have you try passing configuring JDBC to use /dev/urandom instead?
>
>     Cheers,
>
>     Nigel
>
>     On 22/01/2013 17:00, Anthony Leonard wrote:
>>     Dear Karl,
>>
>>     Our DBA noticed that each time our job was run 10 Oracle
>>     connections were created. So, we dropped the "Max connections"
>>     parameter on the repository connection config to 1 and re-ran the
>>     job with the DBA watching. The job worked fine but the DBA
>>     reported that 1 connection was created and then 10 more briefly ...
>>
>>     Out of curiosity we re-ran the job again with no further changes
>>     and this time got the following results:
>>
>>     * the job hung in the "Starting Up" phase again, with the same
>>     logging and symptoms as detailed before on this thread.
>>     * the DBA reported seeing no connections at all this time.
>>     * I have attached a thread dump created by "jstack -F <pid>".
>>     This is reporting all threads as blocked.
>>
>>     Any ideas? Any help with this would certainly be very gratefully
>>     received.
>>
>>     Best wishes,
>>     Anthony.
>>
>>     -- 
>>     Dr Anthony Leonard
>>     System Integrator, Information Directorate
>>     University of York, Heslington, York, UK, YO10 5DD
>>     Tel: +44 (0)1904 434350
>>     <tel:%2B44%20%280%291904%20434350> http://twitter.com/apbleonard
>>     Times Higher Education University of the Year 2010
>>
>>
>>     On Mon, Jan 21, 2013 at 2:15 PM, Karl Wright <daddywri@gmail.com
>>     <ma...@gmail.com>> wrote:
>>
>>         kill -QUIT should not abort the agents process, just cause a
>>         thread
>>         dump.  kill -9 is a different story.
>>
>>         You can also do the same thing by using jstack, in the jvm
>>         bin directory.
>>
>>         Karl
>>
>>
>>         On Mon, Jan 21, 2013 at 9:04 AM, Anthony Leonard
>>         <anthony.leonard@york.ac.uk
>>         <ma...@york.ac.uk>> wrote:
>>         > Dear Karl,
>>         >
>>         > Many thanks for your insights. I'll do a kill -QUIT next
>>         time we have this
>>         > issue which should hopefully give me the thread dump.
>>         However we've noticed
>>         > that killing processes means we have to run the
>>         "locks-clean" script so it's
>>         > not our favourite way of doing it.
>>         >
>>         > Also I definitely think a timeout for queries would be a
>>         good thing.
>>         >
>>         > I guess we go back to checking that the connection to the
>>         database should
>>         > have been ok last night...
>>         >
>>         > Best wishes,
>>         > Anthony.
>>         >
>>         > --
>>         > Dr Anthony Leonard
>>         > System Integrator, Information Directorate
>>         > University of York, Heslington, York, UK, YO10 5DD
>>         > Tel: +44 (0)1904 434350 <tel:%2B44%20%280%291904%20434350>
>>         http://twitter.com/apbleonard
>>         > Times Higher Education University of the Year 2010
>>         >
>>         >
>>         > On Mon, Jan 21, 2013 at 1:25 PM, Karl Wright
>>         <daddywri@gmail.com <ma...@gmail.com>> wrote:
>>         >>
>>         >> Hi Anthony,
>>         >>
>>         >> What happens between the framework recognizing that the
>>         job should be
>>         >> started (which it does fine in both cases), and actually
>>         achieving a
>>         >> correct job start, is the "seeding" phase, which is going
>>         to try to
>>         >> execute the seeding query against your Oracle database.
>>          If something
>>         >> happens at that time to hang the JDBC connection's seeding
>>         query, then
>>         >> it precisely explains the behavior you are seeing.
>>         >>
>>         >> It is also the case that the timeout on the queries that
>>         the JDBC
>>         >> connector does is effectively infinite.  This makes me
>>         suspicious that
>>         >> what is happening is an Oracle query is going out but
>>         there is no
>>         >> response ever coming back.
>>         >>
>>         >> The other possibility is that the JDBC connector is in
>>         fact correctly
>>         >> throwing a ServiceInterruption, but that the ManifoldCF
>>         code is either
>>         >> not handling it properly, or the connector is not forming
>>         it properly.
>>         >>  In that case, when you notice a hung job, the startup
>>         thread will be
>>         >> a particular place in the code, and I can diagnose it that
>>         way.
>>         >>
>>         >> The first order of business is therefore to get a thread
>>         dump when the
>>         >> system is "hung".  That will help confirm the picture.
>>          There are a
>>         >> number of additional questions here.
>>         >>
>>         >> (1) Why is this happening?  Is there any possibility that
>>         the Oracle
>>         >> database you are crawling is (very occasionally) not able
>>         to properly
>>         >> respond to a JDBC query?  I can imagine that, under some
>>         network
>>         >> conditions, it might be possible for the Oracle JDBC
>>         driver to wind up
>>         >> waiting indefinitely for a response that never comes.
>>         >>
>>         >> (2) Given that we can't always control the infrastructure
>>         we're trying
>>         >> to crawl through, should we attempt to provide a reasonable
>>         >> workaround?  For example, a timeout on JDBC connector
>>         queries, where
>>         >> we throw a ServiceInterruption if the timeout is exceeded?
>>         >>
>>         >> Karl
>>         >>
>>         >> On Mon, Jan 21, 2013 at 7:57 AM, Anthony Leonard
>>         >> <anthony.leonard@york.ac.uk
>>         <ma...@york.ac.uk>> wrote:
>>         >> > Hi there,
>>         >> >
>>         >> > We have recently started running a nightly job 2AM in
>>         ManifoldCF to
>>         >> > extract
>>         >> > data from an Oracle repository and populate a Solr
>>         index. Most nights
>>         >> > this
>>         >> > works fine, but occasionally the job has been hanging at
>>         the "Starting
>>         >> > up"
>>         >> > phase. We have observed this on our test setup also
>>         occasionally. A
>>         >> > restart
>>         >> > of ManifoldCF usually solves this.
>>         >> >
>>         >> > Using the simple history reports today I looked up all
>>         records and
>>         >> > sorted
>>         >> > them by the "Time" column, largest first, and found the
>>         following:
>>         >> >
>>         >> > "Start Time","Activity","Identifier","Result
>>         >> > Code","Bytes","Time","Result
>>         >> > Description"
>>         >> > "11-12-2012 05:00:05.941","external query" "... SQL QUERY
>>         >> > ...","ERROR","0","1926607529","Interrupted: null"
>>         >> > "01-21-2013 02:00:11.843","external query" "... SQL QUERY
>>         >> > ...","ERROR","0","31644956","Interrupted: null"
>>         >> > "01-17-2013 02:00:03.600","external query" "... SQL QUERY
>>         >> > ...","ERROR","0","31637594","Interrupted: null"
>>         >> > "12-04-2012 12:12:19.860","external query" "... SQL QUERY
>>         >> > ...","OK","0","17511",""
>>         >> > ... etc ...
>>         >> >
>>         >> > If the Time column is in millis that means the first
>>         query was hanging
>>         >> > for
>>         >> > 22 days! (This was in the period before we went live
>>         when our live
>>         >> > server
>>         >> > was sitting idle for a while.) The other two occasions
>>         it was hanging
>>         >> > for
>>         >> > about 8 hours until we arrived to restart the job in the
>>         morning. I have
>>         >> > confirmed that the Oracle database we are connecting to
>>         was available
>>         >> > throughout these periods. These times are also too long
>>         for any network
>>         >> > or
>>         >> > database timeouts, which makes me suspect that it's a
>>         problem with the
>>         >> > application.
>>         >> >
>>         >> > We have the following logging config in properties.xml
>>         >> >
>>         >> >   <property name="org.apache.manifoldcf.jobs
>>         <http://org.apache.manifoldcf.jobs>" value="ALL"/>
>>         >> >   <property name="org.apache.manifoldcf.connectors"
>>         value="ALL"/>
>>         >> >   <property name="org.apache.manifoldcf.agents"
>>         value="ALL"/>
>>         >> >   <property name="org.apache.manifoldcf.misc" value="ALL"/>
>>         >> >
>>         >> > The job failed again last night and when I checked at
>>         10:40 AM this
>>         >> > morning
>>         >> > the last few lines of manifoldcf.log were:
>>         >> >
>>         >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) -
>>         Checking if job
>>         >> > 1352455005553 needs to be started; it was last checked
>>         at 1358733575454,
>>         >> > and
>>         >> > now it is 1358733585635
>>         >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) -  No
>>         time match found
>>         >> > within interval 1358733575454 to 1358733585635
>>         >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) -
>>         Checking if job
>>         >> > 1352455005553 needs to be started; it was last checked
>>         at 1358733585636,
>>         >> > and
>>         >> > now it is 1358733595662
>>         >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) -  No
>>         time match found
>>         >> > within interval 1358733585636 to 1358733595662
>>         >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) -
>>         Checking if job
>>         >> > 1352455005553 needs to be started; it was last checked
>>         at 1358733595663,
>>         >> > and
>>         >> > now it is 1358733605813
>>         >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) -  Time
>>         match FOUND
>>         >> > within
>>         >> > interval 1358733595663 to 1358733605813
>>         >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Job
>>         '1352455005553'
>>         >> > is
>>         >> > within run window at 1358733605813 ms. (which starts at
>>         1358733600000
>>         >> > ms.)
>>         >> > DEBUG 2013-01-21 02:00:05,830 (Job start thread) -
>>         Signalled for job
>>         >> > start
>>         >> > for job 1352455005553
>>         >> > DEBUG 2013-01-21 02:00:11,674 (Startup thread) - Marked job
>>         >> > 1352455005553
>>         >> > for startup
>>         >> > DEBUG 2013-01-21 02:00:11,843 (Thread-951922) - JDBC:
>>         The connect string
>>         >> > is
>>         >> >
>>         'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york
>>         <http://oradwhlive.york.ac.uk:1521/dwhlive.csrv.york>'
>>         >> >
>>         >> > After that - nothing. Once I restarted manifoldcf this
>>         morning the job
>>         >> > magically restarted itself and the following log
>>         messages were added
>>         >> > where
>>         >> > it had left off:
>>         >> >
>>         >> > DEBUG 2013-01-21 10:47:36,852 (Startup thread) - Setting job
>>         >> > 1352455005553
>>         >> > back to 'ReadyForStartup' state
>>         >> > DEBUG 2013-01-21 10:48:57,538 (Thread-98) - Resetting
>>         due to restart
>>         >> > DEBUG 2013-01-21 10:48:57,958 (Thread-98) - Reset complete
>>         >> > DEBUG 2013-01-21 10:48:58,200 (Startup thread) - Marked job
>>         >> > 1352455005553
>>         >> > for startup
>>         >> > DEBUG 2013-01-21 10:48:58,367 (Thread-184) - JDBC: The
>>         connect string is
>>         >> >
>>         'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york
>>         <http://oradwhlive.york.ac.uk:1521/dwhlive.csrv.york>'
>>         >> > DEBUG 2013-01-21 10:49:02,071 (Startup thread) - Job
>>         1352455005553 is
>>         >> > now
>>         >> > started
>>         >> > ... etc ...
>>         >> >
>>         >> > So it appears that the application is running fine with
>>         a Job start
>>         >> > thread
>>         >> > logging nicely every ten seconds until it tries to start
>>         this job and
>>         >> > hangs
>>         >> > entirely until the system is shutdown - even though it
>>         performed that
>>         >> > same
>>         >> > tasks perfectly the night before and later the same day.
>>         >> >
>>         >> > Can anyone advise on what might be happening here? We
>>         are running
>>         >> > ManifoldCF
>>         >> > version 1.0.1 on Ubuntu 10.04.
>>         >> >
>>         >> > Best wishes,
>>         >> > Anthony.
>>         >> >
>>         >> > --
>>         >> > Dr Anthony Leonard
>>         >> > System Integrator, Information Directorate
>>         >> > University of York, Heslington, York, UK, YO10 5DD
>>         >> > Tel: +44 (0)1904 434350
>>         <tel:%2B44%20%280%291904%20434350> http://twitter.com/apbleonard
>>         >> > Times Higher Education University of the Year 2010
>>         >
>>         >
>>
>>
>
>


Re: Job hanging on "Starting up" with never ending external query.

Posted by Anthony Leonard <an...@york.ac.uk>.
Hi Nigel,

You could be on to something here. I'll have a chat with the guys here as I
think we've hit this issue before?

Hope you're well :)

Not sure how I would configure ManifoldCF to use /dev/urandom ...

Best wishes,
Anthony.

-- 
Dr Anthony Leonard
System Integrator, Information Directorate
University of York, Heslington, York, UK, YO10 5DD
Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
Times Higher Education University of the Year 2010


On Fri, Jan 25, 2013 at 4:21 PM, Nigel Thomas <em...@gmail.com> wrote:

>  Hello Anthony,
>
> This issue may be related to the entropy exhaustion on Linux using certain
> versions of JDBC driver, see http://goo.gl/8yA3a
> have you try passing configuring JDBC to use /dev/urandom instead?
>
> Cheers,
>
> Nigel
>
> On 22/01/2013 17:00, Anthony Leonard wrote:
>
> Dear Karl,
>
>  Our DBA noticed that each time our job was run 10 Oracle connections
> were created. So, we dropped the "Max connections" parameter on the
> repository connection config to 1 and re-ran the job with the DBA watching.
> The job worked fine but the DBA reported that 1 connection was created and
> then 10 more briefly ...
>
>  Out of curiosity we re-ran the job again with no further changes and
> this time got the following results:
>
>  * the job hung in the "Starting Up" phase again, with the same logging
> and symptoms as detailed before on this thread.
> * the DBA reported seeing no connections at all this time.
> * I have attached a thread dump created by "jstack -F <pid>". This is
> reporting all threads as blocked.
>
>  Any ideas? Any help with this would certainly be very gratefully
> received.
>
>  Best wishes,
> Anthony.
>
>  --
> Dr Anthony Leonard
> System Integrator, Information Directorate
> University of York, Heslington, York, UK, YO10 5DD
> Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
> Times Higher Education University of the Year 2010
>
>
> On Mon, Jan 21, 2013 at 2:15 PM, Karl Wright <da...@gmail.com> wrote:
>
>> kill -QUIT should not abort the agents process, just cause a thread
>> dump.  kill -9 is a different story.
>>
>> You can also do the same thing by using jstack, in the jvm bin directory.
>>
>> Karl
>>
>>
>> On Mon, Jan 21, 2013 at 9:04 AM, Anthony Leonard
>>  <an...@york.ac.uk> wrote:
>> > Dear Karl,
>> >
>> > Many thanks for your insights. I'll do a kill -QUIT next time we have
>> this
>> > issue which should hopefully give me the thread dump. However we've
>> noticed
>> > that killing processes means we have to run the "locks-clean" script so
>> it's
>> > not our favourite way of doing it.
>> >
>> > Also I definitely think a timeout for queries would be a good thing.
>> >
>> > I guess we go back to checking that the connection to the database
>> should
>> > have been ok last night...
>> >
>> > Best wishes,
>> > Anthony.
>> >
>> > --
>> > Dr Anthony Leonard
>> > System Integrator, Information Directorate
>> > University of York, Heslington, York, UK, YO10 5DD
>> > Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
>> > Times Higher Education University of the Year 2010
>> >
>> >
>> > On Mon, Jan 21, 2013 at 1:25 PM, Karl Wright <da...@gmail.com>
>> wrote:
>> >>
>> >> Hi Anthony,
>> >>
>> >> What happens between the framework recognizing that the job should be
>> >> started (which it does fine in both cases), and actually achieving a
>> >> correct job start, is the "seeding" phase, which is going to try to
>> >> execute the seeding query against your Oracle database.  If something
>> >> happens at that time to hang the JDBC connection's seeding query, then
>> >> it precisely explains the behavior you are seeing.
>> >>
>> >> It is also the case that the timeout on the queries that the JDBC
>> >> connector does is effectively infinite.  This makes me suspicious that
>> >> what is happening is an Oracle query is going out but there is no
>> >> response ever coming back.
>> >>
>> >> The other possibility is that the JDBC connector is in fact correctly
>> >> throwing a ServiceInterruption, but that the ManifoldCF code is either
>> >> not handling it properly, or the connector is not forming it properly.
>> >>  In that case, when you notice a hung job, the startup thread will be
>> >> a particular place in the code, and I can diagnose it that way.
>> >>
>> >> The first order of business is therefore to get a thread dump when the
>> >> system is "hung".  That will help confirm the picture.  There are a
>> >> number of additional questions here.
>> >>
>> >> (1) Why is this happening?  Is there any possibility that the Oracle
>> >> database you are crawling is (very occasionally) not able to properly
>> >> respond to a JDBC query?  I can imagine that, under some network
>> >> conditions, it might be possible for the Oracle JDBC driver to wind up
>> >> waiting indefinitely for a response that never comes.
>> >>
>> >> (2) Given that we can't always control the infrastructure we're trying
>> >> to crawl through, should we attempt to provide a reasonable
>> >> workaround?  For example, a timeout on JDBC connector queries, where
>> >> we throw a ServiceInterruption if the timeout is exceeded?
>> >>
>> >> Karl
>> >>
>> >> On Mon, Jan 21, 2013 at 7:57 AM, Anthony Leonard
>> >> <an...@york.ac.uk> wrote:
>> >> > Hi there,
>> >> >
>> >> > We have recently started running a nightly job 2AM in ManifoldCF to
>> >> > extract
>> >> > data from an Oracle repository and populate a Solr index. Most nights
>> >> > this
>> >> > works fine, but occasionally the job has been hanging at the
>> "Starting
>> >> > up"
>> >> > phase. We have observed this on our test setup also occasionally. A
>> >> > restart
>> >> > of ManifoldCF usually solves this.
>> >> >
>> >> > Using the simple history reports today I looked up all records and
>> >> > sorted
>> >> > them by the "Time" column, largest first, and found the following:
>> >> >
>> >> > "Start Time","Activity","Identifier","Result
>> >> > Code","Bytes","Time","Result
>> >> > Description"
>> >> > "11-12-2012 05:00:05.941","external query" "... SQL QUERY
>> >> > ...","ERROR","0","1926607529","Interrupted: null"
>> >> > "01-21-2013 02:00:11.843","external query" "... SQL QUERY
>> >> > ...","ERROR","0","31644956","Interrupted: null"
>> >> > "01-17-2013 02:00:03.600","external query" "... SQL QUERY
>> >> > ...","ERROR","0","31637594","Interrupted: null"
>> >> > "12-04-2012 12:12:19.860","external query" "... SQL QUERY
>> >> > ...","OK","0","17511",""
>> >> > ... etc ...
>> >> >
>> >> > If the Time column is in millis that means the first query was
>> hanging
>> >> > for
>> >> > 22 days! (This was in the period before we went live when our live
>> >> > server
>> >> > was sitting idle for a while.) The other two occasions it was hanging
>> >> > for
>> >> > about 8 hours until we arrived to restart the job in the morning. I
>> have
>> >> > confirmed that the Oracle database we are connecting to was available
>> >> > throughout these periods. These times are also too long for any
>> network
>> >> > or
>> >> > database timeouts, which makes me suspect that it's a problem with
>> the
>> >> > application.
>> >> >
>> >> > We have the following logging config in properties.xml
>> >> >
>> >> >   <property name="org.apache.manifoldcf.jobs" value="ALL"/>
>> >> >   <property name="org.apache.manifoldcf.connectors" value="ALL"/>
>> >> >   <property name="org.apache.manifoldcf.agents" value="ALL"/>
>> >> >   <property name="org.apache.manifoldcf.misc" value="ALL"/>
>> >> >
>> >> > The job failed again last night and when I checked at 10:40 AM this
>> >> > morning
>> >> > the last few lines of manifoldcf.log were:
>> >> >
>> >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) - Checking if job
>> >> > 1352455005553 needs to be started; it was last checked at
>> 1358733575454,
>> >> > and
>> >> > now it is 1358733585635
>> >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) -  No time match
>> found
>> >> > within interval 1358733575454 to 1358733585635
>> >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) - Checking if job
>> >> > 1352455005553 needs to be started; it was last checked at
>> 1358733585636,
>> >> > and
>> >> > now it is 1358733595662
>> >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) -  No time match
>> found
>> >> > within interval 1358733585636 to 1358733595662
>> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Checking if job
>> >> > 1352455005553 needs to be started; it was last checked at
>> 1358733595663,
>> >> > and
>> >> > now it is 1358733605813
>> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) -  Time match FOUND
>> >> > within
>> >> > interval 1358733595663 to 1358733605813
>> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Job
>> '1352455005553'
>> >> > is
>> >> > within run window at 1358733605813 ms. (which starts at 1358733600000
>> >> > ms.)
>> >> > DEBUG 2013-01-21 02:00:05,830 (Job start thread) - Signalled for job
>> >> > start
>> >> > for job 1352455005553
>> >> > DEBUG 2013-01-21 02:00:11,674 (Startup thread) - Marked job
>> >> > 1352455005553
>> >> > for startup
>> >> > DEBUG 2013-01-21 02:00:11,843 (Thread-951922) - JDBC: The connect
>> string
>> >> > is
>> >> > 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
>> >> >
>> >> > After that - nothing. Once I restarted manifoldcf this morning the
>> job
>> >> > magically restarted itself and the following log messages were added
>> >> > where
>> >> > it had left off:
>> >> >
>> >> > DEBUG 2013-01-21 10:47:36,852 (Startup thread) - Setting job
>> >> > 1352455005553
>> >> > back to 'ReadyForStartup' state
>> >> > DEBUG 2013-01-21 10:48:57,538 (Thread-98) - Resetting due to restart
>> >> > DEBUG 2013-01-21 10:48:57,958 (Thread-98) - Reset complete
>> >> > DEBUG 2013-01-21 10:48:58,200 (Startup thread) - Marked job
>> >> > 1352455005553
>> >> > for startup
>> >> > DEBUG 2013-01-21 10:48:58,367 (Thread-184) - JDBC: The connect
>> string is
>> >> > 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
>> >> > DEBUG 2013-01-21 10:49:02,071 (Startup thread) - Job 1352455005553 is
>> >> > now
>> >> > started
>> >> > ... etc ...
>> >> >
>> >> > So it appears that the application is running fine with a Job start
>> >> > thread
>> >> > logging nicely every ten seconds until it tries to start this job and
>> >> > hangs
>> >> > entirely until the system is shutdown - even though it performed that
>> >> > same
>> >> > tasks perfectly the night before and later the same day.
>> >> >
>> >> > Can anyone advise on what might be happening here? We are running
>> >> > ManifoldCF
>> >> > version 1.0.1 on Ubuntu 10.04.
>> >> >
>> >> > Best wishes,
>> >> > Anthony.
>> >> >
>> >> > --
>> >> > Dr Anthony Leonard
>> >> > System Integrator, Information Directorate
>> >> > University of York, Heslington, York, UK, YO10 5DD
>> >> > Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
>> >> > Times Higher Education University of the Year 2010
>> >
>> >
>>
>
>
>

Re: Job hanging on "Starting up" with never ending external query.

Posted by Nigel Thomas <em...@gmail.com>.
Hello Anthony,

This issue may be related to the entropy exhaustion on Linux using
certain versions of JDBC driver, see http://goo.gl/8yA3a
have you try passing configuring JDBC to use /dev/urandom instead?

Cheers,

Nigel
On 22/01/2013 17:00, Anthony Leonard wrote:
> Dear Karl,
>
> Our DBA noticed that each time our job was run 10 Oracle connections
> were created. So, we dropped the "Max connections" parameter on the
> repository connection config to 1 and re-ran the job with the DBA
> watching. The job worked fine but the DBA reported that 1 connection
> was created and then 10 more briefly ...
>
> Out of curiosity we re-ran the job again with no further changes and
> this time got the following results:
>
> * the job hung in the "Starting Up" phase again, with the same logging
> and symptoms as detailed before on this thread.
> * the DBA reported seeing no connections at all this time.
> * I have attached a thread dump created by "jstack -F <pid>". This is
> reporting all threads as blocked.
>
> Any ideas? Any help with this would certainly be very gratefully received.
>
> Best wishes,
> Anthony.
>
> -- 
> Dr Anthony Leonard
> System Integrator, Information Directorate
> University of York, Heslington, York, UK, YO10 5DD
> Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
> Times Higher Education University of the Year 2010
>
>
> On Mon, Jan 21, 2013 at 2:15 PM, Karl Wright <daddywri@gmail.com
> <ma...@gmail.com>> wrote:
>
>     kill -QUIT should not abort the agents process, just cause a thread
>     dump.  kill -9 is a different story.
>
>     You can also do the same thing by using jstack, in the jvm bin
>     directory.
>
>     Karl
>
>
>     On Mon, Jan 21, 2013 at 9:04 AM, Anthony Leonard
>     <anthony.leonard@york.ac.uk <ma...@york.ac.uk>>
>     wrote:
>     > Dear Karl,
>     >
>     > Many thanks for your insights. I'll do a kill -QUIT next time we
>     have this
>     > issue which should hopefully give me the thread dump. However
>     we've noticed
>     > that killing processes means we have to run the "locks-clean"
>     script so it's
>     > not our favourite way of doing it.
>     >
>     > Also I definitely think a timeout for queries would be a good thing.
>     >
>     > I guess we go back to checking that the connection to the
>     database should
>     > have been ok last night...
>     >
>     > Best wishes,
>     > Anthony.
>     >
>     > --
>     > Dr Anthony Leonard
>     > System Integrator, Information Directorate
>     > University of York, Heslington, York, UK, YO10 5DD
>     > Tel: +44 (0)1904 434350 <tel:%2B44%20%280%291904%20434350>
>     http://twitter.com/apbleonard
>     > Times Higher Education University of the Year 2010
>     >
>     >
>     > On Mon, Jan 21, 2013 at 1:25 PM, Karl Wright <daddywri@gmail.com
>     <ma...@gmail.com>> wrote:
>     >>
>     >> Hi Anthony,
>     >>
>     >> What happens between the framework recognizing that the job
>     should be
>     >> started (which it does fine in both cases), and actually
>     achieving a
>     >> correct job start, is the "seeding" phase, which is going to try to
>     >> execute the seeding query against your Oracle database.  If
>     something
>     >> happens at that time to hang the JDBC connection's seeding
>     query, then
>     >> it precisely explains the behavior you are seeing.
>     >>
>     >> It is also the case that the timeout on the queries that the JDBC
>     >> connector does is effectively infinite.  This makes me
>     suspicious that
>     >> what is happening is an Oracle query is going out but there is no
>     >> response ever coming back.
>     >>
>     >> The other possibility is that the JDBC connector is in fact
>     correctly
>     >> throwing a ServiceInterruption, but that the ManifoldCF code is
>     either
>     >> not handling it properly, or the connector is not forming it
>     properly.
>     >>  In that case, when you notice a hung job, the startup thread
>     will be
>     >> a particular place in the code, and I can diagnose it that way.
>     >>
>     >> The first order of business is therefore to get a thread dump
>     when the
>     >> system is "hung".  That will help confirm the picture.  There are a
>     >> number of additional questions here.
>     >>
>     >> (1) Why is this happening?  Is there any possibility that the
>     Oracle
>     >> database you are crawling is (very occasionally) not able to
>     properly
>     >> respond to a JDBC query?  I can imagine that, under some network
>     >> conditions, it might be possible for the Oracle JDBC driver to
>     wind up
>     >> waiting indefinitely for a response that never comes.
>     >>
>     >> (2) Given that we can't always control the infrastructure we're
>     trying
>     >> to crawl through, should we attempt to provide a reasonable
>     >> workaround?  For example, a timeout on JDBC connector queries,
>     where
>     >> we throw a ServiceInterruption if the timeout is exceeded?
>     >>
>     >> Karl
>     >>
>     >> On Mon, Jan 21, 2013 at 7:57 AM, Anthony Leonard
>     >> <anthony.leonard@york.ac.uk
>     <ma...@york.ac.uk>> wrote:
>     >> > Hi there,
>     >> >
>     >> > We have recently started running a nightly job 2AM in
>     ManifoldCF to
>     >> > extract
>     >> > data from an Oracle repository and populate a Solr index.
>     Most nights
>     >> > this
>     >> > works fine, but occasionally the job has been hanging at the
>     "Starting
>     >> > up"
>     >> > phase. We have observed this on our test setup also
>     occasionally. A
>     >> > restart
>     >> > of ManifoldCF usually solves this.
>     >> >
>     >> > Using the simple history reports today I looked up all
>     records and
>     >> > sorted
>     >> > them by the "Time" column, largest first, and found the
>     following:
>     >> >
>     >> > "Start Time","Activity","Identifier","Result
>     >> > Code","Bytes","Time","Result
>     >> > Description"
>     >> > "11-12-2012 05:00:05.941","external query" "... SQL QUERY
>     >> > ...","ERROR","0","1926607529","Interrupted: null"
>     >> > "01-21-2013 02:00:11.843","external query" "... SQL QUERY
>     >> > ...","ERROR","0","31644956","Interrupted: null"
>     >> > "01-17-2013 02:00:03.600","external query" "... SQL QUERY
>     >> > ...","ERROR","0","31637594","Interrupted: null"
>     >> > "12-04-2012 12:12:19.860","external query" "... SQL QUERY
>     >> > ...","OK","0","17511",""
>     >> > ... etc ...
>     >> >
>     >> > If the Time column is in millis that means the first query
>     was hanging
>     >> > for
>     >> > 22 days! (This was in the period before we went live when our
>     live
>     >> > server
>     >> > was sitting idle for a while.) The other two occasions it was
>     hanging
>     >> > for
>     >> > about 8 hours until we arrived to restart the job in the
>     morning. I have
>     >> > confirmed that the Oracle database we are connecting to was
>     available
>     >> > throughout these periods. These times are also too long for
>     any network
>     >> > or
>     >> > database timeouts, which makes me suspect that it's a problem
>     with the
>     >> > application.
>     >> >
>     >> > We have the following logging config in properties.xml
>     >> >
>     >> >   <property name="org.apache.manifoldcf.jobs
>     <http://org.apache.manifoldcf.jobs>" value="ALL"/>
>     >> >   <property name="org.apache.manifoldcf.connectors" value="ALL"/>
>     >> >   <property name="org.apache.manifoldcf.agents" value="ALL"/>
>     >> >   <property name="org.apache.manifoldcf.misc" value="ALL"/>
>     >> >
>     >> > The job failed again last night and when I checked at 10:40
>     AM this
>     >> > morning
>     >> > the last few lines of manifoldcf.log were:
>     >> >
>     >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) - Checking
>     if job
>     >> > 1352455005553 needs to be started; it was last checked at
>     1358733575454,
>     >> > and
>     >> > now it is 1358733585635
>     >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) -  No time
>     match found
>     >> > within interval 1358733575454 to 1358733585635
>     >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) - Checking
>     if job
>     >> > 1352455005553 needs to be started; it was last checked at
>     1358733585636,
>     >> > and
>     >> > now it is 1358733595662
>     >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) -  No time
>     match found
>     >> > within interval 1358733585636 to 1358733595662
>     >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Checking
>     if job
>     >> > 1352455005553 needs to be started; it was last checked at
>     1358733595663,
>     >> > and
>     >> > now it is 1358733605813
>     >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) -  Time
>     match FOUND
>     >> > within
>     >> > interval 1358733595663 to 1358733605813
>     >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Job
>     '1352455005553'
>     >> > is
>     >> > within run window at 1358733605813 ms. (which starts at
>     1358733600000
>     >> > ms.)
>     >> > DEBUG 2013-01-21 02:00:05,830 (Job start thread) - Signalled
>     for job
>     >> > start
>     >> > for job 1352455005553
>     >> > DEBUG 2013-01-21 02:00:11,674 (Startup thread) - Marked job
>     >> > 1352455005553
>     >> > for startup
>     >> > DEBUG 2013-01-21 02:00:11,843 (Thread-951922) - JDBC: The
>     connect string
>     >> > is
>     >> >
>     'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york
>     <http://oradwhlive.york.ac.uk:1521/dwhlive.csrv.york>'
>     >> >
>     >> > After that - nothing. Once I restarted manifoldcf this
>     morning the job
>     >> > magically restarted itself and the following log messages
>     were added
>     >> > where
>     >> > it had left off:
>     >> >
>     >> > DEBUG 2013-01-21 10:47:36,852 (Startup thread) - Setting job
>     >> > 1352455005553
>     >> > back to 'ReadyForStartup' state
>     >> > DEBUG 2013-01-21 10:48:57,538 (Thread-98) - Resetting due to
>     restart
>     >> > DEBUG 2013-01-21 10:48:57,958 (Thread-98) - Reset complete
>     >> > DEBUG 2013-01-21 10:48:58,200 (Startup thread) - Marked job
>     >> > 1352455005553
>     >> > for startup
>     >> > DEBUG 2013-01-21 10:48:58,367 (Thread-184) - JDBC: The
>     connect string is
>     >> >
>     'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york
>     <http://oradwhlive.york.ac.uk:1521/dwhlive.csrv.york>'
>     >> > DEBUG 2013-01-21 10:49:02,071 (Startup thread) - Job
>     1352455005553 is
>     >> > now
>     >> > started
>     >> > ... etc ...
>     >> >
>     >> > So it appears that the application is running fine with a Job
>     start
>     >> > thread
>     >> > logging nicely every ten seconds until it tries to start this
>     job and
>     >> > hangs
>     >> > entirely until the system is shutdown - even though it
>     performed that
>     >> > same
>     >> > tasks perfectly the night before and later the same day.
>     >> >
>     >> > Can anyone advise on what might be happening here? We are running
>     >> > ManifoldCF
>     >> > version 1.0.1 on Ubuntu 10.04.
>     >> >
>     >> > Best wishes,
>     >> > Anthony.
>     >> >
>     >> > --
>     >> > Dr Anthony Leonard
>     >> > System Integrator, Information Directorate
>     >> > University of York, Heslington, York, UK, YO10 5DD
>     >> > Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
>     >> > Times Higher Education University of the Year 2010
>     >
>     >
>
>


Re: Job hanging on "Starting up" with never ending external query.

Posted by Anthony Leonard <an...@york.ac.uk>.
Dear Karl,

Our DBA noticed that each time our job was run 10 Oracle connections were
created. So, we dropped the "Max connections" parameter on the repository
connection config to 1 and re-ran the job with the DBA watching. The job
worked fine but the DBA reported that 1 connection was created and then 10
more briefly ...

Out of curiosity we re-ran the job again with no further changes and this
time got the following results:

* the job hung in the "Starting Up" phase again, with the same logging and
symptoms as detailed before on this thread.
* the DBA reported seeing no connections at all this time.
* I have attached a thread dump created by "jstack -F <pid>". This is
reporting all threads as blocked.

Any ideas? Any help with this would certainly be very gratefully received.

Best wishes,
Anthony.

-- 
Dr Anthony Leonard
System Integrator, Information Directorate
University of York, Heslington, York, UK, YO10 5DD
Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
Times Higher Education University of the Year 2010


On Mon, Jan 21, 2013 at 2:15 PM, Karl Wright <da...@gmail.com> wrote:

> kill -QUIT should not abort the agents process, just cause a thread
> dump.  kill -9 is a different story.
>
> You can also do the same thing by using jstack, in the jvm bin directory.
>
> Karl
>
>
> On Mon, Jan 21, 2013 at 9:04 AM, Anthony Leonard
> <an...@york.ac.uk> wrote:
> > Dear Karl,
> >
> > Many thanks for your insights. I'll do a kill -QUIT next time we have
> this
> > issue which should hopefully give me the thread dump. However we've
> noticed
> > that killing processes means we have to run the "locks-clean" script so
> it's
> > not our favourite way of doing it.
> >
> > Also I definitely think a timeout for queries would be a good thing.
> >
> > I guess we go back to checking that the connection to the database should
> > have been ok last night...
> >
> > Best wishes,
> > Anthony.
> >
> > --
> > Dr Anthony Leonard
> > System Integrator, Information Directorate
> > University of York, Heslington, York, UK, YO10 5DD
> > Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
> > Times Higher Education University of the Year 2010
> >
> >
> > On Mon, Jan 21, 2013 at 1:25 PM, Karl Wright <da...@gmail.com> wrote:
> >>
> >> Hi Anthony,
> >>
> >> What happens between the framework recognizing that the job should be
> >> started (which it does fine in both cases), and actually achieving a
> >> correct job start, is the "seeding" phase, which is going to try to
> >> execute the seeding query against your Oracle database.  If something
> >> happens at that time to hang the JDBC connection's seeding query, then
> >> it precisely explains the behavior you are seeing.
> >>
> >> It is also the case that the timeout on the queries that the JDBC
> >> connector does is effectively infinite.  This makes me suspicious that
> >> what is happening is an Oracle query is going out but there is no
> >> response ever coming back.
> >>
> >> The other possibility is that the JDBC connector is in fact correctly
> >> throwing a ServiceInterruption, but that the ManifoldCF code is either
> >> not handling it properly, or the connector is not forming it properly.
> >>  In that case, when you notice a hung job, the startup thread will be
> >> a particular place in the code, and I can diagnose it that way.
> >>
> >> The first order of business is therefore to get a thread dump when the
> >> system is "hung".  That will help confirm the picture.  There are a
> >> number of additional questions here.
> >>
> >> (1) Why is this happening?  Is there any possibility that the Oracle
> >> database you are crawling is (very occasionally) not able to properly
> >> respond to a JDBC query?  I can imagine that, under some network
> >> conditions, it might be possible for the Oracle JDBC driver to wind up
> >> waiting indefinitely for a response that never comes.
> >>
> >> (2) Given that we can't always control the infrastructure we're trying
> >> to crawl through, should we attempt to provide a reasonable
> >> workaround?  For example, a timeout on JDBC connector queries, where
> >> we throw a ServiceInterruption if the timeout is exceeded?
> >>
> >> Karl
> >>
> >> On Mon, Jan 21, 2013 at 7:57 AM, Anthony Leonard
> >> <an...@york.ac.uk> wrote:
> >> > Hi there,
> >> >
> >> > We have recently started running a nightly job 2AM in ManifoldCF to
> >> > extract
> >> > data from an Oracle repository and populate a Solr index. Most nights
> >> > this
> >> > works fine, but occasionally the job has been hanging at the "Starting
> >> > up"
> >> > phase. We have observed this on our test setup also occasionally. A
> >> > restart
> >> > of ManifoldCF usually solves this.
> >> >
> >> > Using the simple history reports today I looked up all records and
> >> > sorted
> >> > them by the "Time" column, largest first, and found the following:
> >> >
> >> > "Start Time","Activity","Identifier","Result
> >> > Code","Bytes","Time","Result
> >> > Description"
> >> > "11-12-2012 05:00:05.941","external query" "... SQL QUERY
> >> > ...","ERROR","0","1926607529","Interrupted: null"
> >> > "01-21-2013 02:00:11.843","external query" "... SQL QUERY
> >> > ...","ERROR","0","31644956","Interrupted: null"
> >> > "01-17-2013 02:00:03.600","external query" "... SQL QUERY
> >> > ...","ERROR","0","31637594","Interrupted: null"
> >> > "12-04-2012 12:12:19.860","external query" "... SQL QUERY
> >> > ...","OK","0","17511",""
> >> > ... etc ...
> >> >
> >> > If the Time column is in millis that means the first query was hanging
> >> > for
> >> > 22 days! (This was in the period before we went live when our live
> >> > server
> >> > was sitting idle for a while.) The other two occasions it was hanging
> >> > for
> >> > about 8 hours until we arrived to restart the job in the morning. I
> have
> >> > confirmed that the Oracle database we are connecting to was available
> >> > throughout these periods. These times are also too long for any
> network
> >> > or
> >> > database timeouts, which makes me suspect that it's a problem with the
> >> > application.
> >> >
> >> > We have the following logging config in properties.xml
> >> >
> >> >   <property name="org.apache.manifoldcf.jobs" value="ALL"/>
> >> >   <property name="org.apache.manifoldcf.connectors" value="ALL"/>
> >> >   <property name="org.apache.manifoldcf.agents" value="ALL"/>
> >> >   <property name="org.apache.manifoldcf.misc" value="ALL"/>
> >> >
> >> > The job failed again last night and when I checked at 10:40 AM this
> >> > morning
> >> > the last few lines of manifoldcf.log were:
> >> >
> >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) - Checking if job
> >> > 1352455005553 needs to be started; it was last checked at
> 1358733575454,
> >> > and
> >> > now it is 1358733585635
> >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) -  No time match
> found
> >> > within interval 1358733575454 to 1358733585635
> >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) - Checking if job
> >> > 1352455005553 needs to be started; it was last checked at
> 1358733585636,
> >> > and
> >> > now it is 1358733595662
> >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) -  No time match
> found
> >> > within interval 1358733585636 to 1358733595662
> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Checking if job
> >> > 1352455005553 needs to be started; it was last checked at
> 1358733595663,
> >> > and
> >> > now it is 1358733605813
> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) -  Time match FOUND
> >> > within
> >> > interval 1358733595663 to 1358733605813
> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Job '1352455005553'
> >> > is
> >> > within run window at 1358733605813 ms. (which starts at 1358733600000
> >> > ms.)
> >> > DEBUG 2013-01-21 02:00:05,830 (Job start thread) - Signalled for job
> >> > start
> >> > for job 1352455005553
> >> > DEBUG 2013-01-21 02:00:11,674 (Startup thread) - Marked job
> >> > 1352455005553
> >> > for startup
> >> > DEBUG 2013-01-21 02:00:11,843 (Thread-951922) - JDBC: The connect
> string
> >> > is
> >> > 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
> >> >
> >> > After that - nothing. Once I restarted manifoldcf this morning the job
> >> > magically restarted itself and the following log messages were added
> >> > where
> >> > it had left off:
> >> >
> >> > DEBUG 2013-01-21 10:47:36,852 (Startup thread) - Setting job
> >> > 1352455005553
> >> > back to 'ReadyForStartup' state
> >> > DEBUG 2013-01-21 10:48:57,538 (Thread-98) - Resetting due to restart
> >> > DEBUG 2013-01-21 10:48:57,958 (Thread-98) - Reset complete
> >> > DEBUG 2013-01-21 10:48:58,200 (Startup thread) - Marked job
> >> > 1352455005553
> >> > for startup
> >> > DEBUG 2013-01-21 10:48:58,367 (Thread-184) - JDBC: The connect string
> is
> >> > 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
> >> > DEBUG 2013-01-21 10:49:02,071 (Startup thread) - Job 1352455005553 is
> >> > now
> >> > started
> >> > ... etc ...
> >> >
> >> > So it appears that the application is running fine with a Job start
> >> > thread
> >> > logging nicely every ten seconds until it tries to start this job and
> >> > hangs
> >> > entirely until the system is shutdown - even though it performed that
> >> > same
> >> > tasks perfectly the night before and later the same day.
> >> >
> >> > Can anyone advise on what might be happening here? We are running
> >> > ManifoldCF
> >> > version 1.0.1 on Ubuntu 10.04.
> >> >
> >> > Best wishes,
> >> > Anthony.
> >> >
> >> > --
> >> > Dr Anthony Leonard
> >> > System Integrator, Information Directorate
> >> > University of York, Heslington, York, UK, YO10 5DD
> >> > Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
> >> > Times Higher Education University of the Year 2010
> >
> >
>

Re: Job hanging on "Starting up" with never ending external query.

Posted by Karl Wright <da...@gmail.com>.
kill -QUIT should not abort the agents process, just cause a thread
dump.  kill -9 is a different story.

You can also do the same thing by using jstack, in the jvm bin directory.

Karl


On Mon, Jan 21, 2013 at 9:04 AM, Anthony Leonard
<an...@york.ac.uk> wrote:
> Dear Karl,
>
> Many thanks for your insights. I'll do a kill -QUIT next time we have this
> issue which should hopefully give me the thread dump. However we've noticed
> that killing processes means we have to run the "locks-clean" script so it's
> not our favourite way of doing it.
>
> Also I definitely think a timeout for queries would be a good thing.
>
> I guess we go back to checking that the connection to the database should
> have been ok last night...
>
> Best wishes,
> Anthony.
>
> --
> Dr Anthony Leonard
> System Integrator, Information Directorate
> University of York, Heslington, York, UK, YO10 5DD
> Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
> Times Higher Education University of the Year 2010
>
>
> On Mon, Jan 21, 2013 at 1:25 PM, Karl Wright <da...@gmail.com> wrote:
>>
>> Hi Anthony,
>>
>> What happens between the framework recognizing that the job should be
>> started (which it does fine in both cases), and actually achieving a
>> correct job start, is the "seeding" phase, which is going to try to
>> execute the seeding query against your Oracle database.  If something
>> happens at that time to hang the JDBC connection's seeding query, then
>> it precisely explains the behavior you are seeing.
>>
>> It is also the case that the timeout on the queries that the JDBC
>> connector does is effectively infinite.  This makes me suspicious that
>> what is happening is an Oracle query is going out but there is no
>> response ever coming back.
>>
>> The other possibility is that the JDBC connector is in fact correctly
>> throwing a ServiceInterruption, but that the ManifoldCF code is either
>> not handling it properly, or the connector is not forming it properly.
>>  In that case, when you notice a hung job, the startup thread will be
>> a particular place in the code, and I can diagnose it that way.
>>
>> The first order of business is therefore to get a thread dump when the
>> system is "hung".  That will help confirm the picture.  There are a
>> number of additional questions here.
>>
>> (1) Why is this happening?  Is there any possibility that the Oracle
>> database you are crawling is (very occasionally) not able to properly
>> respond to a JDBC query?  I can imagine that, under some network
>> conditions, it might be possible for the Oracle JDBC driver to wind up
>> waiting indefinitely for a response that never comes.
>>
>> (2) Given that we can't always control the infrastructure we're trying
>> to crawl through, should we attempt to provide a reasonable
>> workaround?  For example, a timeout on JDBC connector queries, where
>> we throw a ServiceInterruption if the timeout is exceeded?
>>
>> Karl
>>
>> On Mon, Jan 21, 2013 at 7:57 AM, Anthony Leonard
>> <an...@york.ac.uk> wrote:
>> > Hi there,
>> >
>> > We have recently started running a nightly job 2AM in ManifoldCF to
>> > extract
>> > data from an Oracle repository and populate a Solr index. Most nights
>> > this
>> > works fine, but occasionally the job has been hanging at the "Starting
>> > up"
>> > phase. We have observed this on our test setup also occasionally. A
>> > restart
>> > of ManifoldCF usually solves this.
>> >
>> > Using the simple history reports today I looked up all records and
>> > sorted
>> > them by the "Time" column, largest first, and found the following:
>> >
>> > "Start Time","Activity","Identifier","Result
>> > Code","Bytes","Time","Result
>> > Description"
>> > "11-12-2012 05:00:05.941","external query" "... SQL QUERY
>> > ...","ERROR","0","1926607529","Interrupted: null"
>> > "01-21-2013 02:00:11.843","external query" "... SQL QUERY
>> > ...","ERROR","0","31644956","Interrupted: null"
>> > "01-17-2013 02:00:03.600","external query" "... SQL QUERY
>> > ...","ERROR","0","31637594","Interrupted: null"
>> > "12-04-2012 12:12:19.860","external query" "... SQL QUERY
>> > ...","OK","0","17511",""
>> > ... etc ...
>> >
>> > If the Time column is in millis that means the first query was hanging
>> > for
>> > 22 days! (This was in the period before we went live when our live
>> > server
>> > was sitting idle for a while.) The other two occasions it was hanging
>> > for
>> > about 8 hours until we arrived to restart the job in the morning. I have
>> > confirmed that the Oracle database we are connecting to was available
>> > throughout these periods. These times are also too long for any network
>> > or
>> > database timeouts, which makes me suspect that it's a problem with the
>> > application.
>> >
>> > We have the following logging config in properties.xml
>> >
>> >   <property name="org.apache.manifoldcf.jobs" value="ALL"/>
>> >   <property name="org.apache.manifoldcf.connectors" value="ALL"/>
>> >   <property name="org.apache.manifoldcf.agents" value="ALL"/>
>> >   <property name="org.apache.manifoldcf.misc" value="ALL"/>
>> >
>> > The job failed again last night and when I checked at 10:40 AM this
>> > morning
>> > the last few lines of manifoldcf.log were:
>> >
>> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) - Checking if job
>> > 1352455005553 needs to be started; it was last checked at 1358733575454,
>> > and
>> > now it is 1358733585635
>> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) -  No time match found
>> > within interval 1358733575454 to 1358733585635
>> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) - Checking if job
>> > 1352455005553 needs to be started; it was last checked at 1358733585636,
>> > and
>> > now it is 1358733595662
>> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) -  No time match found
>> > within interval 1358733585636 to 1358733595662
>> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Checking if job
>> > 1352455005553 needs to be started; it was last checked at 1358733595663,
>> > and
>> > now it is 1358733605813
>> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) -  Time match FOUND
>> > within
>> > interval 1358733595663 to 1358733605813
>> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Job '1352455005553'
>> > is
>> > within run window at 1358733605813 ms. (which starts at 1358733600000
>> > ms.)
>> > DEBUG 2013-01-21 02:00:05,830 (Job start thread) - Signalled for job
>> > start
>> > for job 1352455005553
>> > DEBUG 2013-01-21 02:00:11,674 (Startup thread) - Marked job
>> > 1352455005553
>> > for startup
>> > DEBUG 2013-01-21 02:00:11,843 (Thread-951922) - JDBC: The connect string
>> > is
>> > 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
>> >
>> > After that - nothing. Once I restarted manifoldcf this morning the job
>> > magically restarted itself and the following log messages were added
>> > where
>> > it had left off:
>> >
>> > DEBUG 2013-01-21 10:47:36,852 (Startup thread) - Setting job
>> > 1352455005553
>> > back to 'ReadyForStartup' state
>> > DEBUG 2013-01-21 10:48:57,538 (Thread-98) - Resetting due to restart
>> > DEBUG 2013-01-21 10:48:57,958 (Thread-98) - Reset complete
>> > DEBUG 2013-01-21 10:48:58,200 (Startup thread) - Marked job
>> > 1352455005553
>> > for startup
>> > DEBUG 2013-01-21 10:48:58,367 (Thread-184) - JDBC: The connect string is
>> > 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
>> > DEBUG 2013-01-21 10:49:02,071 (Startup thread) - Job 1352455005553 is
>> > now
>> > started
>> > ... etc ...
>> >
>> > So it appears that the application is running fine with a Job start
>> > thread
>> > logging nicely every ten seconds until it tries to start this job and
>> > hangs
>> > entirely until the system is shutdown - even though it performed that
>> > same
>> > tasks perfectly the night before and later the same day.
>> >
>> > Can anyone advise on what might be happening here? We are running
>> > ManifoldCF
>> > version 1.0.1 on Ubuntu 10.04.
>> >
>> > Best wishes,
>> > Anthony.
>> >
>> > --
>> > Dr Anthony Leonard
>> > System Integrator, Information Directorate
>> > University of York, Heslington, York, UK, YO10 5DD
>> > Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
>> > Times Higher Education University of the Year 2010
>
>

Re: Job hanging on "Starting up" with never ending external query.

Posted by Anthony Leonard <an...@york.ac.uk>.
Dear Karl,

Many thanks for your insights. I'll do a kill -QUIT next time we have this
issue which should hopefully give me the thread dump. However we've noticed
that killing processes means we have to run the "locks-clean" script so
it's not our favourite way of doing it.

Also I definitely think a timeout for queries would be a good thing.

I guess we go back to checking that the connection to the database should
have been ok last night...

Best wishes,
Anthony.

-- 
Dr Anthony Leonard
System Integrator, Information Directorate
University of York, Heslington, York, UK, YO10 5DD
Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
Times Higher Education University of the Year 2010


On Mon, Jan 21, 2013 at 1:25 PM, Karl Wright <da...@gmail.com> wrote:

> Hi Anthony,
>
> What happens between the framework recognizing that the job should be
> started (which it does fine in both cases), and actually achieving a
> correct job start, is the "seeding" phase, which is going to try to
> execute the seeding query against your Oracle database.  If something
> happens at that time to hang the JDBC connection's seeding query, then
> it precisely explains the behavior you are seeing.
>
> It is also the case that the timeout on the queries that the JDBC
> connector does is effectively infinite.  This makes me suspicious that
> what is happening is an Oracle query is going out but there is no
> response ever coming back.
>
> The other possibility is that the JDBC connector is in fact correctly
> throwing a ServiceInterruption, but that the ManifoldCF code is either
> not handling it properly, or the connector is not forming it properly.
>  In that case, when you notice a hung job, the startup thread will be
> a particular place in the code, and I can diagnose it that way.
>
> The first order of business is therefore to get a thread dump when the
> system is "hung".  That will help confirm the picture.  There are a
> number of additional questions here.
>
> (1) Why is this happening?  Is there any possibility that the Oracle
> database you are crawling is (very occasionally) not able to properly
> respond to a JDBC query?  I can imagine that, under some network
> conditions, it might be possible for the Oracle JDBC driver to wind up
> waiting indefinitely for a response that never comes.
>
> (2) Given that we can't always control the infrastructure we're trying
> to crawl through, should we attempt to provide a reasonable
> workaround?  For example, a timeout on JDBC connector queries, where
> we throw a ServiceInterruption if the timeout is exceeded?
>
> Karl
>
> On Mon, Jan 21, 2013 at 7:57 AM, Anthony Leonard
> <an...@york.ac.uk> wrote:
> > Hi there,
> >
> > We have recently started running a nightly job 2AM in ManifoldCF to
> extract
> > data from an Oracle repository and populate a Solr index. Most nights
> this
> > works fine, but occasionally the job has been hanging at the "Starting
> up"
> > phase. We have observed this on our test setup also occasionally. A
> restart
> > of ManifoldCF usually solves this.
> >
> > Using the simple history reports today I looked up all records and sorted
> > them by the "Time" column, largest first, and found the following:
> >
> > "Start Time","Activity","Identifier","Result Code","Bytes","Time","Result
> > Description"
> > "11-12-2012 05:00:05.941","external query" "... SQL QUERY
> > ...","ERROR","0","1926607529","Interrupted: null"
> > "01-21-2013 02:00:11.843","external query" "... SQL QUERY
> > ...","ERROR","0","31644956","Interrupted: null"
> > "01-17-2013 02:00:03.600","external query" "... SQL QUERY
> > ...","ERROR","0","31637594","Interrupted: null"
> > "12-04-2012 12:12:19.860","external query" "... SQL QUERY
> > ...","OK","0","17511",""
> > ... etc ...
> >
> > If the Time column is in millis that means the first query was hanging
> for
> > 22 days! (This was in the period before we went live when our live server
> > was sitting idle for a while.) The other two occasions it was hanging for
> > about 8 hours until we arrived to restart the job in the morning. I have
> > confirmed that the Oracle database we are connecting to was available
> > throughout these periods. These times are also too long for any network
> or
> > database timeouts, which makes me suspect that it's a problem with the
> > application.
> >
> > We have the following logging config in properties.xml
> >
> >   <property name="org.apache.manifoldcf.jobs" value="ALL"/>
> >   <property name="org.apache.manifoldcf.connectors" value="ALL"/>
> >   <property name="org.apache.manifoldcf.agents" value="ALL"/>
> >   <property name="org.apache.manifoldcf.misc" value="ALL"/>
> >
> > The job failed again last night and when I checked at 10:40 AM this
> morning
> > the last few lines of manifoldcf.log were:
> >
> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) - Checking if job
> > 1352455005553 needs to be started; it was last checked at 1358733575454,
> and
> > now it is 1358733585635
> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) -  No time match found
> > within interval 1358733575454 to 1358733585635
> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) - Checking if job
> > 1352455005553 needs to be started; it was last checked at 1358733585636,
> and
> > now it is 1358733595662
> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) -  No time match found
> > within interval 1358733585636 to 1358733595662
> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Checking if job
> > 1352455005553 needs to be started; it was last checked at 1358733595663,
> and
> > now it is 1358733605813
> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) -  Time match FOUND
> within
> > interval 1358733595663 to 1358733605813
> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Job '1352455005553' is
> > within run window at 1358733605813 ms. (which starts at 1358733600000
> ms.)
> > DEBUG 2013-01-21 02:00:05,830 (Job start thread) - Signalled for job
> start
> > for job 1352455005553
> > DEBUG 2013-01-21 02:00:11,674 (Startup thread) - Marked job 1352455005553
> > for startup
> > DEBUG 2013-01-21 02:00:11,843 (Thread-951922) - JDBC: The connect string
> is
> > 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
> >
> > After that - nothing. Once I restarted manifoldcf this morning the job
> > magically restarted itself and the following log messages were added
> where
> > it had left off:
> >
> > DEBUG 2013-01-21 10:47:36,852 (Startup thread) - Setting job
> 1352455005553
> > back to 'ReadyForStartup' state
> > DEBUG 2013-01-21 10:48:57,538 (Thread-98) - Resetting due to restart
> > DEBUG 2013-01-21 10:48:57,958 (Thread-98) - Reset complete
> > DEBUG 2013-01-21 10:48:58,200 (Startup thread) - Marked job 1352455005553
> > for startup
> > DEBUG 2013-01-21 10:48:58,367 (Thread-184) - JDBC: The connect string is
> > 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
> > DEBUG 2013-01-21 10:49:02,071 (Startup thread) - Job 1352455005553 is now
> > started
> > ... etc ...
> >
> > So it appears that the application is running fine with a Job start
> thread
> > logging nicely every ten seconds until it tries to start this job and
> hangs
> > entirely until the system is shutdown - even though it performed that
> same
> > tasks perfectly the night before and later the same day.
> >
> > Can anyone advise on what might be happening here? We are running
> ManifoldCF
> > version 1.0.1 on Ubuntu 10.04.
> >
> > Best wishes,
> > Anthony.
> >
> > --
> > Dr Anthony Leonard
> > System Integrator, Information Directorate
> > University of York, Heslington, York, UK, YO10 5DD
> > Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
> > Times Higher Education University of the Year 2010
>

Re: Job hanging on "Starting up" with never ending external query.

Posted by Karl Wright <da...@gmail.com>.
Hi Anthony,

What happens between the framework recognizing that the job should be
started (which it does fine in both cases), and actually achieving a
correct job start, is the "seeding" phase, which is going to try to
execute the seeding query against your Oracle database.  If something
happens at that time to hang the JDBC connection's seeding query, then
it precisely explains the behavior you are seeing.

It is also the case that the timeout on the queries that the JDBC
connector does is effectively infinite.  This makes me suspicious that
what is happening is an Oracle query is going out but there is no
response ever coming back.

The other possibility is that the JDBC connector is in fact correctly
throwing a ServiceInterruption, but that the ManifoldCF code is either
not handling it properly, or the connector is not forming it properly.
 In that case, when you notice a hung job, the startup thread will be
a particular place in the code, and I can diagnose it that way.

The first order of business is therefore to get a thread dump when the
system is "hung".  That will help confirm the picture.  There are a
number of additional questions here.

(1) Why is this happening?  Is there any possibility that the Oracle
database you are crawling is (very occasionally) not able to properly
respond to a JDBC query?  I can imagine that, under some network
conditions, it might be possible for the Oracle JDBC driver to wind up
waiting indefinitely for a response that never comes.

(2) Given that we can't always control the infrastructure we're trying
to crawl through, should we attempt to provide a reasonable
workaround?  For example, a timeout on JDBC connector queries, where
we throw a ServiceInterruption if the timeout is exceeded?

Karl

On Mon, Jan 21, 2013 at 7:57 AM, Anthony Leonard
<an...@york.ac.uk> wrote:
> Hi there,
>
> We have recently started running a nightly job 2AM in ManifoldCF to extract
> data from an Oracle repository and populate a Solr index. Most nights this
> works fine, but occasionally the job has been hanging at the "Starting up"
> phase. We have observed this on our test setup also occasionally. A restart
> of ManifoldCF usually solves this.
>
> Using the simple history reports today I looked up all records and sorted
> them by the "Time" column, largest first, and found the following:
>
> "Start Time","Activity","Identifier","Result Code","Bytes","Time","Result
> Description"
> "11-12-2012 05:00:05.941","external query" "... SQL QUERY
> ...","ERROR","0","1926607529","Interrupted: null"
> "01-21-2013 02:00:11.843","external query" "... SQL QUERY
> ...","ERROR","0","31644956","Interrupted: null"
> "01-17-2013 02:00:03.600","external query" "... SQL QUERY
> ...","ERROR","0","31637594","Interrupted: null"
> "12-04-2012 12:12:19.860","external query" "... SQL QUERY
> ...","OK","0","17511",""
> ... etc ...
>
> If the Time column is in millis that means the first query was hanging for
> 22 days! (This was in the period before we went live when our live server
> was sitting idle for a while.) The other two occasions it was hanging for
> about 8 hours until we arrived to restart the job in the morning. I have
> confirmed that the Oracle database we are connecting to was available
> throughout these periods. These times are also too long for any network or
> database timeouts, which makes me suspect that it's a problem with the
> application.
>
> We have the following logging config in properties.xml
>
>   <property name="org.apache.manifoldcf.jobs" value="ALL"/>
>   <property name="org.apache.manifoldcf.connectors" value="ALL"/>
>   <property name="org.apache.manifoldcf.agents" value="ALL"/>
>   <property name="org.apache.manifoldcf.misc" value="ALL"/>
>
> The job failed again last night and when I checked at 10:40 AM this morning
> the last few lines of manifoldcf.log were:
>
> DEBUG 2013-01-21 01:59:45,654 (Job start thread) - Checking if job
> 1352455005553 needs to be started; it was last checked at 1358733575454, and
> now it is 1358733585635
> DEBUG 2013-01-21 01:59:45,654 (Job start thread) -  No time match found
> within interval 1358733575454 to 1358733585635
> DEBUG 2013-01-21 01:59:55,805 (Job start thread) - Checking if job
> 1352455005553 needs to be started; it was last checked at 1358733585636, and
> now it is 1358733595662
> DEBUG 2013-01-21 01:59:55,805 (Job start thread) -  No time match found
> within interval 1358733585636 to 1358733595662
> DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Checking if job
> 1352455005553 needs to be started; it was last checked at 1358733595663, and
> now it is 1358733605813
> DEBUG 2013-01-21 02:00:05,821 (Job start thread) -  Time match FOUND within
> interval 1358733595663 to 1358733605813
> DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Job '1352455005553' is
> within run window at 1358733605813 ms. (which starts at 1358733600000 ms.)
> DEBUG 2013-01-21 02:00:05,830 (Job start thread) - Signalled for job start
> for job 1352455005553
> DEBUG 2013-01-21 02:00:11,674 (Startup thread) - Marked job 1352455005553
> for startup
> DEBUG 2013-01-21 02:00:11,843 (Thread-951922) - JDBC: The connect string is
> 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
>
> After that - nothing. Once I restarted manifoldcf this morning the job
> magically restarted itself and the following log messages were added where
> it had left off:
>
> DEBUG 2013-01-21 10:47:36,852 (Startup thread) - Setting job 1352455005553
> back to 'ReadyForStartup' state
> DEBUG 2013-01-21 10:48:57,538 (Thread-98) - Resetting due to restart
> DEBUG 2013-01-21 10:48:57,958 (Thread-98) - Reset complete
> DEBUG 2013-01-21 10:48:58,200 (Startup thread) - Marked job 1352455005553
> for startup
> DEBUG 2013-01-21 10:48:58,367 (Thread-184) - JDBC: The connect string is
> 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
> DEBUG 2013-01-21 10:49:02,071 (Startup thread) - Job 1352455005553 is now
> started
> ... etc ...
>
> So it appears that the application is running fine with a Job start thread
> logging nicely every ten seconds until it tries to start this job and hangs
> entirely until the system is shutdown - even though it performed that same
> tasks perfectly the night before and later the same day.
>
> Can anyone advise on what might be happening here? We are running ManifoldCF
> version 1.0.1 on Ubuntu 10.04.
>
> Best wishes,
> Anthony.
>
> --
> Dr Anthony Leonard
> System Integrator, Information Directorate
> University of York, Heslington, York, UK, YO10 5DD
> Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
> Times Higher Education University of the Year 2010