You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Michael Le <mi...@gmail.com> on 2012/05/07 07:25:58 UTC

JDBC Connection Exception

Hello,

Using a JDBC Repository connection to an Oracle 11g database, I've had
issues where in the initial seeding stage the connection to the database is
closing in the middle of processing the result set.  The original data
table I'm trying to index is about 10 million records, and with the
original code, I could never get past about 750K records.

I spent some time with the pooling parameters to the bitmachanic database
pooling, but the API and source doesn't seem to be available any more.
 Even the original author doesn't have the code or specs any more.  The
parameter modifications to the pool allowed me to get through the first
stage of processing a 2M row subset, but during the second stage where it's
trying to obtain the documents, the connections again started being closed.
 I ended up just replacing the connection pool code, with an oracle
implementation, and its churning through the documents happily.  As a foot
note, on my sample subset of about 400K documents, the throughput went from
about 10 documents/s to 19 docs/s, but this may just be a side effect of
oracle database load or network traffic.

Has anyone else had issues processing a large Oracle repository?  I've
noted the benchmarks were done with 300K documents, and even in our initial
testing with about 500K documents, no issues arose.

The second and more pressing issue is the jobqueues table.  In the process
of dubugging the database connection issues, jobs were started, stopped,
deleted, aborted, and various WHERE clauses were applied to the seeding
queries/jobs.   MCF is now reporting that there are long running queries
against this table.  In the past, I've just truncated the jobqueues table,
but this had the side effect of stuffing a document into solr (output
connector) multiple times.  What API calls, or sql can I run to clean up
the jobqueues table?  Should I just wait for all jobs to finish and then at
that point truncate the table?  I've broken my data into several smaller
subsets of around 1-2 million rows, but that has the side effect of a
jobqueues table that is 6-8 million rows.

Any support would be greatly appreciated.

Thanks,
-Michael Le

Re: JDBC Connection Exception

Posted by Karl Wright <da...@gmail.com>.
This was committed to trunk last week, and seems to work well.
Karl

On Wed, May 9, 2012 at 11:20 AM, Karl Wright <da...@gmail.com> wrote:
> FWIW, the ticket is CONNECTORS-96.  I've created a branch to work on
> it.  I'll let you know when I think it's ready to try out.
>
> Karl
>
>
> On Mon, May 7, 2012 at 5:53 AM, Karl Wright <da...@gmail.com> wrote:
>> Also, there has been a long-running ticket to replace the JDBC pool
>> driver with something more modern for a while.  Many of the
>> off-the-shelf pool drivers are inadequate for various reasons, so I
>> have one that I wrote myself, but it is not yet committed.  So I am
>> curious - which connections are timing out?  The Oracle connections or
>> the Postgresql ones?
>>
>> Karl
>>
>> On Mon, May 7, 2012 at 5:34 AM, Karl Wright <da...@gmail.com> wrote:
>>> What database are you using?  (Not the JDBC database, the underlying
>>> one...)  If PostgreSQL, what version?  What version of ManifoldCF?  If
>>> you could also post some of the long-running queries, that would be
>>> good as well.
>>>
>>> Depending on the database, ManifoldCF periodically
>>> re-analyzes/reindexes the underlying database during the crawl, which
>>> when the table is large can cause some warnings about long-running
>>> queries, because during the reindex process the database performance
>>> is slowed.  That's not usually a problem, other than briefly slowing
>>> the crawl.  However, it's also possible that there's a point where
>>> Postgresql's plan is poor, and we should see that because the warning
>>> also dumps the plan.
>>>
>>> Truncating the jobqueue table is not recommended, since then
>>> ManifoldCF has no idea of what it has crawled and what it hasn't, and
>>> its incremental properties tend to suffer.
>>>
>>> Karl
>>>
>>>
>>> On Mon, May 7, 2012 at 1:25 AM, Michael Le <mi...@gmail.com> wrote:
>>>> Hello,
>>>>
>>>> Using a JDBC Repository connection to an Oracle 11g database, I've had
>>>> issues where in the initial seeding stage the connection to the database is
>>>> closing in the middle of processing the result set.  The original data table
>>>> I'm trying to index is about 10 million records, and with the original code,
>>>> I could never get past about 750K records.
>>>>
>>>> I spent some time with the pooling parameters to the bitmachanic database
>>>> pooling, but the API and source doesn't seem to be available any more.  Even
>>>> the original author doesn't have the code or specs any more.  The parameter
>>>> modifications to the pool allowed me to get through the first stage of
>>>> processing a 2M row subset, but during the second stage where it's trying to
>>>> obtain the documents, the connections again started being closed.  I ended
>>>> up just replacing the connection pool code, with an oracle implementation,
>>>> and its churning through the documents happily.  As a foot note, on my
>>>> sample subset of about 400K documents, the throughput went from about 10
>>>> documents/s to 19 docs/s, but this may just be a side effect of oracle
>>>> database load or network traffic.
>>>>
>>>> Has anyone else had issues processing a large Oracle repository?  I've noted
>>>> the benchmarks were done with 300K documents, and even in our initial
>>>> testing with about 500K documents, no issues arose.
>>>>
>>>> The second and more pressing issue is the jobqueues table.  In the process
>>>> of dubugging the database connection issues, jobs were started, stopped,
>>>> deleted, aborted, and various WHERE clauses were applied to the seeding
>>>> queries/jobs.   MCF is now reporting that there are long running queries
>>>> against this table.  In the past, I've just truncated the jobqueues table,
>>>> but this had the side effect of stuffing a document into solr (output
>>>> connector) multiple times.  What API calls, or sql can I run to clean up the
>>>> jobqueues table?  Should I just wait for all jobs to finish and then at that
>>>> point truncate the table?  I've broken my data into several smaller subsets
>>>> of around 1-2 million rows, but that has the side effect of a jobqueues
>>>> table that is 6-8 million rows.
>>>>
>>>> Any support would be greatly appreciated.
>>>>
>>>> Thanks,
>>>> -Michael Le
>>>>
>>>>

Re: JDBC Connection Exception

Posted by Karl Wright <da...@gmail.com>.
FWIW, the ticket is CONNECTORS-96.  I've created a branch to work on
it.  I'll let you know when I think it's ready to try out.

Karl


On Mon, May 7, 2012 at 5:53 AM, Karl Wright <da...@gmail.com> wrote:
> Also, there has been a long-running ticket to replace the JDBC pool
> driver with something more modern for a while.  Many of the
> off-the-shelf pool drivers are inadequate for various reasons, so I
> have one that I wrote myself, but it is not yet committed.  So I am
> curious - which connections are timing out?  The Oracle connections or
> the Postgresql ones?
>
> Karl
>
> On Mon, May 7, 2012 at 5:34 AM, Karl Wright <da...@gmail.com> wrote:
>> What database are you using?  (Not the JDBC database, the underlying
>> one...)  If PostgreSQL, what version?  What version of ManifoldCF?  If
>> you could also post some of the long-running queries, that would be
>> good as well.
>>
>> Depending on the database, ManifoldCF periodically
>> re-analyzes/reindexes the underlying database during the crawl, which
>> when the table is large can cause some warnings about long-running
>> queries, because during the reindex process the database performance
>> is slowed.  That's not usually a problem, other than briefly slowing
>> the crawl.  However, it's also possible that there's a point where
>> Postgresql's plan is poor, and we should see that because the warning
>> also dumps the plan.
>>
>> Truncating the jobqueue table is not recommended, since then
>> ManifoldCF has no idea of what it has crawled and what it hasn't, and
>> its incremental properties tend to suffer.
>>
>> Karl
>>
>>
>> On Mon, May 7, 2012 at 1:25 AM, Michael Le <mi...@gmail.com> wrote:
>>> Hello,
>>>
>>> Using a JDBC Repository connection to an Oracle 11g database, I've had
>>> issues where in the initial seeding stage the connection to the database is
>>> closing in the middle of processing the result set.  The original data table
>>> I'm trying to index is about 10 million records, and with the original code,
>>> I could never get past about 750K records.
>>>
>>> I spent some time with the pooling parameters to the bitmachanic database
>>> pooling, but the API and source doesn't seem to be available any more.  Even
>>> the original author doesn't have the code or specs any more.  The parameter
>>> modifications to the pool allowed me to get through the first stage of
>>> processing a 2M row subset, but during the second stage where it's trying to
>>> obtain the documents, the connections again started being closed.  I ended
>>> up just replacing the connection pool code, with an oracle implementation,
>>> and its churning through the documents happily.  As a foot note, on my
>>> sample subset of about 400K documents, the throughput went from about 10
>>> documents/s to 19 docs/s, but this may just be a side effect of oracle
>>> database load or network traffic.
>>>
>>> Has anyone else had issues processing a large Oracle repository?  I've noted
>>> the benchmarks were done with 300K documents, and even in our initial
>>> testing with about 500K documents, no issues arose.
>>>
>>> The second and more pressing issue is the jobqueues table.  In the process
>>> of dubugging the database connection issues, jobs were started, stopped,
>>> deleted, aborted, and various WHERE clauses were applied to the seeding
>>> queries/jobs.   MCF is now reporting that there are long running queries
>>> against this table.  In the past, I've just truncated the jobqueues table,
>>> but this had the side effect of stuffing a document into solr (output
>>> connector) multiple times.  What API calls, or sql can I run to clean up the
>>> jobqueues table?  Should I just wait for all jobs to finish and then at that
>>> point truncate the table?  I've broken my data into several smaller subsets
>>> of around 1-2 million rows, but that has the side effect of a jobqueues
>>> table that is 6-8 million rows.
>>>
>>> Any support would be greatly appreciated.
>>>
>>> Thanks,
>>> -Michael Le
>>>
>>>

Re: JDBC Connection Exception

Posted by Karl Wright <da...@gmail.com>.
Also, there has been a long-running ticket to replace the JDBC pool
driver with something more modern for a while.  Many of the
off-the-shelf pool drivers are inadequate for various reasons, so I
have one that I wrote myself, but it is not yet committed.  So I am
curious - which connections are timing out?  The Oracle connections or
the Postgresql ones?

Karl

On Mon, May 7, 2012 at 5:34 AM, Karl Wright <da...@gmail.com> wrote:
> What database are you using?  (Not the JDBC database, the underlying
> one...)  If PostgreSQL, what version?  What version of ManifoldCF?  If
> you could also post some of the long-running queries, that would be
> good as well.
>
> Depending on the database, ManifoldCF periodically
> re-analyzes/reindexes the underlying database during the crawl, which
> when the table is large can cause some warnings about long-running
> queries, because during the reindex process the database performance
> is slowed.  That's not usually a problem, other than briefly slowing
> the crawl.  However, it's also possible that there's a point where
> Postgresql's plan is poor, and we should see that because the warning
> also dumps the plan.
>
> Truncating the jobqueue table is not recommended, since then
> ManifoldCF has no idea of what it has crawled and what it hasn't, and
> its incremental properties tend to suffer.
>
> Karl
>
>
> On Mon, May 7, 2012 at 1:25 AM, Michael Le <mi...@gmail.com> wrote:
>> Hello,
>>
>> Using a JDBC Repository connection to an Oracle 11g database, I've had
>> issues where in the initial seeding stage the connection to the database is
>> closing in the middle of processing the result set.  The original data table
>> I'm trying to index is about 10 million records, and with the original code,
>> I could never get past about 750K records.
>>
>> I spent some time with the pooling parameters to the bitmachanic database
>> pooling, but the API and source doesn't seem to be available any more.  Even
>> the original author doesn't have the code or specs any more.  The parameter
>> modifications to the pool allowed me to get through the first stage of
>> processing a 2M row subset, but during the second stage where it's trying to
>> obtain the documents, the connections again started being closed.  I ended
>> up just replacing the connection pool code, with an oracle implementation,
>> and its churning through the documents happily.  As a foot note, on my
>> sample subset of about 400K documents, the throughput went from about 10
>> documents/s to 19 docs/s, but this may just be a side effect of oracle
>> database load or network traffic.
>>
>> Has anyone else had issues processing a large Oracle repository?  I've noted
>> the benchmarks were done with 300K documents, and even in our initial
>> testing with about 500K documents, no issues arose.
>>
>> The second and more pressing issue is the jobqueues table.  In the process
>> of dubugging the database connection issues, jobs were started, stopped,
>> deleted, aborted, and various WHERE clauses were applied to the seeding
>> queries/jobs.   MCF is now reporting that there are long running queries
>> against this table.  In the past, I've just truncated the jobqueues table,
>> but this had the side effect of stuffing a document into solr (output
>> connector) multiple times.  What API calls, or sql can I run to clean up the
>> jobqueues table?  Should I just wait for all jobs to finish and then at that
>> point truncate the table?  I've broken my data into several smaller subsets
>> of around 1-2 million rows, but that has the side effect of a jobqueues
>> table that is 6-8 million rows.
>>
>> Any support would be greatly appreciated.
>>
>> Thanks,
>> -Michael Le
>>
>>

Re: JDBC Connection Exception

Posted by Karl Wright <da...@gmail.com>.
What database are you using?  (Not the JDBC database, the underlying
one...)  If PostgreSQL, what version?  What version of ManifoldCF?  If
you could also post some of the long-running queries, that would be
good as well.

Depending on the database, ManifoldCF periodically
re-analyzes/reindexes the underlying database during the crawl, which
when the table is large can cause some warnings about long-running
queries, because during the reindex process the database performance
is slowed.  That's not usually a problem, other than briefly slowing
the crawl.  However, it's also possible that there's a point where
Postgresql's plan is poor, and we should see that because the warning
also dumps the plan.

Truncating the jobqueue table is not recommended, since then
ManifoldCF has no idea of what it has crawled and what it hasn't, and
its incremental properties tend to suffer.

Karl


On Mon, May 7, 2012 at 1:25 AM, Michael Le <mi...@gmail.com> wrote:
> Hello,
>
> Using a JDBC Repository connection to an Oracle 11g database, I've had
> issues where in the initial seeding stage the connection to the database is
> closing in the middle of processing the result set.  The original data table
> I'm trying to index is about 10 million records, and with the original code,
> I could never get past about 750K records.
>
> I spent some time with the pooling parameters to the bitmachanic database
> pooling, but the API and source doesn't seem to be available any more.  Even
> the original author doesn't have the code or specs any more.  The parameter
> modifications to the pool allowed me to get through the first stage of
> processing a 2M row subset, but during the second stage where it's trying to
> obtain the documents, the connections again started being closed.  I ended
> up just replacing the connection pool code, with an oracle implementation,
> and its churning through the documents happily.  As a foot note, on my
> sample subset of about 400K documents, the throughput went from about 10
> documents/s to 19 docs/s, but this may just be a side effect of oracle
> database load or network traffic.
>
> Has anyone else had issues processing a large Oracle repository?  I've noted
> the benchmarks were done with 300K documents, and even in our initial
> testing with about 500K documents, no issues arose.
>
> The second and more pressing issue is the jobqueues table.  In the process
> of dubugging the database connection issues, jobs were started, stopped,
> deleted, aborted, and various WHERE clauses were applied to the seeding
> queries/jobs.   MCF is now reporting that there are long running queries
> against this table.  In the past, I've just truncated the jobqueues table,
> but this had the side effect of stuffing a document into solr (output
> connector) multiple times.  What API calls, or sql can I run to clean up the
> jobqueues table?  Should I just wait for all jobs to finish and then at that
> point truncate the table?  I've broken my data into several smaller subsets
> of around 1-2 million rows, but that has the side effect of a jobqueues
> table that is 6-8 million rows.
>
> Any support would be greatly appreciated.
>
> Thanks,
> -Michael Le
>
>