You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by Baptiste Berthier <ba...@gmail.com> on 2014/09/10 16:07:15 UTC

Apache ManifoldCF Performance

Hi

I would like to know what is the maximum number of documents that you
managed to crawl with ManifoldCF and with how many connectors in parallel
it could works ?

Thanks for your answer

Baptiste

Re: Apache ManifoldCF Performance

Posted by Karl Wright <da...@gmail.com>.

Hi Paul,

The patches I included should work (at worst with some code offsets)
against 1.6.1 code without problems.  I suggest you try that first before
going through an upgrade to 1.7.

Thanks,
Karl

On Thu, Sep 11, 2014 at 11:47 AM, Karl Wright <da...@gmail.com> wrote:

> Hi Paul,
>
> Upgrade is included automatically in MCF.
>
> If you are using the multiprocess example, just run the initialize script
> and it will do the upgrade.  The single process example does any required
> upgrades every time you start it.
>
> Thanks,
> Karl
>
>
>
> On Thu, Sep 11, 2014 at 11:34 AM, Paul Boichat <pa...@exonar.com>
> wrote:
>
>> Thanks - we've pulled down the branch and will test the changes. It looks
>> like a branch of 1.7 so it's going to take us a little while to test. We
>> need to migrate our connectors (there's some deprecated stuff that's now
>> been cleared in 1.7 .eg. getShareACL) and we'll need to patch the database
>> to include the pipeline and any other schema changes. We'll have some
>> environment contention over the next week as our performance test
>> environment needs to remain on 1.6.1 while we test a release. Once that's
>> clear I'll move to 1.7
>>
>> On the database schema patch moving from 1.6.1 to 1.7 - is there a simple
>> way to migrate and existing database?
>>
>> Thanks,
>>
>> Paul
>>
>>
>>
>>
>>
>> VP Engineering,
>> Exonar Ltd
>>
>> T: +44 7940 567724
>>
>> twitter:@exonarco @pboichat
>> W: http://www.exonar.com
>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>
>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>> West Mills, Newbury, Berkshire, RG14 5HG
>> DISCLAIMER: This email and any attachments to it may be confidential and
>> are intended solely for the use of the individual to whom it is addressed.
>> Any views or opinions expressed are solely those of the author and do not
>> necessarily represent those of Exonar Ltd. If you are not the intended
>> recipient of this email, you must neither take any action based upon its
>> contents, nor copy or show it to anyone. Please contact the sender if
>> you believe you have received this email in error.
>>
>> On Thu, Sep 11, 2014 at 1:27 PM, Karl Wright <da...@gmail.com> wrote:
>>
>>> Thanks -- I'll include that change as well then, in ticket
>>> CONNECTORS-1027.
>>>
>>>
>>> Karl
>>>
>>>
>>> On Thu, Sep 11, 2014 at 7:45 AM, Paul Boichat <pa...@exonar.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> That comes back immediately with 10001 rows:
>>>>
>>>> explain analyze SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT
>>>> 10001) t;
>>>>
>>>> QUERY PLAN
>>>>
>>>>
>>>> -----------------------------------------------------------------------------------------------------------------------
>>>> ----------------------------------
>>>>  Aggregate  (cost=544.08..544.09 rows=1 width=0) (actual
>>>> time=9.125..9.125 rows=1 loops=1)
>>>>    ->  Limit  (cost=0.00..419.07 rows=10001 width=0) (actual
>>>> time=0.033..6.945 rows=10001 loops=1)
>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>> (cost=0.00..431189.31 rows=10290271 width=0) (actual time
>>>> =0.031..3.257 rows=10001 loops=1)
>>>>                Heap Fetches: 725
>>>>  Total runtime: 9.157 ms
>>>> (5 rows)
>>>>
>>>>
>>>> Whereas:
>>>>
>>>> explain analyze SELECT count(*) FROM jobqueue limit 10001;
>>>>
>>>> QUERY PLAN
>>>>
>>>>
>>>> -----------------------------------------------------------------------------------------------------------------------
>>>> ----------------------------------------
>>>>  Limit  (cost=456922.99..456923.00 rows=1 width=0) (actual
>>>> time=5225.107..5225.109 rows=1 loops=1)
>>>>    ->  Aggregate  (cost=456922.99..456923.00 rows=1 width=0) (actual
>>>> time=5225.105..5225.106 rows=1 loops=1)
>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>> (cost=0.00..431197.31 rows=10290271 width=0) (actual time
>>>> =0.108..3090.848 rows=10370209 loops=1)
>>>>                Heap Fetches: 684297
>>>>  Total runtime: 5225.151 ms
>>>>
>>>> Thanks,
>>>>
>>>> Paul
>>>>
>>>>
>>>>
>>>> VP Engineering,
>>>> Exonar Ltd
>>>>
>>>> T: +44 7940 567724
>>>>
>>>> twitter:@exonarco @pboichat
>>>> W: http://www.exonar.com
>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>
>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>> and are intended solely for the use of the individual to whom it is
>>>> addressed. Any views or opinions expressed are solely those of the author
>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>> the intended recipient of this email, you must neither take any action
>>>> based upon its contents, nor copy or show it to anyone. Please contact
>>>> the sender if you believe you have received this email in error.
>>>>
>>>> On Thu, Sep 11, 2014 at 12:25 PM, Karl Wright <da...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Paul,
>>>>>
>>>>> Could you try this query on your database please and tell me if it
>>>>> executes promptly:
>>>>>
>>>>> SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t
>>>>>
>>>>>
>>>>> I vaguely remember that I had to change the form of this query in
>>>>> order to support MySQL -- but first let's see if this helps.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <da...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I've created a ticket (CONNECTORS-1027) and a trunk-based branch
>>>>>> (branches/CONNECTORS-1027) for looking at any changes we do for large-scale
>>>>>> Postgresql optimization work.
>>>>>>
>>>>>> Please note that trunk code already has schema changes relative to
>>>>>> MCF 1.7, so you will not be able to work directly with this branch code.
>>>>>> I'll have to create patches for whatever changes you would need to try.
>>>>>>
>>>>>> Thanks,
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat <
>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> We're on Postgres 9.2. I'll get the query plans and add them to the
>>>>>>> doc.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> Paul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> VP Engineering,
>>>>>>> Exonar Ltd
>>>>>>>
>>>>>>> T: +44 7940 567724
>>>>>>>
>>>>>>> twitter:@exonarco @pboichat
>>>>>>> W: http://www.exonar.com
>>>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>>>
>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>> author and do not necessarily represent those of Exonar Ltd. If you
>>>>>>> are not the intended recipient of this email, you must neither take any
>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>
>>>>>>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <da...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Paul,
>>>>>>>>
>>>>>>>> Can you include the logged plan for this query; this is an actual
>>>>>>>> query encountered during crawling:
>>>>>>>>
>>>>>>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a
>>>>>>>> long-running query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid
>>>>>>>> FROM carrydown t1, jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=? AND
>>>>>>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?]
>>>>>>>>
>>>>>>>>
>>>>>>>> These queries are all from the UI; it is what gets generated when
>>>>>>>> no limits are in place:
>>>>>>>>
>>>>>>>>  WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found a
>>>>>>>> long-running query (166845 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>  WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found a
>>>>>>>> long-running query (107222 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>
>>>>>>>> This query is from the UI with a limit of 1000000:
>>>>>>>>
>>>>>>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found a
>>>>>>>> long-running query (254851 ms): [SELECT COUNT(dochash) AS doccount FROM
>>>>>>>> jobqueue t1 LIMIT 1000001]
>>>>>>>>
>>>>>>>> I honestly don't understand why PostgreSQL would execute a
>>>>>>>> sequential scan of the entire table when given a limit clause.  It
>>>>>>>> certainly didn't used to do that.  If you have any other suggestions please
>>>>>>>> let me know.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Some queries show up in this list because MCF periodically
>>>>>>>> reindexes tables.  For example, this query goes only against the (small)
>>>>>>>> jobs table.  Its poor performance on occasion is likely due to something
>>>>>>>> else happening to the database, probably a reindex:
>>>>>>>>
>>>>>>>>  WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a
>>>>>>>> long-running query (592474 ms): [SELECT id FROM jobs WHERE status IN
>>>>>>>> (?,?,?,?,?) FOR UPDATE]
>>>>>>>>
>>>>>>>>
>>>>>>>> The final query is the document stuffing query, which is perhaps
>>>>>>>> the most critical query in the whole system:
>>>>>>>>
>>>>>>>>  SELECT
>>>>>>>>  t0.id
>>>>>>>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
>>>>>>>>  t0.priorityset FROM jobqueue t0
>>>>>>>>  WHERE t0.status IN ('P','G')  AND t0.checkaction='R' AND
>>>>>>>> t0.checktime
>>>>>>>>  <= 1407246846166
>>>>>>>>  AND EXISTS (
>>>>>>>>    SELECT 'x' FROM jobs t1
>>>>>>>>    WHERE t1.status  IN ('A','a')  AND t1.id=t0.jobid  AND
>>>>>>>> t1.priority=5
>>>>>>>>  )
>>>>>>>>  AND NOT EXISTS (
>>>>>>>>    SELECT 'x' FROM jobqueue t2
>>>>>>>>    WHERE t2.dochash=t0.dochash AND t2.status IN
>>>>>>>>  ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid
>>>>>>>>  )
>>>>>>>>  AND NOT EXISTS (
>>>>>>>>    SELECT 'x' FROM prereqevents t3,events t4
>>>>>>>>    WHERE t0.id=t3.owner AND t3.eventname=t4.name
>>>>>>>>  )
>>>>>>>>  ORDER BY t0.docpriority ASC
>>>>>>>>  LIMIT 480;
>>>>>>>>
>>>>>>>> Your analysis of whether IN beats OR does not agree with
>>>>>>>> experiments I did on postgresql 8.7 which showed no difference.  What
>>>>>>>> Postgresql version are you using?  Also, I trust you have query plans that
>>>>>>>> demonstrate your claim?  In any case, whether IN vs. OR is generated is a
>>>>>>>> function of the MCF database driver, so this is trivial to experiment
>>>>>>>> with.  I'll create a ticket and a branch for experimentation.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <
>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Karl,
>>>>>>>>>
>>>>>>>>> Changing maxcountstatus to something much smaller (10,000) doesn't
>>>>>>>>> seem to buy us that much on the table scan - in the attached you'll see
>>>>>>>>> that it's still taking a long time to return the job status page. Also in
>>>>>>>>> the attached are some sample other long running queries that we're
>>>>>>>>> beginning to see more frequently. There's also an example of a query that's
>>>>>>>>> frequently executed and regularly takes > 4 secs (plus a suggested change
>>>>>>>>> to improve performance). This one in particular would certainly benefit
>>>>>>>>> from a change to SSDs which should relieve the I/O bound bottleneck on
>>>>>>>>> postgres.
>>>>>>>>>
>>>>>>>>> We're loading the system from 10mil towards 100mil so would be
>>>>>>>>> keen to work with you to optimise where possible.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Paul
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> VP Engineering,
>>>>>>>>> Exonar Ltd
>>>>>>>>>
>>>>>>>>> T: +44 7940 567724
>>>>>>>>>
>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>> W: http://www.exonar.com
>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>
>>>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>
>>>>>>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <da...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Paul,
>>>>>>>>>>
>>>>>>>>>> The jobstatus query that uses count(*) should be doing something
>>>>>>>>>> like this when the maxdocumentstatuscount value is set:
>>>>>>>>>>
>>>>>>>>>> select count(*) from jobqueue where xxx limit 500001
>>>>>>>>>>
>>>>>>>>>> This will still do a sequential scan, but it will be an aborted
>>>>>>>>>> one, so you can control the maximum amount of time spent doing the query.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <
>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> We've had a play with maxstatuscount and couldn't stop it from
>>>>>>>>>>> count(*)-ing but I'll certainly have another look to see if we've missed
>>>>>>>>>>> something.
>>>>>>>>>>>
>>>>>>>>>>> We're increasingly seeing long running threads and I'll put
>>>>>>>>>>> together some samples. As an example, on a job that's currently aborting:
>>>>>>>>>>>
>>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a
>>>>>>>>>>> long-running query (72902 ms): [UPDATE jobqueue SET
>>>>>>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?]
>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter
>>>>>>>>>>> 0: '1.000000001E9'
>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter
>>>>>>>>>>> 1: '1407144048075'
>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan: Update
>>>>>>>>>>> on jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:   ->
>>>>>>>>>>> Bitmap Heap Scan on jobqueue  (cost=18806.08..445770.39 rows=764916
>>>>>>>>>>> width=287)
>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>> Plan:         Recheck Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>> Plan:         ->  Bitmap Index Scan on i1392985450177  (cost=0.00..18614.85
>>>>>>>>>>> rows=764916 width=0)
>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>> Plan:               Index Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -  Stats:
>>>>>>>>>>> n_distinct=4.0 most_common_vals={G,C,Z,P}
>>>>>>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>>>>>>
>>>>>>>>>>> Paul
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> VP Engineering,
>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>
>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>
>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>
>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <daddywri@gmail.com
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>
>>>>>>>>>>>> For the jobqueue scans from the UI, there is a parameter you
>>>>>>>>>>>> can set which limits the number of documents counted to at most a specified
>>>>>>>>>>>> amount.  This uses a limit clause, which should prevent unbounded time
>>>>>>>>>>>> doing these kinds of queries:
>>>>>>>>>>>>
>>>>>>>>>>>> org.apache.manifoldcf.ui.maxstatuscount
>>>>>>>>>>>>
>>>>>>>>>>>> The documentation says that the default value for this
>>>>>>>>>>>> parameter is 10000, which however is incorrect.  The actual true default is
>>>>>>>>>>>> 500000.  You could set that lower for better UI performance (losing some
>>>>>>>>>>>> information, of course.)
>>>>>>>>>>>>
>>>>>>>>>>>> As for long-running queries, a lot of time and effort has been
>>>>>>>>>>>> spent in MCF to insure that this doesn't happen.  Specifically, the main
>>>>>>>>>>>> document queuing query is structured to read directly out of a specific
>>>>>>>>>>>> jobqueue index.  This is the crucial query that must work properly for
>>>>>>>>>>>> scalability, since doing a query that is effectively just a sort on the
>>>>>>>>>>>> entire jobqueue would be a major problem.  There are some times where
>>>>>>>>>>>> Postgresql's optimizer fails to do the right thing here, mostly because it
>>>>>>>>>>>> makes a huge distinction between whether there's zero of something or one
>>>>>>>>>>>> of something, but you can work around that particular issue by setting the
>>>>>>>>>>>> analyze count to 1 if you start to see this problem -- which basically
>>>>>>>>>>>> means that reanalysis of the table has to occur on every stuffing query.
>>>>>>>>>>>>
>>>>>>>>>>>> I'd appreciate seeing the queries that are long-running in your
>>>>>>>>>>>> case so that I can see if that is what you are encountering or not.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <
>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>
>>>>>>>>>>>>> We're beginning to see issues with a document count > 10
>>>>>>>>>>>>> million. At that point, even with good postgres vacuuming the
>>>>>>>>>>>>> jobqueue table is starting to become a bottleneck.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For example select count(*) from jobqueue, which is executed
>>>>>>>>>>>>> when querying job status will do a full table scan of jobqueue
>>>>>>>>>>>>> which has more than 10 million rows. That's going to take some time in
>>>>>>>>>>>>> postgres.
>>>>>>>>>>>>>
>>>>>>>>>>>>> SSDs will certainly make a big difference to document
>>>>>>>>>>>>> processing through-put (which we see is largely I/O bound in
>>>>>>>>>>>>> postgres) but we are increasingly seeing long running queries
>>>>>>>>>>>>> in the logs. Our current thinking is that we'll need to refactor
>>>>>>>>>>>>> JobQueue somewhat to optimise queries and, potentially
>>>>>>>>>>>>> partition jobqueue into a subset of tables (table per queue
>>>>>>>>>>>>> for example).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>
>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>
>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <
>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Baptiste,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ManifoldCF is not limited by the number of agents processes
>>>>>>>>>>>>>> or parallel connectors.  Overall database performance is the limiting
>>>>>>>>>>>>>> factor.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would read this:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Also, there's a section in ManifoldCF (I believe Chapter 2)
>>>>>>>>>>>>>> that discusses this issue.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Some five years ago, I successfully crawled 5 million web
>>>>>>>>>>>>>> documents, using Postgresql 8.3.  Postgresql 9.x is faster, and with modern
>>>>>>>>>>>>>> SSD's, I expect that you will do even better.  In general, I'd say it was
>>>>>>>>>>>>>> fine to shoot for 10M - 100M documents on ManifoldCF, provided that you use
>>>>>>>>>>>>>> a good database, and provided that you maintain it properly.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>>>>>>>>>>>> ba.berthier@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I would like to know what is the maximum number of documents
>>>>>>>>>>>>>>> that you managed to crawl with ManifoldCF and with how many connectors in
>>>>>>>>>>>>>>> parallel it could works ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for your answer
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Baptiste
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Posted by Paul Boichat <pa...@exonar.com>.

Hi Karl,

I've tracked it down - I mistakenly excluded the connector library for
repository connector of the running job in the patched deployment. Although
the job started StufferThread couldn't load the connector to process the
documents. It's up and processing and I'll go back to measuring the
performance improvement that we're seeing.

Thanks for your help,

Paul



VP Engineering,
Exonar Ltd

T: +44 7940 567724

twitter:@exonarco @pboichat
W: http://www.exonar.com
Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>

Exonar Limited, registered in the UK, registration number 06439969 at 14
West Mills, Newbury, Berkshire, RG14 5HG
DISCLAIMER: This email and any attachments to it may be confidential and
are intended solely for the use of the individual to whom it is addressed.
Any views or opinions expressed are solely those of the author and do not
necessarily represent those of Exonar Ltd. If you are not the intended
recipient of this email, you must neither take any action based upon its
contents, nor copy or show it to anyone. Please contact the sender if you
believe you have received this email in error.

On Fri, Sep 12, 2014 at 7:36 PM, Karl Wright <da...@gmail.com> wrote:

> Hi Paul,
>
> Zookeeper is fine; file system multiprocess, though, was just broken by a
> recent windows update, so don't use that.
>
> Karl
>
>
> On Fri, Sep 12, 2014 at 2:34 PM, Paul Boichat <pa...@exonar.com>
> wrote:
>
>> Hi,
>>
>> Will do. Am also putting in some extra debug to help narrow it down.
>>
>> We're using zookeeper multiprocess.
>>
>> Paul
>>
>>
>>
>> VP Engineering,
>> Exonar Ltd
>>
>> T: +44 7940 567724
>>
>> twitter:@exonarco @pboichat
>> W: http://www.exonar.com
>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>
>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>> West Mills, Newbury, Berkshire, RG14 5HG
>> DISCLAIMER: This email and any attachments to it may be confidential and
>> are intended solely for the use of the individual to whom it is addressed.
>> Any views or opinions expressed are solely those of the author and do not
>> necessarily represent those of Exonar Ltd. If you are not the intended
>> recipient of this email, you must neither take any action based upon its
>> contents, nor copy or show it to anyone. Please contact the sender if
>> you believe you have received this email in error.
>>
>> On Fri, Sep 12, 2014 at 5:51 PM, Karl Wright <da...@gmail.com> wrote:
>>
>>> It's done actually; my new laptop with PostgreSQL 9.3 can do 111111
>>> documents in roughly 15 minutes.  No problems encountered; roughly 120
>>> documents per second.
>>>
>>> For sanity sake, could you try the following:
>>>
>>> - check out or unpack 1.6.1 sources
>>> - lay down downloaded lib dependencies
>>> - build using "ant build"
>>> - modify properties.xml and start the approparite example
>>>
>>> Please see if you have any problems doing this process *without* any
>>> patches.
>>>
>>> Also, what kind of synchronization are you using?  File based,
>>> zookeeper, or single-process?
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Fri, Sep 12, 2014 at 12:29 PM, Karl Wright <da...@gmail.com>
>>> wrote:
>>>
>>>> Hi Paul,
>>>>
>>>> The query looks right; the database driver determines the maximum
>>>> number of clauses in a conjunction OR list, just like it does for an IN()
>>>> list.  In the case of Postgresql and OR, the limit is 25; for IN()'s it's
>>>> 100.
>>>>
>>>> The standard integration tests generally run small jobs but that is
>>>> typically sufficient to find query generation problems.  I have load tests
>>>> I can also run but they take several hours to complete.  I'll start one
>>>> now, but I may need to abort it before it finishes.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Fri, Sep 12, 2014 at 11:26 AM, Paul Boichat <paul.boichat@exonar.com
>>>> > wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm looking through the logs - can see the change from IN to OR in
>>>>> each query - and there's clearly a difference in execution path but it's
>>>>> quite verbose so will take a while.
>>>>>
>>>>> It may well be that document state has not been reprioritised or in
>>>>> some way inconsistent. However, I don't think it's that which is causing
>>>>> the issue - I can switch this behaviour on and off over by changing the
>>>>> DBInterfacePostgres class and restarting Manifold. That seems to suggest a
>>>>> query isn't behaving the same way between IN and OR - I just can't isolate
>>>>> the particular query (yet).
>>>>>
>>>>> Have you tested with a job already in running state (on a restart)
>>>>> with a large document count? For example am seeing this kind of thing which
>>>>> looks messy but appears to execute as you'd expect:
>>>>>
>>>>> SELECT
>>>>> id,dockey,lastversion,lastoutputversion,authorityname,forcedparams FROM
>>>>> ingeststatus WHERE  (dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=?) AND
>>>>> connectionname=?]
>>>>> DEBUG 2014-09-12 15:01:27,052 (Thread-542) -   Parameter 0:
>>>>> '1407144048431:F42CD76D66FA6BAD396FF8F8A409DD211C184E6A'
>>>>> DEBUG 2014-09-12 15:01:27,052 (Thread-542) -   Parameter 1:
>>>>> '1407144048431:FE66CC4054300E4EB2A84138DC9B62B80F59F5B9'
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> VP Engineering,
>>>>> Exonar Ltd
>>>>>
>>>>> T: +44 7940 567724
>>>>>
>>>>> twitter:@exonarco @pboichat
>>>>> W: http://www.exonar.com
>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>
>>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>>> and are intended solely for the use of the individual to whom it is
>>>>> addressed. Any views or opinions expressed are solely those of the author
>>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>>> the intended recipient of this email, you must neither take any action
>>>>> based upon its contents, nor copy or show it to anyone. Please
>>>>> contact the sender if you believe you have received this email in error.
>>>>>
>>>>> On Fri, Sep 12, 2014 at 4:20 PM, Karl Wright <da...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Paul,
>>>>>>
>>>>>> The tests in fact do multiple complete crawls, so it is extremely
>>>>>> unlikely that the stuffer query is broken.  If you look at the queries
>>>>>> generated, you should note that the only difference is that whenever an xxx
>>>>>> IN(?,?) was generated before, a (xxx=? OR xxx=?) is generated instead.
>>>>>> These should be completely equivalent; if they don't look equivalent to you
>>>>>> in the log, then I will fix whatever is broken.  I'll make sure here that
>>>>>> the queries look right visually too.
>>>>>>
>>>>>> One possibility is that when you restarted the agents process, the
>>>>>> jobqueue records did not yet finish getting reprioritized.  Stuffer queries
>>>>>> are fired all the time, but the running jobs must complete reprioritization
>>>>>> before the stuffer query will pick up any records.  I wonder if they may
>>>>>> not have managed to get to the right state before you aborted the
>>>>>> experiment?  You can tell what is happening by using jstack to get a thread
>>>>>> dump of the agents process.
>>>>>>
>>>>>> Thanks,
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 12, 2014 at 11:05 AM, Paul Boichat <
>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>
>>>>>>> I stayed with base 1.6.1 and manually patched the code to include
>>>>>>> the two new methods in DBInterfacePostgreSQL
>>>>>>>
>>>>>>> Paul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> VP Engineering,
>>>>>>> Exonar Ltd
>>>>>>>
>>>>>>> T: +44 7940 567724
>>>>>>>
>>>>>>> twitter:@exonarco @pboichat
>>>>>>> W: http://www.exonar.com
>>>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>>>
>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>> author and do not necessarily represent those of Exonar Ltd. If you
>>>>>>> are not the intended recipient of this email, you must neither take any
>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>
>>>>>>> On Fri, Sep 12, 2014 at 4:01 PM, Karl Wright <da...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> The changes pass all tests here.  Is it possible that you attempted
>>>>>>>> some upgrade that failed (or didn't attempt upgrade but went to a new code
>>>>>>>> version)?
>>>>>>>>
>>>>>>>> If you could let me know as exactly as possible what you did, I can
>>>>>>>> let you know if that should have worked or not.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Sep 12, 2014 at 10:57 AM, Paul Boichat <
>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>
>>>>>>>>> Karl,
>>>>>>>>>
>>>>>>>>> We appear to be seeing an issue with the performance change to use
>>>>>>>>> an OR clause rather than IN. After making the change, when we restart
>>>>>>>>> manifoldcf (with one job in running state) documents in the running job are
>>>>>>>>> not picked up for processing by the stuffer thread. If we redploy base
>>>>>>>>> 1.6.1 and restart documents are processed. This is consistently switchable
>>>>>>>>> depending on which code base is deployed.
>>>>>>>>>
>>>>>>>>> We have logs that I could upload to the ticket if you recommend
>>>>>>>>> that we reopen the issue (or create a new one)?
>>>>>>>>>
>>>>>>>>> Paul
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> VP Engineering,
>>>>>>>>> Exonar Ltd
>>>>>>>>>
>>>>>>>>> T: +44 7940 567724
>>>>>>>>>
>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>> W: http://www.exonar.com
>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>
>>>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>
>>>>>>>>> On Fri, Sep 12, 2014 at 6:05 AM, Karl Wright <da...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Paul --
>>>>>>>>>>
>>>>>>>>>> Just to be clear -- the branch for CONNECTORS-1027 is a branch of
>>>>>>>>>> trunk, which is MCF 2.0.  MCF 2.0 is not backwards compatible with any
>>>>>>>>>> previous MCF release, and indeed there is no upgrade from any 1.x release
>>>>>>>>>> to 2.0.  That's why I said to use the patches, and try to stay on 1.6.1 or
>>>>>>>>>> at most to migrate to 1.7.
>>>>>>>>>>
>>>>>>>>>> IF you ALREADY tried an upgrade with the branch code, then you
>>>>>>>>>> would have wound up in a schema state where the schema had more columns in
>>>>>>>>>> it than the branch knew how to deal with.  That's bad, and you will need to
>>>>>>>>>> do things to fix the situation.  I believe you should still be able to do
>>>>>>>>>> the following:
>>>>>>>>>>
>>>>>>>>>> - Download 1.7 source, or check out
>>>>>>>>>> https://svn.apache.org/repos/asf/manifoldcf/branches/release-1.7-branch
>>>>>>>>>> - Apply the patches
>>>>>>>>>> - Build
>>>>>>>>>> - Modify your properties.xml to point to your postgresql instance
>>>>>>>>>> - Run the upgrade (initialize.bat on the multi-process example,
>>>>>>>>>> or start the single-process example)
>>>>>>>>>>
>>>>>>>>>> You should then have a working 1.7 release, with code patches
>>>>>>>>>> applied.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 11, 2014 at 11:34 AM, Paul Boichat <
>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks - we've pulled down the branch and will test the changes.
>>>>>>>>>>> It looks like a branch of 1.7 so it's going to take us a little while to
>>>>>>>>>>> test. We need to migrate our connectors (there's some deprecated stuff
>>>>>>>>>>> that's now been cleared in 1.7 .eg. getShareACL) and we'll need to patch
>>>>>>>>>>> the database to include the pipeline and any other schema changes. We'll
>>>>>>>>>>> have some environment contention over the next week as our performance test
>>>>>>>>>>> environment needs to remain on 1.6.1 while we test a release. Once that's
>>>>>>>>>>> clear I'll move to 1.7
>>>>>>>>>>>
>>>>>>>>>>> On the database schema patch moving from 1.6.1 to 1.7 - is there
>>>>>>>>>>> a simple way to migrate and existing database?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Paul
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> VP Engineering,
>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>
>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>
>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>
>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Sep 11, 2014 at 1:27 PM, Karl Wright <daddywri@gmail.com
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks -- I'll include that change as well then, in ticket
>>>>>>>>>>>> CONNECTORS-1027.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Sep 11, 2014 at 7:45 AM, Paul Boichat <
>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> That comes back immediately with 10001 rows:
>>>>>>>>>>>>>
>>>>>>>>>>>>> explain analyze SELECT count(*) FROM (SELECT 'x' FROM jobqueue
>>>>>>>>>>>>> LIMIT 10001) t;
>>>>>>>>>>>>>
>>>>>>>>>>>>> QUERY PLAN
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> -----------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>>> ----------------------------------
>>>>>>>>>>>>>  Aggregate  (cost=544.08..544.09 rows=1 width=0) (actual
>>>>>>>>>>>>> time=9.125..9.125 rows=1 loops=1)
>>>>>>>>>>>>>    ->  Limit  (cost=0.00..419.07 rows=10001 width=0) (actual
>>>>>>>>>>>>> time=0.033..6.945 rows=10001 loops=1)
>>>>>>>>>>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>>>>>>>>>>> (cost=0.00..431189.31 rows=10290271 width=0) (actual time
>>>>>>>>>>>>> =0.031..3.257 rows=10001 loops=1)
>>>>>>>>>>>>>                Heap Fetches: 725
>>>>>>>>>>>>>  Total runtime: 9.157 ms
>>>>>>>>>>>>> (5 rows)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Whereas:
>>>>>>>>>>>>>
>>>>>>>>>>>>> explain analyze SELECT count(*) FROM jobqueue limit 10001;
>>>>>>>>>>>>>
>>>>>>>>>>>>> QUERY PLAN
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> -----------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>>> ----------------------------------------
>>>>>>>>>>>>>  Limit  (cost=456922.99..456923.00 rows=1 width=0) (actual
>>>>>>>>>>>>> time=5225.107..5225.109 rows=1 loops=1)
>>>>>>>>>>>>>    ->  Aggregate  (cost=456922.99..456923.00 rows=1 width=0)
>>>>>>>>>>>>> (actual time=5225.105..5225.106 rows=1 loops=1)
>>>>>>>>>>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>>>>>>>>>>> (cost=0.00..431197.31 rows=10290271 width=0) (actual time
>>>>>>>>>>>>> =0.108..3090.848 rows=10370209 loops=1)
>>>>>>>>>>>>>                Heap Fetches: 684297
>>>>>>>>>>>>>  Total runtime: 5225.151 ms
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>
>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>
>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Sep 11, 2014 at 12:25 PM, Karl Wright <
>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Could you try this query on your database please and tell me
>>>>>>>>>>>>>> if it executes promptly:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I vaguely remember that I had to change the form of this
>>>>>>>>>>>>>> query in order to support MySQL -- but first let's see if this helps.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <
>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I've created a ticket (CONNECTORS-1027) and a trunk-based
>>>>>>>>>>>>>>> branch (branches/CONNECTORS-1027) for looking at any changes we do for
>>>>>>>>>>>>>>> large-scale Postgresql optimization work.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Please note that trunk code already has schema changes
>>>>>>>>>>>>>>> relative to MCF 1.7, so you will not be able to work directly with this
>>>>>>>>>>>>>>> branch code.  I'll have to create patches for whatever changes you would
>>>>>>>>>>>>>>> need to try.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat <
>>>>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We're on Postgres 9.2. I'll get the query plans and add
>>>>>>>>>>>>>>>> them to the doc.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <
>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Can you include the logged plan for this query; this is an
>>>>>>>>>>>>>>>>> actual query encountered during crawling:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found
>>>>>>>>>>>>>>>>> a long-running query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid
>>>>>>>>>>>>>>>>> FROM carrydown t1, jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=? AND
>>>>>>>>>>>>>>>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> These queries are all from the UI; it is what gets
>>>>>>>>>>>>>>>>> generated when no limits are in place:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) -
>>>>>>>>>>>>>>>>> Found a long-running query (166845 ms): [SELECT jobid,COUNT(dochash) AS
>>>>>>>>>>>>>>>>> doccount FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>>>>>>>>>>  WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) -
>>>>>>>>>>>>>>>>> Found a long-running query (107222 ms): [SELECT jobid,COUNT(dochash) AS
>>>>>>>>>>>>>>>>> doccount FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This query is from the UI with a limit of 1000000:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) -
>>>>>>>>>>>>>>>>> Found a long-running query (254851 ms): [SELECT COUNT(dochash) AS doccount
>>>>>>>>>>>>>>>>> FROM jobqueue t1 LIMIT 1000001]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I honestly don't understand why PostgreSQL would execute a
>>>>>>>>>>>>>>>>> sequential scan of the entire table when given a limit clause.  It
>>>>>>>>>>>>>>>>> certainly didn't used to do that.  If you have any other suggestions please
>>>>>>>>>>>>>>>>> let me know.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Some queries show up in this list because MCF periodically
>>>>>>>>>>>>>>>>> reindexes tables.  For example, this query goes only against the (small)
>>>>>>>>>>>>>>>>> jobs table.  Its poor performance on occasion is likely due to something
>>>>>>>>>>>>>>>>> else happening to the database, probably a reindex:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a
>>>>>>>>>>>>>>>>> long-running query (592474 ms): [SELECT id FROM jobs WHERE status IN
>>>>>>>>>>>>>>>>> (?,?,?,?,?) FOR UPDATE]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The final query is the document stuffing query, which is
>>>>>>>>>>>>>>>>> perhaps the most critical query in the whole system:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  SELECT
>>>>>>>>>>>>>>>>>  t0.id
>>>>>>>>>>>>>>>>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
>>>>>>>>>>>>>>>>>  t0.priorityset FROM jobqueue t0
>>>>>>>>>>>>>>>>>  WHERE t0.status IN ('P','G')  AND t0.checkaction='R' AND
>>>>>>>>>>>>>>>>> t0.checktime
>>>>>>>>>>>>>>>>>  <= 1407246846166
>>>>>>>>>>>>>>>>>  AND EXISTS (
>>>>>>>>>>>>>>>>>    SELECT 'x' FROM jobs t1
>>>>>>>>>>>>>>>>>    WHERE t1.status  IN ('A','a')  AND t1.id=t0.jobid  AND
>>>>>>>>>>>>>>>>> t1.priority=5
>>>>>>>>>>>>>>>>>  )
>>>>>>>>>>>>>>>>>  AND NOT EXISTS (
>>>>>>>>>>>>>>>>>    SELECT 'x' FROM jobqueue t2
>>>>>>>>>>>>>>>>>    WHERE t2.dochash=t0.dochash AND t2.status IN
>>>>>>>>>>>>>>>>>  ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid
>>>>>>>>>>>>>>>>>  )
>>>>>>>>>>>>>>>>>  AND NOT EXISTS (
>>>>>>>>>>>>>>>>>    SELECT 'x' FROM prereqevents t3,events t4
>>>>>>>>>>>>>>>>>    WHERE t0.id=t3.owner AND t3.eventname=t4.name
>>>>>>>>>>>>>>>>>  )
>>>>>>>>>>>>>>>>>  ORDER BY t0.docpriority ASC
>>>>>>>>>>>>>>>>>  LIMIT 480;
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Your analysis of whether IN beats OR does not agree with
>>>>>>>>>>>>>>>>> experiments I did on postgresql 8.7 which showed no difference.  What
>>>>>>>>>>>>>>>>> Postgresql version are you using?  Also, I trust you have query plans that
>>>>>>>>>>>>>>>>> demonstrate your claim?  In any case, whether IN vs. OR is generated is a
>>>>>>>>>>>>>>>>> function of the MCF database driver, so this is trivial to experiment
>>>>>>>>>>>>>>>>> with.  I'll create a ticket and a branch for experimentation.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <
>>>>>>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Changing maxcountstatus to something much smaller
>>>>>>>>>>>>>>>>>> (10,000) doesn't seem to buy us that much on the table scan - in the
>>>>>>>>>>>>>>>>>> attached you'll see that it's still taking a long time to return the job
>>>>>>>>>>>>>>>>>> status page. Also in the attached are some sample other long running
>>>>>>>>>>>>>>>>>> queries that we're beginning to see more frequently. There's also an
>>>>>>>>>>>>>>>>>> example of a query that's frequently executed and regularly takes > 4 secs
>>>>>>>>>>>>>>>>>> (plus a suggested change to improve performance). This one in particular
>>>>>>>>>>>>>>>>>> would certainly benefit from a change to SSDs which should relieve the I/O
>>>>>>>>>>>>>>>>>> bound bottleneck on postgres.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> We're loading the system from 10mil towards 100mil so
>>>>>>>>>>>>>>>>>> would be keen to work with you to optimise where possible.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <
>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The jobstatus query that uses count(*) should be doing
>>>>>>>>>>>>>>>>>>> something like this when the maxdocumentstatuscount value is set:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> select count(*) from jobqueue where xxx limit 500001
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This will still do a sequential scan, but it will be an
>>>>>>>>>>>>>>>>>>> aborted one, so you can control the maximum amount of time spent doing the
>>>>>>>>>>>>>>>>>>> query.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <
>>>>>>>>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> We've had a play with maxstatuscount and couldn't stop
>>>>>>>>>>>>>>>>>>>> it from count(*)-ing but I'll certainly have another look to see if we've
>>>>>>>>>>>>>>>>>>>> missed something.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> We're increasingly seeing long running threads and I'll
>>>>>>>>>>>>>>>>>>>> put together some samples. As an example, on a job that's currently
>>>>>>>>>>>>>>>>>>>> aborting:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found
>>>>>>>>>>>>>>>>>>>> a long-running query (72902 ms): [UPDATE jobqueue SET
>>>>>>>>>>>>>>>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?]
>>>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -
>>>>>>>>>>>>>>>>>>>> Parameter 0: '1.000000001E9'
>>>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -
>>>>>>>>>>>>>>>>>>>> Parameter 1: '1407144048075'
>>>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>>>> Plan: Update on jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>>>> Plan:   ->  Bitmap Heap Scan on jobqueue  (cost=18806.08..445770.39
>>>>>>>>>>>>>>>>>>>> rows=764916 width=287)
>>>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>>>> Plan:         Recheck Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>>>> Plan:         ->  Bitmap Index Scan on i1392985450177  (cost=0.00..18614.85
>>>>>>>>>>>>>>>>>>>> rows=764916 width=0)
>>>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>>>> Plan:               Index Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>>>>>>>>>>>>>>> Stats: n_distinct=4.0 most_common_vals={G,C,Z,P}
>>>>>>>>>>>>>>>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration
>>>>>>>>>>>>>>>>>>>> number 06439969 at 14 West Mills, Newbury, Berkshire,
>>>>>>>>>>>>>>>>>>>> RG14 5HG
>>>>>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may
>>>>>>>>>>>>>>>>>>>> be confidential and are intended solely for the use of the individual to
>>>>>>>>>>>>>>>>>>>> whom it is addressed. Any views or opinions expressed are solely those of
>>>>>>>>>>>>>>>>>>>> the author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <
>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> For the jobqueue scans from the UI, there is a
>>>>>>>>>>>>>>>>>>>>> parameter you can set which limits the number of documents counted to at
>>>>>>>>>>>>>>>>>>>>> most a specified amount.  This uses a limit clause, which should prevent
>>>>>>>>>>>>>>>>>>>>> unbounded time doing these kinds of queries:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> org.apache.manifoldcf.ui.maxstatuscount
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> The documentation says that the default value for this
>>>>>>>>>>>>>>>>>>>>> parameter is 10000, which however is incorrect.  The actual true default is
>>>>>>>>>>>>>>>>>>>>> 500000.  You could set that lower for better UI performance (losing some
>>>>>>>>>>>>>>>>>>>>> information, of course.)
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> As for long-running queries, a lot of time and effort
>>>>>>>>>>>>>>>>>>>>> has been spent in MCF to insure that this doesn't happen.  Specifically,
>>>>>>>>>>>>>>>>>>>>> the main document queuing query is structured to read directly out of a
>>>>>>>>>>>>>>>>>>>>> specific jobqueue index.  This is the crucial query that must work properly
>>>>>>>>>>>>>>>>>>>>> for scalability, since doing a query that is effectively just a sort on the
>>>>>>>>>>>>>>>>>>>>> entire jobqueue would be a major problem.  There are some times where
>>>>>>>>>>>>>>>>>>>>> Postgresql's optimizer fails to do the right thing here, mostly because it
>>>>>>>>>>>>>>>>>>>>> makes a huge distinction between whether there's zero of something or one
>>>>>>>>>>>>>>>>>>>>> of something, but you can work around that particular issue by setting the
>>>>>>>>>>>>>>>>>>>>> analyze count to 1 if you start to see this problem -- which basically
>>>>>>>>>>>>>>>>>>>>> means that reanalysis of the table has to occur on every stuffing query.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I'd appreciate seeing the queries that are
>>>>>>>>>>>>>>>>>>>>> long-running in your case so that I can see if that is what you are
>>>>>>>>>>>>>>>>>>>>> encountering or not.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <
>>>>>>>>>>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> We're beginning to see issues with a document count >
>>>>>>>>>>>>>>>>>>>>>> 10 million. At that point, even with good postgres
>>>>>>>>>>>>>>>>>>>>>> vacuuming the jobqueue table is starting to become a
>>>>>>>>>>>>>>>>>>>>>> bottleneck.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> For example select count(*) from jobqueue, which is
>>>>>>>>>>>>>>>>>>>>>> executed when querying job status will do a full table scan of
>>>>>>>>>>>>>>>>>>>>>> jobqueue which has more than 10 million rows. That's
>>>>>>>>>>>>>>>>>>>>>> going to take some time in postgres.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> SSDs will certainly make a big difference to
>>>>>>>>>>>>>>>>>>>>>> document processing through-put (which we see is largely I/O bound in
>>>>>>>>>>>>>>>>>>>>>> postgres) but we are increasingly seeing long
>>>>>>>>>>>>>>>>>>>>>> running queries in the logs. Our current thinking is that we'll need to
>>>>>>>>>>>>>>>>>>>>>> refactor JobQueue somewhat to optimise queries and,
>>>>>>>>>>>>>>>>>>>>>> potentially partition jobqueue into a subset of
>>>>>>>>>>>>>>>>>>>>>> tables (table per queue for example).
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration
>>>>>>>>>>>>>>>>>>>>>> number 06439969 at 14 West Mills, Newbury,
>>>>>>>>>>>>>>>>>>>>>> Berkshire, RG14 5HG
>>>>>>>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may
>>>>>>>>>>>>>>>>>>>>>> be confidential and are intended solely for the use of the individual to
>>>>>>>>>>>>>>>>>>>>>> whom it is addressed. Any views or opinions expressed are solely those of
>>>>>>>>>>>>>>>>>>>>>> the author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <
>>>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi Baptiste,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> ManifoldCF is not limited by the number of agents
>>>>>>>>>>>>>>>>>>>>>>> processes or parallel connectors.  Overall database performance is the
>>>>>>>>>>>>>>>>>>>>>>> limiting factor.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I would read this:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Also, there's a section in ManifoldCF (I believe
>>>>>>>>>>>>>>>>>>>>>>> Chapter 2) that discusses this issue.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Some five years ago, I successfully crawled 5
>>>>>>>>>>>>>>>>>>>>>>> million web documents, using Postgresql 8.3.  Postgresql 9.x is faster, and
>>>>>>>>>>>>>>>>>>>>>>> with modern SSD's, I expect that you will do even better.  In general, I'd
>>>>>>>>>>>>>>>>>>>>>>> say it was fine to shoot for 10M - 100M documents on ManifoldCF, provided
>>>>>>>>>>>>>>>>>>>>>>> that you use a good database, and provided that you maintain it properly.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier
>>>>>>>>>>>>>>>>>>>>>>> <ba...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I would like to know what is the maximum number of
>>>>>>>>>>>>>>>>>>>>>>>> documents that you managed to crawl with ManifoldCF and with how many
>>>>>>>>>>>>>>>>>>>>>>>> connectors in parallel it could works ?
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your answer
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Baptiste
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Posted by Karl Wright <da...@gmail.com>.

Hi Paul,

Zookeeper is fine; file system multiprocess, though, was just broken by a
recent windows update, so don't use that.

Karl


On Fri, Sep 12, 2014 at 2:34 PM, Paul Boichat <pa...@exonar.com>
wrote:

> Hi,
>
> Will do. Am also putting in some extra debug to help narrow it down.
>
> We're using zookeeper multiprocess.
>
> Paul
>
>
>
> VP Engineering,
> Exonar Ltd
>
> T: +44 7940 567724
>
> twitter:@exonarco @pboichat
> W: http://www.exonar.com
> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>
> Exonar Limited, registered in the UK, registration number 06439969 at 14
> West Mills, Newbury, Berkshire, RG14 5HG
> DISCLAIMER: This email and any attachments to it may be confidential and
> are intended solely for the use of the individual to whom it is addressed.
> Any views or opinions expressed are solely those of the author and do not
> necessarily represent those of Exonar Ltd. If you are not the intended
> recipient of this email, you must neither take any action based upon its
> contents, nor copy or show it to anyone. Please contact the sender if you
> believe you have received this email in error.
>
> On Fri, Sep 12, 2014 at 5:51 PM, Karl Wright <da...@gmail.com> wrote:
>
>> It's done actually; my new laptop with PostgreSQL 9.3 can do 111111
>> documents in roughly 15 minutes.  No problems encountered; roughly 120
>> documents per second.
>>
>> For sanity sake, could you try the following:
>>
>> - check out or unpack 1.6.1 sources
>> - lay down downloaded lib dependencies
>> - build using "ant build"
>> - modify properties.xml and start the approparite example
>>
>> Please see if you have any problems doing this process *without* any
>> patches.
>>
>> Also, what kind of synchronization are you using?  File based, zookeeper,
>> or single-process?
>>
>> Thanks,
>> Karl
>>
>>
>> On Fri, Sep 12, 2014 at 12:29 PM, Karl Wright <da...@gmail.com> wrote:
>>
>>> Hi Paul,
>>>
>>> The query looks right; the database driver determines the maximum number
>>> of clauses in a conjunction OR list, just like it does for an IN() list.
>>> In the case of Postgresql and OR, the limit is 25; for IN()'s it's 100.
>>>
>>> The standard integration tests generally run small jobs but that is
>>> typically sufficient to find query generation problems.  I have load tests
>>> I can also run but they take several hours to complete.  I'll start one
>>> now, but I may need to abort it before it finishes.
>>>
>>> Karl
>>>
>>>
>>> On Fri, Sep 12, 2014 at 11:26 AM, Paul Boichat <pa...@exonar.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm looking through the logs - can see the change from IN to OR in each
>>>> query - and there's clearly a difference in execution path but it's quite
>>>> verbose so will take a while.
>>>>
>>>> It may well be that document state has not been reprioritised or in
>>>> some way inconsistent. However, I don't think it's that which is causing
>>>> the issue - I can switch this behaviour on and off over by changing the
>>>> DBInterfacePostgres class and restarting Manifold. That seems to suggest a
>>>> query isn't behaving the same way between IN and OR - I just can't isolate
>>>> the particular query (yet).
>>>>
>>>> Have you tested with a job already in running state (on a restart) with
>>>> a large document count? For example am seeing this kind of thing which
>>>> looks messy but appears to execute as you'd expect:
>>>>
>>>> SELECT
>>>> id,dockey,lastversion,lastoutputversion,authorityname,forcedparams FROM
>>>> ingeststatus WHERE  (dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=?) AND
>>>> connectionname=?]
>>>> DEBUG 2014-09-12 15:01:27,052 (Thread-542) -   Parameter 0:
>>>> '1407144048431:F42CD76D66FA6BAD396FF8F8A409DD211C184E6A'
>>>> DEBUG 2014-09-12 15:01:27,052 (Thread-542) -   Parameter 1:
>>>> '1407144048431:FE66CC4054300E4EB2A84138DC9B62B80F59F5B9'
>>>>
>>>>
>>>>
>>>>
>>>> VP Engineering,
>>>> Exonar Ltd
>>>>
>>>> T: +44 7940 567724
>>>>
>>>> twitter:@exonarco @pboichat
>>>> W: http://www.exonar.com
>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>
>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>> and are intended solely for the use of the individual to whom it is
>>>> addressed. Any views or opinions expressed are solely those of the author
>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>> the intended recipient of this email, you must neither take any action
>>>> based upon its contents, nor copy or show it to anyone. Please contact
>>>> the sender if you believe you have received this email in error.
>>>>
>>>> On Fri, Sep 12, 2014 at 4:20 PM, Karl Wright <da...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Paul,
>>>>>
>>>>> The tests in fact do multiple complete crawls, so it is extremely
>>>>> unlikely that the stuffer query is broken.  If you look at the queries
>>>>> generated, you should note that the only difference is that whenever an xxx
>>>>> IN(?,?) was generated before, a (xxx=? OR xxx=?) is generated instead.
>>>>> These should be completely equivalent; if they don't look equivalent to you
>>>>> in the log, then I will fix whatever is broken.  I'll make sure here that
>>>>> the queries look right visually too.
>>>>>
>>>>> One possibility is that when you restarted the agents process, the
>>>>> jobqueue records did not yet finish getting reprioritized.  Stuffer queries
>>>>> are fired all the time, but the running jobs must complete reprioritization
>>>>> before the stuffer query will pick up any records.  I wonder if they may
>>>>> not have managed to get to the right state before you aborted the
>>>>> experiment?  You can tell what is happening by using jstack to get a thread
>>>>> dump of the agents process.
>>>>>
>>>>> Thanks,
>>>>> Karl
>>>>>
>>>>>
>>>>> On Fri, Sep 12, 2014 at 11:05 AM, Paul Boichat <
>>>>> paul.boichat@exonar.com> wrote:
>>>>>
>>>>>> I stayed with base 1.6.1 and manually patched the code to include the
>>>>>> two new methods in DBInterfacePostgreSQL
>>>>>>
>>>>>> Paul
>>>>>>
>>>>>>
>>>>>>
>>>>>> VP Engineering,
>>>>>> Exonar Ltd
>>>>>>
>>>>>> T: +44 7940 567724
>>>>>>
>>>>>> twitter:@exonarco @pboichat
>>>>>> W: http://www.exonar.com
>>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>>
>>>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>>>> and are intended solely for the use of the individual to whom it is
>>>>>> addressed. Any views or opinions expressed are solely those of the author
>>>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>>>> the intended recipient of this email, you must neither take any action
>>>>>> based upon its contents, nor copy or show it to anyone. Please
>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>
>>>>>> On Fri, Sep 12, 2014 at 4:01 PM, Karl Wright <da...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> The changes pass all tests here.  Is it possible that you attempted
>>>>>>> some upgrade that failed (or didn't attempt upgrade but went to a new code
>>>>>>> version)?
>>>>>>>
>>>>>>> If you could let me know as exactly as possible what you did, I can
>>>>>>> let you know if that should have worked or not.
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 12, 2014 at 10:57 AM, Paul Boichat <
>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>
>>>>>>>> Karl,
>>>>>>>>
>>>>>>>> We appear to be seeing an issue with the performance change to use
>>>>>>>> an OR clause rather than IN. After making the change, when we restart
>>>>>>>> manifoldcf (with one job in running state) documents in the running job are
>>>>>>>> not picked up for processing by the stuffer thread. If we redploy base
>>>>>>>> 1.6.1 and restart documents are processed. This is consistently switchable
>>>>>>>> depending on which code base is deployed.
>>>>>>>>
>>>>>>>> We have logs that I could upload to the ticket if you recommend
>>>>>>>> that we reopen the issue (or create a new one)?
>>>>>>>>
>>>>>>>> Paul
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> VP Engineering,
>>>>>>>> Exonar Ltd
>>>>>>>>
>>>>>>>> T: +44 7940 567724
>>>>>>>>
>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>> W: http://www.exonar.com
>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>> <http://video.exonar.com/>
>>>>>>>>
>>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>
>>>>>>>> On Fri, Sep 12, 2014 at 6:05 AM, Karl Wright <da...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Paul --
>>>>>>>>>
>>>>>>>>> Just to be clear -- the branch for CONNECTORS-1027 is a branch of
>>>>>>>>> trunk, which is MCF 2.0.  MCF 2.0 is not backwards compatible with any
>>>>>>>>> previous MCF release, and indeed there is no upgrade from any 1.x release
>>>>>>>>> to 2.0.  That's why I said to use the patches, and try to stay on 1.6.1 or
>>>>>>>>> at most to migrate to 1.7.
>>>>>>>>>
>>>>>>>>> IF you ALREADY tried an upgrade with the branch code, then you
>>>>>>>>> would have wound up in a schema state where the schema had more columns in
>>>>>>>>> it than the branch knew how to deal with.  That's bad, and you will need to
>>>>>>>>> do things to fix the situation.  I believe you should still be able to do
>>>>>>>>> the following:
>>>>>>>>>
>>>>>>>>> - Download 1.7 source, or check out
>>>>>>>>> https://svn.apache.org/repos/asf/manifoldcf/branches/release-1.7-branch
>>>>>>>>> - Apply the patches
>>>>>>>>> - Build
>>>>>>>>> - Modify your properties.xml to point to your postgresql instance
>>>>>>>>> - Run the upgrade (initialize.bat on the multi-process example, or
>>>>>>>>> start the single-process example)
>>>>>>>>>
>>>>>>>>> You should then have a working 1.7 release, with code patches
>>>>>>>>> applied.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Sep 11, 2014 at 11:34 AM, Paul Boichat <
>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks - we've pulled down the branch and will test the changes.
>>>>>>>>>> It looks like a branch of 1.7 so it's going to take us a little while to
>>>>>>>>>> test. We need to migrate our connectors (there's some deprecated stuff
>>>>>>>>>> that's now been cleared in 1.7 .eg. getShareACL) and we'll need to patch
>>>>>>>>>> the database to include the pipeline and any other schema changes. We'll
>>>>>>>>>> have some environment contention over the next week as our performance test
>>>>>>>>>> environment needs to remain on 1.6.1 while we test a release. Once that's
>>>>>>>>>> clear I'll move to 1.7
>>>>>>>>>>
>>>>>>>>>> On the database schema patch moving from 1.6.1 to 1.7 - is there
>>>>>>>>>> a simple way to migrate and existing database?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Paul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> VP Engineering,
>>>>>>>>>> Exonar Ltd
>>>>>>>>>>
>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>
>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>
>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 11, 2014 at 1:27 PM, Karl Wright <da...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks -- I'll include that change as well then, in ticket
>>>>>>>>>>> CONNECTORS-1027.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Sep 11, 2014 at 7:45 AM, Paul Boichat <
>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> That comes back immediately with 10001 rows:
>>>>>>>>>>>>
>>>>>>>>>>>> explain analyze SELECT count(*) FROM (SELECT 'x' FROM jobqueue
>>>>>>>>>>>> LIMIT 10001) t;
>>>>>>>>>>>>
>>>>>>>>>>>> QUERY PLAN
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> -----------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>> ----------------------------------
>>>>>>>>>>>>  Aggregate  (cost=544.08..544.09 rows=1 width=0) (actual
>>>>>>>>>>>> time=9.125..9.125 rows=1 loops=1)
>>>>>>>>>>>>    ->  Limit  (cost=0.00..419.07 rows=10001 width=0) (actual
>>>>>>>>>>>> time=0.033..6.945 rows=10001 loops=1)
>>>>>>>>>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>>>>>>>>>> (cost=0.00..431189.31 rows=10290271 width=0) (actual time
>>>>>>>>>>>> =0.031..3.257 rows=10001 loops=1)
>>>>>>>>>>>>                Heap Fetches: 725
>>>>>>>>>>>>  Total runtime: 9.157 ms
>>>>>>>>>>>> (5 rows)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Whereas:
>>>>>>>>>>>>
>>>>>>>>>>>> explain analyze SELECT count(*) FROM jobqueue limit 10001;
>>>>>>>>>>>>
>>>>>>>>>>>> QUERY PLAN
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> -----------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>> ----------------------------------------
>>>>>>>>>>>>  Limit  (cost=456922.99..456923.00 rows=1 width=0) (actual
>>>>>>>>>>>> time=5225.107..5225.109 rows=1 loops=1)
>>>>>>>>>>>>    ->  Aggregate  (cost=456922.99..456923.00 rows=1 width=0)
>>>>>>>>>>>> (actual time=5225.105..5225.106 rows=1 loops=1)
>>>>>>>>>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>>>>>>>>>> (cost=0.00..431197.31 rows=10290271 width=0) (actual time
>>>>>>>>>>>> =0.108..3090.848 rows=10370209 loops=1)
>>>>>>>>>>>>                Heap Fetches: 684297
>>>>>>>>>>>>  Total runtime: 5225.151 ms
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Paul
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>
>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>
>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>
>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Sep 11, 2014 at 12:25 PM, Karl Wright <
>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Could you try this query on your database please and tell me
>>>>>>>>>>>>> if it executes promptly:
>>>>>>>>>>>>>
>>>>>>>>>>>>> SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I vaguely remember that I had to change the form of this query
>>>>>>>>>>>>> in order to support MySQL -- but first let's see if this helps.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <
>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I've created a ticket (CONNECTORS-1027) and a trunk-based
>>>>>>>>>>>>>> branch (branches/CONNECTORS-1027) for looking at any changes we do for
>>>>>>>>>>>>>> large-scale Postgresql optimization work.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please note that trunk code already has schema changes
>>>>>>>>>>>>>> relative to MCF 1.7, so you will not be able to work directly with this
>>>>>>>>>>>>>> branch code.  I'll have to create patches for whatever changes you would
>>>>>>>>>>>>>> need to try.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat <
>>>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We're on Postgres 9.2. I'll get the query plans and add them
>>>>>>>>>>>>>>> to the doc.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <
>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Can you include the logged plan for this query; this is an
>>>>>>>>>>>>>>>> actual query encountered during crawling:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a
>>>>>>>>>>>>>>>> long-running query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid
>>>>>>>>>>>>>>>> FROM carrydown t1, jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=? AND
>>>>>>>>>>>>>>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> These queries are all from the UI; it is what gets
>>>>>>>>>>>>>>>> generated when no limits are in place:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) -
>>>>>>>>>>>>>>>> Found a long-running query (166845 ms): [SELECT jobid,COUNT(dochash) AS
>>>>>>>>>>>>>>>> doccount FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>>>>>>>>>  WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) -
>>>>>>>>>>>>>>>> Found a long-running query (107222 ms): [SELECT jobid,COUNT(dochash) AS
>>>>>>>>>>>>>>>> doccount FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This query is from the UI with a limit of 1000000:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) -
>>>>>>>>>>>>>>>> Found a long-running query (254851 ms): [SELECT COUNT(dochash) AS doccount
>>>>>>>>>>>>>>>> FROM jobqueue t1 LIMIT 1000001]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I honestly don't understand why PostgreSQL would execute a
>>>>>>>>>>>>>>>> sequential scan of the entire table when given a limit clause.  It
>>>>>>>>>>>>>>>> certainly didn't used to do that.  If you have any other suggestions please
>>>>>>>>>>>>>>>> let me know.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Some queries show up in this list because MCF periodically
>>>>>>>>>>>>>>>> reindexes tables.  For example, this query goes only against the (small)
>>>>>>>>>>>>>>>> jobs table.  Its poor performance on occasion is likely due to something
>>>>>>>>>>>>>>>> else happening to the database, probably a reindex:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a
>>>>>>>>>>>>>>>> long-running query (592474 ms): [SELECT id FROM jobs WHERE status IN
>>>>>>>>>>>>>>>> (?,?,?,?,?) FOR UPDATE]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The final query is the document stuffing query, which is
>>>>>>>>>>>>>>>> perhaps the most critical query in the whole system:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  SELECT
>>>>>>>>>>>>>>>>  t0.id
>>>>>>>>>>>>>>>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
>>>>>>>>>>>>>>>>  t0.priorityset FROM jobqueue t0
>>>>>>>>>>>>>>>>  WHERE t0.status IN ('P','G')  AND t0.checkaction='R' AND
>>>>>>>>>>>>>>>> t0.checktime
>>>>>>>>>>>>>>>>  <= 1407246846166
>>>>>>>>>>>>>>>>  AND EXISTS (
>>>>>>>>>>>>>>>>    SELECT 'x' FROM jobs t1
>>>>>>>>>>>>>>>>    WHERE t1.status  IN ('A','a')  AND t1.id=t0.jobid  AND
>>>>>>>>>>>>>>>> t1.priority=5
>>>>>>>>>>>>>>>>  )
>>>>>>>>>>>>>>>>  AND NOT EXISTS (
>>>>>>>>>>>>>>>>    SELECT 'x' FROM jobqueue t2
>>>>>>>>>>>>>>>>    WHERE t2.dochash=t0.dochash AND t2.status IN
>>>>>>>>>>>>>>>>  ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid
>>>>>>>>>>>>>>>>  )
>>>>>>>>>>>>>>>>  AND NOT EXISTS (
>>>>>>>>>>>>>>>>    SELECT 'x' FROM prereqevents t3,events t4
>>>>>>>>>>>>>>>>    WHERE t0.id=t3.owner AND t3.eventname=t4.name
>>>>>>>>>>>>>>>>  )
>>>>>>>>>>>>>>>>  ORDER BY t0.docpriority ASC
>>>>>>>>>>>>>>>>  LIMIT 480;
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Your analysis of whether IN beats OR does not agree with
>>>>>>>>>>>>>>>> experiments I did on postgresql 8.7 which showed no difference.  What
>>>>>>>>>>>>>>>> Postgresql version are you using?  Also, I trust you have query plans that
>>>>>>>>>>>>>>>> demonstrate your claim?  In any case, whether IN vs. OR is generated is a
>>>>>>>>>>>>>>>> function of the MCF database driver, so this is trivial to experiment
>>>>>>>>>>>>>>>> with.  I'll create a ticket and a branch for experimentation.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <
>>>>>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Changing maxcountstatus to something much smaller (10,000)
>>>>>>>>>>>>>>>>> doesn't seem to buy us that much on the table scan - in the attached you'll
>>>>>>>>>>>>>>>>> see that it's still taking a long time to return the job status page. Also
>>>>>>>>>>>>>>>>> in the attached are some sample other long running queries that we're
>>>>>>>>>>>>>>>>> beginning to see more frequently. There's also an example of a query that's
>>>>>>>>>>>>>>>>> frequently executed and regularly takes > 4 secs (plus a suggested change
>>>>>>>>>>>>>>>>> to improve performance). This one in particular would certainly benefit
>>>>>>>>>>>>>>>>> from a change to SSDs which should relieve the I/O bound bottleneck on
>>>>>>>>>>>>>>>>> postgres.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> We're loading the system from 10mil towards 100mil so
>>>>>>>>>>>>>>>>> would be keen to work with you to optimise where possible.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <
>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The jobstatus query that uses count(*) should be doing
>>>>>>>>>>>>>>>>>> something like this when the maxdocumentstatuscount value is set:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> select count(*) from jobqueue where xxx limit 500001
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This will still do a sequential scan, but it will be an
>>>>>>>>>>>>>>>>>> aborted one, so you can control the maximum amount of time spent doing the
>>>>>>>>>>>>>>>>>> query.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <
>>>>>>>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> We've had a play with maxstatuscount and couldn't stop
>>>>>>>>>>>>>>>>>>> it from count(*)-ing but I'll certainly have another look to see if we've
>>>>>>>>>>>>>>>>>>> missed something.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> We're increasingly seeing long running threads and I'll
>>>>>>>>>>>>>>>>>>> put together some samples. As an example, on a job that's currently
>>>>>>>>>>>>>>>>>>> aborting:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found
>>>>>>>>>>>>>>>>>>> a long-running query (72902 ms): [UPDATE jobqueue SET
>>>>>>>>>>>>>>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?]
>>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -
>>>>>>>>>>>>>>>>>>> Parameter 0: '1.000000001E9'
>>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -
>>>>>>>>>>>>>>>>>>> Parameter 1: '1407144048075'
>>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>>> Plan: Update on jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>>> Plan:   ->  Bitmap Heap Scan on jobqueue  (cost=18806.08..445770.39
>>>>>>>>>>>>>>>>>>> rows=764916 width=287)
>>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>>> Plan:         Recheck Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>>> Plan:         ->  Bitmap Index Scan on i1392985450177  (cost=0.00..18614.85
>>>>>>>>>>>>>>>>>>> rows=764916 width=0)
>>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>>> Plan:               Index Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>>>>>>>>>>>>>> Stats: n_distinct=4.0 most_common_vals={G,C,Z,P}
>>>>>>>>>>>>>>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration
>>>>>>>>>>>>>>>>>>> number 06439969 at 14 West Mills, Newbury, Berkshire,
>>>>>>>>>>>>>>>>>>> RG14 5HG
>>>>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <
>>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> For the jobqueue scans from the UI, there is a
>>>>>>>>>>>>>>>>>>>> parameter you can set which limits the number of documents counted to at
>>>>>>>>>>>>>>>>>>>> most a specified amount.  This uses a limit clause, which should prevent
>>>>>>>>>>>>>>>>>>>> unbounded time doing these kinds of queries:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> org.apache.manifoldcf.ui.maxstatuscount
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The documentation says that the default value for this
>>>>>>>>>>>>>>>>>>>> parameter is 10000, which however is incorrect.  The actual true default is
>>>>>>>>>>>>>>>>>>>> 500000.  You could set that lower for better UI performance (losing some
>>>>>>>>>>>>>>>>>>>> information, of course.)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> As for long-running queries, a lot of time and effort
>>>>>>>>>>>>>>>>>>>> has been spent in MCF to insure that this doesn't happen.  Specifically,
>>>>>>>>>>>>>>>>>>>> the main document queuing query is structured to read directly out of a
>>>>>>>>>>>>>>>>>>>> specific jobqueue index.  This is the crucial query that must work properly
>>>>>>>>>>>>>>>>>>>> for scalability, since doing a query that is effectively just a sort on the
>>>>>>>>>>>>>>>>>>>> entire jobqueue would be a major problem.  There are some times where
>>>>>>>>>>>>>>>>>>>> Postgresql's optimizer fails to do the right thing here, mostly because it
>>>>>>>>>>>>>>>>>>>> makes a huge distinction between whether there's zero of something or one
>>>>>>>>>>>>>>>>>>>> of something, but you can work around that particular issue by setting the
>>>>>>>>>>>>>>>>>>>> analyze count to 1 if you start to see this problem -- which basically
>>>>>>>>>>>>>>>>>>>> means that reanalysis of the table has to occur on every stuffing query.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I'd appreciate seeing the queries that are long-running
>>>>>>>>>>>>>>>>>>>> in your case so that I can see if that is what you are encountering or not.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <
>>>>>>>>>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> We're beginning to see issues with a document count >
>>>>>>>>>>>>>>>>>>>>> 10 million. At that point, even with good postgres
>>>>>>>>>>>>>>>>>>>>> vacuuming the jobqueue table is starting to become a
>>>>>>>>>>>>>>>>>>>>> bottleneck.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> For example select count(*) from jobqueue, which is
>>>>>>>>>>>>>>>>>>>>> executed when querying job status will do a full table scan of
>>>>>>>>>>>>>>>>>>>>> jobqueue which has more than 10 million rows. That's
>>>>>>>>>>>>>>>>>>>>> going to take some time in postgres.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> SSDs will certainly make a big difference to document
>>>>>>>>>>>>>>>>>>>>> processing through-put (which we see is largely I/O bound in
>>>>>>>>>>>>>>>>>>>>> postgres) but we are increasingly seeing long running
>>>>>>>>>>>>>>>>>>>>> queries in the logs. Our current thinking is that we'll need to refactor
>>>>>>>>>>>>>>>>>>>>> JobQueue somewhat to optimise queries and,
>>>>>>>>>>>>>>>>>>>>> potentially partition jobqueue into a subset of
>>>>>>>>>>>>>>>>>>>>> tables (table per queue for example).
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration
>>>>>>>>>>>>>>>>>>>>> number 06439969 at 14 West Mills, Newbury, Berkshire,
>>>>>>>>>>>>>>>>>>>>> RG14 5HG
>>>>>>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may
>>>>>>>>>>>>>>>>>>>>> be confidential and are intended solely for the use of the individual to
>>>>>>>>>>>>>>>>>>>>> whom it is addressed. Any views or opinions expressed are solely those of
>>>>>>>>>>>>>>>>>>>>> the author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <
>>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi Baptiste,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> ManifoldCF is not limited by the number of agents
>>>>>>>>>>>>>>>>>>>>>> processes or parallel connectors.  Overall database performance is the
>>>>>>>>>>>>>>>>>>>>>> limiting factor.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I would read this:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Also, there's a section in ManifoldCF (I believe
>>>>>>>>>>>>>>>>>>>>>> Chapter 2) that discusses this issue.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Some five years ago, I successfully crawled 5 million
>>>>>>>>>>>>>>>>>>>>>> web documents, using Postgresql 8.3.  Postgresql 9.x is faster, and with
>>>>>>>>>>>>>>>>>>>>>> modern SSD's, I expect that you will do even better.  In general, I'd say
>>>>>>>>>>>>>>>>>>>>>> it was fine to shoot for 10M - 100M documents on ManifoldCF, provided that
>>>>>>>>>>>>>>>>>>>>>> you use a good database, and provided that you maintain it properly.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>>>>>>>>>>>>>>>>>>>> ba.berthier@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I would like to know what is the maximum number of
>>>>>>>>>>>>>>>>>>>>>>> documents that you managed to crawl with ManifoldCF and with how many
>>>>>>>>>>>>>>>>>>>>>>> connectors in parallel it could works ?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks for your answer
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Baptiste
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Posted by Paul Boichat <pa...@exonar.com>.

Hi,

Will do. Am also putting in some extra debug to help narrow it down.

We're using zookeeper multiprocess.

Paul



VP Engineering,
Exonar Ltd

T: +44 7940 567724

twitter:@exonarco @pboichat
W: http://www.exonar.com
Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>

Exonar Limited, registered in the UK, registration number 06439969 at 14
West Mills, Newbury, Berkshire, RG14 5HG
DISCLAIMER: This email and any attachments to it may be confidential and
are intended solely for the use of the individual to whom it is addressed.
Any views or opinions expressed are solely those of the author and do not
necessarily represent those of Exonar Ltd. If you are not the intended
recipient of this email, you must neither take any action based upon its
contents, nor copy or show it to anyone. Please contact the sender if you
believe you have received this email in error.

On Fri, Sep 12, 2014 at 5:51 PM, Karl Wright <da...@gmail.com> wrote:

> It's done actually; my new laptop with PostgreSQL 9.3 can do 111111
> documents in roughly 15 minutes.  No problems encountered; roughly 120
> documents per second.
>
> For sanity sake, could you try the following:
>
> - check out or unpack 1.6.1 sources
> - lay down downloaded lib dependencies
> - build using "ant build"
> - modify properties.xml and start the approparite example
>
> Please see if you have any problems doing this process *without* any
> patches.
>
> Also, what kind of synchronization are you using?  File based, zookeeper,
> or single-process?
>
> Thanks,
> Karl
>
>
> On Fri, Sep 12, 2014 at 12:29 PM, Karl Wright <da...@gmail.com> wrote:
>
>> Hi Paul,
>>
>> The query looks right; the database driver determines the maximum number
>> of clauses in a conjunction OR list, just like it does for an IN() list.
>> In the case of Postgresql and OR, the limit is 25; for IN()'s it's 100.
>>
>> The standard integration tests generally run small jobs but that is
>> typically sufficient to find query generation problems.  I have load tests
>> I can also run but they take several hours to complete.  I'll start one
>> now, but I may need to abort it before it finishes.
>>
>> Karl
>>
>>
>> On Fri, Sep 12, 2014 at 11:26 AM, Paul Boichat <pa...@exonar.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I'm looking through the logs - can see the change from IN to OR in each
>>> query - and there's clearly a difference in execution path but it's quite
>>> verbose so will take a while.
>>>
>>> It may well be that document state has not been reprioritised or in some
>>> way inconsistent. However, I don't think it's that which is causing the
>>> issue - I can switch this behaviour on and off over by changing the
>>> DBInterfacePostgres class and restarting Manifold. That seems to suggest a
>>> query isn't behaving the same way between IN and OR - I just can't isolate
>>> the particular query (yet).
>>>
>>> Have you tested with a job already in running state (on a restart) with
>>> a large document count? For example am seeing this kind of thing which
>>> looks messy but appears to execute as you'd expect:
>>>
>>> SELECT
>>> id,dockey,lastversion,lastoutputversion,authorityname,forcedparams FROM
>>> ingeststatus WHERE  (dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=?) AND
>>> connectionname=?]
>>> DEBUG 2014-09-12 15:01:27,052 (Thread-542) -   Parameter 0:
>>> '1407144048431:F42CD76D66FA6BAD396FF8F8A409DD211C184E6A'
>>> DEBUG 2014-09-12 15:01:27,052 (Thread-542) -   Parameter 1:
>>> '1407144048431:FE66CC4054300E4EB2A84138DC9B62B80F59F5B9'
>>>
>>>
>>>
>>>
>>> VP Engineering,
>>> Exonar Ltd
>>>
>>> T: +44 7940 567724
>>>
>>> twitter:@exonarco @pboichat
>>> W: http://www.exonar.com
>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>
>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>> West Mills, Newbury, Berkshire, RG14 5HG
>>> DISCLAIMER: This email and any attachments to it may be confidential
>>> and are intended solely for the use of the individual to whom it is
>>> addressed. Any views or opinions expressed are solely those of the author
>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>> the intended recipient of this email, you must neither take any action
>>> based upon its contents, nor copy or show it to anyone. Please contact
>>> the sender if you believe you have received this email in error.
>>>
>>> On Fri, Sep 12, 2014 at 4:20 PM, Karl Wright <da...@gmail.com> wrote:
>>>
>>>> Hi Paul,
>>>>
>>>> The tests in fact do multiple complete crawls, so it is extremely
>>>> unlikely that the stuffer query is broken.  If you look at the queries
>>>> generated, you should note that the only difference is that whenever an xxx
>>>> IN(?,?) was generated before, a (xxx=? OR xxx=?) is generated instead.
>>>> These should be completely equivalent; if they don't look equivalent to you
>>>> in the log, then I will fix whatever is broken.  I'll make sure here that
>>>> the queries look right visually too.
>>>>
>>>> One possibility is that when you restarted the agents process, the
>>>> jobqueue records did not yet finish getting reprioritized.  Stuffer queries
>>>> are fired all the time, but the running jobs must complete reprioritization
>>>> before the stuffer query will pick up any records.  I wonder if they may
>>>> not have managed to get to the right state before you aborted the
>>>> experiment?  You can tell what is happening by using jstack to get a thread
>>>> dump of the agents process.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>> On Fri, Sep 12, 2014 at 11:05 AM, Paul Boichat <paul.boichat@exonar.com
>>>> > wrote:
>>>>
>>>>> I stayed with base 1.6.1 and manually patched the code to include the
>>>>> two new methods in DBInterfacePostgreSQL
>>>>>
>>>>> Paul
>>>>>
>>>>>
>>>>>
>>>>> VP Engineering,
>>>>> Exonar Ltd
>>>>>
>>>>> T: +44 7940 567724
>>>>>
>>>>> twitter:@exonarco @pboichat
>>>>> W: http://www.exonar.com
>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>
>>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>>> and are intended solely for the use of the individual to whom it is
>>>>> addressed. Any views or opinions expressed are solely those of the author
>>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>>> the intended recipient of this email, you must neither take any action
>>>>> based upon its contents, nor copy or show it to anyone. Please
>>>>> contact the sender if you believe you have received this email in error.
>>>>>
>>>>> On Fri, Sep 12, 2014 at 4:01 PM, Karl Wright <da...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> The changes pass all tests here.  Is it possible that you attempted
>>>>>> some upgrade that failed (or didn't attempt upgrade but went to a new code
>>>>>> version)?
>>>>>>
>>>>>> If you could let me know as exactly as possible what you did, I can
>>>>>> let you know if that should have worked or not.
>>>>>>
>>>>>> Thanks!
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 12, 2014 at 10:57 AM, Paul Boichat <
>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>
>>>>>>> Karl,
>>>>>>>
>>>>>>> We appear to be seeing an issue with the performance change to use
>>>>>>> an OR clause rather than IN. After making the change, when we restart
>>>>>>> manifoldcf (with one job in running state) documents in the running job are
>>>>>>> not picked up for processing by the stuffer thread. If we redploy base
>>>>>>> 1.6.1 and restart documents are processed. This is consistently switchable
>>>>>>> depending on which code base is deployed.
>>>>>>>
>>>>>>> We have logs that I could upload to the ticket if you recommend that
>>>>>>> we reopen the issue (or create a new one)?
>>>>>>>
>>>>>>> Paul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> VP Engineering,
>>>>>>> Exonar Ltd
>>>>>>>
>>>>>>> T: +44 7940 567724
>>>>>>>
>>>>>>> twitter:@exonarco @pboichat
>>>>>>> W: http://www.exonar.com
>>>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>>>
>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>> author and do not necessarily represent those of Exonar Ltd. If you
>>>>>>> are not the intended recipient of this email, you must neither take any
>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>
>>>>>>> On Fri, Sep 12, 2014 at 6:05 AM, Karl Wright <da...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Paul --
>>>>>>>>
>>>>>>>> Just to be clear -- the branch for CONNECTORS-1027 is a branch of
>>>>>>>> trunk, which is MCF 2.0.  MCF 2.0 is not backwards compatible with any
>>>>>>>> previous MCF release, and indeed there is no upgrade from any 1.x release
>>>>>>>> to 2.0.  That's why I said to use the patches, and try to stay on 1.6.1 or
>>>>>>>> at most to migrate to 1.7.
>>>>>>>>
>>>>>>>> IF you ALREADY tried an upgrade with the branch code, then you
>>>>>>>> would have wound up in a schema state where the schema had more columns in
>>>>>>>> it than the branch knew how to deal with.  That's bad, and you will need to
>>>>>>>> do things to fix the situation.  I believe you should still be able to do
>>>>>>>> the following:
>>>>>>>>
>>>>>>>> - Download 1.7 source, or check out
>>>>>>>> https://svn.apache.org/repos/asf/manifoldcf/branches/release-1.7-branch
>>>>>>>> - Apply the patches
>>>>>>>> - Build
>>>>>>>> - Modify your properties.xml to point to your postgresql instance
>>>>>>>> - Run the upgrade (initialize.bat on the multi-process example, or
>>>>>>>> start the single-process example)
>>>>>>>>
>>>>>>>> You should then have a working 1.7 release, with code patches
>>>>>>>> applied.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Sep 11, 2014 at 11:34 AM, Paul Boichat <
>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks - we've pulled down the branch and will test the changes.
>>>>>>>>> It looks like a branch of 1.7 so it's going to take us a little while to
>>>>>>>>> test. We need to migrate our connectors (there's some deprecated stuff
>>>>>>>>> that's now been cleared in 1.7 .eg. getShareACL) and we'll need to patch
>>>>>>>>> the database to include the pipeline and any other schema changes. We'll
>>>>>>>>> have some environment contention over the next week as our performance test
>>>>>>>>> environment needs to remain on 1.6.1 while we test a release. Once that's
>>>>>>>>> clear I'll move to 1.7
>>>>>>>>>
>>>>>>>>> On the database schema patch moving from 1.6.1 to 1.7 - is there a
>>>>>>>>> simple way to migrate and existing database?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Paul
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> VP Engineering,
>>>>>>>>> Exonar Ltd
>>>>>>>>>
>>>>>>>>> T: +44 7940 567724
>>>>>>>>>
>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>> W: http://www.exonar.com
>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>
>>>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>
>>>>>>>>> On Thu, Sep 11, 2014 at 1:27 PM, Karl Wright <da...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks -- I'll include that change as well then, in ticket
>>>>>>>>>> CONNECTORS-1027.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 11, 2014 at 7:45 AM, Paul Boichat <
>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> That comes back immediately with 10001 rows:
>>>>>>>>>>>
>>>>>>>>>>> explain analyze SELECT count(*) FROM (SELECT 'x' FROM jobqueue
>>>>>>>>>>> LIMIT 10001) t;
>>>>>>>>>>>
>>>>>>>>>>> QUERY PLAN
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -----------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>> ----------------------------------
>>>>>>>>>>>  Aggregate  (cost=544.08..544.09 rows=1 width=0) (actual
>>>>>>>>>>> time=9.125..9.125 rows=1 loops=1)
>>>>>>>>>>>    ->  Limit  (cost=0.00..419.07 rows=10001 width=0) (actual
>>>>>>>>>>> time=0.033..6.945 rows=10001 loops=1)
>>>>>>>>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>>>>>>>>> (cost=0.00..431189.31 rows=10290271 width=0) (actual time
>>>>>>>>>>> =0.031..3.257 rows=10001 loops=1)
>>>>>>>>>>>                Heap Fetches: 725
>>>>>>>>>>>  Total runtime: 9.157 ms
>>>>>>>>>>> (5 rows)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Whereas:
>>>>>>>>>>>
>>>>>>>>>>> explain analyze SELECT count(*) FROM jobqueue limit 10001;
>>>>>>>>>>>
>>>>>>>>>>> QUERY PLAN
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -----------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>> ----------------------------------------
>>>>>>>>>>>  Limit  (cost=456922.99..456923.00 rows=1 width=0) (actual
>>>>>>>>>>> time=5225.107..5225.109 rows=1 loops=1)
>>>>>>>>>>>    ->  Aggregate  (cost=456922.99..456923.00 rows=1 width=0)
>>>>>>>>>>> (actual time=5225.105..5225.106 rows=1 loops=1)
>>>>>>>>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>>>>>>>>> (cost=0.00..431197.31 rows=10290271 width=0) (actual time
>>>>>>>>>>> =0.108..3090.848 rows=10370209 loops=1)
>>>>>>>>>>>                Heap Fetches: 684297
>>>>>>>>>>>  Total runtime: 5225.151 ms
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Paul
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> VP Engineering,
>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>
>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>
>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>
>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Sep 11, 2014 at 12:25 PM, Karl Wright <
>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>
>>>>>>>>>>>> Could you try this query on your database please and tell me if
>>>>>>>>>>>> it executes promptly:
>>>>>>>>>>>>
>>>>>>>>>>>> SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I vaguely remember that I had to change the form of this query
>>>>>>>>>>>> in order to support MySQL -- but first let's see if this helps.
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <
>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I've created a ticket (CONNECTORS-1027) and a trunk-based
>>>>>>>>>>>>> branch (branches/CONNECTORS-1027) for looking at any changes we do for
>>>>>>>>>>>>> large-scale Postgresql optimization work.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please note that trunk code already has schema changes
>>>>>>>>>>>>> relative to MCF 1.7, so you will not be able to work directly with this
>>>>>>>>>>>>> branch code.  I'll have to create patches for whatever changes you would
>>>>>>>>>>>>> need to try.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat <
>>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We're on Postgres 9.2. I'll get the query plans and add them
>>>>>>>>>>>>>> to the doc.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <
>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Can you include the logged plan for this query; this is an
>>>>>>>>>>>>>>> actual query encountered during crawling:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a
>>>>>>>>>>>>>>> long-running query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid
>>>>>>>>>>>>>>> FROM carrydown t1, jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=? AND
>>>>>>>>>>>>>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> These queries are all from the UI; it is what gets generated
>>>>>>>>>>>>>>> when no limits are in place:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found
>>>>>>>>>>>>>>> a long-running query (166845 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>>>>>>>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>>>>>>>>  WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found
>>>>>>>>>>>>>>> a long-running query (107222 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>>>>>>>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This query is from the UI with a limit of 1000000:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found
>>>>>>>>>>>>>>> a long-running query (254851 ms): [SELECT COUNT(dochash) AS doccount FROM
>>>>>>>>>>>>>>> jobqueue t1 LIMIT 1000001]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I honestly don't understand why PostgreSQL would execute a
>>>>>>>>>>>>>>> sequential scan of the entire table when given a limit clause.  It
>>>>>>>>>>>>>>> certainly didn't used to do that.  If you have any other suggestions please
>>>>>>>>>>>>>>> let me know.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Some queries show up in this list because MCF periodically
>>>>>>>>>>>>>>> reindexes tables.  For example, this query goes only against the (small)
>>>>>>>>>>>>>>> jobs table.  Its poor performance on occasion is likely due to something
>>>>>>>>>>>>>>> else happening to the database, probably a reindex:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a
>>>>>>>>>>>>>>> long-running query (592474 ms): [SELECT id FROM jobs WHERE status IN
>>>>>>>>>>>>>>> (?,?,?,?,?) FOR UPDATE]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The final query is the document stuffing query, which is
>>>>>>>>>>>>>>> perhaps the most critical query in the whole system:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  SELECT
>>>>>>>>>>>>>>>  t0.id
>>>>>>>>>>>>>>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
>>>>>>>>>>>>>>>  t0.priorityset FROM jobqueue t0
>>>>>>>>>>>>>>>  WHERE t0.status IN ('P','G')  AND t0.checkaction='R' AND
>>>>>>>>>>>>>>> t0.checktime
>>>>>>>>>>>>>>>  <= 1407246846166
>>>>>>>>>>>>>>>  AND EXISTS (
>>>>>>>>>>>>>>>    SELECT 'x' FROM jobs t1
>>>>>>>>>>>>>>>    WHERE t1.status  IN ('A','a')  AND t1.id=t0.jobid  AND
>>>>>>>>>>>>>>> t1.priority=5
>>>>>>>>>>>>>>>  )
>>>>>>>>>>>>>>>  AND NOT EXISTS (
>>>>>>>>>>>>>>>    SELECT 'x' FROM jobqueue t2
>>>>>>>>>>>>>>>    WHERE t2.dochash=t0.dochash AND t2.status IN
>>>>>>>>>>>>>>>  ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid
>>>>>>>>>>>>>>>  )
>>>>>>>>>>>>>>>  AND NOT EXISTS (
>>>>>>>>>>>>>>>    SELECT 'x' FROM prereqevents t3,events t4
>>>>>>>>>>>>>>>    WHERE t0.id=t3.owner AND t3.eventname=t4.name
>>>>>>>>>>>>>>>  )
>>>>>>>>>>>>>>>  ORDER BY t0.docpriority ASC
>>>>>>>>>>>>>>>  LIMIT 480;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Your analysis of whether IN beats OR does not agree with
>>>>>>>>>>>>>>> experiments I did on postgresql 8.7 which showed no difference.  What
>>>>>>>>>>>>>>> Postgresql version are you using?  Also, I trust you have query plans that
>>>>>>>>>>>>>>> demonstrate your claim?  In any case, whether IN vs. OR is generated is a
>>>>>>>>>>>>>>> function of the MCF database driver, so this is trivial to experiment
>>>>>>>>>>>>>>> with.  I'll create a ticket and a branch for experimentation.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <
>>>>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Changing maxcountstatus to something much smaller (10,000)
>>>>>>>>>>>>>>>> doesn't seem to buy us that much on the table scan - in the attached you'll
>>>>>>>>>>>>>>>> see that it's still taking a long time to return the job status page. Also
>>>>>>>>>>>>>>>> in the attached are some sample other long running queries that we're
>>>>>>>>>>>>>>>> beginning to see more frequently. There's also an example of a query that's
>>>>>>>>>>>>>>>> frequently executed and regularly takes > 4 secs (plus a suggested change
>>>>>>>>>>>>>>>> to improve performance). This one in particular would certainly benefit
>>>>>>>>>>>>>>>> from a change to SSDs which should relieve the I/O bound bottleneck on
>>>>>>>>>>>>>>>> postgres.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We're loading the system from 10mil towards 100mil so would
>>>>>>>>>>>>>>>> be keen to work with you to optimise where possible.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <
>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The jobstatus query that uses count(*) should be doing
>>>>>>>>>>>>>>>>> something like this when the maxdocumentstatuscount value is set:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> select count(*) from jobqueue where xxx limit 500001
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This will still do a sequential scan, but it will be an
>>>>>>>>>>>>>>>>> aborted one, so you can control the maximum amount of time spent doing the
>>>>>>>>>>>>>>>>> query.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <
>>>>>>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> We've had a play with maxstatuscount and couldn't stop it
>>>>>>>>>>>>>>>>>> from count(*)-ing but I'll certainly have another look to see if we've
>>>>>>>>>>>>>>>>>> missed something.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> We're increasingly seeing long running threads and I'll
>>>>>>>>>>>>>>>>>> put together some samples. As an example, on a job that's currently
>>>>>>>>>>>>>>>>>> aborting:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a
>>>>>>>>>>>>>>>>>> long-running query (72902 ms): [UPDATE jobqueue SET
>>>>>>>>>>>>>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?]
>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -
>>>>>>>>>>>>>>>>>> Parameter 0: '1.000000001E9'
>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -
>>>>>>>>>>>>>>>>>> Parameter 1: '1407144048075'
>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>>>>>>>>>>>>>>> Update on jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>> Plan:   ->  Bitmap Heap Scan on jobqueue  (cost=18806.08..445770.39
>>>>>>>>>>>>>>>>>> rows=764916 width=287)
>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>> Plan:         Recheck Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>> Plan:         ->  Bitmap Index Scan on i1392985450177  (cost=0.00..18614.85
>>>>>>>>>>>>>>>>>> rows=764916 width=0)
>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>> Plan:               Index Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>>>>>>>>>>>>> Stats: n_distinct=4.0 most_common_vals={G,C,Z,P}
>>>>>>>>>>>>>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <
>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> For the jobqueue scans from the UI, there is a parameter
>>>>>>>>>>>>>>>>>>> you can set which limits the number of documents counted to at most a
>>>>>>>>>>>>>>>>>>> specified amount.  This uses a limit clause, which should prevent unbounded
>>>>>>>>>>>>>>>>>>> time doing these kinds of queries:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> org.apache.manifoldcf.ui.maxstatuscount
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The documentation says that the default value for this
>>>>>>>>>>>>>>>>>>> parameter is 10000, which however is incorrect.  The actual true default is
>>>>>>>>>>>>>>>>>>> 500000.  You could set that lower for better UI performance (losing some
>>>>>>>>>>>>>>>>>>> information, of course.)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> As for long-running queries, a lot of time and effort
>>>>>>>>>>>>>>>>>>> has been spent in MCF to insure that this doesn't happen.  Specifically,
>>>>>>>>>>>>>>>>>>> the main document queuing query is structured to read directly out of a
>>>>>>>>>>>>>>>>>>> specific jobqueue index.  This is the crucial query that must work properly
>>>>>>>>>>>>>>>>>>> for scalability, since doing a query that is effectively just a sort on the
>>>>>>>>>>>>>>>>>>> entire jobqueue would be a major problem.  There are some times where
>>>>>>>>>>>>>>>>>>> Postgresql's optimizer fails to do the right thing here, mostly because it
>>>>>>>>>>>>>>>>>>> makes a huge distinction between whether there's zero of something or one
>>>>>>>>>>>>>>>>>>> of something, but you can work around that particular issue by setting the
>>>>>>>>>>>>>>>>>>> analyze count to 1 if you start to see this problem -- which basically
>>>>>>>>>>>>>>>>>>> means that reanalysis of the table has to occur on every stuffing query.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I'd appreciate seeing the queries that are long-running
>>>>>>>>>>>>>>>>>>> in your case so that I can see if that is what you are encountering or not.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <
>>>>>>>>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> We're beginning to see issues with a document count >
>>>>>>>>>>>>>>>>>>>> 10 million. At that point, even with good postgres
>>>>>>>>>>>>>>>>>>>> vacuuming the jobqueue table is starting to become a
>>>>>>>>>>>>>>>>>>>> bottleneck.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> For example select count(*) from jobqueue, which is
>>>>>>>>>>>>>>>>>>>> executed when querying job status will do a full table scan of
>>>>>>>>>>>>>>>>>>>> jobqueue which has more than 10 million rows. That's
>>>>>>>>>>>>>>>>>>>> going to take some time in postgres.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> SSDs will certainly make a big difference to document
>>>>>>>>>>>>>>>>>>>> processing through-put (which we see is largely I/O bound in
>>>>>>>>>>>>>>>>>>>> postgres) but we are increasingly seeing long running
>>>>>>>>>>>>>>>>>>>> queries in the logs. Our current thinking is that we'll need to refactor
>>>>>>>>>>>>>>>>>>>> JobQueue somewhat to optimise queries and, potentially
>>>>>>>>>>>>>>>>>>>> partition jobqueue into a subset of tables (table per
>>>>>>>>>>>>>>>>>>>> queue for example).
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration
>>>>>>>>>>>>>>>>>>>> number 06439969 at 14 West Mills, Newbury, Berkshire,
>>>>>>>>>>>>>>>>>>>> RG14 5HG
>>>>>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may
>>>>>>>>>>>>>>>>>>>> be confidential and are intended solely for the use of the individual to
>>>>>>>>>>>>>>>>>>>> whom it is addressed. Any views or opinions expressed are solely those of
>>>>>>>>>>>>>>>>>>>> the author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <
>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi Baptiste,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> ManifoldCF is not limited by the number of agents
>>>>>>>>>>>>>>>>>>>>> processes or parallel connectors.  Overall database performance is the
>>>>>>>>>>>>>>>>>>>>> limiting factor.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I would read this:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Also, there's a section in ManifoldCF (I believe
>>>>>>>>>>>>>>>>>>>>> Chapter 2) that discusses this issue.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Some five years ago, I successfully crawled 5 million
>>>>>>>>>>>>>>>>>>>>> web documents, using Postgresql 8.3.  Postgresql 9.x is faster, and with
>>>>>>>>>>>>>>>>>>>>> modern SSD's, I expect that you will do even better.  In general, I'd say
>>>>>>>>>>>>>>>>>>>>> it was fine to shoot for 10M - 100M documents on ManifoldCF, provided that
>>>>>>>>>>>>>>>>>>>>> you use a good database, and provided that you maintain it properly.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>>>>>>>>>>>>>>>>>>> ba.berthier@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I would like to know what is the maximum number of
>>>>>>>>>>>>>>>>>>>>>> documents that you managed to crawl with ManifoldCF and with how many
>>>>>>>>>>>>>>>>>>>>>> connectors in parallel it could works ?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks for your answer
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Baptiste
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Posted by Karl Wright <da...@gmail.com>.

A longer (63 minute) postgresql load test just completed, also without any
problem.

Are you sure you included the patch code correctly?

Karl


On Fri, Sep 12, 2014 at 12:51 PM, Karl Wright <da...@gmail.com> wrote:

> It's done actually; my new laptop with PostgreSQL 9.3 can do 111111
> documents in roughly 15 minutes.  No problems encountered; roughly 120
> documents per second.
>
> For sanity sake, could you try the following:
>
> - check out or unpack 1.6.1 sources
> - lay down downloaded lib dependencies
> - build using "ant build"
> - modify properties.xml and start the approparite example
>
> Please see if you have any problems doing this process *without* any
> patches.
>
> Also, what kind of synchronization are you using?  File based, zookeeper,
> or single-process?
>
> Thanks,
> Karl
>
>
> On Fri, Sep 12, 2014 at 12:29 PM, Karl Wright <da...@gmail.com> wrote:
>
>> Hi Paul,
>>
>> The query looks right; the database driver determines the maximum number
>> of clauses in a conjunction OR list, just like it does for an IN() list.
>> In the case of Postgresql and OR, the limit is 25; for IN()'s it's 100.
>>
>> The standard integration tests generally run small jobs but that is
>> typically sufficient to find query generation problems.  I have load tests
>> I can also run but they take several hours to complete.  I'll start one
>> now, but I may need to abort it before it finishes.
>>
>> Karl
>>
>>
>> On Fri, Sep 12, 2014 at 11:26 AM, Paul Boichat <pa...@exonar.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I'm looking through the logs - can see the change from IN to OR in each
>>> query - and there's clearly a difference in execution path but it's quite
>>> verbose so will take a while.
>>>
>>> It may well be that document state has not been reprioritised or in some
>>> way inconsistent. However, I don't think it's that which is causing the
>>> issue - I can switch this behaviour on and off over by changing the
>>> DBInterfacePostgres class and restarting Manifold. That seems to suggest a
>>> query isn't behaving the same way between IN and OR - I just can't isolate
>>> the particular query (yet).
>>>
>>> Have you tested with a job already in running state (on a restart) with
>>> a large document count? For example am seeing this kind of thing which
>>> looks messy but appears to execute as you'd expect:
>>>
>>> SELECT
>>> id,dockey,lastversion,lastoutputversion,authorityname,forcedparams FROM
>>> ingeststatus WHERE  (dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=?) AND
>>> connectionname=?]
>>> DEBUG 2014-09-12 15:01:27,052 (Thread-542) -   Parameter 0:
>>> '1407144048431:F42CD76D66FA6BAD396FF8F8A409DD211C184E6A'
>>> DEBUG 2014-09-12 15:01:27,052 (Thread-542) -   Parameter 1:
>>> '1407144048431:FE66CC4054300E4EB2A84138DC9B62B80F59F5B9'
>>>
>>>
>>>
>>>
>>> VP Engineering,
>>> Exonar Ltd
>>>
>>> T: +44 7940 567724
>>>
>>> twitter:@exonarco @pboichat
>>> W: http://www.exonar.com
>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>
>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>> West Mills, Newbury, Berkshire, RG14 5HG
>>> DISCLAIMER: This email and any attachments to it may be confidential
>>> and are intended solely for the use of the individual to whom it is
>>> addressed. Any views or opinions expressed are solely those of the author
>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>> the intended recipient of this email, you must neither take any action
>>> based upon its contents, nor copy or show it to anyone. Please contact
>>> the sender if you believe you have received this email in error.
>>>
>>> On Fri, Sep 12, 2014 at 4:20 PM, Karl Wright <da...@gmail.com> wrote:
>>>
>>>> Hi Paul,
>>>>
>>>> The tests in fact do multiple complete crawls, so it is extremely
>>>> unlikely that the stuffer query is broken.  If you look at the queries
>>>> generated, you should note that the only difference is that whenever an xxx
>>>> IN(?,?) was generated before, a (xxx=? OR xxx=?) is generated instead.
>>>> These should be completely equivalent; if they don't look equivalent to you
>>>> in the log, then I will fix whatever is broken.  I'll make sure here that
>>>> the queries look right visually too.
>>>>
>>>> One possibility is that when you restarted the agents process, the
>>>> jobqueue records did not yet finish getting reprioritized.  Stuffer queries
>>>> are fired all the time, but the running jobs must complete reprioritization
>>>> before the stuffer query will pick up any records.  I wonder if they may
>>>> not have managed to get to the right state before you aborted the
>>>> experiment?  You can tell what is happening by using jstack to get a thread
>>>> dump of the agents process.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>> On Fri, Sep 12, 2014 at 11:05 AM, Paul Boichat <paul.boichat@exonar.com
>>>> > wrote:
>>>>
>>>>> I stayed with base 1.6.1 and manually patched the code to include the
>>>>> two new methods in DBInterfacePostgreSQL
>>>>>
>>>>> Paul
>>>>>
>>>>>
>>>>>
>>>>> VP Engineering,
>>>>> Exonar Ltd
>>>>>
>>>>> T: +44 7940 567724
>>>>>
>>>>> twitter:@exonarco @pboichat
>>>>> W: http://www.exonar.com
>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>
>>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>>> and are intended solely for the use of the individual to whom it is
>>>>> addressed. Any views or opinions expressed are solely those of the author
>>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>>> the intended recipient of this email, you must neither take any action
>>>>> based upon its contents, nor copy or show it to anyone. Please
>>>>> contact the sender if you believe you have received this email in error.
>>>>>
>>>>> On Fri, Sep 12, 2014 at 4:01 PM, Karl Wright <da...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> The changes pass all tests here.  Is it possible that you attempted
>>>>>> some upgrade that failed (or didn't attempt upgrade but went to a new code
>>>>>> version)?
>>>>>>
>>>>>> If you could let me know as exactly as possible what you did, I can
>>>>>> let you know if that should have worked or not.
>>>>>>
>>>>>> Thanks!
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 12, 2014 at 10:57 AM, Paul Boichat <
>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>
>>>>>>> Karl,
>>>>>>>
>>>>>>> We appear to be seeing an issue with the performance change to use
>>>>>>> an OR clause rather than IN. After making the change, when we restart
>>>>>>> manifoldcf (with one job in running state) documents in the running job are
>>>>>>> not picked up for processing by the stuffer thread. If we redploy base
>>>>>>> 1.6.1 and restart documents are processed. This is consistently switchable
>>>>>>> depending on which code base is deployed.
>>>>>>>
>>>>>>> We have logs that I could upload to the ticket if you recommend that
>>>>>>> we reopen the issue (or create a new one)?
>>>>>>>
>>>>>>> Paul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> VP Engineering,
>>>>>>> Exonar Ltd
>>>>>>>
>>>>>>> T: +44 7940 567724
>>>>>>>
>>>>>>> twitter:@exonarco @pboichat
>>>>>>> W: http://www.exonar.com
>>>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>>>
>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>> author and do not necessarily represent those of Exonar Ltd. If you
>>>>>>> are not the intended recipient of this email, you must neither take any
>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>
>>>>>>> On Fri, Sep 12, 2014 at 6:05 AM, Karl Wright <da...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Paul --
>>>>>>>>
>>>>>>>> Just to be clear -- the branch for CONNECTORS-1027 is a branch of
>>>>>>>> trunk, which is MCF 2.0.  MCF 2.0 is not backwards compatible with any
>>>>>>>> previous MCF release, and indeed there is no upgrade from any 1.x release
>>>>>>>> to 2.0.  That's why I said to use the patches, and try to stay on 1.6.1 or
>>>>>>>> at most to migrate to 1.7.
>>>>>>>>
>>>>>>>> IF you ALREADY tried an upgrade with the branch code, then you
>>>>>>>> would have wound up in a schema state where the schema had more columns in
>>>>>>>> it than the branch knew how to deal with.  That's bad, and you will need to
>>>>>>>> do things to fix the situation.  I believe you should still be able to do
>>>>>>>> the following:
>>>>>>>>
>>>>>>>> - Download 1.7 source, or check out
>>>>>>>> https://svn.apache.org/repos/asf/manifoldcf/branches/release-1.7-branch
>>>>>>>> - Apply the patches
>>>>>>>> - Build
>>>>>>>> - Modify your properties.xml to point to your postgresql instance
>>>>>>>> - Run the upgrade (initialize.bat on the multi-process example, or
>>>>>>>> start the single-process example)
>>>>>>>>
>>>>>>>> You should then have a working 1.7 release, with code patches
>>>>>>>> applied.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Sep 11, 2014 at 11:34 AM, Paul Boichat <
>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks - we've pulled down the branch and will test the changes.
>>>>>>>>> It looks like a branch of 1.7 so it's going to take us a little while to
>>>>>>>>> test. We need to migrate our connectors (there's some deprecated stuff
>>>>>>>>> that's now been cleared in 1.7 .eg. getShareACL) and we'll need to patch
>>>>>>>>> the database to include the pipeline and any other schema changes. We'll
>>>>>>>>> have some environment contention over the next week as our performance test
>>>>>>>>> environment needs to remain on 1.6.1 while we test a release. Once that's
>>>>>>>>> clear I'll move to 1.7
>>>>>>>>>
>>>>>>>>> On the database schema patch moving from 1.6.1 to 1.7 - is there a
>>>>>>>>> simple way to migrate and existing database?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Paul
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> VP Engineering,
>>>>>>>>> Exonar Ltd
>>>>>>>>>
>>>>>>>>> T: +44 7940 567724
>>>>>>>>>
>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>> W: http://www.exonar.com
>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>
>>>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>
>>>>>>>>> On Thu, Sep 11, 2014 at 1:27 PM, Karl Wright <da...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks -- I'll include that change as well then, in ticket
>>>>>>>>>> CONNECTORS-1027.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 11, 2014 at 7:45 AM, Paul Boichat <
>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> That comes back immediately with 10001 rows:
>>>>>>>>>>>
>>>>>>>>>>> explain analyze SELECT count(*) FROM (SELECT 'x' FROM jobqueue
>>>>>>>>>>> LIMIT 10001) t;
>>>>>>>>>>>
>>>>>>>>>>> QUERY PLAN
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -----------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>> ----------------------------------
>>>>>>>>>>>  Aggregate  (cost=544.08..544.09 rows=1 width=0) (actual
>>>>>>>>>>> time=9.125..9.125 rows=1 loops=1)
>>>>>>>>>>>    ->  Limit  (cost=0.00..419.07 rows=10001 width=0) (actual
>>>>>>>>>>> time=0.033..6.945 rows=10001 loops=1)
>>>>>>>>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>>>>>>>>> (cost=0.00..431189.31 rows=10290271 width=0) (actual time
>>>>>>>>>>> =0.031..3.257 rows=10001 loops=1)
>>>>>>>>>>>                Heap Fetches: 725
>>>>>>>>>>>  Total runtime: 9.157 ms
>>>>>>>>>>> (5 rows)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Whereas:
>>>>>>>>>>>
>>>>>>>>>>> explain analyze SELECT count(*) FROM jobqueue limit 10001;
>>>>>>>>>>>
>>>>>>>>>>> QUERY PLAN
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -----------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>> ----------------------------------------
>>>>>>>>>>>  Limit  (cost=456922.99..456923.00 rows=1 width=0) (actual
>>>>>>>>>>> time=5225.107..5225.109 rows=1 loops=1)
>>>>>>>>>>>    ->  Aggregate  (cost=456922.99..456923.00 rows=1 width=0)
>>>>>>>>>>> (actual time=5225.105..5225.106 rows=1 loops=1)
>>>>>>>>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>>>>>>>>> (cost=0.00..431197.31 rows=10290271 width=0) (actual time
>>>>>>>>>>> =0.108..3090.848 rows=10370209 loops=1)
>>>>>>>>>>>                Heap Fetches: 684297
>>>>>>>>>>>  Total runtime: 5225.151 ms
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Paul
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> VP Engineering,
>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>
>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>
>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>
>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Sep 11, 2014 at 12:25 PM, Karl Wright <
>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>
>>>>>>>>>>>> Could you try this query on your database please and tell me if
>>>>>>>>>>>> it executes promptly:
>>>>>>>>>>>>
>>>>>>>>>>>> SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I vaguely remember that I had to change the form of this query
>>>>>>>>>>>> in order to support MySQL -- but first let's see if this helps.
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <
>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I've created a ticket (CONNECTORS-1027) and a trunk-based
>>>>>>>>>>>>> branch (branches/CONNECTORS-1027) for looking at any changes we do for
>>>>>>>>>>>>> large-scale Postgresql optimization work.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please note that trunk code already has schema changes
>>>>>>>>>>>>> relative to MCF 1.7, so you will not be able to work directly with this
>>>>>>>>>>>>> branch code.  I'll have to create patches for whatever changes you would
>>>>>>>>>>>>> need to try.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat <
>>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We're on Postgres 9.2. I'll get the query plans and add them
>>>>>>>>>>>>>> to the doc.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <
>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Can you include the logged plan for this query; this is an
>>>>>>>>>>>>>>> actual query encountered during crawling:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a
>>>>>>>>>>>>>>> long-running query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid
>>>>>>>>>>>>>>> FROM carrydown t1, jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=? AND
>>>>>>>>>>>>>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> These queries are all from the UI; it is what gets generated
>>>>>>>>>>>>>>> when no limits are in place:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found
>>>>>>>>>>>>>>> a long-running query (166845 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>>>>>>>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>>>>>>>>  WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found
>>>>>>>>>>>>>>> a long-running query (107222 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>>>>>>>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This query is from the UI with a limit of 1000000:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found
>>>>>>>>>>>>>>> a long-running query (254851 ms): [SELECT COUNT(dochash) AS doccount FROM
>>>>>>>>>>>>>>> jobqueue t1 LIMIT 1000001]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I honestly don't understand why PostgreSQL would execute a
>>>>>>>>>>>>>>> sequential scan of the entire table when given a limit clause.  It
>>>>>>>>>>>>>>> certainly didn't used to do that.  If you have any other suggestions please
>>>>>>>>>>>>>>> let me know.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Some queries show up in this list because MCF periodically
>>>>>>>>>>>>>>> reindexes tables.  For example, this query goes only against the (small)
>>>>>>>>>>>>>>> jobs table.  Its poor performance on occasion is likely due to something
>>>>>>>>>>>>>>> else happening to the database, probably a reindex:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a
>>>>>>>>>>>>>>> long-running query (592474 ms): [SELECT id FROM jobs WHERE status IN
>>>>>>>>>>>>>>> (?,?,?,?,?) FOR UPDATE]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The final query is the document stuffing query, which is
>>>>>>>>>>>>>>> perhaps the most critical query in the whole system:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  SELECT
>>>>>>>>>>>>>>>  t0.id
>>>>>>>>>>>>>>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
>>>>>>>>>>>>>>>  t0.priorityset FROM jobqueue t0
>>>>>>>>>>>>>>>  WHERE t0.status IN ('P','G')  AND t0.checkaction='R' AND
>>>>>>>>>>>>>>> t0.checktime
>>>>>>>>>>>>>>>  <= 1407246846166
>>>>>>>>>>>>>>>  AND EXISTS (
>>>>>>>>>>>>>>>    SELECT 'x' FROM jobs t1
>>>>>>>>>>>>>>>    WHERE t1.status  IN ('A','a')  AND t1.id=t0.jobid  AND
>>>>>>>>>>>>>>> t1.priority=5
>>>>>>>>>>>>>>>  )
>>>>>>>>>>>>>>>  AND NOT EXISTS (
>>>>>>>>>>>>>>>    SELECT 'x' FROM jobqueue t2
>>>>>>>>>>>>>>>    WHERE t2.dochash=t0.dochash AND t2.status IN
>>>>>>>>>>>>>>>  ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid
>>>>>>>>>>>>>>>  )
>>>>>>>>>>>>>>>  AND NOT EXISTS (
>>>>>>>>>>>>>>>    SELECT 'x' FROM prereqevents t3,events t4
>>>>>>>>>>>>>>>    WHERE t0.id=t3.owner AND t3.eventname=t4.name
>>>>>>>>>>>>>>>  )
>>>>>>>>>>>>>>>  ORDER BY t0.docpriority ASC
>>>>>>>>>>>>>>>  LIMIT 480;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Your analysis of whether IN beats OR does not agree with
>>>>>>>>>>>>>>> experiments I did on postgresql 8.7 which showed no difference.  What
>>>>>>>>>>>>>>> Postgresql version are you using?  Also, I trust you have query plans that
>>>>>>>>>>>>>>> demonstrate your claim?  In any case, whether IN vs. OR is generated is a
>>>>>>>>>>>>>>> function of the MCF database driver, so this is trivial to experiment
>>>>>>>>>>>>>>> with.  I'll create a ticket and a branch for experimentation.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <
>>>>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Changing maxcountstatus to something much smaller (10,000)
>>>>>>>>>>>>>>>> doesn't seem to buy us that much on the table scan - in the attached you'll
>>>>>>>>>>>>>>>> see that it's still taking a long time to return the job status page. Also
>>>>>>>>>>>>>>>> in the attached are some sample other long running queries that we're
>>>>>>>>>>>>>>>> beginning to see more frequently. There's also an example of a query that's
>>>>>>>>>>>>>>>> frequently executed and regularly takes > 4 secs (plus a suggested change
>>>>>>>>>>>>>>>> to improve performance). This one in particular would certainly benefit
>>>>>>>>>>>>>>>> from a change to SSDs which should relieve the I/O bound bottleneck on
>>>>>>>>>>>>>>>> postgres.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We're loading the system from 10mil towards 100mil so would
>>>>>>>>>>>>>>>> be keen to work with you to optimise where possible.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <
>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The jobstatus query that uses count(*) should be doing
>>>>>>>>>>>>>>>>> something like this when the maxdocumentstatuscount value is set:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> select count(*) from jobqueue where xxx limit 500001
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This will still do a sequential scan, but it will be an
>>>>>>>>>>>>>>>>> aborted one, so you can control the maximum amount of time spent doing the
>>>>>>>>>>>>>>>>> query.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <
>>>>>>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> We've had a play with maxstatuscount and couldn't stop it
>>>>>>>>>>>>>>>>>> from count(*)-ing but I'll certainly have another look to see if we've
>>>>>>>>>>>>>>>>>> missed something.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> We're increasingly seeing long running threads and I'll
>>>>>>>>>>>>>>>>>> put together some samples. As an example, on a job that's currently
>>>>>>>>>>>>>>>>>> aborting:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a
>>>>>>>>>>>>>>>>>> long-running query (72902 ms): [UPDATE jobqueue SET
>>>>>>>>>>>>>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?]
>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -
>>>>>>>>>>>>>>>>>> Parameter 0: '1.000000001E9'
>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -
>>>>>>>>>>>>>>>>>> Parameter 1: '1407144048075'
>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>>>>>>>>>>>>>>> Update on jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>> Plan:   ->  Bitmap Heap Scan on jobqueue  (cost=18806.08..445770.39
>>>>>>>>>>>>>>>>>> rows=764916 width=287)
>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>> Plan:         Recheck Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>> Plan:         ->  Bitmap Index Scan on i1392985450177  (cost=0.00..18614.85
>>>>>>>>>>>>>>>>>> rows=764916 width=0)
>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>> Plan:               Index Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>>>>>>>>>>>>> Stats: n_distinct=4.0 most_common_vals={G,C,Z,P}
>>>>>>>>>>>>>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <
>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> For the jobqueue scans from the UI, there is a parameter
>>>>>>>>>>>>>>>>>>> you can set which limits the number of documents counted to at most a
>>>>>>>>>>>>>>>>>>> specified amount.  This uses a limit clause, which should prevent unbounded
>>>>>>>>>>>>>>>>>>> time doing these kinds of queries:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> org.apache.manifoldcf.ui.maxstatuscount
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The documentation says that the default value for this
>>>>>>>>>>>>>>>>>>> parameter is 10000, which however is incorrect.  The actual true default is
>>>>>>>>>>>>>>>>>>> 500000.  You could set that lower for better UI performance (losing some
>>>>>>>>>>>>>>>>>>> information, of course.)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> As for long-running queries, a lot of time and effort
>>>>>>>>>>>>>>>>>>> has been spent in MCF to insure that this doesn't happen.  Specifically,
>>>>>>>>>>>>>>>>>>> the main document queuing query is structured to read directly out of a
>>>>>>>>>>>>>>>>>>> specific jobqueue index.  This is the crucial query that must work properly
>>>>>>>>>>>>>>>>>>> for scalability, since doing a query that is effectively just a sort on the
>>>>>>>>>>>>>>>>>>> entire jobqueue would be a major problem.  There are some times where
>>>>>>>>>>>>>>>>>>> Postgresql's optimizer fails to do the right thing here, mostly because it
>>>>>>>>>>>>>>>>>>> makes a huge distinction between whether there's zero of something or one
>>>>>>>>>>>>>>>>>>> of something, but you can work around that particular issue by setting the
>>>>>>>>>>>>>>>>>>> analyze count to 1 if you start to see this problem -- which basically
>>>>>>>>>>>>>>>>>>> means that reanalysis of the table has to occur on every stuffing query.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I'd appreciate seeing the queries that are long-running
>>>>>>>>>>>>>>>>>>> in your case so that I can see if that is what you are encountering or not.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <
>>>>>>>>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> We're beginning to see issues with a document count >
>>>>>>>>>>>>>>>>>>>> 10 million. At that point, even with good postgres
>>>>>>>>>>>>>>>>>>>> vacuuming the jobqueue table is starting to become a
>>>>>>>>>>>>>>>>>>>> bottleneck.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> For example select count(*) from jobqueue, which is
>>>>>>>>>>>>>>>>>>>> executed when querying job status will do a full table scan of
>>>>>>>>>>>>>>>>>>>> jobqueue which has more than 10 million rows. That's
>>>>>>>>>>>>>>>>>>>> going to take some time in postgres.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> SSDs will certainly make a big difference to document
>>>>>>>>>>>>>>>>>>>> processing through-put (which we see is largely I/O bound in
>>>>>>>>>>>>>>>>>>>> postgres) but we are increasingly seeing long running
>>>>>>>>>>>>>>>>>>>> queries in the logs. Our current thinking is that we'll need to refactor
>>>>>>>>>>>>>>>>>>>> JobQueue somewhat to optimise queries and, potentially
>>>>>>>>>>>>>>>>>>>> partition jobqueue into a subset of tables (table per
>>>>>>>>>>>>>>>>>>>> queue for example).
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration
>>>>>>>>>>>>>>>>>>>> number 06439969 at 14 West Mills, Newbury, Berkshire,
>>>>>>>>>>>>>>>>>>>> RG14 5HG
>>>>>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may
>>>>>>>>>>>>>>>>>>>> be confidential and are intended solely for the use of the individual to
>>>>>>>>>>>>>>>>>>>> whom it is addressed. Any views or opinions expressed are solely those of
>>>>>>>>>>>>>>>>>>>> the author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <
>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi Baptiste,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> ManifoldCF is not limited by the number of agents
>>>>>>>>>>>>>>>>>>>>> processes or parallel connectors.  Overall database performance is the
>>>>>>>>>>>>>>>>>>>>> limiting factor.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I would read this:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Also, there's a section in ManifoldCF (I believe
>>>>>>>>>>>>>>>>>>>>> Chapter 2) that discusses this issue.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Some five years ago, I successfully crawled 5 million
>>>>>>>>>>>>>>>>>>>>> web documents, using Postgresql 8.3.  Postgresql 9.x is faster, and with
>>>>>>>>>>>>>>>>>>>>> modern SSD's, I expect that you will do even better.  In general, I'd say
>>>>>>>>>>>>>>>>>>>>> it was fine to shoot for 10M - 100M documents on ManifoldCF, provided that
>>>>>>>>>>>>>>>>>>>>> you use a good database, and provided that you maintain it properly.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>>>>>>>>>>>>>>>>>>> ba.berthier@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I would like to know what is the maximum number of
>>>>>>>>>>>>>>>>>>>>>> documents that you managed to crawl with ManifoldCF and with how many
>>>>>>>>>>>>>>>>>>>>>> connectors in parallel it could works ?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks for your answer
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Baptiste
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Posted by Karl Wright <da...@gmail.com>.

It's done actually; my new laptop with PostgreSQL 9.3 can do 111111
documents in roughly 15 minutes.  No problems encountered; roughly 120
documents per second.

For sanity sake, could you try the following:

- check out or unpack 1.6.1 sources
- lay down downloaded lib dependencies
- build using "ant build"
- modify properties.xml and start the approparite example

Please see if you have any problems doing this process *without* any
patches.

Also, what kind of synchronization are you using?  File based, zookeeper,
or single-process?

Thanks,
Karl


On Fri, Sep 12, 2014 at 12:29 PM, Karl Wright <da...@gmail.com> wrote:

> Hi Paul,
>
> The query looks right; the database driver determines the maximum number
> of clauses in a conjunction OR list, just like it does for an IN() list.
> In the case of Postgresql and OR, the limit is 25; for IN()'s it's 100.
>
> The standard integration tests generally run small jobs but that is
> typically sufficient to find query generation problems.  I have load tests
> I can also run but they take several hours to complete.  I'll start one
> now, but I may need to abort it before it finishes.
>
> Karl
>
>
> On Fri, Sep 12, 2014 at 11:26 AM, Paul Boichat <pa...@exonar.com>
> wrote:
>
>> Hi,
>>
>> I'm looking through the logs - can see the change from IN to OR in each
>> query - and there's clearly a difference in execution path but it's quite
>> verbose so will take a while.
>>
>> It may well be that document state has not been reprioritised or in some
>> way inconsistent. However, I don't think it's that which is causing the
>> issue - I can switch this behaviour on and off over by changing the
>> DBInterfacePostgres class and restarting Manifold. That seems to suggest a
>> query isn't behaving the same way between IN and OR - I just can't isolate
>> the particular query (yet).
>>
>> Have you tested with a job already in running state (on a restart) with a
>> large document count? For example am seeing this kind of thing which looks
>> messy but appears to execute as you'd expect:
>>
>> SELECT id,dockey,lastversion,lastoutputversion,authorityname,forcedparams
>> FROM ingeststatus WHERE  (dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
>> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=?) AND
>> connectionname=?]
>> DEBUG 2014-09-12 15:01:27,052 (Thread-542) -   Parameter 0:
>> '1407144048431:F42CD76D66FA6BAD396FF8F8A409DD211C184E6A'
>> DEBUG 2014-09-12 15:01:27,052 (Thread-542) -   Parameter 1:
>> '1407144048431:FE66CC4054300E4EB2A84138DC9B62B80F59F5B9'
>>
>>
>>
>>
>> VP Engineering,
>> Exonar Ltd
>>
>> T: +44 7940 567724
>>
>> twitter:@exonarco @pboichat
>> W: http://www.exonar.com
>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>
>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>> West Mills, Newbury, Berkshire, RG14 5HG
>> DISCLAIMER: This email and any attachments to it may be confidential and
>> are intended solely for the use of the individual to whom it is addressed.
>> Any views or opinions expressed are solely those of the author and do not
>> necessarily represent those of Exonar Ltd. If you are not the intended
>> recipient of this email, you must neither take any action based upon its
>> contents, nor copy or show it to anyone. Please contact the sender if
>> you believe you have received this email in error.
>>
>> On Fri, Sep 12, 2014 at 4:20 PM, Karl Wright <da...@gmail.com> wrote:
>>
>>> Hi Paul,
>>>
>>> The tests in fact do multiple complete crawls, so it is extremely
>>> unlikely that the stuffer query is broken.  If you look at the queries
>>> generated, you should note that the only difference is that whenever an xxx
>>> IN(?,?) was generated before, a (xxx=? OR xxx=?) is generated instead.
>>> These should be completely equivalent; if they don't look equivalent to you
>>> in the log, then I will fix whatever is broken.  I'll make sure here that
>>> the queries look right visually too.
>>>
>>> One possibility is that when you restarted the agents process, the
>>> jobqueue records did not yet finish getting reprioritized.  Stuffer queries
>>> are fired all the time, but the running jobs must complete reprioritization
>>> before the stuffer query will pick up any records.  I wonder if they may
>>> not have managed to get to the right state before you aborted the
>>> experiment?  You can tell what is happening by using jstack to get a thread
>>> dump of the agents process.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Fri, Sep 12, 2014 at 11:05 AM, Paul Boichat <pa...@exonar.com>
>>> wrote:
>>>
>>>> I stayed with base 1.6.1 and manually patched the code to include the
>>>> two new methods in DBInterfacePostgreSQL
>>>>
>>>> Paul
>>>>
>>>>
>>>>
>>>> VP Engineering,
>>>> Exonar Ltd
>>>>
>>>> T: +44 7940 567724
>>>>
>>>> twitter:@exonarco @pboichat
>>>> W: http://www.exonar.com
>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>
>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>> and are intended solely for the use of the individual to whom it is
>>>> addressed. Any views or opinions expressed are solely those of the author
>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>> the intended recipient of this email, you must neither take any action
>>>> based upon its contents, nor copy or show it to anyone. Please contact
>>>> the sender if you believe you have received this email in error.
>>>>
>>>> On Fri, Sep 12, 2014 at 4:01 PM, Karl Wright <da...@gmail.com>
>>>> wrote:
>>>>
>>>>> The changes pass all tests here.  Is it possible that you attempted
>>>>> some upgrade that failed (or didn't attempt upgrade but went to a new code
>>>>> version)?
>>>>>
>>>>> If you could let me know as exactly as possible what you did, I can
>>>>> let you know if that should have worked or not.
>>>>>
>>>>> Thanks!
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 12, 2014 at 10:57 AM, Paul Boichat <
>>>>> paul.boichat@exonar.com> wrote:
>>>>>
>>>>>> Karl,
>>>>>>
>>>>>> We appear to be seeing an issue with the performance change to use an
>>>>>> OR clause rather than IN. After making the change, when we restart
>>>>>> manifoldcf (with one job in running state) documents in the running job are
>>>>>> not picked up for processing by the stuffer thread. If we redploy base
>>>>>> 1.6.1 and restart documents are processed. This is consistently switchable
>>>>>> depending on which code base is deployed.
>>>>>>
>>>>>> We have logs that I could upload to the ticket if you recommend that
>>>>>> we reopen the issue (or create a new one)?
>>>>>>
>>>>>> Paul
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> VP Engineering,
>>>>>> Exonar Ltd
>>>>>>
>>>>>> T: +44 7940 567724
>>>>>>
>>>>>> twitter:@exonarco @pboichat
>>>>>> W: http://www.exonar.com
>>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>>
>>>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>>>> and are intended solely for the use of the individual to whom it is
>>>>>> addressed. Any views or opinions expressed are solely those of the author
>>>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>>>> the intended recipient of this email, you must neither take any action
>>>>>> based upon its contents, nor copy or show it to anyone. Please
>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>
>>>>>> On Fri, Sep 12, 2014 at 6:05 AM, Karl Wright <da...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Paul --
>>>>>>>
>>>>>>> Just to be clear -- the branch for CONNECTORS-1027 is a branch of
>>>>>>> trunk, which is MCF 2.0.  MCF 2.0 is not backwards compatible with any
>>>>>>> previous MCF release, and indeed there is no upgrade from any 1.x release
>>>>>>> to 2.0.  That's why I said to use the patches, and try to stay on 1.6.1 or
>>>>>>> at most to migrate to 1.7.
>>>>>>>
>>>>>>> IF you ALREADY tried an upgrade with the branch code, then you would
>>>>>>> have wound up in a schema state where the schema had more columns in it
>>>>>>> than the branch knew how to deal with.  That's bad, and you will need to do
>>>>>>> things to fix the situation.  I believe you should still be able to do the
>>>>>>> following:
>>>>>>>
>>>>>>> - Download 1.7 source, or check out
>>>>>>> https://svn.apache.org/repos/asf/manifoldcf/branches/release-1.7-branch
>>>>>>> - Apply the patches
>>>>>>> - Build
>>>>>>> - Modify your properties.xml to point to your postgresql instance
>>>>>>> - Run the upgrade (initialize.bat on the multi-process example, or
>>>>>>> start the single-process example)
>>>>>>>
>>>>>>> You should then have a working 1.7 release, with code patches
>>>>>>> applied.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Sep 11, 2014 at 11:34 AM, Paul Boichat <
>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>
>>>>>>>> Thanks - we've pulled down the branch and will test the changes. It
>>>>>>>> looks like a branch of 1.7 so it's going to take us a little while to test.
>>>>>>>> We need to migrate our connectors (there's some deprecated stuff that's now
>>>>>>>> been cleared in 1.7 .eg. getShareACL) and we'll need to patch the database
>>>>>>>> to include the pipeline and any other schema changes. We'll have some
>>>>>>>> environment contention over the next week as our performance test
>>>>>>>> environment needs to remain on 1.6.1 while we test a release. Once that's
>>>>>>>> clear I'll move to 1.7
>>>>>>>>
>>>>>>>> On the database schema patch moving from 1.6.1 to 1.7 - is there a
>>>>>>>> simple way to migrate and existing database?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Paul
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> VP Engineering,
>>>>>>>> Exonar Ltd
>>>>>>>>
>>>>>>>> T: +44 7940 567724
>>>>>>>>
>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>> W: http://www.exonar.com
>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>> <http://video.exonar.com/>
>>>>>>>>
>>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>
>>>>>>>> On Thu, Sep 11, 2014 at 1:27 PM, Karl Wright <da...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks -- I'll include that change as well then, in ticket
>>>>>>>>> CONNECTORS-1027.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Sep 11, 2014 at 7:45 AM, Paul Boichat <
>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> That comes back immediately with 10001 rows:
>>>>>>>>>>
>>>>>>>>>> explain analyze SELECT count(*) FROM (SELECT 'x' FROM jobqueue
>>>>>>>>>> LIMIT 10001) t;
>>>>>>>>>>
>>>>>>>>>> QUERY PLAN
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -----------------------------------------------------------------------------------------------------------------------
>>>>>>>>>> ----------------------------------
>>>>>>>>>>  Aggregate  (cost=544.08..544.09 rows=1 width=0) (actual
>>>>>>>>>> time=9.125..9.125 rows=1 loops=1)
>>>>>>>>>>    ->  Limit  (cost=0.00..419.07 rows=10001 width=0) (actual
>>>>>>>>>> time=0.033..6.945 rows=10001 loops=1)
>>>>>>>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>>>>>>>> (cost=0.00..431189.31 rows=10290271 width=0) (actual time
>>>>>>>>>> =0.031..3.257 rows=10001 loops=1)
>>>>>>>>>>                Heap Fetches: 725
>>>>>>>>>>  Total runtime: 9.157 ms
>>>>>>>>>> (5 rows)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Whereas:
>>>>>>>>>>
>>>>>>>>>> explain analyze SELECT count(*) FROM jobqueue limit 10001;
>>>>>>>>>>
>>>>>>>>>> QUERY PLAN
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -----------------------------------------------------------------------------------------------------------------------
>>>>>>>>>> ----------------------------------------
>>>>>>>>>>  Limit  (cost=456922.99..456923.00 rows=1 width=0) (actual
>>>>>>>>>> time=5225.107..5225.109 rows=1 loops=1)
>>>>>>>>>>    ->  Aggregate  (cost=456922.99..456923.00 rows=1 width=0)
>>>>>>>>>> (actual time=5225.105..5225.106 rows=1 loops=1)
>>>>>>>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>>>>>>>> (cost=0.00..431197.31 rows=10290271 width=0) (actual time
>>>>>>>>>> =0.108..3090.848 rows=10370209 loops=1)
>>>>>>>>>>                Heap Fetches: 684297
>>>>>>>>>>  Total runtime: 5225.151 ms
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Paul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> VP Engineering,
>>>>>>>>>> Exonar Ltd
>>>>>>>>>>
>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>
>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>
>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 11, 2014 at 12:25 PM, Karl Wright <daddywri@gmail.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>
>>>>>>>>>>> Could you try this query on your database please and tell me if
>>>>>>>>>>> it executes promptly:
>>>>>>>>>>>
>>>>>>>>>>> SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I vaguely remember that I had to change the form of this query
>>>>>>>>>>> in order to support MySQL -- but first let's see if this helps.
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <daddywri@gmail.com
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I've created a ticket (CONNECTORS-1027) and a trunk-based
>>>>>>>>>>>> branch (branches/CONNECTORS-1027) for looking at any changes we do for
>>>>>>>>>>>> large-scale Postgresql optimization work.
>>>>>>>>>>>>
>>>>>>>>>>>> Please note that trunk code already has schema changes relative
>>>>>>>>>>>> to MCF 1.7, so you will not be able to work directly with this branch
>>>>>>>>>>>> code.  I'll have to create patches for whatever changes you would need to
>>>>>>>>>>>> try.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat <
>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> We're on Postgres 9.2. I'll get the query plans and add them
>>>>>>>>>>>>> to the doc.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>
>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>
>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>
>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <
>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Can you include the logged plan for this query; this is an
>>>>>>>>>>>>>> actual query encountered during crawling:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a
>>>>>>>>>>>>>> long-running query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid
>>>>>>>>>>>>>> FROM carrydown t1, jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=? AND
>>>>>>>>>>>>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> These queries are all from the UI; it is what gets generated
>>>>>>>>>>>>>> when no limits are in place:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found
>>>>>>>>>>>>>> a long-running query (166845 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>>>>>>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>>>>>>>  WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found
>>>>>>>>>>>>>> a long-running query (107222 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>>>>>>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This query is from the UI with a limit of 1000000:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found
>>>>>>>>>>>>>> a long-running query (254851 ms): [SELECT COUNT(dochash) AS doccount FROM
>>>>>>>>>>>>>> jobqueue t1 LIMIT 1000001]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I honestly don't understand why PostgreSQL would execute a
>>>>>>>>>>>>>> sequential scan of the entire table when given a limit clause.  It
>>>>>>>>>>>>>> certainly didn't used to do that.  If you have any other suggestions please
>>>>>>>>>>>>>> let me know.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Some queries show up in this list because MCF periodically
>>>>>>>>>>>>>> reindexes tables.  For example, this query goes only against the (small)
>>>>>>>>>>>>>> jobs table.  Its poor performance on occasion is likely due to something
>>>>>>>>>>>>>> else happening to the database, probably a reindex:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a
>>>>>>>>>>>>>> long-running query (592474 ms): [SELECT id FROM jobs WHERE status IN
>>>>>>>>>>>>>> (?,?,?,?,?) FOR UPDATE]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The final query is the document stuffing query, which is
>>>>>>>>>>>>>> perhaps the most critical query in the whole system:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  SELECT
>>>>>>>>>>>>>>  t0.id
>>>>>>>>>>>>>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
>>>>>>>>>>>>>>  t0.priorityset FROM jobqueue t0
>>>>>>>>>>>>>>  WHERE t0.status IN ('P','G')  AND t0.checkaction='R' AND
>>>>>>>>>>>>>> t0.checktime
>>>>>>>>>>>>>>  <= 1407246846166
>>>>>>>>>>>>>>  AND EXISTS (
>>>>>>>>>>>>>>    SELECT 'x' FROM jobs t1
>>>>>>>>>>>>>>    WHERE t1.status  IN ('A','a')  AND t1.id=t0.jobid  AND
>>>>>>>>>>>>>> t1.priority=5
>>>>>>>>>>>>>>  )
>>>>>>>>>>>>>>  AND NOT EXISTS (
>>>>>>>>>>>>>>    SELECT 'x' FROM jobqueue t2
>>>>>>>>>>>>>>    WHERE t2.dochash=t0.dochash AND t2.status IN
>>>>>>>>>>>>>>  ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid
>>>>>>>>>>>>>>  )
>>>>>>>>>>>>>>  AND NOT EXISTS (
>>>>>>>>>>>>>>    SELECT 'x' FROM prereqevents t3,events t4
>>>>>>>>>>>>>>    WHERE t0.id=t3.owner AND t3.eventname=t4.name
>>>>>>>>>>>>>>  )
>>>>>>>>>>>>>>  ORDER BY t0.docpriority ASC
>>>>>>>>>>>>>>  LIMIT 480;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Your analysis of whether IN beats OR does not agree with
>>>>>>>>>>>>>> experiments I did on postgresql 8.7 which showed no difference.  What
>>>>>>>>>>>>>> Postgresql version are you using?  Also, I trust you have query plans that
>>>>>>>>>>>>>> demonstrate your claim?  In any case, whether IN vs. OR is generated is a
>>>>>>>>>>>>>> function of the MCF database driver, so this is trivial to experiment
>>>>>>>>>>>>>> with.  I'll create a ticket and a branch for experimentation.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <
>>>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Changing maxcountstatus to something much smaller (10,000)
>>>>>>>>>>>>>>> doesn't seem to buy us that much on the table scan - in the attached you'll
>>>>>>>>>>>>>>> see that it's still taking a long time to return the job status page. Also
>>>>>>>>>>>>>>> in the attached are some sample other long running queries that we're
>>>>>>>>>>>>>>> beginning to see more frequently. There's also an example of a query that's
>>>>>>>>>>>>>>> frequently executed and regularly takes > 4 secs (plus a suggested change
>>>>>>>>>>>>>>> to improve performance). This one in particular would certainly benefit
>>>>>>>>>>>>>>> from a change to SSDs which should relieve the I/O bound bottleneck on
>>>>>>>>>>>>>>> postgres.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We're loading the system from 10mil towards 100mil so would
>>>>>>>>>>>>>>> be keen to work with you to optimise where possible.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <
>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The jobstatus query that uses count(*) should be doing
>>>>>>>>>>>>>>>> something like this when the maxdocumentstatuscount value is set:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> select count(*) from jobqueue where xxx limit 500001
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This will still do a sequential scan, but it will be an
>>>>>>>>>>>>>>>> aborted one, so you can control the maximum amount of time spent doing the
>>>>>>>>>>>>>>>> query.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <
>>>>>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> We've had a play with maxstatuscount and couldn't stop it
>>>>>>>>>>>>>>>>> from count(*)-ing but I'll certainly have another look to see if we've
>>>>>>>>>>>>>>>>> missed something.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> We're increasingly seeing long running threads and I'll
>>>>>>>>>>>>>>>>> put together some samples. As an example, on a job that's currently
>>>>>>>>>>>>>>>>> aborting:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a
>>>>>>>>>>>>>>>>> long-running query (72902 ms): [UPDATE jobqueue SET
>>>>>>>>>>>>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?]
>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -
>>>>>>>>>>>>>>>>> Parameter 0: '1.000000001E9'
>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -
>>>>>>>>>>>>>>>>> Parameter 1: '1407144048075'
>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>>>>>>>>>>>>>> Update on jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>> Plan:   ->  Bitmap Heap Scan on jobqueue  (cost=18806.08..445770.39
>>>>>>>>>>>>>>>>> rows=764916 width=287)
>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>> Plan:         Recheck Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>> Plan:         ->  Bitmap Index Scan on i1392985450177  (cost=0.00..18614.85
>>>>>>>>>>>>>>>>> rows=764916 width=0)
>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>> Plan:               Index Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -  Stats:
>>>>>>>>>>>>>>>>> n_distinct=4.0 most_common_vals={G,C,Z,P}
>>>>>>>>>>>>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <
>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> For the jobqueue scans from the UI, there is a parameter
>>>>>>>>>>>>>>>>>> you can set which limits the number of documents counted to at most a
>>>>>>>>>>>>>>>>>> specified amount.  This uses a limit clause, which should prevent unbounded
>>>>>>>>>>>>>>>>>> time doing these kinds of queries:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> org.apache.manifoldcf.ui.maxstatuscount
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The documentation says that the default value for this
>>>>>>>>>>>>>>>>>> parameter is 10000, which however is incorrect.  The actual true default is
>>>>>>>>>>>>>>>>>> 500000.  You could set that lower for better UI performance (losing some
>>>>>>>>>>>>>>>>>> information, of course.)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> As for long-running queries, a lot of time and effort has
>>>>>>>>>>>>>>>>>> been spent in MCF to insure that this doesn't happen.  Specifically, the
>>>>>>>>>>>>>>>>>> main document queuing query is structured to read directly out of a
>>>>>>>>>>>>>>>>>> specific jobqueue index.  This is the crucial query that must work properly
>>>>>>>>>>>>>>>>>> for scalability, since doing a query that is effectively just a sort on the
>>>>>>>>>>>>>>>>>> entire jobqueue would be a major problem.  There are some times where
>>>>>>>>>>>>>>>>>> Postgresql's optimizer fails to do the right thing here, mostly because it
>>>>>>>>>>>>>>>>>> makes a huge distinction between whether there's zero of something or one
>>>>>>>>>>>>>>>>>> of something, but you can work around that particular issue by setting the
>>>>>>>>>>>>>>>>>> analyze count to 1 if you start to see this problem -- which basically
>>>>>>>>>>>>>>>>>> means that reanalysis of the table has to occur on every stuffing query.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I'd appreciate seeing the queries that are long-running
>>>>>>>>>>>>>>>>>> in your case so that I can see if that is what you are encountering or not.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <
>>>>>>>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> We're beginning to see issues with a document count > 10
>>>>>>>>>>>>>>>>>>> million. At that point, even with good postgres
>>>>>>>>>>>>>>>>>>> vacuuming the jobqueue table is starting to become a
>>>>>>>>>>>>>>>>>>> bottleneck.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> For example select count(*) from jobqueue, which is
>>>>>>>>>>>>>>>>>>> executed when querying job status will do a full table scan of
>>>>>>>>>>>>>>>>>>> jobqueue which has more than 10 million rows. That's
>>>>>>>>>>>>>>>>>>> going to take some time in postgres.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> SSDs will certainly make a big difference to document
>>>>>>>>>>>>>>>>>>> processing through-put (which we see is largely I/O bound in
>>>>>>>>>>>>>>>>>>> postgres) but we are increasingly seeing long running
>>>>>>>>>>>>>>>>>>> queries in the logs. Our current thinking is that we'll need to refactor
>>>>>>>>>>>>>>>>>>> JobQueue somewhat to optimise queries and, potentially
>>>>>>>>>>>>>>>>>>> partition jobqueue into a subset of tables (table per
>>>>>>>>>>>>>>>>>>> queue for example).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration
>>>>>>>>>>>>>>>>>>> number 06439969 at 14 West Mills, Newbury, Berkshire,
>>>>>>>>>>>>>>>>>>> RG14 5HG
>>>>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <
>>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Baptiste,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> ManifoldCF is not limited by the number of agents
>>>>>>>>>>>>>>>>>>>> processes or parallel connectors.  Overall database performance is the
>>>>>>>>>>>>>>>>>>>> limiting factor.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I would read this:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Also, there's a section in ManifoldCF (I believe
>>>>>>>>>>>>>>>>>>>> Chapter 2) that discusses this issue.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Some five years ago, I successfully crawled 5 million
>>>>>>>>>>>>>>>>>>>> web documents, using Postgresql 8.3.  Postgresql 9.x is faster, and with
>>>>>>>>>>>>>>>>>>>> modern SSD's, I expect that you will do even better.  In general, I'd say
>>>>>>>>>>>>>>>>>>>> it was fine to shoot for 10M - 100M documents on ManifoldCF, provided that
>>>>>>>>>>>>>>>>>>>> you use a good database, and provided that you maintain it properly.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>>>>>>>>>>>>>>>>>> ba.berthier@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I would like to know what is the maximum number of
>>>>>>>>>>>>>>>>>>>>> documents that you managed to crawl with ManifoldCF and with how many
>>>>>>>>>>>>>>>>>>>>> connectors in parallel it could works ?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks for your answer
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Baptiste
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Posted by Karl Wright <da...@gmail.com>.

Hi Paul,

The query looks right; the database driver determines the maximum number of
clauses in a conjunction OR list, just like it does for an IN() list.  In
the case of Postgresql and OR, the limit is 25; for IN()'s it's 100.

The standard integration tests generally run small jobs but that is
typically sufficient to find query generation problems.  I have load tests
I can also run but they take several hours to complete.  I'll start one
now, but I may need to abort it before it finishes.

Karl


On Fri, Sep 12, 2014 at 11:26 AM, Paul Boichat <pa...@exonar.com>
wrote:

> Hi,
>
> I'm looking through the logs - can see the change from IN to OR in each
> query - and there's clearly a difference in execution path but it's quite
> verbose so will take a while.
>
> It may well be that document state has not been reprioritised or in some
> way inconsistent. However, I don't think it's that which is causing the
> issue - I can switch this behaviour on and off over by changing the
> DBInterfacePostgres class and restarting Manifold. That seems to suggest a
> query isn't behaving the same way between IN and OR - I just can't isolate
> the particular query (yet).
>
> Have you tested with a job already in running state (on a restart) with a
> large document count? For example am seeing this kind of thing which looks
> messy but appears to execute as you'd expect:
>
> SELECT id,dockey,lastversion,lastoutputversion,authorityname,forcedparams
> FROM ingeststatus WHERE  (dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
> dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=?) AND
> connectionname=?]
> DEBUG 2014-09-12 15:01:27,052 (Thread-542) -   Parameter 0:
> '1407144048431:F42CD76D66FA6BAD396FF8F8A409DD211C184E6A'
> DEBUG 2014-09-12 15:01:27,052 (Thread-542) -   Parameter 1:
> '1407144048431:FE66CC4054300E4EB2A84138DC9B62B80F59F5B9'
>
>
>
>
> VP Engineering,
> Exonar Ltd
>
> T: +44 7940 567724
>
> twitter:@exonarco @pboichat
> W: http://www.exonar.com
> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>
> Exonar Limited, registered in the UK, registration number 06439969 at 14
> West Mills, Newbury, Berkshire, RG14 5HG
> DISCLAIMER: This email and any attachments to it may be confidential and
> are intended solely for the use of the individual to whom it is addressed.
> Any views or opinions expressed are solely those of the author and do not
> necessarily represent those of Exonar Ltd. If you are not the intended
> recipient of this email, you must neither take any action based upon its
> contents, nor copy or show it to anyone. Please contact the sender if you
> believe you have received this email in error.
>
> On Fri, Sep 12, 2014 at 4:20 PM, Karl Wright <da...@gmail.com> wrote:
>
>> Hi Paul,
>>
>> The tests in fact do multiple complete crawls, so it is extremely
>> unlikely that the stuffer query is broken.  If you look at the queries
>> generated, you should note that the only difference is that whenever an xxx
>> IN(?,?) was generated before, a (xxx=? OR xxx=?) is generated instead.
>> These should be completely equivalent; if they don't look equivalent to you
>> in the log, then I will fix whatever is broken.  I'll make sure here that
>> the queries look right visually too.
>>
>> One possibility is that when you restarted the agents process, the
>> jobqueue records did not yet finish getting reprioritized.  Stuffer queries
>> are fired all the time, but the running jobs must complete reprioritization
>> before the stuffer query will pick up any records.  I wonder if they may
>> not have managed to get to the right state before you aborted the
>> experiment?  You can tell what is happening by using jstack to get a thread
>> dump of the agents process.
>>
>> Thanks,
>> Karl
>>
>>
>> On Fri, Sep 12, 2014 at 11:05 AM, Paul Boichat <pa...@exonar.com>
>> wrote:
>>
>>> I stayed with base 1.6.1 and manually patched the code to include the
>>> two new methods in DBInterfacePostgreSQL
>>>
>>> Paul
>>>
>>>
>>>
>>> VP Engineering,
>>> Exonar Ltd
>>>
>>> T: +44 7940 567724
>>>
>>> twitter:@exonarco @pboichat
>>> W: http://www.exonar.com
>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>
>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>> West Mills, Newbury, Berkshire, RG14 5HG
>>> DISCLAIMER: This email and any attachments to it may be confidential
>>> and are intended solely for the use of the individual to whom it is
>>> addressed. Any views or opinions expressed are solely those of the author
>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>> the intended recipient of this email, you must neither take any action
>>> based upon its contents, nor copy or show it to anyone. Please contact
>>> the sender if you believe you have received this email in error.
>>>
>>> On Fri, Sep 12, 2014 at 4:01 PM, Karl Wright <da...@gmail.com> wrote:
>>>
>>>> The changes pass all tests here.  Is it possible that you attempted
>>>> some upgrade that failed (or didn't attempt upgrade but went to a new code
>>>> version)?
>>>>
>>>> If you could let me know as exactly as possible what you did, I can let
>>>> you know if that should have worked or not.
>>>>
>>>> Thanks!
>>>> Karl
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Sep 12, 2014 at 10:57 AM, Paul Boichat <paul.boichat@exonar.com
>>>> > wrote:
>>>>
>>>>> Karl,
>>>>>
>>>>> We appear to be seeing an issue with the performance change to use an
>>>>> OR clause rather than IN. After making the change, when we restart
>>>>> manifoldcf (with one job in running state) documents in the running job are
>>>>> not picked up for processing by the stuffer thread. If we redploy base
>>>>> 1.6.1 and restart documents are processed. This is consistently switchable
>>>>> depending on which code base is deployed.
>>>>>
>>>>> We have logs that I could upload to the ticket if you recommend that
>>>>> we reopen the issue (or create a new one)?
>>>>>
>>>>> Paul
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> VP Engineering,
>>>>> Exonar Ltd
>>>>>
>>>>> T: +44 7940 567724
>>>>>
>>>>> twitter:@exonarco @pboichat
>>>>> W: http://www.exonar.com
>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>
>>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>>> and are intended solely for the use of the individual to whom it is
>>>>> addressed. Any views or opinions expressed are solely those of the author
>>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>>> the intended recipient of this email, you must neither take any action
>>>>> based upon its contents, nor copy or show it to anyone. Please
>>>>> contact the sender if you believe you have received this email in error.
>>>>>
>>>>> On Fri, Sep 12, 2014 at 6:05 AM, Karl Wright <da...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Paul --
>>>>>>
>>>>>> Just to be clear -- the branch for CONNECTORS-1027 is a branch of
>>>>>> trunk, which is MCF 2.0.  MCF 2.0 is not backwards compatible with any
>>>>>> previous MCF release, and indeed there is no upgrade from any 1.x release
>>>>>> to 2.0.  That's why I said to use the patches, and try to stay on 1.6.1 or
>>>>>> at most to migrate to 1.7.
>>>>>>
>>>>>> IF you ALREADY tried an upgrade with the branch code, then you would
>>>>>> have wound up in a schema state where the schema had more columns in it
>>>>>> than the branch knew how to deal with.  That's bad, and you will need to do
>>>>>> things to fix the situation.  I believe you should still be able to do the
>>>>>> following:
>>>>>>
>>>>>> - Download 1.7 source, or check out
>>>>>> https://svn.apache.org/repos/asf/manifoldcf/branches/release-1.7-branch
>>>>>> - Apply the patches
>>>>>> - Build
>>>>>> - Modify your properties.xml to point to your postgresql instance
>>>>>> - Run the upgrade (initialize.bat on the multi-process example, or
>>>>>> start the single-process example)
>>>>>>
>>>>>> You should then have a working 1.7 release, with code patches applied.
>>>>>>
>>>>>> Thanks,
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Sep 11, 2014 at 11:34 AM, Paul Boichat <
>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>
>>>>>>> Thanks - we've pulled down the branch and will test the changes. It
>>>>>>> looks like a branch of 1.7 so it's going to take us a little while to test.
>>>>>>> We need to migrate our connectors (there's some deprecated stuff that's now
>>>>>>> been cleared in 1.7 .eg. getShareACL) and we'll need to patch the database
>>>>>>> to include the pipeline and any other schema changes. We'll have some
>>>>>>> environment contention over the next week as our performance test
>>>>>>> environment needs to remain on 1.6.1 while we test a release. Once that's
>>>>>>> clear I'll move to 1.7
>>>>>>>
>>>>>>> On the database schema patch moving from 1.6.1 to 1.7 - is there a
>>>>>>> simple way to migrate and existing database?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Paul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> VP Engineering,
>>>>>>> Exonar Ltd
>>>>>>>
>>>>>>> T: +44 7940 567724
>>>>>>>
>>>>>>> twitter:@exonarco @pboichat
>>>>>>> W: http://www.exonar.com
>>>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>>>
>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>> author and do not necessarily represent those of Exonar Ltd. If you
>>>>>>> are not the intended recipient of this email, you must neither take any
>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>
>>>>>>> On Thu, Sep 11, 2014 at 1:27 PM, Karl Wright <da...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks -- I'll include that change as well then, in ticket
>>>>>>>> CONNECTORS-1027.
>>>>>>>>
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Sep 11, 2014 at 7:45 AM, Paul Boichat <
>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> That comes back immediately with 10001 rows:
>>>>>>>>>
>>>>>>>>> explain analyze SELECT count(*) FROM (SELECT 'x' FROM jobqueue
>>>>>>>>> LIMIT 10001) t;
>>>>>>>>>
>>>>>>>>> QUERY PLAN
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -----------------------------------------------------------------------------------------------------------------------
>>>>>>>>> ----------------------------------
>>>>>>>>>  Aggregate  (cost=544.08..544.09 rows=1 width=0) (actual
>>>>>>>>> time=9.125..9.125 rows=1 loops=1)
>>>>>>>>>    ->  Limit  (cost=0.00..419.07 rows=10001 width=0) (actual
>>>>>>>>> time=0.033..6.945 rows=10001 loops=1)
>>>>>>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>>>>>>> (cost=0.00..431189.31 rows=10290271 width=0) (actual time
>>>>>>>>> =0.031..3.257 rows=10001 loops=1)
>>>>>>>>>                Heap Fetches: 725
>>>>>>>>>  Total runtime: 9.157 ms
>>>>>>>>> (5 rows)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Whereas:
>>>>>>>>>
>>>>>>>>> explain analyze SELECT count(*) FROM jobqueue limit 10001;
>>>>>>>>>
>>>>>>>>> QUERY PLAN
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -----------------------------------------------------------------------------------------------------------------------
>>>>>>>>> ----------------------------------------
>>>>>>>>>  Limit  (cost=456922.99..456923.00 rows=1 width=0) (actual
>>>>>>>>> time=5225.107..5225.109 rows=1 loops=1)
>>>>>>>>>    ->  Aggregate  (cost=456922.99..456923.00 rows=1 width=0)
>>>>>>>>> (actual time=5225.105..5225.106 rows=1 loops=1)
>>>>>>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>>>>>>> (cost=0.00..431197.31 rows=10290271 width=0) (actual time
>>>>>>>>> =0.108..3090.848 rows=10370209 loops=1)
>>>>>>>>>                Heap Fetches: 684297
>>>>>>>>>  Total runtime: 5225.151 ms
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Paul
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> VP Engineering,
>>>>>>>>> Exonar Ltd
>>>>>>>>>
>>>>>>>>> T: +44 7940 567724
>>>>>>>>>
>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>> W: http://www.exonar.com
>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>
>>>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>
>>>>>>>>> On Thu, Sep 11, 2014 at 12:25 PM, Karl Wright <da...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Paul,
>>>>>>>>>>
>>>>>>>>>> Could you try this query on your database please and tell me if
>>>>>>>>>> it executes promptly:
>>>>>>>>>>
>>>>>>>>>> SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I vaguely remember that I had to change the form of this query in
>>>>>>>>>> order to support MySQL -- but first let's see if this helps.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <da...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I've created a ticket (CONNECTORS-1027) and a trunk-based branch
>>>>>>>>>>> (branches/CONNECTORS-1027) for looking at any changes we do for large-scale
>>>>>>>>>>> Postgresql optimization work.
>>>>>>>>>>>
>>>>>>>>>>> Please note that trunk code already has schema changes relative
>>>>>>>>>>> to MCF 1.7, so you will not be able to work directly with this branch
>>>>>>>>>>> code.  I'll have to create patches for whatever changes you would need to
>>>>>>>>>>> try.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat <
>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> We're on Postgres 9.2. I'll get the query plans and add them to
>>>>>>>>>>>> the doc.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>>
>>>>>>>>>>>> Paul
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>
>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>
>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>
>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <
>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can you include the logged plan for this query; this is an
>>>>>>>>>>>>> actual query encountered during crawling:
>>>>>>>>>>>>>
>>>>>>>>>>>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a
>>>>>>>>>>>>> long-running query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid
>>>>>>>>>>>>> FROM carrydown t1, jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=? AND
>>>>>>>>>>>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?]
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> These queries are all from the UI; it is what gets generated
>>>>>>>>>>>>> when no limits are in place:
>>>>>>>>>>>>>
>>>>>>>>>>>>>  WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found a
>>>>>>>>>>>>> long-running query (166845 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>>>>>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>>>>>>  WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found a
>>>>>>>>>>>>> long-running query (107222 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>>>>>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>>>>>>
>>>>>>>>>>>>> This query is from the UI with a limit of 1000000:
>>>>>>>>>>>>>
>>>>>>>>>>>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found a
>>>>>>>>>>>>> long-running query (254851 ms): [SELECT COUNT(dochash) AS doccount FROM
>>>>>>>>>>>>> jobqueue t1 LIMIT 1000001]
>>>>>>>>>>>>>
>>>>>>>>>>>>> I honestly don't understand why PostgreSQL would execute a
>>>>>>>>>>>>> sequential scan of the entire table when given a limit clause.  It
>>>>>>>>>>>>> certainly didn't used to do that.  If you have any other suggestions please
>>>>>>>>>>>>> let me know.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Some queries show up in this list because MCF periodically
>>>>>>>>>>>>> reindexes tables.  For example, this query goes only against the (small)
>>>>>>>>>>>>> jobs table.  Its poor performance on occasion is likely due to something
>>>>>>>>>>>>> else happening to the database, probably a reindex:
>>>>>>>>>>>>>
>>>>>>>>>>>>>  WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a
>>>>>>>>>>>>> long-running query (592474 ms): [SELECT id FROM jobs WHERE status IN
>>>>>>>>>>>>> (?,?,?,?,?) FOR UPDATE]
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The final query is the document stuffing query, which is
>>>>>>>>>>>>> perhaps the most critical query in the whole system:
>>>>>>>>>>>>>
>>>>>>>>>>>>>  SELECT
>>>>>>>>>>>>>  t0.id
>>>>>>>>>>>>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
>>>>>>>>>>>>>  t0.priorityset FROM jobqueue t0
>>>>>>>>>>>>>  WHERE t0.status IN ('P','G')  AND t0.checkaction='R' AND
>>>>>>>>>>>>> t0.checktime
>>>>>>>>>>>>>  <= 1407246846166
>>>>>>>>>>>>>  AND EXISTS (
>>>>>>>>>>>>>    SELECT 'x' FROM jobs t1
>>>>>>>>>>>>>    WHERE t1.status  IN ('A','a')  AND t1.id=t0.jobid  AND
>>>>>>>>>>>>> t1.priority=5
>>>>>>>>>>>>>  )
>>>>>>>>>>>>>  AND NOT EXISTS (
>>>>>>>>>>>>>    SELECT 'x' FROM jobqueue t2
>>>>>>>>>>>>>    WHERE t2.dochash=t0.dochash AND t2.status IN
>>>>>>>>>>>>>  ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid
>>>>>>>>>>>>>  )
>>>>>>>>>>>>>  AND NOT EXISTS (
>>>>>>>>>>>>>    SELECT 'x' FROM prereqevents t3,events t4
>>>>>>>>>>>>>    WHERE t0.id=t3.owner AND t3.eventname=t4.name
>>>>>>>>>>>>>  )
>>>>>>>>>>>>>  ORDER BY t0.docpriority ASC
>>>>>>>>>>>>>  LIMIT 480;
>>>>>>>>>>>>>
>>>>>>>>>>>>> Your analysis of whether IN beats OR does not agree with
>>>>>>>>>>>>> experiments I did on postgresql 8.7 which showed no difference.  What
>>>>>>>>>>>>> Postgresql version are you using?  Also, I trust you have query plans that
>>>>>>>>>>>>> demonstrate your claim?  In any case, whether IN vs. OR is generated is a
>>>>>>>>>>>>> function of the MCF database driver, so this is trivial to experiment
>>>>>>>>>>>>> with.  I'll create a ticket and a branch for experimentation.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <
>>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Changing maxcountstatus to something much smaller (10,000)
>>>>>>>>>>>>>> doesn't seem to buy us that much on the table scan - in the attached you'll
>>>>>>>>>>>>>> see that it's still taking a long time to return the job status page. Also
>>>>>>>>>>>>>> in the attached are some sample other long running queries that we're
>>>>>>>>>>>>>> beginning to see more frequently. There's also an example of a query that's
>>>>>>>>>>>>>> frequently executed and regularly takes > 4 secs (plus a suggested change
>>>>>>>>>>>>>> to improve performance). This one in particular would certainly benefit
>>>>>>>>>>>>>> from a change to SSDs which should relieve the I/O bound bottleneck on
>>>>>>>>>>>>>> postgres.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We're loading the system from 10mil towards 100mil so would
>>>>>>>>>>>>>> be keen to work with you to optimise where possible.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <
>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The jobstatus query that uses count(*) should be doing
>>>>>>>>>>>>>>> something like this when the maxdocumentstatuscount value is set:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> select count(*) from jobqueue where xxx limit 500001
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This will still do a sequential scan, but it will be an
>>>>>>>>>>>>>>> aborted one, so you can control the maximum amount of time spent doing the
>>>>>>>>>>>>>>> query.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <
>>>>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We've had a play with maxstatuscount and couldn't stop it
>>>>>>>>>>>>>>>> from count(*)-ing but I'll certainly have another look to see if we've
>>>>>>>>>>>>>>>> missed something.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We're increasingly seeing long running threads and I'll put
>>>>>>>>>>>>>>>> together some samples. As an example, on a job that's currently aborting:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a
>>>>>>>>>>>>>>>> long-running query (72902 ms): [UPDATE jobqueue SET
>>>>>>>>>>>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?]
>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -
>>>>>>>>>>>>>>>> Parameter 0: '1.000000001E9'
>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -
>>>>>>>>>>>>>>>> Parameter 1: '1407144048075'
>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>>>>>>>>>>>>> Update on jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>>>>>>>>>>>>> ->  Bitmap Heap Scan on jobqueue  (cost=18806.08..445770.39 rows=764916
>>>>>>>>>>>>>>>> width=287)
>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>> Plan:         Recheck Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>> Plan:         ->  Bitmap Index Scan on i1392985450177  (cost=0.00..18614.85
>>>>>>>>>>>>>>>> rows=764916 width=0)
>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>> Plan:               Index Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -  Stats:
>>>>>>>>>>>>>>>> n_distinct=4.0 most_common_vals={G,C,Z,P}
>>>>>>>>>>>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <
>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> For the jobqueue scans from the UI, there is a parameter
>>>>>>>>>>>>>>>>> you can set which limits the number of documents counted to at most a
>>>>>>>>>>>>>>>>> specified amount.  This uses a limit clause, which should prevent unbounded
>>>>>>>>>>>>>>>>> time doing these kinds of queries:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> org.apache.manifoldcf.ui.maxstatuscount
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The documentation says that the default value for this
>>>>>>>>>>>>>>>>> parameter is 10000, which however is incorrect.  The actual true default is
>>>>>>>>>>>>>>>>> 500000.  You could set that lower for better UI performance (losing some
>>>>>>>>>>>>>>>>> information, of course.)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> As for long-running queries, a lot of time and effort has
>>>>>>>>>>>>>>>>> been spent in MCF to insure that this doesn't happen.  Specifically, the
>>>>>>>>>>>>>>>>> main document queuing query is structured to read directly out of a
>>>>>>>>>>>>>>>>> specific jobqueue index.  This is the crucial query that must work properly
>>>>>>>>>>>>>>>>> for scalability, since doing a query that is effectively just a sort on the
>>>>>>>>>>>>>>>>> entire jobqueue would be a major problem.  There are some times where
>>>>>>>>>>>>>>>>> Postgresql's optimizer fails to do the right thing here, mostly because it
>>>>>>>>>>>>>>>>> makes a huge distinction between whether there's zero of something or one
>>>>>>>>>>>>>>>>> of something, but you can work around that particular issue by setting the
>>>>>>>>>>>>>>>>> analyze count to 1 if you start to see this problem -- which basically
>>>>>>>>>>>>>>>>> means that reanalysis of the table has to occur on every stuffing query.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'd appreciate seeing the queries that are long-running in
>>>>>>>>>>>>>>>>> your case so that I can see if that is what you are encountering or not.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <
>>>>>>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> We're beginning to see issues with a document count > 10
>>>>>>>>>>>>>>>>>> million. At that point, even with good postgres
>>>>>>>>>>>>>>>>>> vacuuming the jobqueue table is starting to become a
>>>>>>>>>>>>>>>>>> bottleneck.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> For example select count(*) from jobqueue, which is
>>>>>>>>>>>>>>>>>> executed when querying job status will do a full table scan of
>>>>>>>>>>>>>>>>>> jobqueue which has more than 10 million rows. That's
>>>>>>>>>>>>>>>>>> going to take some time in postgres.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> SSDs will certainly make a big difference to document
>>>>>>>>>>>>>>>>>> processing through-put (which we see is largely I/O bound in
>>>>>>>>>>>>>>>>>> postgres) but we are increasingly seeing long running
>>>>>>>>>>>>>>>>>> queries in the logs. Our current thinking is that we'll need to refactor
>>>>>>>>>>>>>>>>>> JobQueue somewhat to optimise queries and, potentially
>>>>>>>>>>>>>>>>>> partition jobqueue into a subset of tables (table per
>>>>>>>>>>>>>>>>>> queue for example).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <
>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Baptiste,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ManifoldCF is not limited by the number of agents
>>>>>>>>>>>>>>>>>>> processes or parallel connectors.  Overall database performance is the
>>>>>>>>>>>>>>>>>>> limiting factor.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I would read this:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Also, there's a section in ManifoldCF (I believe Chapter
>>>>>>>>>>>>>>>>>>> 2) that discusses this issue.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Some five years ago, I successfully crawled 5 million
>>>>>>>>>>>>>>>>>>> web documents, using Postgresql 8.3.  Postgresql 9.x is faster, and with
>>>>>>>>>>>>>>>>>>> modern SSD's, I expect that you will do even better.  In general, I'd say
>>>>>>>>>>>>>>>>>>> it was fine to shoot for 10M - 100M documents on ManifoldCF, provided that
>>>>>>>>>>>>>>>>>>> you use a good database, and provided that you maintain it properly.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>>>>>>>>>>>>>>>>> ba.berthier@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I would like to know what is the maximum number of
>>>>>>>>>>>>>>>>>>>> documents that you managed to crawl with ManifoldCF and with how many
>>>>>>>>>>>>>>>>>>>> connectors in parallel it could works ?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks for your answer
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Baptiste
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Posted by Paul Boichat <pa...@exonar.com>.

Hi,

I'm looking through the logs - can see the change from IN to OR in each
query - and there's clearly a difference in execution path but it's quite
verbose so will take a while.

It may well be that document state has not been reprioritised or in some
way inconsistent. However, I don't think it's that which is causing the
issue - I can switch this behaviour on and off over by changing the
DBInterfacePostgres class and restarting Manifold. That seems to suggest a
query isn't behaving the same way between IN and OR - I just can't isolate
the particular query (yet).

Have you tested with a job already in running state (on a restart) with a
large document count? For example am seeing this kind of thing which looks
messy but appears to execute as you'd expect:

SELECT id,dockey,lastversion,lastoutputversion,authorityname,forcedparams
FROM ingeststatus WHERE  (dockey=? OR dockey=? OR dockey=? OR dockey=? OR
dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR
dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=? OR dockey=?) AND
connectionname=?]
DEBUG 2014-09-12 15:01:27,052 (Thread-542) -   Parameter 0:
'1407144048431:F42CD76D66FA6BAD396FF8F8A409DD211C184E6A'
DEBUG 2014-09-12 15:01:27,052 (Thread-542) -   Parameter 1:
'1407144048431:FE66CC4054300E4EB2A84138DC9B62B80F59F5B9'




VP Engineering,
Exonar Ltd

T: +44 7940 567724

twitter:@exonarco @pboichat
W: http://www.exonar.com
Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>

Exonar Limited, registered in the UK, registration number 06439969 at 14
West Mills, Newbury, Berkshire, RG14 5HG
DISCLAIMER: This email and any attachments to it may be confidential and
are intended solely for the use of the individual to whom it is addressed.
Any views or opinions expressed are solely those of the author and do not
necessarily represent those of Exonar Ltd. If you are not the intended
recipient of this email, you must neither take any action based upon its
contents, nor copy or show it to anyone. Please contact the sender if you
believe you have received this email in error.

On Fri, Sep 12, 2014 at 4:20 PM, Karl Wright <da...@gmail.com> wrote:

> Hi Paul,
>
> The tests in fact do multiple complete crawls, so it is extremely unlikely
> that the stuffer query is broken.  If you look at the queries generated,
> you should note that the only difference is that whenever an xxx IN(?,?)
> was generated before, a (xxx=? OR xxx=?) is generated instead.  These
> should be completely equivalent; if they don't look equivalent to you in
> the log, then I will fix whatever is broken.  I'll make sure here that the
> queries look right visually too.
>
> One possibility is that when you restarted the agents process, the
> jobqueue records did not yet finish getting reprioritized.  Stuffer queries
> are fired all the time, but the running jobs must complete reprioritization
> before the stuffer query will pick up any records.  I wonder if they may
> not have managed to get to the right state before you aborted the
> experiment?  You can tell what is happening by using jstack to get a thread
> dump of the agents process.
>
> Thanks,
> Karl
>
>
> On Fri, Sep 12, 2014 at 11:05 AM, Paul Boichat <pa...@exonar.com>
> wrote:
>
>> I stayed with base 1.6.1 and manually patched the code to include the two
>> new methods in DBInterfacePostgreSQL
>>
>> Paul
>>
>>
>>
>> VP Engineering,
>> Exonar Ltd
>>
>> T: +44 7940 567724
>>
>> twitter:@exonarco @pboichat
>> W: http://www.exonar.com
>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>
>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>> West Mills, Newbury, Berkshire, RG14 5HG
>> DISCLAIMER: This email and any attachments to it may be confidential and
>> are intended solely for the use of the individual to whom it is addressed.
>> Any views or opinions expressed are solely those of the author and do not
>> necessarily represent those of Exonar Ltd. If you are not the intended
>> recipient of this email, you must neither take any action based upon its
>> contents, nor copy or show it to anyone. Please contact the sender if
>> you believe you have received this email in error.
>>
>> On Fri, Sep 12, 2014 at 4:01 PM, Karl Wright <da...@gmail.com> wrote:
>>
>>> The changes pass all tests here.  Is it possible that you attempted some
>>> upgrade that failed (or didn't attempt upgrade but went to a new code
>>> version)?
>>>
>>> If you could let me know as exactly as possible what you did, I can let
>>> you know if that should have worked or not.
>>>
>>> Thanks!
>>> Karl
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Sep 12, 2014 at 10:57 AM, Paul Boichat <pa...@exonar.com>
>>> wrote:
>>>
>>>> Karl,
>>>>
>>>> We appear to be seeing an issue with the performance change to use an
>>>> OR clause rather than IN. After making the change, when we restart
>>>> manifoldcf (with one job in running state) documents in the running job are
>>>> not picked up for processing by the stuffer thread. If we redploy base
>>>> 1.6.1 and restart documents are processed. This is consistently switchable
>>>> depending on which code base is deployed.
>>>>
>>>> We have logs that I could upload to the ticket if you recommend that we
>>>> reopen the issue (or create a new one)?
>>>>
>>>> Paul
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> VP Engineering,
>>>> Exonar Ltd
>>>>
>>>> T: +44 7940 567724
>>>>
>>>> twitter:@exonarco @pboichat
>>>> W: http://www.exonar.com
>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>
>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>> and are intended solely for the use of the individual to whom it is
>>>> addressed. Any views or opinions expressed are solely those of the author
>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>> the intended recipient of this email, you must neither take any action
>>>> based upon its contents, nor copy or show it to anyone. Please contact
>>>> the sender if you believe you have received this email in error.
>>>>
>>>> On Fri, Sep 12, 2014 at 6:05 AM, Karl Wright <da...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Paul --
>>>>>
>>>>> Just to be clear -- the branch for CONNECTORS-1027 is a branch of
>>>>> trunk, which is MCF 2.0.  MCF 2.0 is not backwards compatible with any
>>>>> previous MCF release, and indeed there is no upgrade from any 1.x release
>>>>> to 2.0.  That's why I said to use the patches, and try to stay on 1.6.1 or
>>>>> at most to migrate to 1.7.
>>>>>
>>>>> IF you ALREADY tried an upgrade with the branch code, then you would
>>>>> have wound up in a schema state where the schema had more columns in it
>>>>> than the branch knew how to deal with.  That's bad, and you will need to do
>>>>> things to fix the situation.  I believe you should still be able to do the
>>>>> following:
>>>>>
>>>>> - Download 1.7 source, or check out
>>>>> https://svn.apache.org/repos/asf/manifoldcf/branches/release-1.7-branch
>>>>> - Apply the patches
>>>>> - Build
>>>>> - Modify your properties.xml to point to your postgresql instance
>>>>> - Run the upgrade (initialize.bat on the multi-process example, or
>>>>> start the single-process example)
>>>>>
>>>>> You should then have a working 1.7 release, with code patches applied.
>>>>>
>>>>> Thanks,
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Sep 11, 2014 at 11:34 AM, Paul Boichat <
>>>>> paul.boichat@exonar.com> wrote:
>>>>>
>>>>>> Thanks - we've pulled down the branch and will test the changes. It
>>>>>> looks like a branch of 1.7 so it's going to take us a little while to test.
>>>>>> We need to migrate our connectors (there's some deprecated stuff that's now
>>>>>> been cleared in 1.7 .eg. getShareACL) and we'll need to patch the database
>>>>>> to include the pipeline and any other schema changes. We'll have some
>>>>>> environment contention over the next week as our performance test
>>>>>> environment needs to remain on 1.6.1 while we test a release. Once that's
>>>>>> clear I'll move to 1.7
>>>>>>
>>>>>> On the database schema patch moving from 1.6.1 to 1.7 - is there a
>>>>>> simple way to migrate and existing database?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Paul
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> VP Engineering,
>>>>>> Exonar Ltd
>>>>>>
>>>>>> T: +44 7940 567724
>>>>>>
>>>>>> twitter:@exonarco @pboichat
>>>>>> W: http://www.exonar.com
>>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>>
>>>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>>>> and are intended solely for the use of the individual to whom it is
>>>>>> addressed. Any views or opinions expressed are solely those of the author
>>>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>>>> the intended recipient of this email, you must neither take any action
>>>>>> based upon its contents, nor copy or show it to anyone. Please
>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>
>>>>>> On Thu, Sep 11, 2014 at 1:27 PM, Karl Wright <da...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks -- I'll include that change as well then, in ticket
>>>>>>> CONNECTORS-1027.
>>>>>>>
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Sep 11, 2014 at 7:45 AM, Paul Boichat <
>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> That comes back immediately with 10001 rows:
>>>>>>>>
>>>>>>>> explain analyze SELECT count(*) FROM (SELECT 'x' FROM jobqueue
>>>>>>>> LIMIT 10001) t;
>>>>>>>>
>>>>>>>> QUERY PLAN
>>>>>>>>
>>>>>>>>
>>>>>>>> -----------------------------------------------------------------------------------------------------------------------
>>>>>>>> ----------------------------------
>>>>>>>>  Aggregate  (cost=544.08..544.09 rows=1 width=0) (actual
>>>>>>>> time=9.125..9.125 rows=1 loops=1)
>>>>>>>>    ->  Limit  (cost=0.00..419.07 rows=10001 width=0) (actual
>>>>>>>> time=0.033..6.945 rows=10001 loops=1)
>>>>>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>>>>>> (cost=0.00..431189.31 rows=10290271 width=0) (actual time
>>>>>>>> =0.031..3.257 rows=10001 loops=1)
>>>>>>>>                Heap Fetches: 725
>>>>>>>>  Total runtime: 9.157 ms
>>>>>>>> (5 rows)
>>>>>>>>
>>>>>>>>
>>>>>>>> Whereas:
>>>>>>>>
>>>>>>>> explain analyze SELECT count(*) FROM jobqueue limit 10001;
>>>>>>>>
>>>>>>>> QUERY PLAN
>>>>>>>>
>>>>>>>>
>>>>>>>> -----------------------------------------------------------------------------------------------------------------------
>>>>>>>> ----------------------------------------
>>>>>>>>  Limit  (cost=456922.99..456923.00 rows=1 width=0) (actual
>>>>>>>> time=5225.107..5225.109 rows=1 loops=1)
>>>>>>>>    ->  Aggregate  (cost=456922.99..456923.00 rows=1 width=0)
>>>>>>>> (actual time=5225.105..5225.106 rows=1 loops=1)
>>>>>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>>>>>> (cost=0.00..431197.31 rows=10290271 width=0) (actual time
>>>>>>>> =0.108..3090.848 rows=10370209 loops=1)
>>>>>>>>                Heap Fetches: 684297
>>>>>>>>  Total runtime: 5225.151 ms
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Paul
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> VP Engineering,
>>>>>>>> Exonar Ltd
>>>>>>>>
>>>>>>>> T: +44 7940 567724
>>>>>>>>
>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>> W: http://www.exonar.com
>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>> <http://video.exonar.com/>
>>>>>>>>
>>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>
>>>>>>>> On Thu, Sep 11, 2014 at 12:25 PM, Karl Wright <da...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Paul,
>>>>>>>>>
>>>>>>>>> Could you try this query on your database please and tell me if it
>>>>>>>>> executes promptly:
>>>>>>>>>
>>>>>>>>> SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I vaguely remember that I had to change the form of this query in
>>>>>>>>> order to support MySQL -- but first let's see if this helps.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <da...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I've created a ticket (CONNECTORS-1027) and a trunk-based branch
>>>>>>>>>> (branches/CONNECTORS-1027) for looking at any changes we do for large-scale
>>>>>>>>>> Postgresql optimization work.
>>>>>>>>>>
>>>>>>>>>> Please note that trunk code already has schema changes relative
>>>>>>>>>> to MCF 1.7, so you will not be able to work directly with this branch
>>>>>>>>>> code.  I'll have to create patches for whatever changes you would need to
>>>>>>>>>> try.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat <
>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> We're on Postgres 9.2. I'll get the query plans and add them to
>>>>>>>>>>> the doc.
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>>
>>>>>>>>>>> Paul
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> VP Engineering,
>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>
>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>
>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>
>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <
>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>
>>>>>>>>>>>> Can you include the logged plan for this query; this is an
>>>>>>>>>>>> actual query encountered during crawling:
>>>>>>>>>>>>
>>>>>>>>>>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a
>>>>>>>>>>>> long-running query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid
>>>>>>>>>>>> FROM carrydown t1, jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=? AND
>>>>>>>>>>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?]
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> These queries are all from the UI; it is what gets generated
>>>>>>>>>>>> when no limits are in place:
>>>>>>>>>>>>
>>>>>>>>>>>>  WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found a
>>>>>>>>>>>> long-running query (166845 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>>>>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>>>>>  WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found a
>>>>>>>>>>>> long-running query (107222 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>>>>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>>>>>
>>>>>>>>>>>> This query is from the UI with a limit of 1000000:
>>>>>>>>>>>>
>>>>>>>>>>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found a
>>>>>>>>>>>> long-running query (254851 ms): [SELECT COUNT(dochash) AS doccount FROM
>>>>>>>>>>>> jobqueue t1 LIMIT 1000001]
>>>>>>>>>>>>
>>>>>>>>>>>> I honestly don't understand why PostgreSQL would execute a
>>>>>>>>>>>> sequential scan of the entire table when given a limit clause.  It
>>>>>>>>>>>> certainly didn't used to do that.  If you have any other suggestions please
>>>>>>>>>>>> let me know.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Some queries show up in this list because MCF periodically
>>>>>>>>>>>> reindexes tables.  For example, this query goes only against the (small)
>>>>>>>>>>>> jobs table.  Its poor performance on occasion is likely due to something
>>>>>>>>>>>> else happening to the database, probably a reindex:
>>>>>>>>>>>>
>>>>>>>>>>>>  WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a
>>>>>>>>>>>> long-running query (592474 ms): [SELECT id FROM jobs WHERE status IN
>>>>>>>>>>>> (?,?,?,?,?) FOR UPDATE]
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The final query is the document stuffing query, which is
>>>>>>>>>>>> perhaps the most critical query in the whole system:
>>>>>>>>>>>>
>>>>>>>>>>>>  SELECT
>>>>>>>>>>>>  t0.id
>>>>>>>>>>>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
>>>>>>>>>>>>  t0.priorityset FROM jobqueue t0
>>>>>>>>>>>>  WHERE t0.status IN ('P','G')  AND t0.checkaction='R' AND
>>>>>>>>>>>> t0.checktime
>>>>>>>>>>>>  <= 1407246846166
>>>>>>>>>>>>  AND EXISTS (
>>>>>>>>>>>>    SELECT 'x' FROM jobs t1
>>>>>>>>>>>>    WHERE t1.status  IN ('A','a')  AND t1.id=t0.jobid  AND
>>>>>>>>>>>> t1.priority=5
>>>>>>>>>>>>  )
>>>>>>>>>>>>  AND NOT EXISTS (
>>>>>>>>>>>>    SELECT 'x' FROM jobqueue t2
>>>>>>>>>>>>    WHERE t2.dochash=t0.dochash AND t2.status IN
>>>>>>>>>>>>  ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid
>>>>>>>>>>>>  )
>>>>>>>>>>>>  AND NOT EXISTS (
>>>>>>>>>>>>    SELECT 'x' FROM prereqevents t3,events t4
>>>>>>>>>>>>    WHERE t0.id=t3.owner AND t3.eventname=t4.name
>>>>>>>>>>>>  )
>>>>>>>>>>>>  ORDER BY t0.docpriority ASC
>>>>>>>>>>>>  LIMIT 480;
>>>>>>>>>>>>
>>>>>>>>>>>> Your analysis of whether IN beats OR does not agree with
>>>>>>>>>>>> experiments I did on postgresql 8.7 which showed no difference.  What
>>>>>>>>>>>> Postgresql version are you using?  Also, I trust you have query plans that
>>>>>>>>>>>> demonstrate your claim?  In any case, whether IN vs. OR is generated is a
>>>>>>>>>>>> function of the MCF database driver, so this is trivial to experiment
>>>>>>>>>>>> with.  I'll create a ticket and a branch for experimentation.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <
>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Changing maxcountstatus to something much smaller (10,000)
>>>>>>>>>>>>> doesn't seem to buy us that much on the table scan - in the attached you'll
>>>>>>>>>>>>> see that it's still taking a long time to return the job status page. Also
>>>>>>>>>>>>> in the attached are some sample other long running queries that we're
>>>>>>>>>>>>> beginning to see more frequently. There's also an example of a query that's
>>>>>>>>>>>>> frequently executed and regularly takes > 4 secs (plus a suggested change
>>>>>>>>>>>>> to improve performance). This one in particular would certainly benefit
>>>>>>>>>>>>> from a change to SSDs which should relieve the I/O bound bottleneck on
>>>>>>>>>>>>> postgres.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We're loading the system from 10mil towards 100mil so would be
>>>>>>>>>>>>> keen to work with you to optimise where possible.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>
>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>
>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <
>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The jobstatus query that uses count(*) should be doing
>>>>>>>>>>>>>> something like this when the maxdocumentstatuscount value is set:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> select count(*) from jobqueue where xxx limit 500001
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This will still do a sequential scan, but it will be an
>>>>>>>>>>>>>> aborted one, so you can control the maximum amount of time spent doing the
>>>>>>>>>>>>>> query.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <
>>>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We've had a play with maxstatuscount and couldn't stop it
>>>>>>>>>>>>>>> from count(*)-ing but I'll certainly have another look to see if we've
>>>>>>>>>>>>>>> missed something.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We're increasingly seeing long running threads and I'll put
>>>>>>>>>>>>>>> together some samples. As an example, on a job that's currently aborting:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a
>>>>>>>>>>>>>>> long-running query (72902 ms): [UPDATE jobqueue SET
>>>>>>>>>>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?]
>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -
>>>>>>>>>>>>>>> Parameter 0: '1.000000001E9'
>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -
>>>>>>>>>>>>>>> Parameter 1: '1407144048075'
>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>>>>>>>>>>>> Update on jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>>>>>>>>>>>> ->  Bitmap Heap Scan on jobqueue  (cost=18806.08..445770.39 rows=764916
>>>>>>>>>>>>>>> width=287)
>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>> Plan:         Recheck Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>> Plan:         ->  Bitmap Index Scan on i1392985450177  (cost=0.00..18614.85
>>>>>>>>>>>>>>> rows=764916 width=0)
>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>> Plan:               Index Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -  Stats:
>>>>>>>>>>>>>>> n_distinct=4.0 most_common_vals={G,C,Z,P}
>>>>>>>>>>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <
>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For the jobqueue scans from the UI, there is a parameter
>>>>>>>>>>>>>>>> you can set which limits the number of documents counted to at most a
>>>>>>>>>>>>>>>> specified amount.  This uses a limit clause, which should prevent unbounded
>>>>>>>>>>>>>>>> time doing these kinds of queries:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> org.apache.manifoldcf.ui.maxstatuscount
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The documentation says that the default value for this
>>>>>>>>>>>>>>>> parameter is 10000, which however is incorrect.  The actual true default is
>>>>>>>>>>>>>>>> 500000.  You could set that lower for better UI performance (losing some
>>>>>>>>>>>>>>>> information, of course.)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> As for long-running queries, a lot of time and effort has
>>>>>>>>>>>>>>>> been spent in MCF to insure that this doesn't happen.  Specifically, the
>>>>>>>>>>>>>>>> main document queuing query is structured to read directly out of a
>>>>>>>>>>>>>>>> specific jobqueue index.  This is the crucial query that must work properly
>>>>>>>>>>>>>>>> for scalability, since doing a query that is effectively just a sort on the
>>>>>>>>>>>>>>>> entire jobqueue would be a major problem.  There are some times where
>>>>>>>>>>>>>>>> Postgresql's optimizer fails to do the right thing here, mostly because it
>>>>>>>>>>>>>>>> makes a huge distinction between whether there's zero of something or one
>>>>>>>>>>>>>>>> of something, but you can work around that particular issue by setting the
>>>>>>>>>>>>>>>> analyze count to 1 if you start to see this problem -- which basically
>>>>>>>>>>>>>>>> means that reanalysis of the table has to occur on every stuffing query.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'd appreciate seeing the queries that are long-running in
>>>>>>>>>>>>>>>> your case so that I can see if that is what you are encountering or not.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <
>>>>>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> We're beginning to see issues with a document count > 10
>>>>>>>>>>>>>>>>> million. At that point, even with good postgres vacuuming
>>>>>>>>>>>>>>>>> the jobqueue table is starting to become a bottleneck.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> For example select count(*) from jobqueue, which is
>>>>>>>>>>>>>>>>> executed when querying job status will do a full table scan of
>>>>>>>>>>>>>>>>> jobqueue which has more than 10 million rows. That's
>>>>>>>>>>>>>>>>> going to take some time in postgres.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> SSDs will certainly make a big difference to document
>>>>>>>>>>>>>>>>> processing through-put (which we see is largely I/O bound in
>>>>>>>>>>>>>>>>> postgres) but we are increasingly seeing long running
>>>>>>>>>>>>>>>>> queries in the logs. Our current thinking is that we'll need to refactor
>>>>>>>>>>>>>>>>> JobQueue somewhat to optimise queries and, potentially
>>>>>>>>>>>>>>>>> partition jobqueue into a subset of tables (table per
>>>>>>>>>>>>>>>>> queue for example).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <
>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Baptiste,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ManifoldCF is not limited by the number of agents
>>>>>>>>>>>>>>>>>> processes or parallel connectors.  Overall database performance is the
>>>>>>>>>>>>>>>>>> limiting factor.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I would read this:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Also, there's a section in ManifoldCF (I believe Chapter
>>>>>>>>>>>>>>>>>> 2) that discusses this issue.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Some five years ago, I successfully crawled 5 million web
>>>>>>>>>>>>>>>>>> documents, using Postgresql 8.3.  Postgresql 9.x is faster, and with modern
>>>>>>>>>>>>>>>>>> SSD's, I expect that you will do even better.  In general, I'd say it was
>>>>>>>>>>>>>>>>>> fine to shoot for 10M - 100M documents on ManifoldCF, provided that you use
>>>>>>>>>>>>>>>>>> a good database, and provided that you maintain it properly.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>>>>>>>>>>>>>>>> ba.berthier@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I would like to know what is the maximum number of
>>>>>>>>>>>>>>>>>>> documents that you managed to crawl with ManifoldCF and with how many
>>>>>>>>>>>>>>>>>>> connectors in parallel it could works ?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks for your answer
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Baptiste
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Posted by Karl Wright <da...@gmail.com>.

Hi Paul,

The tests in fact do multiple complete crawls, so it is extremely unlikely
that the stuffer query is broken.  If you look at the queries generated,
you should note that the only difference is that whenever an xxx IN(?,?)
was generated before, a (xxx=? OR xxx=?) is generated instead.  These
should be completely equivalent; if they don't look equivalent to you in
the log, then I will fix whatever is broken.  I'll make sure here that the
queries look right visually too.

One possibility is that when you restarted the agents process, the jobqueue
records did not yet finish getting reprioritized.  Stuffer queries are
fired all the time, but the running jobs must complete reprioritization
before the stuffer query will pick up any records.  I wonder if they may
not have managed to get to the right state before you aborted the
experiment?  You can tell what is happening by using jstack to get a thread
dump of the agents process.

Thanks,
Karl


On Fri, Sep 12, 2014 at 11:05 AM, Paul Boichat <pa...@exonar.com>
wrote:

> I stayed with base 1.6.1 and manually patched the code to include the two
> new methods in DBInterfacePostgreSQL
>
> Paul
>
>
>
> VP Engineering,
> Exonar Ltd
>
> T: +44 7940 567724
>
> twitter:@exonarco @pboichat
> W: http://www.exonar.com
> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>
> Exonar Limited, registered in the UK, registration number 06439969 at 14
> West Mills, Newbury, Berkshire, RG14 5HG
> DISCLAIMER: This email and any attachments to it may be confidential and
> are intended solely for the use of the individual to whom it is addressed.
> Any views or opinions expressed are solely those of the author and do not
> necessarily represent those of Exonar Ltd. If you are not the intended
> recipient of this email, you must neither take any action based upon its
> contents, nor copy or show it to anyone. Please contact the sender if you
> believe you have received this email in error.
>
> On Fri, Sep 12, 2014 at 4:01 PM, Karl Wright <da...@gmail.com> wrote:
>
>> The changes pass all tests here.  Is it possible that you attempted some
>> upgrade that failed (or didn't attempt upgrade but went to a new code
>> version)?
>>
>> If you could let me know as exactly as possible what you did, I can let
>> you know if that should have worked or not.
>>
>> Thanks!
>> Karl
>>
>>
>>
>>
>>
>> On Fri, Sep 12, 2014 at 10:57 AM, Paul Boichat <pa...@exonar.com>
>> wrote:
>>
>>> Karl,
>>>
>>> We appear to be seeing an issue with the performance change to use an OR
>>> clause rather than IN. After making the change, when we restart manifoldcf
>>> (with one job in running state) documents in the running job are not picked
>>> up for processing by the stuffer thread. If we redploy base 1.6.1 and
>>> restart documents are processed. This is consistently switchable depending
>>> on which code base is deployed.
>>>
>>> We have logs that I could upload to the ticket if you recommend that we
>>> reopen the issue (or create a new one)?
>>>
>>> Paul
>>>
>>>
>>>
>>>
>>>
>>> VP Engineering,
>>> Exonar Ltd
>>>
>>> T: +44 7940 567724
>>>
>>> twitter:@exonarco @pboichat
>>> W: http://www.exonar.com
>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>
>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>> West Mills, Newbury, Berkshire, RG14 5HG
>>> DISCLAIMER: This email and any attachments to it may be confidential
>>> and are intended solely for the use of the individual to whom it is
>>> addressed. Any views or opinions expressed are solely those of the author
>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>> the intended recipient of this email, you must neither take any action
>>> based upon its contents, nor copy or show it to anyone. Please contact
>>> the sender if you believe you have received this email in error.
>>>
>>> On Fri, Sep 12, 2014 at 6:05 AM, Karl Wright <da...@gmail.com> wrote:
>>>
>>>> Hi Paul --
>>>>
>>>> Just to be clear -- the branch for CONNECTORS-1027 is a branch of
>>>> trunk, which is MCF 2.0.  MCF 2.0 is not backwards compatible with any
>>>> previous MCF release, and indeed there is no upgrade from any 1.x release
>>>> to 2.0.  That's why I said to use the patches, and try to stay on 1.6.1 or
>>>> at most to migrate to 1.7.
>>>>
>>>> IF you ALREADY tried an upgrade with the branch code, then you would
>>>> have wound up in a schema state where the schema had more columns in it
>>>> than the branch knew how to deal with.  That's bad, and you will need to do
>>>> things to fix the situation.  I believe you should still be able to do the
>>>> following:
>>>>
>>>> - Download 1.7 source, or check out
>>>> https://svn.apache.org/repos/asf/manifoldcf/branches/release-1.7-branch
>>>> - Apply the patches
>>>> - Build
>>>> - Modify your properties.xml to point to your postgresql instance
>>>> - Run the upgrade (initialize.bat on the multi-process example, or
>>>> start the single-process example)
>>>>
>>>> You should then have a working 1.7 release, with code patches applied.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Sep 11, 2014 at 11:34 AM, Paul Boichat <paul.boichat@exonar.com
>>>> > wrote:
>>>>
>>>>> Thanks - we've pulled down the branch and will test the changes. It
>>>>> looks like a branch of 1.7 so it's going to take us a little while to test.
>>>>> We need to migrate our connectors (there's some deprecated stuff that's now
>>>>> been cleared in 1.7 .eg. getShareACL) and we'll need to patch the database
>>>>> to include the pipeline and any other schema changes. We'll have some
>>>>> environment contention over the next week as our performance test
>>>>> environment needs to remain on 1.6.1 while we test a release. Once that's
>>>>> clear I'll move to 1.7
>>>>>
>>>>> On the database schema patch moving from 1.6.1 to 1.7 - is there a
>>>>> simple way to migrate and existing database?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Paul
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> VP Engineering,
>>>>> Exonar Ltd
>>>>>
>>>>> T: +44 7940 567724
>>>>>
>>>>> twitter:@exonarco @pboichat
>>>>> W: http://www.exonar.com
>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>
>>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>>> and are intended solely for the use of the individual to whom it is
>>>>> addressed. Any views or opinions expressed are solely those of the author
>>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>>> the intended recipient of this email, you must neither take any action
>>>>> based upon its contents, nor copy or show it to anyone. Please
>>>>> contact the sender if you believe you have received this email in error.
>>>>>
>>>>> On Thu, Sep 11, 2014 at 1:27 PM, Karl Wright <da...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks -- I'll include that change as well then, in ticket
>>>>>> CONNECTORS-1027.
>>>>>>
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Thu, Sep 11, 2014 at 7:45 AM, Paul Boichat <
>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> That comes back immediately with 10001 rows:
>>>>>>>
>>>>>>> explain analyze SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT
>>>>>>> 10001) t;
>>>>>>>
>>>>>>> QUERY PLAN
>>>>>>>
>>>>>>>
>>>>>>> -----------------------------------------------------------------------------------------------------------------------
>>>>>>> ----------------------------------
>>>>>>>  Aggregate  (cost=544.08..544.09 rows=1 width=0) (actual
>>>>>>> time=9.125..9.125 rows=1 loops=1)
>>>>>>>    ->  Limit  (cost=0.00..419.07 rows=10001 width=0) (actual
>>>>>>> time=0.033..6.945 rows=10001 loops=1)
>>>>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>>>>> (cost=0.00..431189.31 rows=10290271 width=0) (actual time
>>>>>>> =0.031..3.257 rows=10001 loops=1)
>>>>>>>                Heap Fetches: 725
>>>>>>>  Total runtime: 9.157 ms
>>>>>>> (5 rows)
>>>>>>>
>>>>>>>
>>>>>>> Whereas:
>>>>>>>
>>>>>>> explain analyze SELECT count(*) FROM jobqueue limit 10001;
>>>>>>>
>>>>>>> QUERY PLAN
>>>>>>>
>>>>>>>
>>>>>>> -----------------------------------------------------------------------------------------------------------------------
>>>>>>> ----------------------------------------
>>>>>>>  Limit  (cost=456922.99..456923.00 rows=1 width=0) (actual
>>>>>>> time=5225.107..5225.109 rows=1 loops=1)
>>>>>>>    ->  Aggregate  (cost=456922.99..456923.00 rows=1 width=0) (actual
>>>>>>> time=5225.105..5225.106 rows=1 loops=1)
>>>>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>>>>> (cost=0.00..431197.31 rows=10290271 width=0) (actual time
>>>>>>> =0.108..3090.848 rows=10370209 loops=1)
>>>>>>>                Heap Fetches: 684297
>>>>>>>  Total runtime: 5225.151 ms
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Paul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> VP Engineering,
>>>>>>> Exonar Ltd
>>>>>>>
>>>>>>> T: +44 7940 567724
>>>>>>>
>>>>>>> twitter:@exonarco @pboichat
>>>>>>> W: http://www.exonar.com
>>>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>>>
>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>> author and do not necessarily represent those of Exonar Ltd. If you
>>>>>>> are not the intended recipient of this email, you must neither take any
>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>
>>>>>>> On Thu, Sep 11, 2014 at 12:25 PM, Karl Wright <da...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Paul,
>>>>>>>>
>>>>>>>> Could you try this query on your database please and tell me if it
>>>>>>>> executes promptly:
>>>>>>>>
>>>>>>>> SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t
>>>>>>>>
>>>>>>>>
>>>>>>>> I vaguely remember that I had to change the form of this query in
>>>>>>>> order to support MySQL -- but first let's see if this helps.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <da...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I've created a ticket (CONNECTORS-1027) and a trunk-based branch
>>>>>>>>> (branches/CONNECTORS-1027) for looking at any changes we do for large-scale
>>>>>>>>> Postgresql optimization work.
>>>>>>>>>
>>>>>>>>> Please note that trunk code already has schema changes relative to
>>>>>>>>> MCF 1.7, so you will not be able to work directly with this branch code.
>>>>>>>>> I'll have to create patches for whatever changes you would need to try.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat <
>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> We're on Postgres 9.2. I'll get the query plans and add them to
>>>>>>>>>> the doc.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>> Paul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> VP Engineering,
>>>>>>>>>> Exonar Ltd
>>>>>>>>>>
>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>
>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>
>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <daddywri@gmail.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>
>>>>>>>>>>> Can you include the logged plan for this query; this is an
>>>>>>>>>>> actual query encountered during crawling:
>>>>>>>>>>>
>>>>>>>>>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a
>>>>>>>>>>> long-running query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid
>>>>>>>>>>> FROM carrydown t1, jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=? AND
>>>>>>>>>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> These queries are all from the UI; it is what gets generated
>>>>>>>>>>> when no limits are in place:
>>>>>>>>>>>
>>>>>>>>>>>  WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found a
>>>>>>>>>>> long-running query (166845 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>>>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>>>>  WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found a
>>>>>>>>>>> long-running query (107222 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>>>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>>>>
>>>>>>>>>>> This query is from the UI with a limit of 1000000:
>>>>>>>>>>>
>>>>>>>>>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found a
>>>>>>>>>>> long-running query (254851 ms): [SELECT COUNT(dochash) AS doccount FROM
>>>>>>>>>>> jobqueue t1 LIMIT 1000001]
>>>>>>>>>>>
>>>>>>>>>>> I honestly don't understand why PostgreSQL would execute a
>>>>>>>>>>> sequential scan of the entire table when given a limit clause.  It
>>>>>>>>>>> certainly didn't used to do that.  If you have any other suggestions please
>>>>>>>>>>> let me know.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Some queries show up in this list because MCF periodically
>>>>>>>>>>> reindexes tables.  For example, this query goes only against the (small)
>>>>>>>>>>> jobs table.  Its poor performance on occasion is likely due to something
>>>>>>>>>>> else happening to the database, probably a reindex:
>>>>>>>>>>>
>>>>>>>>>>>  WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a
>>>>>>>>>>> long-running query (592474 ms): [SELECT id FROM jobs WHERE status IN
>>>>>>>>>>> (?,?,?,?,?) FOR UPDATE]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The final query is the document stuffing query, which is perhaps
>>>>>>>>>>> the most critical query in the whole system:
>>>>>>>>>>>
>>>>>>>>>>>  SELECT
>>>>>>>>>>>  t0.id
>>>>>>>>>>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
>>>>>>>>>>>  t0.priorityset FROM jobqueue t0
>>>>>>>>>>>  WHERE t0.status IN ('P','G')  AND t0.checkaction='R' AND
>>>>>>>>>>> t0.checktime
>>>>>>>>>>>  <= 1407246846166
>>>>>>>>>>>  AND EXISTS (
>>>>>>>>>>>    SELECT 'x' FROM jobs t1
>>>>>>>>>>>    WHERE t1.status  IN ('A','a')  AND t1.id=t0.jobid  AND
>>>>>>>>>>> t1.priority=5
>>>>>>>>>>>  )
>>>>>>>>>>>  AND NOT EXISTS (
>>>>>>>>>>>    SELECT 'x' FROM jobqueue t2
>>>>>>>>>>>    WHERE t2.dochash=t0.dochash AND t2.status IN
>>>>>>>>>>>  ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid
>>>>>>>>>>>  )
>>>>>>>>>>>  AND NOT EXISTS (
>>>>>>>>>>>    SELECT 'x' FROM prereqevents t3,events t4
>>>>>>>>>>>    WHERE t0.id=t3.owner AND t3.eventname=t4.name
>>>>>>>>>>>  )
>>>>>>>>>>>  ORDER BY t0.docpriority ASC
>>>>>>>>>>>  LIMIT 480;
>>>>>>>>>>>
>>>>>>>>>>> Your analysis of whether IN beats OR does not agree with
>>>>>>>>>>> experiments I did on postgresql 8.7 which showed no difference.  What
>>>>>>>>>>> Postgresql version are you using?  Also, I trust you have query plans that
>>>>>>>>>>> demonstrate your claim?  In any case, whether IN vs. OR is generated is a
>>>>>>>>>>> function of the MCF database driver, so this is trivial to experiment
>>>>>>>>>>> with.  I'll create a ticket and a branch for experimentation.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <
>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>
>>>>>>>>>>>> Changing maxcountstatus to something much smaller (10,000)
>>>>>>>>>>>> doesn't seem to buy us that much on the table scan - in the attached you'll
>>>>>>>>>>>> see that it's still taking a long time to return the job status page. Also
>>>>>>>>>>>> in the attached are some sample other long running queries that we're
>>>>>>>>>>>> beginning to see more frequently. There's also an example of a query that's
>>>>>>>>>>>> frequently executed and regularly takes > 4 secs (plus a suggested change
>>>>>>>>>>>> to improve performance). This one in particular would certainly benefit
>>>>>>>>>>>> from a change to SSDs which should relieve the I/O bound bottleneck on
>>>>>>>>>>>> postgres.
>>>>>>>>>>>>
>>>>>>>>>>>> We're loading the system from 10mil towards 100mil so would be
>>>>>>>>>>>> keen to work with you to optimise where possible.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Paul
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>
>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>
>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>
>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <
>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>
>>>>>>>>>>>>> The jobstatus query that uses count(*) should be doing
>>>>>>>>>>>>> something like this when the maxdocumentstatuscount value is set:
>>>>>>>>>>>>>
>>>>>>>>>>>>> select count(*) from jobqueue where xxx limit 500001
>>>>>>>>>>>>>
>>>>>>>>>>>>> This will still do a sequential scan, but it will be an
>>>>>>>>>>>>> aborted one, so you can control the maximum amount of time spent doing the
>>>>>>>>>>>>> query.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <
>>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We've had a play with maxstatuscount and couldn't stop it
>>>>>>>>>>>>>> from count(*)-ing but I'll certainly have another look to see if we've
>>>>>>>>>>>>>> missed something.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We're increasingly seeing long running threads and I'll put
>>>>>>>>>>>>>> together some samples. As an example, on a job that's currently aborting:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a
>>>>>>>>>>>>>> long-running query (72902 ms): [UPDATE jobqueue SET
>>>>>>>>>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?]
>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -
>>>>>>>>>>>>>> Parameter 0: '1.000000001E9'
>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -
>>>>>>>>>>>>>> Parameter 1: '1407144048075'
>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>>>>>>>>>>> Update on jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>>>>>>>>>>> ->  Bitmap Heap Scan on jobqueue  (cost=18806.08..445770.39 rows=764916
>>>>>>>>>>>>>> width=287)
>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>> Plan:         Recheck Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>> Plan:         ->  Bitmap Index Scan on i1392985450177  (cost=0.00..18614.85
>>>>>>>>>>>>>> rows=764916 width=0)
>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>> Plan:               Index Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -  Stats:
>>>>>>>>>>>>>> n_distinct=4.0 most_common_vals={G,C,Z,P}
>>>>>>>>>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>>>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <
>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For the jobqueue scans from the UI, there is a parameter you
>>>>>>>>>>>>>>> can set which limits the number of documents counted to at most a specified
>>>>>>>>>>>>>>> amount.  This uses a limit clause, which should prevent unbounded time
>>>>>>>>>>>>>>> doing these kinds of queries:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> org.apache.manifoldcf.ui.maxstatuscount
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The documentation says that the default value for this
>>>>>>>>>>>>>>> parameter is 10000, which however is incorrect.  The actual true default is
>>>>>>>>>>>>>>> 500000.  You could set that lower for better UI performance (losing some
>>>>>>>>>>>>>>> information, of course.)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> As for long-running queries, a lot of time and effort has
>>>>>>>>>>>>>>> been spent in MCF to insure that this doesn't happen.  Specifically, the
>>>>>>>>>>>>>>> main document queuing query is structured to read directly out of a
>>>>>>>>>>>>>>> specific jobqueue index.  This is the crucial query that must work properly
>>>>>>>>>>>>>>> for scalability, since doing a query that is effectively just a sort on the
>>>>>>>>>>>>>>> entire jobqueue would be a major problem.  There are some times where
>>>>>>>>>>>>>>> Postgresql's optimizer fails to do the right thing here, mostly because it
>>>>>>>>>>>>>>> makes a huge distinction between whether there's zero of something or one
>>>>>>>>>>>>>>> of something, but you can work around that particular issue by setting the
>>>>>>>>>>>>>>> analyze count to 1 if you start to see this problem -- which basically
>>>>>>>>>>>>>>> means that reanalysis of the table has to occur on every stuffing query.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'd appreciate seeing the queries that are long-running in
>>>>>>>>>>>>>>> your case so that I can see if that is what you are encountering or not.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <
>>>>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We're beginning to see issues with a document count > 10
>>>>>>>>>>>>>>>> million. At that point, even with good postgres vacuuming
>>>>>>>>>>>>>>>> the jobqueue table is starting to become a bottleneck.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For example select count(*) from jobqueue, which is
>>>>>>>>>>>>>>>> executed when querying job status will do a full table scan of
>>>>>>>>>>>>>>>> jobqueue which has more than 10 million rows. That's going
>>>>>>>>>>>>>>>> to take some time in postgres.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> SSDs will certainly make a big difference to document
>>>>>>>>>>>>>>>> processing through-put (which we see is largely I/O bound in
>>>>>>>>>>>>>>>> postgres) but we are increasingly seeing long running
>>>>>>>>>>>>>>>> queries in the logs. Our current thinking is that we'll need to refactor
>>>>>>>>>>>>>>>> JobQueue somewhat to optimise queries and, potentially
>>>>>>>>>>>>>>>> partition jobqueue into a subset of tables (table per
>>>>>>>>>>>>>>>> queue for example).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <
>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Baptiste,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ManifoldCF is not limited by the number of agents
>>>>>>>>>>>>>>>>> processes or parallel connectors.  Overall database performance is the
>>>>>>>>>>>>>>>>> limiting factor.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I would read this:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Also, there's a section in ManifoldCF (I believe Chapter
>>>>>>>>>>>>>>>>> 2) that discusses this issue.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Some five years ago, I successfully crawled 5 million web
>>>>>>>>>>>>>>>>> documents, using Postgresql 8.3.  Postgresql 9.x is faster, and with modern
>>>>>>>>>>>>>>>>> SSD's, I expect that you will do even better.  In general, I'd say it was
>>>>>>>>>>>>>>>>> fine to shoot for 10M - 100M documents on ManifoldCF, provided that you use
>>>>>>>>>>>>>>>>> a good database, and provided that you maintain it properly.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>>>>>>>>>>>>>>> ba.berthier@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I would like to know what is the maximum number of
>>>>>>>>>>>>>>>>>> documents that you managed to crawl with ManifoldCF and with how many
>>>>>>>>>>>>>>>>>> connectors in parallel it could works ?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks for your answer
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Baptiste
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Posted by Paul Boichat <pa...@exonar.com>.

I stayed with base 1.6.1 and manually patched the code to include the two
new methods in DBInterfacePostgreSQL

Paul



VP Engineering,
Exonar Ltd

T: +44 7940 567724

twitter:@exonarco @pboichat
W: http://www.exonar.com
Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>

Exonar Limited, registered in the UK, registration number 06439969 at 14
West Mills, Newbury, Berkshire, RG14 5HG
DISCLAIMER: This email and any attachments to it may be confidential and
are intended solely for the use of the individual to whom it is addressed.
Any views or opinions expressed are solely those of the author and do not
necessarily represent those of Exonar Ltd. If you are not the intended
recipient of this email, you must neither take any action based upon its
contents, nor copy or show it to anyone. Please contact the sender if you
believe you have received this email in error.

On Fri, Sep 12, 2014 at 4:01 PM, Karl Wright <da...@gmail.com> wrote:

> The changes pass all tests here.  Is it possible that you attempted some
> upgrade that failed (or didn't attempt upgrade but went to a new code
> version)?
>
> If you could let me know as exactly as possible what you did, I can let
> you know if that should have worked or not.
>
> Thanks!
> Karl
>
>
>
>
>
> On Fri, Sep 12, 2014 at 10:57 AM, Paul Boichat <pa...@exonar.com>
> wrote:
>
>> Karl,
>>
>> We appear to be seeing an issue with the performance change to use an OR
>> clause rather than IN. After making the change, when we restart manifoldcf
>> (with one job in running state) documents in the running job are not picked
>> up for processing by the stuffer thread. If we redploy base 1.6.1 and
>> restart documents are processed. This is consistently switchable depending
>> on which code base is deployed.
>>
>> We have logs that I could upload to the ticket if you recommend that we
>> reopen the issue (or create a new one)?
>>
>> Paul
>>
>>
>>
>>
>>
>> VP Engineering,
>> Exonar Ltd
>>
>> T: +44 7940 567724
>>
>> twitter:@exonarco @pboichat
>> W: http://www.exonar.com
>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>
>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>> West Mills, Newbury, Berkshire, RG14 5HG
>> DISCLAIMER: This email and any attachments to it may be confidential and
>> are intended solely for the use of the individual to whom it is addressed.
>> Any views or opinions expressed are solely those of the author and do not
>> necessarily represent those of Exonar Ltd. If you are not the intended
>> recipient of this email, you must neither take any action based upon its
>> contents, nor copy or show it to anyone. Please contact the sender if
>> you believe you have received this email in error.
>>
>> On Fri, Sep 12, 2014 at 6:05 AM, Karl Wright <da...@gmail.com> wrote:
>>
>>> Hi Paul --
>>>
>>> Just to be clear -- the branch for CONNECTORS-1027 is a branch of trunk,
>>> which is MCF 2.0.  MCF 2.0 is not backwards compatible with any previous
>>> MCF release, and indeed there is no upgrade from any 1.x release to 2.0.
>>> That's why I said to use the patches, and try to stay on 1.6.1 or at most
>>> to migrate to 1.7.
>>>
>>> IF you ALREADY tried an upgrade with the branch code, then you would
>>> have wound up in a schema state where the schema had more columns in it
>>> than the branch knew how to deal with.  That's bad, and you will need to do
>>> things to fix the situation.  I believe you should still be able to do the
>>> following:
>>>
>>> - Download 1.7 source, or check out
>>> https://svn.apache.org/repos/asf/manifoldcf/branches/release-1.7-branch
>>> - Apply the patches
>>> - Build
>>> - Modify your properties.xml to point to your postgresql instance
>>> - Run the upgrade (initialize.bat on the multi-process example, or start
>>> the single-process example)
>>>
>>> You should then have a working 1.7 release, with code patches applied.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>>
>>>
>>> On Thu, Sep 11, 2014 at 11:34 AM, Paul Boichat <pa...@exonar.com>
>>> wrote:
>>>
>>>> Thanks - we've pulled down the branch and will test the changes. It
>>>> looks like a branch of 1.7 so it's going to take us a little while to test.
>>>> We need to migrate our connectors (there's some deprecated stuff that's now
>>>> been cleared in 1.7 .eg. getShareACL) and we'll need to patch the database
>>>> to include the pipeline and any other schema changes. We'll have some
>>>> environment contention over the next week as our performance test
>>>> environment needs to remain on 1.6.1 while we test a release. Once that's
>>>> clear I'll move to 1.7
>>>>
>>>> On the database schema patch moving from 1.6.1 to 1.7 - is there a
>>>> simple way to migrate and existing database?
>>>>
>>>> Thanks,
>>>>
>>>> Paul
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> VP Engineering,
>>>> Exonar Ltd
>>>>
>>>> T: +44 7940 567724
>>>>
>>>> twitter:@exonarco @pboichat
>>>> W: http://www.exonar.com
>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>
>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>> and are intended solely for the use of the individual to whom it is
>>>> addressed. Any views or opinions expressed are solely those of the author
>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>> the intended recipient of this email, you must neither take any action
>>>> based upon its contents, nor copy or show it to anyone. Please contact
>>>> the sender if you believe you have received this email in error.
>>>>
>>>> On Thu, Sep 11, 2014 at 1:27 PM, Karl Wright <da...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks -- I'll include that change as well then, in ticket
>>>>> CONNECTORS-1027.
>>>>>
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Thu, Sep 11, 2014 at 7:45 AM, Paul Boichat <paul.boichat@exonar.com
>>>>> > wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> That comes back immediately with 10001 rows:
>>>>>>
>>>>>> explain analyze SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT
>>>>>> 10001) t;
>>>>>>
>>>>>> QUERY PLAN
>>>>>>
>>>>>>
>>>>>> -----------------------------------------------------------------------------------------------------------------------
>>>>>> ----------------------------------
>>>>>>  Aggregate  (cost=544.08..544.09 rows=1 width=0) (actual
>>>>>> time=9.125..9.125 rows=1 loops=1)
>>>>>>    ->  Limit  (cost=0.00..419.07 rows=10001 width=0) (actual
>>>>>> time=0.033..6.945 rows=10001 loops=1)
>>>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>>>> (cost=0.00..431189.31 rows=10290271 width=0) (actual time
>>>>>> =0.031..3.257 rows=10001 loops=1)
>>>>>>                Heap Fetches: 725
>>>>>>  Total runtime: 9.157 ms
>>>>>> (5 rows)
>>>>>>
>>>>>>
>>>>>> Whereas:
>>>>>>
>>>>>> explain analyze SELECT count(*) FROM jobqueue limit 10001;
>>>>>>
>>>>>> QUERY PLAN
>>>>>>
>>>>>>
>>>>>> -----------------------------------------------------------------------------------------------------------------------
>>>>>> ----------------------------------------
>>>>>>  Limit  (cost=456922.99..456923.00 rows=1 width=0) (actual
>>>>>> time=5225.107..5225.109 rows=1 loops=1)
>>>>>>    ->  Aggregate  (cost=456922.99..456923.00 rows=1 width=0) (actual
>>>>>> time=5225.105..5225.106 rows=1 loops=1)
>>>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>>>> (cost=0.00..431197.31 rows=10290271 width=0) (actual time
>>>>>> =0.108..3090.848 rows=10370209 loops=1)
>>>>>>                Heap Fetches: 684297
>>>>>>  Total runtime: 5225.151 ms
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Paul
>>>>>>
>>>>>>
>>>>>>
>>>>>> VP Engineering,
>>>>>> Exonar Ltd
>>>>>>
>>>>>> T: +44 7940 567724
>>>>>>
>>>>>> twitter:@exonarco @pboichat
>>>>>> W: http://www.exonar.com
>>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>>
>>>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>>>> and are intended solely for the use of the individual to whom it is
>>>>>> addressed. Any views or opinions expressed are solely those of the author
>>>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>>>> the intended recipient of this email, you must neither take any action
>>>>>> based upon its contents, nor copy or show it to anyone. Please
>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>
>>>>>> On Thu, Sep 11, 2014 at 12:25 PM, Karl Wright <da...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Paul,
>>>>>>>
>>>>>>> Could you try this query on your database please and tell me if it
>>>>>>> executes promptly:
>>>>>>>
>>>>>>> SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t
>>>>>>>
>>>>>>>
>>>>>>> I vaguely remember that I had to change the form of this query in
>>>>>>> order to support MySQL -- but first let's see if this helps.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <da...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I've created a ticket (CONNECTORS-1027) and a trunk-based branch
>>>>>>>> (branches/CONNECTORS-1027) for looking at any changes we do for large-scale
>>>>>>>> Postgresql optimization work.
>>>>>>>>
>>>>>>>> Please note that trunk code already has schema changes relative to
>>>>>>>> MCF 1.7, so you will not be able to work directly with this branch code.
>>>>>>>> I'll have to create patches for whatever changes you would need to try.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat <
>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> We're on Postgres 9.2. I'll get the query plans and add them to
>>>>>>>>> the doc.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> Paul
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> VP Engineering,
>>>>>>>>> Exonar Ltd
>>>>>>>>>
>>>>>>>>> T: +44 7940 567724
>>>>>>>>>
>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>> W: http://www.exonar.com
>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>
>>>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>
>>>>>>>>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <da...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Paul,
>>>>>>>>>>
>>>>>>>>>> Can you include the logged plan for this query; this is an actual
>>>>>>>>>> query encountered during crawling:
>>>>>>>>>>
>>>>>>>>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a
>>>>>>>>>> long-running query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid
>>>>>>>>>> FROM carrydown t1, jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=? AND
>>>>>>>>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> These queries are all from the UI; it is what gets generated when
>>>>>>>>>> no limits are in place:
>>>>>>>>>>
>>>>>>>>>>  WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found a
>>>>>>>>>> long-running query (166845 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>>>  WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found a
>>>>>>>>>> long-running query (107222 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>>>
>>>>>>>>>> This query is from the UI with a limit of 1000000:
>>>>>>>>>>
>>>>>>>>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found a
>>>>>>>>>> long-running query (254851 ms): [SELECT COUNT(dochash) AS doccount FROM
>>>>>>>>>> jobqueue t1 LIMIT 1000001]
>>>>>>>>>>
>>>>>>>>>> I honestly don't understand why PostgreSQL would execute a
>>>>>>>>>> sequential scan of the entire table when given a limit clause.  It
>>>>>>>>>> certainly didn't used to do that.  If you have any other suggestions please
>>>>>>>>>> let me know.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Some queries show up in this list because MCF periodically
>>>>>>>>>> reindexes tables.  For example, this query goes only against the (small)
>>>>>>>>>> jobs table.  Its poor performance on occasion is likely due to something
>>>>>>>>>> else happening to the database, probably a reindex:
>>>>>>>>>>
>>>>>>>>>>  WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a
>>>>>>>>>> long-running query (592474 ms): [SELECT id FROM jobs WHERE status IN
>>>>>>>>>> (?,?,?,?,?) FOR UPDATE]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The final query is the document stuffing query, which is perhaps
>>>>>>>>>> the most critical query in the whole system:
>>>>>>>>>>
>>>>>>>>>>  SELECT
>>>>>>>>>>  t0.id
>>>>>>>>>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
>>>>>>>>>>  t0.priorityset FROM jobqueue t0
>>>>>>>>>>  WHERE t0.status IN ('P','G')  AND t0.checkaction='R' AND
>>>>>>>>>> t0.checktime
>>>>>>>>>>  <= 1407246846166
>>>>>>>>>>  AND EXISTS (
>>>>>>>>>>    SELECT 'x' FROM jobs t1
>>>>>>>>>>    WHERE t1.status  IN ('A','a')  AND t1.id=t0.jobid  AND
>>>>>>>>>> t1.priority=5
>>>>>>>>>>  )
>>>>>>>>>>  AND NOT EXISTS (
>>>>>>>>>>    SELECT 'x' FROM jobqueue t2
>>>>>>>>>>    WHERE t2.dochash=t0.dochash AND t2.status IN
>>>>>>>>>>  ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid
>>>>>>>>>>  )
>>>>>>>>>>  AND NOT EXISTS (
>>>>>>>>>>    SELECT 'x' FROM prereqevents t3,events t4
>>>>>>>>>>    WHERE t0.id=t3.owner AND t3.eventname=t4.name
>>>>>>>>>>  )
>>>>>>>>>>  ORDER BY t0.docpriority ASC
>>>>>>>>>>  LIMIT 480;
>>>>>>>>>>
>>>>>>>>>> Your analysis of whether IN beats OR does not agree with
>>>>>>>>>> experiments I did on postgresql 8.7 which showed no difference.  What
>>>>>>>>>> Postgresql version are you using?  Also, I trust you have query plans that
>>>>>>>>>> demonstrate your claim?  In any case, whether IN vs. OR is generated is a
>>>>>>>>>> function of the MCF database driver, so this is trivial to experiment
>>>>>>>>>> with.  I'll create a ticket and a branch for experimentation.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <
>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>
>>>>>>>>>>> Changing maxcountstatus to something much smaller (10,000)
>>>>>>>>>>> doesn't seem to buy us that much on the table scan - in the attached you'll
>>>>>>>>>>> see that it's still taking a long time to return the job status page. Also
>>>>>>>>>>> in the attached are some sample other long running queries that we're
>>>>>>>>>>> beginning to see more frequently. There's also an example of a query that's
>>>>>>>>>>> frequently executed and regularly takes > 4 secs (plus a suggested change
>>>>>>>>>>> to improve performance). This one in particular would certainly benefit
>>>>>>>>>>> from a change to SSDs which should relieve the I/O bound bottleneck on
>>>>>>>>>>> postgres.
>>>>>>>>>>>
>>>>>>>>>>> We're loading the system from 10mil towards 100mil so would be
>>>>>>>>>>> keen to work with you to optimise where possible.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Paul
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> VP Engineering,
>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>
>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>
>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>
>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <daddywri@gmail.com
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>
>>>>>>>>>>>> The jobstatus query that uses count(*) should be doing
>>>>>>>>>>>> something like this when the maxdocumentstatuscount value is set:
>>>>>>>>>>>>
>>>>>>>>>>>> select count(*) from jobqueue where xxx limit 500001
>>>>>>>>>>>>
>>>>>>>>>>>> This will still do a sequential scan, but it will be an aborted
>>>>>>>>>>>> one, so you can control the maximum amount of time spent doing the query.
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <
>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> We've had a play with maxstatuscount and couldn't stop it from
>>>>>>>>>>>>> count(*)-ing but I'll certainly have another look to see if we've missed
>>>>>>>>>>>>> something.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We're increasingly seeing long running threads and I'll put
>>>>>>>>>>>>> together some samples. As an example, on a job that's currently aborting:
>>>>>>>>>>>>>
>>>>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a
>>>>>>>>>>>>> long-running query (72902 ms): [UPDATE jobqueue SET
>>>>>>>>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?]
>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter
>>>>>>>>>>>>> 0: '1.000000001E9'
>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter
>>>>>>>>>>>>> 1: '1407144048075'
>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>>>>>>>>>> Update on jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>>>>>>>>>> ->  Bitmap Heap Scan on jobqueue  (cost=18806.08..445770.39 rows=764916
>>>>>>>>>>>>> width=287)
>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>> Plan:         Recheck Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>> Plan:         ->  Bitmap Index Scan on i1392985450177  (cost=0.00..18614.85
>>>>>>>>>>>>> rows=764916 width=0)
>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>> Plan:               Index Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -  Stats:
>>>>>>>>>>>>> n_distinct=4.0 most_common_vals={G,C,Z,P}
>>>>>>>>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>>>>>>>>
>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>
>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>
>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <
>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For the jobqueue scans from the UI, there is a parameter you
>>>>>>>>>>>>>> can set which limits the number of documents counted to at most a specified
>>>>>>>>>>>>>> amount.  This uses a limit clause, which should prevent unbounded time
>>>>>>>>>>>>>> doing these kinds of queries:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> org.apache.manifoldcf.ui.maxstatuscount
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The documentation says that the default value for this
>>>>>>>>>>>>>> parameter is 10000, which however is incorrect.  The actual true default is
>>>>>>>>>>>>>> 500000.  You could set that lower for better UI performance (losing some
>>>>>>>>>>>>>> information, of course.)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As for long-running queries, a lot of time and effort has
>>>>>>>>>>>>>> been spent in MCF to insure that this doesn't happen.  Specifically, the
>>>>>>>>>>>>>> main document queuing query is structured to read directly out of a
>>>>>>>>>>>>>> specific jobqueue index.  This is the crucial query that must work properly
>>>>>>>>>>>>>> for scalability, since doing a query that is effectively just a sort on the
>>>>>>>>>>>>>> entire jobqueue would be a major problem.  There are some times where
>>>>>>>>>>>>>> Postgresql's optimizer fails to do the right thing here, mostly because it
>>>>>>>>>>>>>> makes a huge distinction between whether there's zero of something or one
>>>>>>>>>>>>>> of something, but you can work around that particular issue by setting the
>>>>>>>>>>>>>> analyze count to 1 if you start to see this problem -- which basically
>>>>>>>>>>>>>> means that reanalysis of the table has to occur on every stuffing query.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'd appreciate seeing the queries that are long-running in
>>>>>>>>>>>>>> your case so that I can see if that is what you are encountering or not.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <
>>>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We're beginning to see issues with a document count > 10
>>>>>>>>>>>>>>> million. At that point, even with good postgres vacuuming
>>>>>>>>>>>>>>> the jobqueue table is starting to become a bottleneck.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For example select count(*) from jobqueue, which is
>>>>>>>>>>>>>>> executed when querying job status will do a full table scan of
>>>>>>>>>>>>>>> jobqueue which has more than 10 million rows. That's going
>>>>>>>>>>>>>>> to take some time in postgres.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> SSDs will certainly make a big difference to document
>>>>>>>>>>>>>>> processing through-put (which we see is largely I/O bound in
>>>>>>>>>>>>>>> postgres) but we are increasingly seeing long running
>>>>>>>>>>>>>>> queries in the logs. Our current thinking is that we'll need to refactor
>>>>>>>>>>>>>>> JobQueue somewhat to optimise queries and, potentially
>>>>>>>>>>>>>>> partition jobqueue into a subset of tables (table per queue
>>>>>>>>>>>>>>> for example).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <
>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Baptiste,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ManifoldCF is not limited by the number of agents processes
>>>>>>>>>>>>>>>> or parallel connectors.  Overall database performance is the limiting
>>>>>>>>>>>>>>>> factor.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I would read this:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Also, there's a section in ManifoldCF (I believe Chapter 2)
>>>>>>>>>>>>>>>> that discusses this issue.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Some five years ago, I successfully crawled 5 million web
>>>>>>>>>>>>>>>> documents, using Postgresql 8.3.  Postgresql 9.x is faster, and with modern
>>>>>>>>>>>>>>>> SSD's, I expect that you will do even better.  In general, I'd say it was
>>>>>>>>>>>>>>>> fine to shoot for 10M - 100M documents on ManifoldCF, provided that you use
>>>>>>>>>>>>>>>> a good database, and provided that you maintain it properly.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>>>>>>>>>>>>>> ba.berthier@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I would like to know what is the maximum number of
>>>>>>>>>>>>>>>>> documents that you managed to crawl with ManifoldCF and with how many
>>>>>>>>>>>>>>>>> connectors in parallel it could works ?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks for your answer
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Baptiste
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Posted by Karl Wright <da...@gmail.com>.

The changes pass all tests here.  Is it possible that you attempted some
upgrade that failed (or didn't attempt upgrade but went to a new code
version)?

If you could let me know as exactly as possible what you did, I can let you
know if that should have worked or not.

Thanks!
Karl





On Fri, Sep 12, 2014 at 10:57 AM, Paul Boichat <pa...@exonar.com>
wrote:

> Karl,
>
> We appear to be seeing an issue with the performance change to use an OR
> clause rather than IN. After making the change, when we restart manifoldcf
> (with one job in running state) documents in the running job are not picked
> up for processing by the stuffer thread. If we redploy base 1.6.1 and
> restart documents are processed. This is consistently switchable depending
> on which code base is deployed.
>
> We have logs that I could upload to the ticket if you recommend that we
> reopen the issue (or create a new one)?
>
> Paul
>
>
>
>
>
> VP Engineering,
> Exonar Ltd
>
> T: +44 7940 567724
>
> twitter:@exonarco @pboichat
> W: http://www.exonar.com
> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>
> Exonar Limited, registered in the UK, registration number 06439969 at 14
> West Mills, Newbury, Berkshire, RG14 5HG
> DISCLAIMER: This email and any attachments to it may be confidential and
> are intended solely for the use of the individual to whom it is addressed.
> Any views or opinions expressed are solely those of the author and do not
> necessarily represent those of Exonar Ltd. If you are not the intended
> recipient of this email, you must neither take any action based upon its
> contents, nor copy or show it to anyone. Please contact the sender if you
> believe you have received this email in error.
>
> On Fri, Sep 12, 2014 at 6:05 AM, Karl Wright <da...@gmail.com> wrote:
>
>> Hi Paul --
>>
>> Just to be clear -- the branch for CONNECTORS-1027 is a branch of trunk,
>> which is MCF 2.0.  MCF 2.0 is not backwards compatible with any previous
>> MCF release, and indeed there is no upgrade from any 1.x release to 2.0.
>> That's why I said to use the patches, and try to stay on 1.6.1 or at most
>> to migrate to 1.7.
>>
>> IF you ALREADY tried an upgrade with the branch code, then you would have
>> wound up in a schema state where the schema had more columns in it than the
>> branch knew how to deal with.  That's bad, and you will need to do things
>> to fix the situation.  I believe you should still be able to do the
>> following:
>>
>> - Download 1.7 source, or check out
>> https://svn.apache.org/repos/asf/manifoldcf/branches/release-1.7-branch
>> - Apply the patches
>> - Build
>> - Modify your properties.xml to point to your postgresql instance
>> - Run the upgrade (initialize.bat on the multi-process example, or start
>> the single-process example)
>>
>> You should then have a working 1.7 release, with code patches applied.
>>
>> Thanks,
>> Karl
>>
>>
>>
>>
>> On Thu, Sep 11, 2014 at 11:34 AM, Paul Boichat <pa...@exonar.com>
>> wrote:
>>
>>> Thanks - we've pulled down the branch and will test the changes. It
>>> looks like a branch of 1.7 so it's going to take us a little while to test.
>>> We need to migrate our connectors (there's some deprecated stuff that's now
>>> been cleared in 1.7 .eg. getShareACL) and we'll need to patch the database
>>> to include the pipeline and any other schema changes. We'll have some
>>> environment contention over the next week as our performance test
>>> environment needs to remain on 1.6.1 while we test a release. Once that's
>>> clear I'll move to 1.7
>>>
>>> On the database schema patch moving from 1.6.1 to 1.7 - is there a
>>> simple way to migrate and existing database?
>>>
>>> Thanks,
>>>
>>> Paul
>>>
>>>
>>>
>>>
>>>
>>> VP Engineering,
>>> Exonar Ltd
>>>
>>> T: +44 7940 567724
>>>
>>> twitter:@exonarco @pboichat
>>> W: http://www.exonar.com
>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>
>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>> West Mills, Newbury, Berkshire, RG14 5HG
>>> DISCLAIMER: This email and any attachments to it may be confidential
>>> and are intended solely for the use of the individual to whom it is
>>> addressed. Any views or opinions expressed are solely those of the author
>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>> the intended recipient of this email, you must neither take any action
>>> based upon its contents, nor copy or show it to anyone. Please contact
>>> the sender if you believe you have received this email in error.
>>>
>>> On Thu, Sep 11, 2014 at 1:27 PM, Karl Wright <da...@gmail.com> wrote:
>>>
>>>> Thanks -- I'll include that change as well then, in ticket
>>>> CONNECTORS-1027.
>>>>
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Thu, Sep 11, 2014 at 7:45 AM, Paul Boichat <pa...@exonar.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> That comes back immediately with 10001 rows:
>>>>>
>>>>> explain analyze SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT
>>>>> 10001) t;
>>>>>
>>>>> QUERY PLAN
>>>>>
>>>>>
>>>>> -----------------------------------------------------------------------------------------------------------------------
>>>>> ----------------------------------
>>>>>  Aggregate  (cost=544.08..544.09 rows=1 width=0) (actual
>>>>> time=9.125..9.125 rows=1 loops=1)
>>>>>    ->  Limit  (cost=0.00..419.07 rows=10001 width=0) (actual
>>>>> time=0.033..6.945 rows=10001 loops=1)
>>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>>> (cost=0.00..431189.31 rows=10290271 width=0) (actual time
>>>>> =0.031..3.257 rows=10001 loops=1)
>>>>>                Heap Fetches: 725
>>>>>  Total runtime: 9.157 ms
>>>>> (5 rows)
>>>>>
>>>>>
>>>>> Whereas:
>>>>>
>>>>> explain analyze SELECT count(*) FROM jobqueue limit 10001;
>>>>>
>>>>> QUERY PLAN
>>>>>
>>>>>
>>>>> -----------------------------------------------------------------------------------------------------------------------
>>>>> ----------------------------------------
>>>>>  Limit  (cost=456922.99..456923.00 rows=1 width=0) (actual
>>>>> time=5225.107..5225.109 rows=1 loops=1)
>>>>>    ->  Aggregate  (cost=456922.99..456923.00 rows=1 width=0) (actual
>>>>> time=5225.105..5225.106 rows=1 loops=1)
>>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>>> (cost=0.00..431197.31 rows=10290271 width=0) (actual time
>>>>> =0.108..3090.848 rows=10370209 loops=1)
>>>>>                Heap Fetches: 684297
>>>>>  Total runtime: 5225.151 ms
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Paul
>>>>>
>>>>>
>>>>>
>>>>> VP Engineering,
>>>>> Exonar Ltd
>>>>>
>>>>> T: +44 7940 567724
>>>>>
>>>>> twitter:@exonarco @pboichat
>>>>> W: http://www.exonar.com
>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>
>>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>>> and are intended solely for the use of the individual to whom it is
>>>>> addressed. Any views or opinions expressed are solely those of the author
>>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>>> the intended recipient of this email, you must neither take any action
>>>>> based upon its contents, nor copy or show it to anyone. Please
>>>>> contact the sender if you believe you have received this email in error.
>>>>>
>>>>> On Thu, Sep 11, 2014 at 12:25 PM, Karl Wright <da...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Paul,
>>>>>>
>>>>>> Could you try this query on your database please and tell me if it
>>>>>> executes promptly:
>>>>>>
>>>>>> SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t
>>>>>>
>>>>>>
>>>>>> I vaguely remember that I had to change the form of this query in
>>>>>> order to support MySQL -- but first let's see if this helps.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <da...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I've created a ticket (CONNECTORS-1027) and a trunk-based branch
>>>>>>> (branches/CONNECTORS-1027) for looking at any changes we do for large-scale
>>>>>>> Postgresql optimization work.
>>>>>>>
>>>>>>> Please note that trunk code already has schema changes relative to
>>>>>>> MCF 1.7, so you will not be able to work directly with this branch code.
>>>>>>> I'll have to create patches for whatever changes you would need to try.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat <
>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> We're on Postgres 9.2. I'll get the query plans and add them to the
>>>>>>>> doc.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> Paul
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> VP Engineering,
>>>>>>>> Exonar Ltd
>>>>>>>>
>>>>>>>> T: +44 7940 567724
>>>>>>>>
>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>> W: http://www.exonar.com
>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>> <http://video.exonar.com/>
>>>>>>>>
>>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>
>>>>>>>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <da...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Paul,
>>>>>>>>>
>>>>>>>>> Can you include the logged plan for this query; this is an actual
>>>>>>>>> query encountered during crawling:
>>>>>>>>>
>>>>>>>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a
>>>>>>>>> long-running query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid
>>>>>>>>> FROM carrydown t1, jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=? AND
>>>>>>>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> These queries are all from the UI; it is what gets generated when
>>>>>>>>> no limits are in place:
>>>>>>>>>
>>>>>>>>>  WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found a
>>>>>>>>> long-running query (166845 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>>  WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found a
>>>>>>>>> long-running query (107222 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>>
>>>>>>>>> This query is from the UI with a limit of 1000000:
>>>>>>>>>
>>>>>>>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found a
>>>>>>>>> long-running query (254851 ms): [SELECT COUNT(dochash) AS doccount FROM
>>>>>>>>> jobqueue t1 LIMIT 1000001]
>>>>>>>>>
>>>>>>>>> I honestly don't understand why PostgreSQL would execute a
>>>>>>>>> sequential scan of the entire table when given a limit clause.  It
>>>>>>>>> certainly didn't used to do that.  If you have any other suggestions please
>>>>>>>>> let me know.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Some queries show up in this list because MCF periodically
>>>>>>>>> reindexes tables.  For example, this query goes only against the (small)
>>>>>>>>> jobs table.  Its poor performance on occasion is likely due to something
>>>>>>>>> else happening to the database, probably a reindex:
>>>>>>>>>
>>>>>>>>>  WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a
>>>>>>>>> long-running query (592474 ms): [SELECT id FROM jobs WHERE status IN
>>>>>>>>> (?,?,?,?,?) FOR UPDATE]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The final query is the document stuffing query, which is perhaps
>>>>>>>>> the most critical query in the whole system:
>>>>>>>>>
>>>>>>>>>  SELECT
>>>>>>>>>  t0.id
>>>>>>>>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
>>>>>>>>>  t0.priorityset FROM jobqueue t0
>>>>>>>>>  WHERE t0.status IN ('P','G')  AND t0.checkaction='R' AND
>>>>>>>>> t0.checktime
>>>>>>>>>  <= 1407246846166
>>>>>>>>>  AND EXISTS (
>>>>>>>>>    SELECT 'x' FROM jobs t1
>>>>>>>>>    WHERE t1.status  IN ('A','a')  AND t1.id=t0.jobid  AND
>>>>>>>>> t1.priority=5
>>>>>>>>>  )
>>>>>>>>>  AND NOT EXISTS (
>>>>>>>>>    SELECT 'x' FROM jobqueue t2
>>>>>>>>>    WHERE t2.dochash=t0.dochash AND t2.status IN
>>>>>>>>>  ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid
>>>>>>>>>  )
>>>>>>>>>  AND NOT EXISTS (
>>>>>>>>>    SELECT 'x' FROM prereqevents t3,events t4
>>>>>>>>>    WHERE t0.id=t3.owner AND t3.eventname=t4.name
>>>>>>>>>  )
>>>>>>>>>  ORDER BY t0.docpriority ASC
>>>>>>>>>  LIMIT 480;
>>>>>>>>>
>>>>>>>>> Your analysis of whether IN beats OR does not agree with
>>>>>>>>> experiments I did on postgresql 8.7 which showed no difference.  What
>>>>>>>>> Postgresql version are you using?  Also, I trust you have query plans that
>>>>>>>>> demonstrate your claim?  In any case, whether IN vs. OR is generated is a
>>>>>>>>> function of the MCF database driver, so this is trivial to experiment
>>>>>>>>> with.  I'll create a ticket and a branch for experimentation.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <
>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Karl,
>>>>>>>>>>
>>>>>>>>>> Changing maxcountstatus to something much smaller (10,000)
>>>>>>>>>> doesn't seem to buy us that much on the table scan - in the attached you'll
>>>>>>>>>> see that it's still taking a long time to return the job status page. Also
>>>>>>>>>> in the attached are some sample other long running queries that we're
>>>>>>>>>> beginning to see more frequently. There's also an example of a query that's
>>>>>>>>>> frequently executed and regularly takes > 4 secs (plus a suggested change
>>>>>>>>>> to improve performance). This one in particular would certainly benefit
>>>>>>>>>> from a change to SSDs which should relieve the I/O bound bottleneck on
>>>>>>>>>> postgres.
>>>>>>>>>>
>>>>>>>>>> We're loading the system from 10mil towards 100mil so would be
>>>>>>>>>> keen to work with you to optimise where possible.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Paul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> VP Engineering,
>>>>>>>>>> Exonar Ltd
>>>>>>>>>>
>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>
>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>
>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <da...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>
>>>>>>>>>>> The jobstatus query that uses count(*) should be doing something
>>>>>>>>>>> like this when the maxdocumentstatuscount value is set:
>>>>>>>>>>>
>>>>>>>>>>> select count(*) from jobqueue where xxx limit 500001
>>>>>>>>>>>
>>>>>>>>>>> This will still do a sequential scan, but it will be an aborted
>>>>>>>>>>> one, so you can control the maximum amount of time spent doing the query.
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <
>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> We've had a play with maxstatuscount and couldn't stop it from
>>>>>>>>>>>> count(*)-ing but I'll certainly have another look to see if we've missed
>>>>>>>>>>>> something.
>>>>>>>>>>>>
>>>>>>>>>>>> We're increasingly seeing long running threads and I'll put
>>>>>>>>>>>> together some samples. As an example, on a job that's currently aborting:
>>>>>>>>>>>>
>>>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a
>>>>>>>>>>>> long-running query (72902 ms): [UPDATE jobqueue SET
>>>>>>>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?]
>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter
>>>>>>>>>>>> 0: '1.000000001E9'
>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter
>>>>>>>>>>>> 1: '1407144048075'
>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>>>>>>>>> Update on jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:   ->
>>>>>>>>>>>> Bitmap Heap Scan on jobqueue  (cost=18806.08..445770.39 rows=764916
>>>>>>>>>>>> width=287)
>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>> Plan:         Recheck Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>> Plan:         ->  Bitmap Index Scan on i1392985450177  (cost=0.00..18614.85
>>>>>>>>>>>> rows=764916 width=0)
>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>> Plan:               Index Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -  Stats:
>>>>>>>>>>>> n_distinct=4.0 most_common_vals={G,C,Z,P}
>>>>>>>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>>>>>>>
>>>>>>>>>>>> Paul
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>
>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>
>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>
>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <
>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>
>>>>>>>>>>>>> For the jobqueue scans from the UI, there is a parameter you
>>>>>>>>>>>>> can set which limits the number of documents counted to at most a specified
>>>>>>>>>>>>> amount.  This uses a limit clause, which should prevent unbounded time
>>>>>>>>>>>>> doing these kinds of queries:
>>>>>>>>>>>>>
>>>>>>>>>>>>> org.apache.manifoldcf.ui.maxstatuscount
>>>>>>>>>>>>>
>>>>>>>>>>>>> The documentation says that the default value for this
>>>>>>>>>>>>> parameter is 10000, which however is incorrect.  The actual true default is
>>>>>>>>>>>>> 500000.  You could set that lower for better UI performance (losing some
>>>>>>>>>>>>> information, of course.)
>>>>>>>>>>>>>
>>>>>>>>>>>>> As for long-running queries, a lot of time and effort has been
>>>>>>>>>>>>> spent in MCF to insure that this doesn't happen.  Specifically, the main
>>>>>>>>>>>>> document queuing query is structured to read directly out of a specific
>>>>>>>>>>>>> jobqueue index.  This is the crucial query that must work properly for
>>>>>>>>>>>>> scalability, since doing a query that is effectively just a sort on the
>>>>>>>>>>>>> entire jobqueue would be a major problem.  There are some times where
>>>>>>>>>>>>> Postgresql's optimizer fails to do the right thing here, mostly because it
>>>>>>>>>>>>> makes a huge distinction between whether there's zero of something or one
>>>>>>>>>>>>> of something, but you can work around that particular issue by setting the
>>>>>>>>>>>>> analyze count to 1 if you start to see this problem -- which basically
>>>>>>>>>>>>> means that reanalysis of the table has to occur on every stuffing query.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'd appreciate seeing the queries that are long-running in
>>>>>>>>>>>>> your case so that I can see if that is what you are encountering or not.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <
>>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We're beginning to see issues with a document count > 10
>>>>>>>>>>>>>> million. At that point, even with good postgres vacuuming
>>>>>>>>>>>>>> the jobqueue table is starting to become a bottleneck.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For example select count(*) from jobqueue, which is executed
>>>>>>>>>>>>>> when querying job status will do a full table scan of
>>>>>>>>>>>>>> jobqueue which has more than 10 million rows. That's going
>>>>>>>>>>>>>> to take some time in postgres.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> SSDs will certainly make a big difference to document
>>>>>>>>>>>>>> processing through-put (which we see is largely I/O bound in
>>>>>>>>>>>>>> postgres) but we are increasingly seeing long running
>>>>>>>>>>>>>> queries in the logs. Our current thinking is that we'll need to refactor
>>>>>>>>>>>>>> JobQueue somewhat to optimise queries and, potentially
>>>>>>>>>>>>>> partition jobqueue into a subset of tables (table per queue
>>>>>>>>>>>>>> for example).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <
>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Baptiste,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ManifoldCF is not limited by the number of agents processes
>>>>>>>>>>>>>>> or parallel connectors.  Overall database performance is the limiting
>>>>>>>>>>>>>>> factor.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I would read this:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Also, there's a section in ManifoldCF (I believe Chapter 2)
>>>>>>>>>>>>>>> that discusses this issue.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Some five years ago, I successfully crawled 5 million web
>>>>>>>>>>>>>>> documents, using Postgresql 8.3.  Postgresql 9.x is faster, and with modern
>>>>>>>>>>>>>>> SSD's, I expect that you will do even better.  In general, I'd say it was
>>>>>>>>>>>>>>> fine to shoot for 10M - 100M documents on ManifoldCF, provided that you use
>>>>>>>>>>>>>>> a good database, and provided that you maintain it properly.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>>>>>>>>>>>>> ba.berthier@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I would like to know what is the maximum number of
>>>>>>>>>>>>>>>> documents that you managed to crawl with ManifoldCF and with how many
>>>>>>>>>>>>>>>> connectors in parallel it could works ?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for your answer
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Baptiste
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Posted by Paul Boichat <pa...@exonar.com>.

Karl,

We appear to be seeing an issue with the performance change to use an OR
clause rather than IN. After making the change, when we restart manifoldcf
(with one job in running state) documents in the running job are not picked
up for processing by the stuffer thread. If we redploy base 1.6.1 and
restart documents are processed. This is consistently switchable depending
on which code base is deployed.

We have logs that I could upload to the ticket if you recommend that we
reopen the issue (or create a new one)?

Paul





VP Engineering,
Exonar Ltd

T: +44 7940 567724

twitter:@exonarco @pboichat
W: http://www.exonar.com
Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>

Exonar Limited, registered in the UK, registration number 06439969 at 14
West Mills, Newbury, Berkshire, RG14 5HG
DISCLAIMER: This email and any attachments to it may be confidential and
are intended solely for the use of the individual to whom it is addressed.
Any views or opinions expressed are solely those of the author and do not
necessarily represent those of Exonar Ltd. If you are not the intended
recipient of this email, you must neither take any action based upon its
contents, nor copy or show it to anyone. Please contact the sender if you
believe you have received this email in error.

On Fri, Sep 12, 2014 at 6:05 AM, Karl Wright <da...@gmail.com> wrote:

> Hi Paul --
>
> Just to be clear -- the branch for CONNECTORS-1027 is a branch of trunk,
> which is MCF 2.0.  MCF 2.0 is not backwards compatible with any previous
> MCF release, and indeed there is no upgrade from any 1.x release to 2.0.
> That's why I said to use the patches, and try to stay on 1.6.1 or at most
> to migrate to 1.7.
>
> IF you ALREADY tried an upgrade with the branch code, then you would have
> wound up in a schema state where the schema had more columns in it than the
> branch knew how to deal with.  That's bad, and you will need to do things
> to fix the situation.  I believe you should still be able to do the
> following:
>
> - Download 1.7 source, or check out
> https://svn.apache.org/repos/asf/manifoldcf/branches/release-1.7-branch
> - Apply the patches
> - Build
> - Modify your properties.xml to point to your postgresql instance
> - Run the upgrade (initialize.bat on the multi-process example, or start
> the single-process example)
>
> You should then have a working 1.7 release, with code patches applied.
>
> Thanks,
> Karl
>
>
>
>
> On Thu, Sep 11, 2014 at 11:34 AM, Paul Boichat <pa...@exonar.com>
> wrote:
>
>> Thanks - we've pulled down the branch and will test the changes. It looks
>> like a branch of 1.7 so it's going to take us a little while to test. We
>> need to migrate our connectors (there's some deprecated stuff that's now
>> been cleared in 1.7 .eg. getShareACL) and we'll need to patch the database
>> to include the pipeline and any other schema changes. We'll have some
>> environment contention over the next week as our performance test
>> environment needs to remain on 1.6.1 while we test a release. Once that's
>> clear I'll move to 1.7
>>
>> On the database schema patch moving from 1.6.1 to 1.7 - is there a simple
>> way to migrate and existing database?
>>
>> Thanks,
>>
>> Paul
>>
>>
>>
>>
>>
>> VP Engineering,
>> Exonar Ltd
>>
>> T: +44 7940 567724
>>
>> twitter:@exonarco @pboichat
>> W: http://www.exonar.com
>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>
>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>> West Mills, Newbury, Berkshire, RG14 5HG
>> DISCLAIMER: This email and any attachments to it may be confidential and
>> are intended solely for the use of the individual to whom it is addressed.
>> Any views or opinions expressed are solely those of the author and do not
>> necessarily represent those of Exonar Ltd. If you are not the intended
>> recipient of this email, you must neither take any action based upon its
>> contents, nor copy or show it to anyone. Please contact the sender if
>> you believe you have received this email in error.
>>
>> On Thu, Sep 11, 2014 at 1:27 PM, Karl Wright <da...@gmail.com> wrote:
>>
>>> Thanks -- I'll include that change as well then, in ticket
>>> CONNECTORS-1027.
>>>
>>>
>>> Karl
>>>
>>>
>>> On Thu, Sep 11, 2014 at 7:45 AM, Paul Boichat <pa...@exonar.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> That comes back immediately with 10001 rows:
>>>>
>>>> explain analyze SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT
>>>> 10001) t;
>>>>
>>>> QUERY PLAN
>>>>
>>>>
>>>> -----------------------------------------------------------------------------------------------------------------------
>>>> ----------------------------------
>>>>  Aggregate  (cost=544.08..544.09 rows=1 width=0) (actual
>>>> time=9.125..9.125 rows=1 loops=1)
>>>>    ->  Limit  (cost=0.00..419.07 rows=10001 width=0) (actual
>>>> time=0.033..6.945 rows=10001 loops=1)
>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>> (cost=0.00..431189.31 rows=10290271 width=0) (actual time
>>>> =0.031..3.257 rows=10001 loops=1)
>>>>                Heap Fetches: 725
>>>>  Total runtime: 9.157 ms
>>>> (5 rows)
>>>>
>>>>
>>>> Whereas:
>>>>
>>>> explain analyze SELECT count(*) FROM jobqueue limit 10001;
>>>>
>>>> QUERY PLAN
>>>>
>>>>
>>>> -----------------------------------------------------------------------------------------------------------------------
>>>> ----------------------------------------
>>>>  Limit  (cost=456922.99..456923.00 rows=1 width=0) (actual
>>>> time=5225.107..5225.109 rows=1 loops=1)
>>>>    ->  Aggregate  (cost=456922.99..456923.00 rows=1 width=0) (actual
>>>> time=5225.105..5225.106 rows=1 loops=1)
>>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>>> (cost=0.00..431197.31 rows=10290271 width=0) (actual time
>>>> =0.108..3090.848 rows=10370209 loops=1)
>>>>                Heap Fetches: 684297
>>>>  Total runtime: 5225.151 ms
>>>>
>>>> Thanks,
>>>>
>>>> Paul
>>>>
>>>>
>>>>
>>>> VP Engineering,
>>>> Exonar Ltd
>>>>
>>>> T: +44 7940 567724
>>>>
>>>> twitter:@exonarco @pboichat
>>>> W: http://www.exonar.com
>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>
>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>> and are intended solely for the use of the individual to whom it is
>>>> addressed. Any views or opinions expressed are solely those of the author
>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>> the intended recipient of this email, you must neither take any action
>>>> based upon its contents, nor copy or show it to anyone. Please contact
>>>> the sender if you believe you have received this email in error.
>>>>
>>>> On Thu, Sep 11, 2014 at 12:25 PM, Karl Wright <da...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Paul,
>>>>>
>>>>> Could you try this query on your database please and tell me if it
>>>>> executes promptly:
>>>>>
>>>>> SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t
>>>>>
>>>>>
>>>>> I vaguely remember that I had to change the form of this query in
>>>>> order to support MySQL -- but first let's see if this helps.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <da...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I've created a ticket (CONNECTORS-1027) and a trunk-based branch
>>>>>> (branches/CONNECTORS-1027) for looking at any changes we do for large-scale
>>>>>> Postgresql optimization work.
>>>>>>
>>>>>> Please note that trunk code already has schema changes relative to
>>>>>> MCF 1.7, so you will not be able to work directly with this branch code.
>>>>>> I'll have to create patches for whatever changes you would need to try.
>>>>>>
>>>>>> Thanks,
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat <
>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> We're on Postgres 9.2. I'll get the query plans and add them to the
>>>>>>> doc.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> Paul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> VP Engineering,
>>>>>>> Exonar Ltd
>>>>>>>
>>>>>>> T: +44 7940 567724
>>>>>>>
>>>>>>> twitter:@exonarco @pboichat
>>>>>>> W: http://www.exonar.com
>>>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>>>
>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>> author and do not necessarily represent those of Exonar Ltd. If you
>>>>>>> are not the intended recipient of this email, you must neither take any
>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>
>>>>>>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <da...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Paul,
>>>>>>>>
>>>>>>>> Can you include the logged plan for this query; this is an actual
>>>>>>>> query encountered during crawling:
>>>>>>>>
>>>>>>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a
>>>>>>>> long-running query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid
>>>>>>>> FROM carrydown t1, jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=? AND
>>>>>>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?]
>>>>>>>>
>>>>>>>>
>>>>>>>> These queries are all from the UI; it is what gets generated when
>>>>>>>> no limits are in place:
>>>>>>>>
>>>>>>>>  WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found a
>>>>>>>> long-running query (166845 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>  WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found a
>>>>>>>> long-running query (107222 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>>
>>>>>>>> This query is from the UI with a limit of 1000000:
>>>>>>>>
>>>>>>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found a
>>>>>>>> long-running query (254851 ms): [SELECT COUNT(dochash) AS doccount FROM
>>>>>>>> jobqueue t1 LIMIT 1000001]
>>>>>>>>
>>>>>>>> I honestly don't understand why PostgreSQL would execute a
>>>>>>>> sequential scan of the entire table when given a limit clause.  It
>>>>>>>> certainly didn't used to do that.  If you have any other suggestions please
>>>>>>>> let me know.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Some queries show up in this list because MCF periodically
>>>>>>>> reindexes tables.  For example, this query goes only against the (small)
>>>>>>>> jobs table.  Its poor performance on occasion is likely due to something
>>>>>>>> else happening to the database, probably a reindex:
>>>>>>>>
>>>>>>>>  WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a
>>>>>>>> long-running query (592474 ms): [SELECT id FROM jobs WHERE status IN
>>>>>>>> (?,?,?,?,?) FOR UPDATE]
>>>>>>>>
>>>>>>>>
>>>>>>>> The final query is the document stuffing query, which is perhaps
>>>>>>>> the most critical query in the whole system:
>>>>>>>>
>>>>>>>>  SELECT
>>>>>>>>  t0.id
>>>>>>>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
>>>>>>>>  t0.priorityset FROM jobqueue t0
>>>>>>>>  WHERE t0.status IN ('P','G')  AND t0.checkaction='R' AND
>>>>>>>> t0.checktime
>>>>>>>>  <= 1407246846166
>>>>>>>>  AND EXISTS (
>>>>>>>>    SELECT 'x' FROM jobs t1
>>>>>>>>    WHERE t1.status  IN ('A','a')  AND t1.id=t0.jobid  AND
>>>>>>>> t1.priority=5
>>>>>>>>  )
>>>>>>>>  AND NOT EXISTS (
>>>>>>>>    SELECT 'x' FROM jobqueue t2
>>>>>>>>    WHERE t2.dochash=t0.dochash AND t2.status IN
>>>>>>>>  ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid
>>>>>>>>  )
>>>>>>>>  AND NOT EXISTS (
>>>>>>>>    SELECT 'x' FROM prereqevents t3,events t4
>>>>>>>>    WHERE t0.id=t3.owner AND t3.eventname=t4.name
>>>>>>>>  )
>>>>>>>>  ORDER BY t0.docpriority ASC
>>>>>>>>  LIMIT 480;
>>>>>>>>
>>>>>>>> Your analysis of whether IN beats OR does not agree with
>>>>>>>> experiments I did on postgresql 8.7 which showed no difference.  What
>>>>>>>> Postgresql version are you using?  Also, I trust you have query plans that
>>>>>>>> demonstrate your claim?  In any case, whether IN vs. OR is generated is a
>>>>>>>> function of the MCF database driver, so this is trivial to experiment
>>>>>>>> with.  I'll create a ticket and a branch for experimentation.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <
>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Karl,
>>>>>>>>>
>>>>>>>>> Changing maxcountstatus to something much smaller (10,000) doesn't
>>>>>>>>> seem to buy us that much on the table scan - in the attached you'll see
>>>>>>>>> that it's still taking a long time to return the job status page. Also in
>>>>>>>>> the attached are some sample other long running queries that we're
>>>>>>>>> beginning to see more frequently. There's also an example of a query that's
>>>>>>>>> frequently executed and regularly takes > 4 secs (plus a suggested change
>>>>>>>>> to improve performance). This one in particular would certainly benefit
>>>>>>>>> from a change to SSDs which should relieve the I/O bound bottleneck on
>>>>>>>>> postgres.
>>>>>>>>>
>>>>>>>>> We're loading the system from 10mil towards 100mil so would be
>>>>>>>>> keen to work with you to optimise where possible.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Paul
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> VP Engineering,
>>>>>>>>> Exonar Ltd
>>>>>>>>>
>>>>>>>>> T: +44 7940 567724
>>>>>>>>>
>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>> W: http://www.exonar.com
>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>
>>>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>
>>>>>>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <da...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Paul,
>>>>>>>>>>
>>>>>>>>>> The jobstatus query that uses count(*) should be doing something
>>>>>>>>>> like this when the maxdocumentstatuscount value is set:
>>>>>>>>>>
>>>>>>>>>> select count(*) from jobqueue where xxx limit 500001
>>>>>>>>>>
>>>>>>>>>> This will still do a sequential scan, but it will be an aborted
>>>>>>>>>> one, so you can control the maximum amount of time spent doing the query.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <
>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> We've had a play with maxstatuscount and couldn't stop it from
>>>>>>>>>>> count(*)-ing but I'll certainly have another look to see if we've missed
>>>>>>>>>>> something.
>>>>>>>>>>>
>>>>>>>>>>> We're increasingly seeing long running threads and I'll put
>>>>>>>>>>> together some samples. As an example, on a job that's currently aborting:
>>>>>>>>>>>
>>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a
>>>>>>>>>>> long-running query (72902 ms): [UPDATE jobqueue SET
>>>>>>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?]
>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter
>>>>>>>>>>> 0: '1.000000001E9'
>>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter
>>>>>>>>>>> 1: '1407144048075'
>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan: Update
>>>>>>>>>>> on jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:   ->
>>>>>>>>>>> Bitmap Heap Scan on jobqueue  (cost=18806.08..445770.39 rows=764916
>>>>>>>>>>> width=287)
>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>> Plan:         Recheck Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>> Plan:         ->  Bitmap Index Scan on i1392985450177  (cost=0.00..18614.85
>>>>>>>>>>> rows=764916 width=0)
>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>> Plan:               Index Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -  Stats:
>>>>>>>>>>> n_distinct=4.0 most_common_vals={G,C,Z,P}
>>>>>>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>>>>>>
>>>>>>>>>>> Paul
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> VP Engineering,
>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>
>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>
>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>
>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <daddywri@gmail.com
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>
>>>>>>>>>>>> For the jobqueue scans from the UI, there is a parameter you
>>>>>>>>>>>> can set which limits the number of documents counted to at most a specified
>>>>>>>>>>>> amount.  This uses a limit clause, which should prevent unbounded time
>>>>>>>>>>>> doing these kinds of queries:
>>>>>>>>>>>>
>>>>>>>>>>>> org.apache.manifoldcf.ui.maxstatuscount
>>>>>>>>>>>>
>>>>>>>>>>>> The documentation says that the default value for this
>>>>>>>>>>>> parameter is 10000, which however is incorrect.  The actual true default is
>>>>>>>>>>>> 500000.  You could set that lower for better UI performance (losing some
>>>>>>>>>>>> information, of course.)
>>>>>>>>>>>>
>>>>>>>>>>>> As for long-running queries, a lot of time and effort has been
>>>>>>>>>>>> spent in MCF to insure that this doesn't happen.  Specifically, the main
>>>>>>>>>>>> document queuing query is structured to read directly out of a specific
>>>>>>>>>>>> jobqueue index.  This is the crucial query that must work properly for
>>>>>>>>>>>> scalability, since doing a query that is effectively just a sort on the
>>>>>>>>>>>> entire jobqueue would be a major problem.  There are some times where
>>>>>>>>>>>> Postgresql's optimizer fails to do the right thing here, mostly because it
>>>>>>>>>>>> makes a huge distinction between whether there's zero of something or one
>>>>>>>>>>>> of something, but you can work around that particular issue by setting the
>>>>>>>>>>>> analyze count to 1 if you start to see this problem -- which basically
>>>>>>>>>>>> means that reanalysis of the table has to occur on every stuffing query.
>>>>>>>>>>>>
>>>>>>>>>>>> I'd appreciate seeing the queries that are long-running in your
>>>>>>>>>>>> case so that I can see if that is what you are encountering or not.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <
>>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>
>>>>>>>>>>>>> We're beginning to see issues with a document count > 10
>>>>>>>>>>>>> million. At that point, even with good postgres vacuuming the
>>>>>>>>>>>>> jobqueue table is starting to become a bottleneck.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For example select count(*) from jobqueue, which is executed
>>>>>>>>>>>>> when querying job status will do a full table scan of jobqueue
>>>>>>>>>>>>> which has more than 10 million rows. That's going to take some time in
>>>>>>>>>>>>> postgres.
>>>>>>>>>>>>>
>>>>>>>>>>>>> SSDs will certainly make a big difference to document
>>>>>>>>>>>>> processing through-put (which we see is largely I/O bound in
>>>>>>>>>>>>> postgres) but we are increasingly seeing long running queries
>>>>>>>>>>>>> in the logs. Our current thinking is that we'll need to refactor
>>>>>>>>>>>>> JobQueue somewhat to optimise queries and, potentially
>>>>>>>>>>>>> partition jobqueue into a subset of tables (table per queue
>>>>>>>>>>>>> for example).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>>
>>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>>
>>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <
>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Baptiste,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ManifoldCF is not limited by the number of agents processes
>>>>>>>>>>>>>> or parallel connectors.  Overall database performance is the limiting
>>>>>>>>>>>>>> factor.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would read this:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Also, there's a section in ManifoldCF (I believe Chapter 2)
>>>>>>>>>>>>>> that discusses this issue.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Some five years ago, I successfully crawled 5 million web
>>>>>>>>>>>>>> documents, using Postgresql 8.3.  Postgresql 9.x is faster, and with modern
>>>>>>>>>>>>>> SSD's, I expect that you will do even better.  In general, I'd say it was
>>>>>>>>>>>>>> fine to shoot for 10M - 100M documents on ManifoldCF, provided that you use
>>>>>>>>>>>>>> a good database, and provided that you maintain it properly.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>>>>>>>>>>>> ba.berthier@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I would like to know what is the maximum number of documents
>>>>>>>>>>>>>>> that you managed to crawl with ManifoldCF and with how many connectors in
>>>>>>>>>>>>>>> parallel it could works ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for your answer
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Baptiste
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Posted by Karl Wright <da...@gmail.com>.

Hi Paul --

Just to be clear -- the branch for CONNECTORS-1027 is a branch of trunk,
which is MCF 2.0.  MCF 2.0 is not backwards compatible with any previous
MCF release, and indeed there is no upgrade from any 1.x release to 2.0.
That's why I said to use the patches, and try to stay on 1.6.1 or at most
to migrate to 1.7.

IF you ALREADY tried an upgrade with the branch code, then you would have
wound up in a schema state where the schema had more columns in it than the
branch knew how to deal with.  That's bad, and you will need to do things
to fix the situation.  I believe you should still be able to do the
following:

- Download 1.7 source, or check out
https://svn.apache.org/repos/asf/manifoldcf/branches/release-1.7-branch
- Apply the patches
- Build
- Modify your properties.xml to point to your postgresql instance
- Run the upgrade (initialize.bat on the multi-process example, or start
the single-process example)

You should then have a working 1.7 release, with code patches applied.

Thanks,
Karl




On Thu, Sep 11, 2014 at 11:34 AM, Paul Boichat <pa...@exonar.com>
wrote:

> Thanks - we've pulled down the branch and will test the changes. It looks
> like a branch of 1.7 so it's going to take us a little while to test. We
> need to migrate our connectors (there's some deprecated stuff that's now
> been cleared in 1.7 .eg. getShareACL) and we'll need to patch the database
> to include the pipeline and any other schema changes. We'll have some
> environment contention over the next week as our performance test
> environment needs to remain on 1.6.1 while we test a release. Once that's
> clear I'll move to 1.7
>
> On the database schema patch moving from 1.6.1 to 1.7 - is there a simple
> way to migrate and existing database?
>
> Thanks,
>
> Paul
>
>
>
>
>
> VP Engineering,
> Exonar Ltd
>
> T: +44 7940 567724
>
> twitter:@exonarco @pboichat
> W: http://www.exonar.com
> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>
> Exonar Limited, registered in the UK, registration number 06439969 at 14
> West Mills, Newbury, Berkshire, RG14 5HG
> DISCLAIMER: This email and any attachments to it may be confidential and
> are intended solely for the use of the individual to whom it is addressed.
> Any views or opinions expressed are solely those of the author and do not
> necessarily represent those of Exonar Ltd. If you are not the intended
> recipient of this email, you must neither take any action based upon its
> contents, nor copy or show it to anyone. Please contact the sender if you
> believe you have received this email in error.
>
> On Thu, Sep 11, 2014 at 1:27 PM, Karl Wright <da...@gmail.com> wrote:
>
>> Thanks -- I'll include that change as well then, in ticket
>> CONNECTORS-1027.
>>
>>
>> Karl
>>
>>
>> On Thu, Sep 11, 2014 at 7:45 AM, Paul Boichat <pa...@exonar.com>
>> wrote:
>>
>>> Hi,
>>>
>>> That comes back immediately with 10001 rows:
>>>
>>> explain analyze SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT
>>> 10001) t;
>>>
>>> QUERY PLAN
>>>
>>>
>>> -----------------------------------------------------------------------------------------------------------------------
>>> ----------------------------------
>>>  Aggregate  (cost=544.08..544.09 rows=1 width=0) (actual
>>> time=9.125..9.125 rows=1 loops=1)
>>>    ->  Limit  (cost=0.00..419.07 rows=10001 width=0) (actual
>>> time=0.033..6.945 rows=10001 loops=1)
>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>> (cost=0.00..431189.31 rows=10290271 width=0) (actual time
>>> =0.031..3.257 rows=10001 loops=1)
>>>                Heap Fetches: 725
>>>  Total runtime: 9.157 ms
>>> (5 rows)
>>>
>>>
>>> Whereas:
>>>
>>> explain analyze SELECT count(*) FROM jobqueue limit 10001;
>>>
>>> QUERY PLAN
>>>
>>>
>>> -----------------------------------------------------------------------------------------------------------------------
>>> ----------------------------------------
>>>  Limit  (cost=456922.99..456923.00 rows=1 width=0) (actual
>>> time=5225.107..5225.109 rows=1 loops=1)
>>>    ->  Aggregate  (cost=456922.99..456923.00 rows=1 width=0) (actual
>>> time=5225.105..5225.106 rows=1 loops=1)
>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>> (cost=0.00..431197.31 rows=10290271 width=0) (actual time
>>> =0.108..3090.848 rows=10370209 loops=1)
>>>                Heap Fetches: 684297
>>>  Total runtime: 5225.151 ms
>>>
>>> Thanks,
>>>
>>> Paul
>>>
>>>
>>>
>>> VP Engineering,
>>> Exonar Ltd
>>>
>>> T: +44 7940 567724
>>>
>>> twitter:@exonarco @pboichat
>>> W: http://www.exonar.com
>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>
>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>> West Mills, Newbury, Berkshire, RG14 5HG
>>> DISCLAIMER: This email and any attachments to it may be confidential
>>> and are intended solely for the use of the individual to whom it is
>>> addressed. Any views or opinions expressed are solely those of the author
>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>> the intended recipient of this email, you must neither take any action
>>> based upon its contents, nor copy or show it to anyone. Please contact
>>> the sender if you believe you have received this email in error.
>>>
>>> On Thu, Sep 11, 2014 at 12:25 PM, Karl Wright <da...@gmail.com>
>>> wrote:
>>>
>>>> Hi Paul,
>>>>
>>>> Could you try this query on your database please and tell me if it
>>>> executes promptly:
>>>>
>>>> SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t
>>>>
>>>>
>>>> I vaguely remember that I had to change the form of this query in order
>>>> to support MySQL -- but first let's see if this helps.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <da...@gmail.com>
>>>> wrote:
>>>>
>>>>> I've created a ticket (CONNECTORS-1027) and a trunk-based branch
>>>>> (branches/CONNECTORS-1027) for looking at any changes we do for large-scale
>>>>> Postgresql optimization work.
>>>>>
>>>>> Please note that trunk code already has schema changes relative to MCF
>>>>> 1.7, so you will not be able to work directly with this branch code.  I'll
>>>>> have to create patches for whatever changes you would need to try.
>>>>>
>>>>> Thanks,
>>>>> Karl
>>>>>
>>>>>
>>>>> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat <paul.boichat@exonar.com
>>>>> > wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> We're on Postgres 9.2. I'll get the query plans and add them to the
>>>>>> doc.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Paul
>>>>>>
>>>>>>
>>>>>>
>>>>>> VP Engineering,
>>>>>> Exonar Ltd
>>>>>>
>>>>>> T: +44 7940 567724
>>>>>>
>>>>>> twitter:@exonarco @pboichat
>>>>>> W: http://www.exonar.com
>>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>>
>>>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>>>> and are intended solely for the use of the individual to whom it is
>>>>>> addressed. Any views or opinions expressed are solely those of the author
>>>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>>>> the intended recipient of this email, you must neither take any action
>>>>>> based upon its contents, nor copy or show it to anyone. Please
>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>
>>>>>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <da...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Paul,
>>>>>>>
>>>>>>> Can you include the logged plan for this query; this is an actual
>>>>>>> query encountered during crawling:
>>>>>>>
>>>>>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a
>>>>>>> long-running query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid
>>>>>>> FROM carrydown t1, jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=? AND
>>>>>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?]
>>>>>>>
>>>>>>>
>>>>>>> These queries are all from the UI; it is what gets generated when no
>>>>>>> limits are in place:
>>>>>>>
>>>>>>>  WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found a
>>>>>>> long-running query (166845 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>  WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found a
>>>>>>> long-running query (107222 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>
>>>>>>> This query is from the UI with a limit of 1000000:
>>>>>>>
>>>>>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found a
>>>>>>> long-running query (254851 ms): [SELECT COUNT(dochash) AS doccount FROM
>>>>>>> jobqueue t1 LIMIT 1000001]
>>>>>>>
>>>>>>> I honestly don't understand why PostgreSQL would execute a
>>>>>>> sequential scan of the entire table when given a limit clause.  It
>>>>>>> certainly didn't used to do that.  If you have any other suggestions please
>>>>>>> let me know.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Some queries show up in this list because MCF periodically reindexes
>>>>>>> tables.  For example, this query goes only against the (small) jobs table.
>>>>>>> Its poor performance on occasion is likely due to something else happening
>>>>>>> to the database, probably a reindex:
>>>>>>>
>>>>>>>  WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a
>>>>>>> long-running query (592474 ms): [SELECT id FROM jobs WHERE status IN
>>>>>>> (?,?,?,?,?) FOR UPDATE]
>>>>>>>
>>>>>>>
>>>>>>> The final query is the document stuffing query, which is perhaps the
>>>>>>> most critical query in the whole system:
>>>>>>>
>>>>>>>  SELECT
>>>>>>>  t0.id
>>>>>>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
>>>>>>>  t0.priorityset FROM jobqueue t0
>>>>>>>  WHERE t0.status IN ('P','G')  AND t0.checkaction='R' AND
>>>>>>> t0.checktime
>>>>>>>  <= 1407246846166
>>>>>>>  AND EXISTS (
>>>>>>>    SELECT 'x' FROM jobs t1
>>>>>>>    WHERE t1.status  IN ('A','a')  AND t1.id=t0.jobid  AND
>>>>>>> t1.priority=5
>>>>>>>  )
>>>>>>>  AND NOT EXISTS (
>>>>>>>    SELECT 'x' FROM jobqueue t2
>>>>>>>    WHERE t2.dochash=t0.dochash AND t2.status IN
>>>>>>>  ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid
>>>>>>>  )
>>>>>>>  AND NOT EXISTS (
>>>>>>>    SELECT 'x' FROM prereqevents t3,events t4
>>>>>>>    WHERE t0.id=t3.owner AND t3.eventname=t4.name
>>>>>>>  )
>>>>>>>  ORDER BY t0.docpriority ASC
>>>>>>>  LIMIT 480;
>>>>>>>
>>>>>>> Your analysis of whether IN beats OR does not agree with experiments
>>>>>>> I did on postgresql 8.7 which showed no difference.  What Postgresql
>>>>>>> version are you using?  Also, I trust you have query plans that demonstrate
>>>>>>> your claim?  In any case, whether IN vs. OR is generated is a function of
>>>>>>> the MCF database driver, so this is trivial to experiment with.  I'll
>>>>>>> create a ticket and a branch for experimentation.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <
>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>
>>>>>>>> Hi Karl,
>>>>>>>>
>>>>>>>> Changing maxcountstatus to something much smaller (10,000) doesn't
>>>>>>>> seem to buy us that much on the table scan - in the attached you'll see
>>>>>>>> that it's still taking a long time to return the job status page. Also in
>>>>>>>> the attached are some sample other long running queries that we're
>>>>>>>> beginning to see more frequently. There's also an example of a query that's
>>>>>>>> frequently executed and regularly takes > 4 secs (plus a suggested change
>>>>>>>> to improve performance). This one in particular would certainly benefit
>>>>>>>> from a change to SSDs which should relieve the I/O bound bottleneck on
>>>>>>>> postgres.
>>>>>>>>
>>>>>>>> We're loading the system from 10mil towards 100mil so would be keen
>>>>>>>> to work with you to optimise where possible.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Paul
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> VP Engineering,
>>>>>>>> Exonar Ltd
>>>>>>>>
>>>>>>>> T: +44 7940 567724
>>>>>>>>
>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>> W: http://www.exonar.com
>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>> <http://video.exonar.com/>
>>>>>>>>
>>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>
>>>>>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <da...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Paul,
>>>>>>>>>
>>>>>>>>> The jobstatus query that uses count(*) should be doing something
>>>>>>>>> like this when the maxdocumentstatuscount value is set:
>>>>>>>>>
>>>>>>>>> select count(*) from jobqueue where xxx limit 500001
>>>>>>>>>
>>>>>>>>> This will still do a sequential scan, but it will be an aborted
>>>>>>>>> one, so you can control the maximum amount of time spent doing the query.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <
>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> We've had a play with maxstatuscount and couldn't stop it from
>>>>>>>>>> count(*)-ing but I'll certainly have another look to see if we've missed
>>>>>>>>>> something.
>>>>>>>>>>
>>>>>>>>>> We're increasingly seeing long running threads and I'll put
>>>>>>>>>> together some samples. As an example, on a job that's currently aborting:
>>>>>>>>>>
>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a
>>>>>>>>>> long-running query (72902 ms): [UPDATE jobqueue SET
>>>>>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?]
>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 0:
>>>>>>>>>> '1.000000001E9'
>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 1:
>>>>>>>>>> '1407144048075'
>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan: Update
>>>>>>>>>> on jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:   ->
>>>>>>>>>> Bitmap Heap Scan on jobqueue  (cost=18806.08..445770.39 rows=764916
>>>>>>>>>> width=287)
>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>>>>>>> Recheck Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>>>>>>> ->  Bitmap Index Scan on i1392985450177  (cost=0.00..18614.85 rows=764916
>>>>>>>>>> width=0)
>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>> Plan:               Index Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -  Stats:
>>>>>>>>>> n_distinct=4.0 most_common_vals={G,C,Z,P}
>>>>>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>>>>>
>>>>>>>>>> Paul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> VP Engineering,
>>>>>>>>>> Exonar Ltd
>>>>>>>>>>
>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>
>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>
>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <da...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>
>>>>>>>>>>> For the jobqueue scans from the UI, there is a parameter you can
>>>>>>>>>>> set which limits the number of documents counted to at most a specified
>>>>>>>>>>> amount.  This uses a limit clause, which should prevent unbounded time
>>>>>>>>>>> doing these kinds of queries:
>>>>>>>>>>>
>>>>>>>>>>> org.apache.manifoldcf.ui.maxstatuscount
>>>>>>>>>>>
>>>>>>>>>>> The documentation says that the default value for this parameter
>>>>>>>>>>> is 10000, which however is incorrect.  The actual true default is 500000.
>>>>>>>>>>> You could set that lower for better UI performance (losing some
>>>>>>>>>>> information, of course.)
>>>>>>>>>>>
>>>>>>>>>>> As for long-running queries, a lot of time and effort has been
>>>>>>>>>>> spent in MCF to insure that this doesn't happen.  Specifically, the main
>>>>>>>>>>> document queuing query is structured to read directly out of a specific
>>>>>>>>>>> jobqueue index.  This is the crucial query that must work properly for
>>>>>>>>>>> scalability, since doing a query that is effectively just a sort on the
>>>>>>>>>>> entire jobqueue would be a major problem.  There are some times where
>>>>>>>>>>> Postgresql's optimizer fails to do the right thing here, mostly because it
>>>>>>>>>>> makes a huge distinction between whether there's zero of something or one
>>>>>>>>>>> of something, but you can work around that particular issue by setting the
>>>>>>>>>>> analyze count to 1 if you start to see this problem -- which basically
>>>>>>>>>>> means that reanalysis of the table has to occur on every stuffing query.
>>>>>>>>>>>
>>>>>>>>>>> I'd appreciate seeing the queries that are long-running in your
>>>>>>>>>>> case so that I can see if that is what you are encountering or not.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <
>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>
>>>>>>>>>>>> We're beginning to see issues with a document count > 10
>>>>>>>>>>>> million. At that point, even with good postgres vacuuming the
>>>>>>>>>>>> jobqueue table is starting to become a bottleneck.
>>>>>>>>>>>>
>>>>>>>>>>>> For example select count(*) from jobqueue, which is executed
>>>>>>>>>>>> when querying job status will do a full table scan of jobqueue
>>>>>>>>>>>> which has more than 10 million rows. That's going to take some time in
>>>>>>>>>>>> postgres.
>>>>>>>>>>>>
>>>>>>>>>>>> SSDs will certainly make a big difference to document
>>>>>>>>>>>> processing through-put (which we see is largely I/O bound in
>>>>>>>>>>>> postgres) but we are increasingly seeing long running queries
>>>>>>>>>>>> in the logs. Our current thinking is that we'll need to refactor
>>>>>>>>>>>> JobQueue somewhat to optimise queries and, potentially
>>>>>>>>>>>> partition jobqueue into a subset of tables (table per queue
>>>>>>>>>>>> for example).
>>>>>>>>>>>>
>>>>>>>>>>>> Paul
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>
>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>
>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>
>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <
>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Baptiste,
>>>>>>>>>>>>>
>>>>>>>>>>>>> ManifoldCF is not limited by the number of agents processes or
>>>>>>>>>>>>> parallel connectors.  Overall database performance is the limiting factor.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would read this:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>>>>>>>
>>>>>>>>>>>>> Also, there's a section in ManifoldCF (I believe Chapter 2)
>>>>>>>>>>>>> that discusses this issue.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Some five years ago, I successfully crawled 5 million web
>>>>>>>>>>>>> documents, using Postgresql 8.3.  Postgresql 9.x is faster, and with modern
>>>>>>>>>>>>> SSD's, I expect that you will do even better.  In general, I'd say it was
>>>>>>>>>>>>> fine to shoot for 10M - 100M documents on ManifoldCF, provided that you use
>>>>>>>>>>>>> a good database, and provided that you maintain it properly.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>>>>>>>>>>> ba.berthier@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would like to know what is the maximum number of documents
>>>>>>>>>>>>>> that you managed to crawl with ManifoldCF and with how many connectors in
>>>>>>>>>>>>>> parallel it could works ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for your answer
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Baptiste
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Posted by Karl Wright <da...@gmail.com>.

Hi Paul,

Upgrade is included automatically in MCF.

If you are using the multiprocess example, just run the initialize script
and it will do the upgrade.  The single process example does any required
upgrades every time you start it.

Thanks,
Karl



On Thu, Sep 11, 2014 at 11:34 AM, Paul Boichat <pa...@exonar.com>
wrote:

> Thanks - we've pulled down the branch and will test the changes. It looks
> like a branch of 1.7 so it's going to take us a little while to test. We
> need to migrate our connectors (there's some deprecated stuff that's now
> been cleared in 1.7 .eg. getShareACL) and we'll need to patch the database
> to include the pipeline and any other schema changes. We'll have some
> environment contention over the next week as our performance test
> environment needs to remain on 1.6.1 while we test a release. Once that's
> clear I'll move to 1.7
>
> On the database schema patch moving from 1.6.1 to 1.7 - is there a simple
> way to migrate and existing database?
>
> Thanks,
>
> Paul
>
>
>
>
>
> VP Engineering,
> Exonar Ltd
>
> T: +44 7940 567724
>
> twitter:@exonarco @pboichat
> W: http://www.exonar.com
> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>
> Exonar Limited, registered in the UK, registration number 06439969 at 14
> West Mills, Newbury, Berkshire, RG14 5HG
> DISCLAIMER: This email and any attachments to it may be confidential and
> are intended solely for the use of the individual to whom it is addressed.
> Any views or opinions expressed are solely those of the author and do not
> necessarily represent those of Exonar Ltd. If you are not the intended
> recipient of this email, you must neither take any action based upon its
> contents, nor copy or show it to anyone. Please contact the sender if you
> believe you have received this email in error.
>
> On Thu, Sep 11, 2014 at 1:27 PM, Karl Wright <da...@gmail.com> wrote:
>
>> Thanks -- I'll include that change as well then, in ticket
>> CONNECTORS-1027.
>>
>>
>> Karl
>>
>>
>> On Thu, Sep 11, 2014 at 7:45 AM, Paul Boichat <pa...@exonar.com>
>> wrote:
>>
>>> Hi,
>>>
>>> That comes back immediately with 10001 rows:
>>>
>>> explain analyze SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT
>>> 10001) t;
>>>
>>> QUERY PLAN
>>>
>>>
>>> -----------------------------------------------------------------------------------------------------------------------
>>> ----------------------------------
>>>  Aggregate  (cost=544.08..544.09 rows=1 width=0) (actual
>>> time=9.125..9.125 rows=1 loops=1)
>>>    ->  Limit  (cost=0.00..419.07 rows=10001 width=0) (actual
>>> time=0.033..6.945 rows=10001 loops=1)
>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>> (cost=0.00..431189.31 rows=10290271 width=0) (actual time
>>> =0.031..3.257 rows=10001 loops=1)
>>>                Heap Fetches: 725
>>>  Total runtime: 9.157 ms
>>> (5 rows)
>>>
>>>
>>> Whereas:
>>>
>>> explain analyze SELECT count(*) FROM jobqueue limit 10001;
>>>
>>> QUERY PLAN
>>>
>>>
>>> -----------------------------------------------------------------------------------------------------------------------
>>> ----------------------------------------
>>>  Limit  (cost=456922.99..456923.00 rows=1 width=0) (actual
>>> time=5225.107..5225.109 rows=1 loops=1)
>>>    ->  Aggregate  (cost=456922.99..456923.00 rows=1 width=0) (actual
>>> time=5225.105..5225.106 rows=1 loops=1)
>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>> (cost=0.00..431197.31 rows=10290271 width=0) (actual time
>>> =0.108..3090.848 rows=10370209 loops=1)
>>>                Heap Fetches: 684297
>>>  Total runtime: 5225.151 ms
>>>
>>> Thanks,
>>>
>>> Paul
>>>
>>>
>>>
>>> VP Engineering,
>>> Exonar Ltd
>>>
>>> T: +44 7940 567724
>>>
>>> twitter:@exonarco @pboichat
>>> W: http://www.exonar.com
>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>
>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>> West Mills, Newbury, Berkshire, RG14 5HG
>>> DISCLAIMER: This email and any attachments to it may be confidential
>>> and are intended solely for the use of the individual to whom it is
>>> addressed. Any views or opinions expressed are solely those of the author
>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>> the intended recipient of this email, you must neither take any action
>>> based upon its contents, nor copy or show it to anyone. Please contact
>>> the sender if you believe you have received this email in error.
>>>
>>> On Thu, Sep 11, 2014 at 12:25 PM, Karl Wright <da...@gmail.com>
>>> wrote:
>>>
>>>> Hi Paul,
>>>>
>>>> Could you try this query on your database please and tell me if it
>>>> executes promptly:
>>>>
>>>> SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t
>>>>
>>>>
>>>> I vaguely remember that I had to change the form of this query in order
>>>> to support MySQL -- but first let's see if this helps.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <da...@gmail.com>
>>>> wrote:
>>>>
>>>>> I've created a ticket (CONNECTORS-1027) and a trunk-based branch
>>>>> (branches/CONNECTORS-1027) for looking at any changes we do for large-scale
>>>>> Postgresql optimization work.
>>>>>
>>>>> Please note that trunk code already has schema changes relative to MCF
>>>>> 1.7, so you will not be able to work directly with this branch code.  I'll
>>>>> have to create patches for whatever changes you would need to try.
>>>>>
>>>>> Thanks,
>>>>> Karl
>>>>>
>>>>>
>>>>> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat <paul.boichat@exonar.com
>>>>> > wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> We're on Postgres 9.2. I'll get the query plans and add them to the
>>>>>> doc.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Paul
>>>>>>
>>>>>>
>>>>>>
>>>>>> VP Engineering,
>>>>>> Exonar Ltd
>>>>>>
>>>>>> T: +44 7940 567724
>>>>>>
>>>>>> twitter:@exonarco @pboichat
>>>>>> W: http://www.exonar.com
>>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>>
>>>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>>>> and are intended solely for the use of the individual to whom it is
>>>>>> addressed. Any views or opinions expressed are solely those of the author
>>>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>>>> the intended recipient of this email, you must neither take any action
>>>>>> based upon its contents, nor copy or show it to anyone. Please
>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>
>>>>>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <da...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Paul,
>>>>>>>
>>>>>>> Can you include the logged plan for this query; this is an actual
>>>>>>> query encountered during crawling:
>>>>>>>
>>>>>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a
>>>>>>> long-running query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid
>>>>>>> FROM carrydown t1, jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=? AND
>>>>>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?]
>>>>>>>
>>>>>>>
>>>>>>> These queries are all from the UI; it is what gets generated when no
>>>>>>> limits are in place:
>>>>>>>
>>>>>>>  WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found a
>>>>>>> long-running query (166845 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>  WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found a
>>>>>>> long-running query (107222 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>
>>>>>>> This query is from the UI with a limit of 1000000:
>>>>>>>
>>>>>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found a
>>>>>>> long-running query (254851 ms): [SELECT COUNT(dochash) AS doccount FROM
>>>>>>> jobqueue t1 LIMIT 1000001]
>>>>>>>
>>>>>>> I honestly don't understand why PostgreSQL would execute a
>>>>>>> sequential scan of the entire table when given a limit clause.  It
>>>>>>> certainly didn't used to do that.  If you have any other suggestions please
>>>>>>> let me know.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Some queries show up in this list because MCF periodically reindexes
>>>>>>> tables.  For example, this query goes only against the (small) jobs table.
>>>>>>> Its poor performance on occasion is likely due to something else happening
>>>>>>> to the database, probably a reindex:
>>>>>>>
>>>>>>>  WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a
>>>>>>> long-running query (592474 ms): [SELECT id FROM jobs WHERE status IN
>>>>>>> (?,?,?,?,?) FOR UPDATE]
>>>>>>>
>>>>>>>
>>>>>>> The final query is the document stuffing query, which is perhaps the
>>>>>>> most critical query in the whole system:
>>>>>>>
>>>>>>>  SELECT
>>>>>>>  t0.id
>>>>>>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
>>>>>>>  t0.priorityset FROM jobqueue t0
>>>>>>>  WHERE t0.status IN ('P','G')  AND t0.checkaction='R' AND
>>>>>>> t0.checktime
>>>>>>>  <= 1407246846166
>>>>>>>  AND EXISTS (
>>>>>>>    SELECT 'x' FROM jobs t1
>>>>>>>    WHERE t1.status  IN ('A','a')  AND t1.id=t0.jobid  AND
>>>>>>> t1.priority=5
>>>>>>>  )
>>>>>>>  AND NOT EXISTS (
>>>>>>>    SELECT 'x' FROM jobqueue t2
>>>>>>>    WHERE t2.dochash=t0.dochash AND t2.status IN
>>>>>>>  ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid
>>>>>>>  )
>>>>>>>  AND NOT EXISTS (
>>>>>>>    SELECT 'x' FROM prereqevents t3,events t4
>>>>>>>    WHERE t0.id=t3.owner AND t3.eventname=t4.name
>>>>>>>  )
>>>>>>>  ORDER BY t0.docpriority ASC
>>>>>>>  LIMIT 480;
>>>>>>>
>>>>>>> Your analysis of whether IN beats OR does not agree with experiments
>>>>>>> I did on postgresql 8.7 which showed no difference.  What Postgresql
>>>>>>> version are you using?  Also, I trust you have query plans that demonstrate
>>>>>>> your claim?  In any case, whether IN vs. OR is generated is a function of
>>>>>>> the MCF database driver, so this is trivial to experiment with.  I'll
>>>>>>> create a ticket and a branch for experimentation.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <
>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>
>>>>>>>> Hi Karl,
>>>>>>>>
>>>>>>>> Changing maxcountstatus to something much smaller (10,000) doesn't
>>>>>>>> seem to buy us that much on the table scan - in the attached you'll see
>>>>>>>> that it's still taking a long time to return the job status page. Also in
>>>>>>>> the attached are some sample other long running queries that we're
>>>>>>>> beginning to see more frequently. There's also an example of a query that's
>>>>>>>> frequently executed and regularly takes > 4 secs (plus a suggested change
>>>>>>>> to improve performance). This one in particular would certainly benefit
>>>>>>>> from a change to SSDs which should relieve the I/O bound bottleneck on
>>>>>>>> postgres.
>>>>>>>>
>>>>>>>> We're loading the system from 10mil towards 100mil so would be keen
>>>>>>>> to work with you to optimise where possible.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Paul
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> VP Engineering,
>>>>>>>> Exonar Ltd
>>>>>>>>
>>>>>>>> T: +44 7940 567724
>>>>>>>>
>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>> W: http://www.exonar.com
>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>> <http://video.exonar.com/>
>>>>>>>>
>>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>
>>>>>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <da...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Paul,
>>>>>>>>>
>>>>>>>>> The jobstatus query that uses count(*) should be doing something
>>>>>>>>> like this when the maxdocumentstatuscount value is set:
>>>>>>>>>
>>>>>>>>> select count(*) from jobqueue where xxx limit 500001
>>>>>>>>>
>>>>>>>>> This will still do a sequential scan, but it will be an aborted
>>>>>>>>> one, so you can control the maximum amount of time spent doing the query.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <
>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> We've had a play with maxstatuscount and couldn't stop it from
>>>>>>>>>> count(*)-ing but I'll certainly have another look to see if we've missed
>>>>>>>>>> something.
>>>>>>>>>>
>>>>>>>>>> We're increasingly seeing long running threads and I'll put
>>>>>>>>>> together some samples. As an example, on a job that's currently aborting:
>>>>>>>>>>
>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a
>>>>>>>>>> long-running query (72902 ms): [UPDATE jobqueue SET
>>>>>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?]
>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 0:
>>>>>>>>>> '1.000000001E9'
>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 1:
>>>>>>>>>> '1407144048075'
>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan: Update
>>>>>>>>>> on jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:   ->
>>>>>>>>>> Bitmap Heap Scan on jobqueue  (cost=18806.08..445770.39 rows=764916
>>>>>>>>>> width=287)
>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>>>>>>> Recheck Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>>>>>>> ->  Bitmap Index Scan on i1392985450177  (cost=0.00..18614.85 rows=764916
>>>>>>>>>> width=0)
>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>> Plan:               Index Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -  Stats:
>>>>>>>>>> n_distinct=4.0 most_common_vals={G,C,Z,P}
>>>>>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>>>>>
>>>>>>>>>> Paul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> VP Engineering,
>>>>>>>>>> Exonar Ltd
>>>>>>>>>>
>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>
>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>
>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <da...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>
>>>>>>>>>>> For the jobqueue scans from the UI, there is a parameter you can
>>>>>>>>>>> set which limits the number of documents counted to at most a specified
>>>>>>>>>>> amount.  This uses a limit clause, which should prevent unbounded time
>>>>>>>>>>> doing these kinds of queries:
>>>>>>>>>>>
>>>>>>>>>>> org.apache.manifoldcf.ui.maxstatuscount
>>>>>>>>>>>
>>>>>>>>>>> The documentation says that the default value for this parameter
>>>>>>>>>>> is 10000, which however is incorrect.  The actual true default is 500000.
>>>>>>>>>>> You could set that lower for better UI performance (losing some
>>>>>>>>>>> information, of course.)
>>>>>>>>>>>
>>>>>>>>>>> As for long-running queries, a lot of time and effort has been
>>>>>>>>>>> spent in MCF to insure that this doesn't happen.  Specifically, the main
>>>>>>>>>>> document queuing query is structured to read directly out of a specific
>>>>>>>>>>> jobqueue index.  This is the crucial query that must work properly for
>>>>>>>>>>> scalability, since doing a query that is effectively just a sort on the
>>>>>>>>>>> entire jobqueue would be a major problem.  There are some times where
>>>>>>>>>>> Postgresql's optimizer fails to do the right thing here, mostly because it
>>>>>>>>>>> makes a huge distinction between whether there's zero of something or one
>>>>>>>>>>> of something, but you can work around that particular issue by setting the
>>>>>>>>>>> analyze count to 1 if you start to see this problem -- which basically
>>>>>>>>>>> means that reanalysis of the table has to occur on every stuffing query.
>>>>>>>>>>>
>>>>>>>>>>> I'd appreciate seeing the queries that are long-running in your
>>>>>>>>>>> case so that I can see if that is what you are encountering or not.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <
>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>
>>>>>>>>>>>> We're beginning to see issues with a document count > 10
>>>>>>>>>>>> million. At that point, even with good postgres vacuuming the
>>>>>>>>>>>> jobqueue table is starting to become a bottleneck.
>>>>>>>>>>>>
>>>>>>>>>>>> For example select count(*) from jobqueue, which is executed
>>>>>>>>>>>> when querying job status will do a full table scan of jobqueue
>>>>>>>>>>>> which has more than 10 million rows. That's going to take some time in
>>>>>>>>>>>> postgres.
>>>>>>>>>>>>
>>>>>>>>>>>> SSDs will certainly make a big difference to document
>>>>>>>>>>>> processing through-put (which we see is largely I/O bound in
>>>>>>>>>>>> postgres) but we are increasingly seeing long running queries
>>>>>>>>>>>> in the logs. Our current thinking is that we'll need to refactor
>>>>>>>>>>>> JobQueue somewhat to optimise queries and, potentially
>>>>>>>>>>>> partition jobqueue into a subset of tables (table per queue
>>>>>>>>>>>> for example).
>>>>>>>>>>>>
>>>>>>>>>>>> Paul
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>
>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>
>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>
>>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <
>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Baptiste,
>>>>>>>>>>>>>
>>>>>>>>>>>>> ManifoldCF is not limited by the number of agents processes or
>>>>>>>>>>>>> parallel connectors.  Overall database performance is the limiting factor.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would read this:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>>>>>>>
>>>>>>>>>>>>> Also, there's a section in ManifoldCF (I believe Chapter 2)
>>>>>>>>>>>>> that discusses this issue.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Some five years ago, I successfully crawled 5 million web
>>>>>>>>>>>>> documents, using Postgresql 8.3.  Postgresql 9.x is faster, and with modern
>>>>>>>>>>>>> SSD's, I expect that you will do even better.  In general, I'd say it was
>>>>>>>>>>>>> fine to shoot for 10M - 100M documents on ManifoldCF, provided that you use
>>>>>>>>>>>>> a good database, and provided that you maintain it properly.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>>>>>>>>>>> ba.berthier@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would like to know what is the maximum number of documents
>>>>>>>>>>>>>> that you managed to crawl with ManifoldCF and with how many connectors in
>>>>>>>>>>>>>> parallel it could works ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for your answer
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Baptiste
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Posted by Paul Boichat <pa...@exonar.com>.

Thanks - we've pulled down the branch and will test the changes. It looks
like a branch of 1.7 so it's going to take us a little while to test. We
need to migrate our connectors (there's some deprecated stuff that's now
been cleared in 1.7 .eg. getShareACL) and we'll need to patch the database
to include the pipeline and any other schema changes. We'll have some
environment contention over the next week as our performance test
environment needs to remain on 1.6.1 while we test a release. Once that's
clear I'll move to 1.7

On the database schema patch moving from 1.6.1 to 1.7 - is there a simple
way to migrate and existing database?

Thanks,

Paul





VP Engineering,
Exonar Ltd

T: +44 7940 567724

twitter:@exonarco @pboichat
W: http://www.exonar.com
Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>

Exonar Limited, registered in the UK, registration number 06439969 at 14
West Mills, Newbury, Berkshire, RG14 5HG
DISCLAIMER: This email and any attachments to it may be confidential and
are intended solely for the use of the individual to whom it is addressed.
Any views or opinions expressed are solely those of the author and do not
necessarily represent those of Exonar Ltd. If you are not the intended
recipient of this email, you must neither take any action based upon its
contents, nor copy or show it to anyone. Please contact the sender if you
believe you have received this email in error.

On Thu, Sep 11, 2014 at 1:27 PM, Karl Wright <da...@gmail.com> wrote:

> Thanks -- I'll include that change as well then, in ticket CONNECTORS-1027.
>
>
> Karl
>
>
> On Thu, Sep 11, 2014 at 7:45 AM, Paul Boichat <pa...@exonar.com>
> wrote:
>
>> Hi,
>>
>> That comes back immediately with 10001 rows:
>>
>> explain analyze SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT
>> 10001) t;
>>
>> QUERY PLAN
>>
>>
>> -----------------------------------------------------------------------------------------------------------------------
>> ----------------------------------
>>  Aggregate  (cost=544.08..544.09 rows=1 width=0) (actual
>> time=9.125..9.125 rows=1 loops=1)
>>    ->  Limit  (cost=0.00..419.07 rows=10001 width=0) (actual
>> time=0.033..6.945 rows=10001 loops=1)
>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>> (cost=0.00..431189.31 rows=10290271 width=0) (actual time
>> =0.031..3.257 rows=10001 loops=1)
>>                Heap Fetches: 725
>>  Total runtime: 9.157 ms
>> (5 rows)
>>
>>
>> Whereas:
>>
>> explain analyze SELECT count(*) FROM jobqueue limit 10001;
>>
>> QUERY PLAN
>>
>>
>> -----------------------------------------------------------------------------------------------------------------------
>> ----------------------------------------
>>  Limit  (cost=456922.99..456923.00 rows=1 width=0) (actual
>> time=5225.107..5225.109 rows=1 loops=1)
>>    ->  Aggregate  (cost=456922.99..456923.00 rows=1 width=0) (actual
>> time=5225.105..5225.106 rows=1 loops=1)
>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>> (cost=0.00..431197.31 rows=10290271 width=0) (actual time
>> =0.108..3090.848 rows=10370209 loops=1)
>>                Heap Fetches: 684297
>>  Total runtime: 5225.151 ms
>>
>> Thanks,
>>
>> Paul
>>
>>
>>
>> VP Engineering,
>> Exonar Ltd
>>
>> T: +44 7940 567724
>>
>> twitter:@exonarco @pboichat
>> W: http://www.exonar.com
>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>
>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>> West Mills, Newbury, Berkshire, RG14 5HG
>> DISCLAIMER: This email and any attachments to it may be confidential and
>> are intended solely for the use of the individual to whom it is addressed.
>> Any views or opinions expressed are solely those of the author and do not
>> necessarily represent those of Exonar Ltd. If you are not the intended
>> recipient of this email, you must neither take any action based upon its
>> contents, nor copy or show it to anyone. Please contact the sender if
>> you believe you have received this email in error.
>>
>> On Thu, Sep 11, 2014 at 12:25 PM, Karl Wright <da...@gmail.com> wrote:
>>
>>> Hi Paul,
>>>
>>> Could you try this query on your database please and tell me if it
>>> executes promptly:
>>>
>>> SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t
>>>
>>>
>>> I vaguely remember that I had to change the form of this query in order
>>> to support MySQL -- but first let's see if this helps.
>>>
>>> Karl
>>>
>>>
>>> On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <da...@gmail.com> wrote:
>>>
>>>> I've created a ticket (CONNECTORS-1027) and a trunk-based branch
>>>> (branches/CONNECTORS-1027) for looking at any changes we do for large-scale
>>>> Postgresql optimization work.
>>>>
>>>> Please note that trunk code already has schema changes relative to MCF
>>>> 1.7, so you will not be able to work directly with this branch code.  I'll
>>>> have to create patches for whatever changes you would need to try.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat <pa...@exonar.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> We're on Postgres 9.2. I'll get the query plans and add them to the
>>>>> doc.
>>>>>
>>>>> Thanks
>>>>>
>>>>> Paul
>>>>>
>>>>>
>>>>>
>>>>> VP Engineering,
>>>>> Exonar Ltd
>>>>>
>>>>> T: +44 7940 567724
>>>>>
>>>>> twitter:@exonarco @pboichat
>>>>> W: http://www.exonar.com
>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>
>>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>>> and are intended solely for the use of the individual to whom it is
>>>>> addressed. Any views or opinions expressed are solely those of the author
>>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>>> the intended recipient of this email, you must neither take any action
>>>>> based upon its contents, nor copy or show it to anyone. Please
>>>>> contact the sender if you believe you have received this email in error.
>>>>>
>>>>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <da...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Paul,
>>>>>>
>>>>>> Can you include the logged plan for this query; this is an actual
>>>>>> query encountered during crawling:
>>>>>>
>>>>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a
>>>>>> long-running query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid
>>>>>> FROM carrydown t1, jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=? AND
>>>>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?]
>>>>>>
>>>>>>
>>>>>> These queries are all from the UI; it is what gets generated when no
>>>>>> limits are in place:
>>>>>>
>>>>>>  WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found a
>>>>>> long-running query (166845 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>  WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found a
>>>>>> long-running query (107222 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>
>>>>>> This query is from the UI with a limit of 1000000:
>>>>>>
>>>>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found a
>>>>>> long-running query (254851 ms): [SELECT COUNT(dochash) AS doccount FROM
>>>>>> jobqueue t1 LIMIT 1000001]
>>>>>>
>>>>>> I honestly don't understand why PostgreSQL would execute a sequential
>>>>>> scan of the entire table when given a limit clause.  It certainly didn't
>>>>>> used to do that.  If you have any other suggestions please let me know.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Some queries show up in this list because MCF periodically reindexes
>>>>>> tables.  For example, this query goes only against the (small) jobs table.
>>>>>> Its poor performance on occasion is likely due to something else happening
>>>>>> to the database, probably a reindex:
>>>>>>
>>>>>>  WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a
>>>>>> long-running query (592474 ms): [SELECT id FROM jobs WHERE status IN
>>>>>> (?,?,?,?,?) FOR UPDATE]
>>>>>>
>>>>>>
>>>>>> The final query is the document stuffing query, which is perhaps the
>>>>>> most critical query in the whole system:
>>>>>>
>>>>>>  SELECT
>>>>>>  t0.id
>>>>>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
>>>>>>  t0.priorityset FROM jobqueue t0
>>>>>>  WHERE t0.status IN ('P','G')  AND t0.checkaction='R' AND t0.checktime
>>>>>>  <= 1407246846166
>>>>>>  AND EXISTS (
>>>>>>    SELECT 'x' FROM jobs t1
>>>>>>    WHERE t1.status  IN ('A','a')  AND t1.id=t0.jobid  AND
>>>>>> t1.priority=5
>>>>>>  )
>>>>>>  AND NOT EXISTS (
>>>>>>    SELECT 'x' FROM jobqueue t2
>>>>>>    WHERE t2.dochash=t0.dochash AND t2.status IN
>>>>>>  ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid
>>>>>>  )
>>>>>>  AND NOT EXISTS (
>>>>>>    SELECT 'x' FROM prereqevents t3,events t4
>>>>>>    WHERE t0.id=t3.owner AND t3.eventname=t4.name
>>>>>>  )
>>>>>>  ORDER BY t0.docpriority ASC
>>>>>>  LIMIT 480;
>>>>>>
>>>>>> Your analysis of whether IN beats OR does not agree with experiments
>>>>>> I did on postgresql 8.7 which showed no difference.  What Postgresql
>>>>>> version are you using?  Also, I trust you have query plans that demonstrate
>>>>>> your claim?  In any case, whether IN vs. OR is generated is a function of
>>>>>> the MCF database driver, so this is trivial to experiment with.  I'll
>>>>>> create a ticket and a branch for experimentation.
>>>>>>
>>>>>> Thanks,
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <
>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>
>>>>>>> Hi Karl,
>>>>>>>
>>>>>>> Changing maxcountstatus to something much smaller (10,000) doesn't
>>>>>>> seem to buy us that much on the table scan - in the attached you'll see
>>>>>>> that it's still taking a long time to return the job status page. Also in
>>>>>>> the attached are some sample other long running queries that we're
>>>>>>> beginning to see more frequently. There's also an example of a query that's
>>>>>>> frequently executed and regularly takes > 4 secs (plus a suggested change
>>>>>>> to improve performance). This one in particular would certainly benefit
>>>>>>> from a change to SSDs which should relieve the I/O bound bottleneck on
>>>>>>> postgres.
>>>>>>>
>>>>>>> We're loading the system from 10mil towards 100mil so would be keen
>>>>>>> to work with you to optimise where possible.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Paul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> VP Engineering,
>>>>>>> Exonar Ltd
>>>>>>>
>>>>>>> T: +44 7940 567724
>>>>>>>
>>>>>>> twitter:@exonarco @pboichat
>>>>>>> W: http://www.exonar.com
>>>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>>>
>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>> author and do not necessarily represent those of Exonar Ltd. If you
>>>>>>> are not the intended recipient of this email, you must neither take any
>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>
>>>>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <da...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Paul,
>>>>>>>>
>>>>>>>> The jobstatus query that uses count(*) should be doing something
>>>>>>>> like this when the maxdocumentstatuscount value is set:
>>>>>>>>
>>>>>>>> select count(*) from jobqueue where xxx limit 500001
>>>>>>>>
>>>>>>>> This will still do a sequential scan, but it will be an aborted
>>>>>>>> one, so you can control the maximum amount of time spent doing the query.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <
>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> We've had a play with maxstatuscount and couldn't stop it from
>>>>>>>>> count(*)-ing but I'll certainly have another look to see if we've missed
>>>>>>>>> something.
>>>>>>>>>
>>>>>>>>> We're increasingly seeing long running threads and I'll put
>>>>>>>>> together some samples. As an example, on a job that's currently aborting:
>>>>>>>>>
>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a
>>>>>>>>> long-running query (72902 ms): [UPDATE jobqueue SET
>>>>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?]
>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 0:
>>>>>>>>> '1.000000001E9'
>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 1:
>>>>>>>>> '1407144048075'
>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan: Update
>>>>>>>>> on jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:   ->
>>>>>>>>> Bitmap Heap Scan on jobqueue  (cost=18806.08..445770.39 rows=764916
>>>>>>>>> width=287)
>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>>>>>> Recheck Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>>>>>> ->  Bitmap Index Scan on i1392985450177  (cost=0.00..18614.85 rows=764916
>>>>>>>>> width=0)
>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>> Plan:               Index Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -  Stats:
>>>>>>>>> n_distinct=4.0 most_common_vals={G,C,Z,P}
>>>>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>>>>
>>>>>>>>> Paul
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> VP Engineering,
>>>>>>>>> Exonar Ltd
>>>>>>>>>
>>>>>>>>> T: +44 7940 567724
>>>>>>>>>
>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>> W: http://www.exonar.com
>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>
>>>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>
>>>>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <da...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Paul,
>>>>>>>>>>
>>>>>>>>>> For the jobqueue scans from the UI, there is a parameter you can
>>>>>>>>>> set which limits the number of documents counted to at most a specified
>>>>>>>>>> amount.  This uses a limit clause, which should prevent unbounded time
>>>>>>>>>> doing these kinds of queries:
>>>>>>>>>>
>>>>>>>>>> org.apache.manifoldcf.ui.maxstatuscount
>>>>>>>>>>
>>>>>>>>>> The documentation says that the default value for this parameter
>>>>>>>>>> is 10000, which however is incorrect.  The actual true default is 500000.
>>>>>>>>>> You could set that lower for better UI performance (losing some
>>>>>>>>>> information, of course.)
>>>>>>>>>>
>>>>>>>>>> As for long-running queries, a lot of time and effort has been
>>>>>>>>>> spent in MCF to insure that this doesn't happen.  Specifically, the main
>>>>>>>>>> document queuing query is structured to read directly out of a specific
>>>>>>>>>> jobqueue index.  This is the crucial query that must work properly for
>>>>>>>>>> scalability, since doing a query that is effectively just a sort on the
>>>>>>>>>> entire jobqueue would be a major problem.  There are some times where
>>>>>>>>>> Postgresql's optimizer fails to do the right thing here, mostly because it
>>>>>>>>>> makes a huge distinction between whether there's zero of something or one
>>>>>>>>>> of something, but you can work around that particular issue by setting the
>>>>>>>>>> analyze count to 1 if you start to see this problem -- which basically
>>>>>>>>>> means that reanalysis of the table has to occur on every stuffing query.
>>>>>>>>>>
>>>>>>>>>> I'd appreciate seeing the queries that are long-running in your
>>>>>>>>>> case so that I can see if that is what you are encountering or not.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <
>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>
>>>>>>>>>>> We're beginning to see issues with a document count > 10
>>>>>>>>>>> million. At that point, even with good postgres vacuuming the
>>>>>>>>>>> jobqueue table is starting to become a bottleneck.
>>>>>>>>>>>
>>>>>>>>>>> For example select count(*) from jobqueue, which is executed
>>>>>>>>>>> when querying job status will do a full table scan of jobqueue
>>>>>>>>>>> which has more than 10 million rows. That's going to take some time in
>>>>>>>>>>> postgres.
>>>>>>>>>>>
>>>>>>>>>>> SSDs will certainly make a big difference to document
>>>>>>>>>>> processing through-put (which we see is largely I/O bound in
>>>>>>>>>>> postgres) but we are increasingly seeing long running queries
>>>>>>>>>>> in the logs. Our current thinking is that we'll need to refactor
>>>>>>>>>>> JobQueue somewhat to optimise queries and, potentially
>>>>>>>>>>> partition jobqueue into a subset of tables (table per queue for
>>>>>>>>>>> example).
>>>>>>>>>>>
>>>>>>>>>>> Paul
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> VP Engineering,
>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>
>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>
>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>
>>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <daddywri@gmail.com
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Baptiste,
>>>>>>>>>>>>
>>>>>>>>>>>> ManifoldCF is not limited by the number of agents processes or
>>>>>>>>>>>> parallel connectors.  Overall database performance is the limiting factor.
>>>>>>>>>>>>
>>>>>>>>>>>> I would read this:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>>>>>>
>>>>>>>>>>>> Also, there's a section in ManifoldCF (I believe Chapter 2)
>>>>>>>>>>>> that discusses this issue.
>>>>>>>>>>>>
>>>>>>>>>>>> Some five years ago, I successfully crawled 5 million web
>>>>>>>>>>>> documents, using Postgresql 8.3.  Postgresql 9.x is faster, and with modern
>>>>>>>>>>>> SSD's, I expect that you will do even better.  In general, I'd say it was
>>>>>>>>>>>> fine to shoot for 10M - 100M documents on ManifoldCF, provided that you use
>>>>>>>>>>>> a good database, and provided that you maintain it properly.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>>>>>>>>>> ba.berthier@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would like to know what is the maximum number of documents
>>>>>>>>>>>>> that you managed to crawl with ManifoldCF and with how many connectors in
>>>>>>>>>>>>> parallel it could works ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for your answer
>>>>>>>>>>>>>
>>>>>>>>>>>>> Baptiste
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Posted by Karl Wright <da...@gmail.com>.

Thanks -- I'll include that change as well then, in ticket CONNECTORS-1027.


Karl


On Thu, Sep 11, 2014 at 7:45 AM, Paul Boichat <pa...@exonar.com>
wrote:

> Hi,
>
> That comes back immediately with 10001 rows:
>
> explain analyze SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT
> 10001) t;
>
> QUERY PLAN
>
>
> -----------------------------------------------------------------------------------------------------------------------
> ----------------------------------
>  Aggregate  (cost=544.08..544.09 rows=1 width=0) (actual time=9.125..9.125
> rows=1 loops=1)
>    ->  Limit  (cost=0.00..419.07 rows=10001 width=0) (actual
> time=0.033..6.945 rows=10001 loops=1)
>          ->  Index Only Scan using jobqueue_pkey on jobqueue
> (cost=0.00..431189.31 rows=10290271 width=0) (actual time
> =0.031..3.257 rows=10001 loops=1)
>                Heap Fetches: 725
>  Total runtime: 9.157 ms
> (5 rows)
>
>
> Whereas:
>
> explain analyze SELECT count(*) FROM jobqueue limit 10001;
>
> QUERY PLAN
>
>
> -----------------------------------------------------------------------------------------------------------------------
> ----------------------------------------
>  Limit  (cost=456922.99..456923.00 rows=1 width=0) (actual
> time=5225.107..5225.109 rows=1 loops=1)
>    ->  Aggregate  (cost=456922.99..456923.00 rows=1 width=0) (actual
> time=5225.105..5225.106 rows=1 loops=1)
>          ->  Index Only Scan using jobqueue_pkey on jobqueue
> (cost=0.00..431197.31 rows=10290271 width=0) (actual time
> =0.108..3090.848 rows=10370209 loops=1)
>                Heap Fetches: 684297
>  Total runtime: 5225.151 ms
>
> Thanks,
>
> Paul
>
>
>
> VP Engineering,
> Exonar Ltd
>
> T: +44 7940 567724
>
> twitter:@exonarco @pboichat
> W: http://www.exonar.com
> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>
> Exonar Limited, registered in the UK, registration number 06439969 at 14
> West Mills, Newbury, Berkshire, RG14 5HG
> DISCLAIMER: This email and any attachments to it may be confidential and
> are intended solely for the use of the individual to whom it is addressed.
> Any views or opinions expressed are solely those of the author and do not
> necessarily represent those of Exonar Ltd. If you are not the intended
> recipient of this email, you must neither take any action based upon its
> contents, nor copy or show it to anyone. Please contact the sender if you
> believe you have received this email in error.
>
> On Thu, Sep 11, 2014 at 12:25 PM, Karl Wright <da...@gmail.com> wrote:
>
>> Hi Paul,
>>
>> Could you try this query on your database please and tell me if it
>> executes promptly:
>>
>> SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t
>>
>>
>> I vaguely remember that I had to change the form of this query in order
>> to support MySQL -- but first let's see if this helps.
>>
>> Karl
>>
>>
>> On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <da...@gmail.com> wrote:
>>
>>> I've created a ticket (CONNECTORS-1027) and a trunk-based branch
>>> (branches/CONNECTORS-1027) for looking at any changes we do for large-scale
>>> Postgresql optimization work.
>>>
>>> Please note that trunk code already has schema changes relative to MCF
>>> 1.7, so you will not be able to work directly with this branch code.  I'll
>>> have to create patches for whatever changes you would need to try.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat <pa...@exonar.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> We're on Postgres 9.2. I'll get the query plans and add them to the doc.
>>>>
>>>> Thanks
>>>>
>>>> Paul
>>>>
>>>>
>>>>
>>>> VP Engineering,
>>>> Exonar Ltd
>>>>
>>>> T: +44 7940 567724
>>>>
>>>> twitter:@exonarco @pboichat
>>>> W: http://www.exonar.com
>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>
>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>> and are intended solely for the use of the individual to whom it is
>>>> addressed. Any views or opinions expressed are solely those of the author
>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>> the intended recipient of this email, you must neither take any action
>>>> based upon its contents, nor copy or show it to anyone. Please contact
>>>> the sender if you believe you have received this email in error.
>>>>
>>>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <da...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Paul,
>>>>>
>>>>> Can you include the logged plan for this query; this is an actual
>>>>> query encountered during crawling:
>>>>>
>>>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a
>>>>> long-running query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid
>>>>> FROM carrydown t1, jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=? AND
>>>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?]
>>>>>
>>>>>
>>>>> These queries are all from the UI; it is what gets generated when no
>>>>> limits are in place:
>>>>>
>>>>>  WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found a
>>>>> long-running query (166845 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>  WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found a
>>>>> long-running query (107222 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>
>>>>> This query is from the UI with a limit of 1000000:
>>>>>
>>>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found a
>>>>> long-running query (254851 ms): [SELECT COUNT(dochash) AS doccount FROM
>>>>> jobqueue t1 LIMIT 1000001]
>>>>>
>>>>> I honestly don't understand why PostgreSQL would execute a sequential
>>>>> scan of the entire table when given a limit clause.  It certainly didn't
>>>>> used to do that.  If you have any other suggestions please let me know.
>>>>>
>>>>>
>>>>>
>>>>> Some queries show up in this list because MCF periodically reindexes
>>>>> tables.  For example, this query goes only against the (small) jobs table.
>>>>> Its poor performance on occasion is likely due to something else happening
>>>>> to the database, probably a reindex:
>>>>>
>>>>>  WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a long-running
>>>>> query (592474 ms): [SELECT id FROM jobs WHERE status IN (?,?,?,?,?) FOR
>>>>> UPDATE]
>>>>>
>>>>>
>>>>> The final query is the document stuffing query, which is perhaps the
>>>>> most critical query in the whole system:
>>>>>
>>>>>  SELECT
>>>>>  t0.id
>>>>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
>>>>>  t0.priorityset FROM jobqueue t0
>>>>>  WHERE t0.status IN ('P','G')  AND t0.checkaction='R' AND t0.checktime
>>>>>  <= 1407246846166
>>>>>  AND EXISTS (
>>>>>    SELECT 'x' FROM jobs t1
>>>>>    WHERE t1.status  IN ('A','a')  AND t1.id=t0.jobid  AND
>>>>> t1.priority=5
>>>>>  )
>>>>>  AND NOT EXISTS (
>>>>>    SELECT 'x' FROM jobqueue t2
>>>>>    WHERE t2.dochash=t0.dochash AND t2.status IN
>>>>>  ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid
>>>>>  )
>>>>>  AND NOT EXISTS (
>>>>>    SELECT 'x' FROM prereqevents t3,events t4
>>>>>    WHERE t0.id=t3.owner AND t3.eventname=t4.name
>>>>>  )
>>>>>  ORDER BY t0.docpriority ASC
>>>>>  LIMIT 480;
>>>>>
>>>>> Your analysis of whether IN beats OR does not agree with experiments I
>>>>> did on postgresql 8.7 which showed no difference.  What Postgresql version
>>>>> are you using?  Also, I trust you have query plans that demonstrate your
>>>>> claim?  In any case, whether IN vs. OR is generated is a function of the
>>>>> MCF database driver, so this is trivial to experiment with.  I'll create a
>>>>> ticket and a branch for experimentation.
>>>>>
>>>>> Thanks,
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <paul.boichat@exonar.com
>>>>> > wrote:
>>>>>
>>>>>> Hi Karl,
>>>>>>
>>>>>> Changing maxcountstatus to something much smaller (10,000) doesn't
>>>>>> seem to buy us that much on the table scan - in the attached you'll see
>>>>>> that it's still taking a long time to return the job status page. Also in
>>>>>> the attached are some sample other long running queries that we're
>>>>>> beginning to see more frequently. There's also an example of a query that's
>>>>>> frequently executed and regularly takes > 4 secs (plus a suggested change
>>>>>> to improve performance). This one in particular would certainly benefit
>>>>>> from a change to SSDs which should relieve the I/O bound bottleneck on
>>>>>> postgres.
>>>>>>
>>>>>> We're loading the system from 10mil towards 100mil so would be keen
>>>>>> to work with you to optimise where possible.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Paul
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> VP Engineering,
>>>>>> Exonar Ltd
>>>>>>
>>>>>> T: +44 7940 567724
>>>>>>
>>>>>> twitter:@exonarco @pboichat
>>>>>> W: http://www.exonar.com
>>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>>
>>>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>>>> and are intended solely for the use of the individual to whom it is
>>>>>> addressed. Any views or opinions expressed are solely those of the author
>>>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>>>> the intended recipient of this email, you must neither take any action
>>>>>> based upon its contents, nor copy or show it to anyone. Please
>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>
>>>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <da...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Paul,
>>>>>>>
>>>>>>> The jobstatus query that uses count(*) should be doing something
>>>>>>> like this when the maxdocumentstatuscount value is set:
>>>>>>>
>>>>>>> select count(*) from jobqueue where xxx limit 500001
>>>>>>>
>>>>>>> This will still do a sequential scan, but it will be an aborted one,
>>>>>>> so you can control the maximum amount of time spent doing the query.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <
>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> We've had a play with maxstatuscount and couldn't stop it from
>>>>>>>> count(*)-ing but I'll certainly have another look to see if we've missed
>>>>>>>> something.
>>>>>>>>
>>>>>>>> We're increasingly seeing long running threads and I'll put
>>>>>>>> together some samples. As an example, on a job that's currently aborting:
>>>>>>>>
>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a
>>>>>>>> long-running query (72902 ms): [UPDATE jobqueue SET
>>>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?]
>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 0:
>>>>>>>> '1.000000001E9'
>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 1:
>>>>>>>> '1407144048075'
>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan: Update on
>>>>>>>> jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:   ->
>>>>>>>> Bitmap Heap Scan on jobqueue  (cost=18806.08..445770.39 rows=764916
>>>>>>>> width=287)
>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>>>>> Recheck Cond: (jobid = 1407144048075::bigint)
>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>>>>> ->  Bitmap Index Scan on i1392985450177  (cost=0.00..18614.85 rows=764916
>>>>>>>> width=0)
>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>> Plan:               Index Cond: (jobid = 1407144048075::bigint)
>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -  Stats:
>>>>>>>> n_distinct=4.0 most_common_vals={G,C,Z,P}
>>>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>>>
>>>>>>>> Paul
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> VP Engineering,
>>>>>>>> Exonar Ltd
>>>>>>>>
>>>>>>>> T: +44 7940 567724
>>>>>>>>
>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>> W: http://www.exonar.com
>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>> <http://video.exonar.com/>
>>>>>>>>
>>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>
>>>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <da...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Paul,
>>>>>>>>>
>>>>>>>>> For the jobqueue scans from the UI, there is a parameter you can
>>>>>>>>> set which limits the number of documents counted to at most a specified
>>>>>>>>> amount.  This uses a limit clause, which should prevent unbounded time
>>>>>>>>> doing these kinds of queries:
>>>>>>>>>
>>>>>>>>> org.apache.manifoldcf.ui.maxstatuscount
>>>>>>>>>
>>>>>>>>> The documentation says that the default value for this parameter
>>>>>>>>> is 10000, which however is incorrect.  The actual true default is 500000.
>>>>>>>>> You could set that lower for better UI performance (losing some
>>>>>>>>> information, of course.)
>>>>>>>>>
>>>>>>>>> As for long-running queries, a lot of time and effort has been
>>>>>>>>> spent in MCF to insure that this doesn't happen.  Specifically, the main
>>>>>>>>> document queuing query is structured to read directly out of a specific
>>>>>>>>> jobqueue index.  This is the crucial query that must work properly for
>>>>>>>>> scalability, since doing a query that is effectively just a sort on the
>>>>>>>>> entire jobqueue would be a major problem.  There are some times where
>>>>>>>>> Postgresql's optimizer fails to do the right thing here, mostly because it
>>>>>>>>> makes a huge distinction between whether there's zero of something or one
>>>>>>>>> of something, but you can work around that particular issue by setting the
>>>>>>>>> analyze count to 1 if you start to see this problem -- which basically
>>>>>>>>> means that reanalysis of the table has to occur on every stuffing query.
>>>>>>>>>
>>>>>>>>> I'd appreciate seeing the queries that are long-running in your
>>>>>>>>> case so that I can see if that is what you are encountering or not.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <
>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Karl,
>>>>>>>>>>
>>>>>>>>>> We're beginning to see issues with a document count > 10 million.
>>>>>>>>>> At that point, even with good postgres vacuuming the jobqueue
>>>>>>>>>> table is starting to become a bottleneck.
>>>>>>>>>>
>>>>>>>>>> For example select count(*) from jobqueue, which is executed
>>>>>>>>>> when querying job status will do a full table scan of jobqueue
>>>>>>>>>> which has more than 10 million rows. That's going to take some time in
>>>>>>>>>> postgres.
>>>>>>>>>>
>>>>>>>>>> SSDs will certainly make a big difference to document processing
>>>>>>>>>> through-put (which we see is largely I/O bound in postgres) but
>>>>>>>>>> we are increasingly seeing long running queries in the logs. Our current
>>>>>>>>>> thinking is that we'll need to refactor JobQueue somewhat to
>>>>>>>>>> optimise queries and, potentially partition jobqueue into a
>>>>>>>>>> subset of tables (table per queue for example).
>>>>>>>>>>
>>>>>>>>>> Paul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> VP Engineering,
>>>>>>>>>> Exonar Ltd
>>>>>>>>>>
>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>
>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>
>>>>>>>>>> Exonar Limited, registered in the UK, registration number
>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <da...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Baptiste,
>>>>>>>>>>>
>>>>>>>>>>> ManifoldCF is not limited by the number of agents processes or
>>>>>>>>>>> parallel connectors.  Overall database performance is the limiting factor.
>>>>>>>>>>>
>>>>>>>>>>> I would read this:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>>>>>
>>>>>>>>>>> Also, there's a section in ManifoldCF (I believe Chapter 2) that
>>>>>>>>>>> discusses this issue.
>>>>>>>>>>>
>>>>>>>>>>> Some five years ago, I successfully crawled 5 million web
>>>>>>>>>>> documents, using Postgresql 8.3.  Postgresql 9.x is faster, and with modern
>>>>>>>>>>> SSD's, I expect that you will do even better.  In general, I'd say it was
>>>>>>>>>>> fine to shoot for 10M - 100M documents on ManifoldCF, provided that you use
>>>>>>>>>>> a good database, and provided that you maintain it properly.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>>>>>>>>> ba.berthier@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi
>>>>>>>>>>>>
>>>>>>>>>>>> I would like to know what is the maximum number of documents
>>>>>>>>>>>> that you managed to crawl with ManifoldCF and with how many connectors in
>>>>>>>>>>>> parallel it could works ?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for your answer
>>>>>>>>>>>>
>>>>>>>>>>>> Baptiste
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Posted by Paul Boichat <pa...@exonar.com>.

Hi,

That comes back immediately with 10001 rows:

explain analyze SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001)
t;

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------
----------------------------------
 Aggregate  (cost=544.08..544.09 rows=1 width=0) (actual time=9.125..9.125
rows=1 loops=1)
   ->  Limit  (cost=0.00..419.07 rows=10001 width=0) (actual
time=0.033..6.945 rows=10001 loops=1)
         ->  Index Only Scan using jobqueue_pkey on jobqueue
(cost=0.00..431189.31 rows=10290271 width=0) (actual time
=0.031..3.257 rows=10001 loops=1)
               Heap Fetches: 725
 Total runtime: 9.157 ms
(5 rows)


Whereas:

explain analyze SELECT count(*) FROM jobqueue limit 10001;

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------
----------------------------------------
 Limit  (cost=456922.99..456923.00 rows=1 width=0) (actual
time=5225.107..5225.109 rows=1 loops=1)
   ->  Aggregate  (cost=456922.99..456923.00 rows=1 width=0) (actual
time=5225.105..5225.106 rows=1 loops=1)
         ->  Index Only Scan using jobqueue_pkey on jobqueue
(cost=0.00..431197.31 rows=10290271 width=0) (actual time
=0.108..3090.848 rows=10370209 loops=1)
               Heap Fetches: 684297
 Total runtime: 5225.151 ms

Thanks,

Paul



VP Engineering,
Exonar Ltd

T: +44 7940 567724

twitter:@exonarco @pboichat
W: http://www.exonar.com
Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>

Exonar Limited, registered in the UK, registration number 06439969 at 14
West Mills, Newbury, Berkshire, RG14 5HG
DISCLAIMER: This email and any attachments to it may be confidential and
are intended solely for the use of the individual to whom it is addressed.
Any views or opinions expressed are solely those of the author and do not
necessarily represent those of Exonar Ltd. If you are not the intended
recipient of this email, you must neither take any action based upon its
contents, nor copy or show it to anyone. Please contact the sender if you
believe you have received this email in error.

On Thu, Sep 11, 2014 at 12:25 PM, Karl Wright <da...@gmail.com> wrote:

> Hi Paul,
>
> Could you try this query on your database please and tell me if it
> executes promptly:
>
> SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t
>
>
> I vaguely remember that I had to change the form of this query in order to
> support MySQL -- but first let's see if this helps.
>
> Karl
>
>
> On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <da...@gmail.com> wrote:
>
>> I've created a ticket (CONNECTORS-1027) and a trunk-based branch
>> (branches/CONNECTORS-1027) for looking at any changes we do for large-scale
>> Postgresql optimization work.
>>
>> Please note that trunk code already has schema changes relative to MCF
>> 1.7, so you will not be able to work directly with this branch code.  I'll
>> have to create patches for whatever changes you would need to try.
>>
>> Thanks,
>> Karl
>>
>>
>> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat <pa...@exonar.com>
>> wrote:
>>
>>> Hi,
>>>
>>> We're on Postgres 9.2. I'll get the query plans and add them to the doc.
>>>
>>> Thanks
>>>
>>> Paul
>>>
>>>
>>>
>>> VP Engineering,
>>> Exonar Ltd
>>>
>>> T: +44 7940 567724
>>>
>>> twitter:@exonarco @pboichat
>>> W: http://www.exonar.com
>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>
>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>> West Mills, Newbury, Berkshire, RG14 5HG
>>> DISCLAIMER: This email and any attachments to it may be confidential
>>> and are intended solely for the use of the individual to whom it is
>>> addressed. Any views or opinions expressed are solely those of the author
>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>> the intended recipient of this email, you must neither take any action
>>> based upon its contents, nor copy or show it to anyone. Please contact
>>> the sender if you believe you have received this email in error.
>>>
>>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <da...@gmail.com>
>>> wrote:
>>>
>>>> Hi Paul,
>>>>
>>>> Can you include the logged plan for this query; this is an actual query
>>>> encountered during crawling:
>>>>
>>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a
>>>> long-running query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid FROM
>>>> carrydown t1, jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=? AND
>>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?]
>>>>
>>>>
>>>> These queries are all from the UI; it is what gets generated when no
>>>> limits are in place:
>>>>
>>>>  WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found a
>>>> long-running query (166845 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>  WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found a
>>>> long-running query (107222 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>
>>>> This query is from the UI with a limit of 1000000:
>>>>
>>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found a
>>>> long-running query (254851 ms): [SELECT COUNT(dochash) AS doccount FROM
>>>> jobqueue t1 LIMIT 1000001]
>>>>
>>>> I honestly don't understand why PostgreSQL would execute a sequential
>>>> scan of the entire table when given a limit clause.  It certainly didn't
>>>> used to do that.  If you have any other suggestions please let me know.
>>>>
>>>>
>>>>
>>>> Some queries show up in this list because MCF periodically reindexes
>>>> tables.  For example, this query goes only against the (small) jobs table.
>>>> Its poor performance on occasion is likely due to something else happening
>>>> to the database, probably a reindex:
>>>>
>>>>  WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a long-running
>>>> query (592474 ms): [SELECT id FROM jobs WHERE status IN (?,?,?,?,?) FOR
>>>> UPDATE]
>>>>
>>>>
>>>> The final query is the document stuffing query, which is perhaps the
>>>> most critical query in the whole system:
>>>>
>>>>  SELECT
>>>>  t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
>>>>  t0.priorityset FROM jobqueue t0
>>>>  WHERE t0.status IN ('P','G')  AND t0.checkaction='R' AND t0.checktime
>>>>  <= 1407246846166
>>>>  AND EXISTS (
>>>>    SELECT 'x' FROM jobs t1
>>>>    WHERE t1.status  IN ('A','a')  AND t1.id=t0.jobid  AND t1.priority=5
>>>>  )
>>>>  AND NOT EXISTS (
>>>>    SELECT 'x' FROM jobqueue t2
>>>>    WHERE t2.dochash=t0.dochash AND t2.status IN
>>>>  ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid
>>>>  )
>>>>  AND NOT EXISTS (
>>>>    SELECT 'x' FROM prereqevents t3,events t4
>>>>    WHERE t0.id=t3.owner AND t3.eventname=t4.name
>>>>  )
>>>>  ORDER BY t0.docpriority ASC
>>>>  LIMIT 480;
>>>>
>>>> Your analysis of whether IN beats OR does not agree with experiments I
>>>> did on postgresql 8.7 which showed no difference.  What Postgresql version
>>>> are you using?  Also, I trust you have query plans that demonstrate your
>>>> claim?  In any case, whether IN vs. OR is generated is a function of the
>>>> MCF database driver, so this is trivial to experiment with.  I'll create a
>>>> ticket and a branch for experimentation.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <pa...@exonar.com>
>>>> wrote:
>>>>
>>>>> Hi Karl,
>>>>>
>>>>> Changing maxcountstatus to something much smaller (10,000) doesn't
>>>>> seem to buy us that much on the table scan - in the attached you'll see
>>>>> that it's still taking a long time to return the job status page. Also in
>>>>> the attached are some sample other long running queries that we're
>>>>> beginning to see more frequently. There's also an example of a query that's
>>>>> frequently executed and regularly takes > 4 secs (plus a suggested change
>>>>> to improve performance). This one in particular would certainly benefit
>>>>> from a change to SSDs which should relieve the I/O bound bottleneck on
>>>>> postgres.
>>>>>
>>>>> We're loading the system from 10mil towards 100mil so would be keen to
>>>>> work with you to optimise where possible.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Paul
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> VP Engineering,
>>>>> Exonar Ltd
>>>>>
>>>>> T: +44 7940 567724
>>>>>
>>>>> twitter:@exonarco @pboichat
>>>>> W: http://www.exonar.com
>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>
>>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>>> and are intended solely for the use of the individual to whom it is
>>>>> addressed. Any views or opinions expressed are solely those of the author
>>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>>> the intended recipient of this email, you must neither take any action
>>>>> based upon its contents, nor copy or show it to anyone. Please
>>>>> contact the sender if you believe you have received this email in error.
>>>>>
>>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <da...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Paul,
>>>>>>
>>>>>> The jobstatus query that uses count(*) should be doing something like
>>>>>> this when the maxdocumentstatuscount value is set:
>>>>>>
>>>>>> select count(*) from jobqueue where xxx limit 500001
>>>>>>
>>>>>> This will still do a sequential scan, but it will be an aborted one,
>>>>>> so you can control the maximum amount of time spent doing the query.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <
>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> We've had a play with maxstatuscount and couldn't stop it from
>>>>>>> count(*)-ing but I'll certainly have another look to see if we've missed
>>>>>>> something.
>>>>>>>
>>>>>>> We're increasingly seeing long running threads and I'll put together
>>>>>>> some samples. As an example, on a job that's currently aborting:
>>>>>>>
>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a
>>>>>>> long-running query (72902 ms): [UPDATE jobqueue SET
>>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?]
>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 0:
>>>>>>> '1.000000001E9'
>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 1:
>>>>>>> '1407144048075'
>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan: Update on
>>>>>>> jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:   ->
>>>>>>> Bitmap Heap Scan on jobqueue  (cost=18806.08..445770.39 rows=764916
>>>>>>> width=287)
>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>>>> Recheck Cond: (jobid = 1407144048075::bigint)
>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>>>> ->  Bitmap Index Scan on i1392985450177  (cost=0.00..18614.85 rows=764916
>>>>>>> width=0)
>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>> Plan:               Index Cond: (jobid = 1407144048075::bigint)
>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -  Stats:
>>>>>>> n_distinct=4.0 most_common_vals={G,C,Z,P}
>>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>>
>>>>>>> Paul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> VP Engineering,
>>>>>>> Exonar Ltd
>>>>>>>
>>>>>>> T: +44 7940 567724
>>>>>>>
>>>>>>> twitter:@exonarco @pboichat
>>>>>>> W: http://www.exonar.com
>>>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>>>
>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>> author and do not necessarily represent those of Exonar Ltd. If you
>>>>>>> are not the intended recipient of this email, you must neither take any
>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>
>>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <da...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Paul,
>>>>>>>>
>>>>>>>> For the jobqueue scans from the UI, there is a parameter you can
>>>>>>>> set which limits the number of documents counted to at most a specified
>>>>>>>> amount.  This uses a limit clause, which should prevent unbounded time
>>>>>>>> doing these kinds of queries:
>>>>>>>>
>>>>>>>> org.apache.manifoldcf.ui.maxstatuscount
>>>>>>>>
>>>>>>>> The documentation says that the default value for this parameter is
>>>>>>>> 10000, which however is incorrect.  The actual true default is 500000.  You
>>>>>>>> could set that lower for better UI performance (losing some information, of
>>>>>>>> course.)
>>>>>>>>
>>>>>>>> As for long-running queries, a lot of time and effort has been
>>>>>>>> spent in MCF to insure that this doesn't happen.  Specifically, the main
>>>>>>>> document queuing query is structured to read directly out of a specific
>>>>>>>> jobqueue index.  This is the crucial query that must work properly for
>>>>>>>> scalability, since doing a query that is effectively just a sort on the
>>>>>>>> entire jobqueue would be a major problem.  There are some times where
>>>>>>>> Postgresql's optimizer fails to do the right thing here, mostly because it
>>>>>>>> makes a huge distinction between whether there's zero of something or one
>>>>>>>> of something, but you can work around that particular issue by setting the
>>>>>>>> analyze count to 1 if you start to see this problem -- which basically
>>>>>>>> means that reanalysis of the table has to occur on every stuffing query.
>>>>>>>>
>>>>>>>> I'd appreciate seeing the queries that are long-running in your
>>>>>>>> case so that I can see if that is what you are encountering or not.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <
>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Karl,
>>>>>>>>>
>>>>>>>>> We're beginning to see issues with a document count > 10 million.
>>>>>>>>> At that point, even with good postgres vacuuming the jobqueue
>>>>>>>>> table is starting to become a bottleneck.
>>>>>>>>>
>>>>>>>>> For example select count(*) from jobqueue, which is executed when
>>>>>>>>> querying job status will do a full table scan of jobqueue which
>>>>>>>>> has more than 10 million rows. That's going to take some time in
>>>>>>>>> postgres.
>>>>>>>>>
>>>>>>>>> SSDs will certainly make a big difference to document processing
>>>>>>>>> through-put (which we see is largely I/O bound in postgres) but
>>>>>>>>> we are increasingly seeing long running queries in the logs. Our current
>>>>>>>>> thinking is that we'll need to refactor JobQueue somewhat to
>>>>>>>>> optimise queries and, potentially partition jobqueue into a
>>>>>>>>> subset of tables (table per queue for example).
>>>>>>>>>
>>>>>>>>> Paul
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> VP Engineering,
>>>>>>>>> Exonar Ltd
>>>>>>>>>
>>>>>>>>> T: +44 7940 567724
>>>>>>>>>
>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>> W: http://www.exonar.com
>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>
>>>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>>
>>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <da...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Baptiste,
>>>>>>>>>>
>>>>>>>>>> ManifoldCF is not limited by the number of agents processes or
>>>>>>>>>> parallel connectors.  Overall database performance is the limiting factor.
>>>>>>>>>>
>>>>>>>>>> I would read this:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>>>>
>>>>>>>>>> Also, there's a section in ManifoldCF (I believe Chapter 2) that
>>>>>>>>>> discusses this issue.
>>>>>>>>>>
>>>>>>>>>> Some five years ago, I successfully crawled 5 million web
>>>>>>>>>> documents, using Postgresql 8.3.  Postgresql 9.x is faster, and with modern
>>>>>>>>>> SSD's, I expect that you will do even better.  In general, I'd say it was
>>>>>>>>>> fine to shoot for 10M - 100M documents on ManifoldCF, provided that you use
>>>>>>>>>> a good database, and provided that you maintain it properly.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>>>>>>>> ba.berthier@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi
>>>>>>>>>>>
>>>>>>>>>>> I would like to know what is the maximum number of documents
>>>>>>>>>>> that you managed to crawl with ManifoldCF and with how many connectors in
>>>>>>>>>>> parallel it could works ?
>>>>>>>>>>>
>>>>>>>>>>> Thanks for your answer
>>>>>>>>>>>
>>>>>>>>>>> Baptiste
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Posted by Karl Wright <da...@gmail.com>.

Hi Paul,

Could you try this query on your database please and tell me if it executes
promptly:

SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t


I vaguely remember that I had to change the form of this query in order to
support MySQL -- but first let's see if this helps.

Karl


On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <da...@gmail.com> wrote:

> I've created a ticket (CONNECTORS-1027) and a trunk-based branch
> (branches/CONNECTORS-1027) for looking at any changes we do for large-scale
> Postgresql optimization work.
>
> Please note that trunk code already has schema changes relative to MCF
> 1.7, so you will not be able to work directly with this branch code.  I'll
> have to create patches for whatever changes you would need to try.
>
> Thanks,
> Karl
>
>
> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat <pa...@exonar.com>
> wrote:
>
>> Hi,
>>
>> We're on Postgres 9.2. I'll get the query plans and add them to the doc.
>>
>> Thanks
>>
>> Paul
>>
>>
>>
>> VP Engineering,
>> Exonar Ltd
>>
>> T: +44 7940 567724
>>
>> twitter:@exonarco @pboichat
>> W: http://www.exonar.com
>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>
>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>> West Mills, Newbury, Berkshire, RG14 5HG
>> DISCLAIMER: This email and any attachments to it may be confidential and
>> are intended solely for the use of the individual to whom it is addressed.
>> Any views or opinions expressed are solely those of the author and do not
>> necessarily represent those of Exonar Ltd. If you are not the intended
>> recipient of this email, you must neither take any action based upon its
>> contents, nor copy or show it to anyone. Please contact the sender if
>> you believe you have received this email in error.
>>
>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <da...@gmail.com> wrote:
>>
>>> Hi Paul,
>>>
>>> Can you include the logged plan for this query; this is an actual query
>>> encountered during crawling:
>>>
>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a long-running
>>> query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid FROM carrydown t1,
>>> jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=? AND
>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?]
>>>
>>>
>>> These queries are all from the UI; it is what gets generated when no
>>> limits are in place:
>>>
>>>  WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found a
>>> long-running query (166845 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>> FROM jobqueue t1 GROUP BY jobid]
>>>  WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found a
>>> long-running query (107222 ms): [SELECT jobid,COUNT(dochash) AS doccount
>>> FROM jobqueue t1 GROUP BY jobid]
>>>
>>> This query is from the UI with a limit of 1000000:
>>>
>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found a
>>> long-running query (254851 ms): [SELECT COUNT(dochash) AS doccount FROM
>>> jobqueue t1 LIMIT 1000001]
>>>
>>> I honestly don't understand why PostgreSQL would execute a sequential
>>> scan of the entire table when given a limit clause.  It certainly didn't
>>> used to do that.  If you have any other suggestions please let me know.
>>>
>>>
>>>
>>> Some queries show up in this list because MCF periodically reindexes
>>> tables.  For example, this query goes only against the (small) jobs table.
>>> Its poor performance on occasion is likely due to something else happening
>>> to the database, probably a reindex:
>>>
>>>  WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a long-running
>>> query (592474 ms): [SELECT id FROM jobs WHERE status IN (?,?,?,?,?) FOR
>>> UPDATE]
>>>
>>>
>>> The final query is the document stuffing query, which is perhaps the
>>> most critical query in the whole system:
>>>
>>>  SELECT
>>>  t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
>>>  t0.priorityset FROM jobqueue t0
>>>  WHERE t0.status IN ('P','G')  AND t0.checkaction='R' AND t0.checktime
>>>  <= 1407246846166
>>>  AND EXISTS (
>>>    SELECT 'x' FROM jobs t1
>>>    WHERE t1.status  IN ('A','a')  AND t1.id=t0.jobid  AND t1.priority=5
>>>  )
>>>  AND NOT EXISTS (
>>>    SELECT 'x' FROM jobqueue t2
>>>    WHERE t2.dochash=t0.dochash AND t2.status IN
>>>  ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid
>>>  )
>>>  AND NOT EXISTS (
>>>    SELECT 'x' FROM prereqevents t3,events t4
>>>    WHERE t0.id=t3.owner AND t3.eventname=t4.name
>>>  )
>>>  ORDER BY t0.docpriority ASC
>>>  LIMIT 480;
>>>
>>> Your analysis of whether IN beats OR does not agree with experiments I
>>> did on postgresql 8.7 which showed no difference.  What Postgresql version
>>> are you using?  Also, I trust you have query plans that demonstrate your
>>> claim?  In any case, whether IN vs. OR is generated is a function of the
>>> MCF database driver, so this is trivial to experiment with.  I'll create a
>>> ticket and a branch for experimentation.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>>
>>>
>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <pa...@exonar.com>
>>> wrote:
>>>
>>>> Hi Karl,
>>>>
>>>> Changing maxcountstatus to something much smaller (10,000) doesn't seem
>>>> to buy us that much on the table scan - in the attached you'll see that
>>>> it's still taking a long time to return the job status page. Also in the
>>>> attached are some sample other long running queries that we're beginning to
>>>> see more frequently. There's also an example of a query that's frequently
>>>> executed and regularly takes > 4 secs (plus a suggested change to improve
>>>> performance). This one in particular would certainly benefit from a change
>>>> to SSDs which should relieve the I/O bound bottleneck on postgres.
>>>>
>>>> We're loading the system from 10mil towards 100mil so would be keen to
>>>> work with you to optimise where possible.
>>>>
>>>> Thanks,
>>>>
>>>> Paul
>>>>
>>>>
>>>>
>>>>
>>>> VP Engineering,
>>>> Exonar Ltd
>>>>
>>>> T: +44 7940 567724
>>>>
>>>> twitter:@exonarco @pboichat
>>>> W: http://www.exonar.com
>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>
>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>> and are intended solely for the use of the individual to whom it is
>>>> addressed. Any views or opinions expressed are solely those of the author
>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>> the intended recipient of this email, you must neither take any action
>>>> based upon its contents, nor copy or show it to anyone. Please contact
>>>> the sender if you believe you have received this email in error.
>>>>
>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <da...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Paul,
>>>>>
>>>>> The jobstatus query that uses count(*) should be doing something like
>>>>> this when the maxdocumentstatuscount value is set:
>>>>>
>>>>> select count(*) from jobqueue where xxx limit 500001
>>>>>
>>>>> This will still do a sequential scan, but it will be an aborted one,
>>>>> so you can control the maximum amount of time spent doing the query.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <paul.boichat@exonar.com
>>>>> > wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> We've had a play with maxstatuscount and couldn't stop it from
>>>>>> count(*)-ing but I'll certainly have another look to see if we've missed
>>>>>> something.
>>>>>>
>>>>>> We're increasingly seeing long running threads and I'll put together
>>>>>> some samples. As an example, on a job that's currently aborting:
>>>>>>
>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a
>>>>>> long-running query (72902 ms): [UPDATE jobqueue SET
>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?]
>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 0:
>>>>>> '1.000000001E9'
>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 1:
>>>>>> '1407144048075'
>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan: Update on
>>>>>> jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:   ->
>>>>>> Bitmap Heap Scan on jobqueue  (cost=18806.08..445770.39 rows=764916
>>>>>> width=287)
>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>>> Recheck Cond: (jobid = 1407144048075::bigint)
>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:         ->
>>>>>> Bitmap Index Scan on i1392985450177  (cost=0.00..18614.85 rows=764916
>>>>>> width=0)
>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>> Plan:               Index Cond: (jobid = 1407144048075::bigint)
>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -  Stats:
>>>>>> n_distinct=4.0 most_common_vals={G,C,Z,P}
>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>>
>>>>>> Paul
>>>>>>
>>>>>>
>>>>>>
>>>>>> VP Engineering,
>>>>>> Exonar Ltd
>>>>>>
>>>>>> T: +44 7940 567724
>>>>>>
>>>>>> twitter:@exonarco @pboichat
>>>>>> W: http://www.exonar.com
>>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>>
>>>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>>>> and are intended solely for the use of the individual to whom it is
>>>>>> addressed. Any views or opinions expressed are solely those of the author
>>>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>>>> the intended recipient of this email, you must neither take any action
>>>>>> based upon its contents, nor copy or show it to anyone. Please
>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>
>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <da...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Paul,
>>>>>>>
>>>>>>> For the jobqueue scans from the UI, there is a parameter you can set
>>>>>>> which limits the number of documents counted to at most a specified
>>>>>>> amount.  This uses a limit clause, which should prevent unbounded time
>>>>>>> doing these kinds of queries:
>>>>>>>
>>>>>>> org.apache.manifoldcf.ui.maxstatuscount
>>>>>>>
>>>>>>> The documentation says that the default value for this parameter is
>>>>>>> 10000, which however is incorrect.  The actual true default is 500000.  You
>>>>>>> could set that lower for better UI performance (losing some information, of
>>>>>>> course.)
>>>>>>>
>>>>>>> As for long-running queries, a lot of time and effort has been spent
>>>>>>> in MCF to insure that this doesn't happen.  Specifically, the main document
>>>>>>> queuing query is structured to read directly out of a specific jobqueue
>>>>>>> index.  This is the crucial query that must work properly for scalability,
>>>>>>> since doing a query that is effectively just a sort on the entire jobqueue
>>>>>>> would be a major problem.  There are some times where Postgresql's
>>>>>>> optimizer fails to do the right thing here, mostly because it makes a huge
>>>>>>> distinction between whether there's zero of something or one of something,
>>>>>>> but you can work around that particular issue by setting the analyze count
>>>>>>> to 1 if you start to see this problem -- which basically means that
>>>>>>> reanalysis of the table has to occur on every stuffing query.
>>>>>>>
>>>>>>> I'd appreciate seeing the queries that are long-running in your case
>>>>>>> so that I can see if that is what you are encountering or not.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <
>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>
>>>>>>>> Hi Karl,
>>>>>>>>
>>>>>>>> We're beginning to see issues with a document count > 10 million.
>>>>>>>> At that point, even with good postgres vacuuming the jobqueue
>>>>>>>> table is starting to become a bottleneck.
>>>>>>>>
>>>>>>>> For example select count(*) from jobqueue, which is executed when
>>>>>>>> querying job status will do a full table scan of jobqueue which
>>>>>>>> has more than 10 million rows. That's going to take some time in
>>>>>>>> postgres.
>>>>>>>>
>>>>>>>> SSDs will certainly make a big difference to document processing
>>>>>>>> through-put (which we see is largely I/O bound in postgres) but we
>>>>>>>> are increasingly seeing long running queries in the logs. Our current
>>>>>>>> thinking is that we'll need to refactor JobQueue somewhat to
>>>>>>>> optimise queries and, potentially partition jobqueue into a subset
>>>>>>>> of tables (table per queue for example).
>>>>>>>>
>>>>>>>> Paul
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> VP Engineering,
>>>>>>>> Exonar Ltd
>>>>>>>>
>>>>>>>> T: +44 7940 567724
>>>>>>>>
>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>> W: http://www.exonar.com
>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>> <http://video.exonar.com/>
>>>>>>>>
>>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>>> author and do not necessarily represent those of Exonar Ltd. If
>>>>>>>> you are not the intended recipient of this email, you must neither take any
>>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>>
>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <da...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Baptiste,
>>>>>>>>>
>>>>>>>>> ManifoldCF is not limited by the number of agents processes or
>>>>>>>>> parallel connectors.  Overall database performance is the limiting factor.
>>>>>>>>>
>>>>>>>>> I would read this:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>>>
>>>>>>>>> Also, there's a section in ManifoldCF (I believe Chapter 2) that
>>>>>>>>> discusses this issue.
>>>>>>>>>
>>>>>>>>> Some five years ago, I successfully crawled 5 million web
>>>>>>>>> documents, using Postgresql 8.3.  Postgresql 9.x is faster, and with modern
>>>>>>>>> SSD's, I expect that you will do even better.  In general, I'd say it was
>>>>>>>>> fine to shoot for 10M - 100M documents on ManifoldCF, provided that you use
>>>>>>>>> a good database, and provided that you maintain it properly.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>>>>>>> ba.berthier@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi
>>>>>>>>>>
>>>>>>>>>> I would like to know what is the maximum number of documents that
>>>>>>>>>> you managed to crawl with ManifoldCF and with how many connectors in
>>>>>>>>>> parallel it could works ?
>>>>>>>>>>
>>>>>>>>>> Thanks for your answer
>>>>>>>>>>
>>>>>>>>>> Baptiste
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Posted by Karl Wright <da...@gmail.com>.

I've created a ticket (CONNECTORS-1027) and a trunk-based branch
(branches/CONNECTORS-1027) for looking at any changes we do for large-scale
Postgresql optimization work.

Please note that trunk code already has schema changes relative to MCF 1.7,
so you will not be able to work directly with this branch code.  I'll have
to create patches for whatever changes you would need to try.

Thanks,
Karl


On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat <pa...@exonar.com>
wrote:

> Hi,
>
> We're on Postgres 9.2. I'll get the query plans and add them to the doc.
>
> Thanks
>
> Paul
>
>
>
> VP Engineering,
> Exonar Ltd
>
> T: +44 7940 567724
>
> twitter:@exonarco @pboichat
> W: http://www.exonar.com
> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>
> Exonar Limited, registered in the UK, registration number 06439969 at 14
> West Mills, Newbury, Berkshire, RG14 5HG
> DISCLAIMER: This email and any attachments to it may be confidential and
> are intended solely for the use of the individual to whom it is addressed.
> Any views or opinions expressed are solely those of the author and do not
> necessarily represent those of Exonar Ltd. If you are not the intended
> recipient of this email, you must neither take any action based upon its
> contents, nor copy or show it to anyone. Please contact the sender if you
> believe you have received this email in error.
>
> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <da...@gmail.com> wrote:
>
>> Hi Paul,
>>
>> Can you include the logged plan for this query; this is an actual query
>> encountered during crawling:
>>
>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a long-running
>> query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid FROM carrydown t1,
>> jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=? AND
>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?]
>>
>>
>> These queries are all from the UI; it is what gets generated when no
>> limits are in place:
>>
>>  WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found a
>> long-running query (166845 ms): [SELECT jobid,COUNT(dochash) AS doccount
>> FROM jobqueue t1 GROUP BY jobid]
>>  WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found a
>> long-running query (107222 ms): [SELECT jobid,COUNT(dochash) AS doccount
>> FROM jobqueue t1 GROUP BY jobid]
>>
>> This query is from the UI with a limit of 1000000:
>>
>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found a
>> long-running query (254851 ms): [SELECT COUNT(dochash) AS doccount FROM
>> jobqueue t1 LIMIT 1000001]
>>
>> I honestly don't understand why PostgreSQL would execute a sequential
>> scan of the entire table when given a limit clause.  It certainly didn't
>> used to do that.  If you have any other suggestions please let me know.
>>
>>
>>
>> Some queries show up in this list because MCF periodically reindexes
>> tables.  For example, this query goes only against the (small) jobs table.
>> Its poor performance on occasion is likely due to something else happening
>> to the database, probably a reindex:
>>
>>  WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a long-running
>> query (592474 ms): [SELECT id FROM jobs WHERE status IN (?,?,?,?,?) FOR
>> UPDATE]
>>
>>
>> The final query is the document stuffing query, which is perhaps the most
>> critical query in the whole system:
>>
>>  SELECT
>>  t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
>>  t0.priorityset FROM jobqueue t0
>>  WHERE t0.status IN ('P','G')  AND t0.checkaction='R' AND t0.checktime
>>  <= 1407246846166
>>  AND EXISTS (
>>    SELECT 'x' FROM jobs t1
>>    WHERE t1.status  IN ('A','a')  AND t1.id=t0.jobid  AND t1.priority=5
>>  )
>>  AND NOT EXISTS (
>>    SELECT 'x' FROM jobqueue t2
>>    WHERE t2.dochash=t0.dochash AND t2.status IN
>>  ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid
>>  )
>>  AND NOT EXISTS (
>>    SELECT 'x' FROM prereqevents t3,events t4
>>    WHERE t0.id=t3.owner AND t3.eventname=t4.name
>>  )
>>  ORDER BY t0.docpriority ASC
>>  LIMIT 480;
>>
>> Your analysis of whether IN beats OR does not agree with experiments I
>> did on postgresql 8.7 which showed no difference.  What Postgresql version
>> are you using?  Also, I trust you have query plans that demonstrate your
>> claim?  In any case, whether IN vs. OR is generated is a function of the
>> MCF database driver, so this is trivial to experiment with.  I'll create a
>> ticket and a branch for experimentation.
>>
>> Thanks,
>> Karl
>>
>>
>>
>>
>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <pa...@exonar.com>
>> wrote:
>>
>>> Hi Karl,
>>>
>>> Changing maxcountstatus to something much smaller (10,000) doesn't seem
>>> to buy us that much on the table scan - in the attached you'll see that
>>> it's still taking a long time to return the job status page. Also in the
>>> attached are some sample other long running queries that we're beginning to
>>> see more frequently. There's also an example of a query that's frequently
>>> executed and regularly takes > 4 secs (plus a suggested change to improve
>>> performance). This one in particular would certainly benefit from a change
>>> to SSDs which should relieve the I/O bound bottleneck on postgres.
>>>
>>> We're loading the system from 10mil towards 100mil so would be keen to
>>> work with you to optimise where possible.
>>>
>>> Thanks,
>>>
>>> Paul
>>>
>>>
>>>
>>>
>>> VP Engineering,
>>> Exonar Ltd
>>>
>>> T: +44 7940 567724
>>>
>>> twitter:@exonarco @pboichat
>>> W: http://www.exonar.com
>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>
>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>> West Mills, Newbury, Berkshire, RG14 5HG
>>> DISCLAIMER: This email and any attachments to it may be confidential
>>> and are intended solely for the use of the individual to whom it is
>>> addressed. Any views or opinions expressed are solely those of the author
>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>> the intended recipient of this email, you must neither take any action
>>> based upon its contents, nor copy or show it to anyone. Please contact
>>> the sender if you believe you have received this email in error.
>>>
>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <da...@gmail.com> wrote:
>>>
>>>> Hi Paul,
>>>>
>>>> The jobstatus query that uses count(*) should be doing something like
>>>> this when the maxdocumentstatuscount value is set:
>>>>
>>>> select count(*) from jobqueue where xxx limit 500001
>>>>
>>>> This will still do a sequential scan, but it will be an aborted one, so
>>>> you can control the maximum amount of time spent doing the query.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <pa...@exonar.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> We've had a play with maxstatuscount and couldn't stop it from
>>>>> count(*)-ing but I'll certainly have another look to see if we've missed
>>>>> something.
>>>>>
>>>>> We're increasingly seeing long running threads and I'll put together
>>>>> some samples. As an example, on a job that's currently aborting:
>>>>>
>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a long-running
>>>>> query (72902 ms): [UPDATE jobqueue SET docpriority=?,priorityset=NULL WHERE
>>>>> jobid=?]
>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 0:
>>>>> '1.000000001E9'
>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 1:
>>>>> '1407144048075'
>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan: Update on
>>>>> jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:   ->  Bitmap
>>>>> Heap Scan on jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>>> Recheck Cond: (jobid = 1407144048075::bigint)
>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:         ->
>>>>> Bitmap Index Scan on i1392985450177  (cost=0.00..18614.85 rows=764916
>>>>> width=0)
>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>> Plan:               Index Cond: (jobid = 1407144048075::bigint)
>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -  Stats:
>>>>> n_distinct=4.0 most_common_vals={G,C,Z,P}
>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>>
>>>>> Paul
>>>>>
>>>>>
>>>>>
>>>>> VP Engineering,
>>>>> Exonar Ltd
>>>>>
>>>>> T: +44 7940 567724
>>>>>
>>>>> twitter:@exonarco @pboichat
>>>>> W: http://www.exonar.com
>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>
>>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>>> and are intended solely for the use of the individual to whom it is
>>>>> addressed. Any views or opinions expressed are solely those of the author
>>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>>> the intended recipient of this email, you must neither take any action
>>>>> based upon its contents, nor copy or show it to anyone. Please
>>>>> contact the sender if you believe you have received this email in error.
>>>>>
>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <da...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Paul,
>>>>>>
>>>>>> For the jobqueue scans from the UI, there is a parameter you can set
>>>>>> which limits the number of documents counted to at most a specified
>>>>>> amount.  This uses a limit clause, which should prevent unbounded time
>>>>>> doing these kinds of queries:
>>>>>>
>>>>>> org.apache.manifoldcf.ui.maxstatuscount
>>>>>>
>>>>>> The documentation says that the default value for this parameter is
>>>>>> 10000, which however is incorrect.  The actual true default is 500000.  You
>>>>>> could set that lower for better UI performance (losing some information, of
>>>>>> course.)
>>>>>>
>>>>>> As for long-running queries, a lot of time and effort has been spent
>>>>>> in MCF to insure that this doesn't happen.  Specifically, the main document
>>>>>> queuing query is structured to read directly out of a specific jobqueue
>>>>>> index.  This is the crucial query that must work properly for scalability,
>>>>>> since doing a query that is effectively just a sort on the entire jobqueue
>>>>>> would be a major problem.  There are some times where Postgresql's
>>>>>> optimizer fails to do the right thing here, mostly because it makes a huge
>>>>>> distinction between whether there's zero of something or one of something,
>>>>>> but you can work around that particular issue by setting the analyze count
>>>>>> to 1 if you start to see this problem -- which basically means that
>>>>>> reanalysis of the table has to occur on every stuffing query.
>>>>>>
>>>>>> I'd appreciate seeing the queries that are long-running in your case
>>>>>> so that I can see if that is what you are encountering or not.
>>>>>>
>>>>>> Thanks,
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <
>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>
>>>>>>> Hi Karl,
>>>>>>>
>>>>>>> We're beginning to see issues with a document count > 10 million. At
>>>>>>> that point, even with good postgres vacuuming the jobqueue table is
>>>>>>> starting to become a bottleneck.
>>>>>>>
>>>>>>> For example select count(*) from jobqueue, which is executed when
>>>>>>> querying job status will do a full table scan of jobqueue which has
>>>>>>> more than 10 million rows. That's going to take some time in
>>>>>>> postgres.
>>>>>>>
>>>>>>> SSDs will certainly make a big difference to document processing
>>>>>>> through-put (which we see is largely I/O bound in postgres) but we
>>>>>>> are increasingly seeing long running queries in the logs. Our current
>>>>>>> thinking is that we'll need to refactor JobQueue somewhat to
>>>>>>> optimise queries and, potentially partition jobqueue into a subset
>>>>>>> of tables (table per queue for example).
>>>>>>>
>>>>>>> Paul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> VP Engineering,
>>>>>>> Exonar Ltd
>>>>>>>
>>>>>>> T: +44 7940 567724
>>>>>>>
>>>>>>> twitter:@exonarco @pboichat
>>>>>>> W: http://www.exonar.com
>>>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>>>
>>>>>>> Exonar Limited, registered in the UK, registration number 06439969
>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>> confidential and are intended solely for the use of the individual to whom
>>>>>>> it is addressed. Any views or opinions expressed are solely those of the
>>>>>>> author and do not necessarily represent those of Exonar Ltd. If you
>>>>>>> are not the intended recipient of this email, you must neither take any
>>>>>>> action based upon its contents, nor copy or show it to anyone. Please
>>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>>
>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <da...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Baptiste,
>>>>>>>>
>>>>>>>> ManifoldCF is not limited by the number of agents processes or
>>>>>>>> parallel connectors.  Overall database performance is the limiting factor.
>>>>>>>>
>>>>>>>> I would read this:
>>>>>>>>
>>>>>>>>
>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>>
>>>>>>>> Also, there's a section in ManifoldCF (I believe Chapter 2) that
>>>>>>>> discusses this issue.
>>>>>>>>
>>>>>>>> Some five years ago, I successfully crawled 5 million web
>>>>>>>> documents, using Postgresql 8.3.  Postgresql 9.x is faster, and with modern
>>>>>>>> SSD's, I expect that you will do even better.  In general, I'd say it was
>>>>>>>> fine to shoot for 10M - 100M documents on ManifoldCF, provided that you use
>>>>>>>> a good database, and provided that you maintain it properly.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>>>>>> ba.berthier@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi
>>>>>>>>>
>>>>>>>>> I would like to know what is the maximum number of documents that
>>>>>>>>> you managed to crawl with ManifoldCF and with how many connectors in
>>>>>>>>> parallel it could works ?
>>>>>>>>>
>>>>>>>>> Thanks for your answer
>>>>>>>>>
>>>>>>>>> Baptiste
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Posted by Paul Boichat <pa...@exonar.com>.

Hi,

We're on Postgres 9.2. I'll get the query plans and add them to the doc.

Thanks

Paul



VP Engineering,
Exonar Ltd

T: +44 7940 567724

twitter:@exonarco @pboichat
W: http://www.exonar.com
Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>

Exonar Limited, registered in the UK, registration number 06439969 at 14
West Mills, Newbury, Berkshire, RG14 5HG
DISCLAIMER: This email and any attachments to it may be confidential and
are intended solely for the use of the individual to whom it is addressed.
Any views or opinions expressed are solely those of the author and do not
necessarily represent those of Exonar Ltd. If you are not the intended
recipient of this email, you must neither take any action based upon its
contents, nor copy or show it to anyone. Please contact the sender if you
believe you have received this email in error.

On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <da...@gmail.com> wrote:

> Hi Paul,
>
> Can you include the logged plan for this query; this is an actual query
> encountered during crawling:
>
> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a long-running
> query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid FROM carrydown t1,
> jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=? AND
> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?]
>
>
> These queries are all from the UI; it is what gets generated when no
> limits are in place:
>
>  WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found a
> long-running query (166845 ms): [SELECT jobid,COUNT(dochash) AS doccount
> FROM jobqueue t1 GROUP BY jobid]
>  WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found a
> long-running query (107222 ms): [SELECT jobid,COUNT(dochash) AS doccount
> FROM jobqueue t1 GROUP BY jobid]
>
> This query is from the UI with a limit of 1000000:
>
> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found a
> long-running query (254851 ms): [SELECT COUNT(dochash) AS doccount FROM
> jobqueue t1 LIMIT 1000001]
>
> I honestly don't understand why PostgreSQL would execute a sequential scan
> of the entire table when given a limit clause.  It certainly didn't used to
> do that.  If you have any other suggestions please let me know.
>
>
>
> Some queries show up in this list because MCF periodically reindexes
> tables.  For example, this query goes only against the (small) jobs table.
> Its poor performance on occasion is likely due to something else happening
> to the database, probably a reindex:
>
>  WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a long-running
> query (592474 ms): [SELECT id FROM jobs WHERE status IN (?,?,?,?,?) FOR
> UPDATE]
>
>
> The final query is the document stuffing query, which is perhaps the most
> critical query in the whole system:
>
>  SELECT
>  t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
>  t0.priorityset FROM jobqueue t0
>  WHERE t0.status IN ('P','G')  AND t0.checkaction='R' AND t0.checktime
>  <= 1407246846166
>  AND EXISTS (
>    SELECT 'x' FROM jobs t1
>    WHERE t1.status  IN ('A','a')  AND t1.id=t0.jobid  AND t1.priority=5
>  )
>  AND NOT EXISTS (
>    SELECT 'x' FROM jobqueue t2
>    WHERE t2.dochash=t0.dochash AND t2.status IN
>  ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid
>  )
>  AND NOT EXISTS (
>    SELECT 'x' FROM prereqevents t3,events t4
>    WHERE t0.id=t3.owner AND t3.eventname=t4.name
>  )
>  ORDER BY t0.docpriority ASC
>  LIMIT 480;
>
> Your analysis of whether IN beats OR does not agree with experiments I did
> on postgresql 8.7 which showed no difference.  What Postgresql version are
> you using?  Also, I trust you have query plans that demonstrate your
> claim?  In any case, whether IN vs. OR is generated is a function of the
> MCF database driver, so this is trivial to experiment with.  I'll create a
> ticket and a branch for experimentation.
>
> Thanks,
> Karl
>
>
>
>
> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <pa...@exonar.com>
> wrote:
>
>> Hi Karl,
>>
>> Changing maxcountstatus to something much smaller (10,000) doesn't seem
>> to buy us that much on the table scan - in the attached you'll see that
>> it's still taking a long time to return the job status page. Also in the
>> attached are some sample other long running queries that we're beginning to
>> see more frequently. There's also an example of a query that's frequently
>> executed and regularly takes > 4 secs (plus a suggested change to improve
>> performance). This one in particular would certainly benefit from a change
>> to SSDs which should relieve the I/O bound bottleneck on postgres.
>>
>> We're loading the system from 10mil towards 100mil so would be keen to
>> work with you to optimise where possible.
>>
>> Thanks,
>>
>> Paul
>>
>>
>>
>>
>> VP Engineering,
>> Exonar Ltd
>>
>> T: +44 7940 567724
>>
>> twitter:@exonarco @pboichat
>> W: http://www.exonar.com
>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>
>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>> West Mills, Newbury, Berkshire, RG14 5HG
>> DISCLAIMER: This email and any attachments to it may be confidential and
>> are intended solely for the use of the individual to whom it is addressed.
>> Any views or opinions expressed are solely those of the author and do not
>> necessarily represent those of Exonar Ltd. If you are not the intended
>> recipient of this email, you must neither take any action based upon its
>> contents, nor copy or show it to anyone. Please contact the sender if
>> you believe you have received this email in error.
>>
>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <da...@gmail.com> wrote:
>>
>>> Hi Paul,
>>>
>>> The jobstatus query that uses count(*) should be doing something like
>>> this when the maxdocumentstatuscount value is set:
>>>
>>> select count(*) from jobqueue where xxx limit 500001
>>>
>>> This will still do a sequential scan, but it will be an aborted one, so
>>> you can control the maximum amount of time spent doing the query.
>>>
>>> Karl
>>>
>>>
>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <pa...@exonar.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> We've had a play with maxstatuscount and couldn't stop it from
>>>> count(*)-ing but I'll certainly have another look to see if we've missed
>>>> something.
>>>>
>>>> We're increasingly seeing long running threads and I'll put together
>>>> some samples. As an example, on a job that's currently aborting:
>>>>
>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a long-running
>>>> query (72902 ms): [UPDATE jobqueue SET docpriority=?,priorityset=NULL WHERE
>>>> jobid=?]
>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 0:
>>>> '1.000000001E9'
>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 1:
>>>> '1407144048075'
>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan: Update on
>>>> jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:   ->  Bitmap
>>>> Heap Scan on jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>> Recheck Cond: (jobid = 1407144048075::bigint)
>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:         ->
>>>> Bitmap Index Scan on i1392985450177  (cost=0.00..18614.85 rows=764916
>>>> width=0)
>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>>> Index Cond: (jobid = 1407144048075::bigint)
>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -  Stats:
>>>> n_distinct=4.0 most_common_vals={G,C,Z,P}
>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>>
>>>> Paul
>>>>
>>>>
>>>>
>>>> VP Engineering,
>>>> Exonar Ltd
>>>>
>>>> T: +44 7940 567724
>>>>
>>>> twitter:@exonarco @pboichat
>>>> W: http://www.exonar.com
>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>
>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>> and are intended solely for the use of the individual to whom it is
>>>> addressed. Any views or opinions expressed are solely those of the author
>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>> the intended recipient of this email, you must neither take any action
>>>> based upon its contents, nor copy or show it to anyone. Please contact
>>>> the sender if you believe you have received this email in error.
>>>>
>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <da...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Paul,
>>>>>
>>>>> For the jobqueue scans from the UI, there is a parameter you can set
>>>>> which limits the number of documents counted to at most a specified
>>>>> amount.  This uses a limit clause, which should prevent unbounded time
>>>>> doing these kinds of queries:
>>>>>
>>>>> org.apache.manifoldcf.ui.maxstatuscount
>>>>>
>>>>> The documentation says that the default value for this parameter is
>>>>> 10000, which however is incorrect.  The actual true default is 500000.  You
>>>>> could set that lower for better UI performance (losing some information, of
>>>>> course.)
>>>>>
>>>>> As for long-running queries, a lot of time and effort has been spent
>>>>> in MCF to insure that this doesn't happen.  Specifically, the main document
>>>>> queuing query is structured to read directly out of a specific jobqueue
>>>>> index.  This is the crucial query that must work properly for scalability,
>>>>> since doing a query that is effectively just a sort on the entire jobqueue
>>>>> would be a major problem.  There are some times where Postgresql's
>>>>> optimizer fails to do the right thing here, mostly because it makes a huge
>>>>> distinction between whether there's zero of something or one of something,
>>>>> but you can work around that particular issue by setting the analyze count
>>>>> to 1 if you start to see this problem -- which basically means that
>>>>> reanalysis of the table has to occur on every stuffing query.
>>>>>
>>>>> I'd appreciate seeing the queries that are long-running in your case
>>>>> so that I can see if that is what you are encountering or not.
>>>>>
>>>>> Thanks,
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <paul.boichat@exonar.com
>>>>> > wrote:
>>>>>
>>>>>> Hi Karl,
>>>>>>
>>>>>> We're beginning to see issues with a document count > 10 million. At
>>>>>> that point, even with good postgres vacuuming the jobqueue table is
>>>>>> starting to become a bottleneck.
>>>>>>
>>>>>> For example select count(*) from jobqueue, which is executed when
>>>>>> querying job status will do a full table scan of jobqueue which has
>>>>>> more than 10 million rows. That's going to take some time in postgres
>>>>>> .
>>>>>>
>>>>>> SSDs will certainly make a big difference to document processing
>>>>>> through-put (which we see is largely I/O bound in postgres) but we
>>>>>> are increasingly seeing long running queries in the logs. Our current
>>>>>> thinking is that we'll need to refactor JobQueue somewhat to
>>>>>> optimise queries and, potentially partition jobqueue into a subset
>>>>>> of tables (table per queue for example).
>>>>>>
>>>>>> Paul
>>>>>>
>>>>>>
>>>>>>
>>>>>> VP Engineering,
>>>>>> Exonar Ltd
>>>>>>
>>>>>> T: +44 7940 567724
>>>>>>
>>>>>> twitter:@exonarco @pboichat
>>>>>> W: http://www.exonar.com
>>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>>
>>>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>>>> and are intended solely for the use of the individual to whom it is
>>>>>> addressed. Any views or opinions expressed are solely those of the author
>>>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>>>> the intended recipient of this email, you must neither take any action
>>>>>> based upon its contents, nor copy or show it to anyone. Please
>>>>>> contact the sender if you believe you have received this email in error.
>>>>>>
>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <da...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Baptiste,
>>>>>>>
>>>>>>> ManifoldCF is not limited by the number of agents processes or
>>>>>>> parallel connectors.  Overall database performance is the limiting factor.
>>>>>>>
>>>>>>> I would read this:
>>>>>>>
>>>>>>>
>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>
>>>>>>> Also, there's a section in ManifoldCF (I believe Chapter 2) that
>>>>>>> discusses this issue.
>>>>>>>
>>>>>>> Some five years ago, I successfully crawled 5 million web documents,
>>>>>>> using Postgresql 8.3.  Postgresql 9.x is faster, and with modern SSD's, I
>>>>>>> expect that you will do even better.  In general, I'd say it was fine to
>>>>>>> shoot for 10M - 100M documents on ManifoldCF, provided that you use a good
>>>>>>> database, and provided that you maintain it properly.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>>>>> ba.berthier@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi
>>>>>>>>
>>>>>>>> I would like to know what is the maximum number of documents that
>>>>>>>> you managed to crawl with ManifoldCF and with how many connectors in
>>>>>>>> parallel it could works ?
>>>>>>>>
>>>>>>>> Thanks for your answer
>>>>>>>>
>>>>>>>> Baptiste
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Posted by Karl Wright <da...@gmail.com>.

Hi Paul,

Can you include the logged plan for this query; this is an actual query
encountered during crawling:

WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a long-running
query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid FROM carrydown t1,
jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=? AND
t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?]


These queries are all from the UI; it is what gets generated when no limits
are in place:

 WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found a long-running
query (166845 ms): [SELECT jobid,COUNT(dochash) AS doccount FROM jobqueue
t1 GROUP BY jobid]
 WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found a long-running
query (107222 ms): [SELECT jobid,COUNT(dochash) AS doccount FROM jobqueue
t1 GROUP BY jobid]

This query is from the UI with a limit of 1000000:

WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found a long-running
query (254851 ms): [SELECT COUNT(dochash) AS doccount FROM jobqueue t1
LIMIT 1000001]

I honestly don't understand why PostgreSQL would execute a sequential scan
of the entire table when given a limit clause.  It certainly didn't used to
do that.  If you have any other suggestions please let me know.



Some queries show up in this list because MCF periodically reindexes
tables.  For example, this query goes only against the (small) jobs table.
Its poor performance on occasion is likely due to something else happening
to the database, probably a reindex:

 WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a long-running
query (592474 ms): [SELECT id FROM jobs WHERE status IN (?,?,?,?,?) FOR
UPDATE]


The final query is the document stuffing query, which is perhaps the most
critical query in the whole system:

 SELECT
 t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
 t0.priorityset FROM jobqueue t0
 WHERE t0.status IN ('P','G')  AND t0.checkaction='R' AND t0.checktime
 <= 1407246846166
 AND EXISTS (
   SELECT 'x' FROM jobs t1
   WHERE t1.status  IN ('A','a')  AND t1.id=t0.jobid  AND t1.priority=5
 )
 AND NOT EXISTS (
   SELECT 'x' FROM jobqueue t2
   WHERE t2.dochash=t0.dochash AND t2.status IN
 ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid
 )
 AND NOT EXISTS (
   SELECT 'x' FROM prereqevents t3,events t4
   WHERE t0.id=t3.owner AND t3.eventname=t4.name
 )
 ORDER BY t0.docpriority ASC
 LIMIT 480;

Your analysis of whether IN beats OR does not agree with experiments I did
on postgresql 8.7 which showed no difference.  What Postgresql version are
you using?  Also, I trust you have query plans that demonstrate your
claim?  In any case, whether IN vs. OR is generated is a function of the
MCF database driver, so this is trivial to experiment with.  I'll create a
ticket and a branch for experimentation.

Thanks,
Karl




On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <pa...@exonar.com>
wrote:

> Hi Karl,
>
> Changing maxcountstatus to something much smaller (10,000) doesn't seem to
> buy us that much on the table scan - in the attached you'll see that it's
> still taking a long time to return the job status page. Also in the
> attached are some sample other long running queries that we're beginning to
> see more frequently. There's also an example of a query that's frequently
> executed and regularly takes > 4 secs (plus a suggested change to improve
> performance). This one in particular would certainly benefit from a change
> to SSDs which should relieve the I/O bound bottleneck on postgres.
>
> We're loading the system from 10mil towards 100mil so would be keen to
> work with you to optimise where possible.
>
> Thanks,
>
> Paul
>
>
>
>
> VP Engineering,
> Exonar Ltd
>
> T: +44 7940 567724
>
> twitter:@exonarco @pboichat
> W: http://www.exonar.com
> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>
> Exonar Limited, registered in the UK, registration number 06439969 at 14
> West Mills, Newbury, Berkshire, RG14 5HG
> DISCLAIMER: This email and any attachments to it may be confidential and
> are intended solely for the use of the individual to whom it is addressed.
> Any views or opinions expressed are solely those of the author and do not
> necessarily represent those of Exonar Ltd. If you are not the intended
> recipient of this email, you must neither take any action based upon its
> contents, nor copy or show it to anyone. Please contact the sender if you
> believe you have received this email in error.
>
> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <da...@gmail.com> wrote:
>
>> Hi Paul,
>>
>> The jobstatus query that uses count(*) should be doing something like
>> this when the maxdocumentstatuscount value is set:
>>
>> select count(*) from jobqueue where xxx limit 500001
>>
>> This will still do a sequential scan, but it will be an aborted one, so
>> you can control the maximum amount of time spent doing the query.
>>
>> Karl
>>
>>
>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <pa...@exonar.com>
>> wrote:
>>
>>> Hi,
>>>
>>> We've had a play with maxstatuscount and couldn't stop it from
>>> count(*)-ing but I'll certainly have another look to see if we've missed
>>> something.
>>>
>>> We're increasingly seeing long running threads and I'll put together
>>> some samples. As an example, on a job that's currently aborting:
>>>
>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a long-running
>>> query (72902 ms): [UPDATE jobqueue SET docpriority=?,priorityset=NULL WHERE
>>> jobid=?]
>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 0:
>>> '1.000000001E9'
>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 1:
>>> '1407144048075'
>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan: Update on
>>> jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:   ->  Bitmap
>>> Heap Scan on jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>> Recheck Cond: (jobid = 1407144048075::bigint)
>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:         ->
>>> Bitmap Index Scan on i1392985450177  (cost=0.00..18614.85 rows=764916
>>> width=0)
>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>>> Index Cond: (jobid = 1407144048075::bigint)
>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -  Stats:
>>> n_distinct=4.0 most_common_vals={G,C,Z,P}
>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>>
>>> Paul
>>>
>>>
>>>
>>> VP Engineering,
>>> Exonar Ltd
>>>
>>> T: +44 7940 567724
>>>
>>> twitter:@exonarco @pboichat
>>> W: http://www.exonar.com
>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>
>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>> West Mills, Newbury, Berkshire, RG14 5HG
>>> DISCLAIMER: This email and any attachments to it may be confidential
>>> and are intended solely for the use of the individual to whom it is
>>> addressed. Any views or opinions expressed are solely those of the author
>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>> the intended recipient of this email, you must neither take any action
>>> based upon its contents, nor copy or show it to anyone. Please contact
>>> the sender if you believe you have received this email in error.
>>>
>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <da...@gmail.com> wrote:
>>>
>>>> Hi Paul,
>>>>
>>>> For the jobqueue scans from the UI, there is a parameter you can set
>>>> which limits the number of documents counted to at most a specified
>>>> amount.  This uses a limit clause, which should prevent unbounded time
>>>> doing these kinds of queries:
>>>>
>>>> org.apache.manifoldcf.ui.maxstatuscount
>>>>
>>>> The documentation says that the default value for this parameter is
>>>> 10000, which however is incorrect.  The actual true default is 500000.  You
>>>> could set that lower for better UI performance (losing some information, of
>>>> course.)
>>>>
>>>> As for long-running queries, a lot of time and effort has been spent in
>>>> MCF to insure that this doesn't happen.  Specifically, the main document
>>>> queuing query is structured to read directly out of a specific jobqueue
>>>> index.  This is the crucial query that must work properly for scalability,
>>>> since doing a query that is effectively just a sort on the entire jobqueue
>>>> would be a major problem.  There are some times where Postgresql's
>>>> optimizer fails to do the right thing here, mostly because it makes a huge
>>>> distinction between whether there's zero of something or one of something,
>>>> but you can work around that particular issue by setting the analyze count
>>>> to 1 if you start to see this problem -- which basically means that
>>>> reanalysis of the table has to occur on every stuffing query.
>>>>
>>>> I'd appreciate seeing the queries that are long-running in your case so
>>>> that I can see if that is what you are encountering or not.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <pa...@exonar.com>
>>>> wrote:
>>>>
>>>>> Hi Karl,
>>>>>
>>>>> We're beginning to see issues with a document count > 10 million. At
>>>>> that point, even with good postgres vacuuming the jobqueue table is
>>>>> starting to become a bottleneck.
>>>>>
>>>>> For example select count(*) from jobqueue, which is executed when
>>>>> querying job status will do a full table scan of jobqueue which has
>>>>> more than 10 million rows. That's going to take some time in postgres.
>>>>>
>>>>> SSDs will certainly make a big difference to document processing
>>>>> through-put (which we see is largely I/O bound in postgres) but we
>>>>> are increasingly seeing long running queries in the logs. Our current
>>>>> thinking is that we'll need to refactor JobQueue somewhat to optimise
>>>>> queries and, potentially partition jobqueue into a subset of tables
>>>>> (table per queue for example).
>>>>>
>>>>> Paul
>>>>>
>>>>>
>>>>>
>>>>> VP Engineering,
>>>>> Exonar Ltd
>>>>>
>>>>> T: +44 7940 567724
>>>>>
>>>>> twitter:@exonarco @pboichat
>>>>> W: http://www.exonar.com
>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>
>>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>>> and are intended solely for the use of the individual to whom it is
>>>>> addressed. Any views or opinions expressed are solely those of the author
>>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>>> the intended recipient of this email, you must neither take any action
>>>>> based upon its contents, nor copy or show it to anyone. Please
>>>>> contact the sender if you believe you have received this email in error.
>>>>>
>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <da...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Baptiste,
>>>>>>
>>>>>> ManifoldCF is not limited by the number of agents processes or
>>>>>> parallel connectors.  Overall database performance is the limiting factor.
>>>>>>
>>>>>> I would read this:
>>>>>>
>>>>>>
>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>
>>>>>> Also, there's a section in ManifoldCF (I believe Chapter 2) that
>>>>>> discusses this issue.
>>>>>>
>>>>>> Some five years ago, I successfully crawled 5 million web documents,
>>>>>> using Postgresql 8.3.  Postgresql 9.x is faster, and with modern SSD's, I
>>>>>> expect that you will do even better.  In general, I'd say it was fine to
>>>>>> shoot for 10M - 100M documents on ManifoldCF, provided that you use a good
>>>>>> database, and provided that you maintain it properly.
>>>>>>
>>>>>> Thanks,
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>>>> ba.berthier@gmail.com> wrote:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> I would like to know what is the maximum number of documents that
>>>>>>> you managed to crawl with ManifoldCF and with how many connectors in
>>>>>>> parallel it could works ?
>>>>>>>
>>>>>>> Thanks for your answer
>>>>>>>
>>>>>>> Baptiste
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Posted by Paul Boichat <pa...@exonar.com>.

Hi Karl,

Changing maxcountstatus to something much smaller (10,000) doesn't seem to
buy us that much on the table scan - in the attached you'll see that it's
still taking a long time to return the job status page. Also in the
attached are some sample other long running queries that we're beginning to
see more frequently. There's also an example of a query that's frequently
executed and regularly takes > 4 secs (plus a suggested change to improve
performance). This one in particular would certainly benefit from a change
to SSDs which should relieve the I/O bound bottleneck on postgres.

We're loading the system from 10mil towards 100mil so would be keen to work
with you to optimise where possible.

Thanks,

Paul




VP Engineering,
Exonar Ltd

T: +44 7940 567724

twitter:@exonarco @pboichat
W: http://www.exonar.com
Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>

Exonar Limited, registered in the UK, registration number 06439969 at 14
West Mills, Newbury, Berkshire, RG14 5HG
DISCLAIMER: This email and any attachments to it may be confidential and
are intended solely for the use of the individual to whom it is addressed.
Any views or opinions expressed are solely those of the author and do not
necessarily represent those of Exonar Ltd. If you are not the intended
recipient of this email, you must neither take any action based upon its
contents, nor copy or show it to anyone. Please contact the sender if you
believe you have received this email in error.

On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <da...@gmail.com> wrote:

> Hi Paul,
>
> The jobstatus query that uses count(*) should be doing something like this
> when the maxdocumentstatuscount value is set:
>
> select count(*) from jobqueue where xxx limit 500001
>
> This will still do a sequential scan, but it will be an aborted one, so
> you can control the maximum amount of time spent doing the query.
>
> Karl
>
>
> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <pa...@exonar.com>
> wrote:
>
>> Hi,
>>
>> We've had a play with maxstatuscount and couldn't stop it from
>> count(*)-ing but I'll certainly have another look to see if we've missed
>> something.
>>
>> We're increasingly seeing long running threads and I'll put together some
>> samples. As an example, on a job that's currently aborting:
>>
>> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a long-running
>> query (72902 ms): [UPDATE jobqueue SET docpriority=?,priorityset=NULL WHERE
>> jobid=?]
>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 0:
>> '1.000000001E9'
>>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 1:
>> '1407144048075'
>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan: Update on
>> jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:   ->  Bitmap
>> Heap Scan on jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:         Recheck
>> Cond: (jobid = 1407144048075::bigint)
>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:         ->
>> Bitmap Index Scan on i1392985450177  (cost=0.00..18614.85 rows=764916
>> width=0)
>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
>> Index Cond: (jobid = 1407144048075::bigint)
>>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -  Stats: n_distinct=4.0
>> most_common_vals={G,C,Z,P}
>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>>
>> Paul
>>
>>
>>
>> VP Engineering,
>> Exonar Ltd
>>
>> T: +44 7940 567724
>>
>> twitter:@exonarco @pboichat
>> W: http://www.exonar.com
>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>
>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>> West Mills, Newbury, Berkshire, RG14 5HG
>> DISCLAIMER: This email and any attachments to it may be confidential and
>> are intended solely for the use of the individual to whom it is addressed.
>> Any views or opinions expressed are solely those of the author and do not
>> necessarily represent those of Exonar Ltd. If you are not the intended
>> recipient of this email, you must neither take any action based upon its
>> contents, nor copy or show it to anyone. Please contact the sender if
>> you believe you have received this email in error.
>>
>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <da...@gmail.com> wrote:
>>
>>> Hi Paul,
>>>
>>> For the jobqueue scans from the UI, there is a parameter you can set
>>> which limits the number of documents counted to at most a specified
>>> amount.  This uses a limit clause, which should prevent unbounded time
>>> doing these kinds of queries:
>>>
>>> org.apache.manifoldcf.ui.maxstatuscount
>>>
>>> The documentation says that the default value for this parameter is
>>> 10000, which however is incorrect.  The actual true default is 500000.  You
>>> could set that lower for better UI performance (losing some information, of
>>> course.)
>>>
>>> As for long-running queries, a lot of time and effort has been spent in
>>> MCF to insure that this doesn't happen.  Specifically, the main document
>>> queuing query is structured to read directly out of a specific jobqueue
>>> index.  This is the crucial query that must work properly for scalability,
>>> since doing a query that is effectively just a sort on the entire jobqueue
>>> would be a major problem.  There are some times where Postgresql's
>>> optimizer fails to do the right thing here, mostly because it makes a huge
>>> distinction between whether there's zero of something or one of something,
>>> but you can work around that particular issue by setting the analyze count
>>> to 1 if you start to see this problem -- which basically means that
>>> reanalysis of the table has to occur on every stuffing query.
>>>
>>> I'd appreciate seeing the queries that are long-running in your case so
>>> that I can see if that is what you are encountering or not.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>>
>>>
>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <pa...@exonar.com>
>>> wrote:
>>>
>>>> Hi Karl,
>>>>
>>>> We're beginning to see issues with a document count > 10 million. At
>>>> that point, even with good postgres vacuuming the jobqueue table is
>>>> starting to become a bottleneck.
>>>>
>>>> For example select count(*) from jobqueue, which is executed when
>>>> querying job status will do a full table scan of jobqueue which has
>>>> more than 10 million rows. That's going to take some time in postgres.
>>>>
>>>> SSDs will certainly make a big difference to document processing
>>>> through-put (which we see is largely I/O bound in postgres) but we are
>>>> increasingly seeing long running queries in the logs. Our current thinking
>>>> is that we'll need to refactor JobQueue somewhat to optimise queries
>>>> and, potentially partition jobqueue into a subset of tables (table per
>>>> queue for example).
>>>>
>>>> Paul
>>>>
>>>>
>>>>
>>>> VP Engineering,
>>>> Exonar Ltd
>>>>
>>>> T: +44 7940 567724
>>>>
>>>> twitter:@exonarco @pboichat
>>>> W: http://www.exonar.com
>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>
>>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>> and are intended solely for the use of the individual to whom it is
>>>> addressed. Any views or opinions expressed are solely those of the author
>>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>>> the intended recipient of this email, you must neither take any action
>>>> based upon its contents, nor copy or show it to anyone. Please contact
>>>> the sender if you believe you have received this email in error.
>>>>
>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <da...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Baptiste,
>>>>>
>>>>> ManifoldCF is not limited by the number of agents processes or
>>>>> parallel connectors.  Overall database performance is the limiting factor.
>>>>>
>>>>> I would read this:
>>>>>
>>>>>
>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>
>>>>> Also, there's a section in ManifoldCF (I believe Chapter 2) that
>>>>> discusses this issue.
>>>>>
>>>>> Some five years ago, I successfully crawled 5 million web documents,
>>>>> using Postgresql 8.3.  Postgresql 9.x is faster, and with modern SSD's, I
>>>>> expect that you will do even better.  In general, I'd say it was fine to
>>>>> shoot for 10M - 100M documents on ManifoldCF, provided that you use a good
>>>>> database, and provided that you maintain it properly.
>>>>>
>>>>> Thanks,
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>>> ba.berthier@gmail.com> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> I would like to know what is the maximum number of documents that you
>>>>>> managed to crawl with ManifoldCF and with how many connectors in parallel
>>>>>> it could works ?
>>>>>>
>>>>>> Thanks for your answer
>>>>>>
>>>>>> Baptiste
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Posted by Karl Wright <da...@gmail.com>.

Hi Paul,

The jobstatus query that uses count(*) should be doing something like this
when the maxdocumentstatuscount value is set:

select count(*) from jobqueue where xxx limit 500001

This will still do a sequential scan, but it will be an aborted one, so you
can control the maximum amount of time spent doing the query.

Karl


On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <pa...@exonar.com>
wrote:

> Hi,
>
> We've had a play with maxstatuscount and couldn't stop it from
> count(*)-ing but I'll certainly have another look to see if we've missed
> something.
>
> We're increasingly seeing long running threads and I'll put together some
> samples. As an example, on a job that's currently aborting:
>
> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a long-running
> query (72902 ms): [UPDATE jobqueue SET docpriority=?,priorityset=NULL WHERE
> jobid=?]
>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 0:
> '1.000000001E9'
>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 1:
> '1407144048075'
>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan: Update on
> jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:   ->  Bitmap
> Heap Scan on jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:         Recheck
> Cond: (jobid = 1407144048075::bigint)
>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:         ->
> Bitmap Index Scan on i1392985450177  (cost=0.00..18614.85 rows=764916
> width=0)
>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
> Index Cond: (jobid = 1407144048075::bigint)
>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -  Stats: n_distinct=4.0
> most_common_vals={G,C,Z,P}
> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>
> Paul
>
>
>
> VP Engineering,
> Exonar Ltd
>
> T: +44 7940 567724
>
> twitter:@exonarco @pboichat
> W: http://www.exonar.com
> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>
> Exonar Limited, registered in the UK, registration number 06439969 at 14
> West Mills, Newbury, Berkshire, RG14 5HG
> DISCLAIMER: This email and any attachments to it may be confidential and
> are intended solely for the use of the individual to whom it is addressed.
> Any views or opinions expressed are solely those of the author and do not
> necessarily represent those of Exonar Ltd. If you are not the intended
> recipient of this email, you must neither take any action based upon its
> contents, nor copy or show it to anyone. Please contact the sender if you
> believe you have received this email in error.
>
> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <da...@gmail.com> wrote:
>
>> Hi Paul,
>>
>> For the jobqueue scans from the UI, there is a parameter you can set
>> which limits the number of documents counted to at most a specified
>> amount.  This uses a limit clause, which should prevent unbounded time
>> doing these kinds of queries:
>>
>> org.apache.manifoldcf.ui.maxstatuscount
>>
>> The documentation says that the default value for this parameter is
>> 10000, which however is incorrect.  The actual true default is 500000.  You
>> could set that lower for better UI performance (losing some information, of
>> course.)
>>
>> As for long-running queries, a lot of time and effort has been spent in
>> MCF to insure that this doesn't happen.  Specifically, the main document
>> queuing query is structured to read directly out of a specific jobqueue
>> index.  This is the crucial query that must work properly for scalability,
>> since doing a query that is effectively just a sort on the entire jobqueue
>> would be a major problem.  There are some times where Postgresql's
>> optimizer fails to do the right thing here, mostly because it makes a huge
>> distinction between whether there's zero of something or one of something,
>> but you can work around that particular issue by setting the analyze count
>> to 1 if you start to see this problem -- which basically means that
>> reanalysis of the table has to occur on every stuffing query.
>>
>> I'd appreciate seeing the queries that are long-running in your case so
>> that I can see if that is what you are encountering or not.
>>
>> Thanks,
>> Karl
>>
>>
>>
>>
>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <pa...@exonar.com>
>> wrote:
>>
>>> Hi Karl,
>>>
>>> We're beginning to see issues with a document count > 10 million. At
>>> that point, even with good postgres vacuuming the jobqueue table is
>>> starting to become a bottleneck.
>>>
>>> For example select count(*) from jobqueue, which is executed when
>>> querying job status will do a full table scan of jobqueue which has
>>> more than 10 million rows. That's going to take some time in postgres.
>>>
>>> SSDs will certainly make a big difference to document processing
>>> through-put (which we see is largely I/O bound in postgres) but we are
>>> increasingly seeing long running queries in the logs. Our current thinking
>>> is that we'll need to refactor JobQueue somewhat to optimise queries
>>> and, potentially partition jobqueue into a subset of tables (table per
>>> queue for example).
>>>
>>> Paul
>>>
>>>
>>>
>>> VP Engineering,
>>> Exonar Ltd
>>>
>>> T: +44 7940 567724
>>>
>>> twitter:@exonarco @pboichat
>>> W: http://www.exonar.com
>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>
>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>> West Mills, Newbury, Berkshire, RG14 5HG
>>> DISCLAIMER: This email and any attachments to it may be confidential
>>> and are intended solely for the use of the individual to whom it is
>>> addressed. Any views or opinions expressed are solely those of the author
>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>> the intended recipient of this email, you must neither take any action
>>> based upon its contents, nor copy or show it to anyone. Please contact
>>> the sender if you believe you have received this email in error.
>>>
>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <da...@gmail.com> wrote:
>>>
>>>> Hi Baptiste,
>>>>
>>>> ManifoldCF is not limited by the number of agents processes or parallel
>>>> connectors.  Overall database performance is the limiting factor.
>>>>
>>>> I would read this:
>>>>
>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>
>>>> Also, there's a section in ManifoldCF (I believe Chapter 2) that
>>>> discusses this issue.
>>>>
>>>> Some five years ago, I successfully crawled 5 million web documents,
>>>> using Postgresql 8.3.  Postgresql 9.x is faster, and with modern SSD's, I
>>>> expect that you will do even better.  In general, I'd say it was fine to
>>>> shoot for 10M - 100M documents on ManifoldCF, provided that you use a good
>>>> database, and provided that you maintain it properly.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>> ba.berthier@gmail.com> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> I would like to know what is the maximum number of documents that you
>>>>> managed to crawl with ManifoldCF and with how many connectors in parallel
>>>>> it could works ?
>>>>>
>>>>> Thanks for your answer
>>>>>
>>>>> Baptiste
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Posted by Karl Wright <da...@gmail.com>.

Hi Paul,

The particular query you point to is run when a job is changing state in a
major way, such as being aborted or started, and many or most of its
jobqueue records need to be updated accordingly.  Under that specific
context, 73 seconds does not seem like too long a time.

If you were to separate out the jobqueue table into one table per job, you
would STILL see times of about that length, simply because there are a lot
of records that need to be updated under those conditions.

I look forward to seeing other examples.

Karl


On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <pa...@exonar.com>
wrote:

> Hi,
>
> We've had a play with maxstatuscount and couldn't stop it from
> count(*)-ing but I'll certainly have another look to see if we've missed
> something.
>
> We're increasingly seeing long running threads and I'll put together some
> samples. As an example, on a job that's currently aborting:
>
> WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a long-running
> query (72902 ms): [UPDATE jobqueue SET docpriority=?,priorityset=NULL WHERE
> jobid=?]
>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 0:
> '1.000000001E9'
>  WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 1:
> '1407144048075'
>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan: Update on
> jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:   ->  Bitmap
> Heap Scan on jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:         Recheck
> Cond: (jobid = 1407144048075::bigint)
>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:         ->
> Bitmap Index Scan on i1392985450177  (cost=0.00..18614.85 rows=764916
> width=0)
>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
> Index Cond: (jobid = 1407144048075::bigint)
>  WARN 2014-09-10 18:37:29,960 (Job reset thread) -
>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -  Stats: n_distinct=4.0
> most_common_vals={G,C,Z,P}
> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>  WARN 2014-09-10 18:37:30,140 (Job reset thread) -
>
> Paul
>
>
>
> VP Engineering,
> Exonar Ltd
>
> T: +44 7940 567724
>
> twitter:@exonarco @pboichat
> W: http://www.exonar.com
> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>
> Exonar Limited, registered in the UK, registration number 06439969 at 14
> West Mills, Newbury, Berkshire, RG14 5HG
> DISCLAIMER: This email and any attachments to it may be confidential and
> are intended solely for the use of the individual to whom it is addressed.
> Any views or opinions expressed are solely those of the author and do not
> necessarily represent those of Exonar Ltd. If you are not the intended
> recipient of this email, you must neither take any action based upon its
> contents, nor copy or show it to anyone. Please contact the sender if you
> believe you have received this email in error.
>
> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <da...@gmail.com> wrote:
>
>> Hi Paul,
>>
>> For the jobqueue scans from the UI, there is a parameter you can set
>> which limits the number of documents counted to at most a specified
>> amount.  This uses a limit clause, which should prevent unbounded time
>> doing these kinds of queries:
>>
>> org.apache.manifoldcf.ui.maxstatuscount
>>
>> The documentation says that the default value for this parameter is
>> 10000, which however is incorrect.  The actual true default is 500000.  You
>> could set that lower for better UI performance (losing some information, of
>> course.)
>>
>> As for long-running queries, a lot of time and effort has been spent in
>> MCF to insure that this doesn't happen.  Specifically, the main document
>> queuing query is structured to read directly out of a specific jobqueue
>> index.  This is the crucial query that must work properly for scalability,
>> since doing a query that is effectively just a sort on the entire jobqueue
>> would be a major problem.  There are some times where Postgresql's
>> optimizer fails to do the right thing here, mostly because it makes a huge
>> distinction between whether there's zero of something or one of something,
>> but you can work around that particular issue by setting the analyze count
>> to 1 if you start to see this problem -- which basically means that
>> reanalysis of the table has to occur on every stuffing query.
>>
>> I'd appreciate seeing the queries that are long-running in your case so
>> that I can see if that is what you are encountering or not.
>>
>> Thanks,
>> Karl
>>
>>
>>
>>
>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <pa...@exonar.com>
>> wrote:
>>
>>> Hi Karl,
>>>
>>> We're beginning to see issues with a document count > 10 million. At
>>> that point, even with good postgres vacuuming the jobqueue table is
>>> starting to become a bottleneck.
>>>
>>> For example select count(*) from jobqueue, which is executed when
>>> querying job status will do a full table scan of jobqueue which has
>>> more than 10 million rows. That's going to take some time in postgres.
>>>
>>> SSDs will certainly make a big difference to document processing
>>> through-put (which we see is largely I/O bound in postgres) but we are
>>> increasingly seeing long running queries in the logs. Our current thinking
>>> is that we'll need to refactor JobQueue somewhat to optimise queries
>>> and, potentially partition jobqueue into a subset of tables (table per
>>> queue for example).
>>>
>>> Paul
>>>
>>>
>>>
>>> VP Engineering,
>>> Exonar Ltd
>>>
>>> T: +44 7940 567724
>>>
>>> twitter:@exonarco @pboichat
>>> W: http://www.exonar.com
>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>
>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>> West Mills, Newbury, Berkshire, RG14 5HG
>>> DISCLAIMER: This email and any attachments to it may be confidential
>>> and are intended solely for the use of the individual to whom it is
>>> addressed. Any views or opinions expressed are solely those of the author
>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>> the intended recipient of this email, you must neither take any action
>>> based upon its contents, nor copy or show it to anyone. Please contact
>>> the sender if you believe you have received this email in error.
>>>
>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <da...@gmail.com> wrote:
>>>
>>>> Hi Baptiste,
>>>>
>>>> ManifoldCF is not limited by the number of agents processes or parallel
>>>> connectors.  Overall database performance is the limiting factor.
>>>>
>>>> I would read this:
>>>>
>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>
>>>> Also, there's a section in ManifoldCF (I believe Chapter 2) that
>>>> discusses this issue.
>>>>
>>>> Some five years ago, I successfully crawled 5 million web documents,
>>>> using Postgresql 8.3.  Postgresql 9.x is faster, and with modern SSD's, I
>>>> expect that you will do even better.  In general, I'd say it was fine to
>>>> shoot for 10M - 100M documents on ManifoldCF, provided that you use a good
>>>> database, and provided that you maintain it properly.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>>> ba.berthier@gmail.com> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> I would like to know what is the maximum number of documents that you
>>>>> managed to crawl with ManifoldCF and with how many connectors in parallel
>>>>> it could works ?
>>>>>
>>>>> Thanks for your answer
>>>>>
>>>>> Baptiste
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Posted by Paul Boichat <pa...@exonar.com>.

Hi,

We've had a play with maxstatuscount and couldn't stop it from count(*)-ing
but I'll certainly have another look to see if we've missed something.

We're increasingly seeing long running threads and I'll put together some
samples. As an example, on a job that's currently aborting:

WARN 2014-09-10 18:37:29,900 (Job reset thread) - Found a long-running
query (72902 ms): [UPDATE jobqueue SET docpriority=?,priorityset=NULL WHERE
jobid=?]
 WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 0:
'1.000000001E9'
 WARN 2014-09-10 18:37:29,900 (Job reset thread) -   Parameter 1:
'1407144048075'
 WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan: Update on
jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
 WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:   ->  Bitmap Heap
Scan on jobqueue  (cost=18806.08..445770.39 rows=764916 width=287)
 WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:         Recheck
Cond: (jobid = 1407144048075::bigint)
 WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:         ->
Bitmap Index Scan on i1392985450177  (cost=0.00..18614.85 rows=764916
width=0)
 WARN 2014-09-10 18:37:29,960 (Job reset thread) -  Plan:
Index Cond: (jobid = 1407144048075::bigint)
 WARN 2014-09-10 18:37:29,960 (Job reset thread) -
 WARN 2014-09-10 18:37:30,140 (Job reset thread) -  Stats: n_distinct=4.0
most_common_vals={G,C,Z,P}
most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
 WARN 2014-09-10 18:37:30,140 (Job reset thread) -

Paul



VP Engineering,
Exonar Ltd

T: +44 7940 567724

twitter:@exonarco @pboichat
W: http://www.exonar.com
Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>

Exonar Limited, registered in the UK, registration number 06439969 at 14
West Mills, Newbury, Berkshire, RG14 5HG
DISCLAIMER: This email and any attachments to it may be confidential and
are intended solely for the use of the individual to whom it is addressed.
Any views or opinions expressed are solely those of the author and do not
necessarily represent those of Exonar Ltd. If you are not the intended
recipient of this email, you must neither take any action based upon its
contents, nor copy or show it to anyone. Please contact the sender if you
believe you have received this email in error.

On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <da...@gmail.com> wrote:

> Hi Paul,
>
> For the jobqueue scans from the UI, there is a parameter you can set which
> limits the number of documents counted to at most a specified amount.  This
> uses a limit clause, which should prevent unbounded time doing these kinds
> of queries:
>
> org.apache.manifoldcf.ui.maxstatuscount
>
> The documentation says that the default value for this parameter is 10000,
> which however is incorrect.  The actual true default is 500000.  You could
> set that lower for better UI performance (losing some information, of
> course.)
>
> As for long-running queries, a lot of time and effort has been spent in
> MCF to insure that this doesn't happen.  Specifically, the main document
> queuing query is structured to read directly out of a specific jobqueue
> index.  This is the crucial query that must work properly for scalability,
> since doing a query that is effectively just a sort on the entire jobqueue
> would be a major problem.  There are some times where Postgresql's
> optimizer fails to do the right thing here, mostly because it makes a huge
> distinction between whether there's zero of something or one of something,
> but you can work around that particular issue by setting the analyze count
> to 1 if you start to see this problem -- which basically means that
> reanalysis of the table has to occur on every stuffing query.
>
> I'd appreciate seeing the queries that are long-running in your case so
> that I can see if that is what you are encountering or not.
>
> Thanks,
> Karl
>
>
>
>
> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <pa...@exonar.com>
> wrote:
>
>> Hi Karl,
>>
>> We're beginning to see issues with a document count > 10 million. At that
>> point, even with good postgres vacuuming the jobqueue table is starting
>> to become a bottleneck.
>>
>> For example select count(*) from jobqueue, which is executed when
>> querying job status will do a full table scan of jobqueue which has more
>> than 10 million rows. That's going to take some time in postgres.
>>
>> SSDs will certainly make a big difference to document processing
>> through-put (which we see is largely I/O bound in postgres) but we are
>> increasingly seeing long running queries in the logs. Our current thinking
>> is that we'll need to refactor JobQueue somewhat to optimise queries
>> and, potentially partition jobqueue into a subset of tables (table per
>> queue for example).
>>
>> Paul
>>
>>
>>
>> VP Engineering,
>> Exonar Ltd
>>
>> T: +44 7940 567724
>>
>> twitter:@exonarco @pboichat
>> W: http://www.exonar.com
>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>
>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>> West Mills, Newbury, Berkshire, RG14 5HG
>> DISCLAIMER: This email and any attachments to it may be confidential and
>> are intended solely for the use of the individual to whom it is addressed.
>> Any views or opinions expressed are solely those of the author and do not
>> necessarily represent those of Exonar Ltd. If you are not the intended
>> recipient of this email, you must neither take any action based upon its
>> contents, nor copy or show it to anyone. Please contact the sender if
>> you believe you have received this email in error.
>>
>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <da...@gmail.com> wrote:
>>
>>> Hi Baptiste,
>>>
>>> ManifoldCF is not limited by the number of agents processes or parallel
>>> connectors.  Overall database performance is the limiting factor.
>>>
>>> I would read this:
>>>
>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>
>>> Also, there's a section in ManifoldCF (I believe Chapter 2) that
>>> discusses this issue.
>>>
>>> Some five years ago, I successfully crawled 5 million web documents,
>>> using Postgresql 8.3.  Postgresql 9.x is faster, and with modern SSD's, I
>>> expect that you will do even better.  In general, I'd say it was fine to
>>> shoot for 10M - 100M documents on ManifoldCF, provided that you use a good
>>> database, and provided that you maintain it properly.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>>> ba.berthier@gmail.com> wrote:
>>>
>>>> Hi
>>>>
>>>> I would like to know what is the maximum number of documents that you
>>>> managed to crawl with ManifoldCF and with how many connectors in parallel
>>>> it could works ?
>>>>
>>>> Thanks for your answer
>>>>
>>>> Baptiste
>>>>
>>>
>>>
>>
>

Re: Apache ManifoldCF Performance

Posted by Karl Wright <da...@gmail.com>.

Hi Paul,

For the jobqueue scans from the UI, there is a parameter you can set which
limits the number of documents counted to at most a specified amount.  This
uses a limit clause, which should prevent unbounded time doing these kinds
of queries:

org.apache.manifoldcf.ui.maxstatuscount

The documentation says that the default value for this parameter is 10000,
which however is incorrect.  The actual true default is 500000.  You could
set that lower for better UI performance (losing some information, of
course.)

As for long-running queries, a lot of time and effort has been spent in MCF
to insure that this doesn't happen.  Specifically, the main document
queuing query is structured to read directly out of a specific jobqueue
index.  This is the crucial query that must work properly for scalability,
since doing a query that is effectively just a sort on the entire jobqueue
would be a major problem.  There are some times where Postgresql's
optimizer fails to do the right thing here, mostly because it makes a huge
distinction between whether there's zero of something or one of something,
but you can work around that particular issue by setting the analyze count
to 1 if you start to see this problem -- which basically means that
reanalysis of the table has to occur on every stuffing query.

I'd appreciate seeing the queries that are long-running in your case so
that I can see if that is what you are encountering or not.

Thanks,
Karl

On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat <pa...@exonar.com>
wrote:

> Hi Karl,
>
> We're beginning to see issues with a document count > 10 million. At that
> point, even with good postgres vacuuming the jobqueue table is starting
> to become a bottleneck.
>
> For example select count(*) from jobqueue, which is executed when
> querying job status will do a full table scan of jobqueue which has more
> than 10 million rows. That's going to take some time in postgres.
>
> SSDs will certainly make a big difference to document processing
> through-put (which we see is largely I/O bound in postgres) but we are
> increasingly seeing long running queries in the logs. Our current thinking
> is that we'll need to refactor JobQueue somewhat to optimise queries and,
> potentially partition jobqueue into a subset of tables (table per queue
> for example).
>
> Paul
>
>
>
> VP Engineering,
> Exonar Ltd
>
> T: +44 7940 567724
>
> twitter:@exonarco @pboichat
> W: http://www.exonar.com
> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>
> Exonar Limited, registered in the UK, registration number 06439969 at 14
> West Mills, Newbury, Berkshire, RG14 5HG
> DISCLAIMER: This email and any attachments to it may be confidential and
> are intended solely for the use of the individual to whom it is addressed.
> Any views or opinions expressed are solely those of the author and do not
> necessarily represent those of Exonar Ltd. If you are not the intended
> recipient of this email, you must neither take any action based upon its
> contents, nor copy or show it to anyone. Please contact the sender if you
> believe you have received this email in error.
>
> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <da...@gmail.com> wrote:
>
>> Hi Baptiste,
>>
>> ManifoldCF is not limited by the number of agents processes or parallel
>> connectors.  Overall database performance is the limiting factor.
>>
>> I would read this:
>>
>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>
>> Also, there's a section in ManifoldCF (I believe Chapter 2) that
>> discusses this issue.
>>
>> Some five years ago, I successfully crawled 5 million web documents,
>> using Postgresql 8.3.  Postgresql 9.x is faster, and with modern SSD's, I
>> expect that you will do even better.  In general, I'd say it was fine to
>> shoot for 10M - 100M documents on ManifoldCF, provided that you use a good
>> database, and provided that you maintain it properly.
>>
>> Thanks,
>> Karl
>>
>>
>>
>>
>>
>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <
>> ba.berthier@gmail.com> wrote:
>>
>>> Hi
>>>
>>> I would like to know what is the maximum number of documents that you
>>> managed to crawl with ManifoldCF and with how many connectors in parallel
>>> it could works ?
>>>
>>> Thanks for your answer
>>>
>>> Baptiste
>>>
>>
>>
>

Re: Apache ManifoldCF Performance

Posted by Paul Boichat <pa...@exonar.com>.

Hi Karl,

We're beginning to see issues with a document count > 10 million. At that
point, even with good postgres vacuuming the jobqueue table is starting to
become a bottleneck.

For example select count(*) from jobqueue, which is executed when querying
job status will do a full table scan of jobqueue which has more than 10
million rows. That's going to take some time in postgres.

SSDs will certainly make a big difference to document processing
through-put (which we see is largely I/O bound in postgres) but we are
increasingly seeing long running queries in the logs. Our current thinking
is that we'll need to refactor JobQueue somewhat to optimise queries and,
potentially partition jobqueue into a subset of tables (table per queue for
example).

Paul

VP Engineering,
Exonar Ltd

T: +44 7940 567724

twitter:@exonarco @pboichat
W: http://www.exonar.com
Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>

Exonar Limited, registered in the UK, registration number 06439969 at 14
West Mills, Newbury, Berkshire, RG14 5HG
DISCLAIMER: This email and any attachments to it may be confidential and
are intended solely for the use of the individual to whom it is addressed.
Any views or opinions expressed are solely those of the author and do not
necessarily represent those of Exonar Ltd. If you are not the intended
recipient of this email, you must neither take any action based upon its
contents, nor copy or show it to anyone. Please contact the sender if you
believe you have received this email in error.

On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <da...@gmail.com> wrote:

> Hi Baptiste,
>
> ManifoldCF is not limited by the number of agents processes or parallel
> connectors.  Overall database performance is the limiting factor.
>
> I would read this:
>
> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>
> Also, there's a section in ManifoldCF (I believe Chapter 2) that discusses
> this issue.
>
> Some five years ago, I successfully crawled 5 million web documents, using
> Postgresql 8.3.  Postgresql 9.x is faster, and with modern SSD's, I expect
> that you will do even better.  In general, I'd say it was fine to shoot for
> 10M - 100M documents on ManifoldCF, provided that you use a good database,
> and provided that you maintain it properly.
>
> Thanks,
> Karl
>
>
>
>
>
> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <ba.berthier@gmail.com
> > wrote:
>
>> Hi
>>
>> I would like to know what is the maximum number of documents that you
>> managed to crawl with ManifoldCF and with how many connectors in parallel
>> it could works ?
>>
>> Thanks for your answer
>>
>> Baptiste
>>
>
>

Re: Apache ManifoldCF Performance

Posted by Karl Wright <da...@gmail.com>.

Hi Baptiste,

ManifoldCF is not limited by the number of agents processes or parallel
connectors.  Overall database performance is the limiting factor.

I would read this:

http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html

Also, there's a section in ManifoldCF (I believe Chapter 2) that discusses
this issue.

Some five years ago, I successfully crawled 5 million web documents, using
Postgresql 8.3.  Postgresql 9.x is faster, and with modern SSD's, I expect
that you will do even better.  In general, I'd say it was fine to shoot for
10M - 100M documents on ManifoldCF, provided that you use a good database,
and provided that you maintain it properly.

Thanks,
Karl

On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <ba...@gmail.com>
wrote:

> Hi
>
> I would like to know what is the maximum number of documents that you
> managed to crawl with ManifoldCF and with how many connectors in parallel
> it could works ?
>
> Thanks for your answer
>
> Baptiste
>