You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Tom Rees <tr...@chiliad.com> on 2014/05/21 22:31:57 UTC

Update on a now-fixed old problem and questions about database usage

Dear ManifoldCF:

First, I would like to report that switching to ManifoldCF 1.6 solved a
problem I encountered with version 1.4.1: whenever I ran two web crawls
simultaneously the two crawls would stop progressing within a half an hour.
The 1.6 version works beautifully. Thank you for the excellent work.

Now I have a couple issues with the database that I would appreciate your
feedback on. First, the two crawls that I mentioned finished and pulled
down a little over 255,000 documents. The size of the postgres (version
9.3.2) database on the disk, however, expanded to use a little over 8 GB of
space, and this is after running a full vacuum. This seems like a lot of
space for two medium sized crawls. Is there a way to get the web crawler to
use less database space?

Secondly, when I ran two simultaneous web crawls with the NULL output
connector, the crawls worked without issue. When I ran the same two
simultaneous web crawls with a custom output connector that wrote the files
to a local file system everything worked fine. However, when I used an
output connector that wrote the downloaded files to a file system and put
the path to each file on an ActiveMQ JMS queue, then the crawl showed
quirky behavior. A few times the crawls stopped in their tracks and then
after 40 - 60 minutes a message was printed to the logfile saying that the
SQL queries took too long. The full dump of one set of these messages is
below, at the end of this email. The web crawls always recover, and they
are still running. I am using postgres 9.3.2 with manifoldcf, and so far it
has not had many issues, except for the occasional SQL taking too long
message, although these are infrequent. Do I need to use a different
version of postgres? Or make some other change?

Thank you for you help.

Tom Rees
Chiliad

WARN 2014-05-21 11:05:08,230 (Worker thread '28') - Found a long-running
query (2662579 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id
IN(SELECT ownerid FROM hopdeleted
eps t0 WHERE t0.jobid=? AND t0.childidhash=? AND EXISTS(SELECT 'x' FROM
intrinsiclink t1 WHERE t1.jobid=t0.jobid AND t1.linktype=t0.linktype AND
t1.parentidhash=t0.parentidhash AND
 t1.childidhash=t0.childidhash AND t1.isnew=?))]
 WARN 2014-05-21 11:05:08,230 (Worker thread '28') -   Parameter 0: 'D'
 WARN 2014-05-21 11:05:08,230 (Worker thread '28') -   Parameter 1: '-1'
 WARN 2014-05-21 11:05:08,230 (Worker thread '28') -   Parameter 2:
'1400623413113'
 WARN 2014-05-21 11:05:08,231 (Worker thread '28') -   Parameter 3:
'A2EB225081B47722CCAEB3293A28EEB2F264E02C'
 WARN 2014-05-21 11:05:08,231 (Worker thread '28') -   Parameter 4: 'B'
 WARN 2014-05-21 11:05:08,243 (Worker thread '4') - Found a long-running
query (2625296 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id
IN(SELECT ownerid FROM hopdeletede
ps t0 WHERE t0.jobid=? AND t0.childidhash=? AND EXISTS(SELECT 'x' FROM
intrinsiclink t1 WHERE t1.jobid=t0.jobid AND t1.linktype=t0.linktype AND
t1.parentidhash=t0.parentidhash AND
t1.childidhash=t0.childidhash AND t1.isnew=?))]
 WARN 2014-05-21 11:05:08,243 (Worker thread '4') -   Parameter 0: 'D'
 WARN 2014-05-21 11:05:08,243 (Worker thread '4') -   Parameter 1: '-1'
 WARN 2014-05-21 11:05:08,243 (Worker thread '4') -   Parameter 2:
'1400623413113'
 WARN 2014-05-21 11:05:08,243 (Worker thread '4') -   Parameter 3:
'D942516DE5623A6417FCB994186B507E8CDA30D6'
 WARN 2014-05-21 11:05:08,243 (Worker thread '4') -   Parameter 4: 'B'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') - Found a long-running
query (2675765 ms): [SELECT parentidhash FROM intrinsiclink WHERE jobid=?
AND parentidhash IN (?,?,?,?,?,?
,?,?,?,?,?,?,?,?,?,?,?,?,?,?) AND linktype=? AND childidhash=? FOR UPDATE]
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 0:
'1400623413113'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 1:
'054FC31ACF6FB96D2F8D19FF9CC230349E6A7A76'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 2:
'0774E538282FCA04F0FF95AC65D48EFC57CC6225'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 3:
'1027C9AF07AE2B419C31A1D3B20352E31867BBBB'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 4:
'1382DE9902A7CCC0012F043077E1739867CE00A4'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 5:
'2E8844A26FCD3096DF0D6BC3BB3D6648FCBCA7FA'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 6:
'34741F8B2706BCB202FDA72DABB94D916D497CD4'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 7:
'6A5E47B467A29A8614B473856F1D28EC8B30F5F3'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 8:
'71B865B0979B351279EFD9F99CA8AF700704400A'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 9:
'77C6E57EBDD811027F776BF895E0B43275AF3628'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 10:
'8267055C5CE6D7A1917F88B1FA310FC5082FD599'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 11:
'8F361A3EDA0CAC989812623441DA02BD42883C4F'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 12:
'956CCECF3FD5F508624E19270FD5EC28532B0922'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 13:
'9BAA3731F101B3908E4FFF4A5325601C57B4CD57'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 14:
'AD628D16A2708EECD1C33AA0E63D849BCB5DF417'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 15:
'B661E6DD08FD89A6643A706ECAB6E1729FC623C8'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 16:
'D1F182BF5B49CB4FBF274A1B63B54C2F684EC059'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 17:
'D7FB0CB3AFE34BC258686368296AF0D896C5786E'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 18:
'D807BE55355A53CA84B4163F42081A896B323A81'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 19:
'EDED88E796389DEB5E8DA14F1FD56088CDA8BF98'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 20:
'FE4A24472BD3648F839FFAB7B5476915504A9755'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 21: 'link'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 22:
'B661E6DD08FD89A6643A706ECAB6E1729FC623C8'
 WARN 2014-05-21 11:05:08,289 (Worker thread '4') -  Plan: Update on
hopcount  (cost=157.53..165.57 rows=1 width=81)
 WARN 2014-05-21 11:05:08,289 (Worker thread '28') -  Plan: Update on
hopcount  (cost=157.53..165.57 rows=1 width=81)
 WARN 2014-05-21 11:05:08,289 (Worker thread '28') -  Plan:   ->  Nested
Loop  (cost=157.53..165.57 rows=1 width=81)
 WARN 2014-05-21 11:05:08,289 (Worker thread '4') -  Plan:   ->  Nested
Loop  (cost=157.53..165.57 rows=1 width=81)
 WARN 2014-05-21 11:05:08,289 (Worker thread '28') -  Plan:         ->
 HashAggregate  (cost=157.11..157.12 rows=1 width=20)
 WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
->  Hash Join  (cost=101.51..157.11 rows=1 width=20)
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:         ->
 HashAggregate  (cost=157.11..157.12 rows=1 width=20)
 WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
    Hash Cond: (((t0.linktype)::text = (t1.linktype)::text) AND
((t0.parentidhash)::text = (t1.parentidhash)::text))
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:               ->
 Hash Join  (cost=101.51..157.11 rows=1 width=20)
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
    Hash Cond: (((t0.linktype)::text = (t1.linktype)::text) AND
((t0.parentidhash)::text = (t1.parentidhash)::text))
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
    ->  Index Scan using i1400371486543 on hopdeletedeps t0
 (cost=0.56..55.95 rows=27 width=109)
 WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
    ->  Index Scan using i1400371486543 on hopdeletedeps t0
 (cost=0.56..55.95 rows=27 width=109)
 WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
          Index Cond: ((jobid = 1400623413113::bigint) AND
((childidhash)::text = 'A2EB225081B47722CCAEB3293A28EEB2F264E02C'::text))
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
          Index Cond: ((jobid = 1400623413113::bigint) AND
((childidhash)::text = 'D942516DE5623A6417FCB994186B507E8CDA30D6'::text))
 WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
    ->  Hash  (cost=100.32..100.32 rows=42 width=101)
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
    ->  Hash  (cost=100.32..100.32 rows=42 width=101)
 WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
          ->  Index Scan using i1400371486547 on intrinsiclink t1
 (cost=0.56..100.32 rows=42 width=101)
 WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
                Index Cond: ((jobid = 1400623413113::bigint) AND
((childidhash)::text = 'A2EB225081B47722CCAEB3293A28EEB2F264E02C'::text)
AND (isnew = 'B'::bpchar))
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
          ->  Index Scan using i1400371486547 on intrinsiclink t1
 (cost=0.56..100.32 rows=42 width=101)
 WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:         ->
 Index Scan using hopcount_pkey on hopcount  (cost=0.42..8.45 rows=1
width=69)
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
                Index Cond: ((jobid = 1400623413113::bigint) AND
((childidhash)::text = 'D942516DE5623A6417FCB994186B507E8CDA30D6'::text)
AND (isnew = 'B'::bpchar))
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:         ->
 Index Scan using hopcount_pkey on hopcount  (cost=0.42..8.45 rows=1
width=69)
 WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
Index Cond: (id = t0.ownerid)
 WARN 2014-05-21 11:05:08,290 (Worker thread '28') -
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
Index Cond: (id = t0.ownerid)
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -
 WARN 2014-05-21 11:05:08,294 (Worker thread '40') -  Plan: LockRows
 (cost=0.56..101.40 rows=3 width=47) (actual time=0.041..0.041 rows=0
loops=1)

Re: Update on a now-fixed old problem and questions about database usage

Posted by Karl Wright <da...@gmail.com>.
Hi Tom,

What you are seeing is the result of hopcount logic, together with
ManifoldCF's periodic analysis and reindexing of sensitive tables.
Hopcount tracking in ManifoldCF is expensive, and if you don't actually
need it, I suggest you disable it (in your job, select "keep forever").
Periodically MCF finds a document which causes the hopcount of many already
crawled documents to shrink.  The effect of this is a great deal of
database activity.  And, of course, while this is going on, MCF may well
decide it's time to reindex, which slows things down even more.

Karl



On Wed, May 21, 2014 at 4:31 PM, Tom Rees <tr...@chiliad.com> wrote:

> Dear ManifoldCF:
>
> First, I would like to report that switching to ManifoldCF 1.6 solved a
> problem I encountered with version 1.4.1: whenever I ran two web crawls
> simultaneously the two crawls would stop progressing within a half an hour.
> The 1.6 version works beautifully. Thank you for the excellent work.
>
> Now I have a couple issues with the database that I would appreciate your
> feedback on. First, the two crawls that I mentioned finished and pulled
> down a little over 255,000 documents. The size of the postgres (version
> 9.3.2) database on the disk, however, expanded to use a little over 8 GB of
> space, and this is after running a full vacuum. This seems like a lot of
> space for two medium sized crawls. Is there a way to get the web crawler to
> use less database space?
>
> Secondly, when I ran two simultaneous web crawls with the NULL output
> connector, the crawls worked without issue. When I ran the same two
> simultaneous web crawls with a custom output connector that wrote the files
> to a local file system everything worked fine. However, when I used an
> output connector that wrote the downloaded files to a file system and put
> the path to each file on an ActiveMQ JMS queue, then the crawl showed
> quirky behavior. A few times the crawls stopped in their tracks and then
> after 40 - 60 minutes a message was printed to the logfile saying that the
> SQL queries took too long. The full dump of one set of these messages is
> below, at the end of this email. The web crawls always recover, and they
> are still running. I am using postgres 9.3.2 with manifoldcf, and so far it
> has not had many issues, except for the occasional SQL taking too long
> message, although these are infrequent. Do I need to use a different
> version of postgres? Or make some other change?
>
> Thank you for you help.
>
> Tom Rees
> Chiliad
>
> WARN 2014-05-21 11:05:08,230 (Worker thread '28') - Found a long-running
> query (2662579 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id
> IN(SELECT ownerid FROM hopdeleted
> eps t0 WHERE t0.jobid=? AND t0.childidhash=? AND EXISTS(SELECT 'x' FROM
> intrinsiclink t1 WHERE t1.jobid=t0.jobid AND t1.linktype=t0.linktype AND
> t1.parentidhash=t0.parentidhash AND
>  t1.childidhash=t0.childidhash AND t1.isnew=?))]
>  WARN 2014-05-21 11:05:08,230 (Worker thread '28') -   Parameter 0: 'D'
>  WARN 2014-05-21 11:05:08,230 (Worker thread '28') -   Parameter 1: '-1'
>  WARN 2014-05-21 11:05:08,230 (Worker thread '28') -   Parameter 2:
> '1400623413113'
>  WARN 2014-05-21 11:05:08,231 (Worker thread '28') -   Parameter 3:
> 'A2EB225081B47722CCAEB3293A28EEB2F264E02C'
>  WARN 2014-05-21 11:05:08,231 (Worker thread '28') -   Parameter 4: 'B'
>  WARN 2014-05-21 11:05:08,243 (Worker thread '4') - Found a long-running
> query (2625296 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id
> IN(SELECT ownerid FROM hopdeletede
> ps t0 WHERE t0.jobid=? AND t0.childidhash=? AND EXISTS(SELECT 'x' FROM
> intrinsiclink t1 WHERE t1.jobid=t0.jobid AND t1.linktype=t0.linktype AND
> t1.parentidhash=t0.parentidhash AND
> t1.childidhash=t0.childidhash AND t1.isnew=?))]
>  WARN 2014-05-21 11:05:08,243 (Worker thread '4') -   Parameter 0: 'D'
>  WARN 2014-05-21 11:05:08,243 (Worker thread '4') -   Parameter 1: '-1'
>  WARN 2014-05-21 11:05:08,243 (Worker thread '4') -   Parameter 2:
> '1400623413113'
>  WARN 2014-05-21 11:05:08,243 (Worker thread '4') -   Parameter 3:
> 'D942516DE5623A6417FCB994186B507E8CDA30D6'
>  WARN 2014-05-21 11:05:08,243 (Worker thread '4') -   Parameter 4: 'B'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') - Found a long-running
> query (2675765 ms): [SELECT parentidhash FROM intrinsiclink WHERE jobid=?
> AND parentidhash IN (?,?,?,?,?,?
> ,?,?,?,?,?,?,?,?,?,?,?,?,?,?) AND linktype=? AND childidhash=? FOR UPDATE]
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 0:
> '1400623413113'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 1:
> '054FC31ACF6FB96D2F8D19FF9CC230349E6A7A76'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 2:
> '0774E538282FCA04F0FF95AC65D48EFC57CC6225'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 3:
> '1027C9AF07AE2B419C31A1D3B20352E31867BBBB'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 4:
> '1382DE9902A7CCC0012F043077E1739867CE00A4'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 5:
> '2E8844A26FCD3096DF0D6BC3BB3D6648FCBCA7FA'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 6:
> '34741F8B2706BCB202FDA72DABB94D916D497CD4'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 7:
> '6A5E47B467A29A8614B473856F1D28EC8B30F5F3'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 8:
> '71B865B0979B351279EFD9F99CA8AF700704400A'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 9:
> '77C6E57EBDD811027F776BF895E0B43275AF3628'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 10:
> '8267055C5CE6D7A1917F88B1FA310FC5082FD599'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 11:
> '8F361A3EDA0CAC989812623441DA02BD42883C4F'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 12:
> '956CCECF3FD5F508624E19270FD5EC28532B0922'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 13:
> '9BAA3731F101B3908E4FFF4A5325601C57B4CD57'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 14:
> 'AD628D16A2708EECD1C33AA0E63D849BCB5DF417'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 15:
> 'B661E6DD08FD89A6643A706ECAB6E1729FC623C8'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 16:
> 'D1F182BF5B49CB4FBF274A1B63B54C2F684EC059'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 17:
> 'D7FB0CB3AFE34BC258686368296AF0D896C5786E'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 18:
> 'D807BE55355A53CA84B4163F42081A896B323A81'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 19:
> 'EDED88E796389DEB5E8DA14F1FD56088CDA8BF98'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 20:
> 'FE4A24472BD3648F839FFAB7B5476915504A9755'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 21: 'link'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 22:
> 'B661E6DD08FD89A6643A706ECAB6E1729FC623C8'
>  WARN 2014-05-21 11:05:08,289 (Worker thread '4') -  Plan: Update on
> hopcount  (cost=157.53..165.57 rows=1 width=81)
>  WARN 2014-05-21 11:05:08,289 (Worker thread '28') -  Plan: Update on
> hopcount  (cost=157.53..165.57 rows=1 width=81)
>  WARN 2014-05-21 11:05:08,289 (Worker thread '28') -  Plan:   ->  Nested
> Loop  (cost=157.53..165.57 rows=1 width=81)
>  WARN 2014-05-21 11:05:08,289 (Worker thread '4') -  Plan:   ->  Nested
> Loop  (cost=157.53..165.57 rows=1 width=81)
>  WARN 2014-05-21 11:05:08,289 (Worker thread '28') -  Plan:         ->
>  HashAggregate  (cost=157.11..157.12 rows=1 width=20)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
> ->  Hash Join  (cost=101.51..157.11 rows=1 width=20)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:         ->
>  HashAggregate  (cost=157.11..157.12 rows=1 width=20)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
>       Hash Cond: (((t0.linktype)::text = (t1.linktype)::text) AND
> ((t0.parentidhash)::text = (t1.parentidhash)::text))
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
> ->  Hash Join  (cost=101.51..157.11 rows=1 width=20)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
>     Hash Cond: (((t0.linktype)::text = (t1.linktype)::text) AND
> ((t0.parentidhash)::text = (t1.parentidhash)::text))
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
>     ->  Index Scan using i1400371486543 on hopdeletedeps t0
>  (cost=0.56..55.95 rows=27 width=109)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
>       ->  Index Scan using i1400371486543 on hopdeletedeps t0
>  (cost=0.56..55.95 rows=27 width=109)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
>             Index Cond: ((jobid = 1400623413113::bigint) AND
> ((childidhash)::text = 'A2EB225081B47722CCAEB3293A28EEB2F264E02C'::text))
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
>           Index Cond: ((jobid = 1400623413113::bigint) AND
> ((childidhash)::text = 'D942516DE5623A6417FCB994186B507E8CDA30D6'::text))
>  WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
>       ->  Hash  (cost=100.32..100.32 rows=42 width=101)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
>     ->  Hash  (cost=100.32..100.32 rows=42 width=101)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
>             ->  Index Scan using i1400371486547 on intrinsiclink t1
>  (cost=0.56..100.32 rows=42 width=101)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
>                   Index Cond: ((jobid = 1400623413113::bigint) AND
> ((childidhash)::text = 'A2EB225081B47722CCAEB3293A28EEB2F264E02C'::text)
> AND (isnew = 'B'::bpchar))
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
>           ->  Index Scan using i1400371486547 on intrinsiclink t1
>  (cost=0.56..100.32 rows=42 width=101)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:         ->
>  Index Scan using hopcount_pkey on hopcount  (cost=0.42..8.45 rows=1
> width=69)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
>                 Index Cond: ((jobid = 1400623413113::bigint) AND
> ((childidhash)::text = 'D942516DE5623A6417FCB994186B507E8CDA30D6'::text)
> AND (isnew = 'B'::bpchar))
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:         ->
>  Index Scan using hopcount_pkey on hopcount  (cost=0.42..8.45 rows=1
> width=69)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
> Index Cond: (id = t0.ownerid)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '28') -
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
> Index Cond: (id = t0.ownerid)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -
>  WARN 2014-05-21 11:05:08,294 (Worker thread '40') -  Plan: LockRows
>  (cost=0.56..101.40 rows=3 width=47) (actual time=0.041..0.041 rows=0
> loops=1)
>
>

Re: Update on a now-fixed old problem and questions about database usage

Posted by Karl Wright <da...@gmail.com>.
And, sorry, about the database size -- much of that is likely going into
your history table.  You can limit the amount of history stored, or disable
history entirely, by means of a configuration parameter.  Have a look at
the "how-to-build-and-deploy" page.

Karl


On Wed, May 21, 2014 at 4:31 PM, Tom Rees <tr...@chiliad.com> wrote:

> Dear ManifoldCF:
>
> First, I would like to report that switching to ManifoldCF 1.6 solved a
> problem I encountered with version 1.4.1: whenever I ran two web crawls
> simultaneously the two crawls would stop progressing within a half an hour.
> The 1.6 version works beautifully. Thank you for the excellent work.
>
> Now I have a couple issues with the database that I would appreciate your
> feedback on. First, the two crawls that I mentioned finished and pulled
> down a little over 255,000 documents. The size of the postgres (version
> 9.3.2) database on the disk, however, expanded to use a little over 8 GB of
> space, and this is after running a full vacuum. This seems like a lot of
> space for two medium sized crawls. Is there a way to get the web crawler to
> use less database space?
>
> Secondly, when I ran two simultaneous web crawls with the NULL output
> connector, the crawls worked without issue. When I ran the same two
> simultaneous web crawls with a custom output connector that wrote the files
> to a local file system everything worked fine. However, when I used an
> output connector that wrote the downloaded files to a file system and put
> the path to each file on an ActiveMQ JMS queue, then the crawl showed
> quirky behavior. A few times the crawls stopped in their tracks and then
> after 40 - 60 minutes a message was printed to the logfile saying that the
> SQL queries took too long. The full dump of one set of these messages is
> below, at the end of this email. The web crawls always recover, and they
> are still running. I am using postgres 9.3.2 with manifoldcf, and so far it
> has not had many issues, except for the occasional SQL taking too long
> message, although these are infrequent. Do I need to use a different
> version of postgres? Or make some other change?
>
> Thank you for you help.
>
> Tom Rees
> Chiliad
>
> WARN 2014-05-21 11:05:08,230 (Worker thread '28') - Found a long-running
> query (2662579 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id
> IN(SELECT ownerid FROM hopdeleted
> eps t0 WHERE t0.jobid=? AND t0.childidhash=? AND EXISTS(SELECT 'x' FROM
> intrinsiclink t1 WHERE t1.jobid=t0.jobid AND t1.linktype=t0.linktype AND
> t1.parentidhash=t0.parentidhash AND
>  t1.childidhash=t0.childidhash AND t1.isnew=?))]
>  WARN 2014-05-21 11:05:08,230 (Worker thread '28') -   Parameter 0: 'D'
>  WARN 2014-05-21 11:05:08,230 (Worker thread '28') -   Parameter 1: '-1'
>  WARN 2014-05-21 11:05:08,230 (Worker thread '28') -   Parameter 2:
> '1400623413113'
>  WARN 2014-05-21 11:05:08,231 (Worker thread '28') -   Parameter 3:
> 'A2EB225081B47722CCAEB3293A28EEB2F264E02C'
>  WARN 2014-05-21 11:05:08,231 (Worker thread '28') -   Parameter 4: 'B'
>  WARN 2014-05-21 11:05:08,243 (Worker thread '4') - Found a long-running
> query (2625296 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id
> IN(SELECT ownerid FROM hopdeletede
> ps t0 WHERE t0.jobid=? AND t0.childidhash=? AND EXISTS(SELECT 'x' FROM
> intrinsiclink t1 WHERE t1.jobid=t0.jobid AND t1.linktype=t0.linktype AND
> t1.parentidhash=t0.parentidhash AND
> t1.childidhash=t0.childidhash AND t1.isnew=?))]
>  WARN 2014-05-21 11:05:08,243 (Worker thread '4') -   Parameter 0: 'D'
>  WARN 2014-05-21 11:05:08,243 (Worker thread '4') -   Parameter 1: '-1'
>  WARN 2014-05-21 11:05:08,243 (Worker thread '4') -   Parameter 2:
> '1400623413113'
>  WARN 2014-05-21 11:05:08,243 (Worker thread '4') -   Parameter 3:
> 'D942516DE5623A6417FCB994186B507E8CDA30D6'
>  WARN 2014-05-21 11:05:08,243 (Worker thread '4') -   Parameter 4: 'B'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') - Found a long-running
> query (2675765 ms): [SELECT parentidhash FROM intrinsiclink WHERE jobid=?
> AND parentidhash IN (?,?,?,?,?,?
> ,?,?,?,?,?,?,?,?,?,?,?,?,?,?) AND linktype=? AND childidhash=? FOR UPDATE]
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 0:
> '1400623413113'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 1:
> '054FC31ACF6FB96D2F8D19FF9CC230349E6A7A76'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 2:
> '0774E538282FCA04F0FF95AC65D48EFC57CC6225'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 3:
> '1027C9AF07AE2B419C31A1D3B20352E31867BBBB'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 4:
> '1382DE9902A7CCC0012F043077E1739867CE00A4'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 5:
> '2E8844A26FCD3096DF0D6BC3BB3D6648FCBCA7FA'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 6:
> '34741F8B2706BCB202FDA72DABB94D916D497CD4'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 7:
> '6A5E47B467A29A8614B473856F1D28EC8B30F5F3'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 8:
> '71B865B0979B351279EFD9F99CA8AF700704400A'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 9:
> '77C6E57EBDD811027F776BF895E0B43275AF3628'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 10:
> '8267055C5CE6D7A1917F88B1FA310FC5082FD599'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 11:
> '8F361A3EDA0CAC989812623441DA02BD42883C4F'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 12:
> '956CCECF3FD5F508624E19270FD5EC28532B0922'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 13:
> '9BAA3731F101B3908E4FFF4A5325601C57B4CD57'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 14:
> 'AD628D16A2708EECD1C33AA0E63D849BCB5DF417'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 15:
> 'B661E6DD08FD89A6643A706ECAB6E1729FC623C8'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 16:
> 'D1F182BF5B49CB4FBF274A1B63B54C2F684EC059'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 17:
> 'D7FB0CB3AFE34BC258686368296AF0D896C5786E'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 18:
> 'D807BE55355A53CA84B4163F42081A896B323A81'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 19:
> 'EDED88E796389DEB5E8DA14F1FD56088CDA8BF98'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 20:
> 'FE4A24472BD3648F839FFAB7B5476915504A9755'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 21: 'link'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 22:
> 'B661E6DD08FD89A6643A706ECAB6E1729FC623C8'
>  WARN 2014-05-21 11:05:08,289 (Worker thread '4') -  Plan: Update on
> hopcount  (cost=157.53..165.57 rows=1 width=81)
>  WARN 2014-05-21 11:05:08,289 (Worker thread '28') -  Plan: Update on
> hopcount  (cost=157.53..165.57 rows=1 width=81)
>  WARN 2014-05-21 11:05:08,289 (Worker thread '28') -  Plan:   ->  Nested
> Loop  (cost=157.53..165.57 rows=1 width=81)
>  WARN 2014-05-21 11:05:08,289 (Worker thread '4') -  Plan:   ->  Nested
> Loop  (cost=157.53..165.57 rows=1 width=81)
>  WARN 2014-05-21 11:05:08,289 (Worker thread '28') -  Plan:         ->
>  HashAggregate  (cost=157.11..157.12 rows=1 width=20)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
> ->  Hash Join  (cost=101.51..157.11 rows=1 width=20)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:         ->
>  HashAggregate  (cost=157.11..157.12 rows=1 width=20)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
>       Hash Cond: (((t0.linktype)::text = (t1.linktype)::text) AND
> ((t0.parentidhash)::text = (t1.parentidhash)::text))
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
> ->  Hash Join  (cost=101.51..157.11 rows=1 width=20)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
>     Hash Cond: (((t0.linktype)::text = (t1.linktype)::text) AND
> ((t0.parentidhash)::text = (t1.parentidhash)::text))
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
>     ->  Index Scan using i1400371486543 on hopdeletedeps t0
>  (cost=0.56..55.95 rows=27 width=109)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
>       ->  Index Scan using i1400371486543 on hopdeletedeps t0
>  (cost=0.56..55.95 rows=27 width=109)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
>             Index Cond: ((jobid = 1400623413113::bigint) AND
> ((childidhash)::text = 'A2EB225081B47722CCAEB3293A28EEB2F264E02C'::text))
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
>           Index Cond: ((jobid = 1400623413113::bigint) AND
> ((childidhash)::text = 'D942516DE5623A6417FCB994186B507E8CDA30D6'::text))
>  WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
>       ->  Hash  (cost=100.32..100.32 rows=42 width=101)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
>     ->  Hash  (cost=100.32..100.32 rows=42 width=101)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
>             ->  Index Scan using i1400371486547 on intrinsiclink t1
>  (cost=0.56..100.32 rows=42 width=101)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
>                   Index Cond: ((jobid = 1400623413113::bigint) AND
> ((childidhash)::text = 'A2EB225081B47722CCAEB3293A28EEB2F264E02C'::text)
> AND (isnew = 'B'::bpchar))
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
>           ->  Index Scan using i1400371486547 on intrinsiclink t1
>  (cost=0.56..100.32 rows=42 width=101)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:         ->
>  Index Scan using hopcount_pkey on hopcount  (cost=0.42..8.45 rows=1
> width=69)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
>                 Index Cond: ((jobid = 1400623413113::bigint) AND
> ((childidhash)::text = 'D942516DE5623A6417FCB994186B507E8CDA30D6'::text)
> AND (isnew = 'B'::bpchar))
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:         ->
>  Index Scan using hopcount_pkey on hopcount  (cost=0.42..8.45 rows=1
> width=69)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
> Index Cond: (id = t0.ownerid)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '28') -
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
> Index Cond: (id = t0.ownerid)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -
>  WARN 2014-05-21 11:05:08,294 (Worker thread '40') -  Plan: LockRows
>  (cost=0.56..101.40 rows=3 width=47) (actual time=0.041..0.041 rows=0
> loops=1)
>
>