You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by LIROT Daniel - SG/SPSSI/CPII/DOSO/ET <Da...@developpement-durable.gouv.fr> on 2019/02/08 13:07:37 UTC

ManifoldCF + Postgresql - long freeze on job

Hello,

We use ManifoldCF v2.10, with postgresql (9.6) to crawl our websites.
this represents approximately 1.2 million documents.
We split the crawl into 4 jobs that distribute their results on 3 SOLR 
collections.
The crawl is powerful up to 500000 documents (25000 to 30000 docs / 
hour) then the performance decreases strongly in progress, we observe 
freezes very very long, you might think that the crawl is stopped.
We suspect a reindexing, noticeably of the intrinsiclink table which is 
very important 85 Million lines.
Is it possible to prohibit re-indexing controlled by manifoldCF?
An other idea ?

best Regards
LIROT daniel
-- 

Re: ManifoldCF + Postgresql - long freeze on job

Posted by Karl Wright <da...@gmail.com>.
There is not such a specific value.  But you can practically disable this
entirely by setting a very large value, e.g. 2000000000.

Karl

On Mon, Feb 11, 2019 at 7:43 AM LIROT Daniel - SG/SPSSI/CPII/DOSO/ET <
Daniel.Lirot@developpement-durable.gouv.fr> wrote:

> Hi,
>
> We see the table "Advanced properties.xml properties", we use it to
> parametrized :
>   "<property
> name="org.apache.manifoldcf.db.postgres.reindex.intrinsiclink"
> value="5000000" />" for the intrinsiclink table, and we do the same for the
> other tables,
> but is there a value  that allows to disable the reindex and the analyze,
> for example "-1" or "0", i didn't find it in the documentation.
>
> Thank you
>
>
> Le 11/02/2019 à 12:26, > Karl Wright (par Internet, dépôt
> user-return-5690-daniel.lirot=developpement-durable.gouv.fr@manifoldcf.apache.org)
> a écrit :
>
> See:
> https://manifoldcf.apache.org/release/release-1.10/en_US/how-to-build-and-deploy.html#file+properties
>
> Look at the table "Advanced properties.xml properties"
>
> Karl
>
>
> On Mon, Feb 11, 2019 at 4:16 AM LIROT Daniel - SG/SPSSI/CPII/DOSO/ET <
> Daniel.Lirot@developpement-durable.gouv.fr> wrote:
>
>> Hello,
>>
>> 1/ The database we use is Postgresql version 9.6
>>
>> 2/ I will look at what is happening about the queries in the logs.
>>
>> 3/ We do a vacuum full analyse every 24 hours, for each table we adjust
>> the reindex at the value 5000000 (in properties.xml) with the line :
>>  <property name="org.apache.manifoldcf.db.postgres.reindex.intrinsiclink"
>> value="5000000" />
>>
>> Is there an instruction that allows to disable the reindex requested by
>> manifoldcf
>>
>> thanks
>>
>> Daniel
>>
>>
>> Le 08/02/2019 à 16:00, > Karl Wright (par Internet, dépôt
>> user-return-5674-daniel.lirot=developpement-durable.gouv.fr@manifoldcf.apache.org)
>> a écrit :
>>
>> Hello,
>>
>> (1) What database are you using for this?  Some databases require
>> maintenance periodically or have other heavy usage constraints.
>> (2) Every time a query takes more than an minute to execute, it is
>> logged, along with the query plan.  You need to look at the manifoldcf log
>> to see which queries are problematic before concluding anything.
>> (3) For every database table, you can individually configure how many
>> table operations approximately occur before MCF re-analyzes the table.
>> However, it's likely that you have the opposite problem: a bad query plan
>> for the query that queues documents for processing.  That may mean more
>> frequent analysis to prevent.  But we cannot tell that until we understand
>> what queries are taking a long time.
>>
>> Thanks,
>> Karl
>>
>>
>>
>> On Fri, Feb 8, 2019 at 8:07 AM LIROT Daniel - SG/SPSSI/CPII/DOSO/ET <
>> Daniel.Lirot@developpement-durable.gouv.fr> wrote:
>>
>>> Hello,
>>>
>>> We use ManifoldCF v2.10, with postgresql (9.6) to crawl our websites.
>>> this represents approximately 1.2 million documents.
>>> We split the crawl into 4 jobs that distribute their results on 3 SOLR
>>> collections.
>>> The crawl is powerful up to 500000 documents (25000 to 30000 docs /
>>> hour) then the performance decreases strongly in progress, we observe
>>> freezes very very long, you might think that the crawl is stopped.
>>> We suspect a reindexing, noticeably of the intrinsiclink table which is
>>> very important 85 Million lines.
>>> Is it possible to prohibit re-indexing controlled by manifoldCF?
>>> An other idea ?
>>>
>>> best Regards
>>> LIROT daniel
>>> --
>>>
>>
>>
>

Re: ManifoldCF + Postgresql - long freeze on job

Posted by LIROT Daniel - SG/SPSSI/CPII/DOSO/ET <Da...@developpement-durable.gouv.fr>.
Hi,

We see the table "Advanced properties.xml properties", we use it to 
parametrized :
"<property 
name="org.apache.manifoldcf.db.postgres.reindex.intrinsiclink" 
value="5000000" />" for the intrinsiclink table, and we do the same for 
the other tables,
but is there a value that allows to disable the reindex and the analyze, 
for example "-1" or "0", i didn't find it in the documentation.

Thank you


Le 11/02/2019 à 12:26, > Karl Wright (par Internet, dépôt 
user-return-5690-daniel.lirot=developpement-durable.gouv.fr@manifoldcf.apache.org) 
a écrit :
> See: 
> https://manifoldcf.apache.org/release/release-1.10/en_US/how-to-build-and-deploy.html#file+properties
>
> Look at the table "Advanced properties.xml properties"
>
> Karl
>
>
> On Mon, Feb 11, 2019 at 4:16 AM LIROT Daniel - SG/SPSSI/CPII/DOSO/ET 
> <Daniel.Lirot@developpement-durable.gouv.fr 
> <ma...@developpement-durable.gouv.fr>> wrote:
>
>     Hello,
>
>     1/ The database we use is Postgresql version 9.6
>
>     2/ I will look at what is happening about the queries in the logs.
>
>     3/ We do a vacuum full analyse every 24 hours, for each table we
>     adjust the reindex at the value 5000000 (in properties.xml) with
>     the line :
>      <property
>     name="org.apache.manifoldcf.db.postgres.reindex.intrinsiclink"
>     value="5000000" />
>
>     Is there an instruction that allows to disable the reindex
>     requested by manifoldcf
>
>     thanks
>
>     Daniel
>
>
>     Le 08/02/2019 à 16:00, > Karl Wright (par Internet, dépôt
>     user-return-5674-daniel.lirot=developpement-durable.gouv.fr@manifoldcf.apache.org
>     <ma...@manifoldcf.apache.org>)
>     a écrit :
>>     Hello,
>>
>>     (1) What database are you using for this?  Some databases require
>>     maintenance periodically or have other heavy usage constraints.
>>     (2) Every time a query takes more than an minute to execute, it
>>     is logged, along with the query plan.  You need to look at the
>>     manifoldcf log to see which queries are problematic before
>>     concluding anything.
>>     (3) For every database table, you can individually configure how
>>     many table operations approximately occur before MCF re-analyzes
>>     the table.  However, it's likely that you have the opposite
>>     problem: a bad query plan for the query that queues documents for
>>     processing.  That may mean more frequent analysis to prevent. 
>>     But we cannot tell that until we understand what queries are
>>     taking a long time.
>>
>>     Thanks,
>>     Karl
>>
>>
>>
>>     On Fri, Feb 8, 2019 at 8:07 AM LIROT Daniel -
>>     SG/SPSSI/CPII/DOSO/ET <Daniel.Lirot@developpement-durable.gouv.fr
>>     <ma...@developpement-durable.gouv.fr>> wrote:
>>
>>         Hello,
>>
>>         We use ManifoldCF v2.10, with postgresql (9.6) to crawl our
>>         websites.
>>         this represents approximately 1.2 million documents.
>>         We split the crawl into 4 jobs that distribute their results
>>         on 3 SOLR collections.
>>         The crawl is powerful up to 500000 documents (25000 to 30000
>>         docs / hour) then the performance decreases strongly in
>>         progress, we observe freezes very very long, you might think
>>         that the crawl is stopped.
>>         We suspect a reindexing, noticeably of the intrinsiclink
>>         table which is very important 85 Million lines.
>>         Is it possible to prohibit re-indexing controlled by manifoldCF?
>>         An other idea ?
>>
>>         best Regards
>>         LIROT daniel
>>         -- 
>>
>


Re: ManifoldCF + Postgresql - long freeze on job

Posted by Karl Wright <da...@gmail.com>.
See:
https://manifoldcf.apache.org/release/release-1.10/en_US/how-to-build-and-deploy.html#file+properties

Look at the table "Advanced properties.xml properties"

Karl


On Mon, Feb 11, 2019 at 4:16 AM LIROT Daniel - SG/SPSSI/CPII/DOSO/ET <
Daniel.Lirot@developpement-durable.gouv.fr> wrote:

> Hello,
>
> 1/ The database we use is Postgresql version 9.6
>
> 2/ I will look at what is happening about the queries in the logs.
>
> 3/ We do a vacuum full analyse every 24 hours, for each table we adjust
> the reindex at the value 5000000 (in properties.xml) with the line :
>  <property name="org.apache.manifoldcf.db.postgres.reindex.intrinsiclink"
> value="5000000" />
>
> Is there an instruction that allows to disable the reindex requested by
> manifoldcf
>
> thanks
>
> Daniel
>
>
> Le 08/02/2019 à 16:00, > Karl Wright (par Internet, dépôt
> user-return-5674-daniel.lirot=developpement-durable.gouv.fr@manifoldcf.apache.org)
> a écrit :
>
> Hello,
>
> (1) What database are you using for this?  Some databases require
> maintenance periodically or have other heavy usage constraints.
> (2) Every time a query takes more than an minute to execute, it is logged,
> along with the query plan.  You need to look at the manifoldcf log to see
> which queries are problematic before concluding anything.
> (3) For every database table, you can individually configure how many
> table operations approximately occur before MCF re-analyzes the table.
> However, it's likely that you have the opposite problem: a bad query plan
> for the query that queues documents for processing.  That may mean more
> frequent analysis to prevent.  But we cannot tell that until we understand
> what queries are taking a long time.
>
> Thanks,
> Karl
>
>
>
> On Fri, Feb 8, 2019 at 8:07 AM LIROT Daniel - SG/SPSSI/CPII/DOSO/ET <
> Daniel.Lirot@developpement-durable.gouv.fr> wrote:
>
>> Hello,
>>
>> We use ManifoldCF v2.10, with postgresql (9.6) to crawl our websites.
>> this represents approximately 1.2 million documents.
>> We split the crawl into 4 jobs that distribute their results on 3 SOLR
>> collections.
>> The crawl is powerful up to 500000 documents (25000 to 30000 docs / hour)
>> then the performance decreases strongly in progress, we observe freezes
>> very very long, you might think that the crawl is stopped.
>> We suspect a reindexing, noticeably of the intrinsiclink table which is
>> very important 85 Million lines.
>> Is it possible to prohibit re-indexing controlled by manifoldCF?
>> An other idea ?
>>
>> best Regards
>> LIROT daniel
>> --
>>
>
>

Re: ManifoldCF + Postgresql - long freeze on job

Posted by LIROT Daniel - SG/SPSSI/CPII/DOSO/ET <Da...@developpement-durable.gouv.fr>.
Hello,

1/ The database we use is Postgresql version 9.6

2/ I will look at what is happening about the queries in the logs.

3/ We do a vacuum full analyse every 24 hours, for each table we adjust 
the reindex at the value 5000000 (in properties.xml) with the line :
  <property 
name="org.apache.manifoldcf.db.postgres.reindex.intrinsiclink" 
value="5000000" />

Is there an instruction that allows to disable the reindex requested by 
manifoldcf

thanks

Daniel


Le 08/02/2019 à 16:00, > Karl Wright (par Internet, dépôt 
user-return-5674-daniel.lirot=developpement-durable.gouv.fr@manifoldcf.apache.org) 
a écrit :
> Hello,
>
> (1) What database are you using for this?  Some databases require 
> maintenance periodically or have other heavy usage constraints.
> (2) Every time a query takes more than an minute to execute, it is 
> logged, along with the query plan.  You need to look at the manifoldcf 
> log to see which queries are problematic before concluding anything.
> (3) For every database table, you can individually configure how many 
> table operations approximately occur before MCF re-analyzes the 
> table.  However, it's likely that you have the opposite problem: a bad 
> query plan for the query that queues documents for processing.  That 
> may mean more frequent analysis to prevent.  But we cannot tell that 
> until we understand what queries are taking a long time.
>
> Thanks,
> Karl
>
>
>
> On Fri, Feb 8, 2019 at 8:07 AM LIROT Daniel - SG/SPSSI/CPII/DOSO/ET 
> <Daniel.Lirot@developpement-durable.gouv.fr 
> <ma...@developpement-durable.gouv.fr>> wrote:
>
>     Hello,
>
>     We use ManifoldCF v2.10, with postgresql (9.6) to crawl our websites.
>     this represents approximately 1.2 million documents.
>     We split the crawl into 4 jobs that distribute their results on 3
>     SOLR collections.
>     The crawl is powerful up to 500000 documents (25000 to 30000 docs
>     / hour) then the performance decreases strongly in progress, we
>     observe freezes very very long, you might think that the crawl is
>     stopped.
>     We suspect a reindexing, noticeably of the intrinsiclink table
>     which is very important 85 Million lines.
>     Is it possible to prohibit re-indexing controlled by manifoldCF?
>     An other idea ?
>
>     best Regards
>     LIROT daniel
>     -- 
>


Re: ManifoldCF + Postgresql - long freeze on job

Posted by Karl Wright <da...@gmail.com>.
Hello,

(1) What database are you using for this?  Some databases require
maintenance periodically or have other heavy usage constraints.
(2) Every time a query takes more than an minute to execute, it is logged,
along with the query plan.  You need to look at the manifoldcf log to see
which queries are problematic before concluding anything.
(3) For every database table, you can individually configure how many table
operations approximately occur before MCF re-analyzes the table.  However,
it's likely that you have the opposite problem: a bad query plan for the
query that queues documents for processing.  That may mean more frequent
analysis to prevent.  But we cannot tell that until we understand what
queries are taking a long time.

Thanks,
Karl



On Fri, Feb 8, 2019 at 8:07 AM LIROT Daniel - SG/SPSSI/CPII/DOSO/ET <
Daniel.Lirot@developpement-durable.gouv.fr> wrote:

> Hello,
>
> We use ManifoldCF v2.10, with postgresql (9.6) to crawl our websites.
> this represents approximately 1.2 million documents.
> We split the crawl into 4 jobs that distribute their results on 3 SOLR
> collections.
> The crawl is powerful up to 500000 documents (25000 to 30000 docs / hour)
> then the performance decreases strongly in progress, we observe freezes
> very very long, you might think that the crawl is stopped.
> We suspect a reindexing, noticeably of the intrinsiclink table which is
> very important 85 Million lines.
> Is it possible to prohibit re-indexing controlled by manifoldCF?
> An other idea ?
>
> best Regards
> LIROT daniel
> --
>