You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Mingchun Zhao (Jira)" <ji...@apache.org> on 2023/05/07 04:26:00 UTC

[jira] [Updated] (CONNECTORS-1746) Adding conditions to execute PostgreSQL's ANALYZE command to avoid crawling become extremely slow.

     [ https://issues.apache.org/jira/browse/CONNECTORS-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mingchun Zhao updated CONNECTORS-1746:
--------------------------------------
    Summary: Adding conditions to execute PostgreSQL's ANALYZE command to avoid crawling become extremely slow.  (was: Adding execution conditions of PostgreSQL's ANALYZE command to avoid crawling become extremely slow.)

> Adding conditions to execute PostgreSQL's ANALYZE command to avoid crawling become extremely slow.
> --------------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1746
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1746
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>         Environment: I am using ManifoldCF 2.24 with PostgreSQL 12.14 as the database. 
>            Reporter: Mingchun Zhao
>            Priority: Major
>
> Sometimes, the crawling does not process any documents for a while and there is nothing logged about long-running queries. The performance can be restored by firing the 'ANALYZE' command manually. It seems that a bad query plan caused this performance problem.
> Therefore, in addition to the current configuration parameter org.apache.manifoldcf.db.postgres.analyze.<tablename> , it is considered necessary to execute the 'ANALYZE' even in the following situations.
> 1. When the number of records in the table exceeds the number required for creating an query plan after the job starts.
> 2. When the crawling performance slows down. For example, if the document processing rate drops below a specified threshold. 
> How about adding two parameters to handle the timing of 'ANALYZE' execution as below?
> 1. `org.apache.manifoldcf.db.postgres.analyze.<tablename>.minimumrowcount`
> Specify how many records should be accumulated before carrying out an 'ANALYZE' on the specified table as the first time.defaults to 100.
> 2.`org.apache.manifoldcf.db.postgres.analyze.<tablename>.minimumprocessrate`
> Specify the number of documents processed in the last minute. If the actual processing rate falls below this, the 'ANALYZE' will be carrying out. defaults to 1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)