You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by SAUNIER Maxence <MS...@citya.com> on 2017/12/22 15:07:51 UTC

Problems and evolutions (CITYA)

Hello,

I am Maxence SAUNIER and I work at CITYA Immobilier in France. We use ManifoldCF to crawl several tens of millions of documents at the moment. And, we have/encounter various problems related with ManifoldCF.

ManifoldCF API for scripts:

  *   In the webservice /json/jobstatuses
     *   the job ’name’ field is named 'description'. It is a mistake?
     *   Is there a possibility to have a 'job_name' field in addition to the 'machine' field?
        *   Today, I am forced to request the url /json/jobs and save the result in local files for each of my servers in order to link the id of the jobs with their names to display for the user of the script. This request takes a lot of time that could be avoided.

IOWait:

  *   After a certain amount of time, we are constantly having problems with IOWait on the virtual machine. Here are the features and details.
  *   Features of the virtual machine:
     *   15K Disk 140Go
     *   12 Go RAM
     *   4 vCPU
     *   Allocation RAM postgres : 7Go
     *   Allocation RAM ManifoldCF : 4Go
     *   System Debian, used 130Mo RAM
  *   I investigated and the reason for the IO would be the postgresql and its queries ANALIZED. Screens joins at this email.
  *   Why are there EXPLAIN queries?

---
"postgres";"postgres";"10.37.98.147";"2017-11-15 11:05:42.855741+01";"active";"SELECT datname, usename, client_addr, query_start, state, REGEXP_REPLACE(query, E' *[\\n\\r]+ *', ' ', 'g') AS query FROM pg_stat_activity ORDER BY query_start DESC;"
"manifoldbdd";"manifoldcf";"127.0.0.1";"2017-11-15 11:05:42.036039+01";"idle";"SELECT * FROM agents"
"manifoldbdd";"manifoldcf";"127.0.0.1";"2017-11-15 11:05:36.442853+01";"idle";"COMMIT"
"manifoldbdd";"manifoldcf";"127.0.0.1";"2017-11-15 11:05:36.41319+01";"idle";"COMMIT"
"manifoldbdd";"manifoldcf";"127.0.0.1";"2017-11-15 11:05:36.410481+01";"idle";"COMMIT"
"manifoldbdd";"manifoldcf";"127.0.0.1";"2017-11-15 11:05:36.308415+01";"idle";"COMMIT"
"manifoldbdd";"manifoldcf";"127.0.0.1";"2017-11-15 11:05:36.308415+01";"idle";"COMMIT"
"manifoldbdd";"manifoldcf";"127.0.0.1";"2017-11-15 11:05:36.301668+01";"idle";"COMMIT"
"manifoldbdd";"manifoldcf";"127.0.0.1";"2017-11-15 11:05:36.301102+01";"idle";"COMMIT"
"manifoldbdd";"manifoldcf";"127.0.0.1";"2017-11-15 11:05:36.300208+01";"idle";"COMMIT"
"manifoldbdd";"manifoldcf";"127.0.0.1";"2017-11-15 11:05:36.288904+01";"idle";"COMMIT"
"manifoldbdd";"manifoldcf";"127.0.0.1";"2017-11-15 11:05:36.24823+01";"active";"ANALYZE jobqueue"
"postgres";"postgres";"10.37.98.147";"2017-11-15 11:00:42.07803+01";"idle";"SELECT 1 FROM pg_available_extensions WHERE name='adminpack'"
"postgres";"postgres";"10.10.198.4";"2017-11-15 10:59:24.869264+01";"idle";"SELECT version();"
"manifoldbdd";"postgres";"10.10.198.4";"2017-11-15 10:59:19.29933+01";"idle";"SELECT rolname FROM pg_roles WHERE rolcanlogin ORDER BY 1"
---

[cid:5B037047-C00B-4B1E-99DC-794194926772@citya.local]
[cid:0163972F-DD4E-4B5C-A28F-72298913A7F4@citya.local]

Local Tika content text:

  *   We need to register in the Solr the 'content_text' of the indexed files. Despite the creation of fields 'content_fr', 'content_en', 'content', ’text' or 'content_text' and the addition of these in Solr, the content is not sent by ManifoldCF or not register by Solr. In ManifoldCF, local Tika has been set to send all metadata and I don’t know if the problem comes from Tika, Manifold CF or Solr. A missing configuration? Do you have a process for adding this content_text field without taking into account the language? (All our documents are in French)

Thanks for your help.


Cordialement,


[cid:33DD291E-6A0B-456A-A4D0-F79AE35448F7@citya.local]