You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by Guylaine BASSETTE <gu...@francelabs.com> on 2023/05/22 14:52:34 UTC

RE: Control over number of processed documents per thread

Hi all,

I’m following up on this thread: I did some more testing, and actually 
the optimization problem was on our side.

The repository connector in question was CSV and the problem was that 
getMaxDocumentRequest() in CSVConnector.java was set to 1, so the 
processDocuments() method was processing documents one by one. I have 
now set it to 20 by default, and the performance has improved greatly.

Attached is modified class.

Regards,
Guylaine

France Labs – Your knowledge, now
Datafari Enterprise Search – Découvrez la version 5 / Discover our version 5
www.datafari.com <http://www.datafari.com>


On 2023/03/17 17:36:47 Julien Massiera wrote:
 > Hi Karl
 >
 >
 >
 > I was debugging a repository connector because I was disappointed 
with the
 > performance, and I noticed that the processDocuments method is called 
each
 > time with only 1 document identifier instead of a heap, although the 
seeding
 > phase has referenced 24k ids. What can explain that ? Can we have control
 > over the amount of documentIdentifiers passed per processDocuments 
thread ?
 > For instance, assuming we have the perfect number of documents that 
an API
 > can process at once, it would be very useful to be able to set it per
 > thread.
 >
 >
 >
 > Other thing, I also noticed that the seed phase and the cleanup phase 
seem
 > to process documents per group of 100/200 at a time, again, is it 
configured
 > somewhere, and can we have control over it ?
 >
 >
 >
 > Thanks,
 >
 > Julien
 >
 >
 >
 >
 >
 >