You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Aeham Abushwashi <ae...@exonar.com> on 2014/12/12 13:11:29 UTC

stuffamountfactor and getting more work done

Hi,

Are there any gotchas one should be aware of when configuring property
"org.apache.manifoldcf.crawler.stuffamountfactor"?

At times, I see the manifold nodes in my cluster (and the postgresql box)
not utilising all the resources they have. I have configured 30 worker
threads which tend to sit idle waiting for documents (continuous crawl).
This led me to tweak the batch size of the Stuffer thread indirectly using
"org.apache.manifoldcf.crawler.stuffamountfactor" and setting it to 20 (I
believe the default is 2).

I understand that increasing the batch size results in a bigger result set
coming back from the database. If the size is in the 1000s I doubt it would
cause problems. My hope is a bigger stuffer batch would allow worker
threads to operate more efficiently and handle more documents where
possible.

Please let me know if there are any particular concerns/guidelines over
tweaking this config property or if there are better ways for increasing
the width of the processing pipeline for each manifold instance.

Thanks,
Aeham

Re: stuffamountfactor and getting more work done

Posted by Karl Wright <da...@gmail.com>.
Yes, I believe it is.

Karl


On Fri, Dec 12, 2014 at 1:09 PM, Aeham Abushwashi <
aeham.abushwashi@exonar.com> wrote:
>
> Thanks Karl.
>
> I see that JobManager#fetchAndProcessDocuments invokes
> database.beginTransaction soon after acquiring the lock. Is that
> transaction necessary?
>

Re: stuffamountfactor and getting more work done

Posted by Aeham Abushwashi <ae...@exonar.com>.
Thanks Karl.

I see that JobManager#fetchAndProcessDocuments invokes
database.beginTransaction soon after acquiring the lock. Is that
transaction necessary?

Re: stuffamountfactor and getting more work done

Posted by Karl Wright <da...@gmail.com>.
Hi Aeham,

Given that your stuffer thread has to wait for multiple other machines to
finish stuffing before it runs, it may make sense to increase the amount
stuffed at one time.  Unfortunately the stuffer lock has to remain because
otherwise the same document could be stuffed twice.  Using a database
transaction is unworkable in this context because of the tendency to
deadlock.

Thanks,
Karl


On Fri, Dec 12, 2014 at 12:46 PM, Aeham Abushwashi <
aeham.abushwashi@exonar.com> wrote:
>
> Thanks Karl.
>
> The stuffer thread query isn't doing too badly. Judging by stats from the
> pg_stat_activity table in postgresql, the stuffer query usually takes < 2
> seconds to return.
>
>
> >> In a continuous job, documents may well be scheduled to be crawled at
> some time in the future, and are ineligible for crawling until that future
> time arrives.
>
> Such documents would be excluded by the stuffer query, right?
>
> Thanks for the pointer to the queue status page. Using the root server
> name as an identifier class, I get the bulk of documents grouped under the
> "About to Process" and "Waiting for Processing" categories. For example, I
> have a job with 677,856 and 102,342 docs respecitvely. Another job has
> 320,804 and 443,596 doc respectively. All other status categories have 0
> docs.
>
>
> >>  If there are tons of idle worker threads AND your stuffer thread is
> waiting on Postgresql, that's a good sign it is not keeping up due to
> database reasons.
>
> Interestingly, the stuffer thread spends the majority of its time trying
> to acquire the stuffer lock. I have 3 nodes in the cluster and each node's
> stuffer thread spends ~ 2/3 of its time blocked waiting for the lock. Of
> course the SQL query itself and connection grabbing/releasing all happen
> within the scope of the lock. The effect is that the more nodes there are
> in the cluster, the less time each node has for stuffing documents.
>

Re: stuffamountfactor and getting more work done

Posted by Aeham Abushwashi <ae...@exonar.com>.
Thanks Karl.

The stuffer thread query isn't doing too badly. Judging by stats from the
pg_stat_activity table in postgresql, the stuffer query usually takes < 2
seconds to return.


>> In a continuous job, documents may well be scheduled to be crawled at
some time in the future, and are ineligible for crawling until that future
time arrives.

Such documents would be excluded by the stuffer query, right?

Thanks for the pointer to the queue status page. Using the root server name
as an identifier class, I get the bulk of documents grouped under the
"About to Process" and "Waiting for Processing" categories. For example, I
have a job with 677,856 and 102,342 docs respecitvely. Another job has
320,804 and 443,596 doc respectively. All other status categories have 0
docs.


>>  If there are tons of idle worker threads AND your stuffer thread is
waiting on Postgresql, that's a good sign it is not keeping up due to
database reasons.

Interestingly, the stuffer thread spends the majority of its time trying to
acquire the stuffer lock. I have 3 nodes in the cluster and each node's
stuffer thread spends ~ 2/3 of its time blocked waiting for the lock. Of
course the SQL query itself and connection grabbing/releasing all happen
within the scope of the lock. The effect is that the more nodes there are
in the cluster, the less time each node has for stuffing documents.

Re: stuffamountfactor and getting more work done

Posted by Karl Wright <da...@gmail.com>.
FWIW, you can diagnose a slow stuffer query by getting a thread dump.  If
there are tons of idle worker threads AND your stuffer thread is waiting on
Postgresql, that's a good sign it is not keeping up due to database reasons.

Karl


On Fri, Dec 12, 2014 at 7:23 AM, Karl Wright <da...@gmail.com> wrote:
>
> Hi Aeham,
>
> Before you assume that stuffing is just not happening fast enough, you
> will want to confirm that you have enough documents that are *eligible* for
> processing.  In a continuous job, documents may well be scheduled to be
> crawled at some time in the future, and are ineligible for crawling until
> that future time arrives.  You can get a better sense of this by using the
> document and queue status reports.
>
> If you only have 30 worker threads on your machine, it's extremely
> unlikely that you would find yourself unable to stuff documents fast enough
> with the default parameters.  The only way that would not be true is if
> your stuffer queries are performing badly, and that would be important to
> know too.
>
> Thanks,
> Karl
>
>
>
>
> On Fri, Dec 12, 2014 at 7:11 AM, Aeham Abushwashi <
> aeham.abushwashi@exonar.com> wrote:
>>
>> Hi,
>>
>> Are there any gotchas one should be aware of when configuring property
>> "org.apache.manifoldcf.crawler.stuffamountfactor"?
>>
>> At times, I see the manifold nodes in my cluster (and the postgresql box)
>> not utilising all the resources they have. I have configured 30 worker
>> threads which tend to sit idle waiting for documents (continuous crawl).
>> This led me to tweak the batch size of the Stuffer thread indirectly using
>> "org.apache.manifoldcf.crawler.stuffamountfactor" and setting it to 20 (I
>> believe the default is 2).
>>
>> I understand that increasing the batch size results in a bigger result
>> set coming back from the database. If the size is in the 1000s I doubt it
>> would cause problems. My hope is a bigger stuffer batch would allow worker
>> threads to operate more efficiently and handle more documents where
>> possible.
>>
>> Please let me know if there are any particular concerns/guidelines over
>> tweaking this config property or if there are better ways for increasing
>> the width of the processing pipeline for each manifold instance.
>>
>> Thanks,
>> Aeham
>>
>

Re: stuffamountfactor and getting more work done

Posted by Karl Wright <da...@gmail.com>.
Hi Aeham,

Before you assume that stuffing is just not happening fast enough, you will
want to confirm that you have enough documents that are *eligible* for
processing.  In a continuous job, documents may well be scheduled to be
crawled at some time in the future, and are ineligible for crawling until
that future time arrives.  You can get a better sense of this by using the
document and queue status reports.

If you only have 30 worker threads on your machine, it's extremely unlikely
that you would find yourself unable to stuff documents fast enough with the
default parameters.  The only way that would not be true is if your stuffer
queries are performing badly, and that would be important to know too.

Thanks,
Karl




On Fri, Dec 12, 2014 at 7:11 AM, Aeham Abushwashi <
aeham.abushwashi@exonar.com> wrote:
>
> Hi,
>
> Are there any gotchas one should be aware of when configuring property
> "org.apache.manifoldcf.crawler.stuffamountfactor"?
>
> At times, I see the manifold nodes in my cluster (and the postgresql box)
> not utilising all the resources they have. I have configured 30 worker
> threads which tend to sit idle waiting for documents (continuous crawl).
> This led me to tweak the batch size of the Stuffer thread indirectly using
> "org.apache.manifoldcf.crawler.stuffamountfactor" and setting it to 20 (I
> believe the default is 2).
>
> I understand that increasing the batch size results in a bigger result set
> coming back from the database. If the size is in the 1000s I doubt it would
> cause problems. My hope is a bigger stuffer batch would allow worker
> threads to operate more efficiently and handle more documents where
> possible.
>
> Please let me know if there are any particular concerns/guidelines over
> tweaking this config property or if there are better ways for increasing
> the width of the processing pipeline for each manifold instance.
>
> Thanks,
> Aeham
>