You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ofbiz.apache.org by Scott Gray <sc...@hotwaxsystems.com> on 2019/04/07 08:05:47 UTC
Re: JobManager/JobPoller issues

Job prioritization is committed in r1857071, thanks everyone who provided
their thoughts.

Regards
Scott

On Sat, 16 Mar 2019, 22:30 Scott Gray, <sc...@hotwaxsystems.com> wrote:

> Patch available at https://issues.apache.org/jira/browse/OFBIZ-10865
>
> Reviews welcome, I probably won't have time to commit it for a few weeks
> so no rush.
>
> By the way I was amazed to notice that jobs are limited to 100 jobs per
> poll with a 30 second poll time, seems extremely conservative.  They would
> have to be very slow jobs to not have the executor be idle most of the
> time.  If no one objects I'd like to increase this to 2000 jobs with a 10
> second poll time.
>
> Thanks
> Scott
>
> On Tue, 26 Feb 2019 at 09:13, Scott Gray <sc...@hotwaxsystems.com>
> wrote:
>
>> Hi Jacques,
>>
>> I'm working on implementing the priority queue approach at the moment for
>> a client.  All things going well it will be in production in a couple of
>> weeks and I'll report back then with a patch.
>>
>> Regards
>> Scott
>>
>> On Tue, 26 Feb 2019 at 03:11, Jacques Le Roux <
>> jacques.le.roux@les7arts.com> wrote:
>>
>>> Hi,
>>>
>>> I put this comment there with OFBIZ-10002 trying to document why we have
>>> 5 as hardcoded value of /max-threads/ attribute in /thread-pool/ element
>>> (serviceengine.xml). At this moment Scott already mentioned[1]:
>>>
>>>     /Honestly I think the topic is generic enough that OFBiz doesn't
>>> need to provide any information at all. Thread pool sizing is not exclusive
>>> to
>>>     OFBiz and it would be strange for anyone to modify the numbers
>>> without first researching sources that provide far more detail than a few
>>> sentences
>>>     in our config files will ever cover./
>>>
>>> I agree with Scott and Jacopo that jobs are more likely IO rather than
>>> CPU bounded. So I agree that we should take that into account, change the
>>> current algorithm and remove this somehow misleading comment. Scott's
>>> suggestion in his 2nd email sounds good to me. So If I understood well we
>>> could
>>> use an unbounded but finally limited queue, like it was before.
>>>
>>>     Although with all of that said, after a quick second look it appears
>>> that
>>>     the current implementation doesn't try poll for more jobs than the
>>>     configured limit (minus already queued jobs) so we might be fine
>>> with an
>>>     unbounded queue implementation.  We'd just need to alter the call to
>>>     JobManager.poll(int limit) to not pass in
>>>     executor.getQueue().remainingCapacity() and instead pass in
>>> something like
>>>     (threadPool.getJobs() - executor.getQueue().size())
>>>
>>> I'm fine with that as it would continue to prevent hitting physical
>>> limitations and can be tweaked by users as it's now. Note that it seems
>>> though
>>> uneasy to tweak as we received already several "complaints" about it.
>>>
>>> Now one of the advantage of a PriorityBlockingQueue is priority. And to
>>> take advantage of that we can't rely on "/natural ordering"/ and need to
>>> implement Comparable (which does no seem easy). Nicolas provided some
>>> leads below and this should be discussed. The must would be to have that
>>> parametrised, of course.
>>>
>>> My 2 cts
>>> //
>>>
>>> [1] https://markmail.org/message/ixzluzd44rgloa2j
>>>
>>> Jacques
>>>
>>> Le 06/02/2019 à 14:24, Nicolas Malin a écrit :
>>> > Hello Scott,
>>> >
>>> > On a customer project we use massively the job manager with an average
>>> of one hundred thousand job per days.
>>> >
>>> > We have different cases like, huge long jobs, async persistent job,
>>> fast regular job. The mainly problem that we detect has been (as you
>>> notified)
>>> > the long jobs that stuck poller's thread and when we restart OFBiz (we
>>> are on continuous delivery) we hadn't windows this without crash some jobs.
>>> >
>>> > To solve try with Gil to analyze if we can load some weighting on job
>>> definition to help the job manager on what jobs on the pending queue it can
>>> > push on queued queue. We changed own vision to create two pools, one
>>> for system maintenance and huge long jobs managed by two ofbiz instances
>>> and an
>>> > other to manage user activity jobs also managed by two instances. We
>>> also added on service definition an information to indicate the
>>> predilection pool
>>> >
>>> > This isn't a big deal and not resolve the stuck pool but all blocked
>>> jobs aren't vital for daily activity.
>>> >
>>> > For crashed job, we introduced in trunk service lock that we set
>>> before an update and wait a windows for the restart.
>>> >
>>> > At this time for all OOM detected we reanalyse the origin job and
>>> tried to decompose it by persistent async service to help loading
>>> repartition.
>>> >
>>> > If I had more time, I would be oriented job improvement to :
>>> >
>>> >  * Define an execution plan rule to link services and poller without
>>> touch any service definition
>>> >
>>> >  * Define configuration by instance for the job vacuum to refine by
>>> service volumetric
>>> >
>>> > This feedback is a little confused Scott, maybe you found interesting
>>> things
>>> >
>>> > Nicolas
>>> >
>>> > On 30/01/2019 20:47, Scott Gray wrote:
>>> >> Hi folks,
>>> >>
>>> >> Just jotting down some issues with the JobManager over noticed over
>>> the
>>> >> last few days:
>>> >> 1. min-threads in serviceengine.xml is never exceeded unless the job
>>> count
>>> >> in the queue exceeds 5000 (or whatever is configured).  Is this not
>>> obvious
>>> >> to anyone else?  I don't think this was the behavior prior to a
>>> refactoring
>>> >> a few years ago.
>>> >> 2. The advice on the number of threads to use doesn't seem good to
>>> me, it
>>> >> assumes your jobs are CPU bound when in my experience they are more
>>> likely
>>> >> to be I/O bound while making db or external API calls, sending emails
>>> etc.
>>> >> With the default setup, it only takes two long running jobs to
>>> effectively
>>> >> block the processing of any others until the queue hits 5000 and the
>>> other
>>> >> threads are finally opened up.  If you're not quickly maxing out the
>>> queue
>>> >> then any other jobs are stuck until the slow jobs finally complete.
>>> >> 3. Purging old jobs doesn't seem to be well implemented to me, from
>>> what
>>> >> I've seen the system is only capable of clearing a few hundred per
>>> minute
>>> >> and if you've filled the queue with them then regular jobs have to
>>> queue
>>> >> behind them and can take many minutes to finally be executed.
>>> >>
>>> >> I'm wondering if anyone has experimented with reducing the queue the
>>> size?
>>> >> I'm considering reducing it to say 100 jobs per thread (along with
>>> >> increasing the thread count).  In theory it would reduce the time
>>> real jobs
>>> >> have to sit behind PurgeJobs and would also open up additional
>>> threads for
>>> >> use earlier.
>>> >>
>>> >> Alternatively I've pondered trying a PriorityBlockingQueue for the job
>>> >> queue (unfortunately the implementation is unbounded though so it
>>> isn't a
>>> >> drop-in replacement) so that PurgeJobs always sit at the back of the
>>> >> queue.  It might also allow prioritizing certain "user facing" jobs
>>> (such
>>> >> as asynchronous data imports) over lower priority less time critical
>>> jobs.
>>> >> Maybe another option (or in conjunction) is some sort of "swim-lane"
>>> >> queue/executor that allocates jobs to threads based on prior execution
>>> >> speed so that slow running jobs can never use up all threads and block
>>> >> faster jobs.
>>> >>
>>> >> Any thoughts/experiences you have to share would be appreciated.
>>> >>
>>> >> Thanks
>>> >> Scott
>>> >>
>>> >
>>>
>>