You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jackrabbit.apache.org by Jukka Zitting <ju...@gmail.com> on 2009/07/08 14:37:43 UTC

Per-repository thread pool in Jackrabbit

Hi,

Prompted by the JCR-1818 changes I started thinking whether it would
make sense to have a single per-repository thread pool for use by all
the various background jobs (index merging, text extraction, cluster
sync, etc.) that we have going on while a repository is running. We
could also add simple task scheduling functionality to the pool if
needed. Such a global thread pool would be easier to manage and
configure, though it would reduce the amount of control one has over
the scheduling and capacity of individual tasks.

WDYT?

BR,

Jukka Zitting

Re: Per-repository thread pool in Jackrabbit

Posted by Guo Du <mr...@gmail.com>.

Interesting discussion. There are two problems we had been discussion from
my understanding.

1. Global/Main pool or sub-pools for repository wide thread management.
Marcel listed a few requirement that each of them may need a separate pool,
every pool may have different pool size and other configuration, they
shouldn't inter affect each other. So a repository wide global/main pool
won't fit in this case. My vote is to have multiple pools managed for
repository works.

2. The place to mange the pool for repository.
The pools settings will affect the performance of repository significantly
and thread pools are expensive resource. Undoubtedly the pools should be
configurable with a standalone repository because our test are running in
this model. As repository may live inside sling or other OSGi/Application
Server container which may already have the pool management functionality,
so the repository pool would be managed by container. The pool management
may need consider this situation during implementation.

Just my 2 cents :)

--Guo

On Sun, Jul 12, 2009 at 9:12 PM, Jukka Zitting <ju...@gmail.com>wrote:

> Hi,
>
> 2009/7/8 Marcel Reutegger <ma...@gmx.net>:
> > - paralleled execution of some work. this is primarily to make use of
> > multi-core processors. execution should be distributed over and
> > executed by N threads which is a factor of the available processors.
>
> If I recall correctly we debated this already earlier. My point was
> that limiting the number of tasks to the number of available
> processors may not be a good approach as the tasks may be IO-bound or
> block for other reasons, in which case having more task threads would
> give you better throughput. But I recall being proven wrong, did we
> have some benchmark for that? Do you remember where this discussion
> was?
>
> > - Timers used in TransactionContext and MultiIndex. This could be
> > turned into a scheduling mechanism that could also be used by the
> > ClusterNode sync. Other classes that use periodic checks in a
> > background thread: DatabaseJournal (ClusterRevisionJanitor),
> > CooperativeFileLock (watch dog).
>
> Yep. Perhaps we could also reuse some of the scheduling functionality in
> Sling.
>
> > the more I think about it, the more I like your idea. but we should be
> > careful with a maximum size for a repository wide pool. extensive use
> > of the pool by a module should not lock up another module just because
> > there are no more idle threads. maybe that global pool shouldn't have
> > a maximum size...
>
> That might make sense. Perhaps we should have some concept of
> sub-pools (that borrow from the main pool) with fixed limits for tasks
> that need them (see above).
>
> BR,
>
> Jukka Zitting
>

-- 
Kind regards,

Du, Guo
__________________________________________________
Phone     : +353-86-176 6186
Email     : online@duguo.com
__________________________________________________
http://duguo.com  - Career Life Balance

Re: Per-repository thread pool in Jackrabbit

Posted by Julian Sedding <js...@day.com>.

Hello

Since Java 5 there is java.util.concurrent, which provides a
ScheduledThreadPoolExecutor[1]. Maybe this suits the requirements. And
it does not introduce another dependency.

Regards
Julian

[1] http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/ScheduledThreadPoolExecutor.html



On Mon, Jul 13, 2009 at 2:57 PM, Felix Meschberger<fm...@gmail.com> wrote:
> Hi,
>
> Marcel Reutegger schrieb:
>> Hi,
>>
>> 2009/7/12 Jukka Zitting <ju...@gmail.com>:
>>> Hi,
>>>
>>> 2009/7/8 Marcel Reutegger <ma...@gmx.net>:
>>>> - paralleled execution of some work. this is primarily to make use of
>>>> multi-core processors. execution should be distributed over and
>>>> executed by N threads which is a factor of the available processors.
>>> If I recall correctly we debated this already earlier. My point was
>>> that limiting the number of tasks to the number of available
>>> processors may not be a good approach as the tasks may be IO-bound or
>>> block for other reasons, in which case having more task threads would
>>> give you better throughput. But I recall being proven wrong, did we
>>> have some benchmark for that? Do you remember where this discussion
>>> was?
>>
>> I don't remember either... But let's just start a new one.
>>
>> I think this very much depends on the work that needs to be distributed. there
>> is no prove that one way is better than the other. for CPU intensive work we'd
>> probably want to limit the number of concurrent tasks. for I/O intensive work
>> the concurrency should be higher.
>>
>> my above point was rather related to CPU intensive work. e.g. creating a posting
>> list while content is indexed. but of course there might be other work that may
>> be parallelized more aggressively.
>>
>> I guess the actual pool shouldn't care about that. some utility on top
>> of the pool
>> should provide that functionality. i.e. execute a number of tasks with a given
>> level of concurrency. the utility would then dispatch the tasks to the pool
>> accordingly.
>>
>>>> - Timers used in TransactionContext and MultiIndex. This could be
>>>> turned into a scheduling mechanism that could also be used by the
>>>> ClusterNode sync. Other classes that use periodic checks in a
>>>> background thread: DatabaseJournal (ClusterRevisionJanitor),
>>>> CooperativeFileLock (watch dog).
>>> Yep. Perhaps we could also reuse some of the scheduling functionality in Sling.
>>
>> I'm not sure this is needed. the java rt library already comes with
>> Timer and Task
>> classes. our needs are very simple and I'm not sure that justifies a
>> new dependency.
>
> Yes, AFAICT Java also has ThreadPool implementations. If not, I urge to
> still _not_ reinvent the wheel and take something existing even if it
> would a single dependency.
>
> Regards
> Felix
>
>>
>>>> the more I think about it, the more I like your idea. but we should be
>>>> careful with a maximum size for a repository wide pool. extensive use
>>>> of the pool by a module should not lock up another module just because
>>>> there are no more idle threads. maybe that global pool shouldn't have
>>>> a maximum size...
>>> That might make sense. Perhaps we should have some concept of
>>> sub-pools (that borrow from the main pool) with fixed limits for tasks
>>> that need them (see above).
>>
>> hmm, that doesn't sound flexible and generic. I just thought again how cool
>> it was if we could deploy jackrabbit into a google app-engine. that however
>> requires that all background threads are removed. if we have that generic
>> pool and client code adjusted accordingly it could be as easy as turning
>> the pool into a direct executor variant ;) well, that's very optimistic but
>> sounds promising to me...
>>
>> regards
>>  marcel
>>
>

Re: Per-repository thread pool in Jackrabbit

Posted by Felix Meschberger <fm...@gmail.com>.

Hi,

Marcel Reutegger schrieb:
> Hi,
> 
> 2009/7/12 Jukka Zitting <ju...@gmail.com>:
>> Hi,
>>
>> 2009/7/8 Marcel Reutegger <ma...@gmx.net>:
>>> - paralleled execution of some work. this is primarily to make use of
>>> multi-core processors. execution should be distributed over and
>>> executed by N threads which is a factor of the available processors.
>> If I recall correctly we debated this already earlier. My point was
>> that limiting the number of tasks to the number of available
>> processors may not be a good approach as the tasks may be IO-bound or
>> block for other reasons, in which case having more task threads would
>> give you better throughput. But I recall being proven wrong, did we
>> have some benchmark for that? Do you remember where this discussion
>> was?
> 
> I don't remember either... But let's just start a new one.
> 
> I think this very much depends on the work that needs to be distributed. there
> is no prove that one way is better than the other. for CPU intensive work we'd
> probably want to limit the number of concurrent tasks. for I/O intensive work
> the concurrency should be higher.
> 
> my above point was rather related to CPU intensive work. e.g. creating a posting
> list while content is indexed. but of course there might be other work that may
> be parallelized more aggressively.
> 
> I guess the actual pool shouldn't care about that. some utility on top
> of the pool
> should provide that functionality. i.e. execute a number of tasks with a given
> level of concurrency. the utility would then dispatch the tasks to the pool
> accordingly.
> 
>>> - Timers used in TransactionContext and MultiIndex. This could be
>>> turned into a scheduling mechanism that could also be used by the
>>> ClusterNode sync. Other classes that use periodic checks in a
>>> background thread: DatabaseJournal (ClusterRevisionJanitor),
>>> CooperativeFileLock (watch dog).
>> Yep. Perhaps we could also reuse some of the scheduling functionality in Sling.
> 
> I'm not sure this is needed. the java rt library already comes with
> Timer and Task
> classes. our needs are very simple and I'm not sure that justifies a
> new dependency.

Yes, AFAICT Java also has ThreadPool implementations. If not, I urge to
still _not_ reinvent the wheel and take something existing even if it
would a single dependency.

Regards
Felix

> 
>>> the more I think about it, the more I like your idea. but we should be
>>> careful with a maximum size for a repository wide pool. extensive use
>>> of the pool by a module should not lock up another module just because
>>> there are no more idle threads. maybe that global pool shouldn't have
>>> a maximum size...
>> That might make sense. Perhaps we should have some concept of
>> sub-pools (that borrow from the main pool) with fixed limits for tasks
>> that need them (see above).
> 
> hmm, that doesn't sound flexible and generic. I just thought again how cool
> it was if we could deploy jackrabbit into a google app-engine. that however
> requires that all background threads are removed. if we have that generic
> pool and client code adjusted accordingly it could be as easy as turning
> the pool into a direct executor variant ;) well, that's very optimistic but
> sounds promising to me...
> 
> regards
>  marcel
>

Re: Per-repository thread pool in Jackrabbit

Posted by Marcel Reutegger <ma...@gmx.net>.

Hi,

2009/7/12 Jukka Zitting <ju...@gmail.com>:
> Hi,
>
> 2009/7/8 Marcel Reutegger <ma...@gmx.net>:
>> - paralleled execution of some work. this is primarily to make use of
>> multi-core processors. execution should be distributed over and
>> executed by N threads which is a factor of the available processors.
>
> If I recall correctly we debated this already earlier. My point was
> that limiting the number of tasks to the number of available
> processors may not be a good approach as the tasks may be IO-bound or
> block for other reasons, in which case having more task threads would
> give you better throughput. But I recall being proven wrong, did we
> have some benchmark for that? Do you remember where this discussion
> was?

I don't remember either... But let's just start a new one.

I think this very much depends on the work that needs to be distributed. there
is no prove that one way is better than the other. for CPU intensive work we'd
probably want to limit the number of concurrent tasks. for I/O intensive work
the concurrency should be higher.

my above point was rather related to CPU intensive work. e.g. creating a posting
list while content is indexed. but of course there might be other work that may
be parallelized more aggressively.

I guess the actual pool shouldn't care about that. some utility on top
of the pool
should provide that functionality. i.e. execute a number of tasks with a given
level of concurrency. the utility would then dispatch the tasks to the pool
accordingly.

>> - Timers used in TransactionContext and MultiIndex. This could be
>> turned into a scheduling mechanism that could also be used by the
>> ClusterNode sync. Other classes that use periodic checks in a
>> background thread: DatabaseJournal (ClusterRevisionJanitor),
>> CooperativeFileLock (watch dog).
>
> Yep. Perhaps we could also reuse some of the scheduling functionality in Sling.

I'm not sure this is needed. the java rt library already comes with
Timer and Task
classes. our needs are very simple and I'm not sure that justifies a
new dependency.

>> the more I think about it, the more I like your idea. but we should be
>> careful with a maximum size for a repository wide pool. extensive use
>> of the pool by a module should not lock up another module just because
>> there are no more idle threads. maybe that global pool shouldn't have
>> a maximum size...
>
> That might make sense. Perhaps we should have some concept of
> sub-pools (that borrow from the main pool) with fixed limits for tasks
> that need them (see above).

hmm, that doesn't sound flexible and generic. I just thought again how cool
it was if we could deploy jackrabbit into a google app-engine. that however
requires that all background threads are removed. if we have that generic
pool and client code adjusted accordingly it could be as easy as turning
the pool into a direct executor variant ;) well, that's very optimistic but
sounds promising to me...

regards
 marcel

Re: Per-repository thread pool in Jackrabbit

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

2009/7/8 Marcel Reutegger <ma...@gmx.net>:
> - paralleled execution of some work. this is primarily to make use of
> multi-core processors. execution should be distributed over and
> executed by N threads which is a factor of the available processors.

If I recall correctly we debated this already earlier. My point was
that limiting the number of tasks to the number of available
processors may not be a good approach as the tasks may be IO-bound or
block for other reasons, in which case having more task threads would
give you better throughput. But I recall being proven wrong, did we
have some benchmark for that? Do you remember where this discussion
was?

> - Timers used in TransactionContext and MultiIndex. This could be
> turned into a scheduling mechanism that could also be used by the
> ClusterNode sync. Other classes that use periodic checks in a
> background thread: DatabaseJournal (ClusterRevisionJanitor),
> CooperativeFileLock (watch dog).

Yep. Perhaps we could also reuse some of the scheduling functionality in Sling.

> the more I think about it, the more I like your idea. but we should be
> careful with a maximum size for a repository wide pool. extensive use
> of the pool by a module should not lock up another module just because
> there are no more idle threads. maybe that global pool shouldn't have
> a maximum size...

That might make sense. Perhaps we should have some concept of
sub-pools (that borrow from the main pool) with fixed limits for tasks
that need them (see above).

BR,

Jukka Zitting

Re: Per-repository thread pool in Jackrabbit

Posted by Marcel Reutegger <ma...@gmx.net>.

Hi,

I've been thinking about this whenever I needed a new pool ;) but soon
found out that the requirements are always a bit different. We might
still want to give it a try because it really gets a bit messy when
each module creates and maintains it own pool.

here are the different requirements:

- paralleled execution of some work. this is primarily to make use of
multi-core processors. execution should be distributed over and
executed by N threads which is a factor of the available processors.
This is currently implemented in DynamicPooledExecutor, but I guess
this could also be split apart, using a generic pooled executor and a
utility on top that controls how many commands are executed at a time.

- Timers used in TransactionContext and MultiIndex. This could be
turned into a scheduling mechanism that could also be used by the
ClusterNode sync. Other classes that use periodic checks in a
background thread: DatabaseJournal (ClusterRevisionJanitor),
CooperativeFileLock (watch dog).

- execute work asynchronously (this kind of work might take a longer
time to finish): ObservationDispatcher (event notification),
IndexMerger

the more I think about it, the more I like your idea. but we should be
careful with a maximum size for a repository wide pool. extensive use
of the pool by a module should not lock up another module just because
there are no more idle threads. maybe that global pool shouldn't have
a maximum size...

regards
 marcel

On Wed, Jul 8, 2009 at 14:37, Jukka Zitting<ju...@gmail.com> wrote:
> Hi,
>
> Prompted by the JCR-1818 changes I started thinking whether it would
> make sense to have a single per-repository thread pool for use by all
> the various background jobs (index merging, text extraction, cluster
> sync, etc.) that we have going on while a repository is running. We
> could also add simple task scheduling functionality to the pool if
> needed. Such a global thread pool would be easier to manage and
> configure, though it would reduce the amount of control one has over
> the scheduling and capacity of individual tasks.
>
> WDYT?
>
> BR,
>
> Jukka Zitting
>

Re: Per-repository thread pool in Jackrabbit

Posted by Felix Meschberger <fm...@gmail.com>.

Hi,

Definitely +1 and (of course) trying to reuse existing implementations
such as the one from Sling or even the JDK included one is probably
another good idea ;-)

Regards
Felix

Jukka Zitting schrieb:
> Hi,
> 
> Prompted by the JCR-1818 changes I started thinking whether it would
> make sense to have a single per-repository thread pool for use by all
> the various background jobs (index merging, text extraction, cluster
> sync, etc.) that we have going on while a repository is running. We
> could also add simple task scheduling functionality to the pool if
> needed. Such a global thread pool would be easier to manage and
> configure, though it would reduce the amount of control one has over
> the scheduling and capacity of individual tasks.
> 
> WDYT?
> 
> BR,
> 
> Jukka Zitting
>