You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ode.apache.org by Alex Boisvert <bo...@intalio.com> on 2007/06/06 23:24:51 UTC

Ode Performance: Round I

Howza,

I started testing a short-lived process implementing a single
request-response operation.  The process structure is as follows:

-Receive Purchase Order
-Do some assignments (schema mappings)
-Invoke CRM system to record the new PO
-Do more assignments (schema mappings)
-Invoke ERP system to record a new work order
-Send back an acknowledgment

Some deployment notes:
-All WS operations are SOAP/HTTP
-The process is deployed as "in-memory"
-The CRM and ERP systems are mocked as Axis2 services (as dumb as can be to
avoid bottlenecks)

After fixing a few minor issues (to handle the load), and fixing a few
obvious code inefficiencies which gave us roughly a 20% gain, we are now
near-100% CPU utilization.  (I'm testing on my dual-core system)   As it
stands, Ode clocks about 70 transactions per second.

Is this good?  I'd say there's room for improvement.  Based on previous work
in the field, I estimate we could get up to 300-400 transactions/second.

How do we improve this?  Well, looking at the end-to-end execution of the
process, I counted 4 thread-switches and 4 JTA transactions.  Those are not
really necessary, if you ask me.  I think significant improvements could be
made if we could run this process straight-through, meaning in a single
thread and a single transaction.  (Not to mention it would make things
easier to monitor and measure ;)

Also, to give you an idea, the top 3 areas where we spend most of our CPU
today are:

1) Serialization/deserialization of the Jacob state (I'm evaluating about
40-50%)
2) XML marshaling/unmarshaling (About 10-20%)
3) XML processing:  XPath evaluation + assignments (About 10-20%)

(The rest would be about 20%; I need to load up JProbe or DTrace to provide
more accurate measurements.  My current estimates are a mix of
non-scientific statistical sampling of thread dumps and a quick run with the
JVM's built-in profiler)

So my general question is...  how do we get started on the single thread +
single transaction refactoring?    Anybody already gave some thoughts to
this?  Are there any pending design issues before we start?  How do we work
on this without disrupting other parts of the system?  Do we start a new
branch?

alex

Re: Ode Performance: Round I

Posted by Alex Boisvert <bo...@intalio.com>.

On 6/6/07, Alex Boisvert <bo...@intalio.com> wrote:
>
> Also, to give you an idea, the top 3 areas where we spend most of our CPU
> today are:
>
> 1) Serialization/deserialization of the Jacob state (I'm evaluating about
> 40-50%)
> 2) XML marshaling/unmarshaling (About 10-20%)
> 3) XML processing:  XPath evaluation + assignments (About 10-20%)

Ok, so I hacked something to avoid serialization of the Jacob state for
in-memory processes, and now we're up to 160 transactions/second (up from 70
txns/sec before).   Yay!

alex

Re: Ode Performance: Round I

Posted by Assaf Arkin <ar...@intalio.com>.

On 6/8/07, Alex Boisvert <bo...@intalio.com> wrote:
>
> As a first step, I was thinking of allowing the composition of work that
> is
> currently done in several unrelated threads into a single thread, by
> introducing a WorkQueue
>
> Right now we have code in the engine, such as
> org.apache.ode.axis2.ExternalService.invoke() -> afterCompletion() that
> uses
> ExecutorService.submit(...) and I'd like to convert this into
> WorkQueue.submit().
>
> For example, this means that org.apache.ode.axis2.OdeService would first
> execute the transaction around odeMex.invoke() and after commit it would
> dequeue and execute any pending items in the WorkQueue.  We would also
> need
> to do the same in BpelEngineImpl.onScheduledJob() and other similar engine
> entrypoints.
>
> The outcome of this is that we could execute all the "non-blocking" work
> related to an external event in a single thread, if desired.   Depending
> on
> the WorkQueue implementation, we could have pure serial processing,
> parallel
> processing (like now), or even a mix in-between (e.g. limiting concurrent
> processing to N threads for a given instance).   This would allow for
> optimizing response time or throughput based on the engine policy, or if
> we
> want to get sophisticated, by process model.
>
> I think this change is relatively straightforward that it could happen in
> the trunk without disrupting it.
>
> Thoughts?


I think it will be extremely confusing when code starts misbehaving because
it assumes the queue was thread local and someone configured it differently,
or assumed it dequeues asynchronously, and it stopped doing that.

I would prefer the least surprise, having something like a ThreadLocalQueue
that you can only use within the thread, and regular Queue that always runs
things in a separate thread.

Assaf

alex
>
> On 6/8/07, Maciej Szefler <mb...@intalio.com> wrote:
> >
> > sure..
> >
> >
> > On 6/7/07, Alex Boisvert <bo...@intalio.com> wrote:
> > > Ok, got it.   Do you want to go ahead and create the
> "straight-through"
> > > branch?
> > >
> > > alex
> > >
> > >
> > > On 6/7/07, Maciej Szefler <mb...@intalio.com> wrote:
> > > >
> > > > If the IL supports ASYNC, then it is used, otherwise BLOCKING would
> be
> > > > used. We want to keep this, because if the IL does indeed use ASYNC
> > > > style (for example if this is a JMS ESB), then likely we don't have
> > > > much in the way of performance guarantees, i.e. the thread may end
> up
> > > > being blocked for a day, which would quickly lead to resource
> > > > problems.
> > > >
> > > > -mbs
> > > >
> > > > On 6/6/07, Alex Boisvert <bo...@intalio.com> wrote:
> > > > > Maciej,
> > > > >
> > > > > I'm unclear about how the engine would choose between BLOCKING and
> > > > ASYNC.
> > > > >
> > > > > I tend to think we need only BLOCKING and the IL deals with the
> fact
> > > > that it
> > > > > might have to suspend and resume itself if the underlying
> invocation
> > is
> > > > > async (e.g. JBI).   What's the use-case for ASYNC?
> > > > >
> > > > > alex
> > > > >
> > > > > On 6/6/07, Matthieu Riou <ma...@gmail.com> wrote:
> > > > > >
> > > > > > Forwarding on behalf of Maciej (mistakingly replied privately):
> > > > > >
> > > > > >
> > > > > >
> > > >
> >
> -----------------------------------------------------------------------------------------------------------------
> > > > > >
> > > > > > ah yes. ok, here's my theory on getting the behavior alex wants;
> > this
> > > > > > i think is a fairly concrete way to get the different use cases
> we
> > > > > > outlined on the white board.
> > > > > >
> > > > > > 1) create the notion of an invocation style: BLOCKING, ASYNC,
> > > > > > RELIABLE, and TRANSACTED.
> > > > > > 2) add MessageExchangeContext.isStyleSupported(PartnerMex,
> Style)
> > > > method
> > > > > > 3) modify the MessageExchangeContext.invokePartner method to
> take
> > a
> > > > > > style parameter.
> > > > > >
> > > > > > In BLOCKING style the IL simply does the invoke, right then and
> > there,
> > > > > > blocking the thread. (our axis IL would support this style)
> > > > > >
> > > > > > In ASYNC style, the IL does not block; instead it sends us a
> > > > > > notification when the response is available. (JBI likes this
> style
> > the
> > > > > > most).
> > > > > >
> > > > > > In RELIABLE, the request would be enrolled in the current TX,
> > response
> > > > > > delievered asynch as above (in a new tx)
> > > > > >
> > > > > > in TRANSACTED, the behavior is like BLOCKING, but the TX context
> > is
> > > > > > propagted with the invocation.
> > > > > >
> > > > > > The engine would try to use the best style given the
> > circumstances.
> > > > > > For example, for in-mem processes it would prefer to use the
> > > > > > TRANSACTED style and it could do it "in-line", i.e. as part of
> the
> > > > > > <invoke> or right after it runs out of reductions.  If the style
> > is
> > > > > > not supported it could 'downgrade' to the BLOCKING style, which
> > would
> > > > > > work in the same way. If BLOCKING were not supported, then ASYNC
> > would
> > > > > > be the last resort, but this would force us to serialize.
> > > > > >
> > > > > > For persisted processes, we'd prefer RELIABLE in general,
> > TRANSACTED
> > > > > > when inside an atomic scope, otherwise either BLOCKING or ASYNC.
> > > > > > However, here use of BLOCKING or ASYNC would result in
> additional
> > > > > > transactions since we'd need to persist the fact that the
> > invocation
> > > > > > was made. Unless of course the operation is marked as
> "idempotent"
> > in
> > > > > > which case we could use the BLOCKING call without a checkpoint.
> > > > > >
> > > > > > How does that sound?
> > > > > > -mbs
> > > > > >
> > > > > >
> > > > > > On 6/6/07, Matthieu Riou <ma...@gmail.com> wrote:
> > > > > > >
> > > > > > > Actually for in-memory processes, it would save us all reads
> and
> > > > writes
> > > > > > > (we should never read or write it in that case). And for
> > persistent
> > > > > > > processes, then it will save a lot of reads (which are still
> > > > expensive
> > > > > > > because of deserialization).
> > > > > > >
> > > > > > > On 6/6/07, Matthieu Riou <ma...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Two things:
> > > > > > > >
> > > > > > > > 1. We should also consider caching the Jacob state. Instead
> of
> > > > always
> > > > > > > > serializing / writing and reading / deserializing, caching
> > those
> > > > > > states
> > > > > > > > could save us a lot of reads.
> > > > > > > >
> > > > > > > > 2. Cutting down the transaction count is a significant
> > refactoring
> > > > so
> > > > > > I
> > > > > > > > would start a new branch for that (maybe ODE 2.0?). And
> we're
> > > > going to
> > > > > > > > need a lot of tests to chase regressions :)
> > > > > > > >
> > > > > > > > I think 1 could go without a branch. It's not trivial but I
> > don't
> > > > > > think
> > > > > > > > it would take more than a couple of weeks (I would have to
> get
> > > > deeper
> > > > > > into
> > > > > > > > the code to give a better evaluation).
> > > > > > > >
> > > > > > > > On 6/6/07, Alex Boisvert < boisvert@intalio.com> wrote:
> > > > > > > > >
> > > > > > > > > Howza,
> > > > > > > > >
> > > > > > > > > I started testing a short-lived process implementing a
> > single
> > > > > > > > > request-response operation.  The process structure is as
> > > > follows:
> > > > > > > > >
> > > > > > > > > -Receive Purchase Order
> > > > > > > > > -Do some assignments (schema mappings)
> > > > > > > > > -Invoke CRM system to record the new PO
> > > > > > > > > -Do more assignments (schema mappings)
> > > > > > > > > -Invoke ERP system to record a new work order
> > > > > > > > > -Send back an acknowledgment
> > > > > > > > >
> > > > > > > > > Some deployment notes:
> > > > > > > > > -All WS operations are SOAP/HTTP
> > > > > > > > > -The process is deployed as "in-memory"
> > > > > > > > > -The CRM and ERP systems are mocked as Axis2 services (as
> > dumb
> > > > as
> > > > > > can
> > > > > > > > > be to
> > > > > > > > > avoid bottlenecks)
> > > > > > > > >
> > > > > > > > > After fixing a few minor issues (to handle the load), and
> > fixing
> > > > a
> > > > > > few
> > > > > > > > >
> > > > > > > > > obvious code inefficiencies which gave us roughly a 20%
> > gain, we
> > > > are
> > > > > > > > > now
> > > > > > > > > near-100% CPU utilization.  (I'm testing on my dual-core
> > system)
> > > > > > As
> > > > > > > > > it
> > > > > > > > > stands, Ode clocks about 70 transactions per second.
> > > > > > > > >
> > > > > > > > > Is this good?  I'd say there's room for
> improvement.  Based
> > on
> > > > > > > > > previous work
> > > > > > > > > in the field, I estimate we could get up to 300-400
> > > > > > > > > transactions/second.
> > > > > > > > >
> > > > > > > > > How do we improve this?  Well, looking at the end-to-end
> > > > execution
> > > > > > of
> > > > > > > > > the
> > > > > > > > > process, I counted 4 thread-switches and 4 JTA
> > > > transactions.  Those
> > > > > > > > > are not
> > > > > > > > > really necessary, if you ask me.  I think significant
> > > > improvements
> > > > > > > > > could be
> > > > > > > > > made if we could run this process straight-through,
> meaning
> > in a
> > > > > > > > > single
> > > > > > > > > thread and a single transaction.  (Not to mention it would
> > make
> > > > > > things
> > > > > > > > >
> > > > > > > > > easier to monitor and measure ;)
> > > > > > > > >
> > > > > > > > > Also, to give you an idea, the top 3 areas where we spend
> > most
> > > > of
> > > > > > our
> > > > > > > > > CPU
> > > > > > > > > today are:
> > > > > > > > >
> > > > > > > > > 1) Serialization/deserialization of the Jacob state (I'm
> > > > evaluating
> > > > > > > > > about
> > > > > > > > > 40-50%)
> > > > > > > > > 2) XML marshaling/unmarshaling (About 10-20%)
> > > > > > > > > 3) XML processing:  XPath evaluation + assignments (About
> > > > 10-20%)
> > > > > > > > >
> > > > > > > > > (The rest would be about 20%; I need to load up JProbe or
> > DTrace
> > > > to
> > > > > > > > > provide
> > > > > > > > > more accurate measurements.  My current estimates are a
> mix
> > of
> > > > > > > > > non-scientific statistical sampling of thread dumps and a
> > quick
> > > > run
> > > > > > > > > with the
> > > > > > > > > JVM's built-in profiler)
> > > > > > > > >
> > > > > > > > > So my general question is...  how do we get started on the
> > > > single
> > > > > > > > > thread +
> > > > > > > > > single transaction refactoring?    Anybody already gave
> some
> > > > > > thoughts
> > > > > > > > > to
> > > > > > > > > this?  Are there any pending design issues before we
> > start?  How
> > > > do
> > > > > > we
> > > > > > > > > work
> > > > > > > > > on this without disrupting other parts of the system?  Do
> we
> > > > start a
> > > > > > > > > new
> > > > > > > > > branch?
> > > > > > > > >
> > > > > > > > > alex
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Ode Performance: Round I

Posted by Alex Boisvert <bo...@intalio.com>.

On 6/8/07, Paul Brown <pa...@gmail.com> wrote:
>
> FWIW, working off a single thread shouldn't be that bad.  The Erlang
> VM design uses essentially this concept:
>
> http://www.erlang.se/euc/05/1710OTPupdate.ppt
>
>
+1 to Erlang being a good influence in the language and VM space.

The problem with a single thread [1] in the Java VM is we still have too
many blocking APIs (JDBC for example)  and no lightweight continuation
support.

There's the consideration of porting Jacob to Scala, using the Actors
library... it's the best hack known to the JVM to get a decent mix of
concurrency, continuations, performance and brain not exploding.   Of
course, that begs the question of what is left in Jacob after that.  A
wrapper for pickler combinators?

alex

[1] Actually, 1 thread/cpu + 1

Re: Ode Performance: Round I

Posted by Paul Brown <pa...@gmail.com>.

On 6/8/07, Maciej Szefler <mb...@intalio.com> wrote:
> That strikes me addressing the issue at the wrong level in the
> code---if we wants things to happen in one thread, then the engine
> should just do them in one thread, i.e. not call scheduler until it
> has given up on the thread. Introducing a new concept (work queue)
> that is shared between the engine and integration layer would be
> confusing... its bad enough that the IL uses the scheduler, which it
> really should not.

FWIW, working off a single thread shouldn't be that bad.  The Erlang
VM design uses essentially this concept:

http://www.erlang.se/euc/05/1710OTPupdate.ppt

(Threads for IO, thread for work.)

-- Paul

Re: Ode Performance: Round I

Posted by Alex Boisvert <bo...@intalio.com>.

Ok, I'll measure the performance improvement of single-thread versus
multiple-thread and then we see if it's worth it.

BTW, the serialization issue is already fixed (see
http://issues.apache.org/jira/browse/ODE-144).  Okay, it's a bit of a hack
but the payoff is there (2X throughput).  We can revisit the issue once the
BART work has been completed to see if the hack is still needed, or how we
can cleanly avoid the serialization.

alex

On 6/8/07, Maciej Szefler <mb...@intalio.com> wrote:
>
> BART---nifty. So to that point,I think workqueue is not worth it. As a
> temporary measure it would add complexity and confusion (certain) and
> provide negligable performance improvement (since we are not really
> addressing the key issue of avoiding serialization). To address the
> serialization bit would make it as complicated as BART.
>
> On 6/8/07, Alex Boisvert <bo...@intalio.com> wrote:
> > I agree the proposal for BLOCKING, ASYNC, RELIABLE, TRANSACTIONAL (BART)
> is
> > a better way to go, and will make the WorkQueue idea irrelevant.    I
> was
> > looking for an incremental change that could be applied to the trunk,
> while
> > we're waiting for the BART branch to materialize and stabilize.
> >
> > Regardless, I think allowing the thread pool to be shared with the IL is
> a
> > good thing.  It means less threads in the system and better resilience
> to
> > load.  So for the ASYNC case, I hope the ILs can use the shared thread
> pool
> > whenever possible.
> >
> > alex
> >
> >
> > On 6/8/07, Maciej Szefler <mb...@intalio.com> wrote:
> > >
> > > That strikes me addressing the issue at the wrong level in the
> > > code---if we wants things to happen in one thread, then the engine
> > > should just do them in one thread, i.e. not call scheduler until it
> > > has given up on the thread. Introducing a new concept (work queue)
> > > that is shared between the engine and integration layer would be
> > > confusing... its bad enough that the IL uses the scheduler, which it
> > > really should not.
> > >
> > > -mbs
> > >
> > > On 6/8/07, Alex Boisvert <bo...@intalio.com> wrote:
> > > > As a first step, I was thinking of allowing the composition of work
> that
> > > is
> > > > currently done in several unrelated threads into a single thread, by
> > > > introducing a WorkQueue
> > > >
> > > > Right now we have code in the engine, such as
> > > > org.apache.ode.axis2.ExternalService.invoke() -> afterCompletion()
> that
> > > uses
> > > > ExecutorService.submit(...) and I'd like to convert this into
> > > > WorkQueue.submit().
> > > >
> > > > For example, this means that org.apache.ode.axis2.OdeService would
> first
> > > > execute the transaction around odeMex.invoke() and after commit it
> would
> > > > dequeue and execute any pending items in the WorkQueue.  We would
> also
> > > need
> > > > to do the same in BpelEngineImpl.onScheduledJob() and other similar
> > > engine
> > > > entrypoints.
> > > >
> > > > The outcome of this is that we could execute all the "non-blocking"
> work
> > > > related to an external event in a single thread, if desired.
> Depending
> > > on
> > > > the WorkQueue implementation, we could have pure serial processing,
> > > parallel
> > > > processing (like now), or even a mix in-between (e.g. limiting
> > > concurrent
> > > > processing to N threads for a given instance).   This would allow
> for
> > > > optimizing response time or throughput based on the engine policy,
> or if
> > > we
> > > > want to get sophisticated, by process model.
> > > >
> > > > I think this change is relatively straightforward that it could
> happen
> > > in
> > > > the trunk without disrupting it.
> > > >
> > > > Thoughts?
> > > >
> > > > alex
> > >
> >
>

Re: Ode Performance: Round I

Posted by Maciej Szefler <mb...@intalio.com>.

BART---nifty. So to that point,I think workqueue is not worth it. As a
temporary measure it would add complexity and confusion (certain) and
provide negligable performance improvement (since we are not really
addressing the key issue of avoiding serialization). To address the
serialization bit would make it as complicated as BART.

On 6/8/07, Alex Boisvert <bo...@intalio.com> wrote:
> I agree the proposal for BLOCKING, ASYNC, RELIABLE, TRANSACTIONAL (BART) is
> a better way to go, and will make the WorkQueue idea irrelevant.    I was
> looking for an incremental change that could be applied to the trunk, while
> we're waiting for the BART branch to materialize and stabilize.
>
> Regardless, I think allowing the thread pool to be shared with the IL is a
> good thing.  It means less threads in the system and better resilience to
> load.  So for the ASYNC case, I hope the ILs can use the shared thread pool
> whenever possible.
>
> alex
>
>
> On 6/8/07, Maciej Szefler <mb...@intalio.com> wrote:
> >
> > That strikes me addressing the issue at the wrong level in the
> > code---if we wants things to happen in one thread, then the engine
> > should just do them in one thread, i.e. not call scheduler until it
> > has given up on the thread. Introducing a new concept (work queue)
> > that is shared between the engine and integration layer would be
> > confusing... its bad enough that the IL uses the scheduler, which it
> > really should not.
> >
> > -mbs
> >
> > On 6/8/07, Alex Boisvert <bo...@intalio.com> wrote:
> > > As a first step, I was thinking of allowing the composition of work that
> > is
> > > currently done in several unrelated threads into a single thread, by
> > > introducing a WorkQueue
> > >
> > > Right now we have code in the engine, such as
> > > org.apache.ode.axis2.ExternalService.invoke() -> afterCompletion() that
> > uses
> > > ExecutorService.submit(...) and I'd like to convert this into
> > > WorkQueue.submit().
> > >
> > > For example, this means that org.apache.ode.axis2.OdeService would first
> > > execute the transaction around odeMex.invoke() and after commit it would
> > > dequeue and execute any pending items in the WorkQueue.  We would also
> > need
> > > to do the same in BpelEngineImpl.onScheduledJob() and other similar
> > engine
> > > entrypoints.
> > >
> > > The outcome of this is that we could execute all the "non-blocking" work
> > > related to an external event in a single thread, if desired.   Depending
> > on
> > > the WorkQueue implementation, we could have pure serial processing,
> > parallel
> > > processing (like now), or even a mix in-between (e.g. limiting
> > concurrent
> > > processing to N threads for a given instance).   This would allow for
> > > optimizing response time or throughput based on the engine policy, or if
> > we
> > > want to get sophisticated, by process model.
> > >
> > > I think this change is relatively straightforward that it could happen
> > in
> > > the trunk without disrupting it.
> > >
> > > Thoughts?
> > >
> > > alex
> >
>

Re: Ode Performance: Round I

Posted by Alex Boisvert <bo...@intalio.com>.

I agree the proposal for BLOCKING, ASYNC, RELIABLE, TRANSACTIONAL (BART) is
a better way to go, and will make the WorkQueue idea irrelevant.    I was
looking for an incremental change that could be applied to the trunk, while
we're waiting for the BART branch to materialize and stabilize.

Regardless, I think allowing the thread pool to be shared with the IL is a
good thing.  It means less threads in the system and better resilience to
load.  So for the ASYNC case, I hope the ILs can use the shared thread pool
whenever possible.

alex


On 6/8/07, Maciej Szefler <mb...@intalio.com> wrote:
>
> That strikes me addressing the issue at the wrong level in the
> code---if we wants things to happen in one thread, then the engine
> should just do them in one thread, i.e. not call scheduler until it
> has given up on the thread. Introducing a new concept (work queue)
> that is shared between the engine and integration layer would be
> confusing... its bad enough that the IL uses the scheduler, which it
> really should not.
>
> -mbs
>
> On 6/8/07, Alex Boisvert <bo...@intalio.com> wrote:
> > As a first step, I was thinking of allowing the composition of work that
> is
> > currently done in several unrelated threads into a single thread, by
> > introducing a WorkQueue
> >
> > Right now we have code in the engine, such as
> > org.apache.ode.axis2.ExternalService.invoke() -> afterCompletion() that
> uses
> > ExecutorService.submit(...) and I'd like to convert this into
> > WorkQueue.submit().
> >
> > For example, this means that org.apache.ode.axis2.OdeService would first
> > execute the transaction around odeMex.invoke() and after commit it would
> > dequeue and execute any pending items in the WorkQueue.  We would also
> need
> > to do the same in BpelEngineImpl.onScheduledJob() and other similar
> engine
> > entrypoints.
> >
> > The outcome of this is that we could execute all the "non-blocking" work
> > related to an external event in a single thread, if desired.   Depending
> on
> > the WorkQueue implementation, we could have pure serial processing,
> parallel
> > processing (like now), or even a mix in-between (e.g. limiting
> concurrent
> > processing to N threads for a given instance).   This would allow for
> > optimizing response time or throughput based on the engine policy, or if
> we
> > want to get sophisticated, by process model.
> >
> > I think this change is relatively straightforward that it could happen
> in
> > the trunk without disrupting it.
> >
> > Thoughts?
> >
> > alex
>

Re: Ode Performance: Round I

Posted by Maciej Szefler <mb...@intalio.com>.

That strikes me addressing the issue at the wrong level in the
code---if we wants things to happen in one thread, then the engine
should just do them in one thread, i.e. not call scheduler until it
has given up on the thread. Introducing a new concept (work queue)
that is shared between the engine and integration layer would be
confusing... its bad enough that the IL uses the scheduler, which it
really should not.

-mbs

On 6/8/07, Alex Boisvert <bo...@intalio.com> wrote:
> As a first step, I was thinking of allowing the composition of work that is
> currently done in several unrelated threads into a single thread, by
> introducing a WorkQueue
>
> Right now we have code in the engine, such as
> org.apache.ode.axis2.ExternalService.invoke() -> afterCompletion() that uses
> ExecutorService.submit(...) and I'd like to convert this into
> WorkQueue.submit().
>
> For example, this means that org.apache.ode.axis2.OdeService would first
> execute the transaction around odeMex.invoke() and after commit it would
> dequeue and execute any pending items in the WorkQueue.  We would also need
> to do the same in BpelEngineImpl.onScheduledJob() and other similar engine
> entrypoints.
>
> The outcome of this is that we could execute all the "non-blocking" work
> related to an external event in a single thread, if desired.   Depending on
> the WorkQueue implementation, we could have pure serial processing, parallel
> processing (like now), or even a mix in-between (e.g. limiting concurrent
> processing to N threads for a given instance).   This would allow for
> optimizing response time or throughput based on the engine policy, or if we
> want to get sophisticated, by process model.
>
> I think this change is relatively straightforward that it could happen in
> the trunk without disrupting it.
>
> Thoughts?
>
> alex
>
> On 6/8/07, Maciej Szefler <mb...@intalio.com> wrote:
> >
> > sure..
> >
> >
> > On 6/7/07, Alex Boisvert <bo...@intalio.com> wrote:
> > > Ok, got it.   Do you want to go ahead and create the "straight-through"
> > > branch?
> > >
> > > alex
> > >
> > >
> > > On 6/7/07, Maciej Szefler <mb...@intalio.com> wrote:
> > > >
> > > > If the IL supports ASYNC, then it is used, otherwise BLOCKING would be
> > > > used. We want to keep this, because if the IL does indeed use ASYNC
> > > > style (for example if this is a JMS ESB), then likely we don't have
> > > > much in the way of performance guarantees, i.e. the thread may end up
> > > > being blocked for a day, which would quickly lead to resource
> > > > problems.
> > > >
> > > > -mbs
> > > >
> > > > On 6/6/07, Alex Boisvert <bo...@intalio.com> wrote:
> > > > > Maciej,
> > > > >
> > > > > I'm unclear about how the engine would choose between BLOCKING and
> > > > ASYNC.
> > > > >
> > > > > I tend to think we need only BLOCKING and the IL deals with the fact
> > > > that it
> > > > > might have to suspend and resume itself if the underlying invocation
> > is
> > > > > async (e.g. JBI).   What's the use-case for ASYNC?
> > > > >
> > > > > alex
> > > > >
> > > > > On 6/6/07, Matthieu Riou <ma...@gmail.com> wrote:
> > > > > >
> > > > > > Forwarding on behalf of Maciej (mistakingly replied privately):
> > > > > >
> > > > > >
> > > > > >
> > > >
> > -----------------------------------------------------------------------------------------------------------------
> > > > > >
> > > > > > ah yes. ok, here's my theory on getting the behavior alex wants;
> > this
> > > > > > i think is a fairly concrete way to get the different use cases we
> > > > > > outlined on the white board.
> > > > > >
> > > > > > 1) create the notion of an invocation style: BLOCKING, ASYNC,
> > > > > > RELIABLE, and TRANSACTED.
> > > > > > 2) add MessageExchangeContext.isStyleSupported(PartnerMex, Style)
> > > > method
> > > > > > 3) modify the MessageExchangeContext.invokePartner method to take
> > a
> > > > > > style parameter.
> > > > > >
> > > > > > In BLOCKING style the IL simply does the invoke, right then and
> > there,
> > > > > > blocking the thread. (our axis IL would support this style)
> > > > > >
> > > > > > In ASYNC style, the IL does not block; instead it sends us a
> > > > > > notification when the response is available. (JBI likes this style
> > the
> > > > > > most).
> > > > > >
> > > > > > In RELIABLE, the request would be enrolled in the current TX,
> > response
> > > > > > delievered asynch as above (in a new tx)
> > > > > >
> > > > > > in TRANSACTED, the behavior is like BLOCKING, but the TX context
> > is
> > > > > > propagted with the invocation.
> > > > > >
> > > > > > The engine would try to use the best style given the
> > circumstances.
> > > > > > For example, for in-mem processes it would prefer to use the
> > > > > > TRANSACTED style and it could do it "in-line", i.e. as part of the
> > > > > > <invoke> or right after it runs out of reductions.  If the style
> > is
> > > > > > not supported it could 'downgrade' to the BLOCKING style, which
> > would
> > > > > > work in the same way. If BLOCKING were not supported, then ASYNC
> > would
> > > > > > be the last resort, but this would force us to serialize.
> > > > > >
> > > > > > For persisted processes, we'd prefer RELIABLE in general,
> > TRANSACTED
> > > > > > when inside an atomic scope, otherwise either BLOCKING or ASYNC.
> > > > > > However, here use of BLOCKING or ASYNC would result in additional
> > > > > > transactions since we'd need to persist the fact that the
> > invocation
> > > > > > was made. Unless of course the operation is marked as "idempotent"
> > in
> > > > > > which case we could use the BLOCKING call without a checkpoint.
> > > > > >
> > > > > > How does that sound?
> > > > > > -mbs
> > > > > >
> > > > > >
> > > > > > On 6/6/07, Matthieu Riou <ma...@gmail.com> wrote:
> > > > > > >
> > > > > > > Actually for in-memory processes, it would save us all reads and
> > > > writes
> > > > > > > (we should never read or write it in that case). And for
> > persistent
> > > > > > > processes, then it will save a lot of reads (which are still
> > > > expensive
> > > > > > > because of deserialization).
> > > > > > >
> > > > > > > On 6/6/07, Matthieu Riou <ma...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Two things:
> > > > > > > >
> > > > > > > > 1. We should also consider caching the Jacob state. Instead of
> > > > always
> > > > > > > > serializing / writing and reading / deserializing, caching
> > those
> > > > > > states
> > > > > > > > could save us a lot of reads.
> > > > > > > >
> > > > > > > > 2. Cutting down the transaction count is a significant
> > refactoring
> > > > so
> > > > > > I
> > > > > > > > would start a new branch for that (maybe ODE 2.0?). And we're
> > > > going to
> > > > > > > > need a lot of tests to chase regressions :)
> > > > > > > >
> > > > > > > > I think 1 could go without a branch. It's not trivial but I
> > don't
> > > > > > think
> > > > > > > > it would take more than a couple of weeks (I would have to get
> > > > deeper
> > > > > > into
> > > > > > > > the code to give a better evaluation).
> > > > > > > >
> > > > > > > > On 6/6/07, Alex Boisvert < boisvert@intalio.com> wrote:
> > > > > > > > >
> > > > > > > > > Howza,
> > > > > > > > >
> > > > > > > > > I started testing a short-lived process implementing a
> > single
> > > > > > > > > request-response operation.  The process structure is as
> > > > follows:
> > > > > > > > >
> > > > > > > > > -Receive Purchase Order
> > > > > > > > > -Do some assignments (schema mappings)
> > > > > > > > > -Invoke CRM system to record the new PO
> > > > > > > > > -Do more assignments (schema mappings)
> > > > > > > > > -Invoke ERP system to record a new work order
> > > > > > > > > -Send back an acknowledgment
> > > > > > > > >
> > > > > > > > > Some deployment notes:
> > > > > > > > > -All WS operations are SOAP/HTTP
> > > > > > > > > -The process is deployed as "in-memory"
> > > > > > > > > -The CRM and ERP systems are mocked as Axis2 services (as
> > dumb
> > > > as
> > > > > > can
> > > > > > > > > be to
> > > > > > > > > avoid bottlenecks)
> > > > > > > > >
> > > > > > > > > After fixing a few minor issues (to handle the load), and
> > fixing
> > > > a
> > > > > > few
> > > > > > > > >
> > > > > > > > > obvious code inefficiencies which gave us roughly a 20%
> > gain, we
> > > > are
> > > > > > > > > now
> > > > > > > > > near-100% CPU utilization.  (I'm testing on my dual-core
> > system)
> > > > > > As
> > > > > > > > > it
> > > > > > > > > stands, Ode clocks about 70 transactions per second.
> > > > > > > > >
> > > > > > > > > Is this good?  I'd say there's room for improvement.  Based
> > on
> > > > > > > > > previous work
> > > > > > > > > in the field, I estimate we could get up to 300-400
> > > > > > > > > transactions/second.
> > > > > > > > >
> > > > > > > > > How do we improve this?  Well, looking at the end-to-end
> > > > execution
> > > > > > of
> > > > > > > > > the
> > > > > > > > > process, I counted 4 thread-switches and 4 JTA
> > > > transactions.  Those
> > > > > > > > > are not
> > > > > > > > > really necessary, if you ask me.  I think significant
> > > > improvements
> > > > > > > > > could be
> > > > > > > > > made if we could run this process straight-through, meaning
> > in a
> > > > > > > > > single
> > > > > > > > > thread and a single transaction.  (Not to mention it would
> > make
> > > > > > things
> > > > > > > > >
> > > > > > > > > easier to monitor and measure ;)
> > > > > > > > >
> > > > > > > > > Also, to give you an idea, the top 3 areas where we spend
> > most
> > > > of
> > > > > > our
> > > > > > > > > CPU
> > > > > > > > > today are:
> > > > > > > > >
> > > > > > > > > 1) Serialization/deserialization of the Jacob state (I'm
> > > > evaluating
> > > > > > > > > about
> > > > > > > > > 40-50%)
> > > > > > > > > 2) XML marshaling/unmarshaling (About 10-20%)
> > > > > > > > > 3) XML processing:  XPath evaluation + assignments (About
> > > > 10-20%)
> > > > > > > > >
> > > > > > > > > (The rest would be about 20%; I need to load up JProbe or
> > DTrace
> > > > to
> > > > > > > > > provide
> > > > > > > > > more accurate measurements.  My current estimates are a mix
> > of
> > > > > > > > > non-scientific statistical sampling of thread dumps and a
> > quick
> > > > run
> > > > > > > > > with the
> > > > > > > > > JVM's built-in profiler)
> > > > > > > > >
> > > > > > > > > So my general question is...  how do we get started on the
> > > > single
> > > > > > > > > thread +
> > > > > > > > > single transaction refactoring?    Anybody already gave some
> > > > > > thoughts
> > > > > > > > > to
> > > > > > > > > this?  Are there any pending design issues before we
> > start?  How
> > > > do
> > > > > > we
> > > > > > > > > work
> > > > > > > > > on this without disrupting other parts of the system?  Do we
> > > > start a
> > > > > > > > > new
> > > > > > > > > branch?
> > > > > > > > >
> > > > > > > > > alex
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Ode Performance: Round I

Posted by Alex Boisvert <bo...@intalio.com>.

As a first step, I was thinking of allowing the composition of work that is
currently done in several unrelated threads into a single thread, by
introducing a WorkQueue

Right now we have code in the engine, such as
org.apache.ode.axis2.ExternalService.invoke() -> afterCompletion() that uses
ExecutorService.submit(...) and I'd like to convert this into
WorkQueue.submit().

For example, this means that org.apache.ode.axis2.OdeService would first
execute the transaction around odeMex.invoke() and after commit it would
dequeue and execute any pending items in the WorkQueue.  We would also need
to do the same in BpelEngineImpl.onScheduledJob() and other similar engine
entrypoints.

The outcome of this is that we could execute all the "non-blocking" work
related to an external event in a single thread, if desired.   Depending on
the WorkQueue implementation, we could have pure serial processing, parallel
processing (like now), or even a mix in-between (e.g. limiting concurrent
processing to N threads for a given instance).   This would allow for
optimizing response time or throughput based on the engine policy, or if we
want to get sophisticated, by process model.

I think this change is relatively straightforward that it could happen in
the trunk without disrupting it.

Thoughts?

alex

On 6/8/07, Maciej Szefler <mb...@intalio.com> wrote:
>
> sure..
>
>
> On 6/7/07, Alex Boisvert <bo...@intalio.com> wrote:
> > Ok, got it.   Do you want to go ahead and create the "straight-through"
> > branch?
> >
> > alex
> >
> >
> > On 6/7/07, Maciej Szefler <mb...@intalio.com> wrote:
> > >
> > > If the IL supports ASYNC, then it is used, otherwise BLOCKING would be
> > > used. We want to keep this, because if the IL does indeed use ASYNC
> > > style (for example if this is a JMS ESB), then likely we don't have
> > > much in the way of performance guarantees, i.e. the thread may end up
> > > being blocked for a day, which would quickly lead to resource
> > > problems.
> > >
> > > -mbs
> > >
> > > On 6/6/07, Alex Boisvert <bo...@intalio.com> wrote:
> > > > Maciej,
> > > >
> > > > I'm unclear about how the engine would choose between BLOCKING and
> > > ASYNC.
> > > >
> > > > I tend to think we need only BLOCKING and the IL deals with the fact
> > > that it
> > > > might have to suspend and resume itself if the underlying invocation
> is
> > > > async (e.g. JBI).   What's the use-case for ASYNC?
> > > >
> > > > alex
> > > >
> > > > On 6/6/07, Matthieu Riou <ma...@gmail.com> wrote:
> > > > >
> > > > > Forwarding on behalf of Maciej (mistakingly replied privately):
> > > > >
> > > > >
> > > > >
> > >
> -----------------------------------------------------------------------------------------------------------------
> > > > >
> > > > > ah yes. ok, here's my theory on getting the behavior alex wants;
> this
> > > > > i think is a fairly concrete way to get the different use cases we
> > > > > outlined on the white board.
> > > > >
> > > > > 1) create the notion of an invocation style: BLOCKING, ASYNC,
> > > > > RELIABLE, and TRANSACTED.
> > > > > 2) add MessageExchangeContext.isStyleSupported(PartnerMex, Style)
> > > method
> > > > > 3) modify the MessageExchangeContext.invokePartner method to take
> a
> > > > > style parameter.
> > > > >
> > > > > In BLOCKING style the IL simply does the invoke, right then and
> there,
> > > > > blocking the thread. (our axis IL would support this style)
> > > > >
> > > > > In ASYNC style, the IL does not block; instead it sends us a
> > > > > notification when the response is available. (JBI likes this style
> the
> > > > > most).
> > > > >
> > > > > In RELIABLE, the request would be enrolled in the current TX,
> response
> > > > > delievered asynch as above (in a new tx)
> > > > >
> > > > > in TRANSACTED, the behavior is like BLOCKING, but the TX context
> is
> > > > > propagted with the invocation.
> > > > >
> > > > > The engine would try to use the best style given the
> circumstances.
> > > > > For example, for in-mem processes it would prefer to use the
> > > > > TRANSACTED style and it could do it "in-line", i.e. as part of the
> > > > > <invoke> or right after it runs out of reductions.  If the style
> is
> > > > > not supported it could 'downgrade' to the BLOCKING style, which
> would
> > > > > work in the same way. If BLOCKING were not supported, then ASYNC
> would
> > > > > be the last resort, but this would force us to serialize.
> > > > >
> > > > > For persisted processes, we'd prefer RELIABLE in general,
> TRANSACTED
> > > > > when inside an atomic scope, otherwise either BLOCKING or ASYNC.
> > > > > However, here use of BLOCKING or ASYNC would result in additional
> > > > > transactions since we'd need to persist the fact that the
> invocation
> > > > > was made. Unless of course the operation is marked as "idempotent"
> in
> > > > > which case we could use the BLOCKING call without a checkpoint.
> > > > >
> > > > > How does that sound?
> > > > > -mbs
> > > > >
> > > > >
> > > > > On 6/6/07, Matthieu Riou <ma...@gmail.com> wrote:
> > > > > >
> > > > > > Actually for in-memory processes, it would save us all reads and
> > > writes
> > > > > > (we should never read or write it in that case). And for
> persistent
> > > > > > processes, then it will save a lot of reads (which are still
> > > expensive
> > > > > > because of deserialization).
> > > > > >
> > > > > > On 6/6/07, Matthieu Riou <ma...@gmail.com> wrote:
> > > > > > >
> > > > > > > Two things:
> > > > > > >
> > > > > > > 1. We should also consider caching the Jacob state. Instead of
> > > always
> > > > > > > serializing / writing and reading / deserializing, caching
> those
> > > > > states
> > > > > > > could save us a lot of reads.
> > > > > > >
> > > > > > > 2. Cutting down the transaction count is a significant
> refactoring
> > > so
> > > > > I
> > > > > > > would start a new branch for that (maybe ODE 2.0?). And we're
> > > going to
> > > > > > > need a lot of tests to chase regressions :)
> > > > > > >
> > > > > > > I think 1 could go without a branch. It's not trivial but I
> don't
> > > > > think
> > > > > > > it would take more than a couple of weeks (I would have to get
> > > deeper
> > > > > into
> > > > > > > the code to give a better evaluation).
> > > > > > >
> > > > > > > On 6/6/07, Alex Boisvert < boisvert@intalio.com> wrote:
> > > > > > > >
> > > > > > > > Howza,
> > > > > > > >
> > > > > > > > I started testing a short-lived process implementing a
> single
> > > > > > > > request-response operation.  The process structure is as
> > > follows:
> > > > > > > >
> > > > > > > > -Receive Purchase Order
> > > > > > > > -Do some assignments (schema mappings)
> > > > > > > > -Invoke CRM system to record the new PO
> > > > > > > > -Do more assignments (schema mappings)
> > > > > > > > -Invoke ERP system to record a new work order
> > > > > > > > -Send back an acknowledgment
> > > > > > > >
> > > > > > > > Some deployment notes:
> > > > > > > > -All WS operations are SOAP/HTTP
> > > > > > > > -The process is deployed as "in-memory"
> > > > > > > > -The CRM and ERP systems are mocked as Axis2 services (as
> dumb
> > > as
> > > > > can
> > > > > > > > be to
> > > > > > > > avoid bottlenecks)
> > > > > > > >
> > > > > > > > After fixing a few minor issues (to handle the load), and
> fixing
> > > a
> > > > > few
> > > > > > > >
> > > > > > > > obvious code inefficiencies which gave us roughly a 20%
> gain, we
> > > are
> > > > > > > > now
> > > > > > > > near-100% CPU utilization.  (I'm testing on my dual-core
> system)
> > > > > As
> > > > > > > > it
> > > > > > > > stands, Ode clocks about 70 transactions per second.
> > > > > > > >
> > > > > > > > Is this good?  I'd say there's room for improvement.  Based
> on
> > > > > > > > previous work
> > > > > > > > in the field, I estimate we could get up to 300-400
> > > > > > > > transactions/second.
> > > > > > > >
> > > > > > > > How do we improve this?  Well, looking at the end-to-end
> > > execution
> > > > > of
> > > > > > > > the
> > > > > > > > process, I counted 4 thread-switches and 4 JTA
> > > transactions.  Those
> > > > > > > > are not
> > > > > > > > really necessary, if you ask me.  I think significant
> > > improvements
> > > > > > > > could be
> > > > > > > > made if we could run this process straight-through, meaning
> in a
> > > > > > > > single
> > > > > > > > thread and a single transaction.  (Not to mention it would
> make
> > > > > things
> > > > > > > >
> > > > > > > > easier to monitor and measure ;)
> > > > > > > >
> > > > > > > > Also, to give you an idea, the top 3 areas where we spend
> most
> > > of
> > > > > our
> > > > > > > > CPU
> > > > > > > > today are:
> > > > > > > >
> > > > > > > > 1) Serialization/deserialization of the Jacob state (I'm
> > > evaluating
> > > > > > > > about
> > > > > > > > 40-50%)
> > > > > > > > 2) XML marshaling/unmarshaling (About 10-20%)
> > > > > > > > 3) XML processing:  XPath evaluation + assignments (About
> > > 10-20%)
> > > > > > > >
> > > > > > > > (The rest would be about 20%; I need to load up JProbe or
> DTrace
> > > to
> > > > > > > > provide
> > > > > > > > more accurate measurements.  My current estimates are a mix
> of
> > > > > > > > non-scientific statistical sampling of thread dumps and a
> quick
> > > run
> > > > > > > > with the
> > > > > > > > JVM's built-in profiler)
> > > > > > > >
> > > > > > > > So my general question is...  how do we get started on the
> > > single
> > > > > > > > thread +
> > > > > > > > single transaction refactoring?    Anybody already gave some
> > > > > thoughts
> > > > > > > > to
> > > > > > > > this?  Are there any pending design issues before we
> start?  How
> > > do
> > > > > we
> > > > > > > > work
> > > > > > > > on this without disrupting other parts of the system?  Do we
> > > start a
> > > > > > > > new
> > > > > > > > branch?
> > > > > > > >
> > > > > > > > alex
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Ode Performance: Round I

Posted by Maciej Szefler <mb...@intalio.com>.

sure..


On 6/7/07, Alex Boisvert <bo...@intalio.com> wrote:
> Ok, got it.   Do you want to go ahead and create the "straight-through"
> branch?
>
> alex
>
>
> On 6/7/07, Maciej Szefler <mb...@intalio.com> wrote:
> >
> > If the IL supports ASYNC, then it is used, otherwise BLOCKING would be
> > used. We want to keep this, because if the IL does indeed use ASYNC
> > style (for example if this is a JMS ESB), then likely we don't have
> > much in the way of performance guarantees, i.e. the thread may end up
> > being blocked for a day, which would quickly lead to resource
> > problems.
> >
> > -mbs
> >
> > On 6/6/07, Alex Boisvert <bo...@intalio.com> wrote:
> > > Maciej,
> > >
> > > I'm unclear about how the engine would choose between BLOCKING and
> > ASYNC.
> > >
> > > I tend to think we need only BLOCKING and the IL deals with the fact
> > that it
> > > might have to suspend and resume itself if the underlying invocation is
> > > async (e.g. JBI).   What's the use-case for ASYNC?
> > >
> > > alex
> > >
> > > On 6/6/07, Matthieu Riou <ma...@gmail.com> wrote:
> > > >
> > > > Forwarding on behalf of Maciej (mistakingly replied privately):
> > > >
> > > >
> > > >
> > -----------------------------------------------------------------------------------------------------------------
> > > >
> > > > ah yes. ok, here's my theory on getting the behavior alex wants; this
> > > > i think is a fairly concrete way to get the different use cases we
> > > > outlined on the white board.
> > > >
> > > > 1) create the notion of an invocation style: BLOCKING, ASYNC,
> > > > RELIABLE, and TRANSACTED.
> > > > 2) add MessageExchangeContext.isStyleSupported(PartnerMex, Style)
> > method
> > > > 3) modify the MessageExchangeContext.invokePartner method to take a
> > > > style parameter.
> > > >
> > > > In BLOCKING style the IL simply does the invoke, right then and there,
> > > > blocking the thread. (our axis IL would support this style)
> > > >
> > > > In ASYNC style, the IL does not block; instead it sends us a
> > > > notification when the response is available. (JBI likes this style the
> > > > most).
> > > >
> > > > In RELIABLE, the request would be enrolled in the current TX, response
> > > > delievered asynch as above (in a new tx)
> > > >
> > > > in TRANSACTED, the behavior is like BLOCKING, but the TX context is
> > > > propagted with the invocation.
> > > >
> > > > The engine would try to use the best style given the circumstances.
> > > > For example, for in-mem processes it would prefer to use the
> > > > TRANSACTED style and it could do it "in-line", i.e. as part of the
> > > > <invoke> or right after it runs out of reductions.  If the style is
> > > > not supported it could 'downgrade' to the BLOCKING style, which would
> > > > work in the same way. If BLOCKING were not supported, then ASYNC would
> > > > be the last resort, but this would force us to serialize.
> > > >
> > > > For persisted processes, we'd prefer RELIABLE in general, TRANSACTED
> > > > when inside an atomic scope, otherwise either BLOCKING or ASYNC.
> > > > However, here use of BLOCKING or ASYNC would result in additional
> > > > transactions since we'd need to persist the fact that the invocation
> > > > was made. Unless of course the operation is marked as "idempotent" in
> > > > which case we could use the BLOCKING call without a checkpoint.
> > > >
> > > > How does that sound?
> > > > -mbs
> > > >
> > > >
> > > > On 6/6/07, Matthieu Riou <ma...@gmail.com> wrote:
> > > > >
> > > > > Actually for in-memory processes, it would save us all reads and
> > writes
> > > > > (we should never read or write it in that case). And for persistent
> > > > > processes, then it will save a lot of reads (which are still
> > expensive
> > > > > because of deserialization).
> > > > >
> > > > > On 6/6/07, Matthieu Riou <ma...@gmail.com> wrote:
> > > > > >
> > > > > > Two things:
> > > > > >
> > > > > > 1. We should also consider caching the Jacob state. Instead of
> > always
> > > > > > serializing / writing and reading / deserializing, caching those
> > > > states
> > > > > > could save us a lot of reads.
> > > > > >
> > > > > > 2. Cutting down the transaction count is a significant refactoring
> > so
> > > > I
> > > > > > would start a new branch for that (maybe ODE 2.0?). And we're
> > going to
> > > > > > need a lot of tests to chase regressions :)
> > > > > >
> > > > > > I think 1 could go without a branch. It's not trivial but I don't
> > > > think
> > > > > > it would take more than a couple of weeks (I would have to get
> > deeper
> > > > into
> > > > > > the code to give a better evaluation).
> > > > > >
> > > > > > On 6/6/07, Alex Boisvert < boisvert@intalio.com> wrote:
> > > > > > >
> > > > > > > Howza,
> > > > > > >
> > > > > > > I started testing a short-lived process implementing a single
> > > > > > > request-response operation.  The process structure is as
> > follows:
> > > > > > >
> > > > > > > -Receive Purchase Order
> > > > > > > -Do some assignments (schema mappings)
> > > > > > > -Invoke CRM system to record the new PO
> > > > > > > -Do more assignments (schema mappings)
> > > > > > > -Invoke ERP system to record a new work order
> > > > > > > -Send back an acknowledgment
> > > > > > >
> > > > > > > Some deployment notes:
> > > > > > > -All WS operations are SOAP/HTTP
> > > > > > > -The process is deployed as "in-memory"
> > > > > > > -The CRM and ERP systems are mocked as Axis2 services (as dumb
> > as
> > > > can
> > > > > > > be to
> > > > > > > avoid bottlenecks)
> > > > > > >
> > > > > > > After fixing a few minor issues (to handle the load), and fixing
> > a
> > > > few
> > > > > > >
> > > > > > > obvious code inefficiencies which gave us roughly a 20% gain, we
> > are
> > > > > > > now
> > > > > > > near-100% CPU utilization.  (I'm testing on my dual-core system)
> > > > As
> > > > > > > it
> > > > > > > stands, Ode clocks about 70 transactions per second.
> > > > > > >
> > > > > > > Is this good?  I'd say there's room for improvement.  Based on
> > > > > > > previous work
> > > > > > > in the field, I estimate we could get up to 300-400
> > > > > > > transactions/second.
> > > > > > >
> > > > > > > How do we improve this?  Well, looking at the end-to-end
> > execution
> > > > of
> > > > > > > the
> > > > > > > process, I counted 4 thread-switches and 4 JTA
> > transactions.  Those
> > > > > > > are not
> > > > > > > really necessary, if you ask me.  I think significant
> > improvements
> > > > > > > could be
> > > > > > > made if we could run this process straight-through, meaning in a
> > > > > > > single
> > > > > > > thread and a single transaction.  (Not to mention it would make
> > > > things
> > > > > > >
> > > > > > > easier to monitor and measure ;)
> > > > > > >
> > > > > > > Also, to give you an idea, the top 3 areas where we spend most
> > of
> > > > our
> > > > > > > CPU
> > > > > > > today are:
> > > > > > >
> > > > > > > 1) Serialization/deserialization of the Jacob state (I'm
> > evaluating
> > > > > > > about
> > > > > > > 40-50%)
> > > > > > > 2) XML marshaling/unmarshaling (About 10-20%)
> > > > > > > 3) XML processing:  XPath evaluation + assignments (About
> > 10-20%)
> > > > > > >
> > > > > > > (The rest would be about 20%; I need to load up JProbe or DTrace
> > to
> > > > > > > provide
> > > > > > > more accurate measurements.  My current estimates are a mix of
> > > > > > > non-scientific statistical sampling of thread dumps and a quick
> > run
> > > > > > > with the
> > > > > > > JVM's built-in profiler)
> > > > > > >
> > > > > > > So my general question is...  how do we get started on the
> > single
> > > > > > > thread +
> > > > > > > single transaction refactoring?    Anybody already gave some
> > > > thoughts
> > > > > > > to
> > > > > > > this?  Are there any pending design issues before we start?  How
> > do
> > > > we
> > > > > > > work
> > > > > > > on this without disrupting other parts of the system?  Do we
> > start a
> > > > > > > new
> > > > > > > branch?
> > > > > > >
> > > > > > > alex
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Ode Performance: Round I

Posted by Alex Boisvert <bo...@intalio.com>.

Ok, got it.   Do you want to go ahead and create the "straight-through"
branch?

alex


On 6/7/07, Maciej Szefler <mb...@intalio.com> wrote:
>
> If the IL supports ASYNC, then it is used, otherwise BLOCKING would be
> used. We want to keep this, because if the IL does indeed use ASYNC
> style (for example if this is a JMS ESB), then likely we don't have
> much in the way of performance guarantees, i.e. the thread may end up
> being blocked for a day, which would quickly lead to resource
> problems.
>
> -mbs
>
> On 6/6/07, Alex Boisvert <bo...@intalio.com> wrote:
> > Maciej,
> >
> > I'm unclear about how the engine would choose between BLOCKING and
> ASYNC.
> >
> > I tend to think we need only BLOCKING and the IL deals with the fact
> that it
> > might have to suspend and resume itself if the underlying invocation is
> > async (e.g. JBI).   What's the use-case for ASYNC?
> >
> > alex
> >
> > On 6/6/07, Matthieu Riou <ma...@gmail.com> wrote:
> > >
> > > Forwarding on behalf of Maciej (mistakingly replied privately):
> > >
> > >
> > >
> -----------------------------------------------------------------------------------------------------------------
> > >
> > > ah yes. ok, here's my theory on getting the behavior alex wants; this
> > > i think is a fairly concrete way to get the different use cases we
> > > outlined on the white board.
> > >
> > > 1) create the notion of an invocation style: BLOCKING, ASYNC,
> > > RELIABLE, and TRANSACTED.
> > > 2) add MessageExchangeContext.isStyleSupported(PartnerMex, Style)
> method
> > > 3) modify the MessageExchangeContext.invokePartner method to take a
> > > style parameter.
> > >
> > > In BLOCKING style the IL simply does the invoke, right then and there,
> > > blocking the thread. (our axis IL would support this style)
> > >
> > > In ASYNC style, the IL does not block; instead it sends us a
> > > notification when the response is available. (JBI likes this style the
> > > most).
> > >
> > > In RELIABLE, the request would be enrolled in the current TX, response
> > > delievered asynch as above (in a new tx)
> > >
> > > in TRANSACTED, the behavior is like BLOCKING, but the TX context is
> > > propagted with the invocation.
> > >
> > > The engine would try to use the best style given the circumstances.
> > > For example, for in-mem processes it would prefer to use the
> > > TRANSACTED style and it could do it "in-line", i.e. as part of the
> > > <invoke> or right after it runs out of reductions.  If the style is
> > > not supported it could 'downgrade' to the BLOCKING style, which would
> > > work in the same way. If BLOCKING were not supported, then ASYNC would
> > > be the last resort, but this would force us to serialize.
> > >
> > > For persisted processes, we'd prefer RELIABLE in general, TRANSACTED
> > > when inside an atomic scope, otherwise either BLOCKING or ASYNC.
> > > However, here use of BLOCKING or ASYNC would result in additional
> > > transactions since we'd need to persist the fact that the invocation
> > > was made. Unless of course the operation is marked as "idempotent" in
> > > which case we could use the BLOCKING call without a checkpoint.
> > >
> > > How does that sound?
> > > -mbs
> > >
> > >
> > > On 6/6/07, Matthieu Riou <ma...@gmail.com> wrote:
> > > >
> > > > Actually for in-memory processes, it would save us all reads and
> writes
> > > > (we should never read or write it in that case). And for persistent
> > > > processes, then it will save a lot of reads (which are still
> expensive
> > > > because of deserialization).
> > > >
> > > > On 6/6/07, Matthieu Riou <ma...@gmail.com> wrote:
> > > > >
> > > > > Two things:
> > > > >
> > > > > 1. We should also consider caching the Jacob state. Instead of
> always
> > > > > serializing / writing and reading / deserializing, caching those
> > > states
> > > > > could save us a lot of reads.
> > > > >
> > > > > 2. Cutting down the transaction count is a significant refactoring
> so
> > > I
> > > > > would start a new branch for that (maybe ODE 2.0?). And we're
> going to
> > > > > need a lot of tests to chase regressions :)
> > > > >
> > > > > I think 1 could go without a branch. It's not trivial but I don't
> > > think
> > > > > it would take more than a couple of weeks (I would have to get
> deeper
> > > into
> > > > > the code to give a better evaluation).
> > > > >
> > > > > On 6/6/07, Alex Boisvert < boisvert@intalio.com> wrote:
> > > > > >
> > > > > > Howza,
> > > > > >
> > > > > > I started testing a short-lived process implementing a single
> > > > > > request-response operation.  The process structure is as
> follows:
> > > > > >
> > > > > > -Receive Purchase Order
> > > > > > -Do some assignments (schema mappings)
> > > > > > -Invoke CRM system to record the new PO
> > > > > > -Do more assignments (schema mappings)
> > > > > > -Invoke ERP system to record a new work order
> > > > > > -Send back an acknowledgment
> > > > > >
> > > > > > Some deployment notes:
> > > > > > -All WS operations are SOAP/HTTP
> > > > > > -The process is deployed as "in-memory"
> > > > > > -The CRM and ERP systems are mocked as Axis2 services (as dumb
> as
> > > can
> > > > > > be to
> > > > > > avoid bottlenecks)
> > > > > >
> > > > > > After fixing a few minor issues (to handle the load), and fixing
> a
> > > few
> > > > > >
> > > > > > obvious code inefficiencies which gave us roughly a 20% gain, we
> are
> > > > > > now
> > > > > > near-100% CPU utilization.  (I'm testing on my dual-core system)
> > > As
> > > > > > it
> > > > > > stands, Ode clocks about 70 transactions per second.
> > > > > >
> > > > > > Is this good?  I'd say there's room for improvement.  Based on
> > > > > > previous work
> > > > > > in the field, I estimate we could get up to 300-400
> > > > > > transactions/second.
> > > > > >
> > > > > > How do we improve this?  Well, looking at the end-to-end
> execution
> > > of
> > > > > > the
> > > > > > process, I counted 4 thread-switches and 4 JTA
> transactions.  Those
> > > > > > are not
> > > > > > really necessary, if you ask me.  I think significant
> improvements
> > > > > > could be
> > > > > > made if we could run this process straight-through, meaning in a
> > > > > > single
> > > > > > thread and a single transaction.  (Not to mention it would make
> > > things
> > > > > >
> > > > > > easier to monitor and measure ;)
> > > > > >
> > > > > > Also, to give you an idea, the top 3 areas where we spend most
> of
> > > our
> > > > > > CPU
> > > > > > today are:
> > > > > >
> > > > > > 1) Serialization/deserialization of the Jacob state (I'm
> evaluating
> > > > > > about
> > > > > > 40-50%)
> > > > > > 2) XML marshaling/unmarshaling (About 10-20%)
> > > > > > 3) XML processing:  XPath evaluation + assignments (About
> 10-20%)
> > > > > >
> > > > > > (The rest would be about 20%; I need to load up JProbe or DTrace
> to
> > > > > > provide
> > > > > > more accurate measurements.  My current estimates are a mix of
> > > > > > non-scientific statistical sampling of thread dumps and a quick
> run
> > > > > > with the
> > > > > > JVM's built-in profiler)
> > > > > >
> > > > > > So my general question is...  how do we get started on the
> single
> > > > > > thread +
> > > > > > single transaction refactoring?    Anybody already gave some
> > > thoughts
> > > > > > to
> > > > > > this?  Are there any pending design issues before we start?  How
> do
> > > we
> > > > > > work
> > > > > > on this without disrupting other parts of the system?  Do we
> start a
> > > > > > new
> > > > > > branch?
> > > > > >
> > > > > > alex
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Ode Performance: Round I

Posted by Maciej Szefler <mb...@intalio.com>.

If the IL supports ASYNC, then it is used, otherwise BLOCKING would be
used. We want to keep this, because if the IL does indeed use ASYNC
style (for example if this is a JMS ESB), then likely we don't have
much in the way of performance guarantees, i.e. the thread may end up
being blocked for a day, which would quickly lead to resource
problems.

-mbs

On 6/6/07, Alex Boisvert <bo...@intalio.com> wrote:
> Maciej,
>
> I'm unclear about how the engine would choose between BLOCKING and ASYNC.
>
> I tend to think we need only BLOCKING and the IL deals with the fact that it
> might have to suspend and resume itself if the underlying invocation is
> async (e.g. JBI).   What's the use-case for ASYNC?
>
> alex
>
> On 6/6/07, Matthieu Riou <ma...@gmail.com> wrote:
> >
> > Forwarding on behalf of Maciej (mistakingly replied privately):
> >
> >
> > -----------------------------------------------------------------------------------------------------------------
> >
> > ah yes. ok, here's my theory on getting the behavior alex wants; this
> > i think is a fairly concrete way to get the different use cases we
> > outlined on the white board.
> >
> > 1) create the notion of an invocation style: BLOCKING, ASYNC,
> > RELIABLE, and TRANSACTED.
> > 2) add MessageExchangeContext.isStyleSupported(PartnerMex, Style) method
> > 3) modify the MessageExchangeContext.invokePartner method to take a
> > style parameter.
> >
> > In BLOCKING style the IL simply does the invoke, right then and there,
> > blocking the thread. (our axis IL would support this style)
> >
> > In ASYNC style, the IL does not block; instead it sends us a
> > notification when the response is available. (JBI likes this style the
> > most).
> >
> > In RELIABLE, the request would be enrolled in the current TX, response
> > delievered asynch as above (in a new tx)
> >
> > in TRANSACTED, the behavior is like BLOCKING, but the TX context is
> > propagted with the invocation.
> >
> > The engine would try to use the best style given the circumstances.
> > For example, for in-mem processes it would prefer to use the
> > TRANSACTED style and it could do it "in-line", i.e. as part of the
> > <invoke> or right after it runs out of reductions.  If the style is
> > not supported it could 'downgrade' to the BLOCKING style, which would
> > work in the same way. If BLOCKING were not supported, then ASYNC would
> > be the last resort, but this would force us to serialize.
> >
> > For persisted processes, we'd prefer RELIABLE in general, TRANSACTED
> > when inside an atomic scope, otherwise either BLOCKING or ASYNC.
> > However, here use of BLOCKING or ASYNC would result in additional
> > transactions since we'd need to persist the fact that the invocation
> > was made. Unless of course the operation is marked as "idempotent" in
> > which case we could use the BLOCKING call without a checkpoint.
> >
> > How does that sound?
> > -mbs
> >
> >
> > On 6/6/07, Matthieu Riou <ma...@gmail.com> wrote:
> > >
> > > Actually for in-memory processes, it would save us all reads and writes
> > > (we should never read or write it in that case). And for persistent
> > > processes, then it will save a lot of reads (which are still expensive
> > > because of deserialization).
> > >
> > > On 6/6/07, Matthieu Riou <ma...@gmail.com> wrote:
> > > >
> > > > Two things:
> > > >
> > > > 1. We should also consider caching the Jacob state. Instead of always
> > > > serializing / writing and reading / deserializing, caching those
> > states
> > > > could save us a lot of reads.
> > > >
> > > > 2. Cutting down the transaction count is a significant refactoring so
> > I
> > > > would start a new branch for that (maybe ODE 2.0?). And we're going to
> > > > need a lot of tests to chase regressions :)
> > > >
> > > > I think 1 could go without a branch. It's not trivial but I don't
> > think
> > > > it would take more than a couple of weeks (I would have to get deeper
> > into
> > > > the code to give a better evaluation).
> > > >
> > > > On 6/6/07, Alex Boisvert < boisvert@intalio.com> wrote:
> > > > >
> > > > > Howza,
> > > > >
> > > > > I started testing a short-lived process implementing a single
> > > > > request-response operation.  The process structure is as follows:
> > > > >
> > > > > -Receive Purchase Order
> > > > > -Do some assignments (schema mappings)
> > > > > -Invoke CRM system to record the new PO
> > > > > -Do more assignments (schema mappings)
> > > > > -Invoke ERP system to record a new work order
> > > > > -Send back an acknowledgment
> > > > >
> > > > > Some deployment notes:
> > > > > -All WS operations are SOAP/HTTP
> > > > > -The process is deployed as "in-memory"
> > > > > -The CRM and ERP systems are mocked as Axis2 services (as dumb as
> > can
> > > > > be to
> > > > > avoid bottlenecks)
> > > > >
> > > > > After fixing a few minor issues (to handle the load), and fixing a
> > few
> > > > >
> > > > > obvious code inefficiencies which gave us roughly a 20% gain, we are
> > > > > now
> > > > > near-100% CPU utilization.  (I'm testing on my dual-core system)
> > As
> > > > > it
> > > > > stands, Ode clocks about 70 transactions per second.
> > > > >
> > > > > Is this good?  I'd say there's room for improvement.  Based on
> > > > > previous work
> > > > > in the field, I estimate we could get up to 300-400
> > > > > transactions/second.
> > > > >
> > > > > How do we improve this?  Well, looking at the end-to-end execution
> > of
> > > > > the
> > > > > process, I counted 4 thread-switches and 4 JTA transactions.  Those
> > > > > are not
> > > > > really necessary, if you ask me.  I think significant improvements
> > > > > could be
> > > > > made if we could run this process straight-through, meaning in a
> > > > > single
> > > > > thread and a single transaction.  (Not to mention it would make
> > things
> > > > >
> > > > > easier to monitor and measure ;)
> > > > >
> > > > > Also, to give you an idea, the top 3 areas where we spend most of
> > our
> > > > > CPU
> > > > > today are:
> > > > >
> > > > > 1) Serialization/deserialization of the Jacob state (I'm evaluating
> > > > > about
> > > > > 40-50%)
> > > > > 2) XML marshaling/unmarshaling (About 10-20%)
> > > > > 3) XML processing:  XPath evaluation + assignments (About 10-20%)
> > > > >
> > > > > (The rest would be about 20%; I need to load up JProbe or DTrace to
> > > > > provide
> > > > > more accurate measurements.  My current estimates are a mix of
> > > > > non-scientific statistical sampling of thread dumps and a quick run
> > > > > with the
> > > > > JVM's built-in profiler)
> > > > >
> > > > > So my general question is...  how do we get started on the single
> > > > > thread +
> > > > > single transaction refactoring?    Anybody already gave some
> > thoughts
> > > > > to
> > > > > this?  Are there any pending design issues before we start?  How do
> > we
> > > > > work
> > > > > on this without disrupting other parts of the system?  Do we start a
> > > > > new
> > > > > branch?
> > > > >
> > > > > alex
> > > > >
> > > >
> > > >
> > >
> >
>

Re: Ode Performance: Round I

Posted by Alex Boisvert <bo...@intalio.com>.

Maciej,

I'm unclear about how the engine would choose between BLOCKING and ASYNC.

I tend to think we need only BLOCKING and the IL deals with the fact that it
might have to suspend and resume itself if the underlying invocation is
async (e.g. JBI).   What's the use-case for ASYNC?

alex

On 6/6/07, Matthieu Riou <ma...@gmail.com> wrote:
>
> Forwarding on behalf of Maciej (mistakingly replied privately):
>
>
> -----------------------------------------------------------------------------------------------------------------
>
> ah yes. ok, here's my theory on getting the behavior alex wants; this
> i think is a fairly concrete way to get the different use cases we
> outlined on the white board.
>
> 1) create the notion of an invocation style: BLOCKING, ASYNC,
> RELIABLE, and TRANSACTED.
> 2) add MessageExchangeContext.isStyleSupported(PartnerMex, Style) method
> 3) modify the MessageExchangeContext.invokePartner method to take a
> style parameter.
>
> In BLOCKING style the IL simply does the invoke, right then and there,
> blocking the thread. (our axis IL would support this style)
>
> In ASYNC style, the IL does not block; instead it sends us a
> notification when the response is available. (JBI likes this style the
> most).
>
> In RELIABLE, the request would be enrolled in the current TX, response
> delievered asynch as above (in a new tx)
>
> in TRANSACTED, the behavior is like BLOCKING, but the TX context is
> propagted with the invocation.
>
> The engine would try to use the best style given the circumstances.
> For example, for in-mem processes it would prefer to use the
> TRANSACTED style and it could do it "in-line", i.e. as part of the
> <invoke> or right after it runs out of reductions.  If the style is
> not supported it could 'downgrade' to the BLOCKING style, which would
> work in the same way. If BLOCKING were not supported, then ASYNC would
> be the last resort, but this would force us to serialize.
>
> For persisted processes, we'd prefer RELIABLE in general, TRANSACTED
> when inside an atomic scope, otherwise either BLOCKING or ASYNC.
> However, here use of BLOCKING or ASYNC would result in additional
> transactions since we'd need to persist the fact that the invocation
> was made. Unless of course the operation is marked as "idempotent" in
> which case we could use the BLOCKING call without a checkpoint.
>
> How does that sound?
> -mbs
>
>
> On 6/6/07, Matthieu Riou <ma...@gmail.com> wrote:
> >
> > Actually for in-memory processes, it would save us all reads and writes
> > (we should never read or write it in that case). And for persistent
> > processes, then it will save a lot of reads (which are still expensive
> > because of deserialization).
> >
> > On 6/6/07, Matthieu Riou <ma...@gmail.com> wrote:
> > >
> > > Two things:
> > >
> > > 1. We should also consider caching the Jacob state. Instead of always
> > > serializing / writing and reading / deserializing, caching those
> states
> > > could save us a lot of reads.
> > >
> > > 2. Cutting down the transaction count is a significant refactoring so
> I
> > > would start a new branch for that (maybe ODE 2.0?). And we're going to
> > > need a lot of tests to chase regressions :)
> > >
> > > I think 1 could go without a branch. It's not trivial but I don't
> think
> > > it would take more than a couple of weeks (I would have to get deeper
> into
> > > the code to give a better evaluation).
> > >
> > > On 6/6/07, Alex Boisvert < boisvert@intalio.com> wrote:
> > > >
> > > > Howza,
> > > >
> > > > I started testing a short-lived process implementing a single
> > > > request-response operation.  The process structure is as follows:
> > > >
> > > > -Receive Purchase Order
> > > > -Do some assignments (schema mappings)
> > > > -Invoke CRM system to record the new PO
> > > > -Do more assignments (schema mappings)
> > > > -Invoke ERP system to record a new work order
> > > > -Send back an acknowledgment
> > > >
> > > > Some deployment notes:
> > > > -All WS operations are SOAP/HTTP
> > > > -The process is deployed as "in-memory"
> > > > -The CRM and ERP systems are mocked as Axis2 services (as dumb as
> can
> > > > be to
> > > > avoid bottlenecks)
> > > >
> > > > After fixing a few minor issues (to handle the load), and fixing a
> few
> > > >
> > > > obvious code inefficiencies which gave us roughly a 20% gain, we are
> > > > now
> > > > near-100% CPU utilization.  (I'm testing on my dual-core system)
> As
> > > > it
> > > > stands, Ode clocks about 70 transactions per second.
> > > >
> > > > Is this good?  I'd say there's room for improvement.  Based on
> > > > previous work
> > > > in the field, I estimate we could get up to 300-400
> > > > transactions/second.
> > > >
> > > > How do we improve this?  Well, looking at the end-to-end execution
> of
> > > > the
> > > > process, I counted 4 thread-switches and 4 JTA transactions.  Those
> > > > are not
> > > > really necessary, if you ask me.  I think significant improvements
> > > > could be
> > > > made if we could run this process straight-through, meaning in a
> > > > single
> > > > thread and a single transaction.  (Not to mention it would make
> things
> > > >
> > > > easier to monitor and measure ;)
> > > >
> > > > Also, to give you an idea, the top 3 areas where we spend most of
> our
> > > > CPU
> > > > today are:
> > > >
> > > > 1) Serialization/deserialization of the Jacob state (I'm evaluating
> > > > about
> > > > 40-50%)
> > > > 2) XML marshaling/unmarshaling (About 10-20%)
> > > > 3) XML processing:  XPath evaluation + assignments (About 10-20%)
> > > >
> > > > (The rest would be about 20%; I need to load up JProbe or DTrace to
> > > > provide
> > > > more accurate measurements.  My current estimates are a mix of
> > > > non-scientific statistical sampling of thread dumps and a quick run
> > > > with the
> > > > JVM's built-in profiler)
> > > >
> > > > So my general question is...  how do we get started on the single
> > > > thread +
> > > > single transaction refactoring?    Anybody already gave some
> thoughts
> > > > to
> > > > this?  Are there any pending design issues before we start?  How do
> we
> > > > work
> > > > on this without disrupting other parts of the system?  Do we start a
> > > > new
> > > > branch?
> > > >
> > > > alex
> > > >
> > >
> > >
> >
>

Re: Ode Performance: Round I

Posted by Matthieu Riou <ma...@gmail.com>.

Forwarding on behalf of Maciej (mistakingly replied privately):

-----------------------------------------------------------------------------------------------------------------

ah yes. ok, here's my theory on getting the behavior alex wants; this
i think is a fairly concrete way to get the different use cases we
outlined on the white board.

1) create the notion of an invocation style: BLOCKING, ASYNC,
RELIABLE, and TRANSACTED.
2) add MessageExchangeContext.isStyleSupported(PartnerMex, Style) method
3) modify the MessageExchangeContext.invokePartner method to take a
style parameter.

In BLOCKING style the IL simply does the invoke, right then and there,
blocking the thread. (our axis IL would support this style)

In ASYNC style, the IL does not block; instead it sends us a
notification when the response is available. (JBI likes this style the
most).

In RELIABLE, the request would be enrolled in the current TX, response
delievered asynch as above (in a new tx)

in TRANSACTED, the behavior is like BLOCKING, but the TX context is
propagted with the invocation.

The engine would try to use the best style given the circumstances.
For example, for in-mem processes it would prefer to use the
TRANSACTED style and it could do it "in-line", i.e. as part of the
<invoke> or right after it runs out of reductions.  If the style is
not supported it could 'downgrade' to the BLOCKING style, which would
work in the same way. If BLOCKING were not supported, then ASYNC would
be the last resort, but this would force us to serialize.

For persisted processes, we'd prefer RELIABLE in general, TRANSACTED
when inside an atomic scope, otherwise either BLOCKING or ASYNC.
However, here use of BLOCKING or ASYNC would result in additional
transactions since we'd need to persist the fact that the invocation
was made. Unless of course the operation is marked as "idempotent" in
which case we could use the BLOCKING call without a checkpoint.

How does that sound?
-mbs

On 6/6/07, Matthieu Riou <ma...@gmail.com> wrote:
>
> Actually for in-memory processes, it would save us all reads and writes
> (we should never read or write it in that case). And for persistent
> processes, then it will save a lot of reads (which are still expensive
> because of deserialization).
>
> On 6/6/07, Matthieu Riou <ma...@gmail.com> wrote:
> >
> > Two things:
> >
> > 1. We should also consider caching the Jacob state. Instead of always
> > serializing / writing and reading / deserializing, caching those states
> > could save us a lot of reads.
> >
> > 2. Cutting down the transaction count is a significant refactoring so I
> > would start a new branch for that (maybe ODE 2.0?). And we're going to
> > need a lot of tests to chase regressions :)
> >
> > I think 1 could go without a branch. It's not trivial but I don't think
> > it would take more than a couple of weeks (I would have to get deeper into
> > the code to give a better evaluation).
> >
> > On 6/6/07, Alex Boisvert < boisvert@intalio.com> wrote:
> > >
> > > Howza,
> > >
> > > I started testing a short-lived process implementing a single
> > > request-response operation.  The process structure is as follows:
> > >
> > > -Receive Purchase Order
> > > -Do some assignments (schema mappings)
> > > -Invoke CRM system to record the new PO
> > > -Do more assignments (schema mappings)
> > > -Invoke ERP system to record a new work order
> > > -Send back an acknowledgment
> > >
> > > Some deployment notes:
> > > -All WS operations are SOAP/HTTP
> > > -The process is deployed as "in-memory"
> > > -The CRM and ERP systems are mocked as Axis2 services (as dumb as can
> > > be to
> > > avoid bottlenecks)
> > >
> > > After fixing a few minor issues (to handle the load), and fixing a few
> > >
> > > obvious code inefficiencies which gave us roughly a 20% gain, we are
> > > now
> > > near-100% CPU utilization.  (I'm testing on my dual-core system)   As
> > > it
> > > stands, Ode clocks about 70 transactions per second.
> > >
> > > Is this good?  I'd say there's room for improvement.  Based on
> > > previous work
> > > in the field, I estimate we could get up to 300-400
> > > transactions/second.
> > >
> > > How do we improve this?  Well, looking at the end-to-end execution of
> > > the
> > > process, I counted 4 thread-switches and 4 JTA transactions.  Those
> > > are not
> > > really necessary, if you ask me.  I think significant improvements
> > > could be
> > > made if we could run this process straight-through, meaning in a
> > > single
> > > thread and a single transaction.  (Not to mention it would make things
> > >
> > > easier to monitor and measure ;)
> > >
> > > Also, to give you an idea, the top 3 areas where we spend most of our
> > > CPU
> > > today are:
> > >
> > > 1) Serialization/deserialization of the Jacob state (I'm evaluating
> > > about
> > > 40-50%)
> > > 2) XML marshaling/unmarshaling (About 10-20%)
> > > 3) XML processing:  XPath evaluation + assignments (About 10-20%)
> > >
> > > (The rest would be about 20%; I need to load up JProbe or DTrace to
> > > provide
> > > more accurate measurements.  My current estimates are a mix of
> > > non-scientific statistical sampling of thread dumps and a quick run
> > > with the
> > > JVM's built-in profiler)
> > >
> > > So my general question is...  how do we get started on the single
> > > thread +
> > > single transaction refactoring?    Anybody already gave some thoughts
> > > to
> > > this?  Are there any pending design issues before we start?  How do we
> > > work
> > > on this without disrupting other parts of the system?  Do we start a
> > > new
> > > branch?
> > >
> > > alex
> > >
> >
> >
>

Re: Ode Performance: Round I

Posted by Matthieu Riou <ma...@gmail.com>.

Actually for in-memory processes, it would save us all reads and writes (we
should never read or write it in that case). And for persistent processes,
then it will save a lot of reads (which are still expensive because of
deserialization).

On 6/6/07, Matthieu Riou <ma...@gmail.com> wrote:
>
> Two things:
>
> 1. We should also consider caching the Jacob state. Instead of always
> serializing / writing and reading / deserializing, caching those states
> could save us a lot of reads.
>
> 2. Cutting down the transaction count is a significant refactoring so I
> would start a new branch for that (maybe ODE 2.0?). And we're going to
> need a lot of tests to chase regressions :)
>
> I think 1 could go without a branch. It's not trivial but I don't think it
> would take more than a couple of weeks (I would have to get deeper into the
> code to give a better evaluation).
>
> On 6/6/07, Alex Boisvert <bo...@intalio.com> wrote:
> >
> > Howza,
> >
> > I started testing a short-lived process implementing a single
> > request-response operation.  The process structure is as follows:
> >
> > -Receive Purchase Order
> > -Do some assignments (schema mappings)
> > -Invoke CRM system to record the new PO
> > -Do more assignments (schema mappings)
> > -Invoke ERP system to record a new work order
> > -Send back an acknowledgment
> >
> > Some deployment notes:
> > -All WS operations are SOAP/HTTP
> > -The process is deployed as "in-memory"
> > -The CRM and ERP systems are mocked as Axis2 services (as dumb as can be
> > to
> > avoid bottlenecks)
> >
> > After fixing a few minor issues (to handle the load), and fixing a few
> > obvious code inefficiencies which gave us roughly a 20% gain, we are now
> > near-100% CPU utilization.  (I'm testing on my dual-core system)   As it
> > stands, Ode clocks about 70 transactions per second.
> >
> > Is this good?  I'd say there's room for improvement.  Based on previous
> > work
> > in the field, I estimate we could get up to 300-400 transactions/second.
> >
> > How do we improve this?  Well, looking at the end-to-end execution of
> > the
> > process, I counted 4 thread-switches and 4 JTA transactions.  Those are
> > not
> > really necessary, if you ask me.  I think significant improvements could
> > be
> > made if we could run this process straight-through, meaning in a single
> > thread and a single transaction.  (Not to mention it would make things
> > easier to monitor and measure ;)
> >
> > Also, to give you an idea, the top 3 areas where we spend most of our
> > CPU
> > today are:
> >
> > 1) Serialization/deserialization of the Jacob state (I'm evaluating
> > about
> > 40-50%)
> > 2) XML marshaling/unmarshaling (About 10-20%)
> > 3) XML processing:  XPath evaluation + assignments (About 10-20%)
> >
> > (The rest would be about 20%; I need to load up JProbe or DTrace to
> > provide
> > more accurate measurements.  My current estimates are a mix of
> > non-scientific statistical sampling of thread dumps and a quick run with
> > the
> > JVM's built-in profiler)
> >
> > So my general question is...  how do we get started on the single thread
> > +
> > single transaction refactoring?    Anybody already gave some thoughts to
> >
> > this?  Are there any pending design issues before we start?  How do we
> > work
> > on this without disrupting other parts of the system?  Do we start a new
> > branch?
> >
> > alex
> >
>
>

Re: Ode Performance: Round I

Posted by Matthieu Riou <ma...@gmail.com>.

Two things:

1. We should also consider caching the Jacob state. Instead of always
serializing / writing and reading / deserializing, caching those states
could save us a lot of reads.

2. Cutting down the transaction count is a significant refactoring so I
would start a new branch for that (maybe ODE 2.0?). And we're going to need
a lot of tests to chase regressions :)

I think 1 could go without a branch. It's not trivial but I don't think it
would take more than a couple of weeks (I would have to get deeper into the
code to give a better evaluation).

On 6/6/07, Alex Boisvert <bo...@intalio.com> wrote:
>
> Howza,
>
> I started testing a short-lived process implementing a single
> request-response operation.  The process structure is as follows:
>
> -Receive Purchase Order
> -Do some assignments (schema mappings)
> -Invoke CRM system to record the new PO
> -Do more assignments (schema mappings)
> -Invoke ERP system to record a new work order
> -Send back an acknowledgment
>
> Some deployment notes:
> -All WS operations are SOAP/HTTP
> -The process is deployed as "in-memory"
> -The CRM and ERP systems are mocked as Axis2 services (as dumb as can be
> to
> avoid bottlenecks)
>
> After fixing a few minor issues (to handle the load), and fixing a few
> obvious code inefficiencies which gave us roughly a 20% gain, we are now
> near-100% CPU utilization.  (I'm testing on my dual-core system)   As it
> stands, Ode clocks about 70 transactions per second.
>
> Is this good?  I'd say there's room for improvement.  Based on previous
> work
> in the field, I estimate we could get up to 300-400 transactions/second.
>
> How do we improve this?  Well, looking at the end-to-end execution of the
> process, I counted 4 thread-switches and 4 JTA transactions.  Those are
> not
> really necessary, if you ask me.  I think significant improvements could
> be
> made if we could run this process straight-through, meaning in a single
> thread and a single transaction.  (Not to mention it would make things
> easier to monitor and measure ;)
>
> Also, to give you an idea, the top 3 areas where we spend most of our CPU
> today are:
>
> 1) Serialization/deserialization of the Jacob state (I'm evaluating about
> 40-50%)
> 2) XML marshaling/unmarshaling (About 10-20%)
> 3) XML processing:  XPath evaluation + assignments (About 10-20%)
>
> (The rest would be about 20%; I need to load up JProbe or DTrace to
> provide
> more accurate measurements.  My current estimates are a mix of
> non-scientific statistical sampling of thread dumps and a quick run with
> the
> JVM's built-in profiler)
>
> So my general question is...  how do we get started on the single thread +
> single transaction refactoring?    Anybody already gave some thoughts to
> this?  Are there any pending design issues before we start?  How do we
> work
> on this without disrupting other parts of the system?  Do we start a new
> branch?
>
> alex
>

Re: Ode Performance: Round I

Posted by Alex Boisvert <bo...@intalio.com>.

On 6/6/07, Assaf Arkin <ar...@intalio.com> wrote:
>
> There are certain state storage optimization strategy that will work very
> well for in-memory (where you optimize for CPU), but will not work very
> well
> for persistence (when you also care about scalability and the database).



Agreed.


Given that in-memory processes don't want to store anything, trying to find
> an optimal strategy for both is pointless, it actually takes away from
> optimizing persistent processes. So I suggest we split the discussion.



Ok, I'll get some measurements for persistent processes.

alex

Re: Ode Performance: Round I

Posted by Assaf Arkin <ar...@intalio.com>.

On 6/6/07, Alex Boisvert <bo...@intalio.com> wrote:
>
> On 6/6/07, Assaf Arkin <ar...@intalio.com> wrote:
> >
> > But exactly why would an in-memory process ever be serialized? That
> should
> > be down to 0%.
>
>
> I think that's pretty much what Matthieu was saying;  there's a quick &
> easy
> optimization to be made here that's not dependent on refactoring
> thread/transaction demarcation.

There are certain state storage optimization strategy that will work very
well for in-memory (where you optimize for CPU), but will not work very well
for persistence (when you also care about scalability and the database).

Given that in-memory processes don't want to store anything, trying to find
an optimal strategy for both is pointless, it actually takes away from
optimizing persistent processes. So I suggest we split the discussion.

For in-memory processes we already know the optimal storage strategy: store
nothing. Caching, transactions, etc are either irrelevant or
inconsequential.

And for persistence processes, we need to start running benchmarks, so we
can talk about the best persistence strategy for those.

It's persistence processes where we should find strategies to improve
> > performance/scalability by reducing serialization cost.
>
>
> Yep, the flyweight pattern <http://c2.com/cgi/wiki?FlyweightPattern> would
> be most efficient here (also suggested by Matthieu in different words)

That only works if you have a finite number of states.

assaf

alex
>

Re: Ode Performance: Round I

Posted by Alex Boisvert <bo...@intalio.com>.

On 6/6/07, Assaf Arkin <ar...@intalio.com> wrote:
>
> But exactly why would an in-memory process ever be serialized? That should
> be down to 0%.


I think that's pretty much what Matthieu was saying;  there's a quick & easy
optimization to be made here that's not dependent on refactoring
thread/transaction demarcation.

It's persistence processes where we should find strategies to improve
> performance/scalability by reducing serialization cost.


Yep, the flyweight pattern <http://c2.com/cgi/wiki?FlyweightPattern> would
be most efficient here (also suggested by Matthieu in different words)

alex

Re: Ode Performance: Round I

Posted by Assaf Arkin <ar...@intalio.com>.

On 6/6/07, Alex Boisvert <bo...@intalio.com> wrote:
>
> Howza,
>
> I started testing a short-lived process implementing a single
> request-response operation.  The process structure is as follows:
>
> -Receive Purchase Order
> -Do some assignments (schema mappings)
> -Invoke CRM system to record the new PO
> -Do more assignments (schema mappings)
> -Invoke ERP system to record a new work order
> -Send back an acknowledgment
>
> Some deployment notes:
> -All WS operations are SOAP/HTTP
> -The process is deployed as "in-memory"
> -The CRM and ERP systems are mocked as Axis2 services (as dumb as can be
> to
> avoid bottlenecks)


I missed that on first read. There are obvious interesting questions like
caching instances and reducing the number of unnecessary transactions.

But exactly why would an in-memory process ever be serialized? That should
be down to 0%.

It's persistence processes where we should find strategies to improve
performance/scalability by reducing serialization cost.

Assaf



After fixing a few minor issues (to handle the load), and fixing a few
> obvious code inefficiencies which gave us roughly a 20% gain, we are now
> near-100% CPU utilization.  (I'm testing on my dual-core system)   As it
> stands, Ode clocks about 70 transactions per second.
>
> Is this good?  I'd say there's room for improvement.  Based on previous
> work
> in the field, I estimate we could get up to 300-400 transactions/second.
>
> How do we improve this?  Well, looking at the end-to-end execution of the
> process, I counted 4 thread-switches and 4 JTA transactions.  Those are
> not
> really necessary, if you ask me.  I think significant improvements could
> be
> made if we could run this process straight-through, meaning in a single
> thread and a single transaction.  (Not to mention it would make things
> easier to monitor and measure ;)
>
> Also, to give you an idea, the top 3 areas where we spend most of our CPU
> today are:
>
> 1) Serialization/deserialization of the Jacob state (I'm evaluating about
> 40-50%)
> 2) XML marshaling/unmarshaling (About 10-20%)
> 3) XML processing:  XPath evaluation + assignments (About 10-20%)
>
> (The rest would be about 20%; I need to load up JProbe or DTrace to
> provide
> more accurate measurements.  My current estimates are a mix of
> non-scientific statistical sampling of thread dumps and a quick run with
> the
> JVM's built-in profiler)
>
> So my general question is...  how do we get started on the single thread +
> single transaction refactoring?    Anybody already gave some thoughts to
> this?  Are there any pending design issues before we start?  How do we
> work
> on this without disrupting other parts of the system?  Do we start a new
> branch?
>
> alex
>