You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@daffodil.apache.org by Patrick Grandjean <p....@gmail.com> on 2020/08/05 18:26:52 UTC

Caching, thread safety, optimizations

Hi,

I am looking to optimize applications that use Apache Daffodil and would
like to know which classes or functions are thread-safe, reusable, can be
cached in a singleton, etc. For instance, I believe that
ScalaXMLInfosetOutputter is reusable since it has a reset() function. Here
is a list of classes/functions/instances I am currently using:
- Daffodil.compiler()
- ProcessorFactory
- ProcessorFactory.onPath(String)
- DataProcessor
- ScalaXMLInfosetOutputter

I would like to avoid having to instantiate each class at every call.
Otherwise, what are the common optimizations that can be done when using
Apache Daffodil's Java/Scala API?

Patrick.

Re: Caching, thread safety, optimizations

Posted by Patrick Grandjean <p....@gmail.com>.
Thank you both!

> On Aug 6, 2020, at 10:13 AM, Beckerle, Mike <mb...@tresys.com> wrote:
> 
> I would go further than Steve on this.
> 
> There is only one thread-safe thing in Daffodil. This is by design/intention. Given a DataProcessor object, one may call its parse and unparse methods from multiple threads.  
> 
> These are thread safe because all the shared state of DataProcessor (the compiled schema) is read-only, and all structures allocated by a parse/unparse call are private (not shared at all) so are private to that one thread running that call. 
> 
> btw: There is one thread-safety bug in Daffodil (known currently) 
> https://issues.apache.org/jira/browse/DAFFODIL-2216 <https://issues.apache.org/jira/browse/DAFFODIL-2216>
> 
> Everywhere in Daffodil, developers are expected to avoid state, or where required use local state and *not* protect it from multi-thread access because only one thread should ever be accessing it. Code is expected to use the faster, lower-overhead, non-thread-safe collection classes rather than worry about state sharing, and we look for this in code review. 
> 
> The Daffodil compiler has a single global synchronized method lock. So I believe you can't compile schemas in parallel unless you run more than one JVM instance to do it. The compilation is all sequentialized on purpose so that we don't have to worrry about use of singleton objects. 
> 
> 
> From: Steve Lawrence <slawrence@apache.org <ma...@apache.org>>
> Sent: Thursday, August 6, 2020 9:04 AM
> To: users@daffodil.apache.org <ma...@daffodil.apache.org> <users@daffodil.apache.org <ma...@daffodil.apache.org>>
> Subject: Re: Caching, thread safety, optimizations
>  
> I'm not 100% sure if the Compiler and ProcessorFactory are thread safe.
> We fix issues as they come up and try our best, but I'm not sure we
> guarantee thread-safety. For example, there are definitely known issues
> if you use the set*() functions. The newer with*() functions were added
> to deal with these potential issues and should be used instead.
> 
> The DataProcessor is thread-safe, and we work hard to make sure it stays
> that way, since this is the thing that does most of the work. So every
> DataProcessor parse() or unparse() call can definitely be made in
> different threads without a problem.
> 
> The ScalaXMLInfosetOutputter (as well as most of the other
> InfosetOutputters) are stateful, and so should not be shared among
> different threads, but they can be reused by calling the reset()
> function. I would recommend one InfosetOutputter per thread and call
> reset() inbetween uses. Or just create a new one each time parse/unparse
> is needed--these should be pretty lightweight to allocate.
> 
> In general, I would recommend a workflow of creating a unique
> Compiler/ProcessorFactory/DataProcessor for each unique schema that you
> want to parse/unparse data with. Once you have the DataProcessor, throw
> away the Compiler/ProcessorFactory and cache and reuse that
> DataProcessor anytime you need to parse/unparse data using that schema.
> And then create/reset the InfosetOutputter as mentioned above.
> 
> - Steve
> 
> On 8/5/20 2:26 PM, Patrick Grandjean wrote:
> > Hi,
> > 
> > I am looking to optimize applications that use Apache Daffodil and would like to 
> > know which classes or functions are thread-safe, reusable, can be cached in a 
> > singleton, etc. For instance, I believe that ScalaXMLInfosetOutputter is 
> > reusable since it has a reset() function. Here is a list of 
> > classes/functions/instances I am currently using:
> > - Daffodil.compiler()
> > - ProcessorFactory
> > - ProcessorFactory.onPath(String)
> > - DataProcessor
> > - ScalaXMLInfosetOutputter
> > 
> > I would like to avoid having to instantiate each class at every call. Otherwise, 
> > what are the common optimizations that can be done when using Apache Daffodil's 
> > Java/Scala API?
> > 
> > Patrick.
> > 


Re: Caching, thread safety, optimizations

Posted by "Beckerle, Mike" <mb...@tresys.com>.
I would go further than Steve on this.

There is only one thread-safe thing in Daffodil. This is by design/intention. Given a DataProcessor object, one may call its parse and unparse methods from multiple threads.

These are thread safe because all the shared state of DataProcessor (the compiled schema) is read-only, and all structures allocated by a parse/unparse call are private (not shared at all) so are private to that one thread running that call.

btw: There is one thread-safety bug in Daffodil (known currently)
https://issues.apache.org/jira/browse/DAFFODIL-2216

Everywhere in Daffodil, developers are expected to avoid state, or where required use local state and *not* protect it from multi-thread access because only one thread should ever be accessing it. Code is expected to use the faster, lower-overhead, non-thread-safe collection classes rather than worry about state sharing, and we look for this in code review.

The Daffodil compiler has a single global synchronized method lock. So I believe you can't compile schemas in parallel unless you run more than one JVM instance to do it. The compilation is all sequentialized on purpose so that we don't have to worrry about use of singleton objects.


________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Thursday, August 6, 2020 9:04 AM
To: users@daffodil.apache.org <us...@daffodil.apache.org>
Subject: Re: Caching, thread safety, optimizations

I'm not 100% sure if the Compiler and ProcessorFactory are thread safe.
We fix issues as they come up and try our best, but I'm not sure we
guarantee thread-safety. For example, there are definitely known issues
if you use the set*() functions. The newer with*() functions were added
to deal with these potential issues and should be used instead.

The DataProcessor is thread-safe, and we work hard to make sure it stays
that way, since this is the thing that does most of the work. So every
DataProcessor parse() or unparse() call can definitely be made in
different threads without a problem.

The ScalaXMLInfosetOutputter (as well as most of the other
InfosetOutputters) are stateful, and so should not be shared among
different threads, but they can be reused by calling the reset()
function. I would recommend one InfosetOutputter per thread and call
reset() inbetween uses. Or just create a new one each time parse/unparse
is needed--these should be pretty lightweight to allocate.

In general, I would recommend a workflow of creating a unique
Compiler/ProcessorFactory/DataProcessor for each unique schema that you
want to parse/unparse data with. Once you have the DataProcessor, throw
away the Compiler/ProcessorFactory and cache and reuse that
DataProcessor anytime you need to parse/unparse data using that schema.
And then create/reset the InfosetOutputter as mentioned above.

- Steve

On 8/5/20 2:26 PM, Patrick Grandjean wrote:
> Hi,
>
> I am looking to optimize applications that use Apache Daffodil and would like to
> know which classes or functions are thread-safe, reusable, can be cached in a
> singleton, etc. For instance, I believe that ScalaXMLInfosetOutputter is
> reusable since it has a reset() function. Here is a list of
> classes/functions/instances I am currently using:
> - Daffodil.compiler()
> - ProcessorFactory
> - ProcessorFactory.onPath(String)
> - DataProcessor
> - ScalaXMLInfosetOutputter
>
> I would like to avoid having to instantiate each class at every call. Otherwise,
> what are the common optimizations that can be done when using Apache Daffodil's
> Java/Scala API?
>
> Patrick.
>


Re: Caching, thread safety, optimizations

Posted by Steve Lawrence <sl...@apache.org>.
I'm not 100% sure if the Compiler and ProcessorFactory are thread safe.
We fix issues as they come up and try our best, but I'm not sure we
guarantee thread-safety. For example, there are definitely known issues
if you use the set*() functions. The newer with*() functions were added
to deal with these potential issues and should be used instead.

The DataProcessor is thread-safe, and we work hard to make sure it stays
that way, since this is the thing that does most of the work. So every
DataProcessor parse() or unparse() call can definitely be made in
different threads without a problem.

The ScalaXMLInfosetOutputter (as well as most of the other
InfosetOutputters) are stateful, and so should not be shared among
different threads, but they can be reused by calling the reset()
function. I would recommend one InfosetOutputter per thread and call
reset() inbetween uses. Or just create a new one each time parse/unparse
is needed--these should be pretty lightweight to allocate.

In general, I would recommend a workflow of creating a unique
Compiler/ProcessorFactory/DataProcessor for each unique schema that you
want to parse/unparse data with. Once you have the DataProcessor, throw
away the Compiler/ProcessorFactory and cache and reuse that
DataProcessor anytime you need to parse/unparse data using that schema.
And then create/reset the InfosetOutputter as mentioned above.

- Steve

On 8/5/20 2:26 PM, Patrick Grandjean wrote:
> Hi,
> 
> I am looking to optimize applications that use Apache Daffodil and would like to 
> know which classes or functions are thread-safe, reusable, can be cached in a 
> singleton, etc. For instance, I believe that ScalaXMLInfosetOutputter is 
> reusable since it has a reset() function. Here is a list of 
> classes/functions/instances I am currently using:
> - Daffodil.compiler()
> - ProcessorFactory
> - ProcessorFactory.onPath(String)
> - DataProcessor
> - ScalaXMLInfosetOutputter
> 
> I would like to avoid having to instantiate each class at every call. Otherwise, 
> what are the common optimizations that can be done when using Apache Daffodil's 
> Java/Scala API?
> 
> Patrick.
>