You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@commons.apache.org by "David Smiley (@MITRE.org)" <DS...@mitre.org> on 2012/05/01 17:23:28 UTC

Thread-safety of Commons-Codec Encoder

I am reviewing Apache Lucene's use of Commons-Codec phonetic Encoders for
thread-safety.  Unfortunately, there is barely any mention about
thread-safety in the javadocs.  I think that's a real problem.  I noticed
this old JIRA issue which touches on some of the related aspects of this: 
https://issues.apache.org/jira/browse/CODEC-55

I think the only safe thing for me to do is assume these Encoders are not
thread-safe.

~ David Smiley

-----
 Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
--
View this message in context: http://apache-commons.680414.n4.nabble.com/Thread-safety-of-Commons-Codec-Encoder-tp4600956.html
Sent from the Commons - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
For additional commands, e-mail: user-help@commons.apache.org


Re: Thread-safety of Commons-Codec Encoder

Posted by sebb <se...@gmail.com>.
On 3 May 2012 19:30, Julius Davies <ju...@gmail.com> wrote:
>> So if we agree that the Encoders that exist are thread-safe (as we've defined it) and that they should be in general, then I plan to treat them that way in Lucene.
>
>
> I guess you're saying you'll do this:
>
> Pattern #1:  save some memory, lose some thread safety
> ---
> public class MyClass {
>  private static Encoder E = new Encoder();
>
>  public void myMethod() {
>    // need to encode or decode for some reason:
>    E.encode(stuff);
>  }
> }
> ---
>
> Instead of this?
>
>
> Pattern #2:  lose some memory (temporarily to hold encode table), gain
> some thread safety:
> ---
> public class MyClass {
>  public void myMethod() {
>    // need to encode or decode for some reason:
>    new Encoder().encode(stuff);
>  }
> }
> ---
>
>
> Base64 definitely requires pattern #2, but the others appear to be
> less complex, and might be okay.

Base64 has been made thread-safe in trunk (it was not thread-safe in 1.6).

> Interesting dilemma.   As an (inactive) codec developer, I prefer
> pattern #2 because it means less work for us, but, at the same time,
> there is no way to force developers to go with pattern #2 (even if we
> documented developers might not read it) and so coding defensively
> over here in case developers pick approach #1 could result in less
> work for us in the long run (fewer weird race condition bugs filed in
> Jira?).
>

> Mostly talking to myself here.
>
>
> --
> yours,
>
> Julius Davies
> 604-222-3310 (Home)
>
> $ sudo apt-get install cowsay
> $ echo "Moo." | cowsay | cowsay -n | cowsay -n
> http://juliusdavies.ca/cowsay/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
> For additional commands, e-mail: user-help@commons.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
For additional commands, e-mail: user-help@commons.apache.org


Re: Thread-safety of Commons-Codec Encoder

Posted by "David Smiley (@MITRE.org)" <DS...@mitre.org>.
On May 3, 2012, at 2:30 PM, Julius Davies [via Apache Commons] wrote:

> > So if we agree that the Encoders that exist are thread-safe (as we've defined it) and that they should be in general, then I plan to treat them that way in Lucene. 
> 
> 
> I guess you're saying you'll do this: 
> 
> Pattern #1:  save some memory, lose some thread safety 
> --- 
> public class MyClass { 
>   private static Encoder E = new Encoder(); 
> 
>   public void myMethod() { 
>     // need to encode or decode for some reason: 
>     E.encode(stuff); 
>   } 
> } 
> --- 

Sure.  There's the possibility of a static block configuring the encoder too.  Of course, this same pattern could be realized without using 'static', but static is a concise way of demonstrating a happens-before relationshipo in example code.  Another would be a volatile instance variable that is only set after the encoder is configured.  And I assume myMethod() is going to be invoked concurrently on the same MyClass instance by multiple threads.

> 
> Instead of this? 
> 
> 
> Pattern #2:  lose some memory (temporarily to hold encode table), gain 
> some thread safety: 
> --- 
> public class MyClass { 
>   public void myMethod() { 
>     // need to encode or decode for some reason: 
>     new Encoder().encode(stuff); 
>   } 
> } 
> --- 
> 

Right; not that.

> 
> Base64 definitely requires pattern #2, but the others appear to be 
> less complex, and might be okay. 

I've seen people say that Encoders should be made to be threadsafe and documented as such.  I've not heard a dissenting opinion to this proposal.  So if Base64 can't work in pattern #1, I suggest this be rectified.  All the phonetic encoders appear to be thread-safe already.

> Interesting dilemma.   As an (inactive) codec developer, I prefer 
> pattern #2 because it means less work for us, but, at the same time, 
> there is no way to force developers to go with pattern #2 (even if we 
> documented developers might not read it) and so coding defensively 
> over here in case developers pick approach #1 could result in less 
> work for us in the long run (fewer weird race condition bugs filed in 
> Jira?). 
> 
> 
> Mostly talking to myself here. 
> 

~ David

-----
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: http://apache-commons.680414.n4.nabble.com/Thread-safety-of-Commons-Codec-Encoder-tp4600956p4606799.html
Sent from the Commons - User mailing list archive at Nabble.com.

Re: Thread-safety of Commons-Codec Encoder

Posted by Julius Davies <ju...@gmail.com>.
> So if we agree that the Encoders that exist are thread-safe (as we've defined it) and that they should be in general, then I plan to treat them that way in Lucene.


I guess you're saying you'll do this:

Pattern #1:  save some memory, lose some thread safety
---
public class MyClass {
  private static Encoder E = new Encoder();

  public void myMethod() {
    // need to encode or decode for some reason:
    E.encode(stuff);
  }
}
---

Instead of this?


Pattern #2:  lose some memory (temporarily to hold encode table), gain
some thread safety:
---
public class MyClass {
  public void myMethod() {
    // need to encode or decode for some reason:
    new Encoder().encode(stuff);
  }
}
---


Base64 definitely requires pattern #2, but the others appear to be
less complex, and might be okay.

Interesting dilemma.   As an (inactive) codec developer, I prefer
pattern #2 because it means less work for us, but, at the same time,
there is no way to force developers to go with pattern #2 (even if we
documented developers might not read it) and so coding defensively
over here in case developers pick approach #1 could result in less
work for us in the long run (fewer weird race condition bugs filed in
Jira?).


Mostly talking to myself here.



-- 
yours,

Julius Davies
604-222-3310 (Home)

$ sudo apt-get install cowsay
$ echo "Moo." | cowsay | cowsay -n | cowsay -n
http://juliusdavies.ca/cowsay/

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
For additional commands, e-mail: user-help@commons.apache.org


Re: Thread-safety of Commons-Codec Encoder

Posted by "David Smiley (@MITRE.org)" <DS...@mitre.org>.
On May 2, 2012, at 3:43 PM, sebb-2-2 [via Apache Commons] wrote:

On 2 May 2012 20:16, David Smiley (@MITRE.org) <[hidden email]<x-msg://487/user/SendEmail.jtp?type=node&node=4604253&i=0>> wrote:
> The "kind of thread-safety" I would expect and want is for Encoder.encode()
> to be thread-safe.  I should be able to instantiate and configure an
> Encoder, then share it for multiple concurrent threads to use to call
> encode() as many times as they please at the same time.  I will assume it is
> up to me (the client/user of the API) to arrange for a happens-before
> relationship between instantiation/configuration and the encode() call, by
> using any one of a variety of means of doing so.  This is kind of a given
> but I want to be clear.

That is necessary for safe publication, but not sufficient if the
class changes the values of any instance variables as part of the
encode() processing.

Yes, obviously an encode() method that changes state is not thread-safe.  I'm establishing that encode() should be thread-safe and then elaborating on other matters aside from the thread-safety of encode(), namely Encoder configuration.

In the case of Metaphone#setLength(), the variable is not volatile, so
is not inherently thread-safe.
Provided that you set the length once initially, and used suitable
means to publish the instance to the different threads, it would then
be safe to use from multiple threads, so long as setLength() was not
called again. So you would need to create separate instances for each
length.

I'm not sure it would make sense to share an instance between multiple
threads and then change the length in one thread, so the fact that the
length variable is not volatile really only affects how the instance
needs to be shared initially.

It makes no sense to me either, to try and make initialization/configuration methods of class thread-safe (speaking generally here, not just for Encoder), and setLength and setMaxCodeLength are such methods.

So I think you can say that the Metaphone class is conditionally thread-safe.

Awesome.  Judging from Thomas's table in this thread every Encoder is already thread-safe by our definition of it.  I looked at BeiderMorseEncoder & PhoneticEngine and I think it's thread-safe.  I saw many cases where methods return new altered copies of objects instead of modifying internal object state.

> I haven't looked at any of the source behind the Encoders, but one area that
> may be problematic is any lazy-instantiation/initialization of state to
> instance variables in a non-thread-safe manner by the encode() method.  If
> Encoder had an init() method as part of its api, it would be easy to avoid
> this problem but alas, it doesn't.

A public init() method is but one way to do this, and would not help
where there are setters.
[Well, I suppose the init() commmand could "freeze" any mutable values]

Yes, that's what I implied with such a name -- it might freeze it or not and and users are expected to not change the state following initialization which I think of as a given in the absence of documentation otherwise.

So if we agree that the Encoders that exist are thread-safe (as we've defined it) and that they should be in general, then I plan to treat them that way in Lucene.

~ David


-----
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: http://apache-commons.680414.n4.nabble.com/Thread-safety-of-Commons-Codec-Encoder-tp4600956p4604487.html
Sent from the Commons - User mailing list archive at Nabble.com.

Re: Thread-safety of Commons-Codec Encoder

Posted by sebb <se...@gmail.com>.
On 2 May 2012 20:16, David Smiley (@MITRE.org) <DS...@mitre.org> wrote:
> The "kind of thread-safety" I would expect and want is for Encoder.encode()
> to be thread-safe.  I should be able to instantiate and configure an
> Encoder, then share it for multiple concurrent threads to use to call
> encode() as many times as they please at the same time.  I will assume it is
> up to me (the client/user of the API) to arrange for a happens-before
> relationship between instantiation/configuration and the encode() call, by
> using any one of a variety of means of doing so.  This is kind of a given
> but I want to be clear.

That is necessary for safe publication, but not sufficient if the
class changes the values of any instance variables as part of the
encode() processing.

In the case of Metaphone#setLength(), the variable is not volatile, so
is not inherently thread-safe.
Provided that you set the length once initially, and used suitable
means to publish the instance to the different threads, it would then
be safe to use from multiple threads, so long as setLength() was not
called again. So you would need to create separate instances for each
length.

I'm not sure it would make sense to share an instance between multiple
threads and then change the length in one thread, so the fact that the
length variable is not volatile really only affects how the instance
needs to be shared initially.

So I think you can say that the Metaphone class is conditionally thread-safe.

> I haven't looked at any of the source behind the Encoders, but one area that
> may be problematic is any lazy-instantiation/initialization of state to
> instance variables in a non-thread-safe manner by the encode() method.  If
> Encoder had an init() method as part of its api, it would be easy to avoid
> this problem but alas, it doesn't.

A public init() method is but one way to do this, and would not help
where there are setters.
[Well, I suppose the init() commmand could "freeze" any mutable values]

> FYI a good book on this tricky stuff is "Java Concurrency In Practice" by
> Brian Goetz.

Yes, I've read it.

> ~ David
>
> -----
>  Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
> --
> View this message in context: http://apache-commons.680414.n4.nabble.com/Thread-safety-of-Commons-Codec-Encoder-tp4600956p4604182.html
> Sent from the Commons - User mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
> For additional commands, e-mail: user-help@commons.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
For additional commands, e-mail: user-help@commons.apache.org


Re: Thread-safety of Commons-Codec Encoder

Posted by "David Smiley (@MITRE.org)" <DS...@mitre.org>.
The "kind of thread-safety" I would expect and want is for Encoder.encode()
to be thread-safe.  I should be able to instantiate and configure an
Encoder, then share it for multiple concurrent threads to use to call
encode() as many times as they please at the same time.  I will assume it is
up to me (the client/user of the API) to arrange for a happens-before
relationship between instantiation/configuration and the encode() call, by
using any one of a variety of means of doing so.  This is kind of a given
but I want to be clear.

I haven't looked at any of the source behind the Encoders, but one area that
may be problematic is any lazy-instantiation/initialization of state to
instance variables in a non-thread-safe manner by the encode() method.  If
Encoder had an init() method as part of its api, it would be easy to avoid
this problem but alas, it doesn't. 

FYI a good book on this tricky stuff is "Java Concurrency In Practice" by
Brian Goetz.

~ David

-----
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: http://apache-commons.680414.n4.nabble.com/Thread-safety-of-Commons-Codec-Encoder-tp4600956p4604182.html
Sent from the Commons - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
For additional commands, e-mail: user-help@commons.apache.org


Re: Thread-safety of Commons-Codec Encoder

Posted by sebb <se...@gmail.com>.
On 2 May 2012 19:18, Thomas Neidhart <th...@gmail.com> wrote:
> On 05/01/2012 05:23 PM, David Smiley (@MITRE.org) wrote:
>> I am reviewing Apache Lucene's use of Commons-Codec phonetic Encoders for
>> thread-safety.  Unfortunately, there is barely any mention about
>> thread-safety in the javadocs.  I think that's a real problem.  I noticed
>> this old JIRA issue which touches on some of the related aspects of this:
>> https://issues.apache.org/jira/browse/CODEC-55
>>
>> I think the only safe thing for me to do is assume these Encoders are not
>> thread-safe.

Yes, that is the safest option.

What kind of thread-safety are you looking for?

To be able to use the same instances concurrently in different threads?
To be able to use different instances concurrently in different threads?
To have a pool of instances used singly by multiple threads?

> Hi David,
>
> unluckily not all of the encoders are thread-safe atm:
>
>  codec           | thread-safe | comment
>  -----------------------------------
>  Metaphone       |     N       | code-length may be changed
>  DoubleMetaphone |     N       | code-length may be changed
>  Nysiis          |     Y       |
>  Soundex         |     N       | maxLength may be changed
>  RefinedSoundex  |     Y       |
>  ColognePhonetic |     Y       |
>  Caverphone1     |     Y       |
>  Caverphone2     |     Y       |
>  BeiderMorseEnc /|     Y       | not 100% sure, but no members are
>  PhoneticEngine |             | changed when encoding and no setters
>                                 exist
>
> We should add this information to the javadoc and make all codecs
> thread-safe imho.

Agreed; this really needs to be done for (almost?) all Commons components.

> Best regards,
>
> Thomas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
> For additional commands, e-mail: user-help@commons.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
For additional commands, e-mail: user-help@commons.apache.org


Re: Thread-safety of Commons-Codec Encoder

Posted by Thomas Neidhart <th...@gmail.com>.
On 05/01/2012 05:23 PM, David Smiley (@MITRE.org) wrote:
> I am reviewing Apache Lucene's use of Commons-Codec phonetic Encoders for
> thread-safety.  Unfortunately, there is barely any mention about
> thread-safety in the javadocs.  I think that's a real problem.  I noticed
> this old JIRA issue which touches on some of the related aspects of this: 
> https://issues.apache.org/jira/browse/CODEC-55
> 
> I think the only safe thing for me to do is assume these Encoders are not
> thread-safe.

Hi David,

unluckily not all of the encoders are thread-safe atm:

 codec           | thread-safe | comment
 -----------------------------------
 Metaphone       |     N       | code-length may be changed
 DoubleMetaphone |     N       | code-length may be changed
 Nysiis          |     Y       |
 Soundex         |     N       | maxLength may be changed
 RefinedSoundex  |     Y       |
 ColognePhonetic |     Y       |
 Caverphone1     |     Y       |
 Caverphone2     |     Y       |
 BeiderMorseEnc /|     Y       | not 100% sure, but no members are
  PhoneticEngine |             | changed when encoding and no setters
                                 exist

We should add this information to the javadoc and make all codecs
thread-safe imho.

Best regards,

Thomas

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
For additional commands, e-mail: user-help@commons.apache.org