You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by Peter Veentjer <al...@gmail.com> on 2008/11/16 22:09:05 UTC

Removing JMM issues

Hi Guys,

I have had a quick look at the source code of Hadoop and it appears
there there are some issues with the JMM. In some places it is done
correctly,  in some places partially and in some places it incorrect.
There also are some design issues with concurrency as well and I think
the Hadoop project could benefit from overall solution instead of just
putting out small fires. So who are the guys to get in touch with?
Together with the Hadoop developers I want to further improve the
quality of this very interesting project.

Some information about me:

my blog
http://pveentjer.wordpress.com

and some of my open source projects:
http://prometheus.codehaus.org
http://multiverse.googlecode.com
http://concurrency-detector.googlecode.com

Re: Removing JMM issues

Posted by Steve Loughran <st...@apache.org>.

Peter Veentjer wrote:
> Hi Ted,
> 
> one of the easy to find problems is spinning on a non volatile
> variable without changing the value in the loop and without additional
> synchronization (at least I didn't find it in a few seconds).
> 
> examples:
> AgentControllerSocketListener.closing
> HttpConnector.stopMe
> IndexUpdateReducer.closed
> SecundaryNodeName.shouldRun
> voTask.hasNext
> 
> These can all be fixed by making the field volatile.
> 
> These are the easy ones that can be found with static analysis tools,
> but I bet there are a lot of more harder to find ones.

One of the problems with concurrency issues is that they are hard to 
test for -it's hard to create tests to show that the problem exists. 
Another is that the main services -namenode, secondary namenode, etc, 
all run (in production) in their own processes, so can get away with 
concurrency risks and static shared code that aren't so appealing in 
shared processes.

> I think the Hadoop project would benefit from a structural approach to
> solving these problems instead of just fixing these bugs. That is what
> I want to help with but I can't do it without support of the
> leading-developers of the Hadoop community.

1. I don't see anyone being against this, though you would have to start 
with education. For example, it took me a bit to work out that you were 
using JMM as an acronym for Java Memory Model.

2. I think we'd need to prioritise where the biggest risks are.

> One of the things we need to agree upon is for example:
> making fields that only are set in the constructor, final. This makes
> analysis a lot easier.

It does, but it also makes subclassing trickier as subclassed instances 
don't get a look in or an opportunity to override the values -even if 
they have methods you can use to evaluate the subclassed values, the 
fact that these are called from the parent's constructor makes them a 
risk all on their own

Re: Removing JMM issues

Posted by Steve Loughran <st...@apache.org>.

Peter Veentjer wrote:
> Ok great,
> 
> I'll make the changes and check them in.

Submit patches, ideally with tests. If Hudson isn't happy, the code 
doesn't get in. Obviously concurrency problems are hard to test -for 
those we'll accept proofs of correctness in the formal notation of your 
choice :)

> 
> But this is just scratching the surface. Atm I'm trying to figure out
> which classes are the core of Hadoop so I can focus on that part
> first. There is so much source (and so little time) ;)

I would look at

* the RPC system
* HDFS

They are the underpinning, and if they have problems, they can be severe.

What may be good is to search for the code for
  -everywhere that uses Sleep(), as it's a sign of spinning on something
  -any/all catches of InterruptedException, especially the places that 
swallow them. It's a bad sign when code discards attempts to interrupt 
the thread, as it usually means some other part of the program wants you 
to stop, not keep looping

Re: Removing JMM issues

Posted by Peter Veentjer <al...@gmail.com>.

Ok great,

I'll make the changes and check them in.

But this is just scratching the surface. Atm I'm trying to figure out
which classes are the core of Hadoop so I can focus on that part
first. There is so much source (and so little time) ;)

On Mon, Nov 17, 2008 at 6:28 PM, Owen O'Malley <om...@apache.org> wrote:
> I've filed HADOOP-4671 with those examples. If you have more, obviously in
> need of volatile, please add them there.
>
> -- Owen
>

Re: Removing JMM issues

Posted by Owen O'Malley <om...@apache.org>.

I've filed HADOOP-4671 with those examples. If you have more, obviously in
need of volatile, please add them there.

-- Owen

Re: Removing JMM issues

Posted by Tom White <to...@gmail.com>.

This would be an interesting project and useful to have. I would
suggest starting by running a static analysis tool like Chord
(http://code.google.com/p/jchord/) and creating a series of Jira
issues to fix problems. Later on, we could integrate it into the
Hudson patch checking process, just like FindBugs is now. Creating a
wiki page describing coding best practices for concurrency might be
useful too.

Cheers,
Tom

On Mon, Nov 17, 2008 at 12:09 PM, Peter Veentjer <al...@gmail.com> wrote:
> Hi Ted,
>
> one of the easy to find problems is spinning on a non volatile
> variable without changing the value in the loop and without additional
> synchronization (at least I didn't find it in a few seconds).
>
> examples:
> AgentControllerSocketListener.closing
> HttpConnector.stopMe
> IndexUpdateReducer.closed
> SecundaryNodeName.shouldRun
> voTask.hasNext
>
> These can all be fixed by making the field volatile.
>
> These are the easy ones that can be found with static analysis tools,
> but I bet there are a lot of more harder to find ones.
>
> I think the Hadoop project would benefit from a structural approach to
> solving these problems instead of just fixing these bugs. That is what
> I want to help with but I can't do it without support of the
> leading-developers of the Hadoop community.
>
> One of the things we need to agree upon is for example:
> making fields that only are set in the constructor, final. This makes
> analysis a lot easier.
>
> On Sun, Nov 16, 2008 at 11:37 PM, Ted Dunning <te...@gmail.com> wrote:
>>
>> Can you give some examples?
>>
>> Sent from my iPhone
>>
>> On Nov 16, 2008, at 13:09, "Peter Veentjer" <al...@gmail.com> wrote:
>>
>>> Hi Guys,
>>>
>>> I have had a quick look at the source code of Hadoop and it appears
>>> there there are some issues with the JMM. In some places it is done
>>> correctly,  in some places partially and in some places it incorrect.
>>> There also are some design issues with concurrency as well and I think
>>> the Hadoop project could benefit from overall solution instead of just
>>> putting out small fires. So who are the guys to get in touch with?
>>> Together with the Hadoop developers I want to further improve the
>>> quality of this very interesting project.
>>>
>>> Some information about me:
>>>
>>> my blog
>>> http://pveentjer.wordpress.com
>>>
>>> and some of my open source projects:
>>> http://prometheus.codehaus.org
>>> http://multiverse.googlecode.com
>>> http://concurrency-detector.googlecode.com
>>
>

Re: Removing JMM issues

Posted by Peter Veentjer <al...@gmail.com>.

Hi Ted,

one of the easy to find problems is spinning on a non volatile
variable without changing the value in the loop and without additional
synchronization (at least I didn't find it in a few seconds).

examples:
AgentControllerSocketListener.closing
HttpConnector.stopMe
IndexUpdateReducer.closed
SecundaryNodeName.shouldRun
voTask.hasNext

These can all be fixed by making the field volatile.

These are the easy ones that can be found with static analysis tools,
but I bet there are a lot of more harder to find ones.

I think the Hadoop project would benefit from a structural approach to
solving these problems instead of just fixing these bugs. That is what
I want to help with but I can't do it without support of the
leading-developers of the Hadoop community.

One of the things we need to agree upon is for example:
making fields that only are set in the constructor, final. This makes
analysis a lot easier.

On Sun, Nov 16, 2008 at 11:37 PM, Ted Dunning <te...@gmail.com> wrote:
>
> Can you give some examples?
>
> Sent from my iPhone
>
> On Nov 16, 2008, at 13:09, "Peter Veentjer" <al...@gmail.com> wrote:
>
>> Hi Guys,
>>
>> I have had a quick look at the source code of Hadoop and it appears
>> there there are some issues with the JMM. In some places it is done
>> correctly,  in some places partially and in some places it incorrect.
>> There also are some design issues with concurrency as well and I think
>> the Hadoop project could benefit from overall solution instead of just
>> putting out small fires. So who are the guys to get in touch with?
>> Together with the Hadoop developers I want to further improve the
>> quality of this very interesting project.
>>
>> Some information about me:
>>
>> my blog
>> http://pveentjer.wordpress.com
>>
>> and some of my open source projects:
>> http://prometheus.codehaus.org
>> http://multiverse.googlecode.com
>> http://concurrency-detector.googlecode.com
>

Re: Removing JMM issues

Posted by Ted Dunning <te...@gmail.com>.

Can you give some examples?

Sent from my iPhone

On Nov 16, 2008, at 13:09, "Peter Veentjer" <al...@gmail.com>  
wrote:

> Hi Guys,
>
> I have had a quick look at the source code of Hadoop and it appears
> there there are some issues with the JMM. In some places it is done
> correctly,  in some places partially and in some places it incorrect.
> There also are some design issues with concurrency as well and I think
> the Hadoop project could benefit from overall solution instead of just
> putting out small fires. So who are the guys to get in touch with?
> Together with the Hadoop developers I want to further improve the
> quality of this very interesting project.
>
> Some information about me:
>
> my blog
> http://pveentjer.wordpress.com
>
> and some of my open source projects:
> http://prometheus.codehaus.org
> http://multiverse.googlecode.com
> http://concurrency-detector.googlecode.com

Re: Removing JMM issues

Posted by Owen O'Malley <om...@apache.org>.

On Nov 16, 2008, at 1:09 PM, Peter Veentjer wrote:

> I have had a quick look at the source code of Hadoop and it appears
> there there are some issues with the JMM. In some places it is done
> correctly,  in some places partially and in some places it incorrect.

That is believable. Clearly some of the problems have been fixed, but  
Hadoop is moving fast and new code is being added every week. There  
have been bug reports on concurrency stuff that turned out to be false  
positives. *smile* It is amazing how much testing code gets when you  
run it on 20,000 nodes 24x7 and even rare cases have been hit in  
production.

> There also are some design issues with concurrency as well and I think
> the Hadoop project could benefit from overall solution instead of just
> putting out small fires.

Yeah, we've talked about it for a long time. (See HADOOP-869.) The  
particularly problematic parts of the call graph in the JobTracker are  
when the lower levels call back into the higher layers. We've had to  
be careful to preserve lock order in those cases to avoid deadlock.

> So who are the guys to get in touch with?

This list is exactly the right guys.

> Together with the Hadoop developers I want to further improve the
> quality of this very interesting project.

Jump right in. *Smile*

-- Owen