You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Goel, Ankur" <an...@corp.aol.com> on 2008/09/08 09:51:47 UTC

Multithreaded reduce

Hi Folks,

             I have a setup where I am using a thread-pool
implementation (provided by java.util.concurrent package) in reduce
phase to do database I/O to speed up the database upload. The DB here is
MySQL. I decided to go for additional parallelism via threads as 

1. It considerably speeds up the upload while consuming less cluster
resources (i.e. less number of reducers required). 

2. The upload speed is not limited by the reduce task capacity of the
cluster but by the DB's ability to handle max connections simultaneously
and effectively.

 

Each reduce task has 2 thread pools. One that does the DB I/O and whose
return a java.util.concurrent.FutureTask. Another pool that fetches
result from this future task to do disc I/O i.e.
outputCollector.collect(...).

 

When multiple threads from the second pool try to do a disc I/O, I get
an AlreadyBeingCreatedException in the logs. If I set the second pool to
only have 1 thread then things work fine!

 

It looks like the output collector was never assumed to be used from
multiple threads.

 

Any thoughts on this?

 

Thanks

-Ankur

 


Re: Multithreaded reduce

Posted by Alejandro Abdelnur <tu...@gmail.com>.
AFAIK there is not multithreaded reducer runner.

You have to make sure that you create each output collector only once,
not having a race condition in the creation.

A

On Tue, Sep 9, 2008 at 3:23 PM, Goel, Ankur <an...@corp.aol.com> wrote:
> Folks,
>      My implementation is a bit different. I am not using multithreaded
> reduce runner. Instead using thread-pools to do DB and HDFS I/O from
> each
> of my reduce tasks. To give you example from my setup, I have 3 reduce
> tasks each with a DB thread pool of size 70 threads. This is to ensure
> that I have a maximum of 200 threads hitting the DB doing inserts into
> multiple tables.
>
> Setup MySQL with large configuration and this really makes the inserts
> go at breakneck speeds.
>
> Now each of the threads returns a result that I want to collect on HDFS
> so I tried collecting the result via outputCollector from these threads
> which gave me the same exception. I also tried synchronizing the
> ouputCollector which did not help.
>
> So then I decided to use a separate thread pool in each reduce task for
> doing output collection via outputCollector. When this pool was set to
> have only 1 thread, the exception did not occur. Setting it to 5 threads
> or more caused the exception to show up.
>
> I'll post the stack trace after reproducing the problem.
>
> Thanks
> -Ankur
>
> -----Original Message-----
> From: Alejandro Abdelnur [mailto:tucu00@gmail.com]
> Sent: Tuesday, September 09, 2008 9:15 AM
> To: core-dev@hadoop.apache.org
> Subject: Re: Multithreaded reduce
>
> Collectors are already properly synchronized. Maybe there is a race
> condition in the way the multithreaded reducer runner creates them.
>
> A
>
> On Tue, Sep 9, 2008 at 8:56 AM, Owen O'Malley <om...@apache.org>
> wrote:
>>
>> On Sep 8, 2008, at 4:12 AM, Goel, Ankur wrote:
>>
>>> They seem to not work fine when used in Reduce phase.
>>> I can post the stack trace if required.
>>
>> I believe it. I don't think I've ever seen anyone do a multi-threaded
>> reduce. Of course the answer is easy, just add synchronization around
> the
>> output collector before calling collect.
>>
>> -- Owen
>>
>

RE: Multithreaded reduce

Posted by "Goel, Ankur" <an...@corp.aol.com>.
Folks,
      My implementation is a bit different. I am not using multithreaded
reduce runner. Instead using thread-pools to do DB and HDFS I/O from
each
of my reduce tasks. To give you example from my setup, I have 3 reduce
tasks each with a DB thread pool of size 70 threads. This is to ensure
that I have a maximum of 200 threads hitting the DB doing inserts into
multiple tables.

Setup MySQL with large configuration and this really makes the inserts
go at breakneck speeds.

Now each of the threads returns a result that I want to collect on HDFS
so I tried collecting the result via outputCollector from these threads
which gave me the same exception. I also tried synchronizing the
ouputCollector which did not help. 

So then I decided to use a separate thread pool in each reduce task for
doing output collection via outputCollector. When this pool was set to
have only 1 thread, the exception did not occur. Setting it to 5 threads
or more caused the exception to show up.

I'll post the stack trace after reproducing the problem.

Thanks
-Ankur

-----Original Message-----
From: Alejandro Abdelnur [mailto:tucu00@gmail.com] 
Sent: Tuesday, September 09, 2008 9:15 AM
To: core-dev@hadoop.apache.org
Subject: Re: Multithreaded reduce

Collectors are already properly synchronized. Maybe there is a race
condition in the way the multithreaded reducer runner creates them.

A

On Tue, Sep 9, 2008 at 8:56 AM, Owen O'Malley <om...@apache.org>
wrote:
>
> On Sep 8, 2008, at 4:12 AM, Goel, Ankur wrote:
>
>> They seem to not work fine when used in Reduce phase.
>> I can post the stack trace if required.
>
> I believe it. I don't think I've ever seen anyone do a multi-threaded
> reduce. Of course the answer is easy, just add synchronization around
the
> output collector before calling collect.
>
> -- Owen
>

Re: Multithreaded reduce

Posted by Alejandro Abdelnur <tu...@gmail.com>.
Collectors are already properly synchronized. Maybe there is a race
condition in the way the multithreaded reducer runner creates them.

A

On Tue, Sep 9, 2008 at 8:56 AM, Owen O'Malley <om...@apache.org> wrote:
>
> On Sep 8, 2008, at 4:12 AM, Goel, Ankur wrote:
>
>> They seem to not work fine when used in Reduce phase.
>> I can post the stack trace if required.
>
> I believe it. I don't think I've ever seen anyone do a multi-threaded
> reduce. Of course the answer is easy, just add synchronization around the
> output collector before calling collect.
>
> -- Owen
>

Re: Multithreaded reduce

Posted by Owen O'Malley <om...@apache.org>.
On Sep 8, 2008, at 4:12 AM, Goel, Ankur wrote:

> They seem to not work fine when used in Reduce phase.
> I can post the stack trace if required.

I believe it. I don't think I've ever seen anyone do a multi-threaded  
reduce. Of course the answer is easy, just add synchronization around  
the output collector before calling collect.

-- Owen

RE: Multithreaded reduce

Posted by "Goel, Ankur" <an...@corp.aol.com>.
They seem to not work fine when used in Reduce phase. 
I can post the stack trace if required.


-----Original Message-----
From: Alejandro Abdelnur [mailto:tucu00@gmail.com] 
Sent: Monday, September 08, 2008 3:49 PM
To: core-dev@hadoop.apache.org
Subject: Re: Multithreaded reduce

OutputCollectors work fine when multithreaded, look at the
MultiThreadMapRunner.


On Mon, Sep 8, 2008 at 1:21 PM, Goel, Ankur <an...@corp.aol.com>
wrote:
> Hi Folks,
>
>             I have a setup where I am using a thread-pool
> implementation (provided by java.util.concurrent package) in reduce
> phase to do database I/O to speed up the database upload. The DB here
is
> MySQL. I decided to go for additional parallelism via threads as
>
> 1. It considerably speeds up the upload while consuming less cluster
> resources (i.e. less number of reducers required).
>
> 2. The upload speed is not limited by the reduce task capacity of the
> cluster but by the DB's ability to handle max connections
simultaneously
> and effectively.
>
>
>
> Each reduce task has 2 thread pools. One that does the DB I/O and
whose
> return a java.util.concurrent.FutureTask. Another pool that fetches
> result from this future task to do disc I/O i.e.
> outputCollector.collect(...).
>
>
>
> When multiple threads from the second pool try to do a disc I/O, I get
> an AlreadyBeingCreatedException in the logs. If I set the second pool
to
> only have 1 thread then things work fine!
>
>
>
> It looks like the output collector was never assumed to be used from
> multiple threads.
>
>
>
> Any thoughts on this?
>
>
>
> Thanks
>
> -Ankur
>
>
>
>

Re: Multithreaded reduce

Posted by Alejandro Abdelnur <tu...@gmail.com>.
OutputCollectors work fine when multithreaded, look at the MultiThreadMapRunner.


On Mon, Sep 8, 2008 at 1:21 PM, Goel, Ankur <an...@corp.aol.com> wrote:
> Hi Folks,
>
>             I have a setup where I am using a thread-pool
> implementation (provided by java.util.concurrent package) in reduce
> phase to do database I/O to speed up the database upload. The DB here is
> MySQL. I decided to go for additional parallelism via threads as
>
> 1. It considerably speeds up the upload while consuming less cluster
> resources (i.e. less number of reducers required).
>
> 2. The upload speed is not limited by the reduce task capacity of the
> cluster but by the DB's ability to handle max connections simultaneously
> and effectively.
>
>
>
> Each reduce task has 2 thread pools. One that does the DB I/O and whose
> return a java.util.concurrent.FutureTask. Another pool that fetches
> result from this future task to do disc I/O i.e.
> outputCollector.collect(...).
>
>
>
> When multiple threads from the second pool try to do a disc I/O, I get
> an AlreadyBeingCreatedException in the logs. If I set the second pool to
> only have 1 thread then things work fine!
>
>
>
> It looks like the output collector was never assumed to be used from
> multiple threads.
>
>
>
> Any thoughts on this?
>
>
>
> Thanks
>
> -Ankur
>
>
>
>