You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Pedro Magalhaes <pe...@gmail.com> on 2015/08/23 01:38:27 UTC

MultithreadedMapper - Sharing Data Structure

I am developig a job that has 30B of records in the input path. (File A)
I need to filter these records using another file that can have 30K to 180M
of records. (File B)
So fo each record in File A, i will make a lookup in File B.
I am using distributed cache to share the File B. The problem is that if
the File B is too large (for example 180 M of records), i spend too much
time (CPU processing) allocating it in a hashmap. I make this allocation to
each map task.

In hadoop 2.X the jvm reuse was discontinued. So i am think in use
MultithreadedMapper,
making the hashmap thread-safe, and sharing this read-only structure across
the mappers.

Is this a good approach?

Re: MultithreadedMapper - Sharing Data Structure

Posted by Harsh J <ha...@cloudera.com>.
Perhaps combining MultiThreaded mapper along with a CombineFileInputFormat
may help (it reduces total # of maps, but you get more threads per map
task).

On Mon, Aug 24, 2015 at 2:16 PM twinkle sachdeva <tw...@gmail.com>
wrote:

> Hi,
>
> We have been using the jvm reuse feature for the same reason of sharing
> the same structure across multiple Map Tasks. Multithreaded Map task does
> that partially, as within the multiple threads, same copy is used.
>
>
> Depending upon the hardware availability, one can get the same performance.
>
> Thanks,
>
>
> On Mon, Aug 24, 2015 at 1:37 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> The MultiThreadedMapper won't solve your problem, as all it does is run
>> parallel maps within the same map task JVM as a non-MT one. Your data
>> structure won't be shared across the different map task JVMs on the host,
>> but just within the map tasks's own multiple threads running the map()
>> function over input records.
>>
>> Wouldn't doing reduce-side join for larger files be much faster?
>>
>> On Sun, Aug 23, 2015 at 5:08 AM Pedro Magalhaes <pe...@gmail.com>
>> wrote:
>>
>>> I am developig a job that has 30B of records in the input path. (File A)
>>> I need to filter these records using another file that can have 30K to
>>> 180M of records. (File B)
>>> So fo each record in File A, i will make a lookup in File B.
>>> I am using distributed cache to share the File B. The problem is that if
>>> the File B is too large (for example 180 M of records), i spend too much
>>> time (CPU processing) allocating it in a hashmap. I make this allocation to
>>> each map task.
>>>
>>> In hadoop 2.X the jvm reuse was discontinued. So i am think in use MultithreadedMapper,
>>> making the hashmap thread-safe, and sharing this read-only structure across
>>> the mappers.
>>>
>>> Is this a good approach?
>>>
>>>
>>>
>>>
>>
>

Re: MultithreadedMapper - Sharing Data Structure

Posted by Harsh J <ha...@cloudera.com>.
Perhaps combining MultiThreaded mapper along with a CombineFileInputFormat
may help (it reduces total # of maps, but you get more threads per map
task).

On Mon, Aug 24, 2015 at 2:16 PM twinkle sachdeva <tw...@gmail.com>
wrote:

> Hi,
>
> We have been using the jvm reuse feature for the same reason of sharing
> the same structure across multiple Map Tasks. Multithreaded Map task does
> that partially, as within the multiple threads, same copy is used.
>
>
> Depending upon the hardware availability, one can get the same performance.
>
> Thanks,
>
>
> On Mon, Aug 24, 2015 at 1:37 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> The MultiThreadedMapper won't solve your problem, as all it does is run
>> parallel maps within the same map task JVM as a non-MT one. Your data
>> structure won't be shared across the different map task JVMs on the host,
>> but just within the map tasks's own multiple threads running the map()
>> function over input records.
>>
>> Wouldn't doing reduce-side join for larger files be much faster?
>>
>> On Sun, Aug 23, 2015 at 5:08 AM Pedro Magalhaes <pe...@gmail.com>
>> wrote:
>>
>>> I am developig a job that has 30B of records in the input path. (File A)
>>> I need to filter these records using another file that can have 30K to
>>> 180M of records. (File B)
>>> So fo each record in File A, i will make a lookup in File B.
>>> I am using distributed cache to share the File B. The problem is that if
>>> the File B is too large (for example 180 M of records), i spend too much
>>> time (CPU processing) allocating it in a hashmap. I make this allocation to
>>> each map task.
>>>
>>> In hadoop 2.X the jvm reuse was discontinued. So i am think in use MultithreadedMapper,
>>> making the hashmap thread-safe, and sharing this read-only structure across
>>> the mappers.
>>>
>>> Is this a good approach?
>>>
>>>
>>>
>>>
>>
>

Re: MultithreadedMapper - Sharing Data Structure

Posted by Harsh J <ha...@cloudera.com>.
Perhaps combining MultiThreaded mapper along with a CombineFileInputFormat
may help (it reduces total # of maps, but you get more threads per map
task).

On Mon, Aug 24, 2015 at 2:16 PM twinkle sachdeva <tw...@gmail.com>
wrote:

> Hi,
>
> We have been using the jvm reuse feature for the same reason of sharing
> the same structure across multiple Map Tasks. Multithreaded Map task does
> that partially, as within the multiple threads, same copy is used.
>
>
> Depending upon the hardware availability, one can get the same performance.
>
> Thanks,
>
>
> On Mon, Aug 24, 2015 at 1:37 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> The MultiThreadedMapper won't solve your problem, as all it does is run
>> parallel maps within the same map task JVM as a non-MT one. Your data
>> structure won't be shared across the different map task JVMs on the host,
>> but just within the map tasks's own multiple threads running the map()
>> function over input records.
>>
>> Wouldn't doing reduce-side join for larger files be much faster?
>>
>> On Sun, Aug 23, 2015 at 5:08 AM Pedro Magalhaes <pe...@gmail.com>
>> wrote:
>>
>>> I am developig a job that has 30B of records in the input path. (File A)
>>> I need to filter these records using another file that can have 30K to
>>> 180M of records. (File B)
>>> So fo each record in File A, i will make a lookup in File B.
>>> I am using distributed cache to share the File B. The problem is that if
>>> the File B is too large (for example 180 M of records), i spend too much
>>> time (CPU processing) allocating it in a hashmap. I make this allocation to
>>> each map task.
>>>
>>> In hadoop 2.X the jvm reuse was discontinued. So i am think in use MultithreadedMapper,
>>> making the hashmap thread-safe, and sharing this read-only structure across
>>> the mappers.
>>>
>>> Is this a good approach?
>>>
>>>
>>>
>>>
>>
>

Re: MultithreadedMapper - Sharing Data Structure

Posted by Harsh J <ha...@cloudera.com>.
Perhaps combining MultiThreaded mapper along with a CombineFileInputFormat
may help (it reduces total # of maps, but you get more threads per map
task).

On Mon, Aug 24, 2015 at 2:16 PM twinkle sachdeva <tw...@gmail.com>
wrote:

> Hi,
>
> We have been using the jvm reuse feature for the same reason of sharing
> the same structure across multiple Map Tasks. Multithreaded Map task does
> that partially, as within the multiple threads, same copy is used.
>
>
> Depending upon the hardware availability, one can get the same performance.
>
> Thanks,
>
>
> On Mon, Aug 24, 2015 at 1:37 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> The MultiThreadedMapper won't solve your problem, as all it does is run
>> parallel maps within the same map task JVM as a non-MT one. Your data
>> structure won't be shared across the different map task JVMs on the host,
>> but just within the map tasks's own multiple threads running the map()
>> function over input records.
>>
>> Wouldn't doing reduce-side join for larger files be much faster?
>>
>> On Sun, Aug 23, 2015 at 5:08 AM Pedro Magalhaes <pe...@gmail.com>
>> wrote:
>>
>>> I am developig a job that has 30B of records in the input path. (File A)
>>> I need to filter these records using another file that can have 30K to
>>> 180M of records. (File B)
>>> So fo each record in File A, i will make a lookup in File B.
>>> I am using distributed cache to share the File B. The problem is that if
>>> the File B is too large (for example 180 M of records), i spend too much
>>> time (CPU processing) allocating it in a hashmap. I make this allocation to
>>> each map task.
>>>
>>> In hadoop 2.X the jvm reuse was discontinued. So i am think in use MultithreadedMapper,
>>> making the hashmap thread-safe, and sharing this read-only structure across
>>> the mappers.
>>>
>>> Is this a good approach?
>>>
>>>
>>>
>>>
>>
>

Re: MultithreadedMapper - Sharing Data Structure

Posted by twinkle sachdeva <tw...@gmail.com>.
Hi,

We have been using the jvm reuse feature for the same reason of sharing the
same structure across multiple Map Tasks. Multithreaded Map task does that
partially, as within the multiple threads, same copy is used.


Depending upon the hardware availability, one can get the same performance.

Thanks,


On Mon, Aug 24, 2015 at 1:37 PM, Harsh J <ha...@cloudera.com> wrote:

> The MultiThreadedMapper won't solve your problem, as all it does is run
> parallel maps within the same map task JVM as a non-MT one. Your data
> structure won't be shared across the different map task JVMs on the host,
> but just within the map tasks's own multiple threads running the map()
> function over input records.
>
> Wouldn't doing reduce-side join for larger files be much faster?
>
> On Sun, Aug 23, 2015 at 5:08 AM Pedro Magalhaes <pe...@gmail.com>
> wrote:
>
>> I am developig a job that has 30B of records in the input path. (File A)
>> I need to filter these records using another file that can have 30K to
>> 180M of records. (File B)
>> So fo each record in File A, i will make a lookup in File B.
>> I am using distributed cache to share the File B. The problem is that if
>> the File B is too large (for example 180 M of records), i spend too much
>> time (CPU processing) allocating it in a hashmap. I make this allocation to
>> each map task.
>>
>> In hadoop 2.X the jvm reuse was discontinued. So i am think in use MultithreadedMapper,
>> making the hashmap thread-safe, and sharing this read-only structure across
>> the mappers.
>>
>> Is this a good approach?
>>
>>
>>
>>
>

Re: MultithreadedMapper - Sharing Data Structure

Posted by twinkle sachdeva <tw...@gmail.com>.
Hi,

We have been using the jvm reuse feature for the same reason of sharing the
same structure across multiple Map Tasks. Multithreaded Map task does that
partially, as within the multiple threads, same copy is used.


Depending upon the hardware availability, one can get the same performance.

Thanks,


On Mon, Aug 24, 2015 at 1:37 PM, Harsh J <ha...@cloudera.com> wrote:

> The MultiThreadedMapper won't solve your problem, as all it does is run
> parallel maps within the same map task JVM as a non-MT one. Your data
> structure won't be shared across the different map task JVMs on the host,
> but just within the map tasks's own multiple threads running the map()
> function over input records.
>
> Wouldn't doing reduce-side join for larger files be much faster?
>
> On Sun, Aug 23, 2015 at 5:08 AM Pedro Magalhaes <pe...@gmail.com>
> wrote:
>
>> I am developig a job that has 30B of records in the input path. (File A)
>> I need to filter these records using another file that can have 30K to
>> 180M of records. (File B)
>> So fo each record in File A, i will make a lookup in File B.
>> I am using distributed cache to share the File B. The problem is that if
>> the File B is too large (for example 180 M of records), i spend too much
>> time (CPU processing) allocating it in a hashmap. I make this allocation to
>> each map task.
>>
>> In hadoop 2.X the jvm reuse was discontinued. So i am think in use MultithreadedMapper,
>> making the hashmap thread-safe, and sharing this read-only structure across
>> the mappers.
>>
>> Is this a good approach?
>>
>>
>>
>>
>

Re: MultithreadedMapper - Sharing Data Structure

Posted by twinkle sachdeva <tw...@gmail.com>.
Hi,

We have been using the jvm reuse feature for the same reason of sharing the
same structure across multiple Map Tasks. Multithreaded Map task does that
partially, as within the multiple threads, same copy is used.


Depending upon the hardware availability, one can get the same performance.

Thanks,


On Mon, Aug 24, 2015 at 1:37 PM, Harsh J <ha...@cloudera.com> wrote:

> The MultiThreadedMapper won't solve your problem, as all it does is run
> parallel maps within the same map task JVM as a non-MT one. Your data
> structure won't be shared across the different map task JVMs on the host,
> but just within the map tasks's own multiple threads running the map()
> function over input records.
>
> Wouldn't doing reduce-side join for larger files be much faster?
>
> On Sun, Aug 23, 2015 at 5:08 AM Pedro Magalhaes <pe...@gmail.com>
> wrote:
>
>> I am developig a job that has 30B of records in the input path. (File A)
>> I need to filter these records using another file that can have 30K to
>> 180M of records. (File B)
>> So fo each record in File A, i will make a lookup in File B.
>> I am using distributed cache to share the File B. The problem is that if
>> the File B is too large (for example 180 M of records), i spend too much
>> time (CPU processing) allocating it in a hashmap. I make this allocation to
>> each map task.
>>
>> In hadoop 2.X the jvm reuse was discontinued. So i am think in use MultithreadedMapper,
>> making the hashmap thread-safe, and sharing this read-only structure across
>> the mappers.
>>
>> Is this a good approach?
>>
>>
>>
>>
>

Re: MultithreadedMapper - Sharing Data Structure

Posted by twinkle sachdeva <tw...@gmail.com>.
Hi,

We have been using the jvm reuse feature for the same reason of sharing the
same structure across multiple Map Tasks. Multithreaded Map task does that
partially, as within the multiple threads, same copy is used.


Depending upon the hardware availability, one can get the same performance.

Thanks,


On Mon, Aug 24, 2015 at 1:37 PM, Harsh J <ha...@cloudera.com> wrote:

> The MultiThreadedMapper won't solve your problem, as all it does is run
> parallel maps within the same map task JVM as a non-MT one. Your data
> structure won't be shared across the different map task JVMs on the host,
> but just within the map tasks's own multiple threads running the map()
> function over input records.
>
> Wouldn't doing reduce-side join for larger files be much faster?
>
> On Sun, Aug 23, 2015 at 5:08 AM Pedro Magalhaes <pe...@gmail.com>
> wrote:
>
>> I am developig a job that has 30B of records in the input path. (File A)
>> I need to filter these records using another file that can have 30K to
>> 180M of records. (File B)
>> So fo each record in File A, i will make a lookup in File B.
>> I am using distributed cache to share the File B. The problem is that if
>> the File B is too large (for example 180 M of records), i spend too much
>> time (CPU processing) allocating it in a hashmap. I make this allocation to
>> each map task.
>>
>> In hadoop 2.X the jvm reuse was discontinued. So i am think in use MultithreadedMapper,
>> making the hashmap thread-safe, and sharing this read-only structure across
>> the mappers.
>>
>> Is this a good approach?
>>
>>
>>
>>
>

Re: MultithreadedMapper - Sharing Data Structure

Posted by Harsh J <ha...@cloudera.com>.
The MultiThreadedMapper won't solve your problem, as all it does is run
parallel maps within the same map task JVM as a non-MT one. Your data
structure won't be shared across the different map task JVMs on the host,
but just within the map tasks's own multiple threads running the map()
function over input records.

Wouldn't doing reduce-side join for larger files be much faster?

On Sun, Aug 23, 2015 at 5:08 AM Pedro Magalhaes <pe...@gmail.com> wrote:

> I am developig a job that has 30B of records in the input path. (File A)
> I need to filter these records using another file that can have 30K to
> 180M of records. (File B)
> So fo each record in File A, i will make a lookup in File B.
> I am using distributed cache to share the File B. The problem is that if
> the File B is too large (for example 180 M of records), i spend too much
> time (CPU processing) allocating it in a hashmap. I make this allocation to
> each map task.
>
> In hadoop 2.X the jvm reuse was discontinued. So i am think in use MultithreadedMapper,
> making the hashmap thread-safe, and sharing this read-only structure across
> the mappers.
>
> Is this a good approach?
>
>
>
>

Re: MultithreadedMapper - Sharing Data Structure

Posted by Harsh J <ha...@cloudera.com>.
The MultiThreadedMapper won't solve your problem, as all it does is run
parallel maps within the same map task JVM as a non-MT one. Your data
structure won't be shared across the different map task JVMs on the host,
but just within the map tasks's own multiple threads running the map()
function over input records.

Wouldn't doing reduce-side join for larger files be much faster?

On Sun, Aug 23, 2015 at 5:08 AM Pedro Magalhaes <pe...@gmail.com> wrote:

> I am developig a job that has 30B of records in the input path. (File A)
> I need to filter these records using another file that can have 30K to
> 180M of records. (File B)
> So fo each record in File A, i will make a lookup in File B.
> I am using distributed cache to share the File B. The problem is that if
> the File B is too large (for example 180 M of records), i spend too much
> time (CPU processing) allocating it in a hashmap. I make this allocation to
> each map task.
>
> In hadoop 2.X the jvm reuse was discontinued. So i am think in use MultithreadedMapper,
> making the hashmap thread-safe, and sharing this read-only structure across
> the mappers.
>
> Is this a good approach?
>
>
>
>

Re: MultithreadedMapper - Sharing Data Structure

Posted by Harsh J <ha...@cloudera.com>.
The MultiThreadedMapper won't solve your problem, as all it does is run
parallel maps within the same map task JVM as a non-MT one. Your data
structure won't be shared across the different map task JVMs on the host,
but just within the map tasks's own multiple threads running the map()
function over input records.

Wouldn't doing reduce-side join for larger files be much faster?

On Sun, Aug 23, 2015 at 5:08 AM Pedro Magalhaes <pe...@gmail.com> wrote:

> I am developig a job that has 30B of records in the input path. (File A)
> I need to filter these records using another file that can have 30K to
> 180M of records. (File B)
> So fo each record in File A, i will make a lookup in File B.
> I am using distributed cache to share the File B. The problem is that if
> the File B is too large (for example 180 M of records), i spend too much
> time (CPU processing) allocating it in a hashmap. I make this allocation to
> each map task.
>
> In hadoop 2.X the jvm reuse was discontinued. So i am think in use MultithreadedMapper,
> making the hashmap thread-safe, and sharing this read-only structure across
> the mappers.
>
> Is this a good approach?
>
>
>
>

Re: MultithreadedMapper - Sharing Data Structure

Posted by Harsh J <ha...@cloudera.com>.
The MultiThreadedMapper won't solve your problem, as all it does is run
parallel maps within the same map task JVM as a non-MT one. Your data
structure won't be shared across the different map task JVMs on the host,
but just within the map tasks's own multiple threads running the map()
function over input records.

Wouldn't doing reduce-side join for larger files be much faster?

On Sun, Aug 23, 2015 at 5:08 AM Pedro Magalhaes <pe...@gmail.com> wrote:

> I am developig a job that has 30B of records in the input path. (File A)
> I need to filter these records using another file that can have 30K to
> 180M of records. (File B)
> So fo each record in File A, i will make a lookup in File B.
> I am using distributed cache to share the File B. The problem is that if
> the File B is too large (for example 180 M of records), i spend too much
> time (CPU processing) allocating it in a hashmap. I make this allocation to
> each map task.
>
> In hadoop 2.X the jvm reuse was discontinued. So i am think in use MultithreadedMapper,
> making the hashmap thread-safe, and sharing this read-only structure across
> the mappers.
>
> Is this a good approach?
>
>
>
>