You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Jay Vyas <ja...@gmail.com> on 2013/04/30 20:46:04 UTC

partition as block?

Hi guys:

Im wondering - if I'm running mapreduce jobs on a cluster with large block
sizes - can i increase performance with either:

1) A custom FileInputFormat

2) A custom partitioner

3) -DnumReducers

Clearly, (3) will be an issue due to the fact that it might overload tasks
and network traffic... but maybe (1) or (2) will be a precise way to "use"
partitions as a "poor mans" block.

Just a thought - not sure if anyone has tried (1) or (2) before in order to
simulate blocks and increase locality by utilizing the partition API.

-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: partition as block?

Posted by Jay Vyas <ja...@gmail.com>.

What do you mean "increasing the size"?  Im talking more about increasing the number of partitions... Which actually decreases individual file size.

On Apr 30, 2013, at 4:09 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Increasing the size can help us to an extent, but increasing it further might cause problems during copy and shuffle. If the partitions are too big to be held in the memory, we'll end up with disk based shuffle which is gonna be slower than RAM based shuffle, thus delaying the entire reduce phase. Furthermore N/W might get overwhelmed.
> 
> I think keeping it "considerably" high will definitely give you some boost. But it'll require a high level tinkering.
> 
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
> 
> 
> On Wed, May 1, 2013 at 1:29 AM, Jay Vyas <ja...@gmail.com> wrote:
>> Yes it is a problem at the first stage.  What I'm wondering, though, is wether the intermediate results - which happen after the mapper phase - can be optimized.
>> 
>> 
>> On Tue, Apr 30, 2013 at 3:38 PM, Mohammad Tariq <do...@gmail.com> wrote:
>>> Hmmm. I was actually thinking about the very first step. How are you going to create the maps. Suppose you are on a block-less filesystem and you have a custom Format that is going to give you the splits dynamically. This mean that you are going to store the file as a whole and create the splits as you continue to read the file. Wouldn't it be a bottleneck from 'disk' point of view??Are you not going away from the distributed paradigm??
>>> 
>>> Am I taking it in the correct way. Please correct me if I am getting it wrong.
>>> 
>>> Warm Regards,
>>> Tariq
>>> https://mtariq.jux.com/
>>> cloudfront.blogspot.com
>>> 
>>> 
>>> On Wed, May 1, 2013 at 12:34 AM, Jay Vyas <ja...@gmail.com> wrote:
>>>> Well, to be more clear, I'm wondering how hadoop-mapreduce can be optimized in a block-less filesystem... And am thinking about application tier ways to simulate blocks - i.e. by making the granularity of partitions smaller. 
>>>> 
>>>> Wondering, if there is a way to hack an increased numbers of partitions as a mechanism to simulate blocks - or wether this is just a bad idea altogether :) 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq <do...@gmail.com> wrote:
>>>>> Hello Jay,
>>>>> 
>>>>>     What are you going to do in your custom InputFormat and partitioner?Is your InputFormat is going to create larger splits which will overlap with larger blocks?If that is the case, IMHO, then you are going to reduce the no. of mappers thus reducing the parallelism. Also, much larger block size will put extra overhead when it comes to disk I/O.
>>>>> 
>>>>> Warm Regards,
>>>>> Tariq
>>>>> https://mtariq.jux.com/
>>>>> cloudfront.blogspot.com
>>>>> 
>>>>> 
>>>>> On Wed, May 1, 2013 at 12:16 AM, Jay Vyas <ja...@gmail.com> wrote:
>>>>>> Hi guys:
>>>>>> 
>>>>>> Im wondering - if I'm running mapreduce jobs on a cluster with large block sizes - can i increase performance with either:
>>>>>> 
>>>>>> 1) A custom FileInputFormat
>>>>>> 
>>>>>> 2) A custom partitioner 
>>>>>> 
>>>>>> 3) -DnumReducers
>>>>>> 
>>>>>> Clearly, (3) will be an issue due to the fact that it might overload tasks and network traffic... but maybe (1) or (2) will be a precise way to "use" partitions as a "poor mans" block.  
>>>>>> 
>>>>>> Just a thought - not sure if anyone has tried (1) or (2) before in order to simulate blocks and increase locality by utilizing the partition API.
>>>>>> 
>>>>>> -- 
>>>>>> Jay Vyas
>>>>>> http://jayunit100.blogspot.com
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Jay Vyas
>>>> http://jayunit100.blogspot.com
>> 
>> 
>> 
>> -- 
>> Jay Vyas
>> http://jayunit100.blogspot.com
>

Re: partition as block?

Posted by Jay Vyas <ja...@gmail.com>.

What do you mean "increasing the size"?  Im talking more about increasing the number of partitions... Which actually decreases individual file size.

On Apr 30, 2013, at 4:09 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Increasing the size can help us to an extent, but increasing it further might cause problems during copy and shuffle. If the partitions are too big to be held in the memory, we'll end up with disk based shuffle which is gonna be slower than RAM based shuffle, thus delaying the entire reduce phase. Furthermore N/W might get overwhelmed.
> 
> I think keeping it "considerably" high will definitely give you some boost. But it'll require a high level tinkering.
> 
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
> 
> 
> On Wed, May 1, 2013 at 1:29 AM, Jay Vyas <ja...@gmail.com> wrote:
>> Yes it is a problem at the first stage.  What I'm wondering, though, is wether the intermediate results - which happen after the mapper phase - can be optimized.
>> 
>> 
>> On Tue, Apr 30, 2013 at 3:38 PM, Mohammad Tariq <do...@gmail.com> wrote:
>>> Hmmm. I was actually thinking about the very first step. How are you going to create the maps. Suppose you are on a block-less filesystem and you have a custom Format that is going to give you the splits dynamically. This mean that you are going to store the file as a whole and create the splits as you continue to read the file. Wouldn't it be a bottleneck from 'disk' point of view??Are you not going away from the distributed paradigm??
>>> 
>>> Am I taking it in the correct way. Please correct me if I am getting it wrong.
>>> 
>>> Warm Regards,
>>> Tariq
>>> https://mtariq.jux.com/
>>> cloudfront.blogspot.com
>>> 
>>> 
>>> On Wed, May 1, 2013 at 12:34 AM, Jay Vyas <ja...@gmail.com> wrote:
>>>> Well, to be more clear, I'm wondering how hadoop-mapreduce can be optimized in a block-less filesystem... And am thinking about application tier ways to simulate blocks - i.e. by making the granularity of partitions smaller. 
>>>> 
>>>> Wondering, if there is a way to hack an increased numbers of partitions as a mechanism to simulate blocks - or wether this is just a bad idea altogether :) 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq <do...@gmail.com> wrote:
>>>>> Hello Jay,
>>>>> 
>>>>>     What are you going to do in your custom InputFormat and partitioner?Is your InputFormat is going to create larger splits which will overlap with larger blocks?If that is the case, IMHO, then you are going to reduce the no. of mappers thus reducing the parallelism. Also, much larger block size will put extra overhead when it comes to disk I/O.
>>>>> 
>>>>> Warm Regards,
>>>>> Tariq
>>>>> https://mtariq.jux.com/
>>>>> cloudfront.blogspot.com
>>>>> 
>>>>> 
>>>>> On Wed, May 1, 2013 at 12:16 AM, Jay Vyas <ja...@gmail.com> wrote:
>>>>>> Hi guys:
>>>>>> 
>>>>>> Im wondering - if I'm running mapreduce jobs on a cluster with large block sizes - can i increase performance with either:
>>>>>> 
>>>>>> 1) A custom FileInputFormat
>>>>>> 
>>>>>> 2) A custom partitioner 
>>>>>> 
>>>>>> 3) -DnumReducers
>>>>>> 
>>>>>> Clearly, (3) will be an issue due to the fact that it might overload tasks and network traffic... but maybe (1) or (2) will be a precise way to "use" partitions as a "poor mans" block.  
>>>>>> 
>>>>>> Just a thought - not sure if anyone has tried (1) or (2) before in order to simulate blocks and increase locality by utilizing the partition API.
>>>>>> 
>>>>>> -- 
>>>>>> Jay Vyas
>>>>>> http://jayunit100.blogspot.com
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Jay Vyas
>>>> http://jayunit100.blogspot.com
>> 
>> 
>> 
>> -- 
>> Jay Vyas
>> http://jayunit100.blogspot.com
>

Re: partition as block?

Posted by Jay Vyas <ja...@gmail.com>.

What do you mean "increasing the size"?  Im talking more about increasing the number of partitions... Which actually decreases individual file size.

On Apr 30, 2013, at 4:09 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Increasing the size can help us to an extent, but increasing it further might cause problems during copy and shuffle. If the partitions are too big to be held in the memory, we'll end up with disk based shuffle which is gonna be slower than RAM based shuffle, thus delaying the entire reduce phase. Furthermore N/W might get overwhelmed.
> 
> I think keeping it "considerably" high will definitely give you some boost. But it'll require a high level tinkering.
> 
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
> 
> 
> On Wed, May 1, 2013 at 1:29 AM, Jay Vyas <ja...@gmail.com> wrote:
>> Yes it is a problem at the first stage.  What I'm wondering, though, is wether the intermediate results - which happen after the mapper phase - can be optimized.
>> 
>> 
>> On Tue, Apr 30, 2013 at 3:38 PM, Mohammad Tariq <do...@gmail.com> wrote:
>>> Hmmm. I was actually thinking about the very first step. How are you going to create the maps. Suppose you are on a block-less filesystem and you have a custom Format that is going to give you the splits dynamically. This mean that you are going to store the file as a whole and create the splits as you continue to read the file. Wouldn't it be a bottleneck from 'disk' point of view??Are you not going away from the distributed paradigm??
>>> 
>>> Am I taking it in the correct way. Please correct me if I am getting it wrong.
>>> 
>>> Warm Regards,
>>> Tariq
>>> https://mtariq.jux.com/
>>> cloudfront.blogspot.com
>>> 
>>> 
>>> On Wed, May 1, 2013 at 12:34 AM, Jay Vyas <ja...@gmail.com> wrote:
>>>> Well, to be more clear, I'm wondering how hadoop-mapreduce can be optimized in a block-less filesystem... And am thinking about application tier ways to simulate blocks - i.e. by making the granularity of partitions smaller. 
>>>> 
>>>> Wondering, if there is a way to hack an increased numbers of partitions as a mechanism to simulate blocks - or wether this is just a bad idea altogether :) 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq <do...@gmail.com> wrote:
>>>>> Hello Jay,
>>>>> 
>>>>>     What are you going to do in your custom InputFormat and partitioner?Is your InputFormat is going to create larger splits which will overlap with larger blocks?If that is the case, IMHO, then you are going to reduce the no. of mappers thus reducing the parallelism. Also, much larger block size will put extra overhead when it comes to disk I/O.
>>>>> 
>>>>> Warm Regards,
>>>>> Tariq
>>>>> https://mtariq.jux.com/
>>>>> cloudfront.blogspot.com
>>>>> 
>>>>> 
>>>>> On Wed, May 1, 2013 at 12:16 AM, Jay Vyas <ja...@gmail.com> wrote:
>>>>>> Hi guys:
>>>>>> 
>>>>>> Im wondering - if I'm running mapreduce jobs on a cluster with large block sizes - can i increase performance with either:
>>>>>> 
>>>>>> 1) A custom FileInputFormat
>>>>>> 
>>>>>> 2) A custom partitioner 
>>>>>> 
>>>>>> 3) -DnumReducers
>>>>>> 
>>>>>> Clearly, (3) will be an issue due to the fact that it might overload tasks and network traffic... but maybe (1) or (2) will be a precise way to "use" partitions as a "poor mans" block.  
>>>>>> 
>>>>>> Just a thought - not sure if anyone has tried (1) or (2) before in order to simulate blocks and increase locality by utilizing the partition API.
>>>>>> 
>>>>>> -- 
>>>>>> Jay Vyas
>>>>>> http://jayunit100.blogspot.com
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Jay Vyas
>>>> http://jayunit100.blogspot.com
>> 
>> 
>> 
>> -- 
>> Jay Vyas
>> http://jayunit100.blogspot.com
>

Re: partition as block?

Posted by Jay Vyas <ja...@gmail.com>.

What do you mean "increasing the size"?  Im talking more about increasing the number of partitions... Which actually decreases individual file size.

On Apr 30, 2013, at 4:09 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Increasing the size can help us to an extent, but increasing it further might cause problems during copy and shuffle. If the partitions are too big to be held in the memory, we'll end up with disk based shuffle which is gonna be slower than RAM based shuffle, thus delaying the entire reduce phase. Furthermore N/W might get overwhelmed.
> 
> I think keeping it "considerably" high will definitely give you some boost. But it'll require a high level tinkering.
> 
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
> 
> 
> On Wed, May 1, 2013 at 1:29 AM, Jay Vyas <ja...@gmail.com> wrote:
>> Yes it is a problem at the first stage.  What I'm wondering, though, is wether the intermediate results - which happen after the mapper phase - can be optimized.
>> 
>> 
>> On Tue, Apr 30, 2013 at 3:38 PM, Mohammad Tariq <do...@gmail.com> wrote:
>>> Hmmm. I was actually thinking about the very first step. How are you going to create the maps. Suppose you are on a block-less filesystem and you have a custom Format that is going to give you the splits dynamically. This mean that you are going to store the file as a whole and create the splits as you continue to read the file. Wouldn't it be a bottleneck from 'disk' point of view??Are you not going away from the distributed paradigm??
>>> 
>>> Am I taking it in the correct way. Please correct me if I am getting it wrong.
>>> 
>>> Warm Regards,
>>> Tariq
>>> https://mtariq.jux.com/
>>> cloudfront.blogspot.com
>>> 
>>> 
>>> On Wed, May 1, 2013 at 12:34 AM, Jay Vyas <ja...@gmail.com> wrote:
>>>> Well, to be more clear, I'm wondering how hadoop-mapreduce can be optimized in a block-less filesystem... And am thinking about application tier ways to simulate blocks - i.e. by making the granularity of partitions smaller. 
>>>> 
>>>> Wondering, if there is a way to hack an increased numbers of partitions as a mechanism to simulate blocks - or wether this is just a bad idea altogether :) 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq <do...@gmail.com> wrote:
>>>>> Hello Jay,
>>>>> 
>>>>>     What are you going to do in your custom InputFormat and partitioner?Is your InputFormat is going to create larger splits which will overlap with larger blocks?If that is the case, IMHO, then you are going to reduce the no. of mappers thus reducing the parallelism. Also, much larger block size will put extra overhead when it comes to disk I/O.
>>>>> 
>>>>> Warm Regards,
>>>>> Tariq
>>>>> https://mtariq.jux.com/
>>>>> cloudfront.blogspot.com
>>>>> 
>>>>> 
>>>>> On Wed, May 1, 2013 at 12:16 AM, Jay Vyas <ja...@gmail.com> wrote:
>>>>>> Hi guys:
>>>>>> 
>>>>>> Im wondering - if I'm running mapreduce jobs on a cluster with large block sizes - can i increase performance with either:
>>>>>> 
>>>>>> 1) A custom FileInputFormat
>>>>>> 
>>>>>> 2) A custom partitioner 
>>>>>> 
>>>>>> 3) -DnumReducers
>>>>>> 
>>>>>> Clearly, (3) will be an issue due to the fact that it might overload tasks and network traffic... but maybe (1) or (2) will be a precise way to "use" partitions as a "poor mans" block.  
>>>>>> 
>>>>>> Just a thought - not sure if anyone has tried (1) or (2) before in order to simulate blocks and increase locality by utilizing the partition API.
>>>>>> 
>>>>>> -- 
>>>>>> Jay Vyas
>>>>>> http://jayunit100.blogspot.com
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Jay Vyas
>>>> http://jayunit100.blogspot.com
>> 
>> 
>> 
>> -- 
>> Jay Vyas
>> http://jayunit100.blogspot.com
>

Re: partition as block?

Posted by Mohammad Tariq <do...@gmail.com>.

Increasing the size can help us to an extent, but increasing it further
might cause problems during copy and shuffle. If the partitions are too big
to be held in the memory, we'll end up with *disk based shuffle* which is
gonna be slower than *RAM based shuffle,* thus delaying the entire reduce
phase. Furthermore N/W might get overwhelmed.

I think keeping it "considerably" high will definitely give you some boost.
But it'll require a high level tinkering.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, May 1, 2013 at 1:29 AM, Jay Vyas <ja...@gmail.com> wrote:

> Yes it is a problem at the first stage.  What I'm wondering, though, is
> wether the intermediate results - which happen after the mapper phase - can
> be optimized.
>
>
> On Tue, Apr 30, 2013 at 3:38 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Hmmm. I was actually thinking about the very first step. How are you
>> going to create the maps. Suppose you are on a block-less filesystem and
>> you have a custom Format that is going to give you the splits dynamically.
>> This mean that you are going to store the file as a whole and create the
>> splits as you continue to read the file. Wouldn't it be a bottleneck from
>> 'disk' point of view??Are you not going away from the distributed paradigm??
>>
>> Am I taking it in the correct way. Please correct me if I am getting it
>> wrong.
>>
>> Warm Regards,
>> Tariq
>> https://mtariq.jux.com/
>> cloudfront.blogspot.com
>>
>>
>> On Wed, May 1, 2013 at 12:34 AM, Jay Vyas <ja...@gmail.com> wrote:
>>
>>> Well, to be more clear, I'm wondering how hadoop-mapreduce can be
>>> optimized in a block-less filesystem... And am thinking about application
>>> tier ways to simulate blocks - i.e. by making the granularity of partitions
>>> smaller.
>>>
>>> Wondering, if there is a way to hack an increased numbers of partitions
>>> as a mechanism to simulate blocks - or wether this is just a bad idea
>>> altogether :)
>>>
>>>
>>>
>>>
>>> On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> Hello Jay,
>>>>
>>>>     What are you going to do in your custom InputFormat and
>>>> partitioner?Is your InputFormat is going to create larger splits which will
>>>> overlap with larger blocks?If that is the case, IMHO, then you are going to
>>>> reduce the no. of mappers thus reducing the parallelism. Also, much larger
>>>> block size will put extra overhead when it comes to disk I/O.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> https://mtariq.jux.com/
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Wed, May 1, 2013 at 12:16 AM, Jay Vyas <ja...@gmail.com> wrote:
>>>>
>>>>> Hi guys:
>>>>>
>>>>> Im wondering - if I'm running mapreduce jobs on a cluster with large
>>>>> block sizes - can i increase performance with either:
>>>>>
>>>>> 1) A custom FileInputFormat
>>>>>
>>>>>  2) A custom partitioner
>>>>>
>>>>> 3) -DnumReducers
>>>>>
>>>>> Clearly, (3) will be an issue due to the fact that it might overload
>>>>> tasks and network traffic... but maybe (1) or (2) will be a precise way to
>>>>> "use" partitions as a "poor mans" block.
>>>>>
>>>>> Just a thought - not sure if anyone has tried (1) or (2) before in
>>>>> order to simulate blocks and increase locality by utilizing the partition
>>>>> API.
>>>>>
>>>>> --
>>>>> Jay Vyas
>>>>> http://jayunit100.blogspot.com
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Jay Vyas
>>> http://jayunit100.blogspot.com
>>>
>>
>>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

Re: partition as block?

Posted by Mohammad Tariq <do...@gmail.com>.

Increasing the size can help us to an extent, but increasing it further
might cause problems during copy and shuffle. If the partitions are too big
to be held in the memory, we'll end up with *disk based shuffle* which is
gonna be slower than *RAM based shuffle,* thus delaying the entire reduce
phase. Furthermore N/W might get overwhelmed.

I think keeping it "considerably" high will definitely give you some boost.
But it'll require a high level tinkering.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, May 1, 2013 at 1:29 AM, Jay Vyas <ja...@gmail.com> wrote:

> Yes it is a problem at the first stage.  What I'm wondering, though, is
> wether the intermediate results - which happen after the mapper phase - can
> be optimized.
>
>
> On Tue, Apr 30, 2013 at 3:38 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Hmmm. I was actually thinking about the very first step. How are you
>> going to create the maps. Suppose you are on a block-less filesystem and
>> you have a custom Format that is going to give you the splits dynamically.
>> This mean that you are going to store the file as a whole and create the
>> splits as you continue to read the file. Wouldn't it be a bottleneck from
>> 'disk' point of view??Are you not going away from the distributed paradigm??
>>
>> Am I taking it in the correct way. Please correct me if I am getting it
>> wrong.
>>
>> Warm Regards,
>> Tariq
>> https://mtariq.jux.com/
>> cloudfront.blogspot.com
>>
>>
>> On Wed, May 1, 2013 at 12:34 AM, Jay Vyas <ja...@gmail.com> wrote:
>>
>>> Well, to be more clear, I'm wondering how hadoop-mapreduce can be
>>> optimized in a block-less filesystem... And am thinking about application
>>> tier ways to simulate blocks - i.e. by making the granularity of partitions
>>> smaller.
>>>
>>> Wondering, if there is a way to hack an increased numbers of partitions
>>> as a mechanism to simulate blocks - or wether this is just a bad idea
>>> altogether :)
>>>
>>>
>>>
>>>
>>> On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> Hello Jay,
>>>>
>>>>     What are you going to do in your custom InputFormat and
>>>> partitioner?Is your InputFormat is going to create larger splits which will
>>>> overlap with larger blocks?If that is the case, IMHO, then you are going to
>>>> reduce the no. of mappers thus reducing the parallelism. Also, much larger
>>>> block size will put extra overhead when it comes to disk I/O.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> https://mtariq.jux.com/
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Wed, May 1, 2013 at 12:16 AM, Jay Vyas <ja...@gmail.com> wrote:
>>>>
>>>>> Hi guys:
>>>>>
>>>>> Im wondering - if I'm running mapreduce jobs on a cluster with large
>>>>> block sizes - can i increase performance with either:
>>>>>
>>>>> 1) A custom FileInputFormat
>>>>>
>>>>>  2) A custom partitioner
>>>>>
>>>>> 3) -DnumReducers
>>>>>
>>>>> Clearly, (3) will be an issue due to the fact that it might overload
>>>>> tasks and network traffic... but maybe (1) or (2) will be a precise way to
>>>>> "use" partitions as a "poor mans" block.
>>>>>
>>>>> Just a thought - not sure if anyone has tried (1) or (2) before in
>>>>> order to simulate blocks and increase locality by utilizing the partition
>>>>> API.
>>>>>
>>>>> --
>>>>> Jay Vyas
>>>>> http://jayunit100.blogspot.com
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Jay Vyas
>>> http://jayunit100.blogspot.com
>>>
>>
>>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

Re: partition as block?

Posted by Mohammad Tariq <do...@gmail.com>.

Increasing the size can help us to an extent, but increasing it further
might cause problems during copy and shuffle. If the partitions are too big
to be held in the memory, we'll end up with *disk based shuffle* which is
gonna be slower than *RAM based shuffle,* thus delaying the entire reduce
phase. Furthermore N/W might get overwhelmed.

I think keeping it "considerably" high will definitely give you some boost.
But it'll require a high level tinkering.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, May 1, 2013 at 1:29 AM, Jay Vyas <ja...@gmail.com> wrote:

> Yes it is a problem at the first stage.  What I'm wondering, though, is
> wether the intermediate results - which happen after the mapper phase - can
> be optimized.
>
>
> On Tue, Apr 30, 2013 at 3:38 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Hmmm. I was actually thinking about the very first step. How are you
>> going to create the maps. Suppose you are on a block-less filesystem and
>> you have a custom Format that is going to give you the splits dynamically.
>> This mean that you are going to store the file as a whole and create the
>> splits as you continue to read the file. Wouldn't it be a bottleneck from
>> 'disk' point of view??Are you not going away from the distributed paradigm??
>>
>> Am I taking it in the correct way. Please correct me if I am getting it
>> wrong.
>>
>> Warm Regards,
>> Tariq
>> https://mtariq.jux.com/
>> cloudfront.blogspot.com
>>
>>
>> On Wed, May 1, 2013 at 12:34 AM, Jay Vyas <ja...@gmail.com> wrote:
>>
>>> Well, to be more clear, I'm wondering how hadoop-mapreduce can be
>>> optimized in a block-less filesystem... And am thinking about application
>>> tier ways to simulate blocks - i.e. by making the granularity of partitions
>>> smaller.
>>>
>>> Wondering, if there is a way to hack an increased numbers of partitions
>>> as a mechanism to simulate blocks - or wether this is just a bad idea
>>> altogether :)
>>>
>>>
>>>
>>>
>>> On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> Hello Jay,
>>>>
>>>>     What are you going to do in your custom InputFormat and
>>>> partitioner?Is your InputFormat is going to create larger splits which will
>>>> overlap with larger blocks?If that is the case, IMHO, then you are going to
>>>> reduce the no. of mappers thus reducing the parallelism. Also, much larger
>>>> block size will put extra overhead when it comes to disk I/O.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> https://mtariq.jux.com/
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Wed, May 1, 2013 at 12:16 AM, Jay Vyas <ja...@gmail.com> wrote:
>>>>
>>>>> Hi guys:
>>>>>
>>>>> Im wondering - if I'm running mapreduce jobs on a cluster with large
>>>>> block sizes - can i increase performance with either:
>>>>>
>>>>> 1) A custom FileInputFormat
>>>>>
>>>>>  2) A custom partitioner
>>>>>
>>>>> 3) -DnumReducers
>>>>>
>>>>> Clearly, (3) will be an issue due to the fact that it might overload
>>>>> tasks and network traffic... but maybe (1) or (2) will be a precise way to
>>>>> "use" partitions as a "poor mans" block.
>>>>>
>>>>> Just a thought - not sure if anyone has tried (1) or (2) before in
>>>>> order to simulate blocks and increase locality by utilizing the partition
>>>>> API.
>>>>>
>>>>> --
>>>>> Jay Vyas
>>>>> http://jayunit100.blogspot.com
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Jay Vyas
>>> http://jayunit100.blogspot.com
>>>
>>
>>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

Re: partition as block?

Posted by Mohammad Tariq <do...@gmail.com>.

Increasing the size can help us to an extent, but increasing it further
might cause problems during copy and shuffle. If the partitions are too big
to be held in the memory, we'll end up with *disk based shuffle* which is
gonna be slower than *RAM based shuffle,* thus delaying the entire reduce
phase. Furthermore N/W might get overwhelmed.

I think keeping it "considerably" high will definitely give you some boost.
But it'll require a high level tinkering.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, May 1, 2013 at 1:29 AM, Jay Vyas <ja...@gmail.com> wrote:

> Yes it is a problem at the first stage.  What I'm wondering, though, is
> wether the intermediate results - which happen after the mapper phase - can
> be optimized.
>
>
> On Tue, Apr 30, 2013 at 3:38 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Hmmm. I was actually thinking about the very first step. How are you
>> going to create the maps. Suppose you are on a block-less filesystem and
>> you have a custom Format that is going to give you the splits dynamically.
>> This mean that you are going to store the file as a whole and create the
>> splits as you continue to read the file. Wouldn't it be a bottleneck from
>> 'disk' point of view??Are you not going away from the distributed paradigm??
>>
>> Am I taking it in the correct way. Please correct me if I am getting it
>> wrong.
>>
>> Warm Regards,
>> Tariq
>> https://mtariq.jux.com/
>> cloudfront.blogspot.com
>>
>>
>> On Wed, May 1, 2013 at 12:34 AM, Jay Vyas <ja...@gmail.com> wrote:
>>
>>> Well, to be more clear, I'm wondering how hadoop-mapreduce can be
>>> optimized in a block-less filesystem... And am thinking about application
>>> tier ways to simulate blocks - i.e. by making the granularity of partitions
>>> smaller.
>>>
>>> Wondering, if there is a way to hack an increased numbers of partitions
>>> as a mechanism to simulate blocks - or wether this is just a bad idea
>>> altogether :)
>>>
>>>
>>>
>>>
>>> On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> Hello Jay,
>>>>
>>>>     What are you going to do in your custom InputFormat and
>>>> partitioner?Is your InputFormat is going to create larger splits which will
>>>> overlap with larger blocks?If that is the case, IMHO, then you are going to
>>>> reduce the no. of mappers thus reducing the parallelism. Also, much larger
>>>> block size will put extra overhead when it comes to disk I/O.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> https://mtariq.jux.com/
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Wed, May 1, 2013 at 12:16 AM, Jay Vyas <ja...@gmail.com> wrote:
>>>>
>>>>> Hi guys:
>>>>>
>>>>> Im wondering - if I'm running mapreduce jobs on a cluster with large
>>>>> block sizes - can i increase performance with either:
>>>>>
>>>>> 1) A custom FileInputFormat
>>>>>
>>>>>  2) A custom partitioner
>>>>>
>>>>> 3) -DnumReducers
>>>>>
>>>>> Clearly, (3) will be an issue due to the fact that it might overload
>>>>> tasks and network traffic... but maybe (1) or (2) will be a precise way to
>>>>> "use" partitions as a "poor mans" block.
>>>>>
>>>>> Just a thought - not sure if anyone has tried (1) or (2) before in
>>>>> order to simulate blocks and increase locality by utilizing the partition
>>>>> API.
>>>>>
>>>>> --
>>>>> Jay Vyas
>>>>> http://jayunit100.blogspot.com
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Jay Vyas
>>> http://jayunit100.blogspot.com
>>>
>>
>>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

Re: partition as block?

Posted by Jay Vyas <ja...@gmail.com>.

Yes it is a problem at the first stage.  What I'm wondering, though, is
wether the intermediate results - which happen after the mapper phase - can
be optimized.


On Tue, Apr 30, 2013 at 3:38 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Hmmm. I was actually thinking about the very first step. How are you going
> to create the maps. Suppose you are on a block-less filesystem and you have
> a custom Format that is going to give you the splits dynamically. This mean
> that you are going to store the file as a whole and create the splits as
> you continue to read the file. Wouldn't it be a bottleneck from 'disk'
> point of view??Are you not going away from the distributed paradigm??
>
> Am I taking it in the correct way. Please correct me if I am getting it
> wrong.
>
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Wed, May 1, 2013 at 12:34 AM, Jay Vyas <ja...@gmail.com> wrote:
>
>> Well, to be more clear, I'm wondering how hadoop-mapreduce can be
>> optimized in a block-less filesystem... And am thinking about application
>> tier ways to simulate blocks - i.e. by making the granularity of partitions
>> smaller.
>>
>> Wondering, if there is a way to hack an increased numbers of partitions
>> as a mechanism to simulate blocks - or wether this is just a bad idea
>> altogether :)
>>
>>
>>
>>
>> On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> Hello Jay,
>>>
>>>     What are you going to do in your custom InputFormat and
>>> partitioner?Is your InputFormat is going to create larger splits which will
>>> overlap with larger blocks?If that is the case, IMHO, then you are going to
>>> reduce the no. of mappers thus reducing the parallelism. Also, much larger
>>> block size will put extra overhead when it comes to disk I/O.
>>>
>>> Warm Regards,
>>> Tariq
>>> https://mtariq.jux.com/
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Wed, May 1, 2013 at 12:16 AM, Jay Vyas <ja...@gmail.com> wrote:
>>>
>>>> Hi guys:
>>>>
>>>> Im wondering - if I'm running mapreduce jobs on a cluster with large
>>>> block sizes - can i increase performance with either:
>>>>
>>>> 1) A custom FileInputFormat
>>>>
>>>>  2) A custom partitioner
>>>>
>>>> 3) -DnumReducers
>>>>
>>>> Clearly, (3) will be an issue due to the fact that it might overload
>>>> tasks and network traffic... but maybe (1) or (2) will be a precise way to
>>>> "use" partitions as a "poor mans" block.
>>>>
>>>> Just a thought - not sure if anyone has tried (1) or (2) before in
>>>> order to simulate blocks and increase locality by utilizing the partition
>>>> API.
>>>>
>>>> --
>>>> Jay Vyas
>>>> http://jayunit100.blogspot.com
>>>>
>>>
>>>
>>
>>
>> --
>> Jay Vyas
>> http://jayunit100.blogspot.com
>>
>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: partition as block?

Posted by Jay Vyas <ja...@gmail.com>.

Yes it is a problem at the first stage.  What I'm wondering, though, is
wether the intermediate results - which happen after the mapper phase - can
be optimized.


On Tue, Apr 30, 2013 at 3:38 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Hmmm. I was actually thinking about the very first step. How are you going
> to create the maps. Suppose you are on a block-less filesystem and you have
> a custom Format that is going to give you the splits dynamically. This mean
> that you are going to store the file as a whole and create the splits as
> you continue to read the file. Wouldn't it be a bottleneck from 'disk'
> point of view??Are you not going away from the distributed paradigm??
>
> Am I taking it in the correct way. Please correct me if I am getting it
> wrong.
>
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Wed, May 1, 2013 at 12:34 AM, Jay Vyas <ja...@gmail.com> wrote:
>
>> Well, to be more clear, I'm wondering how hadoop-mapreduce can be
>> optimized in a block-less filesystem... And am thinking about application
>> tier ways to simulate blocks - i.e. by making the granularity of partitions
>> smaller.
>>
>> Wondering, if there is a way to hack an increased numbers of partitions
>> as a mechanism to simulate blocks - or wether this is just a bad idea
>> altogether :)
>>
>>
>>
>>
>> On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> Hello Jay,
>>>
>>>     What are you going to do in your custom InputFormat and
>>> partitioner?Is your InputFormat is going to create larger splits which will
>>> overlap with larger blocks?If that is the case, IMHO, then you are going to
>>> reduce the no. of mappers thus reducing the parallelism. Also, much larger
>>> block size will put extra overhead when it comes to disk I/O.
>>>
>>> Warm Regards,
>>> Tariq
>>> https://mtariq.jux.com/
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Wed, May 1, 2013 at 12:16 AM, Jay Vyas <ja...@gmail.com> wrote:
>>>
>>>> Hi guys:
>>>>
>>>> Im wondering - if I'm running mapreduce jobs on a cluster with large
>>>> block sizes - can i increase performance with either:
>>>>
>>>> 1) A custom FileInputFormat
>>>>
>>>>  2) A custom partitioner
>>>>
>>>> 3) -DnumReducers
>>>>
>>>> Clearly, (3) will be an issue due to the fact that it might overload
>>>> tasks and network traffic... but maybe (1) or (2) will be a precise way to
>>>> "use" partitions as a "poor mans" block.
>>>>
>>>> Just a thought - not sure if anyone has tried (1) or (2) before in
>>>> order to simulate blocks and increase locality by utilizing the partition
>>>> API.
>>>>
>>>> --
>>>> Jay Vyas
>>>> http://jayunit100.blogspot.com
>>>>
>>>
>>>
>>
>>
>> --
>> Jay Vyas
>> http://jayunit100.blogspot.com
>>
>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: partition as block?

Posted by Jay Vyas <ja...@gmail.com>.

Yes it is a problem at the first stage.  What I'm wondering, though, is
wether the intermediate results - which happen after the mapper phase - can
be optimized.


On Tue, Apr 30, 2013 at 3:38 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Hmmm. I was actually thinking about the very first step. How are you going
> to create the maps. Suppose you are on a block-less filesystem and you have
> a custom Format that is going to give you the splits dynamically. This mean
> that you are going to store the file as a whole and create the splits as
> you continue to read the file. Wouldn't it be a bottleneck from 'disk'
> point of view??Are you not going away from the distributed paradigm??
>
> Am I taking it in the correct way. Please correct me if I am getting it
> wrong.
>
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Wed, May 1, 2013 at 12:34 AM, Jay Vyas <ja...@gmail.com> wrote:
>
>> Well, to be more clear, I'm wondering how hadoop-mapreduce can be
>> optimized in a block-less filesystem... And am thinking about application
>> tier ways to simulate blocks - i.e. by making the granularity of partitions
>> smaller.
>>
>> Wondering, if there is a way to hack an increased numbers of partitions
>> as a mechanism to simulate blocks - or wether this is just a bad idea
>> altogether :)
>>
>>
>>
>>
>> On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> Hello Jay,
>>>
>>>     What are you going to do in your custom InputFormat and
>>> partitioner?Is your InputFormat is going to create larger splits which will
>>> overlap with larger blocks?If that is the case, IMHO, then you are going to
>>> reduce the no. of mappers thus reducing the parallelism. Also, much larger
>>> block size will put extra overhead when it comes to disk I/O.
>>>
>>> Warm Regards,
>>> Tariq
>>> https://mtariq.jux.com/
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Wed, May 1, 2013 at 12:16 AM, Jay Vyas <ja...@gmail.com> wrote:
>>>
>>>> Hi guys:
>>>>
>>>> Im wondering - if I'm running mapreduce jobs on a cluster with large
>>>> block sizes - can i increase performance with either:
>>>>
>>>> 1) A custom FileInputFormat
>>>>
>>>>  2) A custom partitioner
>>>>
>>>> 3) -DnumReducers
>>>>
>>>> Clearly, (3) will be an issue due to the fact that it might overload
>>>> tasks and network traffic... but maybe (1) or (2) will be a precise way to
>>>> "use" partitions as a "poor mans" block.
>>>>
>>>> Just a thought - not sure if anyone has tried (1) or (2) before in
>>>> order to simulate blocks and increase locality by utilizing the partition
>>>> API.
>>>>
>>>> --
>>>> Jay Vyas
>>>> http://jayunit100.blogspot.com
>>>>
>>>
>>>
>>
>>
>> --
>> Jay Vyas
>> http://jayunit100.blogspot.com
>>
>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: partition as block?

Posted by Jay Vyas <ja...@gmail.com>.

Yes it is a problem at the first stage.  What I'm wondering, though, is
wether the intermediate results - which happen after the mapper phase - can
be optimized.


On Tue, Apr 30, 2013 at 3:38 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Hmmm. I was actually thinking about the very first step. How are you going
> to create the maps. Suppose you are on a block-less filesystem and you have
> a custom Format that is going to give you the splits dynamically. This mean
> that you are going to store the file as a whole and create the splits as
> you continue to read the file. Wouldn't it be a bottleneck from 'disk'
> point of view??Are you not going away from the distributed paradigm??
>
> Am I taking it in the correct way. Please correct me if I am getting it
> wrong.
>
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Wed, May 1, 2013 at 12:34 AM, Jay Vyas <ja...@gmail.com> wrote:
>
>> Well, to be more clear, I'm wondering how hadoop-mapreduce can be
>> optimized in a block-less filesystem... And am thinking about application
>> tier ways to simulate blocks - i.e. by making the granularity of partitions
>> smaller.
>>
>> Wondering, if there is a way to hack an increased numbers of partitions
>> as a mechanism to simulate blocks - or wether this is just a bad idea
>> altogether :)
>>
>>
>>
>>
>> On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> Hello Jay,
>>>
>>>     What are you going to do in your custom InputFormat and
>>> partitioner?Is your InputFormat is going to create larger splits which will
>>> overlap with larger blocks?If that is the case, IMHO, then you are going to
>>> reduce the no. of mappers thus reducing the parallelism. Also, much larger
>>> block size will put extra overhead when it comes to disk I/O.
>>>
>>> Warm Regards,
>>> Tariq
>>> https://mtariq.jux.com/
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Wed, May 1, 2013 at 12:16 AM, Jay Vyas <ja...@gmail.com> wrote:
>>>
>>>> Hi guys:
>>>>
>>>> Im wondering - if I'm running mapreduce jobs on a cluster with large
>>>> block sizes - can i increase performance with either:
>>>>
>>>> 1) A custom FileInputFormat
>>>>
>>>>  2) A custom partitioner
>>>>
>>>> 3) -DnumReducers
>>>>
>>>> Clearly, (3) will be an issue due to the fact that it might overload
>>>> tasks and network traffic... but maybe (1) or (2) will be a precise way to
>>>> "use" partitions as a "poor mans" block.
>>>>
>>>> Just a thought - not sure if anyone has tried (1) or (2) before in
>>>> order to simulate blocks and increase locality by utilizing the partition
>>>> API.
>>>>
>>>> --
>>>> Jay Vyas
>>>> http://jayunit100.blogspot.com
>>>>
>>>
>>>
>>
>>
>> --
>> Jay Vyas
>> http://jayunit100.blogspot.com
>>
>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: partition as block?

Posted by Mohammad Tariq <do...@gmail.com>.

Hmmm. I was actually thinking about the very first step. How are you going
to create the maps. Suppose you are on a block-less filesystem and you have
a custom Format that is going to give you the splits dynamically. This mean
that you are going to store the file as a whole and create the splits as
you continue to read the file. Wouldn't it be a bottleneck from 'disk'
point of view??Are you not going away from the distributed paradigm??

Am I taking it in the correct way. Please correct me if I am getting it
wrong.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, May 1, 2013 at 12:34 AM, Jay Vyas <ja...@gmail.com> wrote:

> Well, to be more clear, I'm wondering how hadoop-mapreduce can be
> optimized in a block-less filesystem... And am thinking about application
> tier ways to simulate blocks - i.e. by making the granularity of partitions
> smaller.
>
> Wondering, if there is a way to hack an increased numbers of partitions as
> a mechanism to simulate blocks - or wether this is just a bad idea
> altogether :)
>
>
>
>
> On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Hello Jay,
>>
>>     What are you going to do in your custom InputFormat and
>> partitioner?Is your InputFormat is going to create larger splits which will
>> overlap with larger blocks?If that is the case, IMHO, then you are going to
>> reduce the no. of mappers thus reducing the parallelism. Also, much larger
>> block size will put extra overhead when it comes to disk I/O.
>>
>> Warm Regards,
>> Tariq
>> https://mtariq.jux.com/
>> cloudfront.blogspot.com
>>
>>
>> On Wed, May 1, 2013 at 12:16 AM, Jay Vyas <ja...@gmail.com> wrote:
>>
>>> Hi guys:
>>>
>>> Im wondering - if I'm running mapreduce jobs on a cluster with large
>>> block sizes - can i increase performance with either:
>>>
>>> 1) A custom FileInputFormat
>>>
>>>  2) A custom partitioner
>>>
>>> 3) -DnumReducers
>>>
>>> Clearly, (3) will be an issue due to the fact that it might overload
>>> tasks and network traffic... but maybe (1) or (2) will be a precise way to
>>> "use" partitions as a "poor mans" block.
>>>
>>> Just a thought - not sure if anyone has tried (1) or (2) before in order
>>> to simulate blocks and increase locality by utilizing the partition API.
>>>
>>> --
>>> Jay Vyas
>>> http://jayunit100.blogspot.com
>>>
>>
>>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

Re: partition as block?

Posted by Mohammad Tariq <do...@gmail.com>.

Hmmm. I was actually thinking about the very first step. How are you going
to create the maps. Suppose you are on a block-less filesystem and you have
a custom Format that is going to give you the splits dynamically. This mean
that you are going to store the file as a whole and create the splits as
you continue to read the file. Wouldn't it be a bottleneck from 'disk'
point of view??Are you not going away from the distributed paradigm??

Am I taking it in the correct way. Please correct me if I am getting it
wrong.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, May 1, 2013 at 12:34 AM, Jay Vyas <ja...@gmail.com> wrote:

> Well, to be more clear, I'm wondering how hadoop-mapreduce can be
> optimized in a block-less filesystem... And am thinking about application
> tier ways to simulate blocks - i.e. by making the granularity of partitions
> smaller.
>
> Wondering, if there is a way to hack an increased numbers of partitions as
> a mechanism to simulate blocks - or wether this is just a bad idea
> altogether :)
>
>
>
>
> On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Hello Jay,
>>
>>     What are you going to do in your custom InputFormat and
>> partitioner?Is your InputFormat is going to create larger splits which will
>> overlap with larger blocks?If that is the case, IMHO, then you are going to
>> reduce the no. of mappers thus reducing the parallelism. Also, much larger
>> block size will put extra overhead when it comes to disk I/O.
>>
>> Warm Regards,
>> Tariq
>> https://mtariq.jux.com/
>> cloudfront.blogspot.com
>>
>>
>> On Wed, May 1, 2013 at 12:16 AM, Jay Vyas <ja...@gmail.com> wrote:
>>
>>> Hi guys:
>>>
>>> Im wondering - if I'm running mapreduce jobs on a cluster with large
>>> block sizes - can i increase performance with either:
>>>
>>> 1) A custom FileInputFormat
>>>
>>>  2) A custom partitioner
>>>
>>> 3) -DnumReducers
>>>
>>> Clearly, (3) will be an issue due to the fact that it might overload
>>> tasks and network traffic... but maybe (1) or (2) will be a precise way to
>>> "use" partitions as a "poor mans" block.
>>>
>>> Just a thought - not sure if anyone has tried (1) or (2) before in order
>>> to simulate blocks and increase locality by utilizing the partition API.
>>>
>>> --
>>> Jay Vyas
>>> http://jayunit100.blogspot.com
>>>
>>
>>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

Re: partition as block?

Posted by Mohammad Tariq <do...@gmail.com>.

Hmmm. I was actually thinking about the very first step. How are you going
to create the maps. Suppose you are on a block-less filesystem and you have
a custom Format that is going to give you the splits dynamically. This mean
that you are going to store the file as a whole and create the splits as
you continue to read the file. Wouldn't it be a bottleneck from 'disk'
point of view??Are you not going away from the distributed paradigm??

Am I taking it in the correct way. Please correct me if I am getting it
wrong.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, May 1, 2013 at 12:34 AM, Jay Vyas <ja...@gmail.com> wrote:

> Well, to be more clear, I'm wondering how hadoop-mapreduce can be
> optimized in a block-less filesystem... And am thinking about application
> tier ways to simulate blocks - i.e. by making the granularity of partitions
> smaller.
>
> Wondering, if there is a way to hack an increased numbers of partitions as
> a mechanism to simulate blocks - or wether this is just a bad idea
> altogether :)
>
>
>
>
> On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Hello Jay,
>>
>>     What are you going to do in your custom InputFormat and
>> partitioner?Is your InputFormat is going to create larger splits which will
>> overlap with larger blocks?If that is the case, IMHO, then you are going to
>> reduce the no. of mappers thus reducing the parallelism. Also, much larger
>> block size will put extra overhead when it comes to disk I/O.
>>
>> Warm Regards,
>> Tariq
>> https://mtariq.jux.com/
>> cloudfront.blogspot.com
>>
>>
>> On Wed, May 1, 2013 at 12:16 AM, Jay Vyas <ja...@gmail.com> wrote:
>>
>>> Hi guys:
>>>
>>> Im wondering - if I'm running mapreduce jobs on a cluster with large
>>> block sizes - can i increase performance with either:
>>>
>>> 1) A custom FileInputFormat
>>>
>>>  2) A custom partitioner
>>>
>>> 3) -DnumReducers
>>>
>>> Clearly, (3) will be an issue due to the fact that it might overload
>>> tasks and network traffic... but maybe (1) or (2) will be a precise way to
>>> "use" partitions as a "poor mans" block.
>>>
>>> Just a thought - not sure if anyone has tried (1) or (2) before in order
>>> to simulate blocks and increase locality by utilizing the partition API.
>>>
>>> --
>>> Jay Vyas
>>> http://jayunit100.blogspot.com
>>>
>>
>>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

Re: partition as block?

Posted by Mohammad Tariq <do...@gmail.com>.

Hmmm. I was actually thinking about the very first step. How are you going
to create the maps. Suppose you are on a block-less filesystem and you have
a custom Format that is going to give you the splits dynamically. This mean
that you are going to store the file as a whole and create the splits as
you continue to read the file. Wouldn't it be a bottleneck from 'disk'
point of view??Are you not going away from the distributed paradigm??

Am I taking it in the correct way. Please correct me if I am getting it
wrong.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, May 1, 2013 at 12:34 AM, Jay Vyas <ja...@gmail.com> wrote:

> Well, to be more clear, I'm wondering how hadoop-mapreduce can be
> optimized in a block-less filesystem... And am thinking about application
> tier ways to simulate blocks - i.e. by making the granularity of partitions
> smaller.
>
> Wondering, if there is a way to hack an increased numbers of partitions as
> a mechanism to simulate blocks - or wether this is just a bad idea
> altogether :)
>
>
>
>
> On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Hello Jay,
>>
>>     What are you going to do in your custom InputFormat and
>> partitioner?Is your InputFormat is going to create larger splits which will
>> overlap with larger blocks?If that is the case, IMHO, then you are going to
>> reduce the no. of mappers thus reducing the parallelism. Also, much larger
>> block size will put extra overhead when it comes to disk I/O.
>>
>> Warm Regards,
>> Tariq
>> https://mtariq.jux.com/
>> cloudfront.blogspot.com
>>
>>
>> On Wed, May 1, 2013 at 12:16 AM, Jay Vyas <ja...@gmail.com> wrote:
>>
>>> Hi guys:
>>>
>>> Im wondering - if I'm running mapreduce jobs on a cluster with large
>>> block sizes - can i increase performance with either:
>>>
>>> 1) A custom FileInputFormat
>>>
>>>  2) A custom partitioner
>>>
>>> 3) -DnumReducers
>>>
>>> Clearly, (3) will be an issue due to the fact that it might overload
>>> tasks and network traffic... but maybe (1) or (2) will be a precise way to
>>> "use" partitions as a "poor mans" block.
>>>
>>> Just a thought - not sure if anyone has tried (1) or (2) before in order
>>> to simulate blocks and increase locality by utilizing the partition API.
>>>
>>> --
>>> Jay Vyas
>>> http://jayunit100.blogspot.com
>>>
>>
>>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

Re: partition as block?

Posted by Jay Vyas <ja...@gmail.com>.

Well, to be more clear, I'm wondering how hadoop-mapreduce can be optimized
in a block-less filesystem... And am thinking about application tier ways
to simulate blocks - i.e. by making the granularity of partitions smaller.

Wondering, if there is a way to hack an increased numbers of partitions as
a mechanism to simulate blocks - or wether this is just a bad idea
altogether :)




On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello Jay,
>
>     What are you going to do in your custom InputFormat and partitioner?Is
> your InputFormat is going to create larger splits which will overlap with
> larger blocks?If that is the case, IMHO, then you are going to reduce the
> no. of mappers thus reducing the parallelism. Also, much larger block size
> will put extra overhead when it comes to disk I/O.
>
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Wed, May 1, 2013 at 12:16 AM, Jay Vyas <ja...@gmail.com> wrote:
>
>> Hi guys:
>>
>> Im wondering - if I'm running mapreduce jobs on a cluster with large
>> block sizes - can i increase performance with either:
>>
>> 1) A custom FileInputFormat
>>
>>  2) A custom partitioner
>>
>> 3) -DnumReducers
>>
>> Clearly, (3) will be an issue due to the fact that it might overload
>> tasks and network traffic... but maybe (1) or (2) will be a precise way to
>> "use" partitions as a "poor mans" block.
>>
>> Just a thought - not sure if anyone has tried (1) or (2) before in order
>> to simulate blocks and increase locality by utilizing the partition API.
>>
>> --
>> Jay Vyas
>> http://jayunit100.blogspot.com
>>
>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: partition as block?

Posted by Jay Vyas <ja...@gmail.com>.

Well, to be more clear, I'm wondering how hadoop-mapreduce can be optimized
in a block-less filesystem... And am thinking about application tier ways
to simulate blocks - i.e. by making the granularity of partitions smaller.

Wondering, if there is a way to hack an increased numbers of partitions as
a mechanism to simulate blocks - or wether this is just a bad idea
altogether :)




On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello Jay,
>
>     What are you going to do in your custom InputFormat and partitioner?Is
> your InputFormat is going to create larger splits which will overlap with
> larger blocks?If that is the case, IMHO, then you are going to reduce the
> no. of mappers thus reducing the parallelism. Also, much larger block size
> will put extra overhead when it comes to disk I/O.
>
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Wed, May 1, 2013 at 12:16 AM, Jay Vyas <ja...@gmail.com> wrote:
>
>> Hi guys:
>>
>> Im wondering - if I'm running mapreduce jobs on a cluster with large
>> block sizes - can i increase performance with either:
>>
>> 1) A custom FileInputFormat
>>
>>  2) A custom partitioner
>>
>> 3) -DnumReducers
>>
>> Clearly, (3) will be an issue due to the fact that it might overload
>> tasks and network traffic... but maybe (1) or (2) will be a precise way to
>> "use" partitions as a "poor mans" block.
>>
>> Just a thought - not sure if anyone has tried (1) or (2) before in order
>> to simulate blocks and increase locality by utilizing the partition API.
>>
>> --
>> Jay Vyas
>> http://jayunit100.blogspot.com
>>
>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: partition as block?

Posted by Jay Vyas <ja...@gmail.com>.

Well, to be more clear, I'm wondering how hadoop-mapreduce can be optimized
in a block-less filesystem... And am thinking about application tier ways
to simulate blocks - i.e. by making the granularity of partitions smaller.

Wondering, if there is a way to hack an increased numbers of partitions as
a mechanism to simulate blocks - or wether this is just a bad idea
altogether :)




On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello Jay,
>
>     What are you going to do in your custom InputFormat and partitioner?Is
> your InputFormat is going to create larger splits which will overlap with
> larger blocks?If that is the case, IMHO, then you are going to reduce the
> no. of mappers thus reducing the parallelism. Also, much larger block size
> will put extra overhead when it comes to disk I/O.
>
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Wed, May 1, 2013 at 12:16 AM, Jay Vyas <ja...@gmail.com> wrote:
>
>> Hi guys:
>>
>> Im wondering - if I'm running mapreduce jobs on a cluster with large
>> block sizes - can i increase performance with either:
>>
>> 1) A custom FileInputFormat
>>
>>  2) A custom partitioner
>>
>> 3) -DnumReducers
>>
>> Clearly, (3) will be an issue due to the fact that it might overload
>> tasks and network traffic... but maybe (1) or (2) will be a precise way to
>> "use" partitions as a "poor mans" block.
>>
>> Just a thought - not sure if anyone has tried (1) or (2) before in order
>> to simulate blocks and increase locality by utilizing the partition API.
>>
>> --
>> Jay Vyas
>> http://jayunit100.blogspot.com
>>
>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: partition as block?

Posted by Jay Vyas <ja...@gmail.com>.

Well, to be more clear, I'm wondering how hadoop-mapreduce can be optimized
in a block-less filesystem... And am thinking about application tier ways
to simulate blocks - i.e. by making the granularity of partitions smaller.

Wondering, if there is a way to hack an increased numbers of partitions as
a mechanism to simulate blocks - or wether this is just a bad idea
altogether :)




On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello Jay,
>
>     What are you going to do in your custom InputFormat and partitioner?Is
> your InputFormat is going to create larger splits which will overlap with
> larger blocks?If that is the case, IMHO, then you are going to reduce the
> no. of mappers thus reducing the parallelism. Also, much larger block size
> will put extra overhead when it comes to disk I/O.
>
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Wed, May 1, 2013 at 12:16 AM, Jay Vyas <ja...@gmail.com> wrote:
>
>> Hi guys:
>>
>> Im wondering - if I'm running mapreduce jobs on a cluster with large
>> block sizes - can i increase performance with either:
>>
>> 1) A custom FileInputFormat
>>
>>  2) A custom partitioner
>>
>> 3) -DnumReducers
>>
>> Clearly, (3) will be an issue due to the fact that it might overload
>> tasks and network traffic... but maybe (1) or (2) will be a precise way to
>> "use" partitions as a "poor mans" block.
>>
>> Just a thought - not sure if anyone has tried (1) or (2) before in order
>> to simulate blocks and increase locality by utilizing the partition API.
>>
>> --
>> Jay Vyas
>> http://jayunit100.blogspot.com
>>
>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: partition as block?

Posted by Mohammad Tariq <do...@gmail.com>.

Hello Jay,

    What are you going to do in your custom InputFormat and partitioner?Is
your InputFormat is going to create larger splits which will overlap with
larger blocks?If that is the case, IMHO, then you are going to reduce the
no. of mappers thus reducing the parallelism. Also, much larger block size
will put extra overhead when it comes to disk I/O.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, May 1, 2013 at 12:16 AM, Jay Vyas <ja...@gmail.com> wrote:

> Hi guys:
>
> Im wondering - if I'm running mapreduce jobs on a cluster with large block
> sizes - can i increase performance with either:
>
> 1) A custom FileInputFormat
>
>  2) A custom partitioner
>
> 3) -DnumReducers
>
> Clearly, (3) will be an issue due to the fact that it might overload tasks
> and network traffic... but maybe (1) or (2) will be a precise way to "use"
> partitions as a "poor mans" block.
>
> Just a thought - not sure if anyone has tried (1) or (2) before in order
> to simulate blocks and increase locality by utilizing the partition API.
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

Re: partition as block?

Posted by Mohammad Tariq <do...@gmail.com>.

Hello Jay,

    What are you going to do in your custom InputFormat and partitioner?Is
your InputFormat is going to create larger splits which will overlap with
larger blocks?If that is the case, IMHO, then you are going to reduce the
no. of mappers thus reducing the parallelism. Also, much larger block size
will put extra overhead when it comes to disk I/O.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, May 1, 2013 at 12:16 AM, Jay Vyas <ja...@gmail.com> wrote:

> Hi guys:
>
> Im wondering - if I'm running mapreduce jobs on a cluster with large block
> sizes - can i increase performance with either:
>
> 1) A custom FileInputFormat
>
>  2) A custom partitioner
>
> 3) -DnumReducers
>
> Clearly, (3) will be an issue due to the fact that it might overload tasks
> and network traffic... but maybe (1) or (2) will be a precise way to "use"
> partitions as a "poor mans" block.
>
> Just a thought - not sure if anyone has tried (1) or (2) before in order
> to simulate blocks and increase locality by utilizing the partition API.
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

Re: partition as block?

Posted by Mohammad Tariq <do...@gmail.com>.

Hello Jay,

    What are you going to do in your custom InputFormat and partitioner?Is
your InputFormat is going to create larger splits which will overlap with
larger blocks?If that is the case, IMHO, then you are going to reduce the
no. of mappers thus reducing the parallelism. Also, much larger block size
will put extra overhead when it comes to disk I/O.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, May 1, 2013 at 12:16 AM, Jay Vyas <ja...@gmail.com> wrote:

> Hi guys:
>
> Im wondering - if I'm running mapreduce jobs on a cluster with large block
> sizes - can i increase performance with either:
>
> 1) A custom FileInputFormat
>
>  2) A custom partitioner
>
> 3) -DnumReducers
>
> Clearly, (3) will be an issue due to the fact that it might overload tasks
> and network traffic... but maybe (1) or (2) will be a precise way to "use"
> partitions as a "poor mans" block.
>
> Just a thought - not sure if anyone has tried (1) or (2) before in order
> to simulate blocks and increase locality by utilizing the partition API.
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

Re: partition as block?

Posted by Mohammad Tariq <do...@gmail.com>.

Hello Jay,

    What are you going to do in your custom InputFormat and partitioner?Is
your InputFormat is going to create larger splits which will overlap with
larger blocks?If that is the case, IMHO, then you are going to reduce the
no. of mappers thus reducing the parallelism. Also, much larger block size
will put extra overhead when it comes to disk I/O.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, May 1, 2013 at 12:16 AM, Jay Vyas <ja...@gmail.com> wrote:

> Hi guys:
>
> Im wondering - if I'm running mapreduce jobs on a cluster with large block
> sizes - can i increase performance with either:
>
> 1) A custom FileInputFormat
>
>  2) A custom partitioner
>
> 3) -DnumReducers
>
> Clearly, (3) will be an issue due to the fact that it might overload tasks
> and network traffic... but maybe (1) or (2) will be a precise way to "use"
> partitions as a "poor mans" block.
>
> Just a thought - not sure if anyone has tried (1) or (2) before in order
> to simulate blocks and increase locality by utilizing the partition API.
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>