You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Barnet Wagman <b....@comcast.net> on 2009/04/20 04:23:31 UTC

Are SequenceFiles split? If so, how?

Suppose a SequenceFile (containing keys and values that are 
BytesWritable) is used as input. Will it be divided into InputSplits?  
If so, what's the criteria use for splitting?

I'm interested in this because I need to control the number of map tasks 
used, which (if I understand it correctly), is equal to the number of 
InputSplits.

thanks,

bw

Re: Are SequenceFiles split? If so, how?

Posted by Shevek <ha...@anarres.org>.

On Thu, 2009-04-23 at 17:56 +0900, Aaron Kimball wrote:
> Explicitly controlling your splits will be very challenging. Taking the case
> where you have expensive (X) and cheap (C) objects to process, you may have
> a file where the records are lined up X C X C X C X X X X X C C C. In this
> case, you'll need to scan through the whole file and build splits such that
> the lengthy run of expensive objects is broken up into separate splits, but
> the run of cheap objects is consolidated. I'm suspicious that you can do
> this without scanning through the data (which is what often constitutes the
> bulk of a time in a mapreduce program).

I would also like the ability to stream the data and shuffle it into
buckets; when any bucket achieves a fixed cost (currently assessed as
byte size), it would be shipped as a task.

In practise, in the Hadoop architecture, this causes an extra level of
I/O, since all the data must be read into the shuffler and re-sorted.
Also, it breaks the ability to run map tasks on systems hosting the
data. However, it is a subject about which I am doing some thinking.

> But how much data are you using? I would imagine that if you're operating at
> the scale where Hadoop makes sense, then the high- and low-cost objects will
> -- on average -- balance out and tasks will be roughly evenly proportioned.

True, dat.

But it's still worth thinking about stream splitting, since the
theoretical complexity overhead is an increased constant on a linear
term.

Will get more into architecture first.

S.

Re: Are SequenceFiles split? If so, how?

Posted by Barnet Wagman <b....@comcast.net>.

Aaron Kimball wrote:
> Explicitly controlling your splits will be very challenging. Taking the case
> where you have expensive (X) and cheap (C) objects to process, you may have
> a file where the records are lined up X C X C X C X X X X X C C C. In this
> case, you'll need to scan through the whole file and build splits such that
> the lengthy run of expensive objects is broken up into separate splits, but
> the run of cheap objects is consolidated. 
^ I'm not concerned about the variation in processing time of objects; 
there isn't enough variation to worry about. I'm primarily concerned 
with having enough map tasks to utilized all nodes (and cores).
> In general, I would just dodge the problem by making sure your splits
> relatively small compared to the size of your input data. 
^ This sounds like the right solution.  I'll still need to extend 
SequenceFileInputFormat, but it should be relatively simple to put a 
fixed number of objects into each split.

thanks

Re: Are SequenceFiles split? If so, how?

Posted by Aaron Kimball <aa...@cloudera.com>.

Explicitly controlling your splits will be very challenging. Taking the case
where you have expensive (X) and cheap (C) objects to process, you may have
a file where the records are lined up X C X C X C X X X X X C C C. In this
case, you'll need to scan through the whole file and build splits such that
the lengthy run of expensive objects is broken up into separate splits, but
the run of cheap objects is consolidated. I'm suspicious that you can do
this without scanning through the data (which is what often constitutes the
bulk of a time in a mapreduce program).

But how much data are you using? I would imagine that if you're operating at
the scale where Hadoop makes sense, then the high- and low-cost objects will
-- on average -- balance out and tasks will be roughly evenly proportioned.

In general, I would just dodge the problem by making sure your splits
relatively small compared to the size of your input data. If you have 5
million objects to process, then make each split be roughly equal to say
20,000 of them. Then even if some splits take long to process and others
take a short time, then one CPU may dispatch with a dozen cheap splits in
the same time where one unlucky JVM had to process a single very expensive
split. Now you haven't had to manually balance anything, and you still get
to keep all your CPUs full.

- Aaron

On Mon, Apr 20, 2009 at 11:25 PM, Barnet Wagman <b....@comcast.net>wrote:

> Thanks Aaron, that really helps.  I probably do need to control the number
> of splits.  My input 'data' consists of  Java objects and their size (in
> bytes) doesn't necessarily reflect the amount of time needed for each map
> operation.   I need to ensure that I have enough map tasks so that all cpus
> are utilized and the job gets done in a reasonable amount of time.
>  (Currently I'm creating multiple input files and making them unsplitable,
> but subclassing SequenceFileInputFormat to explicitly control then number of
> splits sounds like a better approach).
>
> Barnet
>
>
> Aaron Kimball wrote:
>
>> Yes, there can be more than one InputSplit per SequenceFile. The file will
>> be split more-or-less along 64 MB boundaries. (the actual "edges" of the
>> splits will be adjusted to hit the next block of key-value pairs, so it
>> might be a few kilobytes off.)
>>
>> The SequenceFileInputFormat regards mapred.map.tasks
>> (conf.setNumMapTasks())
>> as a hint, not a set-in-stone metric. (The number of reduce tasks, though,
>> is always 100% user-controlled.) If you need exact control over the number
>> of map tasks, you'll need to subclass it and modify this behavior. That
>> having been said -- are you sure you actually need to precisely control
>> this
>> value? Or is it enough to know how many splits were created?
>>
>> - Aaron
>>
>> On Sun, Apr 19, 2009 at 7:23 PM, Barnet Wagman <b....@comcast.net>
>> wrote:
>>
>>
>>
>>> Suppose a SequenceFile (containing keys and values that are
>>> BytesWritable)
>>> is used as input. Will it be divided into InputSplits?  If so, what's the
>>> criteria use for splitting?
>>>
>>> I'm interested in this because I need to control the number of map tasks
>>> used, which (if I understand it correctly), is equal to the number of
>>> InputSplits.
>>>
>>> thanks,
>>>
>>> bw
>>>
>>>
>>>
>>
>>
>>
>
>

Re: Are SequenceFiles split? If so, how?

Posted by Barnet Wagman <b....@comcast.net>.

Thanks Aaron, that really helps.  I probably do need to control the 
number of splits.  My input 'data' consists of  Java objects and their 
size (in bytes) doesn't necessarily reflect the amount of time needed 
for each map operation.   I need to ensure that I have enough map tasks 
so that all cpus are utilized and the job gets done in a reasonable 
amount of time.  (Currently I'm creating multiple input files and making 
them unsplitable, but subclassing SequenceFileInputFormat to explicitly 
control then number of splits sounds like a better approach).

Barnet

Aaron Kimball wrote:
> Yes, there can be more than one InputSplit per SequenceFile. The file will
> be split more-or-less along 64 MB boundaries. (the actual "edges" of the
> splits will be adjusted to hit the next block of key-value pairs, so it
> might be a few kilobytes off.)
>
> The SequenceFileInputFormat regards mapred.map.tasks (conf.setNumMapTasks())
> as a hint, not a set-in-stone metric. (The number of reduce tasks, though,
> is always 100% user-controlled.) If you need exact control over the number
> of map tasks, you'll need to subclass it and modify this behavior. That
> having been said -- are you sure you actually need to precisely control this
> value? Or is it enough to know how many splits were created?
>
> - Aaron
>
> On Sun, Apr 19, 2009 at 7:23 PM, Barnet Wagman <b....@comcast.net> wrote:
>
>   
>> Suppose a SequenceFile (containing keys and values that are BytesWritable)
>> is used as input. Will it be divided into InputSplits?  If so, what's the
>> criteria use for splitting?
>>
>> I'm interested in this because I need to control the number of map tasks
>> used, which (if I understand it correctly), is equal to the number of
>> InputSplits.
>>
>> thanks,
>>
>> bw
>>
>>     
>
>

Re: Are SequenceFiles split? If so, how?

Posted by Jim Twensky <ji...@gmail.com>.

In addition to what Aaron mentioned, you can configure the minimum split
size in hadoop-site.xml to have smaller or larger input splits depending on
your application.

-Jim

On Mon, Apr 20, 2009 at 12:18 AM, Aaron Kimball <aa...@cloudera.com> wrote:

> Yes, there can be more than one InputSplit per SequenceFile. The file will
> be split more-or-less along 64 MB boundaries. (the actual "edges" of the
> splits will be adjusted to hit the next block of key-value pairs, so it
> might be a few kilobytes off.)
>
> The SequenceFileInputFormat regards mapred.map.tasks
> (conf.setNumMapTasks())
> as a hint, not a set-in-stone metric. (The number of reduce tasks, though,
> is always 100% user-controlled.) If you need exact control over the number
> of map tasks, you'll need to subclass it and modify this behavior. That
> having been said -- are you sure you actually need to precisely control
> this
> value? Or is it enough to know how many splits were created?
>
> - Aaron
>
> On Sun, Apr 19, 2009 at 7:23 PM, Barnet Wagman <b....@comcast.net>
> wrote:
>
> > Suppose a SequenceFile (containing keys and values that are
> BytesWritable)
> > is used as input. Will it be divided into InputSplits?  If so, what's the
> > criteria use for splitting?
> >
> > I'm interested in this because I need to control the number of map tasks
> > used, which (if I understand it correctly), is equal to the number of
> > InputSplits.
> >
> > thanks,
> >
> > bw
> >
>

Re: Are SequenceFiles split? If so, how?

Posted by Aaron Kimball <aa...@cloudera.com>.

Yes, there can be more than one InputSplit per SequenceFile. The file will
be split more-or-less along 64 MB boundaries. (the actual "edges" of the
splits will be adjusted to hit the next block of key-value pairs, so it
might be a few kilobytes off.)

The SequenceFileInputFormat regards mapred.map.tasks (conf.setNumMapTasks())
as a hint, not a set-in-stone metric. (The number of reduce tasks, though,
is always 100% user-controlled.) If you need exact control over the number
of map tasks, you'll need to subclass it and modify this behavior. That
having been said -- are you sure you actually need to precisely control this
value? Or is it enough to know how many splits were created?

- Aaron

On Sun, Apr 19, 2009 at 7:23 PM, Barnet Wagman <b....@comcast.net> wrote:

> Suppose a SequenceFile (containing keys and values that are BytesWritable)
> is used as input. Will it be divided into InputSplits?  If so, what's the
> criteria use for splitting?
>
> I'm interested in this because I need to control the number of map tasks
> used, which (if I understand it correctly), is equal to the number of
> InputSplits.
>
> thanks,
>
> bw
>