You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Deepak Nettem <de...@gmail.com> on 2012/04/26 23:07:53 UTC

Re: Changing the Java heap

HDFS doesn't care about the contents of the file. The file gets divided
into 64MB Blocks.

For example, If your input file contains data in custom format (like
Paragraphs) and you want the files to split as per paragraphs, HDFS isn't
responsible - and rightly so.

The application developer needs to use a custom InputFormat which
internally uses RecordReader and InputSplit. The default, text input format
makes sure that your mappers get each line as an input. The lines that span
two blocks are handled by the InputSplit which makes sure that the
necessary bytes from two blocks are made available, and Record Reader
actually converts that byte view into (key,value).

On Thu, Apr 26, 2012 at 4:59 PM, Barry, Sean F <se...@intel.com>wrote:

> I guess what I meant to say was, how does hadoop make 64M blocks without
> cutting off parts of words at the end of each block? Does it only make
> blocks at whitespace?
>
> -SB
>
> -----Original Message-----
> From: Michael Segel [mailto:michael_segel@hotmail.com]
> Sent: Thursday, April 26, 2012 1:56 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Changing the Java heap
>
> Not sure of your question.
>
> Java child Heap size is independent of how files are split on HDFS.
>
> I suggest you look at Tom White's book on HDFS and how files are split in
> to blocks.
>
> Blocks are split on set sizes. 64MB by default.
> Your record boundaries are not necessarily on file block boundaries so one
> process may read the rest of the last record in block A and then complete
> reading it at the start of block B. A different task may start with block B
> and skip the first n bytes until it hits the start of a record.
>
> HTH
>
> -Mike
>
> On Apr 26, 2012, at 3:46 PM, Barry, Sean F wrote:
>
> > Within my small 2 node cluster I set up my 4 core slave node to have 4
> task trackers and I also limited my java heap size to -Xmx1024m
> >
> > Is there a possibility that when the data gets broken up that it will
> break it at a place in the file that is not a whitespace? Or is that
> already handled when the data on HDFS is broken up into blocks?
> >
> > -SB
>
>

-- 
Warm Regards,
Deepak Nettem <http://www.cs.stonybrook.edu/%7Ednettem/>

Re: Changing the Java heap

Posted by Harsh J <ha...@cloudera.com>.

Deepak is right here. The line-reading technique is explained in
further detail at http://wiki.apache.org/hadoop/HadoopMapReduce.

On Fri, Apr 27, 2012 at 2:37 AM, Deepak Nettem <de...@gmail.com> wrote:
> HDFS doesn't care about the contents of the file. The file gets divided
> into 64MB Blocks.
>
> For example, If your input file contains data in custom format (like
> Paragraphs) and you want the files to split as per paragraphs, HDFS isn't
> responsible - and rightly so.
>
> The application developer needs to use a custom InputFormat which
> internally uses RecordReader and InputSplit. The default, text input format
> makes sure that your mappers get each line as an input. The lines that span
> two blocks are handled by the InputSplit which makes sure that the
> necessary bytes from two blocks are made available, and Record Reader
> actually converts that byte view into (key,value).
>
>
>
> On Thu, Apr 26, 2012 at 4:59 PM, Barry, Sean F <se...@intel.com>wrote:
>
>> I guess what I meant to say was, how does hadoop make 64M blocks without
>> cutting off parts of words at the end of each block? Does it only make
>> blocks at whitespace?
>>
>> -SB
>>
>> -----Original Message-----
>> From: Michael Segel [mailto:michael_segel@hotmail.com]
>> Sent: Thursday, April 26, 2012 1:56 PM
>> To: common-user@hadoop.apache.org
>> Subject: Re: Changing the Java heap
>>
>> Not sure of your question.
>>
>> Java child Heap size is independent of how files are split on HDFS.
>>
>> I suggest you look at Tom White's book on HDFS and how files are split in
>> to blocks.
>>
>> Blocks are split on set sizes. 64MB by default.
>> Your record boundaries are not necessarily on file block boundaries so one
>> process may read the rest of the last record in block A and then complete
>> reading it at the start of block B. A different task may start with block B
>> and skip the first n bytes until it hits the start of a record.
>>
>> HTH
>>
>> -Mike
>>
>> On Apr 26, 2012, at 3:46 PM, Barry, Sean F wrote:
>>
>> > Within my small 2 node cluster I set up my 4 core slave node to have 4
>> task trackers and I also limited my java heap size to -Xmx1024m
>> >
>> > Is there a possibility that when the data gets broken up that it will
>> break it at a place in the file that is not a whitespace? Or is that
>> already handled when the data on HDFS is broken up into blocks?
>> >
>> > -SB
>>
>>
>
>
> --
> Warm Regards,
> Deepak Nettem <http://www.cs.stonybrook.edu/%7Ednettem/>



-- 
Harsh J