You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Majid Azimi <ma...@gmail.com> on 2012/11/18 06:55:25 UTC

HDFS Block size vs Input Split Size

hi guys,

I want to get confirmation that I have understood this topic
correctly. HDFS block size is number of bytes that HDFS will split a large
files into small tokens. Input split size is number bytes each mapper will
actually process. It may be less or more than hdfs block size. Am* *I right?

suppose we want to load a 110MB text file to hdfs. hdfs block size and
Input split size is set to 64MB.

1. number of mappers is based on number of Input splits not number of hdfs
blocks? right?

2. When we set hdfs block to 64MB, Is this exactly 67108864(64*1024*1024)
bytes? I mean it doesn't matter the file will be splitted from middle of
the line.

3. Now we have 2 input split (so two maps). Last line of first block and
first line of second block is not meaningful. TextInputFormat is
responsible for reading meaningful lines and giving them to map jobs. What
TextInputFormat does is:

   - In second block it will seek to second line which is a complete line
   and read from there and gives it to second mapper.
   - First mapper will read until the end of first block and also it will
   process the (last incomplete line of first block + first incomplete line of
   second block).

So the Input split size of first mapper is not exactly 64MB. it is a bit
more than that(first incomplete line of second block). Also Input split
size of second mapper is a bit less than 64 MB. Am I right?
So hdfs block size is an exact number but Input split size is based on our
data logic which may be a little different with the configured number?
right?

Re: HDFS Block size vs Input Split Size

Posted by Harsh J <ha...@cloudera.com>.
Hi,

On Sun, Nov 18, 2012 at 11:25 AM, Majid Azimi <ma...@gmail.com> wrote:
> hi guys,
>
> I want to get confirmation that I have understood this topic correctly. HDFS
> block size is number of bytes that HDFS will split a large files into small
> tokens. Input split size is number bytes each mapper will actually process.
> It may be less or more than hdfs block size. Am I right?

Yes.

> suppose we want to load a 110MB text file to hdfs. hdfs block size and Input
> split size is set to 64MB.
>
> 1. number of mappers is based on number of Input splits not number of hdfs
> blocks? right?

Correct. Although the default logic tries to borrow the split size
based on the block size, there is no such hard requirement.

> 2. When we set hdfs block to 64MB, Is this exactly 67108864(64*1024*1024)
> bytes? I mean it doesn't matter the file will be splitted from middle of the
> line.

Yes, HDFS makes arbitrary sized blocks. HDFS does not concern itself
about file's contents (like a FileSystem doesn't). The reader is
expected to handle reading records properly (i.e. until last newline,
etc.). See http://wiki.apache.org/hadoop/HadoopMapReduce for how MR
reads the records without a break in between, even if the block
boundary has broken a record.

> 3. Now we have 2 input split (so two maps). Last line of first block and
> first line of second block is not meaningful. TextInputFormat is responsible
> for reading meaningful lines and giving them to map jobs. What
> TextInputFormat does is:
>
> In second block it will seek to second line which is a complete line and
> read from there and gives it to second mapper.
> First mapper will read until the end of first block and also it will process
> the (last incomplete line of first block + first incomplete line of second
> block).

Yes, this is explained at http://wiki.apache.org/hadoop/HadoopMapReduce as well.

> So the Input split size of first mapper is not exactly 64MB. it is a bit
> more than that(first incomplete line of second block). Also Input split size
> of second mapper is a bit less than 64 MB. Am I right?
> So hdfs block size is an exact number but Input split size is based on our
> data logic which may be a little different with the configured number?
> right?

Yes, all correct.

-- 
Harsh J

Re: HDFS Block size vs Input Split Size

Posted by Harsh J <ha...@cloudera.com>.
Hi,

On Sun, Nov 18, 2012 at 11:25 AM, Majid Azimi <ma...@gmail.com> wrote:
> hi guys,
>
> I want to get confirmation that I have understood this topic correctly. HDFS
> block size is number of bytes that HDFS will split a large files into small
> tokens. Input split size is number bytes each mapper will actually process.
> It may be less or more than hdfs block size. Am I right?

Yes.

> suppose we want to load a 110MB text file to hdfs. hdfs block size and Input
> split size is set to 64MB.
>
> 1. number of mappers is based on number of Input splits not number of hdfs
> blocks? right?

Correct. Although the default logic tries to borrow the split size
based on the block size, there is no such hard requirement.

> 2. When we set hdfs block to 64MB, Is this exactly 67108864(64*1024*1024)
> bytes? I mean it doesn't matter the file will be splitted from middle of the
> line.

Yes, HDFS makes arbitrary sized blocks. HDFS does not concern itself
about file's contents (like a FileSystem doesn't). The reader is
expected to handle reading records properly (i.e. until last newline,
etc.). See http://wiki.apache.org/hadoop/HadoopMapReduce for how MR
reads the records without a break in between, even if the block
boundary has broken a record.

> 3. Now we have 2 input split (so two maps). Last line of first block and
> first line of second block is not meaningful. TextInputFormat is responsible
> for reading meaningful lines and giving them to map jobs. What
> TextInputFormat does is:
>
> In second block it will seek to second line which is a complete line and
> read from there and gives it to second mapper.
> First mapper will read until the end of first block and also it will process
> the (last incomplete line of first block + first incomplete line of second
> block).

Yes, this is explained at http://wiki.apache.org/hadoop/HadoopMapReduce as well.

> So the Input split size of first mapper is not exactly 64MB. it is a bit
> more than that(first incomplete line of second block). Also Input split size
> of second mapper is a bit less than 64 MB. Am I right?
> So hdfs block size is an exact number but Input split size is based on our
> data logic which may be a little different with the configured number?
> right?

Yes, all correct.

-- 
Harsh J

Re: HDFS Block size vs Input Split Size

Posted by Harsh J <ha...@cloudera.com>.
Hi,

On Sun, Nov 18, 2012 at 11:25 AM, Majid Azimi <ma...@gmail.com> wrote:
> hi guys,
>
> I want to get confirmation that I have understood this topic correctly. HDFS
> block size is number of bytes that HDFS will split a large files into small
> tokens. Input split size is number bytes each mapper will actually process.
> It may be less or more than hdfs block size. Am I right?

Yes.

> suppose we want to load a 110MB text file to hdfs. hdfs block size and Input
> split size is set to 64MB.
>
> 1. number of mappers is based on number of Input splits not number of hdfs
> blocks? right?

Correct. Although the default logic tries to borrow the split size
based on the block size, there is no such hard requirement.

> 2. When we set hdfs block to 64MB, Is this exactly 67108864(64*1024*1024)
> bytes? I mean it doesn't matter the file will be splitted from middle of the
> line.

Yes, HDFS makes arbitrary sized blocks. HDFS does not concern itself
about file's contents (like a FileSystem doesn't). The reader is
expected to handle reading records properly (i.e. until last newline,
etc.). See http://wiki.apache.org/hadoop/HadoopMapReduce for how MR
reads the records without a break in between, even if the block
boundary has broken a record.

> 3. Now we have 2 input split (so two maps). Last line of first block and
> first line of second block is not meaningful. TextInputFormat is responsible
> for reading meaningful lines and giving them to map jobs. What
> TextInputFormat does is:
>
> In second block it will seek to second line which is a complete line and
> read from there and gives it to second mapper.
> First mapper will read until the end of first block and also it will process
> the (last incomplete line of first block + first incomplete line of second
> block).

Yes, this is explained at http://wiki.apache.org/hadoop/HadoopMapReduce as well.

> So the Input split size of first mapper is not exactly 64MB. it is a bit
> more than that(first incomplete line of second block). Also Input split size
> of second mapper is a bit less than 64 MB. Am I right?
> So hdfs block size is an exact number but Input split size is based on our
> data logic which may be a little different with the configured number?
> right?

Yes, all correct.

-- 
Harsh J

Re: HDFS Block size vs Input Split Size

Posted by Harsh J <ha...@cloudera.com>.
Hi,

On Sun, Nov 18, 2012 at 11:25 AM, Majid Azimi <ma...@gmail.com> wrote:
> hi guys,
>
> I want to get confirmation that I have understood this topic correctly. HDFS
> block size is number of bytes that HDFS will split a large files into small
> tokens. Input split size is number bytes each mapper will actually process.
> It may be less or more than hdfs block size. Am I right?

Yes.

> suppose we want to load a 110MB text file to hdfs. hdfs block size and Input
> split size is set to 64MB.
>
> 1. number of mappers is based on number of Input splits not number of hdfs
> blocks? right?

Correct. Although the default logic tries to borrow the split size
based on the block size, there is no such hard requirement.

> 2. When we set hdfs block to 64MB, Is this exactly 67108864(64*1024*1024)
> bytes? I mean it doesn't matter the file will be splitted from middle of the
> line.

Yes, HDFS makes arbitrary sized blocks. HDFS does not concern itself
about file's contents (like a FileSystem doesn't). The reader is
expected to handle reading records properly (i.e. until last newline,
etc.). See http://wiki.apache.org/hadoop/HadoopMapReduce for how MR
reads the records without a break in between, even if the block
boundary has broken a record.

> 3. Now we have 2 input split (so two maps). Last line of first block and
> first line of second block is not meaningful. TextInputFormat is responsible
> for reading meaningful lines and giving them to map jobs. What
> TextInputFormat does is:
>
> In second block it will seek to second line which is a complete line and
> read from there and gives it to second mapper.
> First mapper will read until the end of first block and also it will process
> the (last incomplete line of first block + first incomplete line of second
> block).

Yes, this is explained at http://wiki.apache.org/hadoop/HadoopMapReduce as well.

> So the Input split size of first mapper is not exactly 64MB. it is a bit
> more than that(first incomplete line of second block). Also Input split size
> of second mapper is a bit less than 64 MB. Am I right?
> So hdfs block size is an exact number but Input split size is based on our
> data logic which may be a little different with the configured number?
> right?

Yes, all correct.

-- 
Harsh J