You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by SP <sa...@gmail.com> on 2015/02/21 00:59:50 UTC

BLOCK and Split size question

Hello Every one,

I have couple of doubts can any one please point me in right direction.

1>What exactly happen when I want to copy 1TB file to Hadoop Cluster using
copyfromlocal command

1> what will be the split size? will it be same as the block size?

2> What is a block and split?


If we have 100 MB file and a block size of 64 MB, As we know it will be
divided into 2 blocks of 64 MB and 36 MB the second block still has 28 MB
of space left what will happen to that free space?
will the cluster have unequal block size or will it be occupied by other
file?


3) let’s say a 64MB block is on node A and replicated among 2 other
nodes(B,C), and the input split size for the map-reduce program is 64MB,
will this split just have location for node A? Or will it have locations
for all the three nodes A,b,C?


4) How is it handled if the Input Split size is greater or lesser than
block size?


can any one please help?

thanks

SP

Re: BLOCK and Split size question

Posted by Ulul <ha...@ulul.org>.
Hi

As of Hadoop 2.6, default blocksize is 128 MB (look for dfs.blocksize)
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

About the 100 MB file for a 64 MB blocksize, as said there are 2 blocks 
of 64 and 36 MB each. There is no 28 MB thing, it's just a smaller 
block, HDFS doesn't "reserve" entire blocks as a traditional FS. That 
can be deceptive by the way. If you stuff your cluster with small files, 
you won't see it by your FS filling up but by your NN running out of 
memory or your job performance being bad (since unless you activate 
uberized jobs - which is not yet supported on HDP at least - many JVMs 
will be spawned to deal with too little splits).
In the case of small files you can use CombineFileInputFormat which 
gives an example of the input split not being aligned on block since 
multiple blocks are combined to make an input split.

To complete the non aligned block answer : mapreduce will download the 
missing record part for you from an other DN.

Cheers
Ulul

Le 22/02/2015 03:19, Ahmed Ossama a écrit :
> Hi,
>
> Answering the first question;
>
> What happens is that the client on the machine issuing the command 
> 'copyfromlocal' first creates a new instance of DistributedFileSystem, 
> which makes an RPC call to the NameNode to create a new file in the 
> filesystem's namespace, the NN performs various checks, if passed it 
> creates a record for the file. Then the DistributedFileSystem returns 
> FSDataOutputStream for the client which handles the communication to 
> the datanodes.
>
> The file is then split into packets (based on the block size) and 
> written to the data queue by DataStreamer, which asks the nn to 
> allocate new blocks and picking up a list of dn to store the replicas 
> on. The list of dn forms a pipeline with the number of replicas, the 
> DataStreamer streams the packet to the first datanode in the pipeline 
> which stores the packet and forward it to the second datanode in the 
> pipeline. The packet keeps forwarded until the replicas are written.
>
> When the client finish the stream it sends a confirmation to the NN.
>
> Second question; split size is used during processing of a file, in 
> other words the split size is consumed by the mappers not during HDFS 
> operations. The block is the one consumed during the read and write 
> operations by HDFS client. IIRC the default size of a block in 
> Hadoop-2.6 is 256MB.
>
> Third question; split is the logical representation of the data in a 
> block. Block is the physical representation of the data. As I said, 
> splits are consumed when processing the data, a mapper reads data from 
> a block through a split.
>
> Consider the following content of a file:
>
>     000
>     111
>     222
>     333
>     444
>     555
>     666
>     777
>     888
>     999
>     aaa
>     bbb
>     ccc
>     ddd
>     eee
>     fff
>
>
> The block representation of this file could look something like this
>
>     block-0
>
>         000
>         111
>         222
>         333
>         444
>         55
>
>     block-1
>
>         5
>         666
>         777
>         888
>         999
>         aa
>
>     block-2
>
>         a
>         bbb
>         ccc
>         ddd
>         eee
>         fff
>
> If you notice, the blocks aren't aligned with the record, this happen 
> because the fact that each record in a file isn't like the other. Here 
> comes the splits to put the records all together before being 
> processed by a mapper.
>
> Unlike traditional filesystems, HDFS will create files up to the size 
> of the HDFS block size; Note that there are 2 block size here, the 
> underlying filesystem block size and HDFS block size.
>
> On 02/21/2015 01:59 AM, SP wrote:
>> Hello Every one,
>>
>> I have couple of doubts can any one please point me in right direction.
>>
>> 1>What exactly happen when I want to copy 1TB file to Hadoop Cluster 
>> using copyfromlocal command
>>
>> 1> what will be the split size? will it be same as the block size?
>>
>> 2> What is a block and split?
>>
>> If we have 100 MB file and a block size of 64 MB, As we know it will 
>> be divided into 2 blocks of 64 MB and 36 MB the second block still 
>> has 28 MB of space left what will happen to that free space?
>> will the cluster have unequal block size or will it be occupied by 
>> other file?
>>
>>
>> 3) let’s say a 64MB block is on node A and replicated among 2 other 
>> nodes(B,C), and the input split size for the map-reduce program is 
>> 64MB, will this split just have location for node A? Or will it have 
>> locations for all the three nodes A,b,C?
>>
>>
>> 4) How is it handled if the Input Split size is greater or lesser 
>> than block size?
>>
>>
>> can any one please help?
>>
>> thanks
>>
>> SP
>>
>
> -- 
> Regards,
> Ahmed Ossama


Re: BLOCK and Split size question

Posted by Ulul <ha...@ulul.org>.
Hi

As of Hadoop 2.6, default blocksize is 128 MB (look for dfs.blocksize)
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

About the 100 MB file for a 64 MB blocksize, as said there are 2 blocks 
of 64 and 36 MB each. There is no 28 MB thing, it's just a smaller 
block, HDFS doesn't "reserve" entire blocks as a traditional FS. That 
can be deceptive by the way. If you stuff your cluster with small files, 
you won't see it by your FS filling up but by your NN running out of 
memory or your job performance being bad (since unless you activate 
uberized jobs - which is not yet supported on HDP at least - many JVMs 
will be spawned to deal with too little splits).
In the case of small files you can use CombineFileInputFormat which 
gives an example of the input split not being aligned on block since 
multiple blocks are combined to make an input split.

To complete the non aligned block answer : mapreduce will download the 
missing record part for you from an other DN.

Cheers
Ulul

Le 22/02/2015 03:19, Ahmed Ossama a écrit :
> Hi,
>
> Answering the first question;
>
> What happens is that the client on the machine issuing the command 
> 'copyfromlocal' first creates a new instance of DistributedFileSystem, 
> which makes an RPC call to the NameNode to create a new file in the 
> filesystem's namespace, the NN performs various checks, if passed it 
> creates a record for the file. Then the DistributedFileSystem returns 
> FSDataOutputStream for the client which handles the communication to 
> the datanodes.
>
> The file is then split into packets (based on the block size) and 
> written to the data queue by DataStreamer, which asks the nn to 
> allocate new blocks and picking up a list of dn to store the replicas 
> on. The list of dn forms a pipeline with the number of replicas, the 
> DataStreamer streams the packet to the first datanode in the pipeline 
> which stores the packet and forward it to the second datanode in the 
> pipeline. The packet keeps forwarded until the replicas are written.
>
> When the client finish the stream it sends a confirmation to the NN.
>
> Second question; split size is used during processing of a file, in 
> other words the split size is consumed by the mappers not during HDFS 
> operations. The block is the one consumed during the read and write 
> operations by HDFS client. IIRC the default size of a block in 
> Hadoop-2.6 is 256MB.
>
> Third question; split is the logical representation of the data in a 
> block. Block is the physical representation of the data. As I said, 
> splits are consumed when processing the data, a mapper reads data from 
> a block through a split.
>
> Consider the following content of a file:
>
>     000
>     111
>     222
>     333
>     444
>     555
>     666
>     777
>     888
>     999
>     aaa
>     bbb
>     ccc
>     ddd
>     eee
>     fff
>
>
> The block representation of this file could look something like this
>
>     block-0
>
>         000
>         111
>         222
>         333
>         444
>         55
>
>     block-1
>
>         5
>         666
>         777
>         888
>         999
>         aa
>
>     block-2
>
>         a
>         bbb
>         ccc
>         ddd
>         eee
>         fff
>
> If you notice, the blocks aren't aligned with the record, this happen 
> because the fact that each record in a file isn't like the other. Here 
> comes the splits to put the records all together before being 
> processed by a mapper.
>
> Unlike traditional filesystems, HDFS will create files up to the size 
> of the HDFS block size; Note that there are 2 block size here, the 
> underlying filesystem block size and HDFS block size.
>
> On 02/21/2015 01:59 AM, SP wrote:
>> Hello Every one,
>>
>> I have couple of doubts can any one please point me in right direction.
>>
>> 1>What exactly happen when I want to copy 1TB file to Hadoop Cluster 
>> using copyfromlocal command
>>
>> 1> what will be the split size? will it be same as the block size?
>>
>> 2> What is a block and split?
>>
>> If we have 100 MB file and a block size of 64 MB, As we know it will 
>> be divided into 2 blocks of 64 MB and 36 MB the second block still 
>> has 28 MB of space left what will happen to that free space?
>> will the cluster have unequal block size or will it be occupied by 
>> other file?
>>
>>
>> 3) let’s say a 64MB block is on node A and replicated among 2 other 
>> nodes(B,C), and the input split size for the map-reduce program is 
>> 64MB, will this split just have location for node A? Or will it have 
>> locations for all the three nodes A,b,C?
>>
>>
>> 4) How is it handled if the Input Split size is greater or lesser 
>> than block size?
>>
>>
>> can any one please help?
>>
>> thanks
>>
>> SP
>>
>
> -- 
> Regards,
> Ahmed Ossama


Re: BLOCK and Split size question

Posted by Ulul <ha...@ulul.org>.
Hi

As of Hadoop 2.6, default blocksize is 128 MB (look for dfs.blocksize)
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

About the 100 MB file for a 64 MB blocksize, as said there are 2 blocks 
of 64 and 36 MB each. There is no 28 MB thing, it's just a smaller 
block, HDFS doesn't "reserve" entire blocks as a traditional FS. That 
can be deceptive by the way. If you stuff your cluster with small files, 
you won't see it by your FS filling up but by your NN running out of 
memory or your job performance being bad (since unless you activate 
uberized jobs - which is not yet supported on HDP at least - many JVMs 
will be spawned to deal with too little splits).
In the case of small files you can use CombineFileInputFormat which 
gives an example of the input split not being aligned on block since 
multiple blocks are combined to make an input split.

To complete the non aligned block answer : mapreduce will download the 
missing record part for you from an other DN.

Cheers
Ulul

Le 22/02/2015 03:19, Ahmed Ossama a écrit :
> Hi,
>
> Answering the first question;
>
> What happens is that the client on the machine issuing the command 
> 'copyfromlocal' first creates a new instance of DistributedFileSystem, 
> which makes an RPC call to the NameNode to create a new file in the 
> filesystem's namespace, the NN performs various checks, if passed it 
> creates a record for the file. Then the DistributedFileSystem returns 
> FSDataOutputStream for the client which handles the communication to 
> the datanodes.
>
> The file is then split into packets (based on the block size) and 
> written to the data queue by DataStreamer, which asks the nn to 
> allocate new blocks and picking up a list of dn to store the replicas 
> on. The list of dn forms a pipeline with the number of replicas, the 
> DataStreamer streams the packet to the first datanode in the pipeline 
> which stores the packet and forward it to the second datanode in the 
> pipeline. The packet keeps forwarded until the replicas are written.
>
> When the client finish the stream it sends a confirmation to the NN.
>
> Second question; split size is used during processing of a file, in 
> other words the split size is consumed by the mappers not during HDFS 
> operations. The block is the one consumed during the read and write 
> operations by HDFS client. IIRC the default size of a block in 
> Hadoop-2.6 is 256MB.
>
> Third question; split is the logical representation of the data in a 
> block. Block is the physical representation of the data. As I said, 
> splits are consumed when processing the data, a mapper reads data from 
> a block through a split.
>
> Consider the following content of a file:
>
>     000
>     111
>     222
>     333
>     444
>     555
>     666
>     777
>     888
>     999
>     aaa
>     bbb
>     ccc
>     ddd
>     eee
>     fff
>
>
> The block representation of this file could look something like this
>
>     block-0
>
>         000
>         111
>         222
>         333
>         444
>         55
>
>     block-1
>
>         5
>         666
>         777
>         888
>         999
>         aa
>
>     block-2
>
>         a
>         bbb
>         ccc
>         ddd
>         eee
>         fff
>
> If you notice, the blocks aren't aligned with the record, this happen 
> because the fact that each record in a file isn't like the other. Here 
> comes the splits to put the records all together before being 
> processed by a mapper.
>
> Unlike traditional filesystems, HDFS will create files up to the size 
> of the HDFS block size; Note that there are 2 block size here, the 
> underlying filesystem block size and HDFS block size.
>
> On 02/21/2015 01:59 AM, SP wrote:
>> Hello Every one,
>>
>> I have couple of doubts can any one please point me in right direction.
>>
>> 1>What exactly happen when I want to copy 1TB file to Hadoop Cluster 
>> using copyfromlocal command
>>
>> 1> what will be the split size? will it be same as the block size?
>>
>> 2> What is a block and split?
>>
>> If we have 100 MB file and a block size of 64 MB, As we know it will 
>> be divided into 2 blocks of 64 MB and 36 MB the second block still 
>> has 28 MB of space left what will happen to that free space?
>> will the cluster have unequal block size or will it be occupied by 
>> other file?
>>
>>
>> 3) let’s say a 64MB block is on node A and replicated among 2 other 
>> nodes(B,C), and the input split size for the map-reduce program is 
>> 64MB, will this split just have location for node A? Or will it have 
>> locations for all the three nodes A,b,C?
>>
>>
>> 4) How is it handled if the Input Split size is greater or lesser 
>> than block size?
>>
>>
>> can any one please help?
>>
>> thanks
>>
>> SP
>>
>
> -- 
> Regards,
> Ahmed Ossama


Re: BLOCK and Split size question

Posted by Ulul <ha...@ulul.org>.
Hi

As of Hadoop 2.6, default blocksize is 128 MB (look for dfs.blocksize)
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

About the 100 MB file for a 64 MB blocksize, as said there are 2 blocks 
of 64 and 36 MB each. There is no 28 MB thing, it's just a smaller 
block, HDFS doesn't "reserve" entire blocks as a traditional FS. That 
can be deceptive by the way. If you stuff your cluster with small files, 
you won't see it by your FS filling up but by your NN running out of 
memory or your job performance being bad (since unless you activate 
uberized jobs - which is not yet supported on HDP at least - many JVMs 
will be spawned to deal with too little splits).
In the case of small files you can use CombineFileInputFormat which 
gives an example of the input split not being aligned on block since 
multiple blocks are combined to make an input split.

To complete the non aligned block answer : mapreduce will download the 
missing record part for you from an other DN.

Cheers
Ulul

Le 22/02/2015 03:19, Ahmed Ossama a écrit :
> Hi,
>
> Answering the first question;
>
> What happens is that the client on the machine issuing the command 
> 'copyfromlocal' first creates a new instance of DistributedFileSystem, 
> which makes an RPC call to the NameNode to create a new file in the 
> filesystem's namespace, the NN performs various checks, if passed it 
> creates a record for the file. Then the DistributedFileSystem returns 
> FSDataOutputStream for the client which handles the communication to 
> the datanodes.
>
> The file is then split into packets (based on the block size) and 
> written to the data queue by DataStreamer, which asks the nn to 
> allocate new blocks and picking up a list of dn to store the replicas 
> on. The list of dn forms a pipeline with the number of replicas, the 
> DataStreamer streams the packet to the first datanode in the pipeline 
> which stores the packet and forward it to the second datanode in the 
> pipeline. The packet keeps forwarded until the replicas are written.
>
> When the client finish the stream it sends a confirmation to the NN.
>
> Second question; split size is used during processing of a file, in 
> other words the split size is consumed by the mappers not during HDFS 
> operations. The block is the one consumed during the read and write 
> operations by HDFS client. IIRC the default size of a block in 
> Hadoop-2.6 is 256MB.
>
> Third question; split is the logical representation of the data in a 
> block. Block is the physical representation of the data. As I said, 
> splits are consumed when processing the data, a mapper reads data from 
> a block through a split.
>
> Consider the following content of a file:
>
>     000
>     111
>     222
>     333
>     444
>     555
>     666
>     777
>     888
>     999
>     aaa
>     bbb
>     ccc
>     ddd
>     eee
>     fff
>
>
> The block representation of this file could look something like this
>
>     block-0
>
>         000
>         111
>         222
>         333
>         444
>         55
>
>     block-1
>
>         5
>         666
>         777
>         888
>         999
>         aa
>
>     block-2
>
>         a
>         bbb
>         ccc
>         ddd
>         eee
>         fff
>
> If you notice, the blocks aren't aligned with the record, this happen 
> because the fact that each record in a file isn't like the other. Here 
> comes the splits to put the records all together before being 
> processed by a mapper.
>
> Unlike traditional filesystems, HDFS will create files up to the size 
> of the HDFS block size; Note that there are 2 block size here, the 
> underlying filesystem block size and HDFS block size.
>
> On 02/21/2015 01:59 AM, SP wrote:
>> Hello Every one,
>>
>> I have couple of doubts can any one please point me in right direction.
>>
>> 1>What exactly happen when I want to copy 1TB file to Hadoop Cluster 
>> using copyfromlocal command
>>
>> 1> what will be the split size? will it be same as the block size?
>>
>> 2> What is a block and split?
>>
>> If we have 100 MB file and a block size of 64 MB, As we know it will 
>> be divided into 2 blocks of 64 MB and 36 MB the second block still 
>> has 28 MB of space left what will happen to that free space?
>> will the cluster have unequal block size or will it be occupied by 
>> other file?
>>
>>
>> 3) let’s say a 64MB block is on node A and replicated among 2 other 
>> nodes(B,C), and the input split size for the map-reduce program is 
>> 64MB, will this split just have location for node A? Or will it have 
>> locations for all the three nodes A,b,C?
>>
>>
>> 4) How is it handled if the Input Split size is greater or lesser 
>> than block size?
>>
>>
>> can any one please help?
>>
>> thanks
>>
>> SP
>>
>
> -- 
> Regards,
> Ahmed Ossama


Re: BLOCK and Split size question

Posted by Ahmed Ossama <ah...@aossama.com>.
Hi,

Answering the first question;

What happens is that the client on the machine issuing the command 
'copyfromlocal' first creates a new instance of DistributedFileSystem, 
which makes an RPC call to the NameNode to create a new file in the 
filesystem's namespace, the NN performs various checks, if passed it 
creates a record for the file. Then the DistributedFileSystem returns 
FSDataOutputStream for the client which handles the communication to the 
datanodes.

The file is then split into packets (based on the block size) and 
written to the data queue by DataStreamer, which asks the nn to allocate 
new blocks and picking up a list of dn to store the replicas on. The 
list of dn forms a pipeline with the number of replicas, the 
DataStreamer streams the packet to the first datanode in the pipeline 
which stores the packet and forward it to the second datanode in the 
pipeline. The packet keeps forwarded until the replicas are written.

When the client finish the stream it sends a confirmation to the NN.

Second question; split size is used during processing of a file, in 
other words the split size is consumed by the mappers not during HDFS 
operations. The block is the one consumed during the read and write 
operations by HDFS client. IIRC the default size of a block in 
Hadoop-2.6 is 256MB.

Third question; split is the logical representation of the data in a 
block. Block is the physical representation of the data. As I said, 
splits are consumed when processing the data, a mapper reads data from a 
block through a split.

Consider the following content of a file:

    000
    111
    222
    333
    444
    555
    666
    777
    888
    999
    aaa
    bbb
    ccc
    ddd
    eee
    fff


The block representation of this file could look something like this

    block-0

        000
        111
        222
        333
        444
        55

    block-1

        5
        666
        777
        888
        999
        aa

    block-2

        a
        bbb
        ccc
        ddd
        eee
        fff

If you notice, the blocks aren't aligned with the record, this happen 
because the fact that each record in a file isn't like the other. Here 
comes the splits to put the records all together before being processed 
by a mapper.

Unlike traditional filesystems, HDFS will create files up to the size of 
the HDFS block size; Note that there are 2 block size here, the 
underlying filesystem block size and HDFS block size.

On 02/21/2015 01:59 AM, SP wrote:
> Hello Every one,
>
> I have couple of doubts can any one please point me in right direction.
>
> 1>What exactly happen when I want to copy 1TB file to Hadoop Cluster 
> using copyfromlocal command
>
> 1> what will be the split size? will it be same as the block size?
>
> 2> What is a block and split?
>
> If we have 100 MB file and a block size of 64 MB, As we know it will 
> be divided into 2 blocks of 64 MB and 36 MB the second block still has 
> 28 MB of space left what will happen to that free space?
> will the cluster have unequal block size or will it be occupied by 
> other file?
>
>
> 3) let’s say a 64MB block is on node A and replicated among 2 other 
> nodes(B,C), and the input split size for the map-reduce program is 
> 64MB, will this split just have location for node A? Or will it have 
> locations for all the three nodes A,b,C?
>
>
> 4) How is it handled if the Input Split size is greater or lesser than 
> block size?
>
>
> can any one please help?
>
> thanks
>
> SP
>

-- 
Regards,
Ahmed Ossama


Re: BLOCK and Split size question

Posted by Ahmed Ossama <ah...@aossama.com>.
Hi,

Answering the first question;

What happens is that the client on the machine issuing the command 
'copyfromlocal' first creates a new instance of DistributedFileSystem, 
which makes an RPC call to the NameNode to create a new file in the 
filesystem's namespace, the NN performs various checks, if passed it 
creates a record for the file. Then the DistributedFileSystem returns 
FSDataOutputStream for the client which handles the communication to the 
datanodes.

The file is then split into packets (based on the block size) and 
written to the data queue by DataStreamer, which asks the nn to allocate 
new blocks and picking up a list of dn to store the replicas on. The 
list of dn forms a pipeline with the number of replicas, the 
DataStreamer streams the packet to the first datanode in the pipeline 
which stores the packet and forward it to the second datanode in the 
pipeline. The packet keeps forwarded until the replicas are written.

When the client finish the stream it sends a confirmation to the NN.

Second question; split size is used during processing of a file, in 
other words the split size is consumed by the mappers not during HDFS 
operations. The block is the one consumed during the read and write 
operations by HDFS client. IIRC the default size of a block in 
Hadoop-2.6 is 256MB.

Third question; split is the logical representation of the data in a 
block. Block is the physical representation of the data. As I said, 
splits are consumed when processing the data, a mapper reads data from a 
block through a split.

Consider the following content of a file:

    000
    111
    222
    333
    444
    555
    666
    777
    888
    999
    aaa
    bbb
    ccc
    ddd
    eee
    fff


The block representation of this file could look something like this

    block-0

        000
        111
        222
        333
        444
        55

    block-1

        5
        666
        777
        888
        999
        aa

    block-2

        a
        bbb
        ccc
        ddd
        eee
        fff

If you notice, the blocks aren't aligned with the record, this happen 
because the fact that each record in a file isn't like the other. Here 
comes the splits to put the records all together before being processed 
by a mapper.

Unlike traditional filesystems, HDFS will create files up to the size of 
the HDFS block size; Note that there are 2 block size here, the 
underlying filesystem block size and HDFS block size.

On 02/21/2015 01:59 AM, SP wrote:
> Hello Every one,
>
> I have couple of doubts can any one please point me in right direction.
>
> 1>What exactly happen when I want to copy 1TB file to Hadoop Cluster 
> using copyfromlocal command
>
> 1> what will be the split size? will it be same as the block size?
>
> 2> What is a block and split?
>
> If we have 100 MB file and a block size of 64 MB, As we know it will 
> be divided into 2 blocks of 64 MB and 36 MB the second block still has 
> 28 MB of space left what will happen to that free space?
> will the cluster have unequal block size or will it be occupied by 
> other file?
>
>
> 3) let’s say a 64MB block is on node A and replicated among 2 other 
> nodes(B,C), and the input split size for the map-reduce program is 
> 64MB, will this split just have location for node A? Or will it have 
> locations for all the three nodes A,b,C?
>
>
> 4) How is it handled if the Input Split size is greater or lesser than 
> block size?
>
>
> can any one please help?
>
> thanks
>
> SP
>

-- 
Regards,
Ahmed Ossama


Re: BLOCK and Split size question

Posted by Ahmed Ossama <ah...@aossama.com>.
Hi,

Answering the first question;

What happens is that the client on the machine issuing the command 
'copyfromlocal' first creates a new instance of DistributedFileSystem, 
which makes an RPC call to the NameNode to create a new file in the 
filesystem's namespace, the NN performs various checks, if passed it 
creates a record for the file. Then the DistributedFileSystem returns 
FSDataOutputStream for the client which handles the communication to the 
datanodes.

The file is then split into packets (based on the block size) and 
written to the data queue by DataStreamer, which asks the nn to allocate 
new blocks and picking up a list of dn to store the replicas on. The 
list of dn forms a pipeline with the number of replicas, the 
DataStreamer streams the packet to the first datanode in the pipeline 
which stores the packet and forward it to the second datanode in the 
pipeline. The packet keeps forwarded until the replicas are written.

When the client finish the stream it sends a confirmation to the NN.

Second question; split size is used during processing of a file, in 
other words the split size is consumed by the mappers not during HDFS 
operations. The block is the one consumed during the read and write 
operations by HDFS client. IIRC the default size of a block in 
Hadoop-2.6 is 256MB.

Third question; split is the logical representation of the data in a 
block. Block is the physical representation of the data. As I said, 
splits are consumed when processing the data, a mapper reads data from a 
block through a split.

Consider the following content of a file:

    000
    111
    222
    333
    444
    555
    666
    777
    888
    999
    aaa
    bbb
    ccc
    ddd
    eee
    fff


The block representation of this file could look something like this

    block-0

        000
        111
        222
        333
        444
        55

    block-1

        5
        666
        777
        888
        999
        aa

    block-2

        a
        bbb
        ccc
        ddd
        eee
        fff

If you notice, the blocks aren't aligned with the record, this happen 
because the fact that each record in a file isn't like the other. Here 
comes the splits to put the records all together before being processed 
by a mapper.

Unlike traditional filesystems, HDFS will create files up to the size of 
the HDFS block size; Note that there are 2 block size here, the 
underlying filesystem block size and HDFS block size.

On 02/21/2015 01:59 AM, SP wrote:
> Hello Every one,
>
> I have couple of doubts can any one please point me in right direction.
>
> 1>What exactly happen when I want to copy 1TB file to Hadoop Cluster 
> using copyfromlocal command
>
> 1> what will be the split size? will it be same as the block size?
>
> 2> What is a block and split?
>
> If we have 100 MB file and a block size of 64 MB, As we know it will 
> be divided into 2 blocks of 64 MB and 36 MB the second block still has 
> 28 MB of space left what will happen to that free space?
> will the cluster have unequal block size or will it be occupied by 
> other file?
>
>
> 3) let’s say a 64MB block is on node A and replicated among 2 other 
> nodes(B,C), and the input split size for the map-reduce program is 
> 64MB, will this split just have location for node A? Or will it have 
> locations for all the three nodes A,b,C?
>
>
> 4) How is it handled if the Input Split size is greater or lesser than 
> block size?
>
>
> can any one please help?
>
> thanks
>
> SP
>

-- 
Regards,
Ahmed Ossama


Re: BLOCK and Split size question

Posted by Ahmed Ossama <ah...@aossama.com>.
Hi,

Answering the first question;

What happens is that the client on the machine issuing the command 
'copyfromlocal' first creates a new instance of DistributedFileSystem, 
which makes an RPC call to the NameNode to create a new file in the 
filesystem's namespace, the NN performs various checks, if passed it 
creates a record for the file. Then the DistributedFileSystem returns 
FSDataOutputStream for the client which handles the communication to the 
datanodes.

The file is then split into packets (based on the block size) and 
written to the data queue by DataStreamer, which asks the nn to allocate 
new blocks and picking up a list of dn to store the replicas on. The 
list of dn forms a pipeline with the number of replicas, the 
DataStreamer streams the packet to the first datanode in the pipeline 
which stores the packet and forward it to the second datanode in the 
pipeline. The packet keeps forwarded until the replicas are written.

When the client finish the stream it sends a confirmation to the NN.

Second question; split size is used during processing of a file, in 
other words the split size is consumed by the mappers not during HDFS 
operations. The block is the one consumed during the read and write 
operations by HDFS client. IIRC the default size of a block in 
Hadoop-2.6 is 256MB.

Third question; split is the logical representation of the data in a 
block. Block is the physical representation of the data. As I said, 
splits are consumed when processing the data, a mapper reads data from a 
block through a split.

Consider the following content of a file:

    000
    111
    222
    333
    444
    555
    666
    777
    888
    999
    aaa
    bbb
    ccc
    ddd
    eee
    fff


The block representation of this file could look something like this

    block-0

        000
        111
        222
        333
        444
        55

    block-1

        5
        666
        777
        888
        999
        aa

    block-2

        a
        bbb
        ccc
        ddd
        eee
        fff

If you notice, the blocks aren't aligned with the record, this happen 
because the fact that each record in a file isn't like the other. Here 
comes the splits to put the records all together before being processed 
by a mapper.

Unlike traditional filesystems, HDFS will create files up to the size of 
the HDFS block size; Note that there are 2 block size here, the 
underlying filesystem block size and HDFS block size.

On 02/21/2015 01:59 AM, SP wrote:
> Hello Every one,
>
> I have couple of doubts can any one please point me in right direction.
>
> 1>What exactly happen when I want to copy 1TB file to Hadoop Cluster 
> using copyfromlocal command
>
> 1> what will be the split size? will it be same as the block size?
>
> 2> What is a block and split?
>
> If we have 100 MB file and a block size of 64 MB, As we know it will 
> be divided into 2 blocks of 64 MB and 36 MB the second block still has 
> 28 MB of space left what will happen to that free space?
> will the cluster have unequal block size or will it be occupied by 
> other file?
>
>
> 3) let’s say a 64MB block is on node A and replicated among 2 other 
> nodes(B,C), and the input split size for the map-reduce program is 
> 64MB, will this split just have location for node A? Or will it have 
> locations for all the three nodes A,b,C?
>
>
> 4) How is it handled if the Input Split size is greater or lesser than 
> block size?
>
>
> can any one please help?
>
> thanks
>
> SP
>

-- 
Regards,
Ahmed Ossama