You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by shanmukhan battinapati <sh...@gmail.com> on 2010/12/29 05:57:10 UTC

HDFS Structure

Hi,

I have a small doubt about the how  HDFS manages the files internally.

Assume like I have a NameNode and 2 DataNodes. I have inserted a csv file of
size 80MB into HDFS using 'hadoop copyFromLocal' command.

Then how this file will be stored in HDFS?

Will it be split into two parts of size 64MB(Default chunk size) and
remaining 16Mb and copied to the 2 DataNodes?

If that is the case, if I am doing some map-reduce on the two dataNodes, as
the data is not line oriented I may get unexpected results.

How to solve this type of issues? Please help me.



Thanks & Regards
Shanmukhan.B

Re: HDFS Structure

Posted by Harsh J <qw...@gmail.com>.

FileInputFormat takes care of line boundaries in splits, you don't
need to worry about that.

Each mapper works on a FileSplit, which contains the starting offset
and the length from there. These things are computed for it with line
boundaries in mind (and the extra bytes are pulled from the DataNode
that has it).

Similarly, in SequenceFiles, it is done using a special "Sync" byte
embedded in between logical blocks of data.

On Wed, Dec 29, 2010 at 10:27 AM, shanmukhan battinapati
<sh...@gmail.com> wrote:
> Hi,
>
> I have a small doubt about the how  HDFS manages the files internally.
>
> Assume like I have a NameNode and 2 DataNodes. I have inserted a csv file of
> size 80MB into HDFS using 'hadoop copyFromLocal' command.
>
> Then how this file will be stored in HDFS?
>
> Will it be split into two parts of size 64MB(Default chunk size) and
> remaining 16Mb and copied to the 2 DataNodes?
>
> If that is the case, if I am doing some map-reduce on the two dataNodes, as
> the data is not line oriented I may get unexpected results.
>
> How to solve this type of issues? Please help me.
>
>
>
> Thanks & Regards
> Shanmukhan.B
>

-- 
Harsh J
www.harshj.com