You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Grandl Robert <rg...@yahoo.com> on 2012/08/01 15:44:51 UTC

HDFS splits based on content semantics

Hi,

Probably this question is answered many times but I could not clarify yet after searching on google. 


Does HDFS split the input solely based on fixed block size or take in consideration the semantics of it ?
For example, if I have a binary file, or I want the block to not cut some lines of text, etc. will I be able to instruct HDFS where to stop with each block ?

Thanks,
Robert

Re: HDFS splits based on content semantics

Posted by Grandl Robert <rg...@yahoo.com>.
Thank you guys.

Really helpful.



________________________________
 From: Harsh J <ha...@cloudera.com>
To: hdfs-user@hadoop.apache.org 
Sent: Wednesday, August 1, 2012 1:03 PM
Subject: Re: HDFS splits based on content semantics
 
To add onto David's response, also read
http://search-hadoop.com/m/ydCoSysmTd1 for some more info.

On Wed, Aug 1, 2012 at 7:23 PM, David Rosenstrauch <da...@darose.net> wrote:
> On 08/01/2012 09:44 AM, Grandl Robert wrote:
>>
>> Hi,
>>
>> Probably this question is answered many times but I could not clarify yet
>> after searching on google.
>>
>>
>> Does HDFS split the input solely based on fixed block size or take in
>> consideration the semantics of it ?
>> For example, if I have a binary file, or I want the block to not cut some
>> lines of text, etc. will I be able to instruct HDFS where to stop with each
>> block ?
>>
>> Thanks,
>> Robert
>>
>
> Hadoop can natively understand text-based data.  (As long as it's in a
> one-record-per-line format.)
>
> It obviously does not understand custom binary formats.  (E.g. Microsoft
> Word files.)
>
> However Hadoop does provide a framework for you to create your own binary
> formats that it can understand.  There is a class in Hadoop called a
> SequenceFile which provides the capability to create binary files that are
> broken up into logical blocks that Hadoop can split on.
>
> HTH,
>
> DR



-- 
Harsh J

Re: HDFS splits based on content semantics

Posted by Harsh J <ha...@cloudera.com>.
To add onto David's response, also read
http://search-hadoop.com/m/ydCoSysmTd1 for some more info.

On Wed, Aug 1, 2012 at 7:23 PM, David Rosenstrauch <da...@darose.net> wrote:
> On 08/01/2012 09:44 AM, Grandl Robert wrote:
>>
>> Hi,
>>
>> Probably this question is answered many times but I could not clarify yet
>> after searching on google.
>>
>>
>> Does HDFS split the input solely based on fixed block size or take in
>> consideration the semantics of it ?
>> For example, if I have a binary file, or I want the block to not cut some
>> lines of text, etc. will I be able to instruct HDFS where to stop with each
>> block ?
>>
>> Thanks,
>> Robert
>>
>
> Hadoop can natively understand text-based data.  (As long as it's in a
> one-record-per-line format.)
>
> It obviously does not understand custom binary formats.  (E.g. Microsoft
> Word files.)
>
> However Hadoop does provide a framework for you to create your own binary
> formats that it can understand.  There is a class in Hadoop called a
> SequenceFile which provides the capability to create binary files that are
> broken up into logical blocks that Hadoop can split on.
>
> HTH,
>
> DR



-- 
Harsh J

Re: HDFS splits based on content semantics

Posted by David Rosenstrauch <da...@darose.net>.
On 08/01/2012 09:44 AM, Grandl Robert wrote:
> Hi,
>
> Probably this question is answered many times but I could not clarify yet after searching on google.
>
>
> Does HDFS split the input solely based on fixed block size or take in consideration the semantics of it ?
> For example, if I have a binary file, or I want the block to not cut some lines of text, etc. will I be able to instruct HDFS where to stop with each block ?
>
> Thanks,
> Robert
>

Hadoop can natively understand text-based data.  (As long as it's in a 
one-record-per-line format.)

It obviously does not understand custom binary formats.  (E.g. Microsoft 
Word files.)

However Hadoop does provide a framework for you to create your own 
binary formats that it can understand.  There is a class in Hadoop 
called a SequenceFile which provides the capability to create binary 
files that are broken up into logical blocks that Hadoop can split on.

HTH,

DR