You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by prasenjit mukherjee <pr...@gmail.com> on 2012/06/14 03:11:50 UTC

How hdfs splits blocks on record boundaries

I have a textfile which doesn't have any newline characters. The
records are separated by a special character ( e.g. $ ). if I push a
single file of 5 GB to hdfs, how will it identify the boundaries on
which the files should be split ?

What are the options I have in such scenaion so that I can run mapreduce jobs :

1. Replace record-separator with new line ? ( Not very convincing as I
have newline in the data )

2. Create 64MB chunks by some preprocessing ? ( Would love to know if
it can be avoided )

3. I can definitely write my customloader for my mapreduce jobs, but
even then is it possible to reach out across hdfs nodes if the files
are not aligned with recoird boundaries ?

Thanks,
Prasenjit

-- 
Sent from my mobile device

Re: How hdfs splits blocks on record boundaries

Posted by Harsh J <ha...@cloudera.com>.

Sachin,

That would require knowledge on record boundaries in the file, a
solution that wouldn't scale for very large files nor for large number
of files. You don't really have to do that, its the hard way. Please
see my previous response for a proper MR way of doing this.

On Thu, Jun 21, 2012 at 10:45 AM, Sachin Aggarwal
<di...@gmail.com> wrote:
> when u store data to hdfs it will split in 64 MB chunks automaticaly
>
> usse this to create the no of mappers u want as per size in bytes
> now u can read each file and can use split function
>
>    FileInputFormat.setMaxInputSplitSize(job, 2097152);
>    FileInputFormat.setMinInputSplitSize(job, 1048576);
>
> now u can read each file and can use split function as
>    String record = line.split(",");
>
> On Thu, Jun 14, 2012 at 10:56 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> You may use TextInputFormat with "textinputformat.record.delimiter"
>> config set to the character you use. This feature is available in the
>> Apache Hadoop 2.0.0 release (and perhaps in other distributions that
>> carry backports).
>>
>> In case you don't have a Hadoop cluster with this feature
>> (MAPREDUCE-2254), you can read up on how \n is handled and handle your
>> files in the same way (swapping \n in LineReader with your character,
>> essentially what the above feature does):
>> http://wiki.apache.org/hadoop/HadoopMapReduce (See the Map section for
>> the logic)
>>
>> Does this help?
>>
>> On Thu, Jun 14, 2012 at 6:41 AM, prasenjit mukherjee
>> <pr...@gmail.com> wrote:
>> > I have a textfile which doesn't have any newline characters. The
>> > records are separated by a special character ( e.g. $ ). if I push a
>> > single file of 5 GB to hdfs, how will it identify the boundaries on
>> > which the files should be split ?
>> >
>> > What are the options I have in such scenaion so that I can run mapreduce
>> jobs :
>> >
>> > 1. Replace record-separator with new line ? ( Not very convincing as I
>> > have newline in the data )
>> >
>> > 2. Create 64MB chunks by some preprocessing ? ( Would love to know if
>> > it can be avoided )
>> >
>> > 3. I can definitely write my customloader for my mapreduce jobs, but
>> > even then is it possible to reach out across hdfs nodes if the files
>> > are not aligned with recoird boundaries ?
>> >
>> > Thanks,
>> > Prasenjit
>> >
>> > --
>> > Sent from my mobile device
>>
>>
>>
>> --
>> Harsh J
>>
>
>
>
> --
>
> Thanks & Regards
>
> Sachin Aggarwal
> 7760502772



-- 
Harsh J

Re: How hdfs splits blocks on record boundaries

Posted by Sachin Aggarwal <di...@gmail.com>.

when u store data to hdfs it will split in 64 MB chunks automaticaly

usse this to create the no of mappers u want as per size in bytes
now u can read each file and can use split function

    FileInputFormat.setMaxInputSplitSize(job, 2097152);
    FileInputFormat.setMinInputSplitSize(job, 1048576);

now u can read each file and can use split function as
    String record = line.split(",");

On Thu, Jun 14, 2012 at 10:56 AM, Harsh J <ha...@cloudera.com> wrote:

> You may use TextInputFormat with "textinputformat.record.delimiter"
> config set to the character you use. This feature is available in the
> Apache Hadoop 2.0.0 release (and perhaps in other distributions that
> carry backports).
>
> In case you don't have a Hadoop cluster with this feature
> (MAPREDUCE-2254), you can read up on how \n is handled and handle your
> files in the same way (swapping \n in LineReader with your character,
> essentially what the above feature does):
> http://wiki.apache.org/hadoop/HadoopMapReduce (See the Map section for
> the logic)
>
> Does this help?
>
> On Thu, Jun 14, 2012 at 6:41 AM, prasenjit mukherjee
> <pr...@gmail.com> wrote:
> > I have a textfile which doesn't have any newline characters. The
> > records are separated by a special character ( e.g. $ ). if I push a
> > single file of 5 GB to hdfs, how will it identify the boundaries on
> > which the files should be split ?
> >
> > What are the options I have in such scenaion so that I can run mapreduce
> jobs :
> >
> > 1. Replace record-separator with new line ? ( Not very convincing as I
> > have newline in the data )
> >
> > 2. Create 64MB chunks by some preprocessing ? ( Would love to know if
> > it can be avoided )
> >
> > 3. I can definitely write my customloader for my mapreduce jobs, but
> > even then is it possible to reach out across hdfs nodes if the files
> > are not aligned with recoird boundaries ?
> >
> > Thanks,
> > Prasenjit
> >
> > --
> > Sent from my mobile device
>
>
>
> --
> Harsh J
>



-- 

Thanks & Regards

Sachin Aggarwal
7760502772

Re: How hdfs splits blocks on record boundaries

Posted by Harsh J <ha...@cloudera.com>.

You may use TextInputFormat with "textinputformat.record.delimiter"
config set to the character you use. This feature is available in the
Apache Hadoop 2.0.0 release (and perhaps in other distributions that
carry backports).

In case you don't have a Hadoop cluster with this feature
(MAPREDUCE-2254), you can read up on how \n is handled and handle your
files in the same way (swapping \n in LineReader with your character,
essentially what the above feature does):
http://wiki.apache.org/hadoop/HadoopMapReduce (See the Map section for
the logic)

Does this help?

On Thu, Jun 14, 2012 at 6:41 AM, prasenjit mukherjee
<pr...@gmail.com> wrote:
> I have a textfile which doesn't have any newline characters. The
> records are separated by a special character ( e.g. $ ). if I push a
> single file of 5 GB to hdfs, how will it identify the boundaries on
> which the files should be split ?
>
> What are the options I have in such scenaion so that I can run mapreduce jobs :
>
> 1. Replace record-separator with new line ? ( Not very convincing as I
> have newline in the data )
>
> 2. Create 64MB chunks by some preprocessing ? ( Would love to know if
> it can be avoided )
>
> 3. I can definitely write my customloader for my mapreduce jobs, but
> even then is it possible to reach out across hdfs nodes if the files
> are not aligned with recoird boundaries ?
>
> Thanks,
> Prasenjit
>
> --
> Sent from my mobile device



-- 
Harsh J