You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Teodor Macicas <te...@epfl.ch> on 2010/08/23 12:38:16 UTC

Control the file splits size

Hi all,

Can anyone please tell me how to control the splits size ? I have one 
big file which will be splitted by the number of maps. The input file is 
binary and contains some objects. I do not want to split an object into 
2 separate files, for sure.
I overwrite the computeSplitSize() file and I forced the size to be a 
multiple of my objects size. It worked, but it seems that on certain 
points of the output file objects are missing. And now I am thinking 
that this could be my problem.

Have anyone faced this problem before ?

Thank you.
Regards,
Teodor

Re: Control the file splits size

Posted by Teodor Macicas <te...@epfl.ch>.
Thanks guys for your replies.
I seemed that my problem wasn't this. Using computeSplitSize() by 
overwriting the variable size forcing to be a multiple of my object size 
worked.

But now I have another question:. How can I handle the comparators used 
by the sorting algorithms ? I mean the sorting of the keys before a 
reducer starts. Since I have objects I want a custom comparator to 
distingush them.

Best,
Teodor

On 08/23/2010 09:32 PM, Harsh J wrote:
> Ah yes I overlooked that part, sorry. I haven't tried out custom
> splits yet, so can't comment further on what may be going down.
>
> On Tue, Aug 24, 2010 at 12:44 AM, Michael Segel
> <mi...@hotmail.com>  wrote:
>    
>>
>> Uhm...
>>
>> There may be more to the initial question.
>>
>> The OP indicated that this was a 'binary file' and that the records may not be based on an end-of-line.
>> So he may want to look at how to handle different types of input too.
>>
>>
>>      
>>> From: qwertymaniac@gmail.com
>>> Date: Mon, 23 Aug 2010 18:39:48 +0530
>>> Subject: Re: Control the file splits size
>>> To: common-user@hadoop.apache.org
>>>
>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html#isSplitable(org.apache.hadoop.fs.FileSystem,
>>> org.apache.hadoop.fs.Path)
>>>
>>> The isSplitable is the method you're looking for -- return false for
>>> this in your custom input format (derived from FIF or etc.).
>>>
>>> On Mon, Aug 23, 2010 at 4:08 PM, Teodor Macicas<te...@epfl.ch>  wrote:
>>>        
>>>> Hi all,
>>>>
>>>> Can anyone please tell me how to control the splits size ? I have one big
>>>> file which will be splitted by the number of maps. The input file is binary
>>>> and contains some objects. I do not want to split an object into 2 separate
>>>> files, for sure.
>>>> I overwrite the computeSplitSize() file and I forced the size to be a
>>>> multiple of my objects size. It worked, but it seems that on certain points
>>>> of the output file objects are missing. And now I am thinking that this
>>>> could be my problem.
>>>>          
> Your output file is a result of MR if am correct? Can you verify at
> the input of your mapper if your objects are being read properly based
> on the split you've computed for it?
>    
>>>> Have anyone faced this problem before ?
>>>>
>>>> Thank you.
>>>> Regards,
>>>> Teodor
>>>>
>>>>          
>>>
>>>
>>> --
>>> Harsh J
>>> www.harshj.com
>>>        
>>      
>
>
>    


Re: Control the file splits size

Posted by Harsh J <qw...@gmail.com>.
Ah yes I overlooked that part, sorry. I haven't tried out custom
splits yet, so can't comment further on what may be going down.

On Tue, Aug 24, 2010 at 12:44 AM, Michael Segel
<mi...@hotmail.com> wrote:
>
>
> Uhm...
>
> There may be more to the initial question.
>
> The OP indicated that this was a 'binary file' and that the records may not be based on an end-of-line.
> So he may want to look at how to handle different types of input too.
>
>
>> From: qwertymaniac@gmail.com
>> Date: Mon, 23 Aug 2010 18:39:48 +0530
>> Subject: Re: Control the file splits size
>> To: common-user@hadoop.apache.org
>>
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html#isSplitable(org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.Path)
>>
>> The isSplitable is the method you're looking for -- return false for
>> this in your custom input format (derived from FIF or etc.).
>>
>> On Mon, Aug 23, 2010 at 4:08 PM, Teodor Macicas <te...@epfl.ch> wrote:
>> > Hi all,
>> >
>> > Can anyone please tell me how to control the splits size ? I have one big
>> > file which will be splitted by the number of maps. The input file is binary
>> > and contains some objects. I do not want to split an object into 2 separate
>> > files, for sure.
>> > I overwrite the computeSplitSize() file and I forced the size to be a
>> > multiple of my objects size. It worked, but it seems that on certain points
>> > of the output file objects are missing. And now I am thinking that this
>> > could be my problem.
Your output file is a result of MR if am correct? Can you verify at
the input of your mapper if your objects are being read properly based
on the split you've computed for it?
>> >
>> > Have anyone faced this problem before ?
>> >
>> > Thank you.
>> > Regards,
>> > Teodor
>> >
>>
>>
>>
>> --
>> Harsh J
>> www.harshj.com
>



-- 
Harsh J
www.harshj.com

RE: Control the file splits size

Posted by Michael Segel <mi...@hotmail.com>.

Uhm...

There may be more to the initial question.

The OP indicated that this was a 'binary file' and that the records may not be based on an end-of-line.
So he may want to look at how to handle different types of input too.


> From: qwertymaniac@gmail.com
> Date: Mon, 23 Aug 2010 18:39:48 +0530
> Subject: Re: Control the file splits size
> To: common-user@hadoop.apache.org
> 
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html#isSplitable(org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path)
> 
> The isSplitable is the method you're looking for -- return false for
> this in your custom input format (derived from FIF or etc.).
> 
> On Mon, Aug 23, 2010 at 4:08 PM, Teodor Macicas <te...@epfl.ch> wrote:
> > Hi all,
> >
> > Can anyone please tell me how to control the splits size ? I have one big
> > file which will be splitted by the number of maps. The input file is binary
> > and contains some objects. I do not want to split an object into 2 separate
> > files, for sure.
> > I overwrite the computeSplitSize() file and I forced the size to be a
> > multiple of my objects size. It worked, but it seems that on certain points
> > of the output file objects are missing. And now I am thinking that this
> > could be my problem.
> >
> > Have anyone faced this problem before ?
> >
> > Thank you.
> > Regards,
> > Teodor
> >
> 
> 
> 
> -- 
> Harsh J
> www.harshj.com
 		 	   		  

Re: Control the file splits size

Posted by Gang Luo <lg...@yahoo.com.cn>.
You'd better modify the RecordReader, instead of focusing the split size. If 
there is some token indicating the boundary(start/end token) of an object, the 
RecordReader will read the entire object once it sees the start token, even it 
will across the boundary. Once the reader sees no start token but only an end 
token, it will not read the incomplete object. This is exactly like the reader 
in TextInputFormat which reads one line at a time as a record, even one line is 
logically splitted  into two parts.

-Gang




----- 原始邮件 ----
发件人: Teodor Macicas <te...@epfl.ch>
收件人: common-user@hadoop.apache.org
发送日期: 2010/8/23 (周一) 12:35:25 下午
主   题: Re: Control the file splits size

Thank you. But I think this won't help me. I want to split the big 
input, but to control where the file will be splitted. I have some 
objects in this file and I want to be sure that one object will be 
entirely in one split.
Does it make sense for you ?

Best,
Teodor

On 08/23/2010 03:09 PM, Harsh J wrote:
>http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html#isSplitable(org.apache.hadoop.fs.FileSystem,
>,
> org.apache.hadoop.fs.Path)
>
> The isSplitable is the method you're looking for -- return false for
> this in your custom input format (derived from FIF or etc.).
>
> On Mon, Aug 23, 2010 at 4:08 PM, Teodor Macicas<te...@epfl.ch>  
wrote:
>    
>> Hi all,
>>
>> Can anyone please tell me how to control the splits size ? I have one big
>> file which will be splitted by the number of maps. The input file is binary
>> and contains some objects. I do not want to split an object into 2 separate
>> files, for sure.
>> I overwrite the computeSplitSize() file and I forced the size to be a
>> multiple of my objects size. It worked, but it seems that on certain points
>> of the output file objects are missing. And now I am thinking that this
>> could be my problem.
>>
>> Have anyone faced this problem before ?
>>
>> Thank you.
>> Regards,
>> Teodor
>>
>>      
>
>
>    


      

Re: Control the file splits size

Posted by Teodor Macicas <te...@epfl.ch>.
Thank you. But I think this won't help me. I want to split the big 
input, but to control where the file will be splitted. I have some 
objects in this file and I want to be sure that one object will be 
entirely in one split.
Does it make sense for you ?

Best,
Teodor

On 08/23/2010 03:09 PM, Harsh J wrote:
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html#isSplitable(org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path)
>
> The isSplitable is the method you're looking for -- return false for
> this in your custom input format (derived from FIF or etc.).
>
> On Mon, Aug 23, 2010 at 4:08 PM, Teodor Macicas<te...@epfl.ch>  wrote:
>    
>> Hi all,
>>
>> Can anyone please tell me how to control the splits size ? I have one big
>> file which will be splitted by the number of maps. The input file is binary
>> and contains some objects. I do not want to split an object into 2 separate
>> files, for sure.
>> I overwrite the computeSplitSize() file and I forced the size to be a
>> multiple of my objects size. It worked, but it seems that on certain points
>> of the output file objects are missing. And now I am thinking that this
>> could be my problem.
>>
>> Have anyone faced this problem before ?
>>
>> Thank you.
>> Regards,
>> Teodor
>>
>>      
>
>
>    


Re: Control the file splits size

Posted by Harsh J <qw...@gmail.com>.
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html#isSplitable(org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path)

The isSplitable is the method you're looking for -- return false for
this in your custom input format (derived from FIF or etc.).

On Mon, Aug 23, 2010 at 4:08 PM, Teodor Macicas <te...@epfl.ch> wrote:
> Hi all,
>
> Can anyone please tell me how to control the splits size ? I have one big
> file which will be splitted by the number of maps. The input file is binary
> and contains some objects. I do not want to split an object into 2 separate
> files, for sure.
> I overwrite the computeSplitSize() file and I forced the size to be a
> multiple of my objects size. It worked, but it seems that on certain points
> of the output file objects are missing. And now I am thinking that this
> could be my problem.
>
> Have anyone faced this problem before ?
>
> Thank you.
> Regards,
> Teodor
>



-- 
Harsh J
www.harshj.com