You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Teodor Macicas <te...@epfl.ch> on 2010/08/23 12:38:16 UTC
Control the file splits size
Hi all,
Can anyone please tell me how to control the splits size ? I have one
big file which will be splitted by the number of maps. The input file is
binary and contains some objects. I do not want to split an object into
2 separate files, for sure.
I overwrite the computeSplitSize() file and I forced the size to be a
multiple of my objects size. It worked, but it seems that on certain
points of the output file objects are missing. And now I am thinking
that this could be my problem.
Have anyone faced this problem before ?
Thank you.
Regards,
Teodor
Re: Control the file splits size
Posted by Teodor Macicas <te...@epfl.ch>.
Thanks guys for your replies.
I seemed that my problem wasn't this. Using computeSplitSize() by
overwriting the variable size forcing to be a multiple of my object size
worked.
But now I have another question:. How can I handle the comparators used
by the sorting algorithms ? I mean the sorting of the keys before a
reducer starts. Since I have objects I want a custom comparator to
distingush them.
Best,
Teodor
On 08/23/2010 09:32 PM, Harsh J wrote:
> Ah yes I overlooked that part, sorry. I haven't tried out custom
> splits yet, so can't comment further on what may be going down.
>
> On Tue, Aug 24, 2010 at 12:44 AM, Michael Segel
> <mi...@hotmail.com> wrote:
>
>>
>> Uhm...
>>
>> There may be more to the initial question.
>>
>> The OP indicated that this was a 'binary file' and that the records may not be based on an end-of-line.
>> So he may want to look at how to handle different types of input too.
>>
>>
>>
>>> From: qwertymaniac@gmail.com
>>> Date: Mon, 23 Aug 2010 18:39:48 +0530
>>> Subject: Re: Control the file splits size
>>> To: common-user@hadoop.apache.org
>>>
>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html#isSplitable(org.apache.hadoop.fs.FileSystem,
>>> org.apache.hadoop.fs.Path)
>>>
>>> The isSplitable is the method you're looking for -- return false for
>>> this in your custom input format (derived from FIF or etc.).
>>>
>>> On Mon, Aug 23, 2010 at 4:08 PM, Teodor Macicas<te...@epfl.ch> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Can anyone please tell me how to control the splits size ? I have one big
>>>> file which will be splitted by the number of maps. The input file is binary
>>>> and contains some objects. I do not want to split an object into 2 separate
>>>> files, for sure.
>>>> I overwrite the computeSplitSize() file and I forced the size to be a
>>>> multiple of my objects size. It worked, but it seems that on certain points
>>>> of the output file objects are missing. And now I am thinking that this
>>>> could be my problem.
>>>>
> Your output file is a result of MR if am correct? Can you verify at
> the input of your mapper if your objects are being read properly based
> on the split you've computed for it?
>
>>>> Have anyone faced this problem before ?
>>>>
>>>> Thank you.
>>>> Regards,
>>>> Teodor
>>>>
>>>>
>>>
>>>
>>> --
>>> Harsh J
>>> www.harshj.com
>>>
>>
>
>
>
Re: Control the file splits size
Posted by Harsh J <qw...@gmail.com>.
Ah yes I overlooked that part, sorry. I haven't tried out custom
splits yet, so can't comment further on what may be going down.
On Tue, Aug 24, 2010 at 12:44 AM, Michael Segel
<mi...@hotmail.com> wrote:
>
>
> Uhm...
>
> There may be more to the initial question.
>
> The OP indicated that this was a 'binary file' and that the records may not be based on an end-of-line.
> So he may want to look at how to handle different types of input too.
>
>
>> From: qwertymaniac@gmail.com
>> Date: Mon, 23 Aug 2010 18:39:48 +0530
>> Subject: Re: Control the file splits size
>> To: common-user@hadoop.apache.org
>>
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html#isSplitable(org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.Path)
>>
>> The isSplitable is the method you're looking for -- return false for
>> this in your custom input format (derived from FIF or etc.).
>>
>> On Mon, Aug 23, 2010 at 4:08 PM, Teodor Macicas <te...@epfl.ch> wrote:
>> > Hi all,
>> >
>> > Can anyone please tell me how to control the splits size ? I have one big
>> > file which will be splitted by the number of maps. The input file is binary
>> > and contains some objects. I do not want to split an object into 2 separate
>> > files, for sure.
>> > I overwrite the computeSplitSize() file and I forced the size to be a
>> > multiple of my objects size. It worked, but it seems that on certain points
>> > of the output file objects are missing. And now I am thinking that this
>> > could be my problem.
Your output file is a result of MR if am correct? Can you verify at
the input of your mapper if your objects are being read properly based
on the split you've computed for it?
>> >
>> > Have anyone faced this problem before ?
>> >
>> > Thank you.
>> > Regards,
>> > Teodor
>> >
>>
>>
>>
>> --
>> Harsh J
>> www.harshj.com
>
--
Harsh J
www.harshj.com
RE: Control the file splits size
Posted by Michael Segel <mi...@hotmail.com>.
Uhm...
There may be more to the initial question.
The OP indicated that this was a 'binary file' and that the records may not be based on an end-of-line.
So he may want to look at how to handle different types of input too.
> From: qwertymaniac@gmail.com
> Date: Mon, 23 Aug 2010 18:39:48 +0530
> Subject: Re: Control the file splits size
> To: common-user@hadoop.apache.org
>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html#isSplitable(org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path)
>
> The isSplitable is the method you're looking for -- return false for
> this in your custom input format (derived from FIF or etc.).
>
> On Mon, Aug 23, 2010 at 4:08 PM, Teodor Macicas <te...@epfl.ch> wrote:
> > Hi all,
> >
> > Can anyone please tell me how to control the splits size ? I have one big
> > file which will be splitted by the number of maps. The input file is binary
> > and contains some objects. I do not want to split an object into 2 separate
> > files, for sure.
> > I overwrite the computeSplitSize() file and I forced the size to be a
> > multiple of my objects size. It worked, but it seems that on certain points
> > of the output file objects are missing. And now I am thinking that this
> > could be my problem.
> >
> > Have anyone faced this problem before ?
> >
> > Thank you.
> > Regards,
> > Teodor
> >
>
>
>
> --
> Harsh J
> www.harshj.com
Re: Control the file splits size
Posted by Gang Luo <lg...@yahoo.com.cn>.
You'd better modify the RecordReader, instead of focusing the split size. If
there is some token indicating the boundary(start/end token) of an object, the
RecordReader will read the entire object once it sees the start token, even it
will across the boundary. Once the reader sees no start token but only an end
token, it will not read the incomplete object. This is exactly like the reader
in TextInputFormat which reads one line at a time as a record, even one line is
logically splitted into two parts.
-Gang
----- 原始邮件 ----
发件人: Teodor Macicas <te...@epfl.ch>
收件人: common-user@hadoop.apache.org
发送日期: 2010/8/23 (周一) 12:35:25 下午
主 题: Re: Control the file splits size
Thank you. But I think this won't help me. I want to split the big
input, but to control where the file will be splitted. I have some
objects in this file and I want to be sure that one object will be
entirely in one split.
Does it make sense for you ?
Best,
Teodor
On 08/23/2010 03:09 PM, Harsh J wrote:
>http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html#isSplitable(org.apache.hadoop.fs.FileSystem,
>,
> org.apache.hadoop.fs.Path)
>
> The isSplitable is the method you're looking for -- return false for
> this in your custom input format (derived from FIF or etc.).
>
> On Mon, Aug 23, 2010 at 4:08 PM, Teodor Macicas<te...@epfl.ch>
wrote:
>
>> Hi all,
>>
>> Can anyone please tell me how to control the splits size ? I have one big
>> file which will be splitted by the number of maps. The input file is binary
>> and contains some objects. I do not want to split an object into 2 separate
>> files, for sure.
>> I overwrite the computeSplitSize() file and I forced the size to be a
>> multiple of my objects size. It worked, but it seems that on certain points
>> of the output file objects are missing. And now I am thinking that this
>> could be my problem.
>>
>> Have anyone faced this problem before ?
>>
>> Thank you.
>> Regards,
>> Teodor
>>
>>
>
>
>
Re: Control the file splits size
Posted by Teodor Macicas <te...@epfl.ch>.
Thank you. But I think this won't help me. I want to split the big
input, but to control where the file will be splitted. I have some
objects in this file and I want to be sure that one object will be
entirely in one split.
Does it make sense for you ?
Best,
Teodor
On 08/23/2010 03:09 PM, Harsh J wrote:
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html#isSplitable(org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path)
>
> The isSplitable is the method you're looking for -- return false for
> this in your custom input format (derived from FIF or etc.).
>
> On Mon, Aug 23, 2010 at 4:08 PM, Teodor Macicas<te...@epfl.ch> wrote:
>
>> Hi all,
>>
>> Can anyone please tell me how to control the splits size ? I have one big
>> file which will be splitted by the number of maps. The input file is binary
>> and contains some objects. I do not want to split an object into 2 separate
>> files, for sure.
>> I overwrite the computeSplitSize() file and I forced the size to be a
>> multiple of my objects size. It worked, but it seems that on certain points
>> of the output file objects are missing. And now I am thinking that this
>> could be my problem.
>>
>> Have anyone faced this problem before ?
>>
>> Thank you.
>> Regards,
>> Teodor
>>
>>
>
>
>
Re: Control the file splits size
Posted by Harsh J <qw...@gmail.com>.
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html#isSplitable(org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path)
The isSplitable is the method you're looking for -- return false for
this in your custom input format (derived from FIF or etc.).
On Mon, Aug 23, 2010 at 4:08 PM, Teodor Macicas <te...@epfl.ch> wrote:
> Hi all,
>
> Can anyone please tell me how to control the splits size ? I have one big
> file which will be splitted by the number of maps. The input file is binary
> and contains some objects. I do not want to split an object into 2 separate
> files, for sure.
> I overwrite the computeSplitSize() file and I forced the size to be a
> multiple of my objects size. It worked, but it seems that on certain points
> of the output file objects are missing. And now I am thinking that this
> could be my problem.
>
> Have anyone faced this problem before ?
>
> Thank you.
> Regards,
> Teodor
>
--
Harsh J
www.harshj.com