You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Mohammad Tariq <do...@gmail.com> on 2012/07/10 15:01:42 UTC

WholeFileInputFormat format

Hello list,

       What could be the approximate maximum size of the files that
can be handled using WholeFileInputFormat format??I mean, if the file
is very big, then is it feasible to use WholeFileInputFormat as the
entire load will go to one mapper??Many thanks.

Regards,
    Mohammad Tariq

Re: WholeFileInputFormat format

Posted by Mohammad Tariq <do...@gmail.com>.
Hello Harsh,

          Does Hadoop-0.20.205.0(new API) has Avro support??

Regards,
    Mohammad Tariq


On Wed, Jul 11, 2012 at 1:57 AM, Mohammad Tariq <do...@gmail.com> wrote:
> Hello Harsh,
>
>           I am sorry to be a pest of questions. Actually I am kinda
> stuck. I have to write my MapReduce job such that the comparisons
> between each output from both the mappers must be in order. I mean I
> have to read one line from the file and extract the desired fields
> from the line in one mapper, and in the second mapper I have to read
> the values from Hbase table and compare those values with the fields
> read in the first mapper. I am wondering how to achieve that since
> reducer phase will not start until all the mappers are done.
>           Maybe a bit of elaboration of my use case would be helpful
> in understanding the problem in a better fashion. I have a file that
> contains several fields. I have created columns for these fields in my
> Hbase table. After that I am extracting value of each field from the
> file and storing it in the corresponding Hbase column. Now, I have a
> 'support file' for the same file whose values are already stored in
> Hbase, but with a totally different format. But the order of fields in
> the original file and the order of lines(containing corresponding
> fields) in the support file is exactly same. So I am trying to read
> one line from the support file, extract the field of interest in one
> mapper and read the same field from the Hbase table in second mapper
> and send these values to the reducer where the comparison will be made
> to conclude the test.
>          Please help me out by providing your able guidance, as being
> a novice I am not able to tackle with the situation.(Pardon my
> ignorance)
>
> May thanks.
>
> Regards,
>     Mohammad Tariq
>
>
> On Tue, Jul 10, 2012 at 8:34 PM, Harsh J <ha...@cloudera.com> wrote:
>> I don't see why you'd have to use the WholeFileInputFormat for such a
>> task. Your task is very similar to joins, and you can see the section
>> "General reducer-side join" for what your overall logic should look
>> like, under Ricky's
>> http://horicky.blogspot.in/2010/08/designing-algorithmis-for-map-reduce.html
>> article.
>>
>> On Tue, Jul 10, 2012 at 7:46 PM, Mohammad Tariq <do...@gmail.com> wrote:
>>> Hello Harsh,
>>>
>>>          Thank you so much for the quick response. Actually I have a
>>> use case wherein I have to compare values that are coming from 2
>>> mappers to one reducer. For that I am planning to use MultipleInputs
>>> class. In one mapper I have a text file (these files may contain
>>> 1,00,000 to 2,00,000 lines), and I have to extract bytes from 2-13,
>>> 20-25, 32-38 and so on from each line of this file. In the second
>>> mapper I have to read values from an Hbase table. The columns of this
>>> table correspond to the fields which I am reading from the text file
>>> in the first mapper.
>>>         In the reducer I have to compare the results coming for both
>>> the mappers and generate the final result. Need your guidance. Many
>>> thanks.
>>>
>>> Regards,
>>>     Mohammad Tariq
>>>
>>>
>>> On Tue, Jul 10, 2012 at 6:55 PM, Harsh J <ha...@cloudera.com> wrote:
>>>> It depends on what you need. If your file is not splittable, or if you
>>>> need to read the whole file from a single mapper itself (i.e. you do
>>>> not _want_ it to be split), then use WholeFileInputFormats. Otherwise,
>>>> you get more parallelism with regular splitting.
>>>>
>>>> On Tue, Jul 10, 2012 at 6:31 PM, Mohammad Tariq <do...@gmail.com> wrote:
>>>>> Hello list,
>>>>>
>>>>>        What could be the approximate maximum size of the files that
>>>>> can be handled using WholeFileInputFormat format??I mean, if the file
>>>>> is very big, then is it feasible to use WholeFileInputFormat as the
>>>>> entire load will go to one mapper??Many thanks.
>>>>>
>>>>> Regards,
>>>>>     Mohammad Tariq
>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>
>>
>>
>> --
>> Harsh J

Re: WholeFileInputFormat format

Posted by Mohammad Tariq <do...@gmail.com>.
Hello Harsh,

          I am sorry to be a pest of questions. Actually I am kinda
stuck. I have to write my MapReduce job such that the comparisons
between each output from both the mappers must be in order. I mean I
have to read one line from the file and extract the desired fields
from the line in one mapper, and in the second mapper I have to read
the values from Hbase table and compare those values with the fields
read in the first mapper. I am wondering how to achieve that since
reducer phase will not start until all the mappers are done.
          Maybe a bit of elaboration of my use case would be helpful
in understanding the problem in a better fashion. I have a file that
contains several fields. I have created columns for these fields in my
Hbase table. After that I am extracting value of each field from the
file and storing it in the corresponding Hbase column. Now, I have a
'support file' for the same file whose values are already stored in
Hbase, but with a totally different format. But the order of fields in
the original file and the order of lines(containing corresponding
fields) in the support file is exactly same. So I am trying to read
one line from the support file, extract the field of interest in one
mapper and read the same field from the Hbase table in second mapper
and send these values to the reducer where the comparison will be made
to conclude the test.
         Please help me out by providing your able guidance, as being
a novice I am not able to tackle with the situation.(Pardon my
ignorance)

May thanks.

Regards,
    Mohammad Tariq


On Tue, Jul 10, 2012 at 8:34 PM, Harsh J <ha...@cloudera.com> wrote:
> I don't see why you'd have to use the WholeFileInputFormat for such a
> task. Your task is very similar to joins, and you can see the section
> "General reducer-side join" for what your overall logic should look
> like, under Ricky's
> http://horicky.blogspot.in/2010/08/designing-algorithmis-for-map-reduce.html
> article.
>
> On Tue, Jul 10, 2012 at 7:46 PM, Mohammad Tariq <do...@gmail.com> wrote:
>> Hello Harsh,
>>
>>          Thank you so much for the quick response. Actually I have a
>> use case wherein I have to compare values that are coming from 2
>> mappers to one reducer. For that I am planning to use MultipleInputs
>> class. In one mapper I have a text file (these files may contain
>> 1,00,000 to 2,00,000 lines), and I have to extract bytes from 2-13,
>> 20-25, 32-38 and so on from each line of this file. In the second
>> mapper I have to read values from an Hbase table. The columns of this
>> table correspond to the fields which I am reading from the text file
>> in the first mapper.
>>         In the reducer I have to compare the results coming for both
>> the mappers and generate the final result. Need your guidance. Many
>> thanks.
>>
>> Regards,
>>     Mohammad Tariq
>>
>>
>> On Tue, Jul 10, 2012 at 6:55 PM, Harsh J <ha...@cloudera.com> wrote:
>>> It depends on what you need. If your file is not splittable, or if you
>>> need to read the whole file from a single mapper itself (i.e. you do
>>> not _want_ it to be split), then use WholeFileInputFormats. Otherwise,
>>> you get more parallelism with regular splitting.
>>>
>>> On Tue, Jul 10, 2012 at 6:31 PM, Mohammad Tariq <do...@gmail.com> wrote:
>>>> Hello list,
>>>>
>>>>        What could be the approximate maximum size of the files that
>>>> can be handled using WholeFileInputFormat format??I mean, if the file
>>>> is very big, then is it feasible to use WholeFileInputFormat as the
>>>> entire load will go to one mapper??Many thanks.
>>>>
>>>> Regards,
>>>>     Mohammad Tariq
>>>
>>>
>>>
>>> --
>>> Harsh J
>
>
>
> --
> Harsh J

Re: WholeFileInputFormat format

Posted by Harsh J <ha...@cloudera.com>.
I don't see why you'd have to use the WholeFileInputFormat for such a
task. Your task is very similar to joins, and you can see the section
"General reducer-side join" for what your overall logic should look
like, under Ricky's
http://horicky.blogspot.in/2010/08/designing-algorithmis-for-map-reduce.html
article.

On Tue, Jul 10, 2012 at 7:46 PM, Mohammad Tariq <do...@gmail.com> wrote:
> Hello Harsh,
>
>          Thank you so much for the quick response. Actually I have a
> use case wherein I have to compare values that are coming from 2
> mappers to one reducer. For that I am planning to use MultipleInputs
> class. In one mapper I have a text file (these files may contain
> 1,00,000 to 2,00,000 lines), and I have to extract bytes from 2-13,
> 20-25, 32-38 and so on from each line of this file. In the second
> mapper I have to read values from an Hbase table. The columns of this
> table correspond to the fields which I am reading from the text file
> in the first mapper.
>         In the reducer I have to compare the results coming for both
> the mappers and generate the final result. Need your guidance. Many
> thanks.
>
> Regards,
>     Mohammad Tariq
>
>
> On Tue, Jul 10, 2012 at 6:55 PM, Harsh J <ha...@cloudera.com> wrote:
>> It depends on what you need. If your file is not splittable, or if you
>> need to read the whole file from a single mapper itself (i.e. you do
>> not _want_ it to be split), then use WholeFileInputFormats. Otherwise,
>> you get more parallelism with regular splitting.
>>
>> On Tue, Jul 10, 2012 at 6:31 PM, Mohammad Tariq <do...@gmail.com> wrote:
>>> Hello list,
>>>
>>>        What could be the approximate maximum size of the files that
>>> can be handled using WholeFileInputFormat format??I mean, if the file
>>> is very big, then is it feasible to use WholeFileInputFormat as the
>>> entire load will go to one mapper??Many thanks.
>>>
>>> Regards,
>>>     Mohammad Tariq
>>
>>
>>
>> --
>> Harsh J



-- 
Harsh J

Re: WholeFileInputFormat format

Posted by Mohammad Tariq <do...@gmail.com>.
Hello Harsh,

         Thank you so much for the quick response. Actually I have a
use case wherein I have to compare values that are coming from 2
mappers to one reducer. For that I am planning to use MultipleInputs
class. In one mapper I have a text file (these files may contain
1,00,000 to 2,00,000 lines), and I have to extract bytes from 2-13,
20-25, 32-38 and so on from each line of this file. In the second
mapper I have to read values from an Hbase table. The columns of this
table correspond to the fields which I am reading from the text file
in the first mapper.
        In the reducer I have to compare the results coming for both
the mappers and generate the final result. Need your guidance. Many
thanks.

Regards,
    Mohammad Tariq


On Tue, Jul 10, 2012 at 6:55 PM, Harsh J <ha...@cloudera.com> wrote:
> It depends on what you need. If your file is not splittable, or if you
> need to read the whole file from a single mapper itself (i.e. you do
> not _want_ it to be split), then use WholeFileInputFormats. Otherwise,
> you get more parallelism with regular splitting.
>
> On Tue, Jul 10, 2012 at 6:31 PM, Mohammad Tariq <do...@gmail.com> wrote:
>> Hello list,
>>
>>        What could be the approximate maximum size of the files that
>> can be handled using WholeFileInputFormat format??I mean, if the file
>> is very big, then is it feasible to use WholeFileInputFormat as the
>> entire load will go to one mapper??Many thanks.
>>
>> Regards,
>>     Mohammad Tariq
>
>
>
> --
> Harsh J

Re: WholeFileInputFormat format

Posted by Harsh J <ha...@cloudera.com>.
It depends on what you need. If your file is not splittable, or if you
need to read the whole file from a single mapper itself (i.e. you do
not _want_ it to be split), then use WholeFileInputFormats. Otherwise,
you get more parallelism with regular splitting.

On Tue, Jul 10, 2012 at 6:31 PM, Mohammad Tariq <do...@gmail.com> wrote:
> Hello list,
>
>        What could be the approximate maximum size of the files that
> can be handled using WholeFileInputFormat format??I mean, if the file
> is very big, then is it feasible to use WholeFileInputFormat as the
> entire load will go to one mapper??Many thanks.
>
> Regards,
>     Mohammad Tariq



-- 
Harsh J