You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by hitarth trivedi <t....@gmail.com> on 2015/01/05 21:55:14 UTC

Write and Read file through map reduce

Hi,

I have 6 node cluster, and the scenario is as follows :-

I have one map reduce job which will write file1 in HDFS.
I have another map reduce job which will write file2 in  HDFS.
In the third map reduce job I need to use file1 and file2 to do some
computation and output the value.

What is the best way to store file1 and file2 in HDFS so that they could be
used in third map reduce job.

Thanks,
Hitarth

Re: Write and Read file through map reduce

Posted by Raj K Singh <ra...@gmail.com>.

you can configure your third mapreduce job using MultipleFileInput and read
those file into you job. if the file size is small then you can consider
the DistributedCache which will give you an optimal performance if you are
joining the datasets of file1 and file2. I will also recommend you to use
some job scheduling api oozie to make sure that thrid job kicks off only
when the file1 and file2 are available on the HDFS( the same can be done by
some shell script or JobControl implementation).

::::::::::::::::::::::::::::::::::::::::
Raj K Singh
http://in.linkedin.com/in/rajkrrsingh
http://www.rajkrrsingh.blogspot.com
Mobile  Tel: +91 (0)9899821370

On Tue, Jan 6, 2015 at 2:25 AM, hitarth trivedi <t....@gmail.com> wrote:

> Hi,
>
> I have 6 node cluster, and the scenario is as follows :-
>
> I have one map reduce job which will write file1 in HDFS.
> I have another map reduce job which will write file2 in  HDFS.
> In the third map reduce job I need to use file1 and file2 to do some
> computation and output the value.
>
> What is the best way to store file1 and file2 in HDFS so that they could
> be used in third map reduce job.
>
> Thanks,
> Hitarth
>

Re: Write and Read file through map reduce

Posted by Shahab Yunus <sh...@gmail.com>.

Distributed Cache has been deprecated for a while. You can use the new
mechanism, which is functionally the same thing, discussed here in this
thread:
http://stackoverflow.com/questions/21239722/hadoop-distributedcache-is-deprecated-what-is-the-preferred-api

Regards,
Shahab

On Mon, Jan 5, 2015 at 10:57 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> Hi hitarth
> ,
>
> If your file1 and file 2 is smaller you can move on with Distributed Cache.
> mentioned here
> <http://unmeshasreeveni.blogspot.in/2014/10/how-to-load-file-in-distributedcache-in.html>
>  .
>
> Or you can move on with MultipleInputFormat
>  mentioned here
> <http://unmeshasreeveni.blogspot.in/2014/12/joining-two-files-using-multipleinput.html>
> .
>
> [1]
> http://unmeshasreeveni.blogspot.in/2014/10/how-to-load-file-in-distributedcache-in.html
> [2]
> http://unmeshasreeveni.blogspot.in/2014/12/joining-two-files-using-multipleinput.html
>
> On Tue, Jan 6, 2015 at 8:53 AM, Ted Yu <yu...@gmail.com> wrote:
>
>> Hitarth:
>> You can also consider MultiFileInputFormat (and its concrete
>> implementations).
>>
>> Cheers
>>
>> On Mon, Jan 5, 2015 at 6:14 PM, Corey Nolet <cj...@gmail.com> wrote:
>>
>>> Hitarth,
>>>
>>> I don't know how much direction you are looking for with regards to the
>>> formats of the times but you can certainly read both files into the third
>>> mapreduce job using the FileInputFormat by comma-separating the paths to
>>> the files. The blocks for both files will essentially be unioned together
>>> and the mappers scheduled across your cluster.
>>>
>>> On Mon, Jan 5, 2015 at 3:55 PM, hitarth trivedi <t....@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have 6 node cluster, and the scenario is as follows :-
>>>>
>>>> I have one map reduce job which will write file1 in HDFS.
>>>> I have another map reduce job which will write file2 in  HDFS.
>>>> In the third map reduce job I need to use file1 and file2 to do some
>>>> computation and output the value.
>>>>
>>>> What is the best way to store file1 and file2 in HDFS so that they
>>>> could be used in third map reduce job.
>>>>
>>>> Thanks,
>>>> Hitarth
>>>>
>>>
>>>
>>
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: Write and Read file through map reduce

Posted by Shahab Yunus <sh...@gmail.com>.

Distributed Cache has been deprecated for a while. You can use the new
mechanism, which is functionally the same thing, discussed here in this
thread:
http://stackoverflow.com/questions/21239722/hadoop-distributedcache-is-deprecated-what-is-the-preferred-api

Regards,
Shahab

On Mon, Jan 5, 2015 at 10:57 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> Hi hitarth
> ,
>
> If your file1 and file 2 is smaller you can move on with Distributed Cache.
> mentioned here
> <http://unmeshasreeveni.blogspot.in/2014/10/how-to-load-file-in-distributedcache-in.html>
>  .
>
> Or you can move on with MultipleInputFormat
>  mentioned here
> <http://unmeshasreeveni.blogspot.in/2014/12/joining-two-files-using-multipleinput.html>
> .
>
> [1]
> http://unmeshasreeveni.blogspot.in/2014/10/how-to-load-file-in-distributedcache-in.html
> [2]
> http://unmeshasreeveni.blogspot.in/2014/12/joining-two-files-using-multipleinput.html
>
> On Tue, Jan 6, 2015 at 8:53 AM, Ted Yu <yu...@gmail.com> wrote:
>
>> Hitarth:
>> You can also consider MultiFileInputFormat (and its concrete
>> implementations).
>>
>> Cheers
>>
>> On Mon, Jan 5, 2015 at 6:14 PM, Corey Nolet <cj...@gmail.com> wrote:
>>
>>> Hitarth,
>>>
>>> I don't know how much direction you are looking for with regards to the
>>> formats of the times but you can certainly read both files into the third
>>> mapreduce job using the FileInputFormat by comma-separating the paths to
>>> the files. The blocks for both files will essentially be unioned together
>>> and the mappers scheduled across your cluster.
>>>
>>> On Mon, Jan 5, 2015 at 3:55 PM, hitarth trivedi <t....@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have 6 node cluster, and the scenario is as follows :-
>>>>
>>>> I have one map reduce job which will write file1 in HDFS.
>>>> I have another map reduce job which will write file2 in  HDFS.
>>>> In the third map reduce job I need to use file1 and file2 to do some
>>>> computation and output the value.
>>>>
>>>> What is the best way to store file1 and file2 in HDFS so that they
>>>> could be used in third map reduce job.
>>>>
>>>> Thanks,
>>>> Hitarth
>>>>
>>>
>>>
>>
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: Write and Read file through map reduce

Posted by Shahab Yunus <sh...@gmail.com>.

Distributed Cache has been deprecated for a while. You can use the new
mechanism, which is functionally the same thing, discussed here in this
thread:
http://stackoverflow.com/questions/21239722/hadoop-distributedcache-is-deprecated-what-is-the-preferred-api

Regards,
Shahab

On Mon, Jan 5, 2015 at 10:57 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> Hi hitarth
> ,
>
> If your file1 and file 2 is smaller you can move on with Distributed Cache.
> mentioned here
> <http://unmeshasreeveni.blogspot.in/2014/10/how-to-load-file-in-distributedcache-in.html>
>  .
>
> Or you can move on with MultipleInputFormat
>  mentioned here
> <http://unmeshasreeveni.blogspot.in/2014/12/joining-two-files-using-multipleinput.html>
> .
>
> [1]
> http://unmeshasreeveni.blogspot.in/2014/10/how-to-load-file-in-distributedcache-in.html
> [2]
> http://unmeshasreeveni.blogspot.in/2014/12/joining-two-files-using-multipleinput.html
>
> On Tue, Jan 6, 2015 at 8:53 AM, Ted Yu <yu...@gmail.com> wrote:
>
>> Hitarth:
>> You can also consider MultiFileInputFormat (and its concrete
>> implementations).
>>
>> Cheers
>>
>> On Mon, Jan 5, 2015 at 6:14 PM, Corey Nolet <cj...@gmail.com> wrote:
>>
>>> Hitarth,
>>>
>>> I don't know how much direction you are looking for with regards to the
>>> formats of the times but you can certainly read both files into the third
>>> mapreduce job using the FileInputFormat by comma-separating the paths to
>>> the files. The blocks for both files will essentially be unioned together
>>> and the mappers scheduled across your cluster.
>>>
>>> On Mon, Jan 5, 2015 at 3:55 PM, hitarth trivedi <t....@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have 6 node cluster, and the scenario is as follows :-
>>>>
>>>> I have one map reduce job which will write file1 in HDFS.
>>>> I have another map reduce job which will write file2 in  HDFS.
>>>> In the third map reduce job I need to use file1 and file2 to do some
>>>> computation and output the value.
>>>>
>>>> What is the best way to store file1 and file2 in HDFS so that they
>>>> could be used in third map reduce job.
>>>>
>>>> Thanks,
>>>> Hitarth
>>>>
>>>
>>>
>>
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: Write and Read file through map reduce

Posted by Shahab Yunus <sh...@gmail.com>.

Distributed Cache has been deprecated for a while. You can use the new
mechanism, which is functionally the same thing, discussed here in this
thread:
http://stackoverflow.com/questions/21239722/hadoop-distributedcache-is-deprecated-what-is-the-preferred-api

Regards,
Shahab

On Mon, Jan 5, 2015 at 10:57 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> Hi hitarth
> ,
>
> If your file1 and file 2 is smaller you can move on with Distributed Cache.
> mentioned here
> <http://unmeshasreeveni.blogspot.in/2014/10/how-to-load-file-in-distributedcache-in.html>
>  .
>
> Or you can move on with MultipleInputFormat
>  mentioned here
> <http://unmeshasreeveni.blogspot.in/2014/12/joining-two-files-using-multipleinput.html>
> .
>
> [1]
> http://unmeshasreeveni.blogspot.in/2014/10/how-to-load-file-in-distributedcache-in.html
> [2]
> http://unmeshasreeveni.blogspot.in/2014/12/joining-two-files-using-multipleinput.html
>
> On Tue, Jan 6, 2015 at 8:53 AM, Ted Yu <yu...@gmail.com> wrote:
>
>> Hitarth:
>> You can also consider MultiFileInputFormat (and its concrete
>> implementations).
>>
>> Cheers
>>
>> On Mon, Jan 5, 2015 at 6:14 PM, Corey Nolet <cj...@gmail.com> wrote:
>>
>>> Hitarth,
>>>
>>> I don't know how much direction you are looking for with regards to the
>>> formats of the times but you can certainly read both files into the third
>>> mapreduce job using the FileInputFormat by comma-separating the paths to
>>> the files. The blocks for both files will essentially be unioned together
>>> and the mappers scheduled across your cluster.
>>>
>>> On Mon, Jan 5, 2015 at 3:55 PM, hitarth trivedi <t....@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have 6 node cluster, and the scenario is as follows :-
>>>>
>>>> I have one map reduce job which will write file1 in HDFS.
>>>> I have another map reduce job which will write file2 in  HDFS.
>>>> In the third map reduce job I need to use file1 and file2 to do some
>>>> computation and output the value.
>>>>
>>>> What is the best way to store file1 and file2 in HDFS so that they
>>>> could be used in third map reduce job.
>>>>
>>>> Thanks,
>>>> Hitarth
>>>>
>>>
>>>
>>
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: Write and Read file through map reduce

Posted by unmesha sreeveni <un...@gmail.com>.

Hi hitarth
,

If your file1 and file 2 is smaller you can move on with Distributed Cache.
mentioned here
<http://unmeshasreeveni.blogspot.in/2014/10/how-to-load-file-in-distributedcache-in.html>
 .

Or you can move on with MultipleInputFormat
 mentioned here
<http://unmeshasreeveni.blogspot.in/2014/12/joining-two-files-using-multipleinput.html>
.

[1]
http://unmeshasreeveni.blogspot.in/2014/10/how-to-load-file-in-distributedcache-in.html
[2]
http://unmeshasreeveni.blogspot.in/2014/12/joining-two-files-using-multipleinput.html

On Tue, Jan 6, 2015 at 8:53 AM, Ted Yu <yu...@gmail.com> wrote:

> Hitarth:
> You can also consider MultiFileInputFormat (and its concrete
> implementations).
>
> Cheers
>
> On Mon, Jan 5, 2015 at 6:14 PM, Corey Nolet <cj...@gmail.com> wrote:
>
>> Hitarth,
>>
>> I don't know how much direction you are looking for with regards to the
>> formats of the times but you can certainly read both files into the third
>> mapreduce job using the FileInputFormat by comma-separating the paths to
>> the files. The blocks for both files will essentially be unioned together
>> and the mappers scheduled across your cluster.
>>
>> On Mon, Jan 5, 2015 at 3:55 PM, hitarth trivedi <t....@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I have 6 node cluster, and the scenario is as follows :-
>>>
>>> I have one map reduce job which will write file1 in HDFS.
>>> I have another map reduce job which will write file2 in  HDFS.
>>> In the third map reduce job I need to use file1 and file2 to do some
>>> computation and output the value.
>>>
>>> What is the best way to store file1 and file2 in HDFS so that they could
>>> be used in third map reduce job.
>>>
>>> Thanks,
>>> Hitarth
>>>
>>
>>
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: Write and Read file through map reduce

Posted by unmesha sreeveni <un...@gmail.com>.

Hi hitarth
,

If your file1 and file 2 is smaller you can move on with Distributed Cache.
mentioned here
<http://unmeshasreeveni.blogspot.in/2014/10/how-to-load-file-in-distributedcache-in.html>
 .

Or you can move on with MultipleInputFormat
 mentioned here
<http://unmeshasreeveni.blogspot.in/2014/12/joining-two-files-using-multipleinput.html>
.

[1]
http://unmeshasreeveni.blogspot.in/2014/10/how-to-load-file-in-distributedcache-in.html
[2]
http://unmeshasreeveni.blogspot.in/2014/12/joining-two-files-using-multipleinput.html

On Tue, Jan 6, 2015 at 8:53 AM, Ted Yu <yu...@gmail.com> wrote:

> Hitarth:
> You can also consider MultiFileInputFormat (and its concrete
> implementations).
>
> Cheers
>
> On Mon, Jan 5, 2015 at 6:14 PM, Corey Nolet <cj...@gmail.com> wrote:
>
>> Hitarth,
>>
>> I don't know how much direction you are looking for with regards to the
>> formats of the times but you can certainly read both files into the third
>> mapreduce job using the FileInputFormat by comma-separating the paths to
>> the files. The blocks for both files will essentially be unioned together
>> and the mappers scheduled across your cluster.
>>
>> On Mon, Jan 5, 2015 at 3:55 PM, hitarth trivedi <t....@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I have 6 node cluster, and the scenario is as follows :-
>>>
>>> I have one map reduce job which will write file1 in HDFS.
>>> I have another map reduce job which will write file2 in  HDFS.
>>> In the third map reduce job I need to use file1 and file2 to do some
>>> computation and output the value.
>>>
>>> What is the best way to store file1 and file2 in HDFS so that they could
>>> be used in third map reduce job.
>>>
>>> Thanks,
>>> Hitarth
>>>
>>
>>
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: Write and Read file through map reduce

Posted by unmesha sreeveni <un...@gmail.com>.

Hi hitarth
,

If your file1 and file 2 is smaller you can move on with Distributed Cache.
mentioned here
<http://unmeshasreeveni.blogspot.in/2014/10/how-to-load-file-in-distributedcache-in.html>
 .

Or you can move on with MultipleInputFormat
 mentioned here
<http://unmeshasreeveni.blogspot.in/2014/12/joining-two-files-using-multipleinput.html>
.

[1]
http://unmeshasreeveni.blogspot.in/2014/10/how-to-load-file-in-distributedcache-in.html
[2]
http://unmeshasreeveni.blogspot.in/2014/12/joining-two-files-using-multipleinput.html

On Tue, Jan 6, 2015 at 8:53 AM, Ted Yu <yu...@gmail.com> wrote:

> Hitarth:
> You can also consider MultiFileInputFormat (and its concrete
> implementations).
>
> Cheers
>
> On Mon, Jan 5, 2015 at 6:14 PM, Corey Nolet <cj...@gmail.com> wrote:
>
>> Hitarth,
>>
>> I don't know how much direction you are looking for with regards to the
>> formats of the times but you can certainly read both files into the third
>> mapreduce job using the FileInputFormat by comma-separating the paths to
>> the files. The blocks for both files will essentially be unioned together
>> and the mappers scheduled across your cluster.
>>
>> On Mon, Jan 5, 2015 at 3:55 PM, hitarth trivedi <t....@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I have 6 node cluster, and the scenario is as follows :-
>>>
>>> I have one map reduce job which will write file1 in HDFS.
>>> I have another map reduce job which will write file2 in  HDFS.
>>> In the third map reduce job I need to use file1 and file2 to do some
>>> computation and output the value.
>>>
>>> What is the best way to store file1 and file2 in HDFS so that they could
>>> be used in third map reduce job.
>>>
>>> Thanks,
>>> Hitarth
>>>
>>
>>
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: Write and Read file through map reduce

Posted by unmesha sreeveni <un...@gmail.com>.

Hi hitarth
,

If your file1 and file 2 is smaller you can move on with Distributed Cache.
mentioned here
<http://unmeshasreeveni.blogspot.in/2014/10/how-to-load-file-in-distributedcache-in.html>
 .

Or you can move on with MultipleInputFormat
 mentioned here
<http://unmeshasreeveni.blogspot.in/2014/12/joining-two-files-using-multipleinput.html>
.

[1]
http://unmeshasreeveni.blogspot.in/2014/10/how-to-load-file-in-distributedcache-in.html
[2]
http://unmeshasreeveni.blogspot.in/2014/12/joining-two-files-using-multipleinput.html

On Tue, Jan 6, 2015 at 8:53 AM, Ted Yu <yu...@gmail.com> wrote:

> Hitarth:
> You can also consider MultiFileInputFormat (and its concrete
> implementations).
>
> Cheers
>
> On Mon, Jan 5, 2015 at 6:14 PM, Corey Nolet <cj...@gmail.com> wrote:
>
>> Hitarth,
>>
>> I don't know how much direction you are looking for with regards to the
>> formats of the times but you can certainly read both files into the third
>> mapreduce job using the FileInputFormat by comma-separating the paths to
>> the files. The blocks for both files will essentially be unioned together
>> and the mappers scheduled across your cluster.
>>
>> On Mon, Jan 5, 2015 at 3:55 PM, hitarth trivedi <t....@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I have 6 node cluster, and the scenario is as follows :-
>>>
>>> I have one map reduce job which will write file1 in HDFS.
>>> I have another map reduce job which will write file2 in  HDFS.
>>> In the third map reduce job I need to use file1 and file2 to do some
>>> computation and output the value.
>>>
>>> What is the best way to store file1 and file2 in HDFS so that they could
>>> be used in third map reduce job.
>>>
>>> Thanks,
>>> Hitarth
>>>
>>
>>
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: Write and Read file through map reduce

Posted by Ted Yu <yu...@gmail.com>.

Hitarth:
You can also consider MultiFileInputFormat (and its concrete
implementations).

Cheers

On Mon, Jan 5, 2015 at 6:14 PM, Corey Nolet <cj...@gmail.com> wrote:

> Hitarth,
>
> I don't know how much direction you are looking for with regards to the
> formats of the times but you can certainly read both files into the third
> mapreduce job using the FileInputFormat by comma-separating the paths to
> the files. The blocks for both files will essentially be unioned together
> and the mappers scheduled across your cluster.
>
> On Mon, Jan 5, 2015 at 3:55 PM, hitarth trivedi <t....@gmail.com>
> wrote:
>
>> Hi,
>>
>> I have 6 node cluster, and the scenario is as follows :-
>>
>> I have one map reduce job which will write file1 in HDFS.
>> I have another map reduce job which will write file2 in  HDFS.
>> In the third map reduce job I need to use file1 and file2 to do some
>> computation and output the value.
>>
>> What is the best way to store file1 and file2 in HDFS so that they could
>> be used in third map reduce job.
>>
>> Thanks,
>> Hitarth
>>
>
>

Re: Write and Read file through map reduce

Posted by Ted Yu <yu...@gmail.com>.

Hitarth:
You can also consider MultiFileInputFormat (and its concrete
implementations).

Cheers

On Mon, Jan 5, 2015 at 6:14 PM, Corey Nolet <cj...@gmail.com> wrote:

> Hitarth,
>
> I don't know how much direction you are looking for with regards to the
> formats of the times but you can certainly read both files into the third
> mapreduce job using the FileInputFormat by comma-separating the paths to
> the files. The blocks for both files will essentially be unioned together
> and the mappers scheduled across your cluster.
>
> On Mon, Jan 5, 2015 at 3:55 PM, hitarth trivedi <t....@gmail.com>
> wrote:
>
>> Hi,
>>
>> I have 6 node cluster, and the scenario is as follows :-
>>
>> I have one map reduce job which will write file1 in HDFS.
>> I have another map reduce job which will write file2 in  HDFS.
>> In the third map reduce job I need to use file1 and file2 to do some
>> computation and output the value.
>>
>> What is the best way to store file1 and file2 in HDFS so that they could
>> be used in third map reduce job.
>>
>> Thanks,
>> Hitarth
>>
>
>

Re: Write and Read file through map reduce

Posted by Ted Yu <yu...@gmail.com>.

Hitarth:
You can also consider MultiFileInputFormat (and its concrete
implementations).

Cheers

On Mon, Jan 5, 2015 at 6:14 PM, Corey Nolet <cj...@gmail.com> wrote:

> Hitarth,
>
> I don't know how much direction you are looking for with regards to the
> formats of the times but you can certainly read both files into the third
> mapreduce job using the FileInputFormat by comma-separating the paths to
> the files. The blocks for both files will essentially be unioned together
> and the mappers scheduled across your cluster.
>
> On Mon, Jan 5, 2015 at 3:55 PM, hitarth trivedi <t....@gmail.com>
> wrote:
>
>> Hi,
>>
>> I have 6 node cluster, and the scenario is as follows :-
>>
>> I have one map reduce job which will write file1 in HDFS.
>> I have another map reduce job which will write file2 in  HDFS.
>> In the third map reduce job I need to use file1 and file2 to do some
>> computation and output the value.
>>
>> What is the best way to store file1 and file2 in HDFS so that they could
>> be used in third map reduce job.
>>
>> Thanks,
>> Hitarth
>>
>
>

Re: Write and Read file through map reduce

Posted by Ted Yu <yu...@gmail.com>.

Hitarth:
You can also consider MultiFileInputFormat (and its concrete
implementations).

Cheers

On Mon, Jan 5, 2015 at 6:14 PM, Corey Nolet <cj...@gmail.com> wrote:

> Hitarth,
>
> I don't know how much direction you are looking for with regards to the
> formats of the times but you can certainly read both files into the third
> mapreduce job using the FileInputFormat by comma-separating the paths to
> the files. The blocks for both files will essentially be unioned together
> and the mappers scheduled across your cluster.
>
> On Mon, Jan 5, 2015 at 3:55 PM, hitarth trivedi <t....@gmail.com>
> wrote:
>
>> Hi,
>>
>> I have 6 node cluster, and the scenario is as follows :-
>>
>> I have one map reduce job which will write file1 in HDFS.
>> I have another map reduce job which will write file2 in  HDFS.
>> In the third map reduce job I need to use file1 and file2 to do some
>> computation and output the value.
>>
>> What is the best way to store file1 and file2 in HDFS so that they could
>> be used in third map reduce job.
>>
>> Thanks,
>> Hitarth
>>
>
>

Re: Write and Read file through map reduce

Posted by Corey Nolet <cj...@gmail.com>.

Hitarth,

I don't know how much direction you are looking for with regards to the
formats of the times but you can certainly read both files into the third
mapreduce job using the FileInputFormat by comma-separating the paths to
the files. The blocks for both files will essentially be unioned together
and the mappers scheduled across your cluster.

On Mon, Jan 5, 2015 at 3:55 PM, hitarth trivedi <t....@gmail.com> wrote:

> Hi,
>
> I have 6 node cluster, and the scenario is as follows :-
>
> I have one map reduce job which will write file1 in HDFS.
> I have another map reduce job which will write file2 in  HDFS.
> In the third map reduce job I need to use file1 and file2 to do some
> computation and output the value.
>
> What is the best way to store file1 and file2 in HDFS so that they could
> be used in third map reduce job.
>
> Thanks,
> Hitarth
>

Re: Write and Read file through map reduce

Posted by Corey Nolet <cj...@gmail.com>.

Hitarth,

I don't know how much direction you are looking for with regards to the
formats of the times but you can certainly read both files into the third
mapreduce job using the FileInputFormat by comma-separating the paths to
the files. The blocks for both files will essentially be unioned together
and the mappers scheduled across your cluster.

On Mon, Jan 5, 2015 at 3:55 PM, hitarth trivedi <t....@gmail.com> wrote:

> Hi,
>
> I have 6 node cluster, and the scenario is as follows :-
>
> I have one map reduce job which will write file1 in HDFS.
> I have another map reduce job which will write file2 in  HDFS.
> In the third map reduce job I need to use file1 and file2 to do some
> computation and output the value.
>
> What is the best way to store file1 and file2 in HDFS so that they could
> be used in third map reduce job.
>
> Thanks,
> Hitarth
>

Re: Write and Read file through map reduce

Posted by Raj K Singh <ra...@gmail.com>.

you can configure your third mapreduce job using MultipleFileInput and read
those file into you job. if the file size is small then you can consider
the DistributedCache which will give you an optimal performance if you are
joining the datasets of file1 and file2. I will also recommend you to use
some job scheduling api oozie to make sure that thrid job kicks off only
when the file1 and file2 are available on the HDFS( the same can be done by
some shell script or JobControl implementation).

::::::::::::::::::::::::::::::::::::::::
Raj K Singh
http://in.linkedin.com/in/rajkrrsingh
http://www.rajkrrsingh.blogspot.com
Mobile  Tel: +91 (0)9899821370

On Tue, Jan 6, 2015 at 2:25 AM, hitarth trivedi <t....@gmail.com> wrote:

> Hi,
>
> I have 6 node cluster, and the scenario is as follows :-
>
> I have one map reduce job which will write file1 in HDFS.
> I have another map reduce job which will write file2 in  HDFS.
> In the third map reduce job I need to use file1 and file2 to do some
> computation and output the value.
>
> What is the best way to store file1 and file2 in HDFS so that they could
> be used in third map reduce job.
>
> Thanks,
> Hitarth
>

Re: Write and Read file through map reduce

Posted by Raj K Singh <ra...@gmail.com>.

you can configure your third mapreduce job using MultipleFileInput and read
those file into you job. if the file size is small then you can consider
the DistributedCache which will give you an optimal performance if you are
joining the datasets of file1 and file2. I will also recommend you to use
some job scheduling api oozie to make sure that thrid job kicks off only
when the file1 and file2 are available on the HDFS( the same can be done by
some shell script or JobControl implementation).

::::::::::::::::::::::::::::::::::::::::
Raj K Singh
http://in.linkedin.com/in/rajkrrsingh
http://www.rajkrrsingh.blogspot.com
Mobile  Tel: +91 (0)9899821370

On Tue, Jan 6, 2015 at 2:25 AM, hitarth trivedi <t....@gmail.com> wrote:

> Hi,
>
> I have 6 node cluster, and the scenario is as follows :-
>
> I have one map reduce job which will write file1 in HDFS.
> I have another map reduce job which will write file2 in  HDFS.
> In the third map reduce job I need to use file1 and file2 to do some
> computation and output the value.
>
> What is the best way to store file1 and file2 in HDFS so that they could
> be used in third map reduce job.
>
> Thanks,
> Hitarth
>

Re: Write and Read file through map reduce

Posted by Corey Nolet <cj...@gmail.com>.

Hitarth,

I don't know how much direction you are looking for with regards to the
formats of the times but you can certainly read both files into the third
mapreduce job using the FileInputFormat by comma-separating the paths to
the files. The blocks for both files will essentially be unioned together
and the mappers scheduled across your cluster.

On Mon, Jan 5, 2015 at 3:55 PM, hitarth trivedi <t....@gmail.com> wrote:

> Hi,
>
> I have 6 node cluster, and the scenario is as follows :-
>
> I have one map reduce job which will write file1 in HDFS.
> I have another map reduce job which will write file2 in  HDFS.
> In the third map reduce job I need to use file1 and file2 to do some
> computation and output the value.
>
> What is the best way to store file1 and file2 in HDFS so that they could
> be used in third map reduce job.
>
> Thanks,
> Hitarth
>

Re: Write and Read file through map reduce

Posted by Raj K Singh <ra...@gmail.com>.

you can configure your third mapreduce job using MultipleFileInput and read
those file into you job. if the file size is small then you can consider
the DistributedCache which will give you an optimal performance if you are
joining the datasets of file1 and file2. I will also recommend you to use
some job scheduling api oozie to make sure that thrid job kicks off only
when the file1 and file2 are available on the HDFS( the same can be done by
some shell script or JobControl implementation).

::::::::::::::::::::::::::::::::::::::::
Raj K Singh
http://in.linkedin.com/in/rajkrrsingh
http://www.rajkrrsingh.blogspot.com
Mobile  Tel: +91 (0)9899821370

On Tue, Jan 6, 2015 at 2:25 AM, hitarth trivedi <t....@gmail.com> wrote:

> Hi,
>
> I have 6 node cluster, and the scenario is as follows :-
>
> I have one map reduce job which will write file1 in HDFS.
> I have another map reduce job which will write file2 in  HDFS.
> In the third map reduce job I need to use file1 and file2 to do some
> computation and output the value.
>
> What is the best way to store file1 and file2 in HDFS so that they could
> be used in third map reduce job.
>
> Thanks,
> Hitarth
>

Re: Write and Read file through map reduce

Posted by Corey Nolet <cj...@gmail.com>.

Hitarth,

I don't know how much direction you are looking for with regards to the
formats of the times but you can certainly read both files into the third
mapreduce job using the FileInputFormat by comma-separating the paths to
the files. The blocks for both files will essentially be unioned together
and the mappers scheduled across your cluster.

On Mon, Jan 5, 2015 at 3:55 PM, hitarth trivedi <t....@gmail.com> wrote:

> Hi,
>
> I have 6 node cluster, and the scenario is as follows :-
>
> I have one map reduce job which will write file1 in HDFS.
> I have another map reduce job which will write file2 in  HDFS.
> In the third map reduce job I need to use file1 and file2 to do some
> computation and output the value.
>
> What is the best way to store file1 and file2 in HDFS so that they could
> be used in third map reduce job.
>
> Thanks,
> Hitarth
>