You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Siddharth Dawar <si...@gmail.com> on 2016/06/07 09:17:29 UTC
How to share files amongst multiple jobs using Distributed Cache in
Hadoop 2.7.2
Hi,
I wrote a program which creates Map-Reduce jobs in an iterative fashion as
follows:
while (true) {
JobConf conf2 = new JobConf(getConf(),graphMining.class);
conf2.setJobName("sid");conf2.setMapperClass(mapperMiner.class);conf2.setReducerClass(reducerMiner.class);conf2.setInputFormat(SequenceFileInputFormat.class);conf2.setOutputFormat(SequenceFileOutputFormat.class);conf2.setOutputValueClass(BytesWritable.class);conf2.setMapOutputKeyClass(Text.class);conf2.setMapOutputValueClass(MapWritable.class);conf2.setOutputKeyClass(Text.class);
conf2.setNumMapTasks(Integer.parseInt(args[3]));conf2.setNumReduceTasks(Integer.parseInt(args[4]));FileInputFormat.addInputPath(conf2,
new Path(input));FileOutputFormat.setOutputPath(conf2, new
Path(output)); }
RunningJob job = JobClient.runJob(conf2);
}
Now, I want the first Job which gets created to write something in the
distributed cache and the jobs which get created after the first job to
read from the distributed cache.
I came to know that the DistributedCache.addcacheFiles() method is
deprecated, so the documentation suggests to use Job.addcacheFiles() method
specific for each job.
But, I am unable to get an handle of the currently running job, as
JobClient.runJob(conf2) submits a job internally.
How can I share the content written by the first job in this while loop
available via distributed cache to other jobs which get created in later
iterations of while loop ?
Re: How to share files amongst multiple jobs using Distributed Cache
in Hadoop 2.7.2
Posted by Siddharth Dawar <si...@gmail.com>.
Hi Arun,
Thanks for your prompt reply. Actually, I want to add files to the job
running internally in
JobClient.RunJob(conf2)
method and add cache files to file.
I am unable to find a way to get the running job.
The method Job.getInstanceOf(conf) creates a new job (but I want to add
file to currently running job only).
On Tue, Jun 7, 2016 at 6:36 PM, Arun Natva <ar...@gmail.com> wrote:
> If you use the Instance of Job class, you can add files to distributed
> cache like this:
> Job job = Job.getInstanceOf(conf);
> job.addCacheFiles(filepath);
>
>
> Sent from my iPhone
>
> On Jun 7, 2016, at 5:17 AM, Siddharth Dawar <si...@gmail.com>
> wrote:
>
> Hi,
>
> I wrote a program which creates Map-Reduce jobs in an iterative fashion as
> follows:
>
>
> while (true) {
>
> JobConf conf2 = new JobConf(getConf(),graphMining.class);
> conf2.setJobName("sid");conf2.setMapperClass(mapperMiner.class);conf2.setReducerClass(reducerMiner.class);conf2.setInputFormat(SequenceFileInputFormat.class);conf2.setOutputFormat(SequenceFileOutputFormat.class);conf2.setOutputValueClass(BytesWritable.class);conf2.setMapOutputKeyClass(Text.class);conf2.setMapOutputValueClass(MapWritable.class);conf2.setOutputKeyClass(Text.class);
>
> conf2.setNumMapTasks(Integer.parseInt(args[3]));conf2.setNumReduceTasks(Integer.parseInt(args[4]));FileInputFormat.addInputPath(conf2, new Path(input));FileOutputFormat.setOutputPath(conf2, new Path(output)); }
>
> RunningJob job = JobClient.runJob(conf2);
> }
>
>
> Now, I want the first Job which gets created to write something in the
> distributed cache and the jobs which get created after the first job to
> read from the distributed cache.
>
> I came to know that the DistributedCache.addcacheFiles() method is
> deprecated, so the documentation suggests to use Job.addcacheFiles() method
> specific for each job.
>
> But, I am unable to get an handle of the currently running job, as
> JobClient.runJob(conf2) submits a job internally.
>
>
> How can I share the content written by the first job in this while loop
> available via distributed cache to other jobs which get created in later
> iterations of while loop ?
>
>
Re: How to share files amongst multiple jobs using Distributed Cache in Hadoop 2.7.2
Posted by Arun Natva <ar...@gmail.com>.
If you use the Instance of Job class, you can add files to distributed cache like this:
Job job = Job.getInstanceOf(conf);
job.addCacheFiles(filepath);
Sent from my iPhone
> On Jun 7, 2016, at 5:17 AM, Siddharth Dawar <si...@gmail.com> wrote:
>
> Hi,
>
> I wrote a program which creates Map-Reduce jobs in an iterative fashion as follows:
>
>
> while (true)
> {
> JobConf conf2 = new JobConf(getConf(),graphMining.class);
>
> conf2.setJobName("sid");
> conf2.setMapperClass(mapperMiner.class);
> conf2.setReducerClass(reducerMiner.class);
>
> conf2.setInputFormat(SequenceFileInputFormat.class);
> conf2.setOutputFormat(SequenceFileOutputFormat.class);
> conf2.setOutputValueClass(BytesWritable.class);
>
> conf2.setMapOutputKeyClass(Text.class);
> conf2.setMapOutputValueClass(MapWritable.class);
> conf2.setOutputKeyClass(Text.class);
>
> conf2.setNumMapTasks(Integer.parseInt(args[3]));
> conf2.setNumReduceTasks(Integer.parseInt(args[4]));
> FileInputFormat.addInputPath(conf2, new Path(input));
> FileOutputFormat.setOutputPath(conf2, new Path(output)); }
> RunningJob job = JobClient.runJob(conf2);
> }
>
> Now, I want the first Job which gets created to write something in the distributed cache and the jobs which get created after the first job to read from the distributed cache.
>
> I came to know that the DistributedCache.addcacheFiles() method is deprecated, so the documentation suggests to use Job.addcacheFiles() method specific for each job.
>
> But, I am unable to get an handle of the currently running job, as JobClient.runJob(conf2) submits a job internally.
>
>
> How can I share the content written by the first job in this while loop available via distributed cache to other jobs which get created in later iterations of while loop ?
>