You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Narayanan K <kn...@gmail.com> on 2011/07/01 07:58:22 UTC

[Doubt]: Submission of Mapreduce from outside Hadoop Cluster

Hi all,


We are basically working on a research project and I require some help
regarding this.



I had a few basic doubts regarding submission of Map-Reduce jobs in Hadoop.



   1. How do I submit a mapreduce job from outside the cluster i.e from a
   different machine outside the Hadoop cluster?
   2. If the above can be done, How can I schedule map reduce jobs to run in
   hadoop like crontab from a different machine?
   3. Are there any webservice APIs that I can leverage to access a hadoop
   cluster from outside and submit jobs or read/write data from HDFS.


Many Thanks,

Narayanan

Re: [Doubt]: Submission of Mapreduce from outside Hadoop Cluster

Posted by Yaozhen Pan <it...@gmail.com>.
Narayanan,

Regarding the client installation, you should make sure that client and
server use same version hadoop for submitting jobs and transfer data.
if you use a different user in client than the one runs hadoop job, config
the hadoop ugi property (sorry i forget the exact name).

在 2011 7 1 15:28,"Narayanan K" <kn...@gmail.com>写道:
> Hi Harsh
>
> Thanks for the quick response...
>
> Have a few clarifications regarding the 1st point :
>
> Let me tell the background first..
>
> We have actually set up a Hadoop cluster with HBase installed. We are
> planning to load Hbase with data and perform some
> computations with the data and show up the data in a report format.
> The report should be accessible from outside the cluster and the report
> accepts certain parameters to show data, that will in turn pass on these
> parameters to the hadoop master server where a mapreduce job will be run
> that queries HBase to retrieve the data.
>
> So the report will be run from a different machine outside the cluster. So
> we need a way to pass on the parameters to the hadoop cluster (master) and
> initiate a mapreduce job dynamically. Similarly the output of mapreduce
job
> needs to tunneled into the machine from where the report was run.
>
> Some more clarification I need is : Does the machine (outside of cluster)
> which ran the report, require something like a Client installation which
> will talk with the Hadoop Master Server via TCP??? Or can it can run a job
> in hadoop server by using a passworldless scp to the master machine or
> something of the like.
>
>
> Regards,
> Narayanan
>
>
>
>
> On Fri, Jul 1, 2011 at 11:41 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> Narayanan,
>>
>>
>> On Fri, Jul 1, 2011 at 11:28 AM, Narayanan K <kn...@gmail.com>
>> wrote:
>> > Hi all,
>> >
>> > We are basically working on a research project and I require some help
>> > regarding this.
>>
>> Always glad to see research work being done! What're you working on? :)
>>
>> > How do I submit a mapreduce job from outside the cluster i.e from a
>> > different machine outside the Hadoop cluster?
>>
>> If you use Java APIs, use the Job#submit(…) method and/or
>> JobClient.runJob(…) method.
>> Basically Hadoop will try to create a jar with all requisite classes
>> within and will push it out to the JobTracker's filesystem (HDFS, if
>> you run HDFS). From there on, its like a regular operation.
>>
>> This even happens on the Hadoop nodes itself, so doing so from an
>> external place as long as that place has access to Hadoop's JT and
>> HDFS, should be no different at all.
>>
>> If you are packing custom libraries along, don't forget to use
>> DistributedCache. If you are packing custom MR Java code, don't forget
>> to use Job#setJarByClass/JobClient#setJarByClass and other appropriate
>> API methods.
>>
>> > If the above can be done, How can I schedule map reduce jobs to run in
>> > hadoop like crontab from a different machine?
>> > Are there any webservice APIs that I can leverage to access a hadoop
>> cluster
>> > from outside and submit jobs or read/write data from HDFS.
>>
>> For scheduling jobs, have a look at Oozie: http://yahoo.github.com/oozie/
>> It is well supported and is very useful in writing MR workflows (which
>> is a common requirement). You also get coordinator features and can
>> schedule similar to crontab functionalities.
>>
>> For HDFS r/w over web, not sure of an existing web app specifically
>> for this purpose without limitations, but there is a contrib/thriftfs
>> you can leverage upon (if not writing your own webserver in Java, in
>> which case its as simple as using HDFS APIs).
>>
>> Also have a look at the pretty mature Hue project which aims to
>> provide a great frontend that lets you design jobs, submit jobs,
>> monitor jobs and upload files or browse the filesystem (among several
>> other things): http://cloudera.github.com/hue/
>>
>> --
>> Harsh J
>>

Re: [Doubt]: Submission of Mapreduce from outside Hadoop Cluster

Posted by Harsh J <ha...@cloudera.com>.
Narayanan,

On Fri, Jul 1, 2011 at 12:57 PM, Narayanan K <kn...@gmail.com> wrote:
> So the report will be run from a different machine outside the cluster. So
> we need a way to pass on the parameters to the hadoop cluster (master) and
> initiate a mapreduce job dynamically. Similarly the output of mapreduce job
> needs to tunneled into the machine from where the report was run.
>
> Some more clarification I need is : Does the machine (outside of cluster)
> which ran the report, require something like a Client installation which
> will talk with the Hadoop Master Server via TCP???  Or can it can run a job
> in hadoop server by using a passworldless scp to the master machine or
> something of the like.

Regular way is to let the client talk to your nodes over tcp ports.
This is what Hadoop's plain ol' submitter process does for you.

Have you tried running any simple "hadoop jar <your jar>" from a
remote client machine?

If that works, so should invoking the same from your code (with
appropriate configurations set) cause its basically the plain ol'
runjar submission process in both ways.

If not, maybe you need to think of opening ports to let things happen
(if there's a firewall here).

Hadoop does not use SSH/SCP to move code around. Please give this a
read if you believe you're confused about how SSH+Hadoop is integrated
(or not): http://wiki.apache.org/hadoop/FAQ#Does_Hadoop_require_SSH.3F

-- 
Harsh J

Re: [Doubt]: Submission of Mapreduce from outside Hadoop Cluster

Posted by Narayanan K <kn...@gmail.com>.
Hi Harsh

Thanks for the quick response...

Have a few clarifications regarding the 1st point :

Let me tell the background first..

We have actually set up a Hadoop cluster with HBase installed. We are
planning to load Hbase with data and perform some
computations with the data and show up the data in a report format.
The report should be accessible from outside the cluster and the report
accepts certain parameters to show data, that will in turn pass on these
parameters to the hadoop master server where a mapreduce job will be run
that queries HBase to retrieve the data.

So the report will be run from a different machine outside the cluster. So
we need a way to pass on the parameters to the hadoop cluster (master) and
initiate a mapreduce job dynamically. Similarly the output of mapreduce job
needs to tunneled into the machine from where the report was run.

Some more clarification I need is : Does the machine (outside of cluster)
which ran the report, require something like a Client installation which
will talk with the Hadoop Master Server via TCP???  Or can it can run a job
in hadoop server by using a passworldless scp to the master machine or
something of the like.


Regards,
Narayanan




On Fri, Jul 1, 2011 at 11:41 AM, Harsh J <ha...@cloudera.com> wrote:

> Narayanan,
>
>
> On Fri, Jul 1, 2011 at 11:28 AM, Narayanan K <kn...@gmail.com>
> wrote:
> > Hi all,
> >
> > We are basically working on a research project and I require some help
> > regarding this.
>
> Always glad to see research work being done! What're you working on? :)
>
> > How do I submit a mapreduce job from outside the cluster i.e from a
> > different machine outside the Hadoop cluster?
>
> If you use Java APIs, use the Job#submit(…) method and/or
> JobClient.runJob(…) method.
> Basically Hadoop will try to create a jar with all requisite classes
> within and will push it out to the JobTracker's filesystem (HDFS, if
> you run HDFS). From there on, its like a regular operation.
>
> This even happens on the Hadoop nodes itself, so doing so from an
> external place as long as that place has access to Hadoop's JT and
> HDFS, should be no different at all.
>
> If you are packing custom libraries along, don't forget to use
> DistributedCache. If you are packing custom MR Java code, don't forget
> to use Job#setJarByClass/JobClient#setJarByClass and other appropriate
> API methods.
>
> > If the above can be done, How can I schedule map reduce jobs to run in
> > hadoop like crontab from a different machine?
> > Are there any webservice APIs that I can leverage to access a hadoop
> cluster
> > from outside and submit jobs or read/write data from HDFS.
>
> For scheduling jobs, have a look at Oozie: http://yahoo.github.com/oozie/
> It is well supported and is very useful in writing MR workflows (which
> is a common requirement). You also get coordinator features and can
> schedule similar to crontab functionalities.
>
> For HDFS r/w over web, not sure of an existing web app specifically
> for this purpose without limitations, but there is a contrib/thriftfs
> you can leverage upon (if not writing your own webserver in Java, in
> which case its as simple as using HDFS APIs).
>
> Also have a look at the pretty mature Hue project which aims to
> provide a great frontend that lets you design jobs, submit jobs,
> monitor jobs and upload files or browse the filesystem (among several
> other things): http://cloudera.github.com/hue/
>
> --
> Harsh J
>

Re: [Doubt]: Submission of Mapreduce from outside Hadoop Cluster

Posted by Harsh J <ha...@cloudera.com>.
Narayanan,


On Fri, Jul 1, 2011 at 11:28 AM, Narayanan K <kn...@gmail.com> wrote:
> Hi all,
>
> We are basically working on a research project and I require some help
> regarding this.

Always glad to see research work being done! What're you working on? :)

> How do I submit a mapreduce job from outside the cluster i.e from a
> different machine outside the Hadoop cluster?

If you use Java APIs, use the Job#submit(…) method and/or
JobClient.runJob(…) method.
Basically Hadoop will try to create a jar with all requisite classes
within and will push it out to the JobTracker's filesystem (HDFS, if
you run HDFS). From there on, its like a regular operation.

This even happens on the Hadoop nodes itself, so doing so from an
external place as long as that place has access to Hadoop's JT and
HDFS, should be no different at all.

If you are packing custom libraries along, don't forget to use
DistributedCache. If you are packing custom MR Java code, don't forget
to use Job#setJarByClass/JobClient#setJarByClass and other appropriate
API methods.

> If the above can be done, How can I schedule map reduce jobs to run in
> hadoop like crontab from a different machine?
> Are there any webservice APIs that I can leverage to access a hadoop cluster
> from outside and submit jobs or read/write data from HDFS.

For scheduling jobs, have a look at Oozie: http://yahoo.github.com/oozie/
It is well supported and is very useful in writing MR workflows (which
is a common requirement). You also get coordinator features and can
schedule similar to crontab functionalities.

For HDFS r/w over web, not sure of an existing web app specifically
for this purpose without limitations, but there is a contrib/thriftfs
you can leverage upon (if not writing your own webserver in Java, in
which case its as simple as using HDFS APIs).

Also have a look at the pretty mature Hue project which aims to
provide a great frontend that lets you design jobs, submit jobs,
monitor jobs and upload files or browse the filesystem (among several
other things): http://cloudera.github.com/hue/

-- 
Harsh J