You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Amit_Gupta <am...@gmail.com> on 2008/11/11 19:50:35 UTC
Hadoop Streaming - running a jar file
Hi
I have a jar file which takes input from stdin and writes something on
stdout. i.e. When I run
java -jar A.jar < input
It prints the required output.
However, when I run it as a mapper in hadoop streaming using the command
$HADOOP_HOME/bin/hadoop jar ....streaming.jar -input .. -output ... -mapper
'java -jar A.jar' -reducer NONE
i get the broken pipe exception.
the error message is
additionalConfSpec_:null
null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming
packageJobJar:
[/mnt/hadoop/HADOOP/hadoop-0.16.3/tmp/dir/hadoop-hadoop/hadoop-unjar45410/]
[] /tmp/streamjob45411.jar tmpDir=null
08/11/11 23:20:14 INFO mapred.FileInputFormat: Total input paths to process
: 1
08/11/11 23:20:14 INFO streaming.StreamJob: getLocalDirs():
[/mnt/hadoop/HADOOP/hadoop-0.16.3/tmp/mapred]
08/11/11 23:20:14 INFO streaming.StreamJob: Running job:
job_200811111724_0014
08/11/11 23:20:14 INFO streaming.StreamJob: To kill this job, run:
08/11/11 23:20:14 INFO streaming.StreamJob:
/mnt/hadoop/HADOOP/hadoop-0.16.3/bin/../bin/hadoop job
-Dmapred.job.tracker=10.105.41.25:54311 -kill job_200811111724_0014
08/11/11 23:20:15 INFO streaming.StreamJob: Tracking URL:
http://sayali:50030/jobdetails.jsp?jobid=job_200811111724_0014
08/11/11 23:20:16 INFO streaming.StreamJob: map 0% reduce 0%
08/11/11 23:21:00 INFO streaming.StreamJob: map 100% reduce 100%
08/11/11 23:21:00 INFO streaming.StreamJob: To kill this job, run:
08/11/11 23:21:00 INFO streaming.StreamJob:
/mnt/hadoop/HADOOP/hadoop-0.16.3/bin/../bin/hadoop job
-Dmapred.job.tracker=10.105.41.25:54311 -kill job_200811111724_0014
08/11/11 23:21:00 INFO streaming.StreamJob: Tracking URL:
http://sayali:50030/jobdetails.jsp?jobid=job_200811111724_0014
08/11/11 23:21:00 ERROR streaming.StreamJob: Job not Successful!
08/11/11 23:21:00 INFO streaming.StreamJob: killJob...
Streaming Job Failed!
Could some one please help me with any ideas or pointers.
regards
Amit
--
View this message in context: http://www.nabble.com/Hadoop-Streaming----running-a-jar-file-tp20445877p20445877.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: Best practice for using third party libraries in MapReduce Jobs?
Posted by Johannes Zillmann <jz...@101tec.com>.
You could use the DistributedCache to put multiple jar's into the
classpath. Of cause you would have to write your own job-submission
logic for that....
Johannes
On Dec 3, 2008, at 3:19 PM, Scott Whitecross wrote:
> What's the best way to use third party libraries with Hadoop? For
> example, I want to run a job with both a jar file containing the
> mob, and also extra libraries. I noticed a couple solutions with a
> search, but I'm hoping for something better:
>
> - Merge the third party jar libraries into the job jar
> - Distribute the third party libraries across the cluster in the
> local boxes classpath.
>
> What I'd really like is a way to add an extra option to the hadoop
> jar command, such as hadoop/bin/hadoop jar myJar.jar myJobClass -
> classpath thirdpartyjar1.jar:jar2.jar:etc args
>
> Anything exist like this?
>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
101tec GmbH
Halle (Saale), Saxony-Anhalt, Germany
http://www.101tec.com
Re: Best practice for using third party libraries in MapReduce Jobs?
Posted by tim robertson <ti...@gmail.com>.
Exactly. I'm no expert on maven either, but I like it's convenience for
classpath handling
Attached are my scripts.
- Hadoop-installer allows me to install different versions of hadoop
to local repo
- Pom has an assembly plugin (change mainClass and packageName to be
your target)
- Assembly does the packaging. Run it with
- mvn assembly:assembly -Dmaven.test.skip=true
The way I work is manage all dependencies in the pom, use "mvn
eclipse:eclipse" to keep eclipse buildpath correct. Then I just run
everything in Eclipse with small input files until I am happy that it
works. Then I build the jar with dependencies and copy it up to EC2
to run on the cluster. Might not be the best way but seems fairly
efficient for me.
Cheers,
Tim
On Wed, Dec 3, 2008 at 10:42 PM, Scott Whitecross <sc...@dataxu.com> wrote:
> Thanks Tim.
>
> We use Maven, though I'm not an expert on it. Basically you are using Maven
> to take the dependencies, and package them in one large jar? Basically
> unjar the contents of the jar and use those with your code I'm assuming?
>
>
> On Dec 3, 2008, at 9:25 AM, tim robertson wrote:
>
>> Can't answer your question exactly, but can let you know what I do.
>>
>> I build all dependencies into 1 jar, and by using Maven for my build
>> environment, when I assemble my jar, I am 100% sure all my
>> dependencies are collected together. This is working very nicely for
>> me and I have used the same scripts for around 20 different jars that
>> I run on EC2 - each had different dependencies which would have been a
>> pain to manage seperately, but maven simplifies this massively.
>>
>> Let me know if you want any of my maven config for assembly etc if you
>> are a maven user...
>>
>> Cheers,
>>
>> Tim
>>
>>
>> On Wed, Dec 3, 2008 at 3:19 PM, Scott Whitecross <sc...@dataxu.com> wrote:
>>>
>>> What's the best way to use third party libraries with Hadoop? For
>>> example,
>>> I want to run a job with both a jar file containing the mob, and also
>>> extra
>>> libraries. I noticed a couple solutions with a search, but I'm hoping
>>> for
>>> something better:
>>>
>>> - Merge the third party jar libraries into the job jar
>>> - Distribute the third party libraries across the cluster in the local
>>> boxes
>>> classpath.
>>>
>>> What I'd really like is a way to add an extra option to the hadoop jar
>>> command, such as hadoop/bin/hadoop jar myJar.jar myJobClass -classpath
>>> thirdpartyjar1.jar:jar2.jar:etc args
>>>
>>> Anything exist like this?
>>>
>>
>
>
Re: Best practice for using third party libraries in MapReduce Jobs?
Posted by tim robertson <ti...@gmail.com>.
Can't answer your question exactly, but can let you know what I do.
I build all dependencies into 1 jar, and by using Maven for my build
environment, when I assemble my jar, I am 100% sure all my
dependencies are collected together. This is working very nicely for
me and I have used the same scripts for around 20 different jars that
I run on EC2 - each had different dependencies which would have been a
pain to manage seperately, but maven simplifies this massively.
Let me know if you want any of my maven config for assembly etc if you
are a maven user...
Cheers,
Tim
On Wed, Dec 3, 2008 at 3:19 PM, Scott Whitecross <sc...@dataxu.com> wrote:
> What's the best way to use third party libraries with Hadoop? For example,
> I want to run a job with both a jar file containing the mob, and also extra
> libraries. I noticed a couple solutions with a search, but I'm hoping for
> something better:
>
> - Merge the third party jar libraries into the job jar
> - Distribute the third party libraries across the cluster in the local boxes
> classpath.
>
> What I'd really like is a way to add an extra option to the hadoop jar
> command, such as hadoop/bin/hadoop jar myJar.jar myJobClass -classpath
> thirdpartyjar1.jar:jar2.jar:etc args
>
> Anything exist like this?
>
Best practice for using third party libraries in MapReduce Jobs?
Posted by Scott Whitecross <sc...@dataxu.com>.
What's the best way to use third party libraries with Hadoop? For
example, I want to run a job with both a jar file containing the mob,
and also extra libraries. I noticed a couple solutions with a search,
but I'm hoping for something better:
- Merge the third party jar libraries into the job jar
- Distribute the third party libraries across the cluster in the local
boxes classpath.
What I'd really like is a way to add an extra option to the hadoop jar
command, such as hadoop/bin/hadoop jar myJar.jar myJobClass -classpath
thirdpartyjar1.jar:jar2.jar:etc args
Anything exist like this?
Re: Hadoop Streaming - running a jar file
Posted by Milind Bhandarkar <mi...@yahoo-inc.com>.
You should specify A.jar on the bin/hadoop command line with "-file A.jar",
so that streaming knows to copy that file on the tasktracker node.
- milind
On 11/11/08 10:50 AM, "Amit_Gupta" <am...@gmail.com> wrote:
>
>
> Hi
>
> I have a jar file which takes input from stdin and writes something on
> stdout. i.e. When I run
>
> java -jar A.jar < input
>
> It prints the required output.
>
> However, when I run it as a mapper in hadoop streaming using the command
>
> $HADOOP_HOME/bin/hadoop jar ....streaming.jar -input .. -output ... -mapper
> 'java -jar A.jar' -reducer NONE
>
> i get the broken pipe exception.
>
>
> the error message is
>
> additionalConfSpec_:null
> null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming
> packageJobJar:
> [/mnt/hadoop/HADOOP/hadoop-0.16.3/tmp/dir/hadoop-hadoop/hadoop-unjar45410/]
> [] /tmp/streamjob45411.jar tmpDir=null
> 08/11/11 23:20:14 INFO mapred.FileInputFormat: Total input paths to process
> : 1
> 08/11/11 23:20:14 INFO streaming.StreamJob: getLocalDirs():
> [/mnt/hadoop/HADOOP/hadoop-0.16.3/tmp/mapred]
> 08/11/11 23:20:14 INFO streaming.StreamJob: Running job:
> job_200811111724_0014
> 08/11/11 23:20:14 INFO streaming.StreamJob: To kill this job, run:
> 08/11/11 23:20:14 INFO streaming.StreamJob:
> /mnt/hadoop/HADOOP/hadoop-0.16.3/bin/../bin/hadoop job
> -Dmapred.job.tracker=10.105.41.25:54311 -kill job_200811111724_0014
> 08/11/11 23:20:15 INFO streaming.StreamJob: Tracking URL:
> http://sayali:50030/jobdetails.jsp?jobid=job_200811111724_0014
> 08/11/11 23:20:16 INFO streaming.StreamJob: map 0% reduce 0%
> 08/11/11 23:21:00 INFO streaming.StreamJob: map 100% reduce 100%
> 08/11/11 23:21:00 INFO streaming.StreamJob: To kill this job, run:
> 08/11/11 23:21:00 INFO streaming.StreamJob:
> /mnt/hadoop/HADOOP/hadoop-0.16.3/bin/../bin/hadoop job
> -Dmapred.job.tracker=10.105.41.25:54311 -kill job_200811111724_0014
> 08/11/11 23:21:00 INFO streaming.StreamJob: Tracking URL:
> http://sayali:50030/jobdetails.jsp?jobid=job_200811111724_0014
> 08/11/11 23:21:00 ERROR streaming.StreamJob: Job not Successful!
> 08/11/11 23:21:00 INFO streaming.StreamJob: killJob...
> Streaming Job Failed!
>
> Could some one please help me with any ideas or pointers.
>
> regards
> Amit
>
>
> --
> View this message in context:
> http://www.nabble.com/Hadoop-Streaming----running-a-jar-file-tp20445877p204458
> 77.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
>
--
Milind Bhandarkar
Y!IM: GridSolutions
408-349-2136
(milindb@yahoo-inc.com)