You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Amit_Gupta <am...@gmail.com> on 2008/11/11 19:50:35 UTC

Hadoop Streaming - running a jar file

Hi

I have a jar file which takes input from stdin and writes something on
stdout. i.e. When I run 

java -jar A.jar < input 

It prints the required output.

However, when I run it as a mapper in hadoop streaming using the command

$HADOOP_HOME/bin/hadoop jar ....streaming.jar -input .. -output ...  -mapper
'java -jar A.jar'  -reducer NONE 

i get the broken pipe exception.


the error message is 

additionalConfSpec_:null
null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming
packageJobJar:
[/mnt/hadoop/HADOOP/hadoop-0.16.3/tmp/dir/hadoop-hadoop/hadoop-unjar45410/]
[] /tmp/streamjob45411.jar tmpDir=null
08/11/11 23:20:14 INFO mapred.FileInputFormat: Total input paths to process
: 1
08/11/11 23:20:14 INFO streaming.StreamJob: getLocalDirs():
[/mnt/hadoop/HADOOP/hadoop-0.16.3/tmp/mapred]
08/11/11 23:20:14 INFO streaming.StreamJob: Running job:
job_200811111724_0014
08/11/11 23:20:14 INFO streaming.StreamJob: To kill this job, run:
08/11/11 23:20:14 INFO streaming.StreamJob:
/mnt/hadoop/HADOOP/hadoop-0.16.3/bin/../bin/hadoop job 
-Dmapred.job.tracker=10.105.41.25:54311 -kill job_200811111724_0014
08/11/11 23:20:15 INFO streaming.StreamJob: Tracking URL:
http://sayali:50030/jobdetails.jsp?jobid=job_200811111724_0014
08/11/11 23:20:16 INFO streaming.StreamJob:  map 0%  reduce 0%
08/11/11 23:21:00 INFO streaming.StreamJob:  map 100%  reduce 100%
08/11/11 23:21:00 INFO streaming.StreamJob: To kill this job, run:
08/11/11 23:21:00 INFO streaming.StreamJob:
/mnt/hadoop/HADOOP/hadoop-0.16.3/bin/../bin/hadoop job 
-Dmapred.job.tracker=10.105.41.25:54311 -kill job_200811111724_0014
08/11/11 23:21:00 INFO streaming.StreamJob: Tracking URL:
http://sayali:50030/jobdetails.jsp?jobid=job_200811111724_0014
08/11/11 23:21:00 ERROR streaming.StreamJob: Job not Successful!
08/11/11 23:21:00 INFO streaming.StreamJob: killJob...
Streaming Job Failed!

Could some one please help me with any ideas or pointers.

regards
Amit


-- 
View this message in context: http://www.nabble.com/Hadoop-Streaming----running-a-jar-file-tp20445877p20445877.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: Best practice for using third party libraries in MapReduce Jobs?

Posted by Johannes Zillmann <jz...@101tec.com>.

You could use the DistributedCache to put multiple jar's into the  
classpath. Of cause you would have to write your own job-submission  
logic for that....

Johannes

On Dec 3, 2008, at 3:19 PM, Scott Whitecross wrote:

> What's the best way to use third party libraries with Hadoop?  For  
> example, I want to run a job with both a jar file containing the  
> mob, and also extra libraries.  I noticed a couple solutions with a  
> search, but I'm hoping for something better:
>
> - Merge the third party jar libraries into the job jar
> - Distribute the third party libraries across the cluster in the  
> local boxes classpath.
>
> What I'd really like is a way to add an extra option to the hadoop  
> jar command, such as hadoop/bin/hadoop jar myJar.jar myJobClass - 
> classpath thirdpartyjar1.jar:jar2.jar:etc  args
>
> Anything exist like this?
>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
101tec GmbH
Halle (Saale), Saxony-Anhalt, Germany
http://www.101tec.com

Re: Best practice for using third party libraries in MapReduce Jobs?

Posted by tim robertson <ti...@gmail.com>.

Exactly.  I'm no expert on maven either, but I like it's convenience for
classpath handling

Attached are my scripts.
- Hadoop-installer allows me to install different versions of hadoop
to local repo
- Pom has an assembly plugin (change mainClass and packageName to be
your target)
- Assembly does the packaging.  Run it with
 - mvn assembly:assembly -Dmaven.test.skip=true

The way I work is manage all dependencies in the pom, use "mvn
eclipse:eclipse" to keep eclipse buildpath correct.  Then I just run
everything in Eclipse with small input files until I am happy that it
works.  Then I build the jar with dependencies and copy it up to EC2
to run on the cluster.  Might not be the best way but seems fairly
efficient for me.

Cheers,

Tim


On Wed, Dec 3, 2008 at 10:42 PM, Scott Whitecross <sc...@dataxu.com> wrote:
> Thanks Tim.
>
> We use Maven, though I'm not an expert on it.  Basically you are using Maven
> to take the dependencies, and package them in one large jar?  Basically
> unjar the contents of the jar and use those with your code I'm assuming?
>
>
> On Dec 3, 2008, at 9:25 AM, tim robertson wrote:
>
>> Can't answer your question exactly, but can let you know what I do.
>>
>> I build all dependencies into 1 jar, and by using Maven for my build
>> environment, when I assemble my jar, I am 100% sure all my
>> dependencies are collected together.  This is working very nicely for
>> me and I have used the same scripts for around 20 different jars that
>> I run on EC2 - each had different dependencies which would have been a
>> pain to manage seperately, but maven simplifies this massively.
>>
>> Let me know if you want any of my maven config for assembly etc if you
>> are a maven user...
>>
>> Cheers,
>>
>> Tim
>>
>>
>> On Wed, Dec 3, 2008 at 3:19 PM, Scott Whitecross <sc...@dataxu.com> wrote:
>>>
>>> What's the best way to use third party libraries with Hadoop?  For
>>> example,
>>> I want to run a job with both a jar file containing the mob, and also
>>> extra
>>> libraries.  I noticed a couple solutions with a search, but I'm hoping
>>> for
>>> something better:
>>>
>>> - Merge the third party jar libraries into the job jar
>>> - Distribute the third party libraries across the cluster in the local
>>> boxes
>>> classpath.
>>>
>>> What I'd really like is a way to add an extra option to the hadoop jar
>>> command, such as hadoop/bin/hadoop jar myJar.jar myJobClass -classpath
>>> thirdpartyjar1.jar:jar2.jar:etc  args
>>>
>>> Anything exist like this?
>>>
>>
>
>

Re: Best practice for using third party libraries in MapReduce Jobs?

Posted by tim robertson <ti...@gmail.com>.

Can't answer your question exactly, but can let you know what I do.

I build all dependencies into 1 jar, and by using Maven for my build
environment, when I assemble my jar, I am 100% sure all my
dependencies are collected together.  This is working very nicely for
me and I have used the same scripts for around 20 different jars that
I run on EC2 - each had different dependencies which would have been a
pain to manage seperately, but maven simplifies this massively.

Let me know if you want any of my maven config for assembly etc if you
are a maven user...

Cheers,

Tim

On Wed, Dec 3, 2008 at 3:19 PM, Scott Whitecross <sc...@dataxu.com> wrote:
> What's the best way to use third party libraries with Hadoop?  For example,
> I want to run a job with both a jar file containing the mob, and also extra
> libraries.  I noticed a couple solutions with a search, but I'm hoping for
> something better:
>
> - Merge the third party jar libraries into the job jar
> - Distribute the third party libraries across the cluster in the local boxes
> classpath.
>
> What I'd really like is a way to add an extra option to the hadoop jar
> command, such as hadoop/bin/hadoop jar myJar.jar myJobClass -classpath
> thirdpartyjar1.jar:jar2.jar:etc  args
>
> Anything exist like this?
>

Best practice for using third party libraries in MapReduce Jobs?

Posted by Scott Whitecross <sc...@dataxu.com>.

What's the best way to use third party libraries with Hadoop?  For  
example, I want to run a job with both a jar file containing the mob,  
and also extra libraries.  I noticed a couple solutions with a search,  
but I'm hoping for something better:

- Merge the third party jar libraries into the job jar
- Distribute the third party libraries across the cluster in the local  
boxes classpath.

What I'd really like is a way to add an extra option to the hadoop jar  
command, such as hadoop/bin/hadoop jar myJar.jar myJobClass -classpath  
thirdpartyjar1.jar:jar2.jar:etc  args

Anything exist like this?

Re: Hadoop Streaming - running a jar file

Posted by Milind Bhandarkar <mi...@yahoo-inc.com>.

You should specify A.jar on the bin/hadoop command line with "-file A.jar",
so that streaming knows to copy that file on the tasktracker node.

- milind


On 11/11/08 10:50 AM, "Amit_Gupta" <am...@gmail.com> wrote:

> 
> 
> Hi
> 
> I have a jar file which takes input from stdin and writes something on
> stdout. i.e. When I run
> 
> java -jar A.jar < input
> 
> It prints the required output.
> 
> However, when I run it as a mapper in hadoop streaming using the command
> 
> $HADOOP_HOME/bin/hadoop jar ....streaming.jar -input .. -output ...  -mapper
> 'java -jar A.jar'  -reducer NONE
> 
> i get the broken pipe exception.
> 
> 
> the error message is
> 
> additionalConfSpec_:null
> null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming
> packageJobJar:
> [/mnt/hadoop/HADOOP/hadoop-0.16.3/tmp/dir/hadoop-hadoop/hadoop-unjar45410/]
> [] /tmp/streamjob45411.jar tmpDir=null
> 08/11/11 23:20:14 INFO mapred.FileInputFormat: Total input paths to process
> : 1
> 08/11/11 23:20:14 INFO streaming.StreamJob: getLocalDirs():
> [/mnt/hadoop/HADOOP/hadoop-0.16.3/tmp/mapred]
> 08/11/11 23:20:14 INFO streaming.StreamJob: Running job:
> job_200811111724_0014
> 08/11/11 23:20:14 INFO streaming.StreamJob: To kill this job, run:
> 08/11/11 23:20:14 INFO streaming.StreamJob:
> /mnt/hadoop/HADOOP/hadoop-0.16.3/bin/../bin/hadoop job
> -Dmapred.job.tracker=10.105.41.25:54311 -kill job_200811111724_0014
> 08/11/11 23:20:15 INFO streaming.StreamJob: Tracking URL:
> http://sayali:50030/jobdetails.jsp?jobid=job_200811111724_0014
> 08/11/11 23:20:16 INFO streaming.StreamJob:  map 0%  reduce 0%
> 08/11/11 23:21:00 INFO streaming.StreamJob:  map 100%  reduce 100%
> 08/11/11 23:21:00 INFO streaming.StreamJob: To kill this job, run:
> 08/11/11 23:21:00 INFO streaming.StreamJob:
> /mnt/hadoop/HADOOP/hadoop-0.16.3/bin/../bin/hadoop job
> -Dmapred.job.tracker=10.105.41.25:54311 -kill job_200811111724_0014
> 08/11/11 23:21:00 INFO streaming.StreamJob: Tracking URL:
> http://sayali:50030/jobdetails.jsp?jobid=job_200811111724_0014
> 08/11/11 23:21:00 ERROR streaming.StreamJob: Job not Successful!
> 08/11/11 23:21:00 INFO streaming.StreamJob: killJob...
> Streaming Job Failed!
> 
> Could some one please help me with any ideas or pointers.
> 
> regards
> Amit
> 
> 
> --
> View this message in context:
> http://www.nabble.com/Hadoop-Streaming----running-a-jar-file-tp20445877p204458
> 77.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
> 


-- 
Milind Bhandarkar
Y!IM: GridSolutions
408-349-2136 
(milindb@yahoo-inc.com)