You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by yang song <ha...@gmail.com> on 2009/08/24 08:49:42 UTC

How to speed up the copy phrase?

Hello, everyone

When I submit a big job(e.g. maptasks:10000, reducetasks:500), I find that
the copy phrase will last for a long long time. From WebUI, the message
"reduce > copy (xxxx of 10000 at 0.01 MB/s) >" tells me the transfer speed
is just 0.01 MB/s. Does it a regular value? How can I solve it?

Thank you!

P.S. The hadoop version is 0.19.1. The cluster has 20 nodes. Heap size of JT
is 6G while the     others are default settings.

Re: How to speed up the copy phrase?

Posted by Owen O'Malley <ow...@gmail.com>.

There is an index with the offset of each reduce's first byte. The  
index is written to disk, but is also cached by the task tracker.

-- Owen

On Aug 27, 2009, at 17:12, George Porter <gm...@gmail.com> wrote:

> Interesting.  In this case, how does Jetty dole out the proper
> partitions of the intermediate data to the appropriate reducers if
> they are located in the same files?
>
> Thanks,
> George
>
> On Thu, Aug 27, 2009 at 11:31 AM, Arun C Murthy<ac...@yahoo-inc.com>  
> wrote:
>>
>> On Aug 24, 2009, at 5:49 PM, Aaron Kimball wrote:
>>
>>> If you've got 20 nodes, then you want to have 20-ish reduce tasks.  
>>> Maybe
>>> 40
>>> if you want it to run in two waves. (Assuming 1 core/node.  
>>> Multiply by N
>>> for
>>> N cores...) As it is, each node has 500-ish map tasks that it has  
>>> to read
>>> from and for each of these, it needs to generate 500 separate  
>>> reduce task
>>> output files.  That's going to take Hadoop a long time to do.
>>
>> Maps do not produce one output file per reduce, the entire map- 
>> output is in
>> a single file.
>>
>> Arun
>>

Re: How to speed up the copy phrase?

Posted by George Porter <gm...@gmail.com>.

Interesting.  In this case, how does Jetty dole out the proper
partitions of the intermediate data to the appropriate reducers if
they are located in the same files?

Thanks,
George

On Thu, Aug 27, 2009 at 11:31 AM, Arun C Murthy<ac...@yahoo-inc.com> wrote:
>
> On Aug 24, 2009, at 5:49 PM, Aaron Kimball wrote:
>
>> If you've got 20 nodes, then you want to have 20-ish reduce tasks. Maybe
>> 40
>> if you want it to run in two waves. (Assuming 1 core/node. Multiply by N
>> for
>> N cores...) As it is, each node has 500-ish map tasks that it has to read
>> from and for each of these, it needs to generate 500 separate reduce task
>> output files.  That's going to take Hadoop a long time to do.
>
> Maps do not produce one output file per reduce, the entire map-output is in
> a single file.
>
> Arun
>

Re: Why "java.util.zip.ZipOutputStream" need to use /tmp?

Posted by "Oliver B. Fischer" <o....@swe-blog.net>.

Hello Steve,

I assume what the java.io.FileOutputStream uses /tmp as tempdir. As you
can see, the errors occurs in a native method. As far I know, /tmp is
standard temp directory on UNIX systems automatically used by many
native library calls. May you can set $TEMPDIR
(http://en.wikipedia.org/wiki/TMPDIR) to another directory?

Best regards,

Oliver

Steve Gao schrieb:
> 
> The hadoop version is 0.18.3 . Recently we got "out of space" issue. It's from "java.util.zip.ZipOutputStream".
> We found that /tmp is full and after cleaning /tmp the problem is solved.
> 
> However why hadoop needs to use /tmp? We had already configured hadoop tmp to a local disk in: hadoop-site.xml
> 
> <property>
>   <name>hadoop.tmp.dir</name>
>   <value> ... some large local disk ... </value>
> </property>
> 
> 
> Could it because java.util.zip.ZipOutputStream uses /tmp even if we configured hadoop.tmp.dir to a large local disk?
> 
> The error log is here FYI:
> 
> java.io.IOException: No space left on device         
> at java.io.FileOutputStream.write(Native Method)        
>  at java.util.zip.ZipOutputStream.writeInt(ZipOutputStream.java:445)         
> at java.util.zip.ZipOutputStream.writeEXT(ZipOutputStream.java:362)         
> at java.util.zip.ZipOutputStream.closeEntry(ZipOutputStream.java:220)         
> at java.util.zip.ZipOutputStream.finish(ZipOutputStream.java:301)         
> at java.util.zip.DeflaterOutputStream.close(DeflaterOutputStream.java:146)         
> at java.util.zip.ZipOutputStream.close(ZipOutputStream.java:321)         
> at org.apache.hadoop.streaming.JarBuilder.merge(JarBuilder.java:79)         
> at org.apache.hadoop.streaming.StreamJob.packageJobJar(StreamJob.java:628)         
> at org.apache.hadoop.streaming.StreamJob.setJobConf(StreamJob.java:843)         
> at org.apache.hadoop.streaming.StreamJob.go(StreamJob.java:110)         
> at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:33)         
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)         
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)         
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)         
> at java.lang.reflect.Method.invoke(Method.java:597)         
> at org.apache.hadoop.util.RunJar.main(RunJar.java:155)         
> at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194)         
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)         
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)         
> at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220)  
> Executing Hadoop job failure
> 
> 
> 
> 
>       


-- 
Oliver B. Fischer, Schönhauser Allee 64, 10437 Berlin
Tel. +49 30 44793251, Mobil: +49 178 7903538
Mail: o.b.fischer@swe-blog.net Blog: http://www.swe-blog.net

Re: Why "java.util.zip.ZipOutputStream" need to use /tmp?

Posted by "Oliver B. Fischer" <o....@swe-blog.net>.

Hello Steve,

I assume what the java.io.FileOutputStream uses /tmp as tempdir. As you
can see, the errors occurs in a native method. As far I know, /tmp is
standard temp directory on UNIX systems automatically used by many
native library calls. May you can set $TEMPDIR
(http://en.wikipedia.org/wiki/TMPDIR) to another directory?

Best regards,

Oliver

Steve Gao schrieb:
> 
> The hadoop version is 0.18.3 . Recently we got "out of space" issue. It's from "java.util.zip.ZipOutputStream".
> We found that /tmp is full and after cleaning /tmp the problem is solved.
> 
> However why hadoop needs to use /tmp? We had already configured hadoop tmp to a local disk in: hadoop-site.xml
> 
> <property>
>   <name>hadoop.tmp.dir</name>
>   <value> ... some large local disk ... </value>
> </property>
> 
> 
> Could it because java.util.zip.ZipOutputStream uses /tmp even if we configured hadoop.tmp.dir to a large local disk?
> 
> The error log is here FYI:
> 
> java.io.IOException: No space left on device         
> at java.io.FileOutputStream.write(Native Method)        
>  at java.util.zip.ZipOutputStream.writeInt(ZipOutputStream.java:445)         
> at java.util.zip.ZipOutputStream.writeEXT(ZipOutputStream.java:362)         
> at java.util.zip.ZipOutputStream.closeEntry(ZipOutputStream.java:220)         
> at java.util.zip.ZipOutputStream.finish(ZipOutputStream.java:301)         
> at java.util.zip.DeflaterOutputStream.close(DeflaterOutputStream.java:146)         
> at java.util.zip.ZipOutputStream.close(ZipOutputStream.java:321)         
> at org.apache.hadoop.streaming.JarBuilder.merge(JarBuilder.java:79)         
> at org.apache.hadoop.streaming.StreamJob.packageJobJar(StreamJob.java:628)         
> at org.apache.hadoop.streaming.StreamJob.setJobConf(StreamJob.java:843)         
> at org.apache.hadoop.streaming.StreamJob.go(StreamJob.java:110)         
> at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:33)         
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)         
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)         
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)         
> at java.lang.reflect.Method.invoke(Method.java:597)         
> at org.apache.hadoop.util.RunJar.main(RunJar.java:155)         
> at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194)         
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)         
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)         
> at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220)  
> Executing Hadoop job failure
> 
> 
> 
> 
>       


-- 
Oliver B. Fischer, Schönhauser Allee 64, 10437 Berlin
Tel. +49 30 44793251, Mobil: +49 178 7903538
Mail: o.b.fischer@swe-blog.net Blog: http://www.swe-blog.net

Re: [Help] Why "java.util.zip.ZipOutputStream" need to use /tmp?

Posted by James Cipar <jc...@andrew.cmu.edu>.

I would agree with removing it from the default build for now.

I only used thrift because that's what we were using for all of the
RPC at the time.  I'd rather that we just settle on one RPC to rule
them all, and I will change the code accordingly.


On Aug 28, 2009, at 3:04 PM, Steve Gao wrote:

> Thanks, Brian. Would you tell me what is the filename of the code  
> snippet?
>
> --- On Fri, 8/28/09, Brian Bockelman <bb...@cse.unl.edu> wrote:
>
> From: Brian Bockelman <bb...@cse.unl.edu>
> Subject: Re: [Help] Why "java.util.zip.ZipOutputStream" need to use / 
> tmp?
> To: common-user@hadoop.apache.org
> Date: Friday, August 28, 2009, 2:37 PM
>
> Actually, poking the code, it seems that the streaming package does  
> set this value:
>
>     String tmp = jobConf_.get("stream.tmpdir"); //, "/tmp/$ 
> {user.name}/"
>
> Try setting stream.tmpdir to a different directory maybe?
>
> Brian
>
> On Aug 28, 2009, at 1:31 PM, Steve Gao wrote:
>
>> Thanks lot, Brian. It seems to be a design flaw of hadoop that it  
>> can not manage (or pass in) the temp of "java.util.zip". Can we  
>> create a jira ticket for this?
>>
>> --- On Fri, 8/28/09, Brian Bockelman <bb...@cse.unl.edu> wrote:
>>
>> From: Brian Bockelman <bb...@cse.unl.edu>
>> Subject: Re: [Help] Why "java.util.zip.ZipOutputStream" need to  
>> use /tmp?
>> To:
>> Cc: common-user@hadoop.apache.org
>> Date: Friday, August 28, 2009, 2:27 PM
>>
>> Hey Steve,
>>
>> Correct, java.util.zip.* does not necessarily respect hadoop  
>> settings.
>>
>> Try setting TMPDIR in the environment to your large local disk  
>> space.  It might respect that, if Java decides to act like a unix  
>> utility.
>>
>> http://en.wikipedia.org/wiki/TMPDIR
>>
>> Brian
>>
>> On Aug 28, 2009, at 1:19 PM, Steve Gao wrote:
>>
>>> would someone give us a hint? Thanks.
>>> Why "java.util.zip.ZipOutputStream" need to use /tmp?
>>>
>>> The hadoop version is 0.18.3 . Recently we got "out of space"  
>>> issue. It's from "java.util.zip.ZipOutputStream".
>>> We found that /tmp is full and after cleaning /tmp the problem is  
>>> solved.
>>>
>>> However why hadoop needs to use /tmp? We had already configured  
>>> hadoop tmp to a local disk in: hadoop-site.xml
>>>
>>> <property>
>>>     <name>hadoop.tmp.dir</name>
>>>     <value> ... some large local disk ... </value>
>>> </property>
>>>
>>>
>>> Could it because java.util.zip.ZipOutputStream uses /tmp even if  
>>> we configured hadoop.tmp.dir to a large local disk?
>>>
>>> The error log is here FYI:
>>>
>>> java.io.IOException: No space left on device
>>> at java.io.FileOutputStream.write(Native Method)
>>>    at java.util.zip.ZipOutputStream.writeInt(ZipOutputStream.java: 
>>> 445)
>>> at java.util.zip.ZipOutputStream.writeEXT(ZipOutputStream.java:362)
>>> at java.util.zip.ZipOutputStream.closeEntry(ZipOutputStream.java: 
>>> 220)
>>> at java.util.zip.ZipOutputStream.finish(ZipOutputStream.java:301)
>>> at  
>>> java.util.zip.DeflaterOutputStream.close(DeflaterOutputStream.java: 
>>> 146)
>>> at java.util.zip.ZipOutputStream.close(ZipOutputStream.java:321)
>>> at org.apache.hadoop.streaming.JarBuilder.merge(JarBuilder.java:79)
>>> at  
>>> org.apache.hadoop.streaming.StreamJob.packageJobJar(StreamJob.java: 
>>> 628)
>>> at org.apache.hadoop.streaming.StreamJob.setJobConf(StreamJob.java: 
>>> 843)
>>> at org.apache.hadoop.streaming.StreamJob.go(StreamJob.java:110)
>>> at  
>>> org 
>>> .apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java: 
>>> 33)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at  
>>> sun 
>>> .reflect 
>>> .NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>> at  
>>> sun 
>>> .reflect 
>>> .DelegatingMethodAccessorImpl 
>>> .invoke(DelegatingMethodAccessorImpl.java:25)
>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
>>> at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194)
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>> at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220)
>>> Executing Hadoop job failure
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>
>
>
>

Re: [Help] Why "java.util.zip.ZipOutputStream" need to use /tmp?

Posted by James Cipar <jc...@andrew.cmu.edu>.

Sorry that last one, I replied to the wrong message.



On Aug 28, 2009, at 3:04 PM, Steve Gao wrote:

> Thanks, Brian. Would you tell me what is the filename of the code  
> snippet?
>
> --- On Fri, 8/28/09, Brian Bockelman <bb...@cse.unl.edu> wrote:
>
> From: Brian Bockelman <bb...@cse.unl.edu>
> Subject: Re: [Help] Why "java.util.zip.ZipOutputStream" need to use / 
> tmp?
> To: common-user@hadoop.apache.org
> Date: Friday, August 28, 2009, 2:37 PM
>
> Actually, poking the code, it seems that the streaming package does  
> set this value:
>
>     String tmp = jobConf_.get("stream.tmpdir"); //, "/tmp/$ 
> {user.name}/"
>
> Try setting stream.tmpdir to a different directory maybe?
>
> Brian
>
> On Aug 28, 2009, at 1:31 PM, Steve Gao wrote:
>
>> Thanks lot, Brian. It seems to be a design flaw of hadoop that it  
>> can not manage (or pass in) the temp of "java.util.zip". Can we  
>> create a jira ticket for this?
>>
>> --- On Fri, 8/28/09, Brian Bockelman <bb...@cse.unl.edu> wrote:
>>
>> From: Brian Bockelman <bb...@cse.unl.edu>
>> Subject: Re: [Help] Why "java.util.zip.ZipOutputStream" need to  
>> use /tmp?
>> To:
>> Cc: common-user@hadoop.apache.org
>> Date: Friday, August 28, 2009, 2:27 PM
>>
>> Hey Steve,
>>
>> Correct, java.util.zip.* does not necessarily respect hadoop  
>> settings.
>>
>> Try setting TMPDIR in the environment to your large local disk  
>> space.  It might respect that, if Java decides to act like a unix  
>> utility.
>>
>> http://en.wikipedia.org/wiki/TMPDIR
>>
>> Brian
>>
>> On Aug 28, 2009, at 1:19 PM, Steve Gao wrote:
>>
>>> would someone give us a hint? Thanks.
>>> Why "java.util.zip.ZipOutputStream" need to use /tmp?
>>>
>>> The hadoop version is 0.18.3 . Recently we got "out of space"  
>>> issue. It's from "java.util.zip.ZipOutputStream".
>>> We found that /tmp is full and after cleaning /tmp the problem is  
>>> solved.
>>>
>>> However why hadoop needs to use /tmp? We had already configured  
>>> hadoop tmp to a local disk in: hadoop-site.xml
>>>
>>> <property>
>>>     <name>hadoop.tmp.dir</name>
>>>     <value> ... some large local disk ... </value>
>>> </property>
>>>
>>>
>>> Could it because java.util.zip.ZipOutputStream uses /tmp even if  
>>> we configured hadoop.tmp.dir to a large local disk?
>>>
>>> The error log is here FYI:
>>>
>>> java.io.IOException: No space left on device
>>> at java.io.FileOutputStream.write(Native Method)
>>>    at java.util.zip.ZipOutputStream.writeInt(ZipOutputStream.java: 
>>> 445)
>>> at java.util.zip.ZipOutputStream.writeEXT(ZipOutputStream.java:362)
>>> at java.util.zip.ZipOutputStream.closeEntry(ZipOutputStream.java: 
>>> 220)
>>> at java.util.zip.ZipOutputStream.finish(ZipOutputStream.java:301)
>>> at  
>>> java.util.zip.DeflaterOutputStream.close(DeflaterOutputStream.java: 
>>> 146)
>>> at java.util.zip.ZipOutputStream.close(ZipOutputStream.java:321)
>>> at org.apache.hadoop.streaming.JarBuilder.merge(JarBuilder.java:79)
>>> at  
>>> org.apache.hadoop.streaming.StreamJob.packageJobJar(StreamJob.java: 
>>> 628)
>>> at org.apache.hadoop.streaming.StreamJob.setJobConf(StreamJob.java: 
>>> 843)
>>> at org.apache.hadoop.streaming.StreamJob.go(StreamJob.java:110)
>>> at  
>>> org 
>>> .apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java: 
>>> 33)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at  
>>> sun 
>>> .reflect 
>>> .NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>> at  
>>> sun 
>>> .reflect 
>>> .DelegatingMethodAccessorImpl 
>>> .invoke(DelegatingMethodAccessorImpl.java:25)
>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
>>> at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194)
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>> at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220)
>>> Executing Hadoop job failure
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>
>
>
>

Re: [Help] Why "java.util.zip.ZipOutputStream" need to use /tmp?

Posted by Brian Bockelman <bb...@cse.unl.edu>.

I saw this in:

>> org.apache.hadoop.streaming.StreamJob.packageJobJar

Brian

On Aug 28, 2009, at 2:04 PM, Steve Gao wrote:

> Thanks, Brian. Would you tell me what is the filename of the code  
> snippet?
>
> --- On Fri, 8/28/09, Brian Bockelman <bb...@cse.unl.edu> wrote:
>
> From: Brian Bockelman <bb...@cse.unl.edu>
> Subject: Re: [Help] Why "java.util.zip.ZipOutputStream" need to use / 
> tmp?
> To: common-user@hadoop.apache.org
> Date: Friday, August 28, 2009, 2:37 PM
>
> Actually, poking the code, it seems that the streaming package does  
> set this value:
>
>     String tmp = jobConf_.get("stream.tmpdir"); //, "/tmp/$ 
> {user.name}/"
>
> Try setting stream.tmpdir to a different directory maybe?
>
> Brian
>
> On Aug 28, 2009, at 1:31 PM, Steve Gao wrote:
>
>> Thanks lot, Brian. It seems to be a design flaw of hadoop that it  
>> can not manage (or pass in) the temp of "java.util.zip". Can we  
>> create a jira ticket for this?
>>
>> --- On Fri, 8/28/09, Brian Bockelman <bb...@cse.unl.edu> wrote:
>>
>> From: Brian Bockelman <bb...@cse.unl.edu>
>> Subject: Re: [Help] Why "java.util.zip.ZipOutputStream" need to  
>> use /tmp?
>> To:
>> Cc: common-user@hadoop.apache.org
>> Date: Friday, August 28, 2009, 2:27 PM
>>
>> Hey Steve,
>>
>> Correct, java.util.zip.* does not necessarily respect hadoop  
>> settings.
>>
>> Try setting TMPDIR in the environment to your large local disk  
>> space.  It might respect that, if Java decides to act like a unix  
>> utility.
>>
>> http://en.wikipedia.org/wiki/TMPDIR
>>
>> Brian
>>
>> On Aug 28, 2009, at 1:19 PM, Steve Gao wrote:
>>
>>> would someone give us a hint? Thanks.
>>> Why "java.util.zip.ZipOutputStream" need to use /tmp?
>>>
>>> The hadoop version is 0.18.3 . Recently we got "out of space"  
>>> issue. It's from "java.util.zip.ZipOutputStream".
>>> We found that /tmp is full and after cleaning /tmp the problem is  
>>> solved.
>>>
>>> However why hadoop needs to use /tmp? We had already configured  
>>> hadoop tmp to a local disk in: hadoop-site.xml
>>>
>>> <property>
>>>     <name>hadoop.tmp.dir</name>
>>>     <value> ... some large local disk ... </value>
>>> </property>
>>>
>>>
>>> Could it because java.util.zip.ZipOutputStream uses /tmp even if  
>>> we configured hadoop.tmp.dir to a large local disk?
>>>
>>> The error log is here FYI:
>>>
>>> java.io.IOException: No space left on device
>>> at java.io.FileOutputStream.write(Native Method)
>>>    at java.util.zip.ZipOutputStream.writeInt(ZipOutputStream.java: 
>>> 445)
>>> at java.util.zip.ZipOutputStream.writeEXT(ZipOutputStream.java:362)
>>> at java.util.zip.ZipOutputStream.closeEntry(ZipOutputStream.java: 
>>> 220)
>>> at java.util.zip.ZipOutputStream.finish(ZipOutputStream.java:301)
>>> at  
>>> java.util.zip.DeflaterOutputStream.close(DeflaterOutputStream.java: 
>>> 146)
>>> at java.util.zip.ZipOutputStream.close(ZipOutputStream.java:321)
>>> at org.apache.hadoop.streaming.JarBuilder.merge(JarBuilder.java:79)
>>> at  
>>> org.apache.hadoop.streaming.StreamJob.packageJobJar(StreamJob.java: 
>>> 628)
>>> at org.apache.hadoop.streaming.StreamJob.setJobConf(StreamJob.java: 
>>> 843)
>>> at org.apache.hadoop.streaming.StreamJob.go(StreamJob.java:110)
>>> at  
>>> org 
>>> .apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java: 
>>> 33)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at  
>>> sun 
>>> .reflect 
>>> .NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>> at  
>>> sun 
>>> .reflect 
>>> .DelegatingMethodAccessorImpl 
>>> .invoke(DelegatingMethodAccessorImpl.java:25)
>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
>>> at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194)
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>> at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220)
>>> Executing Hadoop job failure
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>
>
>
>

Re: [Help] Why "java.util.zip.ZipOutputStream" need to use /tmp?

Posted by Steve Gao <st...@yahoo.com>.

Thanks, Brian. Would you tell me what is the filename of the code snippet?

--- On Fri, 8/28/09, Brian Bockelman <bb...@cse.unl.edu> wrote:

From: Brian Bockelman <bb...@cse.unl.edu>
Subject: Re: [Help] Why "java.util.zip.ZipOutputStream" need to use /tmp?
To: common-user@hadoop.apache.org
Date: Friday, August 28, 2009, 2:37 PM

Actually, poking the code, it seems that the streaming package does set this value:

    String tmp = jobConf_.get("stream.tmpdir"); //, "/tmp/${user.name}/"

Try setting stream.tmpdir to a different directory maybe?

Brian

On Aug 28, 2009, at 1:31 PM, Steve Gao wrote:

> Thanks lot, Brian. It seems to be a design flaw of hadoop that it can not manage (or pass in) the temp of "java.util.zip". Can we create a jira ticket for this?
> 
> --- On Fri, 8/28/09, Brian Bockelman <bb...@cse.unl.edu> wrote:
> 
> From: Brian Bockelman <bb...@cse.unl.edu>
> Subject: Re: [Help] Why "java.util.zip.ZipOutputStream" need to use /tmp?
> To:
> Cc: common-user@hadoop.apache.org
> Date: Friday, August 28, 2009, 2:27 PM
> 
> Hey Steve,
> 
> Correct, java.util.zip.* does not necessarily respect hadoop settings.
> 
> Try setting TMPDIR in the environment to your large local disk space.  It might respect that, if Java decides to act like a unix utility.
> 
> http://en.wikipedia.org/wiki/TMPDIR
> 
> Brian
> 
> On Aug 28, 2009, at 1:19 PM, Steve Gao wrote:
> 
>> would someone give us a hint? Thanks.
>> Why "java.util.zip.ZipOutputStream" need to use /tmp?
>> 
>> The hadoop version is 0.18.3 . Recently we got "out of space" issue. It's from "java.util.zip.ZipOutputStream".
>> We found that /tmp is full and after cleaning /tmp the problem is solved.
>> 
>> However why hadoop needs to use /tmp? We had already configured hadoop tmp to a local disk in: hadoop-site.xml
>> 
>> <property>
>>    <name>hadoop.tmp.dir</name>
>>    <value> ... some large local disk ... </value>
>> </property>
>> 
>> 
>> Could it because java.util.zip.ZipOutputStream uses /tmp even if we configured hadoop.tmp.dir to a large local disk?
>> 
>> The error log is here FYI:
>> 
>> java.io.IOException: No space left on device
>> at java.io.FileOutputStream.write(Native Method)
>>   at java.util.zip.ZipOutputStream.writeInt(ZipOutputStream.java:445)
>> at java.util.zip.ZipOutputStream.writeEXT(ZipOutputStream.java:362)
>> at java.util.zip.ZipOutputStream.closeEntry(ZipOutputStream.java:220)
>> at java.util.zip.ZipOutputStream.finish(ZipOutputStream.java:301)
>> at java.util.zip.DeflaterOutputStream.close(DeflaterOutputStream.java:146)
>> at java.util.zip.ZipOutputStream.close(ZipOutputStream.java:321)
>> at org.apache.hadoop.streaming.JarBuilder.merge(JarBuilder.java:79)
>> at org.apache.hadoop.streaming.StreamJob.packageJobJar(StreamJob.java:628)
>> at org.apache.hadoop.streaming.StreamJob.setJobConf(StreamJob.java:843)
>> at org.apache.hadoop.streaming.StreamJob.go(StreamJob.java:110)
>> at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:33)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> at java.lang.reflect.Method.invoke(Method.java:597)
>> at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
>> at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>> at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220)
>> Executing Hadoop job failure
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> 
>

Re: [Help] Why "java.util.zip.ZipOutputStream" need to use /tmp?

Posted by Brian Bockelman <bb...@cse.unl.edu>.

Actually, poking the code, it seems that the streaming package does  
set this value:

     String tmp = jobConf_.get("stream.tmpdir"); //, "/tmp/$ 
{user.name}/"

Try setting stream.tmpdir to a different directory maybe?

Brian

On Aug 28, 2009, at 1:31 PM, Steve Gao wrote:

> Thanks lot, Brian. It seems to be a design flaw of hadoop that it  
> can not manage (or pass in) the temp of "java.util.zip". Can we  
> create a jira ticket for this?
>
> --- On Fri, 8/28/09, Brian Bockelman <bb...@cse.unl.edu> wrote:
>
> From: Brian Bockelman <bb...@cse.unl.edu>
> Subject: Re: [Help] Why "java.util.zip.ZipOutputStream" need to use / 
> tmp?
> To:
> Cc: common-user@hadoop.apache.org
> Date: Friday, August 28, 2009, 2:27 PM
>
> Hey Steve,
>
> Correct, java.util.zip.* does not necessarily respect hadoop settings.
>
> Try setting TMPDIR in the environment to your large local disk  
> space.  It might respect that, if Java decides to act like a unix  
> utility.
>
> http://en.wikipedia.org/wiki/TMPDIR
>
> Brian
>
> On Aug 28, 2009, at 1:19 PM, Steve Gao wrote:
>
>> would someone give us a hint? Thanks.
>> Why "java.util.zip.ZipOutputStream" need to use /tmp?
>>
>> The hadoop version is 0.18.3 . Recently we got "out of space"  
>> issue. It's from "java.util.zip.ZipOutputStream".
>> We found that /tmp is full and after cleaning /tmp the problem is  
>> solved.
>>
>> However why hadoop needs to use /tmp? We had already configured  
>> hadoop tmp to a local disk in: hadoop-site.xml
>>
>> <property>
>>    <name>hadoop.tmp.dir</name>
>>    <value> ... some large local disk ... </value>
>> </property>
>>
>>
>> Could it because java.util.zip.ZipOutputStream uses /tmp even if we  
>> configured hadoop.tmp.dir to a large local disk?
>>
>> The error log is here FYI:
>>
>> java.io.IOException: No space left on device
>> at java.io.FileOutputStream.write(Native Method)
>>   at java.util.zip.ZipOutputStream.writeInt(ZipOutputStream.java:445)
>> at java.util.zip.ZipOutputStream.writeEXT(ZipOutputStream.java:362)
>> at java.util.zip.ZipOutputStream.closeEntry(ZipOutputStream.java:220)
>> at java.util.zip.ZipOutputStream.finish(ZipOutputStream.java:301)
>> at  
>> java.util.zip.DeflaterOutputStream.close(DeflaterOutputStream.java: 
>> 146)
>> at java.util.zip.ZipOutputStream.close(ZipOutputStream.java:321)
>> at org.apache.hadoop.streaming.JarBuilder.merge(JarBuilder.java:79)
>> at  
>> org.apache.hadoop.streaming.StreamJob.packageJobJar(StreamJob.java: 
>> 628)
>> at org.apache.hadoop.streaming.StreamJob.setJobConf(StreamJob.java: 
>> 843)
>> at org.apache.hadoop.streaming.StreamJob.go(StreamJob.java:110)
>> at  
>> org 
>> .apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java: 
>> 33)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at  
>> sun 
>> .reflect 
>> .NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> at  
>> sun 
>> .reflect 
>> .DelegatingMethodAccessorImpl 
>> .invoke(DelegatingMethodAccessorImpl.java:25)
>> at java.lang.reflect.Method.invoke(Method.java:597)
>> at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
>> at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>> at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220)
>> Executing Hadoop job failure
>>
>>
>>
>>
>>
>>
>>
>
>
>
>

Re: [Help] Why "java.util.zip.ZipOutputStream" need to use /tmp?

Posted by Steve Gao <st...@yahoo.com>.

Thanks lot, Brian. It seems to be a design flaw of hadoop that it can not manage (or pass in) the temp of "java.util.zip". Can we create a jira ticket for this?

--- On Fri, 8/28/09, Brian Bockelman <bb...@cse.unl.edu> wrote:

From: Brian Bockelman <bb...@cse.unl.edu>
Subject: Re: [Help] Why "java.util.zip.ZipOutputStream" need to use /tmp?
To: 
Cc: common-user@hadoop.apache.org
Date: Friday, August 28, 2009, 2:27 PM

Hey Steve,

Correct, java.util.zip.* does not necessarily respect hadoop settings.

Try setting TMPDIR in the environment to your large local disk space.  It might respect that, if Java decides to act like a unix utility.

http://en.wikipedia.org/wiki/TMPDIR

Brian

On Aug 28, 2009, at 1:19 PM, Steve Gao wrote:

> would someone give us a hint? Thanks.
> Why "java.util.zip.ZipOutputStream" need to use /tmp?
> 
> The hadoop version is 0.18.3 . Recently we got "out of space" issue. It's from "java.util.zip.ZipOutputStream".
> We found that /tmp is full and after cleaning /tmp the problem is solved.
> 
> However why hadoop needs to use /tmp? We had already configured hadoop tmp to a local disk in: hadoop-site.xml
> 
> <property>
>   <name>hadoop.tmp.dir</name>
>   <value> ... some large local disk ... </value>
> </property>
> 
> 
> Could it because java.util.zip.ZipOutputStream uses /tmp even if we configured hadoop.tmp.dir to a large local disk?
> 
> The error log is here FYI:
> 
> java.io.IOException: No space left on device
> at java.io.FileOutputStream.write(Native Method)
>  at java.util.zip.ZipOutputStream.writeInt(ZipOutputStream.java:445)
> at java.util.zip.ZipOutputStream.writeEXT(ZipOutputStream.java:362)
> at java.util.zip.ZipOutputStream.closeEntry(ZipOutputStream.java:220)
> at java.util.zip.ZipOutputStream.finish(ZipOutputStream.java:301)
> at java.util.zip.DeflaterOutputStream.close(DeflaterOutputStream.java:146)
> at java.util.zip.ZipOutputStream.close(ZipOutputStream.java:321)
> at org.apache.hadoop.streaming.JarBuilder.merge(JarBuilder.java:79)
> at org.apache.hadoop.streaming.StreamJob.packageJobJar(StreamJob.java:628)
> at org.apache.hadoop.streaming.StreamJob.setJobConf(StreamJob.java:843)
> at org.apache.hadoop.streaming.StreamJob.go(StreamJob.java:110)
> at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:33)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
> at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220)
> Executing Hadoop job failure
> 
> 
> 
> 
> 
> 
>

Re: [Help] Why "java.util.zip.ZipOutputStream" need to use /tmp?

Posted by Brian Bockelman <bb...@cse.unl.edu>.

Hey Steve,

Correct, java.util.zip.* does not necessarily respect hadoop settings.

Try setting TMPDIR in the environment to your large local disk space.   
It might respect that, if Java decides to act like a unix utility.

http://en.wikipedia.org/wiki/TMPDIR

Brian

On Aug 28, 2009, at 1:19 PM, Steve Gao wrote:

> would someone give us a hint? Thanks.
> Why "java.util.zip.ZipOutputStream" need to use /tmp?
>
> The hadoop version is 0.18.3 . Recently we got "out of space" issue.  
> It's from "java.util.zip.ZipOutputStream".
> We found that /tmp is full and after cleaning /tmp the problem is  
> solved.
>
> However why hadoop needs to use /tmp? We had already configured  
> hadoop tmp to a local disk in: hadoop-site.xml
>
> <property>
>   <name>hadoop.tmp.dir</name>
>   <value> ... some large local disk ... </value>
> </property>
>
>
> Could it because java.util.zip.ZipOutputStream uses /tmp even if we  
> configured hadoop.tmp.dir to a large local disk?
>
> The error log is here FYI:
>
> java.io.IOException: No space left on device
> at java.io.FileOutputStream.write(Native Method)
>  at java.util.zip.ZipOutputStream.writeInt(ZipOutputStream.java:445)
> at java.util.zip.ZipOutputStream.writeEXT(ZipOutputStream.java:362)
> at java.util.zip.ZipOutputStream.closeEntry(ZipOutputStream.java:220)
> at java.util.zip.ZipOutputStream.finish(ZipOutputStream.java:301)
> at  
> java.util.zip.DeflaterOutputStream.close(DeflaterOutputStream.java: 
> 146)
> at java.util.zip.ZipOutputStream.close(ZipOutputStream.java:321)
> at org.apache.hadoop.streaming.JarBuilder.merge(JarBuilder.java:79)
> at  
> org.apache.hadoop.streaming.StreamJob.packageJobJar(StreamJob.java: 
> 628)
> at org.apache.hadoop.streaming.StreamJob.setJobConf(StreamJob.java: 
> 843)
> at org.apache.hadoop.streaming.StreamJob.go(StreamJob.java:110)
> at  
> org 
> .apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:33)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at  
> sun 
> .reflect 
> .NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at  
> sun 
> .reflect 
> .DelegatingMethodAccessorImpl 
> .invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
> at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220)
> Executing Hadoop job failure
>
>
>
>
>
>
>

Who are the gurus in Hive and/or Hbase?

Posted by Gopal Gandhi <go...@yahoo.com>.

We are inviting gurus or major contributors of Hive and/or Hbase (or anything related to Hadoop) to give us presentations about the products. Would you name a few names? The gurus must be in bay area. 
Thanks.

Who are the gurus in Hive and/or Hbase?

Posted by Gopal Gandhi <go...@yahoo.com>.

We are inviting gurus or major contributors of Hive and/or Hbase (or anything related to Hadoop) to give us presentations about the products. Would you name a few names? The gurus must be in bay area. 
Thanks.

[Help] Why "java.util.zip.ZipOutputStream" need to use /tmp?

Posted by Steve Gao <st...@yahoo.com>.

would someone give us a hint? Thanks.
Why "java.util.zip.ZipOutputStream" need to use /tmp?

The hadoop version is 0.18.3 . Recently we got "out of space" issue. It's from "java.util.zip.ZipOutputStream".
We found that /tmp is full and after cleaning /tmp the problem is solved.

However why hadoop needs to use /tmp? We had already configured hadoop tmp to a local disk in: hadoop-site.xml

<property>
  <name>hadoop.tmp.dir</name>
  <value> ... some large local disk ... </value>
</property>


Could it because java.util.zip.ZipOutputStream uses /tmp even if we configured hadoop.tmp.dir to a large local disk?

The error log is here FYI:

java.io.IOException: No space left on device         
at java.io.FileOutputStream.write(Native Method)        
 at java.util.zip.ZipOutputStream.writeInt(ZipOutputStream.java:445)         
at java.util.zip.ZipOutputStream.writeEXT(ZipOutputStream.java:362)         
at java.util.zip.ZipOutputStream.closeEntry(ZipOutputStream.java:220)         
at java.util.zip.ZipOutputStream.finish(ZipOutputStream.java:301)         
at java.util.zip.DeflaterOutputStream.close(DeflaterOutputStream.java:146)         
at java.util.zip.ZipOutputStream.close(ZipOutputStream.java:321)         
at org.apache.hadoop.streaming.JarBuilder.merge(JarBuilder.java:79)         
at org.apache.hadoop.streaming.StreamJob.packageJobJar(StreamJob.java:628)         
at org.apache.hadoop.streaming.StreamJob.setJobConf(StreamJob.java:843)         
at org.apache.hadoop.streaming.StreamJob.go(StreamJob.java:110)         
at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:33)         
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)         
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)         
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)         
at java.lang.reflect.Method.invoke(Method.java:597)         
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)         
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194)         
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)         
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)         
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220)  
Executing Hadoop job failure

[Help] Why "java.util.zip.ZipOutputStream" need to use /tmp?

Posted by Steve Gao <st...@yahoo.com>.

would someone give us a hint? Thanks.
Why "java.util.zip.ZipOutputStream" need to use /tmp?

The hadoop version is 0.18.3 . Recently we got "out of space" issue. It's from "java.util.zip.ZipOutputStream".
We found that /tmp is full and after cleaning /tmp the problem is solved.

However why hadoop needs to use /tmp? We had already configured hadoop tmp to a local disk in: hadoop-site.xml

<property>
  <name>hadoop.tmp.dir</name>
  <value> ... some large local disk ... </value>
</property>


Could it because java.util.zip.ZipOutputStream uses /tmp even if we configured hadoop.tmp.dir to a large local disk?

The error log is here FYI:

java.io.IOException: No space left on device         
at java.io.FileOutputStream.write(Native Method)        
 at java.util.zip.ZipOutputStream.writeInt(ZipOutputStream.java:445)         
at java.util.zip.ZipOutputStream.writeEXT(ZipOutputStream.java:362)         
at java.util.zip.ZipOutputStream.closeEntry(ZipOutputStream.java:220)         
at java.util.zip.ZipOutputStream.finish(ZipOutputStream.java:301)         
at java.util.zip.DeflaterOutputStream.close(DeflaterOutputStream.java:146)         
at java.util.zip.ZipOutputStream.close(ZipOutputStream.java:321)         
at org.apache.hadoop.streaming.JarBuilder.merge(JarBuilder.java:79)         
at org.apache.hadoop.streaming.StreamJob.packageJobJar(StreamJob.java:628)         
at org.apache.hadoop.streaming.StreamJob.setJobConf(StreamJob.java:843)         
at org.apache.hadoop.streaming.StreamJob.go(StreamJob.java:110)         
at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:33)         
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)         
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)         
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)         
at java.lang.reflect.Method.invoke(Method.java:597)         
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)         
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194)         
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)         
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)         
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220)  
Executing Hadoop job failure

Why "java.util.zip.ZipOutputStream" need to use /tmp?

Posted by Steve Gao <st...@yahoo.com>.


The hadoop version is 0.18.3 . Recently we got "out of space" issue. It's from "java.util.zip.ZipOutputStream".
We found that /tmp is full and after cleaning /tmp the problem is solved.

However why hadoop needs to use /tmp? We had already configured hadoop tmp to a local disk in: hadoop-site.xml

<property>
  <name>hadoop.tmp.dir</name>
  <value> ... some large local disk ... </value>
</property>


Could it because java.util.zip.ZipOutputStream uses /tmp even if we configured hadoop.tmp.dir to a large local disk?

The error log is here FYI:

java.io.IOException: No space left on device         
at java.io.FileOutputStream.write(Native Method)        
 at java.util.zip.ZipOutputStream.writeInt(ZipOutputStream.java:445)         
at java.util.zip.ZipOutputStream.writeEXT(ZipOutputStream.java:362)         
at java.util.zip.ZipOutputStream.closeEntry(ZipOutputStream.java:220)         
at java.util.zip.ZipOutputStream.finish(ZipOutputStream.java:301)         
at java.util.zip.DeflaterOutputStream.close(DeflaterOutputStream.java:146)         
at java.util.zip.ZipOutputStream.close(ZipOutputStream.java:321)         
at org.apache.hadoop.streaming.JarBuilder.merge(JarBuilder.java:79)         
at org.apache.hadoop.streaming.StreamJob.packageJobJar(StreamJob.java:628)         
at org.apache.hadoop.streaming.StreamJob.setJobConf(StreamJob.java:843)         
at org.apache.hadoop.streaming.StreamJob.go(StreamJob.java:110)         
at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:33)         
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)         
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)         
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)         
at java.lang.reflect.Method.invoke(Method.java:597)         
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)         
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194)         
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)         
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)         
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220)  
Executing Hadoop job failure

Why "java.util.zip.ZipOutputStream" need to use /tmp?

Posted by Steve Gao <st...@yahoo.com>.


The hadoop version is 0.18.3 . Recently we got "out of space" issue. It's from "java.util.zip.ZipOutputStream".
We found that /tmp is full and after cleaning /tmp the problem is solved.

However why hadoop needs to use /tmp? We had already configured hadoop tmp to a local disk in: hadoop-site.xml

<property>
  <name>hadoop.tmp.dir</name>
  <value> ... some large local disk ... </value>
</property>


Could it because java.util.zip.ZipOutputStream uses /tmp even if we configured hadoop.tmp.dir to a large local disk?

The error log is here FYI:

java.io.IOException: No space left on device         
at java.io.FileOutputStream.write(Native Method)        
 at java.util.zip.ZipOutputStream.writeInt(ZipOutputStream.java:445)         
at java.util.zip.ZipOutputStream.writeEXT(ZipOutputStream.java:362)         
at java.util.zip.ZipOutputStream.closeEntry(ZipOutputStream.java:220)         
at java.util.zip.ZipOutputStream.finish(ZipOutputStream.java:301)         
at java.util.zip.DeflaterOutputStream.close(DeflaterOutputStream.java:146)         
at java.util.zip.ZipOutputStream.close(ZipOutputStream.java:321)         
at org.apache.hadoop.streaming.JarBuilder.merge(JarBuilder.java:79)         
at org.apache.hadoop.streaming.StreamJob.packageJobJar(StreamJob.java:628)         
at org.apache.hadoop.streaming.StreamJob.setJobConf(StreamJob.java:843)         
at org.apache.hadoop.streaming.StreamJob.go(StreamJob.java:110)         
at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:33)         
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)         
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)         
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)         
at java.lang.reflect.Method.invoke(Method.java:597)         
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)         
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194)         
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)         
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)         
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220)  
Executing Hadoop job failure

Re: How to speed up the copy phrase?

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

On Aug 24, 2009, at 5:49 PM, Aaron Kimball wrote:

> If you've got 20 nodes, then you want to have 20-ish reduce tasks.  
> Maybe 40
> if you want it to run in two waves. (Assuming 1 core/node. Multiply  
> by N for
> N cores...) As it is, each node has 500-ish map tasks that it has to  
> read
> from and for each of these, it needs to generate 500 separate reduce  
> task
> output files.  That's going to take Hadoop a long time to do.

Maps do not produce one output file per reduce, the entire map-output  
is in a single file.

Arun

Re: How to speed up the copy phrase?

Posted by Aaron Kimball <aa...@cloudera.com>.

If you've got 20 nodes, then you want to have 20-ish reduce tasks. Maybe 40
if you want it to run in two waves. (Assuming 1 core/node. Multiply by N for
N cores...) As it is, each node has 500-ish map tasks that it has to read
from and for each of these, it needs to generate 500 separate reduce task
output files.  That's going to take Hadoop a long time to do. 10000 map
tasks is also a very large number of map tasks. Are you processing a lot of
little files? If so, try using MultiFileInputFormat or MultipleInputs to
group them together.

As is mentioned, also set mapred.reduce.parallel.copies to 20. (The default
of 5 is appropriate only for 1--5 nodes.)

- Aaron

On Mon, Aug 24, 2009 at 12:31 AM, Amogh Vasekar <am...@yahoo-inc.com> wrote:

> Maybe look at mapred.reduce.parallel.copies property to speed it up...I
> don't see as to why transfer speed be configured via params, and I'm think
> hadoop wont be messing with that.
>
> Thanks,
> Amogh
>
> -----Original Message-----
> From: yang song [mailto:hadoop.inifok@gmail.com]
> Sent: Monday, August 24, 2009 12:20 PM
> To: common-user@hadoop.apache.org
> Subject: How to speed up the copy phrase?
>
> Hello, everyone
>
> When I submit a big job(e.g. maptasks:10000, reducetasks:500), I find that
> the copy phrase will last for a long long time. From WebUI, the message
> "reduce > copy (xxxx of 10000 at 0.01 MB/s) >" tells me the transfer speed
> is just 0.01 MB/s. Does it a regular value? How can I solve it?
>
> Thank you!
>
> P.S. The hadoop version is 0.19.1. The cluster has 20 nodes. Heap size of
> JT
> is 6G while the     others are default settings.
>

RE: How to speed up the copy phrase?

Posted by Amogh Vasekar <am...@yahoo-inc.com>.

Maybe look at mapred.reduce.parallel.copies property to speed it up...I don't see as to why transfer speed be configured via params, and I'm think hadoop wont be messing with that.

Thanks,
Amogh

-----Original Message-----
From: yang song [mailto:hadoop.inifok@gmail.com] 
Sent: Monday, August 24, 2009 12:20 PM
To: common-user@hadoop.apache.org
Subject: How to speed up the copy phrase?

Hello, everyone

When I submit a big job(e.g. maptasks:10000, reducetasks:500), I find that
the copy phrase will last for a long long time. From WebUI, the message
"reduce > copy (xxxx of 10000 at 0.01 MB/s) >" tells me the transfer speed
is just 0.01 MB/s. Does it a regular value? How can I solve it?

Thank you!

P.S. The hadoop version is 0.19.1. The cluster has 20 nodes. Heap size of JT
is 6G while the     others are default settings.

Re: How to speed up the copy phrase?

Posted by Jothi Padmanabhan <jo...@yahoo-inc.com>.

It is fairly straight forward, on completion of a successful fetch, the
total amount of bytes fetched is divided by the total time taken till then.

Please look at fetchOutputs method in ReduceTask.java, the portion of code
that handles successful copies.

Jothi


On 8/25/09 8:23 AM, "bharath vissapragada"
<bh...@gmail.com> wrote:

> Jothi ,
> 
> Do have any idea , how thease speeds are calculated , I mean some
> mathematical expressions or stuff.
> 
> On Mon, Aug 24, 2009 at 12:54 PM, Jothi Padmanabhan
> <jo...@yahoo-inc.com>wrote:
> 
>> The transfer rate is a little misleading. The timer for this calculation
>> starts when the reducer itself starts and so includes the time spent by the
>> reducer waiting for maps to complete. So, the speed shown when shuffling
>> the
>> first few maps might be totally misleading, it does not necessarily reflect
>> the network speed. You should be able to see more reasonable numbers
>> towards
>> the end of shuffle.
>> 
>> Jothi
>> 
>> 
>> On 8/24/09 12:19 PM, "yang song" <ha...@gmail.com> wrote:
>> 
>>> Hello, everyone
>>> 
>>> When I submit a big job(e.g. maptasks:10000, reducetasks:500), I find
>> that
>>> the copy phrase will last for a long long time. From WebUI, the message
>>> "reduce > copy (xxxx of 10000 at 0.01 MB/s) >" tells me the transfer
>> speed
>>> is just 0.01 MB/s. Does it a regular value? How can I solve it?
>>> 
>>> Thank you!
>>> 
>>> P.S. The hadoop version is 0.19.1. The cluster has 20 nodes. Heap size of
>> JT
>>> is 6G while the     others are default settings.
>> 
>>

Re: How to speed up the copy phrase?

Posted by bharath vissapragada <bh...@gmail.com>.

Jothi ,

Do have any idea , how thease speeds are calculated , I mean some
mathematical expressions or stuff.

On Mon, Aug 24, 2009 at 12:54 PM, Jothi Padmanabhan
<jo...@yahoo-inc.com>wrote:

> The transfer rate is a little misleading. The timer for this calculation
> starts when the reducer itself starts and so includes the time spent by the
> reducer waiting for maps to complete. So, the speed shown when shuffling
> the
> first few maps might be totally misleading, it does not necessarily reflect
> the network speed. You should be able to see more reasonable numbers
> towards
> the end of shuffle.
>
> Jothi
>
>
> On 8/24/09 12:19 PM, "yang song" <ha...@gmail.com> wrote:
>
> > Hello, everyone
> >
> > When I submit a big job(e.g. maptasks:10000, reducetasks:500), I find
> that
> > the copy phrase will last for a long long time. From WebUI, the message
> > "reduce > copy (xxxx of 10000 at 0.01 MB/s) >" tells me the transfer
> speed
> > is just 0.01 MB/s. Does it a regular value? How can I solve it?
> >
> > Thank you!
> >
> > P.S. The hadoop version is 0.19.1. The cluster has 20 nodes. Heap size of
> JT
> > is 6G while the     others are default settings.
>
>

Re: How to speed up the copy phrase?

Posted by Jothi Padmanabhan <jo...@yahoo-inc.com>.

The transfer rate is a little misleading. The timer for this calculation
starts when the reducer itself starts and so includes the time spent by the
reducer waiting for maps to complete. So, the speed shown when shuffling the
first few maps might be totally misleading, it does not necessarily reflect
the network speed. You should be able to see more reasonable numbers towards
the end of shuffle.

Jothi

On 8/24/09 12:19 PM, "yang song" <ha...@gmail.com> wrote:

> Hello, everyone
> 
> When I submit a big job(e.g. maptasks:10000, reducetasks:500), I find that
> the copy phrase will last for a long long time. From WebUI, the message
> "reduce > copy (xxxx of 10000 at 0.01 MB/s) >" tells me the transfer speed
> is just 0.01 MB/s. Does it a regular value? How can I solve it?
> 
> Thank you!
> 
> P.S. The hadoop version is 0.19.1. The cluster has 20 nodes. Heap size of JT
> is 6G while the     others are default settings.