You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Kris Jirapinyo <KJ...@attensity.com> on 2010/08/15 19:34:48 UTC

distcp questions

Hi all,
   A few questions regarding distcp.  Note we are trying to distcp from "normal" unpatched hadoop 0.20.1 to CDH3 hadoop, so we are starting distcp from the CDH3 cluster and using hftp for source url.

1) Our new cluster has 25 machines but 100 mappers.  When distcp is triggered, it seems to allocate 4 mappers per machine.  Is this normal? The issue here is that say distcp only needs 8 mappers, I would think that distcp would try to distribute those to different machines so that perhaps IO will not be saturated on one machine.  What I've been seeing is that for those 8 map tasks, 4 are assigned to one machine and 4 to the other, as opposed to 8 being assigned do a different machine altogether.

2) Distcp cannot get the _logs directory.  I keep getting this error:

2010-08-15 02:26:19,179 INFO org.apache.hadoop.tools.DistCp: FAIL _logs/history/mi-prod-app01.ec2.biz360.com_1273881751016_job_201005141702_51820_hadoop_com.biz360.jobs.DateFilterMerge+%2Fmaster%2F201005%2Fyou : java.io.IOException: Server returned HTTP response code: 500 for URL: http://mi-prod-app05:50075/streamFile?filename=/master/201005/youtube/_logs/history/mi-prod-app01.ec2.biz360.com_1273881751016_job_201005141702_51820_hadoop_com.biz360.jobs.DateFilterMerge+%252Fmaster%252F201005%252Fyou&ugi=hadoop,hadoop
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1313)
        at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:157)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:398)
        at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:410)
        at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:537)
        at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:306)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)

Other than using the flag -i to "ignore" this, is there another workaround? I tried to download that file to local, and it works fine, so it's not that the data does not exist.  Is this in any way related to https://issues.apache.org/jira/browse/MAPREDUCE-968?

Thanks!
Kris Jirapinyo
Software Engineer
Attensity
1400 Bridge Parkway Ste 202
Redwood City, CA 94065
www.attensity.com<http://www.attensity.com/>
WELCOME TO THE OPEN ENTERPRISE
Follow us: twitter<http://twitter.com/attensity> facebook<http://www.facebook.com/attensity> blog<http://blog.attensity.com/>

Re: distcp questions

Posted by rosefinny111 <vl...@gmail.com>.

hi

i would think that distcp would try to distribute those to different
machines so that perhaps IO will not be saturated on one machine.  

its very nice 

regards,
phe9oxis,
http://www.guidebuddha.com
-- 
View this message in context: http://lucene.472066.n3.nabble.com/distcp-questions-tp1160133p1165550.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: distcp questions

Posted by Allen Wittenauer <aw...@linkedin.com>.

On Aug 15, 2010, at 10:34 AM, Kris Jirapinyo wrote:
> 1) Our new cluster has 25 machines but 100 mappers.  When distcp is triggered, it seems to allocate 4 mappers per machine.  Is this normal? The issue here is that say distcp only needs 8 mappers, I would think that distcp would try to distribute those to different machines so that perhaps IO will not be saturated on one machine.  What I've been seeing is that for those 8 map tasks, 4 are assigned to one machine and 4 to the other, as opposed to 8 being assigned do a different machine altogether.

I don't think distcp (or any other job, for that matter) can provide hints to the scheduler about how its tasks should be distributed, other than pointing to its input files.  So very likely, the distcp's input files are on those nodes where the tasks are located.

You can always try to bump up the replication as part of the distcp's parameters.

Re: distcp questions

Posted by rosefinny111 <vl...@gmail.com>.

hi

i would think that distcp would try to distribute those to different
machines so that perhaps IO will not be saturated on one machine.  

its very nice 

regards,
phe9oxis,
http://www.guidebuddha.com
-- 
View this message in context: http://lucene.472066.n3.nabble.com/distcp-questions-tp1160133p1165547.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.