You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by "Chao Shi (JIRA)" <ji...@apache.org> on 2013/08/16 07:34:47 UTC

[jira] [Created] (CRUNCH-252) The driver program may copy huge of data if the output path is not on the same fs of working dir

Chao Shi created CRUNCH-252:
-------------------------------

             Summary: The driver program may copy huge of data if the output path is not on the same fs of working dir
                 Key: CRUNCH-252
                 URL: https://issues.apache.org/jira/browse/CRUNCH-252
             Project: Crunch
          Issue Type: Bug
            Reporter: Chao Shi


I encounter this problem when I run a pipeline of MRs on cluster A and want the final outcome to be stored on cluster B. I don't want to simply point the working dir to B, because we want all the intermediate output stored on A.

Here is the stacktrace of the driver program, which is copying output.

"Thread-15" prio=10 tid=0x00007faa90130800 nid=0x3e73 runnable [0x00007faa874ed000]
   java.lang.Thread.State: RUNNABLE
        at org.apache.hadoop.util.DataChecksum.update(DataChecksum.java:223)
        at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:240)
        at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
        at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
        - locked <0x00000000c45d8128> (a org.apache.hadoop.hdfs.DFSClient$BlockReader)
        at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1249)
        - locked <0x00000000c45d8128> (a org.apache.hadoop.hdfs.DFSClient$BlockReader)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1899)
        - locked <0x00000000c2ccb608> (a org.apache.hadoop.hdfs.DFSClient$DFSInputStream)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1951)
        - locked <0x00000000c2ccb608> (a org.apache.hadoop.hdfs.DFSClient$DFSInputStream)
        at java.io.DataInputStream.read(DataInputStream.java:83)
        at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:55)
        at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:89)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:224)
        at org.apache.crunch.io.impl.FileTargetImpl.handleOutputs(FileTargetImpl.java:109)
        at org.apache.crunch.impl.mr.exec.CrunchJobHooks$CompletionHook.handleMultiPaths(CrunchJobHooks.java:87)
        - locked <0x00000000c18e6fb8> (a org.apache.crunch.impl.mr.exec.CrunchJobHooks$CompletionHook)
        at org.apache.crunch.impl.mr.exec.CrunchJobHooks$CompletionHook.run(CrunchJobHooks.java:79)
        at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.checkRunningState(CrunchControlledJob.java:251)
        at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.checkState(CrunchControlledJob.java:261)
        - locked <0x00000000c18e6ff8> (a org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob)
        at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.checkRunningJobs(CrunchJobControl.java:170)
        - locked <0x00000000c18e7028> (a org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl)
        at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.pollJobStatusAndStartNewOnes(CrunchJobControl.java:221)
        at org.apache.crunch.impl.mr.exec.MRExecutor.monitorLoop(MRExecutor.java:101)
        at org.apache.crunch.impl.mr.exec.MRExecutor.access$000(MRExecutor.java:52)
        at org.apache.crunch.impl.mr.exec.MRExecutor$1.run(MRExecutor.java:76)
        at java.lang.Thread.run(Thread.java:662)


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira