You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@gobblin.apache.org by "Zhang, Xiuzhu(AWF)" <xi...@paypal.com> on 2017/12/14 11:58:03 UTC

Distcp

Hi friends,

I am running distcp between two Hadoop clusters to copy data, unfortunately it is failed but is work within one Hadoop cluster. Could you help me to look at it? I am very thanks you reply.

Looks like the file have copied to task-staging in hdfs and then it failed, I cost a lot of time on it now I very puzzle.

******************** copyablefile origin path hdfs://10.176.0.184:8020/data/test.txt
******************** target path hdfs://10.176.3.115:8020/home/etn/programs/gobblin-dist/workdir/task-staging/job_etndistcp_1513251826306/task_etndistcp_1513251826306_0/attempt_local305940860_0001_m_000000_0/test.txt

Found configured writer builder as gobblin.data.management.copy.writer.FileAwareInputStreamDataWriterBuilder
2017-12-14 03:43:52 PST INFO  [ForkExecutor-0] gobblin.runtime.fork.Fork  452 - Wrapping writer gobblin.writer.PartitionedDataWriter@299fc14f
2017-12-14 03:43:52 PST WARN  [ForkExecutor-0] gobblin.writer.RetryWriter$1  95 - Caught exception. This may be retried.
java.lang.IllegalArgumentException: Wrong FS: hdfs://10.176.0.184:8020/data/test.txt, expected: hdfs://hadoop-master:8020
        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:644)
        at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:464)
        at gobblin.data.management.copy.writer.FileAwareInputStreamDataWriter.writeImpl(FileAwareInputStreamDataWriter.java:222)

Configuration:
job.name=etndistcp
job.group=etn
job.description=Distcpexample.......

source.filebased.fs.uri=hdfs://10.176.0.184:8020
source.filebased.data.directory=/data

source.class=gobblin.data.management.copy.CopySource
gobblin.dataset.profile.class=gobblin.data.management.copy.CopyableGlobDatasetFinder
gobblin.dataset.pattern=/data

extract.namespace=gobblin.data.management.copy.extractor
writer.builder.class=gobblin.data.management.copy.writer.FileAwareInputStreamDataWriterBuilder
data.publisher.type=gobblin.data.management.copy.publisher.CopyDataPublisher

writer.destination.type=HDFS

writer.fs.uri=hdfs://10.176.3.115:8020
data.publisher.final.dir=hdfs://10.176.3.115:8020/demo/etnwork/distcp_dest

Thanks,
Ethan

Re: Distcp

Posted by Zhixiong Chen <zh...@linkedin.com>.
Hi Ethan,

If we look at the log
java.lang.IllegalArgumentException: Wrong FS: hdfs://10.176.0.184:8020/demo/etnwork/distcp_src/twitter9d99m1.avro, expected: hdfs://hadoop-master:8020
        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:644)
        at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:464)
        at gobblin.data.management.copy.writer.FileAwareInputStreamDataWriter.writeImpl(FileAwareInputStreamDataWriter.java:217)
        at gobblin.data.management.copy.writer.FileAwareInputStreamDataWriter.writeImpl(FileAwareInputStreamDataWriter.java:166)
        at

The actual cause of the exception is line 217 of FileAwareInputStreamDataWriter, where you add a print message:
217         System.out.println("******************** copyablefile origin path1 long "+FileSystem.get(new Configuration()).makeQualified(copyableFile.getOrigin().getPath()).toUri());
?
Here, it turns out your default file system configuration is `hadoop-master` (10.176.3.115). But copyableFile has its origin path from the source file system 10.176.0.184. Hence the error message: Wrong FS: hdfs://10.176.0.184:8020/data/test.txt, expected: hdfs://hadoop-master:8020

If you remove line 217, the file could be successfully copied to the destination.

Zhixiong
________________________________
From: Zhang, Xiuzhu(AWF) <xi...@paypal.com>
Sent: Sunday, December 17, 2017 6:17 PM
To: Zhixiong Chen; user@gobblin.incubator.apache.org
Subject: RE: Distcp


Hi Zhixiong,



The version of gobblin:0.10.0 from https://github.com/apache/incubator-gobblin/releases

Ip address of ‘hadoop-master’: 10.176.3.115 which is the destination addr and 10.176.0.184 is source addr.



I add System…sentence under 214 line in FileAwareInputStreamDataWriter.java

214       try {

215

216         System.out.println("******************** copyablefile origin path2 "+copyableFile.getOrigin().getPath());

217         System.out.println("******************** copyablefile origin path1 long "+FileSystem.get(new Configuration()).makeQualified(copyableFile.getOrigin().getPath()).toUri());

218         System.out.println("******************** target path  "+this.fs.makeQualified(writeAt).toUri());

219

220         StreamThrottler<GobblinScopeTypes> throttler =

221             this.taskBroker.getSharedResource(new StreamThrottler.Factory<GobblinScopeTypes>(), new EmptyKey());



Detail logs:

java.lang.IllegalArgumentException: Wrong FS: hdfs://10.176.0.184:8020/demo/etnwork/distcp_src/twitter9d99m1.avro, expected: hdfs://hadoop-master:8020

        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:644)

        at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:464)

        at gobblin.data.management.copy.writer.FileAwareInputStreamDataWriter.writeImpl(FileAwareInputStreamDataWriter.java:217)

        at gobblin.data.management.copy.writer.FileAwareInputStreamDataWriter.writeImpl(FileAwareInputStreamDataWriter.java:166)

        at gobblin.data.management.copy.writer.FileAwareInputStreamDataWriter.writeImpl(FileAwareInputStreamDataWriter.java:82)

        at gobblin.instrumented.writer.InstrumentedDataWriterBase.write(InstrumentedDataWriterBase.java:165)

        at gobblin.instrumented.writer.InstrumentedDataWriter.write(InstrumentedDataWriter.java:38)

        at gobblin.instrumented.writer.InstrumentedDataWriterDecorator.writeImpl(InstrumentedDataWriterDecorator.java:76)

        at gobblin.instrumented.writer.InstrumentedDataWriterDecorator.write(InstrumentedDataWriterDecorator.java:68)

        at gobblin.writer.PartitionedDataWriter.write(PartitionedDataWriter.java:127)

        at gobblin.writer.RetryWriter$2.call(RetryWriter.java:116)

        at gobblin.writer.RetryWriter$2.call(RetryWriter.java:113)

        at com.github.rholder.retry.AttemptTimeLimiters$NoAttemptTimeLimit.call(AttemptTimeLimiters.java:78)

        at com.github.rholder.retry.Retryer.call(Retryer.java:160)

        at com.github.rholder.retry.Retryer$RetryerCallable.call(Retryer.java:318)

        at gobblin.writer.RetryWriter.callWithRetry(RetryWriter.java:140)

        at gobblin.writer.RetryWriter.write(RetryWriter.java:121)

        at gobblin.runtime.fork.Fork.processRecord(Fork.java:426)

        at gobblin.runtime.fork.AsynchronousFork.processRecord(AsynchronousFork.java:98)

        at gobblin.runtime.fork.AsynchronousFork.processRecords(AsynchronousFork.java:81)

        at gobblin.runtime.fork.Fork.run(Fork.java:180)

        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

        at java.util.concurrent.FutureTask.run(FutureTask.java:266)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

        at java.lang.Thread.run(Thread.java:748)



It is throws from https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java at 758 line they are not match

    if (thisScheme.equalsIgnoreCase(thatScheme)) {// schemes match



You mean compile all of source code then just copy jars under gobblin lib directory? Need to change any other file?



Thanks,

Ethan



From: Zhixiong Chen [mailto:zhchen@linkedin.com]
Sent: 2017年12月16日 3:13
To: Zhang, Xiuzhu(AWF) <xi...@paypal.com>; user@gobblin.incubator.apache.org
Subject: Re: Distcp



Hi Ethan,



Can you provide us the following information:

- The version of gobblin

- ip address that corresponds to `hadoop-master`

- more log information after `at gobblin.data.management.copy.writer.FileAwareInputStreamDataWriter.writeImpl(FileAwareInputStreamDataWriter.java:222)`



Where do you print this info?

******************** copyablefile origin path hdfs://10.176.0.184:8020/data/test.txt

******************** target path hdfs://10.176.3.115:8020/home/etn/programs/gobblin-dist/workdir/task-staging/job_etndistcp_1513251826306/task_etndistcp_1513251826306_0/attempt_local305940860_0001_m_000000_0/test.txt



I saw you're still using gobblin version-0, which is not maintained any more. You might think about upgrading it to the incubator-version: https://github.com/apache/incubator-gobblin



Zhixiong



________________________________

From: Zhang, Xiuzhu(AWF) <xi...@paypal.com>>
Sent: Thursday, December 14, 2017 3:58 AM
To: user@gobblin.incubator.apache.org<ma...@gobblin.incubator.apache.org>
Subject: Distcp



Hi friends,



I am running distcp between two Hadoop clusters to copy data, unfortunately it is failed but is work within one Hadoop cluster. Could you help me to look at it? I am very thanks you reply.



Looks like the file have copied to task-staging in hdfs and then it failed, I cost a lot of time on it now I very puzzle.



******************** copyablefile origin path hdfs://10.176.0.184:8020/data/test.txt

******************** target path hdfs://10.176.3.115:8020/home/etn/programs/gobblin-dist/workdir/task-staging/job_etndistcp_1513251826306/task_etndistcp_1513251826306_0/attempt_local305940860_0001_m_000000_0/test.txt



Found configured writer builder as gobblin.data.management.copy.writer.FileAwareInputStreamDataWriterBuilder

2017-12-14 03:43:52 PST INFO  [ForkExecutor-0] gobblin.runtime.fork.Fork  452 - Wrapping writer gobblin.writer.PartitionedDataWriter@299fc14f<ma...@299fc14f>

2017-12-14 03:43:52 PST WARN  [ForkExecutor-0] gobblin.writer.RetryWriter$1  95 - Caught exception. This may be retried.

java.lang.IllegalArgumentException: Wrong FS: hdfs://10.176.0.184:8020/data/test.txt, expected: hdfs://hadoop-master:8020

        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:644)

        at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:464)

        at gobblin.data.management.copy.writer.FileAwareInputStreamDataWriter.writeImpl(FileAwareInputStreamDataWriter.java:222)



Configuration:

job.name=etndistcp

job.group=etn

job.description=Distcpexample.......



source.filebased.fs.uri=hdfs://10.176.0.184:8020

source.filebased.data.directory=/data



source.class=gobblin.data.management.copy.CopySource

gobblin.dataset.profile.class=gobblin.data.management.copy.CopyableGlobDatasetFinder

gobblin.dataset.pattern=/data



extract.namespace=gobblin.data.management.copy.extractor

writer.builder.class=gobblin.data.management.copy.writer.FileAwareInputStreamDataWriterBuilder

data.publisher.type=gobblin.data.management.copy.publisher.CopyDataPublisher



writer.destination.type=HDFS



writer.fs.uri=hdfs://10.176.3.115:8020

data.publisher.final.dir=hdfs://10.176.3.115:8020/demo/etnwork/distcp_dest



Thanks,

Ethan

RE: Distcp

Posted by "Zhang, Xiuzhu(AWF)" <xi...@paypal.com>.
Hi Zhixiong,

The version of gobblin:0.10.0 from https://github.com/apache/incubator-gobblin/releases
Ip address of ‘hadoop-master’: 10.176.3.115 which is the destination addr and 10.176.0.184 is source addr.

I add System…sentence under 214 line in FileAwareInputStreamDataWriter.java
214       try {
215
216         System.out.println("******************** copyablefile origin path2 "+copyableFile.getOrigin().getPath());
217         System.out.println("******************** copyablefile origin path1 long "+FileSystem.get(new Configuration()).makeQualified(copyableFile.getOrigin().getPath()).toUri());
218         System.out.println("******************** target path  "+this.fs.makeQualified(writeAt).toUri());
219
220         StreamThrottler<GobblinScopeTypes> throttler =
221             this.taskBroker.getSharedResource(new StreamThrottler.Factory<GobblinScopeTypes>(), new EmptyKey());

Detail logs:
java.lang.IllegalArgumentException: Wrong FS: hdfs://10.176.0.184:8020/demo/etnwork/distcp_src/twitter9d99m1.avro, expected: hdfs://hadoop-master:8020
        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:644)
        at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:464)
        at gobblin.data.management.copy.writer.FileAwareInputStreamDataWriter.writeImpl(FileAwareInputStreamDataWriter.java:217)
        at gobblin.data.management.copy.writer.FileAwareInputStreamDataWriter.writeImpl(FileAwareInputStreamDataWriter.java:166)
        at gobblin.data.management.copy.writer.FileAwareInputStreamDataWriter.writeImpl(FileAwareInputStreamDataWriter.java:82)
        at gobblin.instrumented.writer.InstrumentedDataWriterBase.write(InstrumentedDataWriterBase.java:165)
        at gobblin.instrumented.writer.InstrumentedDataWriter.write(InstrumentedDataWriter.java:38)
        at gobblin.instrumented.writer.InstrumentedDataWriterDecorator.writeImpl(InstrumentedDataWriterDecorator.java:76)
        at gobblin.instrumented.writer.InstrumentedDataWriterDecorator.write(InstrumentedDataWriterDecorator.java:68)
        at gobblin.writer.PartitionedDataWriter.write(PartitionedDataWriter.java:127)
        at gobblin.writer.RetryWriter$2.call(RetryWriter.java:116)
        at gobblin.writer.RetryWriter$2.call(RetryWriter.java:113)
        at com.github.rholder.retry.AttemptTimeLimiters$NoAttemptTimeLimit.call(AttemptTimeLimiters.java:78)
        at com.github.rholder.retry.Retryer.call(Retryer.java:160)
        at com.github.rholder.retry.Retryer$RetryerCallable.call(Retryer.java:318)
        at gobblin.writer.RetryWriter.callWithRetry(RetryWriter.java:140)
        at gobblin.writer.RetryWriter.write(RetryWriter.java:121)
        at gobblin.runtime.fork.Fork.processRecord(Fork.java:426)
        at gobblin.runtime.fork.AsynchronousFork.processRecord(AsynchronousFork.java:98)
        at gobblin.runtime.fork.AsynchronousFork.processRecords(AsynchronousFork.java:81)
        at gobblin.runtime.fork.Fork.run(Fork.java:180)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

It is throws from https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java at 758 line they are not match
    if (thisScheme.equalsIgnoreCase(thatScheme)) {// schemes match

You mean compile all of source code then just copy jars under gobblin lib directory? Need to change any other file?

Thanks,
Ethan

From: Zhixiong Chen [mailto:zhchen@linkedin.com]
Sent: 2017年12月16日 3:13
To: Zhang, Xiuzhu(AWF) <xi...@paypal.com>; user@gobblin.incubator.apache.org
Subject: Re: Distcp

Hi Ethan,

Can you provide us the following information:
- The version of gobblin
- ip address that corresponds to `hadoop-master`
- more log information after `at gobblin.data.management.copy.writer.FileAwareInputStreamDataWriter.writeImpl(FileAwareInputStreamDataWriter.java:222)`

Where do you print this info?

******************** copyablefile origin path hdfs://10.176.0.184:8020/data/test.txt

******************** target path hdfs://10.176.3.115:8020/home/etn/programs/gobblin-dist/workdir/task-staging/job_etndistcp_1513251826306/task_etndistcp_1513251826306_0/attempt_local305940860_0001_m_000000_0/test.txt

I saw you're still using gobblin version-0, which is not maintained any more. You might think about upgrading it to the incubator-version: https://github.com/apache/incubator-gobblin

Zhixiong

________________________________
From: Zhang, Xiuzhu(AWF) <xi...@paypal.com>>
Sent: Thursday, December 14, 2017 3:58 AM
To: user@gobblin.incubator.apache.org<ma...@gobblin.incubator.apache.org>
Subject: Distcp


Hi friends,



I am running distcp between two Hadoop clusters to copy data, unfortunately it is failed but is work within one Hadoop cluster. Could you help me to look at it? I am very thanks you reply.



Looks like the file have copied to task-staging in hdfs and then it failed, I cost a lot of time on it now I very puzzle.



******************** copyablefile origin path hdfs://10.176.0.184:8020/data/test.txt

******************** target path hdfs://10.176.3.115:8020/home/etn/programs/gobblin-dist/workdir/task-staging/job_etndistcp_1513251826306/task_etndistcp_1513251826306_0/attempt_local305940860_0001_m_000000_0/test.txt



Found configured writer builder as gobblin.data.management.copy.writer.FileAwareInputStreamDataWriterBuilder

2017-12-14 03:43:52 PST INFO  [ForkExecutor-0] gobblin.runtime.fork.Fork  452 - Wrapping writer gobblin.writer.PartitionedDataWriter@299fc14f<ma...@299fc14f>

2017-12-14 03:43:52 PST WARN  [ForkExecutor-0] gobblin.writer.RetryWriter$1  95 - Caught exception. This may be retried.

java.lang.IllegalArgumentException: Wrong FS: hdfs://10.176.0.184:8020/data/test.txt, expected: hdfs://hadoop-master:8020

        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:644)

        at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:464)

        at gobblin.data.management.copy.writer.FileAwareInputStreamDataWriter.writeImpl(FileAwareInputStreamDataWriter.java:222)



Configuration:

job.name=etndistcp

job.group=etn

job.description=Distcpexample.......



source.filebased.fs.uri=hdfs://10.176.0.184:8020

source.filebased.data.directory=/data



source.class=gobblin.data.management.copy.CopySource

gobblin.dataset.profile.class=gobblin.data.management.copy.CopyableGlobDatasetFinder

gobblin.dataset.pattern=/data



extract.namespace=gobblin.data.management.copy.extractor

writer.builder.class=gobblin.data.management.copy.writer.FileAwareInputStreamDataWriterBuilder

data.publisher.type=gobblin.data.management.copy.publisher.CopyDataPublisher



writer.destination.type=HDFS



writer.fs.uri=hdfs://10.176.3.115:8020

data.publisher.final.dir=hdfs://10.176.3.115:8020/demo/etnwork/distcp_dest



Thanks,

Ethan

Re: Distcp

Posted by Zhixiong Chen <zh...@linkedin.com>.
Hi Ethan,

Can you provide us the following information:
- The version of gobblin
- ip address that corresponds to `hadoop-master`
- more log information after `at gobblin.data.management.copy.writer.FileAwareInputStreamDataWriter.writeImpl(FileAwareInputStreamDataWriter.java:222)`

Where do you print this info?

******************** copyablefile origin path hdfs://10.176.0.184:8020/data/test.txt

******************** target path hdfs://10.176.3.115:8020/home/etn/programs/gobblin-dist/workdir/task-staging/job_etndistcp_1513251826306/task_etndistcp_1513251826306_0/attempt_local305940860_0001_m_000000_0/test.txt

I saw you're still using gobblin version-0, which is not maintained any more. You might think about upgrading it to the incubator-version: https://github.com/apache/incubator-gobblin

Zhixiong

________________________________
From: Zhang, Xiuzhu(AWF) <xi...@paypal.com>
Sent: Thursday, December 14, 2017 3:58 AM
To: user@gobblin.incubator.apache.org
Subject: Distcp


Hi friends,



I am running distcp between two Hadoop clusters to copy data, unfortunately it is failed but is work within one Hadoop cluster. Could you help me to look at it? I am very thanks you reply.



Looks like the file have copied to task-staging in hdfs and then it failed, I cost a lot of time on it now I very puzzle.



******************** copyablefile origin path hdfs://10.176.0.184:8020/data/test.txt

******************** target path hdfs://10.176.3.115:8020/home/etn/programs/gobblin-dist/workdir/task-staging/job_etndistcp_1513251826306/task_etndistcp_1513251826306_0/attempt_local305940860_0001_m_000000_0/test.txt



Found configured writer builder as gobblin.data.management.copy.writer.FileAwareInputStreamDataWriterBuilder

2017-12-14 03:43:52 PST INFO  [ForkExecutor-0] gobblin.runtime.fork.Fork  452 - Wrapping writer gobblin.writer.PartitionedDataWriter@299fc14f

2017-12-14 03:43:52 PST WARN  [ForkExecutor-0] gobblin.writer.RetryWriter$1  95 - Caught exception. This may be retried.

java.lang.IllegalArgumentException: Wrong FS: hdfs://10.176.0.184:8020/data/test.txt, expected: hdfs://hadoop-master:8020

        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:644)

        at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:464)

        at gobblin.data.management.copy.writer.FileAwareInputStreamDataWriter.writeImpl(FileAwareInputStreamDataWriter.java:222)



Configuration:

job.name=etndistcp

job.group=etn

job.description=Distcpexample.......



source.filebased.fs.uri=hdfs://10.176.0.184:8020

source.filebased.data.directory=/data



source.class=gobblin.data.management.copy.CopySource

gobblin.dataset.profile.class=gobblin.data.management.copy.CopyableGlobDatasetFinder

gobblin.dataset.pattern=/data



extract.namespace=gobblin.data.management.copy.extractor

writer.builder.class=gobblin.data.management.copy.writer.FileAwareInputStreamDataWriterBuilder

data.publisher.type=gobblin.data.management.copy.publisher.CopyDataPublisher



writer.destination.type=HDFS



writer.fs.uri=hdfs://10.176.3.115:8020

data.publisher.final.dir=hdfs://10.176.3.115:8020/demo/etnwork/distcp_dest



Thanks,

Ethan