You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Arun Suresh (JIRA)" <ji...@apache.org> on 2019/04/16 19:24:00 UTC
[jira] [Created] (HADOOP-16260) Allow Distcp to create a new tempTarget file per File

Arun Suresh created HADOOP-16260:
------------------------------------

             Summary: Allow Distcp to create a new tempTarget file per File
                 Key: HADOOP-16260
                 URL: https://issues.apache.org/jira/browse/HADOOP-16260
             Project: Hadoop Common
          Issue Type: Bug
    Affects Versions: 2.9.2
            Reporter: Arun Suresh


We use distcp to copy entire HDFS clusters to GCS.
 In the process, we hit the following error:
{noformat}
INFO: Encountered status code 410 when accessing URL https://www.googleapis.com/upload/storage/v1/b/ap10data1/o?ifGenerationMatch=0&name=analytics/.distcp.tmp.attempt_local1083459072_0001_m_000000_0&uploadType=resumable&upload_id=AEnB2Uq4mZeZxXgs2Mhx0uskNpZ4Cka8pT4aCcd7v6UC4TDQx-h0uEFWoPpdOO4pWEdmaKnhTjxVva5Ow4vXbTe6_JScIU5fsQSaIwNkF3D84DHjtuhKSCU. Delegating to response handler for possible retry.
Apr 14, 2019 5:53:17 AM com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation call
SEVERE: Exception not convertible into handled response
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.json.GoogleJsonResponseException: 410 Gone
{
  "code" : 429,
  "errors" : [ {
    "domain" : "usageLimits",
    "message" : "The total number of changes to the object ap10data1/analytics/.distcp.tmp.attempt_local1083459072_0001_m_000000_0 exceeds the rate limit. Please reduce the rate of create, update, and delete requests.",
    "reason" : "rateLimitExceeded"
  } ],
  "message" : "The total number of changes to the object ap10data1/analytics/.distcp.tmp.attempt_local1083459072_0001_m_000000_0 exceeds the rate limit. Please reduce the rate of create, update, and delete requests."
}
       at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:150)
        at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
        at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
        at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
        at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
        at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
        at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:301)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
 
{noformat}
Looking at the code, it looks like a distCp mapper gets a list of files to copy from src to target filesystem. The mapper handles each file in its list sequentially: It first creates/overwrites a temp file (*.distcp.tmp.attempt_local1083459072_0001_m_000000_0*), then it copies the src file to the temp file, and finally renames the temp file to the actual target file.
 The temp file name (which contains the task ID) is reused for all the files in the mapper's batch. It looks like GCP enforces a rate-limit on the number of operations per sec on any object (even though we are actually creating a new file and renaming it to the final target, gcp assumes we are making changes to the same object)

Even though it is possible to play around with the number of Maps / split size etc. It is hard to arrive at one of those values based on any rate-limit.

Thus, we propose we add a flag to allow the DistCp mapper to use a different temp file PER file.

Thoughts ? (cc/[~steve_l], [~benoyantony])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org