You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Steve Loughran (JIRA)" <ji...@apache.org> on 2019/04/18 17:23:00 UTC

[jira] [Resolved] (HADOOP-16260) Allow Distcp to create a new tempTarget file per File

     [ https://issues.apache.org/jira/browse/HADOOP-16260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steve Loughran resolved HADOOP-16260.
-------------------------------------
    Resolution: Won't Fix

> Allow Distcp to create a new tempTarget file per File
> -----------------------------------------------------
>
>                 Key: HADOOP-16260
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16260
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.9.2
>            Reporter: Arun Suresh
>            Priority: Major
>
> We use distcp to copy entire HDFS clusters to GCS.
>  In the process, we hit the following error:
> {noformat}
> INFO: Encountered status code 410 when accessing URL https://www.googleapis.com/upload/storage/v1/b/app/o?ifGenerationMatch=0&name=analytics/.distcp.tmp.attempt_local1083459072_0001_m_000000_0&uploadType=resumable&upload_id=AEnB2Uq4mZeZxXgs2Mhx0uskNpZ4Cka8pT4aCcd7v6UC4TDQx-h0uEFWoPpdOO4pWEdmaKnhTjxVva5Ow4vXbTe6_JScIU5fsQSaIwNkF3D84DHjtuhKSCU. Delegating to response handler for possible retry.
> Apr 14, 2019 5:53:17 AM com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation call
> SEVERE: Exception not convertible into handled response
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.json.GoogleJsonResponseException: 410 Gone
> {
>   "code" : 429,
>   "errors" : [ {
>     "domain" : "usageLimits",
>     "message" : "The total number of changes to the object app/folder/.distcp.tmp.attempt_local1083459072_0001_m_000000_0 exceeds the rate limit. Please reduce the rate of create, update, and delete requests.",
>     "reason" : "rateLimitExceeded"
>   } ],
>   "message" : "The total number of changes to the object app/folder/.distcp.tmp.attempt_local1083459072_0001_m_000000_0 exceeds the rate limit. Please reduce the rate of create, update, and delete requests."
> }
>        at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:150)
>         at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
>         at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
>         at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
>         at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
>         at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
>         at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:301)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
>  
> {noformat}
> Looking at the code, it looks like a distCp mapper gets a list of files to copy from src to target filesystem. The mapper handles each file in its list sequentially: It first creates/overwrites a temp file (*.distcp.tmp.attempt_local1083459072_0001_m_000000_0*), then it copies the src file to the temp file, and finally renames the temp file to the actual target file.
>  The temp file name (which contains the task ID) is reused for all the files in the mapper's batch. It looks like GCP enforces a rate-limit on the number of operations per sec on any object (even though we are actually creating a new file and renaming it to the final target, gcp assumes we are making changes to the same object)
> Even though it is possible to play around with the number of Maps / split size etc. It is hard to arrive at one of those values based on any rate-limit.
> Thus, we propose we add a flag to allow the DistCp mapper to use a different temp file PER file.
> Thoughts ? (cc/[~steve_l], [~benoyantony])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org