You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Sergio Peña (JIRA)" <ji...@apache.org> on 2016/09/17 04:52:20 UTC
[jira] [Comment Edited] (HIVE-14776) Skip 'distcp' call when copying data from HDSF to S3

    [ https://issues.apache.org/jira/browse/HIVE-14776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15498177#comment-15498177 ] 

Sergio Peña edited comment on HIVE-14776 at 9/17/16 4:51 AM:
-------------------------------------------------------------

You're right, distcp does not use S3 as a temporary place. While debugging the code, I saw a '/user/hdfs/.Trash' directory created on S3 with data files being created, but after more investigation, I saw that there were copied by Hive when using INSERT OVERWRITE (old data being backed up). 

Anyway, distcp is still slow than not using distcp at all. I've no idea why. I run several tests with different file sizes (see times below when copied a file):

{noformat}
1G
S3 with distcp: 93s
S3 with no distcp: 37s

5G
S3 with distcp: 255s
S3 with no distcp: 147s
{noformat}

INSERT ... SELECT statements are going to create several files depending on the MR jobs and HDFS block-sizes, and they're might be slower than 5G. 

The S3A adapter should already manage multi-part uploads using Amazon API. Probably this is why distcp + s3a are not good together? 


was (Author: spena):
You're right, distcp does not use S3 as a temporary place. While debugging the code, I saw a '/user/hdfs/.Trash' directory created on S3 with data files being created, but after more investigation, I saw that there were copied by Hive when using INSERT OVERWRITE (old data being backed up). 

Anyway, distcp is still slow than not using distcp at all. I've no idea why. I run several tests with different file sizes (see times below when copied a file):

{{noformat}}
1G
S3 with distcp: 93s
S3 with no distcp: 37s

5G
S3 with distcp: 255s
S3 with no distcp: 147s
{{noformat}}

INSERT ... SELECT statements are going to create several files depending on the MR jobs and HDFS block-sizes, and they're might be slower than 5G. 

The S3A adapter should already manage multi-part uploads using Amazon API. Probably this is why distcp + s3a are not good together? 

> Skip 'distcp' call when copying data from HDSF to S3
> ----------------------------------------------------
>
>                 Key: HIVE-14776
>                 URL: https://issues.apache.org/jira/browse/HIVE-14776
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Hive
>            Reporter: Sergio Peña
>            Assignee: Sergio Peña
>         Attachments: HIVE-14776.1.patch, HIVE-14776.2.patch
>
>
> Hive uses 'distcp' to copy files in parallel between HDFS encryption zones when the {{hive.exec.copyfile.maxsize}} threshold is lower than the file to copy. This 'distcp' is also executed when copying to S3, but it is causing slower copies.
> We should not invoke distcp when copying to blobstore systems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)