You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Sahil Takiar (JIRA)" <ji...@apache.org> on 2016/11/03 21:12:59 UTC
[jira] [Commented] (HIVE-14269) Performance optimizations for data on S3

    [ https://issues.apache.org/jira/browse/HIVE-14269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15634285#comment-15634285 ] 

Sahil Takiar commented on HIVE-14269:
-------------------------------------

For those of you following this work. The original approach in this JIRA was to improve write performance of Hive-on-S3 by staging all data on HDFS and then uploading it to S3 at the end of the job. This approach was implemented in HIVE-14270 and after spending some time doing a full scale benchmark, a few issues were uncovered.

Turns out, that when data is staged on HDFS, it is uploaded to S3 inside the {{MoveTask}}. The {{MoveTask}} does a serial upload of the data from HDFS to S3. This upload can be pretty inefficient since the current {{S3AOutputStream}} requires downloading all the data to the local fs, and then issuing a S3 upload request to actually send the data to S3. This means all data uploaded to S3 must go through HS2, which is not a scalable solution.

The {{MoveTask}} has some logic to trigger a distcp job if the amount of data to be copied is large, but there are a number of issues with this approach. It seems that Hive has a few bugs in the logic that determines it the distcp job is needed (see HIVE-14864). Distcp also has a number of issues since each Map Task will rename files on S3 a few times. Furthermore, using distcp would only be useful if it was modified to implement a "direct write approach" (similar to what was suggested in HIVE-14271). After talking to a few members of community (see HIVE-14271, HIVE-14535, and SPARK-10063), it seems that the direct write approach has a large number of problems in production environments.

To summarize, if data all data from a query is staged in HDFS, there isn't a great way to get the data from HDFS to S3.

Instead, an alternative approach would be as follows:
* Modify Hive so that all intermediate MR jobs write to HDFS, but the last MR job writes to a scratch directory on S3
* HiveServer copies the data from the scratch directory on S3 to the final table location on S3
** These copies will just copy data from one S3 location to another S3 location, so the copy is done server-side, within S3
** HS2 just needs to issue COPY requests to S3 to get the data copied, this is essentially just an HTTP request so it should be pretty lightweight

The hope is that this revised approach will offer better scalability when writing to S3.

Any thoughts / comments / suggestions / feedback on this is greatly appreciated.

> Performance optimizations for data on S3
> ----------------------------------------
>
>                 Key: HIVE-14269
>                 URL: https://issues.apache.org/jira/browse/HIVE-14269
>             Project: Hive
>          Issue Type: Improvement
>    Affects Versions: 2.1.0
>            Reporter: Sergio Peña
>            Assignee: Sergio Peña
>
> Working with tables that resides on Amazon S3 (or any other object store) have several performance impact when reading or writing data, and also consistency issues.
> This JIRA is an umbrella task to monitor all the performance improvements that can be done in Hive to work better with S3 data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)