You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Aaron Fabbri (JIRA)" <ji...@apache.org> on 2018/04/26 23:09:00 UTC

[jira] [Comment Edited] (HIVE-16295) Add support for using Hadoop's S3A OutputCommitter

    [ https://issues.apache.org/jira/browse/HIVE-16295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16455520#comment-16455520 ] 

Aaron Fabbri edited comment on HIVE-16295 at 4/26/18 11:08 PM:
---------------------------------------------------------------

This is a really cool prototype [~stakiar], thank you for doing this. I don't have much Hive knowledge but will try to spend some more time looking at the code.  I'm also happy to work w/ [~stevel@apache.org] on stabilizing the _SUCCESS file manifest (which enumerates the files committed) if that works for your dynamic partitioning problem.

edit: need more coffee.


was (Author: fabbri):
This is a really cool prototype [~stakiar], thank you for doing this. I don't have much Hive knowledge but will try to spend some more time looking at the code.  I'm also happy to work w/ [~stevel@apache.org] on stabilizing the _SUCCESS file manifest (which enumerates the uploaded-but-not-completed multipart uploads to S3) if that works for your dynamic partitioning problem.

> Add support for using Hadoop's S3A OutputCommitter
> --------------------------------------------------
>
>                 Key: HIVE-16295
>                 URL: https://issues.apache.org/jira/browse/HIVE-16295
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
>         Attachments: HIVE-16295.1.WIP.patch, HIVE-16295.2.WIP.patch
>
>
> Hive doesn't have integration with Hadoop's {{OutputCommitter}}, it uses a {{NullOutputCommitter}} and uses its own commit logic spread across {{FileSinkOperator}}, {{MoveTask}}, and {{Hive}}.
> The Hadoop community is building an {{OutputCommitter}} that integrates with S3Guard and does a safe, coordinate commit of data on S3 inside individual tasks (HADOOP-13786). If Hive can integrate with this new {{OutputCommitter}} there would be a lot of benefits to Hive-on-S3:
> * Data is only written once; directly committing data at a task level means no renames are necessary
> * The commit is done safely, in a coordinated manner; duplicate tasks (from task retries or speculative execution) should not step on each other



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)