You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tajo.apache.org by "Jaehwa Jung (JIRA)" <ji...@apache.org> on 2016/03/16 11:43:33 UTC

[jira] [Updated] (TAJO-2087) Support DirectOutputCommitter for AWS S3 file system

     [ https://issues.apache.org/jira/browse/TAJO-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jaehwa Jung updated TAJO-2087:
------------------------------
    Summary: Support DirectOutputCommitter for AWS S3 file system  (was: Implement DirectOutputCommitter)

> Support DirectOutputCommitter for AWS S3 file system
> ----------------------------------------------------
>
>                 Key: TAJO-2087
>                 URL: https://issues.apache.org/jira/browse/TAJO-2087
>             Project: Tajo
>          Issue Type: Sub-task
>          Components: QueryMaster, S3
>            Reporter: Jaehwa Jung
>            Assignee: Jaehwa Jung
>
> Currently, Tajo output committer works as following:
> * Each task write output to a temp directory.
> * {{FileTablespace::commitTable}} renames first successful task's temp directory to final destination.
> But above approach will occurs {{FileNotFoundException}} because of eventual consistency of S3. To resolve it, we need to implement DirectOutputCommitter.
> There may be three different ways for implement it.
> First way is changing the name scheme for the files Tajo creates. Instead of {{part-00000}} we should use names like {{UUID_000000}} where all files generated by a single insert into use the same prefix. The prefix is consists of UUID and each query id. It will guarantees that a new insert into will not stomp on data produced by an earlier query. After finishing query successfully, Tajo will delete all files that don't begin with same UUID.  Of course, when executing the insert into statement, Tajo never delete existing files. But if query failed or killed, Tajo will delete all file that begin with same UUID. I was inspired by Qubole's slide (http://www.slideshare.net/qubolemarketing/new-york-city-hadoop-meetup-4-232015)
> Second way is storing insert file names and existing file names name to tables of {{CatalogStore}} or member variables of {{TaskAttemptContext}}. Before inserting files, Tajo will store existing file names to some storage. And whenever finishing task attempt, Tajo will store insert file names to some storage. And Tajo will delete or maintain files using stored file names according to query final status.
> Other way is writing the data to local disk. This output committer works as follows:
> * Each task write output to local disk instead of S3 (in CTAS statement or INERT statement)
> * Copies first successful task's temp directory to S3.
> For the reference, I was inspired by Netflix integrating spark slide(http://www.slideshare.net/piaozhexiu/netflix-integrating-spark-at-petabyte-scale-53391704).
> I wish to implement DirectOutputCommitter with the first way.
> Please feel free to comment if you have any questions/ideas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)