You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tajo.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/03/16 11:39:33 UTC
[jira] [Commented] (TAJO-2087) Implement DirectOutputCommitter

    [ https://issues.apache.org/jira/browse/TAJO-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15197160#comment-15197160 ] 

ASF GitHub Bot commented on TAJO-2087:
--------------------------------------

GitHub user blrunner opened a pull request:

    https://github.com/apache/tajo/pull/979

    TAJO-2087: Implement DirectOutputCommitter

    Here is prototype codes for ``DirectOutputCommitter``. This PR is not ready to review, it shows my approach to implement ``DirectOutputCommitter``. Current version works as following:
    
    - Register commit history to catalog (TODO).
    - Each tasks will write the output data directly to the final location.
    - In a commit phase, delete existing files with query type as follows. First, backup existing files or directories to staging directory. And then delete backup files or directories. 
    - Update the status of commit history to catalog (TODO).
    - If query fails, QueryMaster will delete committed files and update the status of query history to catalog (TODO).
    - When ``TajoMaster`` starting, it will check the status of query histories to catalog. If it find running query, it will delete committed files and update the status of query history (TODO).
    - Add unit test cases for failed query (TODO).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/blrunner/tajo direct-output-committer

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/tajo/pull/979.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #979
    
----
commit 083bed51db1e68ed840961e2e169695dde60e116
Author: JaeHwa Jung <bl...@apache.org>
Date:   2016-02-24T02:08:08Z

    Add the list of output files and backup files to TaskAttemptContext

commit b39c8d1bcb153d53aae028577935499034bd4b6f
Author: JaeHwa Jung <bl...@apache.org>
Date:   2016-02-24T05:31:55Z

    Add outputFiles and backupFiles to Protocol Buffer

commit e3b26ea738ba33e1a6c8b8c856793f5a584eb861
Author: JaeHwa Jung <bl...@apache.org>
Date:   2016-02-24T05:48:02Z

    Add property for setting Direct Output Committer to TajoConf and SessionVars

commit 9efb4662957ff39ff215a3c829ece5e69d9ebe36
Author: JaeHwa Jung <bl...@apache.org>
Date:   2016-02-25T01:59:26Z

    Remove related property from SessionVars

commit 234f2829768f18fab7c7894aab2ccf7780ae3ffb
Author: JaeHwa Jung <bl...@apache.org>
Date:   2016-03-04T02:44:52Z

    Add temporary codes for testing

commit 7effec1fc663d246ffd3e25bfd4a98c803b22607
Author: JaeHwa Jung <bl...@apache.org>
Date:   2016-03-15T09:01:43Z

    Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into direct-output-committer

commit cb762766848c2af5d25e20ab552a2041c67924cc
Author: JaeHwa Jung <bl...@apache.org>
Date:   2016-03-15T09:30:36Z

    Prefix of output file name must be the id of query.

commit dce41c6be686916a346dc15a033bea39cc79550b
Author: JaeHwa Jung <bl...@apache.org>
Date:   2016-03-16T05:50:32Z

    Implement direct Output Committer to FileTablespace

commit 908ccd2b6c2ebbd602892b979c1ff41d7ed4a820
Author: JaeHwa Jung <bl...@apache.org>
Date:   2016-03-16T06:30:45Z

    Implement a method for renaming recursively directories

commit bd1e1b3f16e8b6263ef4e762b621a4ba2235aa34
Author: JaeHwa Jung <bl...@apache.org>
Date:   2016-03-16T06:43:56Z

    Remove proto modifications

commit 95e513a04bcfee10643ebe17b0e21074057f0be2
Author: JaeHwa Jung <bl...@apache.org>
Date:   2016-03-16T10:21:06Z

    Add session variable and add more unit test cases

----


> Implement DirectOutputCommitter
> -------------------------------
>
>                 Key: TAJO-2087
>                 URL: https://issues.apache.org/jira/browse/TAJO-2087
>             Project: Tajo
>          Issue Type: Sub-task
>          Components: QueryMaster, S3
>            Reporter: Jaehwa Jung
>            Assignee: Jaehwa Jung
>
> Currently, Tajo output committer works as following:
> * Each task write output to a temp directory.
> * {{FileTablespace::commitTable}} renames first successful task's temp directory to final destination.
> But above approach will occurs {{FileNotFoundException}} because of eventual consistency of S3. To resolve it, we need to implement DirectOutputCommitter.
> There may be three different ways for implement it.
> First way is changing the name scheme for the files Tajo creates. Instead of {{part-00000}} we should use names like {{UUID_000000}} where all files generated by a single insert into use the same prefix. The prefix is consists of UUID and each query id. It will guarantees that a new insert into will not stomp on data produced by an earlier query. After finishing query successfully, Tajo will delete all files that don't begin with same UUID.  Of course, when executing the insert into statement, Tajo never delete existing files. But if query failed or killed, Tajo will delete all file that begin with same UUID. I was inspired by Qubole's slide (http://www.slideshare.net/qubolemarketing/new-york-city-hadoop-meetup-4-232015)
> Second way is storing insert file names and existing file names name to tables of {{CatalogStore}} or member variables of {{TaskAttemptContext}}. Before inserting files, Tajo will store existing file names to some storage. And whenever finishing task attempt, Tajo will store insert file names to some storage. And Tajo will delete or maintain files using stored file names according to query final status.
> Other way is writing the data to local disk. This output committer works as follows:
> * Each task write output to local disk instead of S3 (in CTAS statement or INERT statement)
> * Copies first successful task's temp directory to S3.
> For the reference, I was inspired by Netflix integrating spark slide(http://www.slideshare.net/piaozhexiu/netflix-integrating-spark-at-petabyte-scale-53391704).
> I wish to implement DirectOutputCommitter with the first way.
> Please feel free to comment if you have any questions/ideas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)