You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2017/12/01 01:13:01 UTC

[jira] [Commented] (HIVE-17361) Support LOAD DATA for transactional tables

    [ https://issues.apache.org/jira/browse/HIVE-17361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273778#comment-16273778 ] 

Alan Gates commented on HIVE-17361:
-----------------------------------

+1 based on discussion in review board.

> Support LOAD DATA for transactional tables
> ------------------------------------------
>
>                 Key: HIVE-17361
>                 URL: https://issues.apache.org/jira/browse/HIVE-17361
>             Project: Hive
>          Issue Type: New Feature
>          Components: Transactions
>            Reporter: Wei Zheng
>            Assignee: Eugene Koifman
>            Priority: Critical
>         Attachments: HIVE-17361.07.patch, HIVE-17361.08.patch, HIVE-17361.09.patch, HIVE-17361.1.patch, HIVE-17361.10.patch, HIVE-17361.11.patch, HIVE-17361.12.patch, HIVE-17361.14.patch, HIVE-17361.16.patch, HIVE-17361.17.patch, HIVE-17361.19.patch, HIVE-17361.2.patch, HIVE-17361.20.patch, HIVE-17361.21.patch, HIVE-17361.23.patch, HIVE-17361.24.patch, HIVE-17361.25.patch, HIVE-17361.3.patch, HIVE-17361.4.patch
>
>
> LOAD DATA was not supported since ACID was introduced. Need to fill this gap between ACID table and regular hive table.
> Current Documentation is under [DML Operations|https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-DMLOperations] and [Loading files into tables|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Loadingfilesintotables]:
> \\
> * Load Data performs very limited validations of the data, in particular it uses the input file name which may not be in 00000_0 which can break some read logic.  (Certainly will for Acid).
> * It does not check the schema of the file.  This may be a non issue for Acid which requires ORC which is self describing so Schema Evolution may handle this seamlessly.  (Assuming Schema is not too different).
> * It does check that _InputFormat_S are compatible. 
> * Bucketed (and thus sorted) tables don't support Load Data (but only if hive.strict.checks.bucketing=true (default)).  Will keep this restriction for Acid.
> * Load Data supports OVERWRITE clause
> * What happens to file permissions/ownership: rename vs copy differences
> \\
> The implementation will follow the same idea as in HIVE-14988 and use a base_N/ dir for OVERWRITE clause.
> \\
> How is minor compaction going to handle delta/base with original files?
> Since delta_8_8/_meta_data is created before files are moved, delta_8_8 becomes visible before it's populated.  Is that an issue?
> It's not since txn 8 is not committed.
> h3. Implementation Notes/Limitations (patch 25)
> * bucketed/sorted tables are not supported
> * input files names must be of the form 00000_0/00000_0_copy_1 - enforced. (HIVE-18125)
> * Load Data creates a delta_x_x/ that contains new files
> * Load Data w/Overwrite creates a base_x/ that contains new files
> * A '_metadata_acid' file is placed in the target directory to indicate it requires special handling on read
> * The input files must be 'plain' ORC files, i.e. not contain acid metadata columns as would be the case if these files were copied from another Acid table.  In the latter case, the ROW_IDs embedded in the data may not make sense in the target table (if it's in a different cluster, for example).  Such files may also have a mix of committed and aborted data.
> ** this could be relaxed later by adding info to the _metadata_acid file to ignore existing ROW_IDs on read.
> * ROW_IDs are attached dynamically at read time and made permanent by compaction.  This is done the same way has handling of files that were written to a table before it was converted to Acid.
> * Vectorization is supported



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)