You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Eugene Koifman (JIRA)" <ji...@apache.org> on 2017/06/19 01:50:01 UTC

[jira] [Comment Edited] (HIVE-16177) non Acid to acid conversion doesn't handle _copy_N files

    [ https://issues.apache.org/jira/browse/HIVE-16177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16052903#comment-16052903 ] 

Eugene Koifman edited comment on HIVE-16177 at 6/19/17 1:49 AM:
----------------------------------------------------------------

The file list is sorted to make sure there is consistent ordering for both read and compact.
Compaction needs to process the whole list of files (for a bucket) and assign ROW_IDs consistently.
For read, OrcRawRecordReader just has a split from some file.  So I need to make sure order them the same way so that the "offset" for the current file is computed the same way as for compaction.

Since Hive doesn't restrict the layout of files in a table very well, sorting is the most general way to do this.
For example, say we realize that some "feature" places bucket files in subdirectories - by sorting the whole list of "original" files it makes this work with any directory layout.

Same goes for when we allow non-bucketed tables - files can be anywhere and they need to be "numbered" consistently.  Sorting seems like the simplest way to do this.

Putting a Comparator in AcidUtils makes sense.

"totalSize" is probably because I run the tests on Mac.  Stats often differ on Mac.



was (Author: ekoifman):
The file list is sorted to make sure there is consistent ordering for both read and compact.
Compaction needs to process the whole list of files (for a bucket) and assign ROW_IDs consistently.
For read, OrcRawRecordReader just has a split from some file.  So I need to make sure order them the same way so that the "offset" for the current file is computed the same way as for compaction.

Since Hive doesn't restrict the layout of files in a table very well, sorting is the most general way to do this.
For example, say we realize that some "feature" places bucket files in subdirectories - by sorting the whole list of "original" files it makes this work with any directory layout.

Putting a Comparator in AcidUtils makes sense.

"totalSize" is probably because I run the tests on Mac.  Stats often differ on Mac.


> non Acid to acid conversion doesn't handle _copy_N files
> --------------------------------------------------------
>
>                 Key: HIVE-16177
>                 URL: https://issues.apache.org/jira/browse/HIVE-16177
>             Project: Hive
>          Issue Type: Bug
>          Components: Transactions
>    Affects Versions: 0.14.0
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>            Priority: Blocker
>         Attachments: HIVE-16177.01.patch, HIVE-16177.02.patch, HIVE-16177.04.patch, HIVE-16177.07.patch, HIVE-16177.08.patch, HIVE-16177.09.patch, HIVE-16177.10.patch, HIVE-16177.11.patch, HIVE-16177.14.patch, HIVE-16177.15.patch
>
>
> {noformat}
> create table T(a int, b int) clustered by (a)  into 2 buckets stored as orc TBLPROPERTIES('transactional'='false')
> insert into T(a,b) values(1,2)
> insert into T(a,b) values(1,3)
> alter table T SET TBLPROPERTIES ('transactional'='true')
> {noformat}
>     //we should now have bucket files 000001_0 and 000001_0_copy_1
> but OrcRawRecordMerger.OriginalReaderPair.next() doesn't know that there can be copy_N files and numbers rows in each bucket from 0 thus generating duplicate IDs
> {noformat}
> select ROW__ID, INPUT__FILE__NAME, a, b from T
> {noformat}
> produces 
> {noformat}
> {"transactionid":0,"bucketid":1,"rowid":0},file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands.../warehouse/nonacidorctbl/000001_0,1,2
> {"transactionid\":0,"bucketid":1,"rowid":0},file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands.../warehouse/nonacidorctbl/000001_0_copy_1,1,3
> {noformat}
> [~owen.omalley], do you have any thoughts on a good way to handle this?
> attached patch has a few changes to make Acid even recognize copy_N but this is just a pre-requisite.  The new UT demonstrates the issue.
> Futhermore,
> {noformat}
> alter table T compact 'major'
> select ROW__ID, INPUT__FILE__NAME, a, b from T order by b
> {noformat}
> produces 
> {noformat}
> {"transactionid":0,"bucketid":1,"rowid":0}	file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands....warehouse/nonacidorctbl/base_-9223372036854775808/bucket_00001	1	2
> {noformat}
> HIVE-16177.04.patch has TestTxnCommands.testNonAcidToAcidConversion0() demonstrating this
> This is because compactor doesn't handle copy_N files either (skips them)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)