You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Owen O'Malley (JIRA)" <ji...@apache.org> on 2017/06/16 23:15:02 UTC
[jira] [Commented] (HIVE-16177) non Acid to acid conversion doesn't
handle _copy_N files
[ https://issues.apache.org/jira/browse/HIVE-16177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16052501#comment-16052501 ]
Owen O'Malley commented on HIVE-16177:
--------------------------------------
* You still have some trailing space issues.
* Why are you sorting the files?
* Your comment on HadoopShims.HdfsFileStatusWithId.compareTo() is pretty confusing. You need more context on why you are sorting them and need a particular sort order. Since there is only one implementation in the shims, it doesn't seem like it is appropriate in shims. I'd suggest making a comparator in AcidUtils.
* Why is the totalSize going down so much in the test results?
I'm still going through the record merger change.
> non Acid to acid conversion doesn't handle _copy_N files
> --------------------------------------------------------
>
> Key: HIVE-16177
> URL: https://issues.apache.org/jira/browse/HIVE-16177
> Project: Hive
> Issue Type: Bug
> Components: Transactions
> Affects Versions: 0.14.0
> Reporter: Eugene Koifman
> Assignee: Eugene Koifman
> Priority: Blocker
> Attachments: HIVE-16177.01.patch, HIVE-16177.02.patch, HIVE-16177.04.patch, HIVE-16177.07.patch, HIVE-16177.08.patch, HIVE-16177.09.patch, HIVE-16177.10.patch, HIVE-16177.11.patch, HIVE-16177.14.patch, HIVE-16177.15.patch
>
>
> {noformat}
> create table T(a int, b int) clustered by (a) into 2 buckets stored as orc TBLPROPERTIES('transactional'='false')
> insert into T(a,b) values(1,2)
> insert into T(a,b) values(1,3)
> alter table T SET TBLPROPERTIES ('transactional'='true')
> {noformat}
> //we should now have bucket files 000001_0 and 000001_0_copy_1
> but OrcRawRecordMerger.OriginalReaderPair.next() doesn't know that there can be copy_N files and numbers rows in each bucket from 0 thus generating duplicate IDs
> {noformat}
> select ROW__ID, INPUT__FILE__NAME, a, b from T
> {noformat}
> produces
> {noformat}
> {"transactionid":0,"bucketid":1,"rowid":0},file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands.../warehouse/nonacidorctbl/000001_0,1,2
> {"transactionid\":0,"bucketid":1,"rowid":0},file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands.../warehouse/nonacidorctbl/000001_0_copy_1,1,3
> {noformat}
> [~owen.omalley], do you have any thoughts on a good way to handle this?
> attached patch has a few changes to make Acid even recognize copy_N but this is just a pre-requisite. The new UT demonstrates the issue.
> Futhermore,
> {noformat}
> alter table T compact 'major'
> select ROW__ID, INPUT__FILE__NAME, a, b from T order by b
> {noformat}
> produces
> {noformat}
> {"transactionid":0,"bucketid":1,"rowid":0} file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands....warehouse/nonacidorctbl/base_-9223372036854775808/bucket_00001 1 2
> {noformat}
> HIVE-16177.04.patch has TestTxnCommands.testNonAcidToAcidConversion0() demonstrating this
> This is because compactor doesn't handle copy_N files either (skips them)
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)