You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "László Bodor (Jira)" <ji...@apache.org> on 2020/07/21 09:18:00 UTC
[jira] [Commented] (HIVE-23889) Empty bucket files are inserted with invalid schema after HIVE-21784

    [ https://issues.apache.org/jira/browse/HIVE-23889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161891#comment-17161891 ] 

László Bodor commented on HIVE-23889:
-------------------------------------

this has been solved as part of HIVE-22538:
https://github.com/apache/hive/commit/964f08ae733b037c6e58dfb4ed149ccad2d3ddc0#diff-bb969e858664d98848960a801fd58b5cR579

I'm closing this as duplicate, but we can use this ticket for "tracking" the fix schema issues:
{code}
-            OrcFile.WriterOptions wo = OrcFile.writerOptions(this.options.getConfiguration())
-                .inspector(rowInspector)
-                .callback(new OrcRecordUpdater.KeyIndexBuilder("testEmpty"));
-            OrcFile.createWriter(path, wo).close();
+            OrcFile.createWriter(path, writerOptions).close();
{code}

> Empty bucket files are inserted with invalid schema after HIVE-21784
> --------------------------------------------------------------------
>
>                 Key: HIVE-23889
>                 URL: https://issues.apache.org/jira/browse/HIVE-23889
>             Project: Hive
>          Issue Type: Bug
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>
> HIVE-21784 uses a new WriterOptions instead of the field in OrcRecordUpdater:
> https://github.com/apache/hive/commit/f62379ba279f41b843fcd5f3d4a107b6fcd04dec#diff-bb969e858664d98848960a801fd58b5cR580-R583
> so in this scenario, the overwrite creates an empty bucket file, which is fine as that was the intention of that patch, but it creates that with invalid schema:
> {code}
> CREATE TABLE test_table (
>    cda_id             int,
>    cda_run_id         varchar(255),
>    cda_load_ts        timestamp,
>    global_party_id    string)
> PARTITIONED BY (
>    cda_date           int,
>    cda_job_name       varchar(12))
> CLUSTERED BY (cda_id) 
> INTO 2 BUCKETS
> STORED AS ORC;
> INSERT OVERWRITE TABLE test_table PARTITION (cda_date = 20200601 , cda_job_name = 'core_base')
> SELECT 1 as cda_id,'cda_run_id' as cda_run_id, NULL as cda_load_ts, 'global_party_id' global_party_id
> UNION ALL
> SELECT 2 as cda_id,'cda_run_id' as cda_run_id, NULL as cda_load_ts, 'global_party_id' global_party_id;
> ALTER TABLE test_table ADD COLUMNS (group_id string) CASCADE ;
> INSERT OVERWRITE TABLE test_table PARTITION (cda_date = 20200601 , cda_job_name = 'core_base')
> SELECT 1 as cda_id,'cda_run_id' as cda_run_id, NULL as cda_load_ts, 'global_party_id' global_party_id, 'group_id' as group_id;
> {code}
> because of HIVE-21784, the new empty bucket_00000 show this schema in orc dump:
> {code}
> Type: struct<_col0:int,_col1:varchar(255),_col2:timestamp,_col3:string,_col4:string>
> {code}
> instead of:
> {code}
> Type: struct<operation:int,originalTransaction:bigint,bucket:int,rowId:bigint,currentTransaction:bigint,row:struct<cda_id:int,cda_run_id:varchar(255),cda_load_ts:timestamp,global_party_id:string,group_id:string>>
> {code}
> and this could lead to problems later, when hive tries to look into the file during split generation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)