You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Wenning Ding (Jira)" <ji...@apache.org> on 2019/10/31 07:39:00 UTC
[jira] [Updated] (HIVE-22438) Additional comma is added to projection column ids

     [ https://issues.apache.org/jira/browse/HIVE-22438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wenning Ding updated HIVE-22438:
--------------------------------
    Description: 
I ran into this issue when querying a Hudi data through Hive.

Basically, to query a Hudi style table, Hudi implements its own InputFormat class and overwrite the getRecordReader method. In this method, because of some reasons, Hudi will manually add several projection column ids and projection column names when each time getRecordReader method is called. Like this:

 
{code:java}
public RecordReader<NullWritable, ArrayWritable> getRecordReader(final InputSplit split, final JobConf job,
        final Reporter reporter) throws IOException {
    if (!job.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR).contains("col_a")) {
        job.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "col_a");
    }
    if (!job.get(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR).contains("1")) {
        job.set(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR, "1");
    }
    super.getRecordReader(split, job, reporter);
}
{code}
 

In this situation, it will cause a problem when using COUNT(\*) or COUNT(1) query. Note that for COUNT(\*) or COUNT(1), Hive don't need to read any column. So the projection column ids is an empty string.

Here is a log example to show the whole workflow.
{code:java}
[DEBUG] [TezChild] |split.TezGroupedSplitsInputFormat|: Init record reader for index 0 of 2
[INFO] [TezChild] |realtime.HoodieParquetRealtimeInputFormat|: Before adding Hoodie columns, Projections : Ids :
[INFO] [TezChild] |hadoop.HoodieParquetInputFormat|: After adding Hoodie columns, Projections :col_a Ids :1
[DEBUG] [TezChild] |split.TezGroupedSplitsInputFormat|: Init record reader for index 1 of 2
[INFO] [TezChild] |realtime.HoodieParquetRealtimeInputFormat|: Before adding Hoodie columns, Projections :col_a Ids :,1
{code}
As we can see, at the second time, projection ids becomes ",1" and that additional comma will cause exceptions in the following program.

 

  was:
I ran into this issue when querying a Hudi data through Hive.

Basically, to query a Hudi style table, Hudi implements its own InputFormat class and overwrite the getRecordReader method. In this method, because of some reasons, Hudi will manually add several projection column ids and projection column names when each time getRecordReader method is called. Like this:

 
{code:java}
public RecordReader<NullWritable, ArrayWritable> getRecordReader(final InputSplit split, final JobConf job,
        final Reporter reporter) throws IOException {
    if (!job.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR).contains("col_a")) {
        job.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "col_a");
    }
    if (!job.get(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR).contains("1")) {
        job.set(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR, "1");
    }
    super.getRecordReader(split, job, reporter);
}
{code}
 

In this situation, it will cause a problem when using COUNT(*) or COUNT(1) query. Note that for COUNT(*) or COUNT(1), Hive don't need to read any column. So the projection column ids is an empty string.

Here is a log example to show the whole workflow.
{code:java}
[DEBUG] [TezChild] |split.TezGroupedSplitsInputFormat|: Init record reader for index 0 of 2
[INFO] [TezChild] |realtime.HoodieParquetRealtimeInputFormat|: Before adding Hoodie columns, Projections : Ids :
[INFO] [TezChild] |hadoop.HoodieParquetInputFormat|: After adding Hoodie columns, Projections :col_a Ids :1
[DEBUG] [TezChild] |split.TezGroupedSplitsInputFormat|: Init record reader for index 1 of 2
[INFO] [TezChild] |realtime.HoodieParquetRealtimeInputFormat|: Before adding Hoodie columns, Projections :col_a Ids :,1
{code}
As we can see, at the second time, projection ids becomes ",1" and that additional comma will cause exceptions in the following program.

 


> Additional comma is added to projection column ids
> --------------------------------------------------
>
>                 Key: HIVE-22438
>                 URL: https://issues.apache.org/jira/browse/HIVE-22438
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Wenning Ding
>            Assignee: Wenning Ding
>            Priority: Major
>
> I ran into this issue when querying a Hudi data through Hive.
> Basically, to query a Hudi style table, Hudi implements its own InputFormat class and overwrite the getRecordReader method. In this method, because of some reasons, Hudi will manually add several projection column ids and projection column names when each time getRecordReader method is called. Like this:
>  
> {code:java}
> public RecordReader<NullWritable, ArrayWritable> getRecordReader(final InputSplit split, final JobConf job,
>         final Reporter reporter) throws IOException {
>     if (!job.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR).contains("col_a")) {
>         job.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "col_a");
>     }
>     if (!job.get(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR).contains("1")) {
>         job.set(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR, "1");
>     }
>     super.getRecordReader(split, job, reporter);
> }
> {code}
>  
> In this situation, it will cause a problem when using COUNT(\*) or COUNT(1) query. Note that for COUNT(\*) or COUNT(1), Hive don't need to read any column. So the projection column ids is an empty string.
> Here is a log example to show the whole workflow.
> {code:java}
> [DEBUG] [TezChild] |split.TezGroupedSplitsInputFormat|: Init record reader for index 0 of 2
> [INFO] [TezChild] |realtime.HoodieParquetRealtimeInputFormat|: Before adding Hoodie columns, Projections : Ids :
> [INFO] [TezChild] |hadoop.HoodieParquetInputFormat|: After adding Hoodie columns, Projections :col_a Ids :1
> [DEBUG] [TezChild] |split.TezGroupedSplitsInputFormat|: Init record reader for index 1 of 2
> [INFO] [TezChild] |realtime.HoodieParquetRealtimeInputFormat|: Before adding Hoodie columns, Projections :col_a Ids :,1
> {code}
> As we can see, at the second time, projection ids becomes ",1" and that additional comma will cause exceptions in the following program.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)