You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "George Wood (Jira)" <ji...@apache.org> on 2021/01/16 00:47:00 UTC
[jira] [Updated] (HIVE-24647) Some Non-Vectorizable Queries are Run as Vectorized - Leads to Query Failures

     [ https://issues.apache.org/jira/browse/HIVE-24647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

George Wood updated HIVE-24647:
-------------------------------
    Description: 
A scenario has been discovered that seems to result in attempts to vectorize and run non-vectorizable queries. The conditions of this scenario are as follows:A scenario has been discovered that seems to result in attempts to vectorize and run non-vectorizable queries. The conditions of this scenario are as follows:
# 1. The query {{SELECT}}s non-primitive type columns. I have tested this with {{ARRAY}} and {{RECORD}} types. 
# 2. The query involves {{JOIN}}ing of two or more tables.
# 3. The tables are external, and they use the same storage type. This may occur with internal tables as well, but I am unable to test this in the Hive instance I have access to. The storage types I have tested this with are Parquet and Avro. If all tables use Parquet or all use Avro, the error is raised. If there is a mixture of storage types, the query runs successfully.

I tried to dig through code to find and fix the issue myself, but unfortunately I did not have a complete enough understanding of Hive and its dependencies to pinpoint the issue.

Further examples and explanation - given the following two table definitions:
{code:sql}
CREATE EXTERNAL TABLE experimental.join_test (
     int_col BIGINT,
     array_col ARRAY <BIGINT>,
     record_col STRUCT <a: BIGINT, b: BIGINT>
)
STORED AS AVRO
LOCATION 's3://s3-bucket/join_test/'
TBLPROPERTIES (
  'avro.schema.literal'='{"fields": [{"name": "int_col", "type": ["null", "long"]},
            {"name": "array_col",
             "type": ["null", {"items": ["null", "long"], "type": "array"}]},
            {"name": "record_col",
             "type": ["null",
                      {"fields": [{"name": "a", "type": ["null", "long"]},
                                  {"name": "b", "type": ["null", "long"]}],
                       "name": "record_col_0",
                       "type": "record"}]}],
 "name": "Root",
 "type": "record"}'
)
{code}

{code:sql}
CREATE EXTERNAL TABLE experimental.join_test2 (
     int_col2 BIGINT,
     str_col STRING
)
STORED AS AVRO
LOCATION 's3://s3-bucket/join_test2/'
TBLPROPERTIES (
  'avro.schema.literal'='{"fields": [{"name": "int_col2", "type": ["null", "long"]},
            {"name": "str_col", "type": ["null", "string"]}],
 "name": "Root",
 "type": "record"}'
)
{code}

I can successfully query both of these tables, running the following queries:

{code:sql}
SELECT * FROM experimental.join_test;

1	[1,2]	{"a":1,"b":2}
2	[3,4]	{"a":3,"b":4}
3	[5,6]	{"a":5,"b":6}
{code}
{code:sql}
SELECT * FROM experimental.join_test2;

1	should-appear-after-joining
2	should-appear-after-joining
3	should-appear-after-joining
4	shouldnt-appear-after-joining
{code}

I can also join the tables together, so long as I do not select the complex type columns:
{code:sql}
SELECT int_col, int_col2, str_col
FROM experimental.join_test AS jt
JOIN experimental.join_test2 AS jt2 ON jt.int_col = jt2.int_col2;

3	3	should-appear-after-joining
2	2	should-appear-after-joining
1	1	should-appear-after-joining
{code}

But as soon as complex columns are introduced to the {{SELECT}} clause, an error arises.
{code:sql}
SELECT *
FROM experimental.george_test AS jt
JOIN experimental.george_test2 AS jt2 ON jt.int_col = jt2.int_col2;
{code}
I will attach the full stack trace as a file - it is rather large. The lowest-level error message given is slightly different depending on which complex type is involved in raising the error.
* If it is an {{ARRAY}} type, the error message is {{org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector}}
* * If it is a {{RECORD}} type, the error message is {{org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector}}
If the tables are of different types - for example, if the table {{join_test}} was stored as Parquet instead of Avro - the query executes without issue.
From what I can tell by looking at the stack trace, Hive is attempting to run this query as a vectorized query, leading to the error.Some further points of support for this idea:
# 1. Vectorization is not supported in queries that involved complex type columns, which lines up with the fact that queries only fail when complex type columns are included in the query.
# 2. Explicitly setting {{hive.vectorized.execution.enabled}} to {{'false'}} allows these queries to execute and finish normally.

  was:
A scenario has been discovered that seems to result in attempts to vectorize and run non-vectorizable queries. The conditions of this scenario are as follows:A scenario has been discovered that seems to result in attempts to vectorize and run non-vectorizable queries. The conditions of this scenario are as follows:
# 1. The query `SELECT`s non-primitive type columns. I have tested this with `ARRAY` and `RECORD` types. 
# 2. The query involves `JOIN`ing of two or more tables.
# 3. The tables are external, and they use the same storage type. This may occur with internal tables as well, but I am unable to test this in the Hive instance I have access to. The storage types I have tested this with are Parquet and Avro. If all tables use Parquet or all use Avro, the error is raised. If there is a mixture of storage types, the query runs successfully.

Given the following two table definitions:

{code:sql}
CREATE EXTERNAL TABLE experimental.join_test (
     int_col BIGINT,
     array_col ARRAY <BIGINT>,
     record_col STRUCT <a: BIGINT, b: BIGINT>
)
STORED AS AVRO
LOCATION 's3://s3-bucket/join_test/'
TBLPROPERTIES (
  'avro.schema.literal'='{"fields": [{"name": "int_col", "type": ["null", "long"]},
            {"name": "array_col",
             "type": ["null", {"items": ["null", "long"], "type": "array"}]},
            {"name": "record_col",
             "type": ["null",
                      {"fields": [{"name": "a", "type": ["null", "long"]},
                                  {"name": "b", "type": ["null", "long"]}],
                       "name": "record_col_0",
                       "type": "record"}]}],
 "name": "Root",
 "type": "record"}'
)
{code}

{code:sql}
CREATE EXTERNAL TABLE experimental.join_test2 (
     int_col2 BIGINT,
     str_col STRING
)
STORED AS AVRO
LOCATION 's3://s3-bucket/join_test2/'
TBLPROPERTIES (
  'avro.schema.literal'='{"fields": [{"name": "int_col2", "type": ["null", "long"]},
            {"name": "str_col", "type": ["null", "string"]}],
 "name": "Root",
 "type": "record"}'
)
{code}

I can successfully query both of these tables, running the following queries:

{code:sql}
SELECT * FROM experimental.join_test;

1	[1,2]	{"a":1,"b":2}
2	[3,4]	{"a":3,"b":4}
3	[5,6]	{"a":5,"b":6}
{code}
{code:sql}
SELECT * FROM experimental.join_test2;

1	should-appear-after-joining
2	should-appear-after-joining
3	should-appear-after-joining
4	shouldnt-appear-after-joining
{code}

I can also join the tables together, so long as I do not select the complex type columns:
{code:sql}
SELECT int_col, int_col2, str_col
FROM experimental.join_test AS jt
JOIN experimental.join_test2 AS jt2 ON jt.int_col = jt2.int_col2;

3	3	should-appear-after-joining
2	2	should-appear-after-joining
1	1	should-appear-after-joining
{code}

But as soon as complex columns are introduced to the `SELECT` clause, an error arises.
{code:sql}
SELECT *
FROM experimental.george_test AS jt
JOIN experimental.george_test2 AS jt2 ON jt.int_col = jt2.int_col2;
{code}
I will attach the full stack trace as a file - it is rather large. The lowest-level error message given is slightly different depending on which complex type is involved in raising the error.* If it is an `ARRAY` type, the error message is `org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector`* If it is a `RECORD` type, the error message is `org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector`
If the tables are of different types - for example, if the table `join_test` was stored as Parquet instead of Avro - the query executes without issue.
From what I can tell by looking at the stack trace, Hive is attempting to run this query as a vectorized query, leading to the error.Some further points of support for this idea:1. Vectorization is not supported in queries that involved complex type columns, which lines up with the fact that queries only fail when complex type columns are included in the query.2. Explicitly setting `hive.vectorized.execution.enabled` to `'false'` allows these queries to execute and finish normally.


> Some Non-Vectorizable Queries are Run as Vectorized - Leads to Query Failures
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-24647
>                 URL: https://issues.apache.org/jira/browse/HIVE-24647
>             Project: Hive
>          Issue Type: Bug
>          Components: File Formats, hpl/sql, Query Processor, Vectorization
>    Affects Versions: 3.1.2
>         Environment: Hadoop and Hive instances running via Amazon EMR.
> Queries run via `hive` CLI.
> OS: Amazon Linux 2
>            Reporter: George Wood
>            Priority: Major
>         Attachments: hive_complex_join_err.txt
>
>
> A scenario has been discovered that seems to result in attempts to vectorize and run non-vectorizable queries. The conditions of this scenario are as follows:A scenario has been discovered that seems to result in attempts to vectorize and run non-vectorizable queries. The conditions of this scenario are as follows:
> # 1. The query {{SELECT}}s non-primitive type columns. I have tested this with {{ARRAY}} and {{RECORD}} types. 
> # 2. The query involves {{JOIN}}ing of two or more tables.
> # 3. The tables are external, and they use the same storage type. This may occur with internal tables as well, but I am unable to test this in the Hive instance I have access to. The storage types I have tested this with are Parquet and Avro. If all tables use Parquet or all use Avro, the error is raised. If there is a mixture of storage types, the query runs successfully.
> I tried to dig through code to find and fix the issue myself, but unfortunately I did not have a complete enough understanding of Hive and its dependencies to pinpoint the issue.
> Further examples and explanation - given the following two table definitions:
> {code:sql}
> CREATE EXTERNAL TABLE experimental.join_test (
>      int_col BIGINT,
>      array_col ARRAY <BIGINT>,
>      record_col STRUCT <a: BIGINT, b: BIGINT>
> )
> STORED AS AVRO
> LOCATION 's3://s3-bucket/join_test/'
> TBLPROPERTIES (
>   'avro.schema.literal'='{"fields": [{"name": "int_col", "type": ["null", "long"]},
>             {"name": "array_col",
>              "type": ["null", {"items": ["null", "long"], "type": "array"}]},
>             {"name": "record_col",
>              "type": ["null",
>                       {"fields": [{"name": "a", "type": ["null", "long"]},
>                                   {"name": "b", "type": ["null", "long"]}],
>                        "name": "record_col_0",
>                        "type": "record"}]}],
>  "name": "Root",
>  "type": "record"}'
> )
> {code}
> {code:sql}
> CREATE EXTERNAL TABLE experimental.join_test2 (
>      int_col2 BIGINT,
>      str_col STRING
> )
> STORED AS AVRO
> LOCATION 's3://s3-bucket/join_test2/'
> TBLPROPERTIES (
>   'avro.schema.literal'='{"fields": [{"name": "int_col2", "type": ["null", "long"]},
>             {"name": "str_col", "type": ["null", "string"]}],
>  "name": "Root",
>  "type": "record"}'
> )
> {code}
> I can successfully query both of these tables, running the following queries:
> {code:sql}
> SELECT * FROM experimental.join_test;
> 1	[1,2]	{"a":1,"b":2}
> 2	[3,4]	{"a":3,"b":4}
> 3	[5,6]	{"a":5,"b":6}
> {code}
> {code:sql}
> SELECT * FROM experimental.join_test2;
> 1	should-appear-after-joining
> 2	should-appear-after-joining
> 3	should-appear-after-joining
> 4	shouldnt-appear-after-joining
> {code}
> I can also join the tables together, so long as I do not select the complex type columns:
> {code:sql}
> SELECT int_col, int_col2, str_col
> FROM experimental.join_test AS jt
> JOIN experimental.join_test2 AS jt2 ON jt.int_col = jt2.int_col2;
> 3	3	should-appear-after-joining
> 2	2	should-appear-after-joining
> 1	1	should-appear-after-joining
> {code}
> But as soon as complex columns are introduced to the {{SELECT}} clause, an error arises.
> {code:sql}
> SELECT *
> FROM experimental.george_test AS jt
> JOIN experimental.george_test2 AS jt2 ON jt.int_col = jt2.int_col2;
> {code}
> I will attach the full stack trace as a file - it is rather large. The lowest-level error message given is slightly different depending on which complex type is involved in raising the error.
> * If it is an {{ARRAY}} type, the error message is {{org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector}}
> * * If it is a {{RECORD}} type, the error message is {{org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector}}
> If the tables are of different types - for example, if the table {{join_test}} was stored as Parquet instead of Avro - the query executes without issue.
> From what I can tell by looking at the stack trace, Hive is attempting to run this query as a vectorized query, leading to the error.Some further points of support for this idea:
> # 1. Vectorization is not supported in queries that involved complex type columns, which lines up with the fact that queries only fail when complex type columns are included in the query.
> # 2. Explicitly setting {{hive.vectorized.execution.enabled}} to {{'false'}} allows these queries to execute and finish normally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)