You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Quanlong Huang (Jira)" <ji...@apache.org> on 2021/05/09 08:25:00 UTC

[jira] [Updated] (IMPALA-7471) Impala crashes or returns incorrect results when querying parquet nested types

     [ https://issues.apache.org/jira/browse/IMPALA-7471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Quanlong Huang updated IMPALA-7471:
-----------------------------------
    Target Version: Impala 4.1  (was: Impala 4.0)

> Impala crashes or returns incorrect results when querying parquet nested types
> ------------------------------------------------------------------------------
>
>                 Key: IMPALA-7471
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7471
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Tim Armstrong
>            Assignee: Csaba Ringhofer
>            Priority: Critical
>              Labels: complextype, correctness, crash, parquet
>         Attachments: test_users_131786401297925138_0.parquet
>
>
> From http://community.cloudera.com/t5/Interactive-Short-cycle-SQL/Impala-bug-with-nested-arrays-of-structures-where-some-of/m-p/78507/highlight/false#M4779
> {quote}We found a case where Impala returns incorrect values from simple query. Our data contains nested array of structures and structures contains other structures.
> We generated minimal sample data allowing to reproduce the issue.
>  
> SQL to create a table:
> {quote}
> {code}
> CREATE TABLE plat_test.test_users (
>   id INT,
>   name STRING,   
>   devices ARRAY<
>     STRUCT<
>       id:STRING,
>       device_info:STRUCT<
>         model:STRING
>       >
>     >
>   >
> )
> STORED AS PARQUET
> {code}
> {quote}
> Please put attached parquet file to the location of the table and refresh the table.
> In sample data we have 2 users, one with 2 devices, second one with 3. Some of the devices.device_info.model fields are NULL.
>  
> When I issue a query:
> {quote}
> {code}
> SELECT u.name, d.device_info.model as model
> FROM test_users u,
> u.devices d;
> {code}
>  {quote}
> I'm expecting to get 5 records in results, but getting only one1.png
> If I change query to:
>  {quote}
> {code}
> SELECT u.name, d.device_info.model as model
> FROM test_users u
> LEFT OUTER JOIN u.devices d;
>  {code}
> {quote}
> I'm getting two records in the results, but still not as it should be.
> We found some workaround to this problem. If we add to the result columns device.id we will get all records from parquet file:
> {quote}
> {code}
> SELECT u.name, d.id, d.device_info.model as model
> FROM test_users u
> , u.devices d
>  {code}
> {quote}
> And result is 3.png
>  
> But we can't rely on this workaround, because we don't need device.id in all queries and Impala optimizes it, and as a result we are getting unpredicted results.
>  
> I tested Hive query on this table and it returns expected results:
> {quote}
> {code}
> SELECT u.name, d.device_info.model
> FROM test_users u
> lateral view outer inline (u.devices) d;
>  {code}
> {quote}
> results:
> 4.png
> Please advice if it's a problem in Impala engine or we did some mistake in our query.
>  
> Best regards,
> Come2Play team.
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org