You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/10/20 13:46:44 UTC

[GitHub] [hudi] Limess opened a new issue #3834: [SUPPORT] - Athena query fails

Limess opened a new issue #3834:
URL: https://github.com/apache/hudi/issues/3834


   **Describe the problem you faced**
   
   Querying the snapshot table (suffix `-rt`) fails using Amazon Athena when the schema contains nested fields.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create table using a column `entity_salience` with the following schema: `array<struct<salience:double,salience_rank:bigint,wiki_title:string>>`
   2. Attempt to query the table with Athena
   
   **Environment Description**
   
   EMR 6.4.0
   
   Athena workgroup V2 (experienced on 2021/10/20)
   
   * Hudi version :
   
   0.9.0
   0.8.0-amzn1
   
   * Spark version :
   
   3.1.2
   
   * Hive version :
   
   Hive 3.1.2
   
   * Hadoop version :
   
   Amazon 3.2.1
   
   * Storage (HDFS/S3/GCS..) :
   
   S3
   
   * Running on Docker? (yes/no) :
   
   no
   
   **Additional context**
   
   We have several columns which produce this issue, the schemas are as follows:
   
   * `array<struct<offset:bigint,overlapping:boolean,position:string,rule_based_entity:boolean,sentiment:struct<compound:double,neg:double,neu:double,pos:double>,signal_type:string,surface_form:string,wiki_title:string>>`
   * `array<struct<id:string,score:string>>`
   				
   
   This doesn't seem to be obvious between columns, for example a column with this schema has no issues:
   
   `array<struct<end:bigint,start:bigint,text:string>>`
   
   **Stacktrace**
   
   ```
   HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://prod-signal-hudi-experiment-datalake/hudi/documents_datalake_from_parquet_merge_on_read_upsert_v2/story_published_date=2020-01-30/cf99fa1e-a678-4dd7-a36e-72e57d50a936-0_16-34-337_20211019174019.parquet (offset=33554432, length=33554432) using org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat: Can't redefine: array
   This query ran against the "pipeline_reprocessing_hudi_experiment" database, unless qualified by the query. Please post the error message on our forum  or contact customer support  with Query Id: f1c60df8-e018-4210-962c-2cbb21aaa18c
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] xushiyan commented on issue #3834: [SUPPORT] - AWS Athena snapshot query failsif there are two or more record array fields in a MoR table

Posted by GitBox <gi...@apache.org>.
xushiyan commented on issue #3834:
URL: https://github.com/apache/hudi/issues/3834#issuecomment-947885486


   @Limess thanks for raising the AWS support case. As an open-source solution, we're not able to investigate fully managed service like Athena. If you have a chance to reproduce it with open-source trino, then it's possible to investigate. I'll leave this issue open for some time, please follow up here if you get any updates from aws support.
   
   cc @codope 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3834: [SUPPORT] - AWS Athena snapshot query fails if there are two or more record array fields in a MoR table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3834:
URL: https://github.com/apache/hudi/issues/3834#issuecomment-961617240


   CC @codope 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] gudladona commented on issue #3834: [SUPPORT] - AWS Athena snapshot query fails if there are two or more record array fields in a MoR table

Posted by GitBox <gi...@apache.org>.
gudladona commented on issue #3834:
URL: https://github.com/apache/hudi/issues/3834#issuecomment-984822133


   CC @codope @vinothchandar 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Limess commented on issue #3834: [SUPPORT] - Athena query fails

Posted by GitBox <gi...@apache.org>.
Limess commented on issue #3834:
URL: https://github.com/apache/hudi/issues/3834#issuecomment-947828305


   This looks like this issue: https://github.com/apache/hudi/issues/2657


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Limess edited a comment on issue #3834: [SUPPORT] - Athena query fails

Posted by GitBox <gi...@apache.org>.
Limess edited a comment on issue #3834:
URL: https://github.com/apache/hudi/issues/3834#issuecomment-947828305


   This looks like this issue: https://github.com/apache/hudi/issues/2657.
   
   Would it be expected this is related to the writer or the reader, i.e. Athena?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3834: [SUPPORT] - AWS Athena snapshot query fails if there are two or more record array fields in a MoR table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3834:
URL: https://github.com/apache/hudi/issues/3834#issuecomment-997307677


   Closing the github issue as its root caused it to parquet upgrade. Feel free to follow the jira for updates. We are looking to get the parquet upgrade for 0.11.0. 
   thanks for reporting. If you feel, the issue is not related to parquet lib, feel free to re-open. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Limess edited a comment on issue #3834: [SUPPORT] - AWS Athena snapshot query failsif there are two or more record array fields in a MoR table

Posted by GitBox <gi...@apache.org>.
Limess edited a comment on issue #3834:
URL: https://github.com/apache/hudi/issues/3834#issuecomment-948727351


   I was able to reproduce this using Presto (i.e ex PrestoDB, not Trino) on EMR
   
   EMR 6.4.0
   Presto 0.254.1
   
   ```
   presto:pipeline_reprocessing_hudi_experiment> SELECT * FROM pipeline_reprocessing_hudi_experiment.documents_datalake_from_merge_on_read_upsert_v2_rt limit 1;
   
   Query 20211021_152409_00026_asfmp, FAILED, 1 node
   Splits: 119 total, 0 done (0.00%)
   0:02 [0 rows, 0B] [0 rows/s, 0B/s]
   
   Query 20211021_152409_00026_asfmp failed: Can't redefine: array
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] xushiyan commented on issue #3834: [SUPPORT] - AWS Athena snapshot query fails if there are two or more record array fields in a MoR table

Posted by GitBox <gi...@apache.org>.
xushiyan commented on issue #3834:
URL: https://github.com/apache/hudi/issues/3834#issuecomment-986164208


   another parquet upgrade request. tracking it in https://issues.apache.org/jira/browse/HUDI-2811


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3834: [SUPPORT] - AWS Athena snapshot query fails if there are two or more record array fields in a MoR table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3834:
URL: https://github.com/apache/hudi/issues/3834#issuecomment-961617240


   CC @codope 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Limess edited a comment on issue #3834: [SUPPORT] - Athena query fails

Posted by GitBox <gi...@apache.org>.
Limess edited a comment on issue #3834:
URL: https://github.com/apache/hudi/issues/3834#issuecomment-947828305


   This looks like this issue: https://github.com/apache/hudi/issues/2657.
   
   Would it be expected this is related to the writer or the reader, i.e. Athena?
   
   I've raised an issue with AWS support around this also.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Limess edited a comment on issue #3834: [SUPPORT] - AWS Athena snapshot query failsif there are two or more record array fields in a MoR table

Posted by GitBox <gi...@apache.org>.
Limess edited a comment on issue #3834:
URL: https://github.com/apache/hudi/issues/3834#issuecomment-948727351


   I was able to reproduce this using Presto (i.e ex PrestoDB, not Trino) on EMR
   
   EMR 6.4.0
   Presto 0.254.1
   
   ```
   presto-cli --catalog hive --schema pipeline_reprocessing_hudi_experiment
   
   presto:pipeline_reprocessing_hudi_experiment> SELECT * FROM pipeline_reprocessing_hudi_experiment.documents_datalake_from_merge_on_read_upsert_v2_rt limit 1;
   
   Query 20211021_152409_00026_asfmp, FAILED, 1 node
   Splits: 119 total, 0 done (0.00%)
   0:02 [0 rows, 0B] [0 rows/s, 0B/s]
   
   Query 20211021_152409_00026_asfmp failed: Can't redefine: array
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Limess commented on issue #3834: [SUPPORT] - Athena query fails

Posted by GitBox <gi...@apache.org>.
Limess commented on issue #3834:
URL: https://github.com/apache/hudi/issues/3834#issuecomment-947823109


   After trying to narrow this down, we're beginning to suspect it occurs whenever two columns of type `array` feature anywhere in the schema - we have two columns of `array` type we can load independently without issues and query, but when adding both we run into the above issue/stacktrace.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Limess edited a comment on issue #3834: [SUPPORT] - AWS Athena snapshot query failsif there are two or more record array fields in a MoR table

Posted by GitBox <gi...@apache.org>.
Limess edited a comment on issue #3834:
URL: https://github.com/apache/hudi/issues/3834#issuecomment-948727351


   I was able to reproduce this using Presto (i.e ex PrestoDB, not Trino) on EMR
   
   EMR 6.4.0
   Presto 0.254.1
   
   ```
   presto-cli --catalog hive --schema pipeline_reprocessing_hudi_experiment
   
   presto:pipeline_reprocessing_hudi_experiment> SELECT * FROM documents_datalake_from_merge_on_read_upsert_v2_rt limit 1;
   
   Query 20211021_152409_00026_asfmp, FAILED, 1 node
   Splits: 119 total, 0 done (0.00%)
   0:02 [0 rows, 0B] [0 rows/s, 0B/s]
   
   Query 20211021_152409_00026_asfmp failed: Can't redefine: array
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on issue #3834: [SUPPORT] - AWS Athena snapshot query fails if there are two or more record array fields in a MoR table

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #3834:
URL: https://github.com/apache/hudi/issues/3834#issuecomment-997307677


   Closing the github issue as its root caused it to parquet upgrade. Feel free to follow the jira for updates. We are looking to get the parquet upgrade for 1.11.0. 
   thanks for reporting. If you feel, the issue is not related to parquet lib, feel free to re-open. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3834: [SUPPORT] - AWS Athena snapshot query fails if there are two or more record array fields in a MoR table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3834:
URL: https://github.com/apache/hudi/issues/3834#issuecomment-961617240


   CC @codope 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #3834: [SUPPORT] - AWS Athena snapshot query fails if there are two or more record array fields in a MoR table

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #3834:
URL: https://github.com/apache/hudi/issues/3834


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] gudladona commented on issue #3834: [SUPPORT] - AWS Athena snapshot query fails if there are two or more record array fields in a MoR table

Posted by GitBox <gi...@apache.org>.
gudladona commented on issue #3834:
URL: https://github.com/apache/hudi/issues/3834#issuecomment-984238436


   I think this is fixed in https://github.com/apache/parquet-mr/pull/560. Upgrading parquet-avro to >=1.11.0 should address this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Limess edited a comment on issue #3834: [SUPPORT] - Athena query fails

Posted by GitBox <gi...@apache.org>.
Limess edited a comment on issue #3834:
URL: https://github.com/apache/hudi/issues/3834#issuecomment-947828305


   This looks like this issue: https://github.com/apache/hudi/issues/2657.
   
   Would it be expected this is related to the writer or the reader, i.e. Athena? We've tried with both Hudi 0.9.0 and 0.8.0 which to me suggests it's purely on the reader.
   
   I've raised an issue with AWS support around this also.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Limess edited a comment on issue #3834: [SUPPORT] - Athena query fails

Posted by GitBox <gi...@apache.org>.
Limess edited a comment on issue #3834:
URL: https://github.com/apache/hudi/issues/3834#issuecomment-947828305


   This looks like this issue: https://github.com/apache/hudi/issues/2657.
   
   Would it be expected this is related to the writer or the reader, i.e. Athena? We've tried with both Hudi 0.9.0 and 0.8.0 which to me suggests it's purely on the reader - the suggestion is this couldn't be reproduced on later Hudi versions.
   
   I've raised an issue with AWS support around this also.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Limess commented on issue #3834: [SUPPORT] - AWS Athena snapshot query failsif there are two or more record array fields in a MoR table

Posted by GitBox <gi...@apache.org>.
Limess commented on issue #3834:
URL: https://github.com/apache/hudi/issues/3834#issuecomment-948727351


   I was able to reproduce this using Presto on EMR
   
   EMR 6.4.0
   Presto 0.254.1
   
   ```
   presto:pipeline_reprocessing_hudi_experiment> SELECT * FROM pipeline_reprocessing_hudi_experiment.documents_datalake_from_merge_on_read_upsert_v2_rt limit 1;
   
   Query 20211021_152409_00026_asfmp, FAILED, 1 node
   Splits: 119 total, 0 done (0.00%)
   0:02 [0 rows, 0B] [0 rows/s, 0B/s]
   
   Query 20211021_152409_00026_asfmp failed: Can't redefine: array
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org