You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/07/19 08:53:19 UTC

[GitHub] [hudi] qianchutao opened a new issue, #6142: [SUPPORT] column ‘_hoodie_is_deleted’ query by presto exception

qianchutao opened a new issue, #6142:
URL: https://github.com/apache/hudi/issues/6142

   **version:**
   hudi: aws-emr 0.9
   presto: aws-emr Presto 0.261
   
   **description:**
   When I use Presto to query HUDi's RO table, an exception will be thrown if the query field contains "_hoodie_IS_DELETED". This problem does not occur when I query RT tables, and an exception will not be thrown if the RO table does not contain "_hoodie_IS_deleted"
   
   **exception description:**
   Caused by: com.facebook.presto.spi.PrestoException: The column _hoodie_is_deleted of table ods.ods_e_ads_sp_product_report_ro is declared as type boolean, but the Parquet file (s3://xxxxxxxxx/855934a7-23ec-4d38-875e-166a2ed1c942-0_0-910-930_20220719083454.parquet) declares the column as type BINARY
   	at com.facebook.presto.hive.parquet.ParquetPageSourceFactory.getParquetType(ParquetPageSourceFactory.java:399)
   	at com.facebook.presto.hive.parquet.ParquetPageSourceFactory.getColumnType(ParquetPageSourceFactory.java:519)
   	at com.facebook.presto.hive.parquet.ParquetPageSourceFactory.lambda$createParquetPageSource$2(ParquetPageSourceFactory.java:246)
   	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
   	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
   	at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
   	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
   	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
   	at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
   	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
   	at java.util.stream.ReferencePipeline.reduce(ReferencePipeline.java:546)
   	at com.facebook.presto.hive.parquet.ParquetPageSourceFactory.createParquetPageSource(ParquetPageSourceFactory.java:250)
   	at com.facebook.presto.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:177)
   	at com.facebook.presto.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:403)
   	at com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:184)
   	at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:64)
   	at com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:78)
   	at com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:262)
   	at com.facebook.presto.operator.Driver.processInternal(Driver.java:418)
   	at com.facebook.presto.operator.Driver.lambda$processFor$9(Driver.java:301)
   	at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:722)
   	at com.facebook.presto.operator.Driver.processFor(Driver.java:294)
   	at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1078)
   	at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:162)
   	at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:599)
   	at com.facebook.presto.$gen.Presto_0_261_amzn_0____20220711_082906_1.run(Unknown Source)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   	at java.lang.Thread.run(Thread.java:750)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] leobiscassi commented on issue #6142: [SUPPORT] column ‘_hoodie_is_deleted’ query by presto exception

Posted by GitBox <gi...@apache.org>.
leobiscassi commented on issue #6142:
URL: https://github.com/apache/hudi/issues/6142#issuecomment-1203995256

   > > Hey @codope, I'd like to add this to the docs, could I tackle this task?
   > 
   > Sure. Please go ahead.
   
   Nice, I'm going to submit this today 👍🏽


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope closed issue #6142: [SUPPORT] column ‘_hoodie_is_deleted’ query by presto exception

Posted by GitBox <gi...@apache.org>.
codope closed issue #6142: [SUPPORT] column ‘_hoodie_is_deleted’ query by presto exception
URL: https://github.com/apache/hudi/issues/6142


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] leobiscassi commented on issue #6142: [SUPPORT] column ‘_hoodie_is_deleted’ query by presto exception

Posted by GitBox <gi...@apache.org>.
leobiscassi commented on issue #6142:
URL: https://github.com/apache/hudi/issues/6142#issuecomment-1198147081

   Hi hudi community, I'm experiencing a similar issue, for some tables in my data lake we got the following error when trying to query:
   
   [16777224] Query failed (#20220727_185609_00434_4n5pr): The column my_column_name_here of table my_tablename_here is declared as type string, but the Parquet file (s3a://bucket/prefix/befb27ee-ee21-4791-95bb-d8aeb521aff9-0_15-22-5118_20220629223504.parquet) declares the column as type INT32 com.facebook.presto.spi.PrestoException: The column my_column_name_here of table my_tablename_here is declared as type string, but the Parquet file (s3a://bucket/prefix/befb27ee-ee21-4791-95bb-d8aeb521aff9-0_15-22-5118_20220629223504.parquet) declares the column as type INT32
   
   **My environment**
   hudi: amzn 0.10.1 / amzn 0.11.0 on EMR
   presto: 0.267 / 0.272 on EMR
   
   What I've done trying to fix it until now:
   
   - Tested in more than one hudi version (0.10.1 and 0.11.0)
   - Copied the jar `hudi-presto-bundle.jar` from EMR to the presto instalation
   - Followed [this](https://stackoverflow.com/questions/60183579/presto-fails-with-type-mismatch-errors) stackoverflow thread and tried to change the config `hive.parquet.use-column-names=true` on `hive.properties` file on EMR
   
   None of this worked. Does someone knows how to deal with it or if is it a bug on the integration?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] leobiscassi commented on issue #6142: [SUPPORT] column ‘_hoodie_is_deleted’ query by presto exception

Posted by GitBox <gi...@apache.org>.
leobiscassi commented on issue #6142:
URL: https://github.com/apache/hudi/issues/6142#issuecomment-1214212714

   @codope sorry for the delay, [here](https://github.com/apache/hudi/pull/6391) is the pull request, could you take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on issue #6142: [SUPPORT] column ‘_hoodie_is_deleted’ query by presto exception

Posted by GitBox <gi...@apache.org>.
codope commented on issue #6142:
URL: https://github.com/apache/hudi/issues/6142#issuecomment-1203842663

   > Hey @codope, I'd like to add this to the docs, could I tackle this task?
   
   Sure. Please go ahead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] qianchutao commented on issue #6142: [SUPPORT] column ‘_hoodie_is_deleted’ query by presto exception

Posted by GitBox <gi...@apache.org>.
qianchutao commented on issue #6142:
URL: https://github.com/apache/hudi/issues/6142#issuecomment-1203445830

   > 嘿@qianchutao,我能够解决这个问题,也许该解决方案也可以帮助你。 因此,基本上会发生此错误,因为镶木地板文件中声明的模式与 Athena / Presto 上的表模式 ddl 之间的顺序不匹配。这通常适用于 Athena,因为在 Athena 上映射列的默认方法是使用名称 [1],对于 Presto,默认方式是按列索引 [2],因此当您进行架构演变或出于某种原因列的顺序时parquet 文件和表模式之间不匹配,这开始发生,与 hudi 本身无关。
   > 
   > 要解决此问题,请`hive.parquet.use-column-names=true`在 EMR 配置选项卡下或在启动时添加配置,这将更新配置文件并重新启动 presto 集群。如果您想在正在运行的集群上执行此操作,您需要在主节点和工作节点上执行并重新启动 presto,否则配置将不起作用。
   > 
   > 让我知道这是否有帮助😁
   > 
   > [1] https://docs.aws.amazon.com/athena/latest/ug/handling-schema-updates-chapter.html [2] [https://stackoverflow.com/questions/60183579/presto-fails-with -类型不匹配错误](https://stackoverflow.com/questions/60183579/presto-fails-with-type-mismatch-errors)
   
   thanks bro,i'll try it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] leobiscassi commented on issue #6142: [SUPPORT] column ‘_hoodie_is_deleted’ query by presto exception

Posted by GitBox <gi...@apache.org>.
leobiscassi commented on issue #6142:
URL: https://github.com/apache/hudi/issues/6142#issuecomment-1199808432

   Hey @qianchutao, I was able to fix this on my side and maybe the solution help you too.
   So, basically this error happens because a mismatch of order between the schema declared inside the parquet files and the table schema ddl on Athena / Presto. This normally works on Athena because the default method to map the columns on Athena is using the names [1], for Presto the default way is by column indexes [2], so when you have schema evolution or for some reason the order of columns doesn't match between the parquet files and the table schema, this starts to happen, nothing related to hudi itself.
   
   To fix this add the config `hive.parquet.use-column-names=true` under the EMR config tab or at start up time, this is going to update the config files and restart the presto cluster.  If you want to do this on a running cluster you'll need to do on master and worker nodes and restart presto, without doing that the config won't work.
   
   Let me know if this helps 😁 
   
   [1] https://docs.aws.amazon.com/athena/latest/ug/handling-schema-updates-chapter.html
   [2] https://stackoverflow.com/questions/60183579/presto-fails-with-type-mismatch-errors


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] leobiscassi commented on issue #6142: [SUPPORT] column ‘_hoodie_is_deleted’ query by presto exception

Posted by GitBox <gi...@apache.org>.
leobiscassi commented on issue #6142:
URL: https://github.com/apache/hudi/issues/6142#issuecomment-1203040823

   Hey @codope, I'd like to add this to the docs, could I tackle this task?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on issue #6142: [SUPPORT] column ‘_hoodie_is_deleted’ query by presto exception

Posted by GitBox <gi...@apache.org>.
codope commented on issue #6142:
URL: https://github.com/apache/hudi/issues/6142#issuecomment-1202043112

   `hive.parquet.use-column-names=true` this is the right workaround. Hudi does not enforce this by default. So, one needs to set this config in case of type mismatch. This should go in Hudi docs. HUDI-4522 to track.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org