You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/01/19 14:21:57 UTC

[GitHub] [hudi] noobarcitect opened a new issue #2461: All records are present in athena query result on glue crawled Hudi tables

noobarcitect opened a new issue #2461:
URL: https://github.com/apache/hudi/issues/2461


   We are in the POC stage of implementing apache hudi in our existing AWS datalake and pipeline. There is one issue that we are stuck at. The issue is as follows : 
   1. We inserted a record into hudi table on COW mode. And then we made an upsert updating that record initially inserted.
   2. Now this Hudi table gets crawled through aws glue crawler.
   3. If we try to read the table from Athena, we get all 3 records. But what we want is only the latest delta record in athena query.
   4. One reason we came across is that glue reads the hudi files as parquet files and read the inputformat as MapReduceParquetFormat rather than hoddieParquet format.
   
   Q: Will there be a support in glue crawlers to identify the hoodieparquet format as input format ? 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vrtrepp commented on issue #2461: All records are present in athena query result on glue crawled Hudi tables

Posted by GitBox <gi...@apache.org>.
vrtrepp commented on issue #2461:
URL: https://github.com/apache/hudi/issues/2461#issuecomment-764444438


   Hi Rubenssoto,
   That is how we are planning but it will involve writing few more steps in the pipeline.However our current architecture is based on running glue crawlers and removing Glue crawlers will come with making changes in many pipelines again a month's task atleast.
   
   What I was curious about will there be any support that Hudi is going to add in future ? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on issue #2461: All records are present in athena query result on glue crawled Hudi tables

Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #2461:
URL: https://github.com/apache/hudi/issues/2461#issuecomment-852791741


   @vrtrepp @noobarcitect Closing this issue since the proposed solution is straightforward. If you still need help making your glue connector work, please feel free to re-open. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rubenssoto commented on issue #2461: All records are present in athena query result on glue crawled Hudi tables

Posted by GitBox <gi...@apache.org>.
rubenssoto commented on issue #2461:
URL: https://github.com/apache/hudi/issues/2461#issuecomment-763159279


   Hello, how are you?
   
   With Hudi, is not necessary to run glue crawlers, Hudi could sync direct to Glue
   
   
   'hoodie.datasource.write.hive_style_partitioning': 'true',
     'hoodie.datasource.hive_sync.enable': 'true',
     'hoodie.datasource.hive_sync.database': 'true',
     'hoodie.datasource.hive_sync.table': tableName,
     'hoodie.datasource.hive_sync.database': 'raw_courier_api',
     'hoodie.datasource.hive_sync.partition_fields': 'created_date_brt',
     'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
     'hoodie.datasource.hive_sync.jdbcurl': 'jdbc:hive2://localhost:10000'
   
   For example, with these options, Hudi creates my table and add my partitions, in localhost you have to put your Hive server.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on issue #2461: All records are present in athena query result on glue crawled Hudi tables

Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #2461:
URL: https://github.com/apache/hudi/issues/2461#issuecomment-771407719


   @vrtrepp @noobarcitect Are you able to use the hive-sync tool to resolve your issue ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vrtrepp commented on issue #2461: All records are present in athena query result on glue crawled Hudi tables

Posted by GitBox <gi...@apache.org>.
vrtrepp commented on issue #2461:
URL: https://github.com/apache/hudi/issues/2461#issuecomment-764444438


   Hi Rubenssoto,
   That is how we are planning but it will involve writing few more steps in the pipeline.However our current architecture is based on running glue crawlers and removing Glue crawlers will come with making changes in many pipelines again a month's task atleast.
   
   What I was curious about will there be any support that Hudi is going to add in future ? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #2461: All records are present in athena query result on glue crawled Hudi tables

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #2461:
URL: https://github.com/apache/hudi/issues/2461#issuecomment-765188290


   @vrtrepp you could potentially modify your existing glue step to run the hive-sync tool as an additonal step? At a high level, without the `HoodieParquetInputFormat` as the registered format, its impossible for athena to do the filtering necessary to just give you the latest records. 
   Just saw this today : https://aws.amazon.com/blogs/big-data/writing-to-apache-hudi-tables-using-aws-glue-connector/ , not sure if its directly helpful for you


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on issue #2461: All records are present in athena query result on glue crawled Hudi tables

Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #2461:
URL: https://github.com/apache/hudi/issues/2461#issuecomment-771407719


   @vrtrepp @noobarcitect Are you able to use the hive-sync tool to resolve your issue ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash closed issue #2461: All records are present in athena query result on glue crawled Hudi tables

Posted by GitBox <gi...@apache.org>.
n3nash closed issue #2461:
URL: https://github.com/apache/hudi/issues/2461


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org