You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/04/28 06:42:19 UTC

[GitHub] [hudi] stevenayers opened a new issue, #5455: [SUPPORT] Read Hudi Table from Hive/Glue Catalog without specifying the S3 Path

stevenayers opened a new issue, #5455:
URL: https://github.com/apache/hudi/issues/5455

   Hi All,
   
   I'm currently using AWS Glue Catalog as my Hive Metastore and Glue ETL 2.0 (soon to be 3.0) with Hudi (AWS Hudi Connector 0.9.0 for Glue 1.0 and 2.0).
   
   In Iceberg, you are able to do the following to query the Glue catalog:
   ```python
   df = glueContext.create_dynamic_frame.from_options(
           connection_type="marketplace.spark",
           connection_options={
               "path": "my_catalog.my_glue_database.my_iceberg_table",
               "connectionName": "Iceberg Connector for Glue 3.0",
           },
           transformation_ctx="IcebergDyF",
       ).toDF()
   ```
   
   I'd like to do something similar with Hudi:
   ```python
   df = glueContext.create_dynamic_frame.from_options(
           connection_type="marketplace.spark",
           connection_options= {
               "className": "org.apache.hudi",
               "hoodie.table.name": "my_hudi_table",
               "hoodie.consistency.check.enabled": "true",
               "hoodie.datasource.hive_sync.use_jdbc": "false",
               "hoodie.datasource.hive_sync.database": "my_glue_database",
               "hoodie.datasource.hive_sync.table":  "my_hudi_table",
               "hoodie.datasource.hive_sync.enable": "true",
               "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
               "hoodie.datasource.hive_sync.partition_fields": partition_key
           },
           transformation_ctx="IcebergDyF",
       )
   ```
   
   Meaning we don't need to grab the S3 path of our data from boto3 every time, like so:
   ```python
   client = boto3.client('glue')
   response = client.get_table(
       DatabaseName='my_glue_database',
       Name='my_hudi_table'
   ) <<----- don't want this
   targetPath = response['Table']['StorageDescriptor']['Location'] <<----- or this
   df = glueContext.create_dynamic_frame.from_options(
           connection_type="marketplace.spark",
           connection_options= {
               "className": "org.apache.hudi",
               "path": targetPath <<----- or this
               "hoodie.table.name": "my_hudi_table",
               "hoodie.consistency.check.enabled": "true",
               "hoodie.datasource.hive_sync.use_jdbc": "false",
               "hoodie.datasource.hive_sync.database": "my_glue_database",
               "hoodie.datasource.hive_sync.table":  "my_hudi_table",
               "hoodie.datasource.hive_sync.enable": "true",
               "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
               "hoodie.datasource.hive_sync.partition_fields": partition_key
           },
           transformation_ctx="HudiDyF",
       )
   # OR
   sourceTableDF = spark.read.format('hudi').load(targetPath)
   ```
   
   Is there any way to do this? Very new to Hudi, so if my configuration settings are wrong and this is possible, please let me know!
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #5455: [SUPPORT] Read Hudi Table from Hive/Glue Catalog without specifying the S3 Path

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #5455:
URL: https://github.com/apache/hudi/issues/5455#issuecomment-1116086600

   @bhasudha : Do we need to add any faq on this end? will let you take a call. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] stevenayers commented on issue #5455: [SUPPORT] Read Hudi Table from Hive/Glue Catalog without specifying the S3 Path

Posted by GitBox <gi...@apache.org>.

stevenayers commented on issue #5455:
URL: https://github.com/apache/hudi/issues/5455#issuecomment-1114609813

   Thanks @rkkalluri. So while I'm able to run the following on the S3 location the catalog points to:
   ```python
   from boto3 import client
   
   conn = client('s3'
   conn.head_bucket(Bucket='olympus-dev-data-refined')
   conn.list_objects(Bucket='olympus-dev-data-refined')
   ```
   and receive a 200 response, when I run a version of your command above:
   ```python
   input_dyf = glueContext.create_dynamic_frame.from_catalog(
   database="db",
   table_name="tble",
   additional_options={"useS3ListImplementation": True, "groupFiles": "inPartition"},
   )
   ```
   I get:
   ```
   Py4JJavaError: An error occurred while calling o255.getDynamicFrame.
   : com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: 31JC4RTFEXPTFTFH; S3 Extended Request ID: qq+KRFeagbenCG9CEikohlGLPTDyftUz+MqoiMmw0XV6KlmBDXQmWxja4dE8tXvl5/sLF4nWhgQ=; Proxy: null), S3 Extended Request ID: qq+KRFeagbenCG9CEikohlGLPTDyftUz+MqoiMmw0XV6KlmBDXQmWxja4dE8tXvl5/sLF4nWhgQ=
   	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1819)
   	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1403)
   ```
   
   A few comments:
   * The role has full S3 access
   * Lake Formation is configured correctly
   * I had to remove boundedSize and the bookmark as bookmarking is turned off for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] rkkalluri commented on issue #5455: [SUPPORT] Read Hudi Table from Hive/Glue Catalog without specifying the S3 Path

Posted by GitBox <gi...@apache.org>.

rkkalluri commented on issue #5455:
URL: https://github.com/apache/hudi/issues/5455#issuecomment-1114302556

   @stevenayers  you should be able to use the glue catalog to load hudi table like any other hive external table.
   
   See if you can emulate the below for you needs.
   
   # Read dataframe from source
   input_dyf = glueContext.create_dynamic_frame.from_catalog(
       database=src_database,
       table_name=src_table_name,
       push_down_predicate=f"(sdwh_update_year = '{start_date[:4]}' and sdwh_update_month = '{start_date[5:7]}' and sdwh_update_day = '{start_date[8:10]}')",
       transformation_ctx="datasource0",
       additional_options={"useS3ListImplementation": True, "groupFiles": "inPartition", "boundedSize": "6516192768"},
   )


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] stevenayers commented on issue #5455: [SUPPORT] Read Hudi Table from Hive/Glue Catalog without specifying the S3 Path

Posted by GitBox <gi...@apache.org>.

stevenayers commented on issue #5455:
URL: https://github.com/apache/hudi/issues/5455#issuecomment-1114744684

   @rkkalluri that looks like it worked perfectly, thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on issue #5455: [SUPPORT] Read Hudi Table from Hive/Glue Catalog without specifying the S3 Path

Posted by GitBox <gi...@apache.org>.

yihua commented on issue #5455:
URL: https://github.com/apache/hudi/issues/5455#issuecomment-1112725379

   @umehrot2 could you shed light on this as well?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on issue #5455: [SUPPORT] Read Hudi Table from Hive/Glue Catalog without specifying the S3 Path

Posted by GitBox <gi...@apache.org>.

yihua commented on issue #5455:
URL: https://github.com/apache/hudi/issues/5455#issuecomment-1112340895

   @nsivabalan @rmahindra123 do you guys know how to regarding Glue Catalog?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on issue #5455: [SUPPORT] Read Hudi Table from Hive/Glue Catalog without specifying the S3 Path

Posted by GitBox <gi...@apache.org>.

yihua commented on issue #5455:
URL: https://github.com/apache/hudi/issues/5455#issuecomment-1115159007

   @rkkalluri Thanks for the help!  Closing this issues.  @stevenayers feel free to reopen this or file a new issue if you face more problems.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua closed issue #5455: [SUPPORT] Read Hudi Table from Hive/Glue Catalog without specifying the S3 Path

Posted by GitBox <gi...@apache.org>.

yihua closed issue #5455: [SUPPORT] Read Hudi Table from Hive/Glue Catalog without specifying the S3 Path
URL: https://github.com/apache/hudi/issues/5455


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org