You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "soumilshah1995 (via GitHub)" <gi...@apache.org> on 2023/02/07 14:17:40 UTC

[GitHub] [hudi] soumilshah1995 opened a new issue, #7879: [Bug] Hudi AWS Connector Throws Error on Hive Sync with Glue

soumilshah1995 opened a new issue, #7879:
URL: https://github.com/apache/hudi/issues/7879

   ### Hello We were using AWS Market place connector and this morning i was preparing some hudi labs  thats when this error started to show up 
   
   # Code 
   ```
   try:
       import sys
       from awsglue.transforms import *
       from awsglue.utils import getResolvedOptions
       from pyspark.context import SparkContext
       from awsglue.context import GlueContext
       from awsglue.job import Job
       from pyspark.sql.session import SparkSession
       from awsglue.dynamicframe import DynamicFrame
       from pyspark.sql.functions import col, to_timestamp, monotonically_increasing_id, to_date, when
       from pyspark.sql.functions import *
       from awsglue.utils import getResolvedOptions
       from pyspark.sql.types import *
       from datetime import datetime
       import boto3
       from functools import reduce
       import uuid
       from faker import Faker
   except Exception as e:
       pass
   
   spark = SparkSession.builder.config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer') \
       .config('spark.sql.hive.convertMetastoreParquet', 'false') \
       .config('spark.sql.legacy.pathOptionBehavior.enabled', 'true') \
       .getOrCreate()
   
   sc = spark.sparkContext
   glueContext = GlueContext(sc)
   job = Job(glueContext)
   logger = glueContext.get_logger()
   db_name = "hudidb"
   table_name = "sample"
   
   recordkey = 'emp_id'
   path = "s3://soumilshah-hudi-demos/tmp/"
   groupSize = "1048576"
   method = 'upsert'
   table_type = "COPY_ON_WRITE"
   
   connection_options = {
       "path": path,
       "connectionName": "hudi-connection",
   
       "hoodie.datasource.write.storage.type": table_type,
       'className': 'org.apache.hudi',
       'hoodie.table.name': table_name,
       'hoodie.datasource.write.recordkey.field': recordkey,
       'hoodie.datasource.write.table.name': table_name,
       'hoodie.datasource.write.operation': method,
   
       'hoodie.datasource.hive_sync.enable': 'true',
       "hoodie.datasource.hive_sync.mode": "hms",
       'hoodie.datasource.hive_sync.sync_as_datasource': 'false',
       'hoodie.datasource.hive_sync.database': db_name,
       'hoodie.datasource.hive_sync.table': table_name,
       'hoodie.datasource.hive_sync.use_jdbc': 'false',
       'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
       'hoodie.datasource.write.hive_style_partitioning': 'true',
   }
   
   global faker
   faker = Faker()
   
   
   class DataGenerator(object):
   
       @staticmethod
       def get_data():
           return [
               (
                   uuid.uuid4().__str__(),
                   faker.name(),
                   faker.random_element(elements=('IT', 'HR', 'Sales', 'Marketing')),
                   faker.random_element(elements=('CA', 'NY', 'TX', 'FL', 'IL', 'RJ')),
                   str(faker.random_int(min=10000, max=150000)),
                   str(faker.random_int(min=18, max=60)),
                   str(faker.random_int(min=0, max=100000)),
                   str(faker.unix_time()),
                   faker.email(),
                   faker.credit_card_number(card_type='amex')
               ) for x in range(20)
           ]
   
   
   data = DataGenerator.get_data()
   
   columns = ["emp_id", "employee_name", "department", "state", "salary", "age", "bonus", "ts", "email", "credit_card"]
   spark_df = spark.createDataFrame(data=data, schema=columns)
   
   WriteDF = (
       glueContext.write_dynamic_frame.from_options(
           frame=DynamicFrame.fromDF(spark_df, glueContext, "glue_df"),
           connection_type="marketplace.spark",
           connection_options=connection_options,
           transformation_ctx="glue_df",
       )
   )
   job.commit()
   
   ```
   
   
   ### Error Message 
   ```
   Py4JJavaError: An error occurred while calling o111.pyWriteDynamicFrame.
   : org.apache.hudi.hive.HoodieHiveSyncException: Got runtime exception when hive syncing
   	at org.apache.hudi.hive.HiveSyncTool.<init>(HiveSyncTool.java:83)
   	at org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:539)
   	at org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$metaSync$2(HoodieSparkSqlWriter.scala:595)
   	at org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$metaSync$2$adapted(HoodieSparkSqlWriter.scala:591)
   	at scala.collection.mutable.HashSet.foreach(HashSet.scala:77)
   	at org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:591)
   	at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:665)
   	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:286)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:164)
   	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)
   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)
   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220)
   	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181)
   	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134)
   	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133)
   	at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
   	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
   	at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
   	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
   	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)
   	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)
   	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)
   	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:301)
   	at com.amazonaws.services.glue.marketplace.connector.SparkCustomDataSink.writeDynamicFrame(CustomDataSink.scala:45)
   	at com.amazonaws.services.glue.DataSink.pyWriteDynamicFrame(DataSink.scala:71)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   	at py4j.Gateway.invoke(Gateway.java:282)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.GatewayConnection.run(GatewayConnection.java:238)
   	at java.lang.Thread.run(Thread.java:750)
   Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to create HiveMetaStoreClient
   	at org.apache.hudi.hive.HoodieHiveClient.<init>(HoodieHiveClient.java:92)
   	at org.apache.hudi.hive.HiveSyncTool.<init>(HiveSyncTool.java:78)
   	... 48 more
   Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Unable to verify existence of default database: com.amazonaws.services.glue.model.AccessDeniedException: Insufficient Lake Formation permission(s) on default (Service: AWSGlue; Status Code: 400; Error Code: AccessDeniedException; Request ID: 02e6bfa7-f5c0-4f18-b223-112bb28bf480; Proxy: null))
   	at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:239)
   	at org.apache.hadoop.hive.ql.metadata.Hive.<init>(Hive.java:402)
   	at org.apache.hadoop.hive.ql.metadata.Hive.create(Hive.java:335)
   	at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:315)
   	at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:291)
   	at org.apache.hudi.hive.ddl.HMSDDLExecutor.<init>(HMSDDLExecutor.java:68)
   	at org.apache.hudi.hive.HoodieHiveClient.<init>(HoodieHiveClient.java:76)
   	... 49 more
   Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Unable to verify existence of default database: com.amazonaws.services.glue.model.AccessDeniedException: Insufficient Lake Formation permission(s) on default (Service: AWSGlue; Status Code: 400; Error Code: AccessDeniedException; Request ID: 02e6bfa7-f5c0-4f18-b223-112bb28bf480; Proxy: null))
   	at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3991)
   	at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:251)
   	at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:234)
   	... 55 more
   Caused by: MetaException(message:Unable to verify existence of default database: com.amazonaws.services.glue.model.AccessDeniedException: Insufficient Lake Formation permission(s) on default (Service: AWSGlue; Status Code: 400; Error Code: AccessDeniedException; Request ID: 02e6bfa7-f5c0-4f18-b223-112bb28bf480; Proxy: null))
   	at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.doesDefaultDBExist(AWSCatalogMetastoreClient.java:244)
   	at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.<init>(AWSCatalogMetastoreClient.java:152)
   	at com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory.createMetaStoreClient(AWSGlueDataCatalogHiveClientFactory.java:20)
   	at org.apache.hadoop.hive.ql.metadata.HiveUtils.createMetaStoreClient(HiveUtils.java:507)
   	at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3746)
   	at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3726)
   	at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3988)
   	... 57 more
   ```
   
   ### Connector Version 
   ![image](https://user-images.githubusercontent.com/39345855/217269660-de9b1c6c-efd2-4bf6-8b8a-7ec96c5149d0.png)
   
   
   #### Note : i have tried this labs before and it was all fine until this morning when it started to throw hive sync error 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] soumilshah1995 commented on issue #7879: [Bug] Hudi AWS Connector Throws Error on Hive Sync with Glue

Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.

soumilshah1995 commented on issue #7879:
URL: https://github.com/apache/hudi/issues/7879#issuecomment-1533820177

   Hey Buddy @juanAmayaRamirez 
   
   just use glue 4.0 and pass these param it will be fixed 
   
   
   ```
   """
   --additional-python-modules  | faker==11.3.0
   --conf  |  spark.serializer=org.apache.spark.serializer.KryoSerializer  --conf spark.sql.hive.convertMetastoreParquet=false --conf spark.sql.hive.convertMetastoreParquet=false --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog --conf spark.sql.legacy.pathOptionBehavior.enabled=true --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension
   --datalake-formats | hudi
   """
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] juanAmayaRamirez commented on issue #7879: [Bug] Hudi AWS Connector Throws Error on Hive Sync with Glue

Posted by "juanAmayaRamirez (via GitHub)" <gi...@apache.org>.

juanAmayaRamirez commented on issue #7879:
URL: https://github.com/apache/hudi/issues/7879#issuecomment-1533818465

   Hi @soumilshah1995 just here to ask what the issue was.
   I am having a similar issue with lake formation that I can't get to figure out when trying to read a Hudi table from Data catalog. Can this be related? If not, do you have any suggestions?
   
   Glue config:
   Glue 4.0
   Job Parameters:
   --datalake-formats hudi
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false
   
   Code:
   ```
   spark = SparkSession.builder.config('spark.serializer','org.apache.spark.serializer.KryoSerializer')\
       .config('spark.sql.hive.convertMetastoreParquet', 'false')\
       .config("spark.sql.parquet.datetimeRebaseModeInRead", "CORRECTED")\
       .config("spark.sql.avro.datetimeRebaseModeInWrite", "CORRECTED")\
       .getOrCreate()
   
   glueContext = GlueContext(spark.sparkContext)
   job = Job(glueContext)
   job.init(args['JOB_NAME'], args)
   logger = glueContext.get_logger()
   
   dataFrame = glueContext.create_dynamic_frame_from_catalog(
       database = "my_db",
       table_name = "my_table"
   )
   ```
   Error:
   2023-05-03 21:18:58,045 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(77)): Error from Python:Traceback (most recent call last):
     File "/tmp/read hudi without connector.py", line 37, in <module>
       dataFrame = glueContext.create_dynamic_frame_from_catalog(
     File "/opt/amazon/lib/python3.7/site-packages/awsglue/context.py", line 188, in create_dynamic_frame_from_catalog
       return source.getFrame(**kwargs)
     File "/opt/amazon/lib/python3.7/site-packages/awsglue/data_source.py", line 37, in getFrame
       jframe = self._jsource.getDynamicFrame()
     File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
       return_value = get_return_value(
     File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco
       return f(*a, **kw)
     File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
       raise Py4JJavaError(
   py4j.protocol.Py4JJavaError: An error occurred while calling o108.getDynamicFrame.
   : java.lang.UnsupportedOperationException: Reads and writes using Lake Formation permissions are not supported for hudi tables.
   	at com.amazonaws.services.glue.GlueUtility$.checkDataLakeFormatAndLakeFormation(GlueUtility.scala:54)
   	at com.amazonaws.services.glue.SparkSQLDataSource.getDynamicFrame(DataSource.scala:760)
   	at com.amazonaws.services.glue.DataSource.getDynamicFrame(DataSource.scala:102)
   	at com.amazonaws.services.glue.DataSource.getDynamicFrame$(DataSource.scala:102)
   	at com.amazonaws.services.glue.AbstractSparkSQLDataSource.getDynamicFrame(DataSource.scala:726)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   	at py4j.Gateway.invoke(Gateway.java:282)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
   	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
   	at java.lang.Thread.run(Thread.java:750)
   ![image](https://user-images.githubusercontent.com/97113713/236059664-8e4fc3ac-58ab-45a8-b6ba-8e648998e52f.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] soumilshah1995 commented on issue #7879: [Bug] Hudi AWS Connector Throws Error on Hive Sync with Glue

Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.

soumilshah1995 commented on issue #7879:
URL: https://github.com/apache/hudi/issues/7879#issuecomment-1420852529

   closing issue as the issue was with lake formation 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] juanAmayaRamirez commented on issue #7879: [Bug] Hudi AWS Connector Throws Error on Hive Sync with Glue

Posted by "juanAmayaRamirez (via GitHub)" <gi...@apache.org>.

juanAmayaRamirez commented on issue #7879:
URL: https://github.com/apache/hudi/issues/7879#issuecomment-1533834470

   Thanks for the quick response! (love your videos BTW)
   but sorry to tell that I am getting the same error. 
   
   `An error occurred while calling o110.getDynamicFrame. Reads and writes using Lake Formation permissions are not supported for hudi tables.`
   
   I was able to read the table using spark directly like: 
   `dataFrame = spark.read.format("hudi").load("s3://bucket/path/to/my_table/")`
   
   BUT NOT with glueContext using a table already in the glue datacatalog.
   ```
   dataFrame = glueContext.create_dynamic_frame_from_catalog(
       database = "my_db",
       table_name = "my_table"
   )
   ```
   
   According to AWS docs: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-hudi.html both should work fine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] soumilshah1995 closed issue #7879: [Bug] Hudi AWS Connector Throws Error on Hive Sync with Glue

Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.

soumilshah1995 closed issue #7879: [Bug] Hudi AWS Connector Throws Error on Hive Sync with Glue
URL: https://github.com/apache/hudi/issues/7879


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] soumilshah1995 commented on issue #7879: [Bug] Hudi AWS Connector Throws Error on Hive Sync with Glue

Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.

soumilshah1995 commented on issue #7879:
URL: https://github.com/apache/hudi/issues/7879#issuecomment-1533835424

   @juanAmayaRamirez 
   
   lets hop on call here is link
   
   https://meet.google.com/gam-wsca-hxi


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org