You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/01/04 05:31:24 UTC

[GitHub] [hudi] a0x opened a new issue #4442: [SUPPORT] PySpark(3.1.2) with Hudi(0.10.0) failed when querying spark sql

a0x opened a new issue #4442:
URL: https://github.com/apache/hudi/issues/4442


   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Enter PySpark interactive session with command: 
   ```
   pyspark --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.10.0,org.apache.spark:spark-avro_2.12:3.1.2 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
   ```
   2. Run Spark SQL query on any kind of table(no matter Hudi or not), eg: 
   ```
   spark.sql('select * from somedb.non_hudi_table')
   spark.sql('select * from somedb.hudi_table')
   ```
   
   **Expected behavior**
   
   When I am using `select` query on a non-hudi table in Spark with Hudi deps, I should get the right datafrarme which includes the data as I selected.
   
   When on an Hudi table, it should return a dataframe with the real data I selected and/or Hudi specific columns.
   
   **Environment Description**
   
   * Hudi version : 0.10.0 (replacement of 0.8.0 bundled in EMR)
   
   * Spark version : 3.1.2
   
   * Hive version : 3.1.2
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) :S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   * **About Hudi**
   
   This occurs in AWS EWR 6.4.0, which Hudi 0.8 bundled is replaced with Hudi 0.10.
   
   The replacement action is as follows:
   
   1. Downlaod packages
   ```
   hudi-spark3-bundle_2.12-0.10.0.jar 
   hudi-hadoop-mr-bundle-0.10.0.jar  
   hudi-utilities-bundle_2.12-0.10.0.jar  
   hudi-hive-sync-bundle-0.10.0.jar 
   hudi-presto-bundle-0.10.0.jar  
   hudi-timeline-server-bundle-0.10.0.jar  
   hudi-cli-0.10.0.jar  
   hudi-client-common-0.10.0.jar  
   hudi-common-0.10.0.jar  
   hudi-hadoop-mr-0.10.0.jar  
   hudi-hive-sync-0.10.0.jar  
   hudi-spark3_2.12-0.10.0.jar 
   hudi-spark-client-0.10.0.jar 
   hudi-spark-common_2.12-0.10.0.jar 
   hudi-sync-common-0.10.0.jar 
   hudi-timeline-service-0.10.0.jar  
   hudi-utilities_2.12-0.10.0.jar  
   hudi-utilities-bundle_2.12-0.10.0.jar  
   ```
   2. Replace Hudi 0.8
   Replace Hudi 0.8(the packages as downloaded but 0.8 version) in `/usr/lib/hudi/` with the packages above.
   3. Now can try Spark SQL with Hudi
   ```
   spark-sql --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.10.0,org.apache.spark:spark-avro_2.12:3.1.2 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
   ```
   
   * **About Catalog**
   In this case, I am using **AWS Glue** as the catalog.
   
   
   * **About Spark**
   This occurs **ONLY in PySpark**.
   For Spark Scala interactive session, Spark SQL query such as `select` and `update`, `delete` with Hudi works just as fine as presented in the documentation.
   
   * **About Hudi 0.8**
   If I use Hudi 0.8 with Hadoop, Spark, Hive as the same version mentioned(also EMR), when entering pyspark session, spark sql will execute correctly for normal tables.
   
   **Stacktrace**
   
   ```
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/usr/lib/spark/python/pyspark/sql/session.py", line 723, in sql
       return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
     File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
     File "/usr/lib/spark/python/pyspark/sql/utils.py", line 111, in deco
       return f(*a, **kw)
     File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
   py4j.protocol.Py4JJavaError: An error occurred while calling o67.sql.
   : java.lang.NoSuchMethodError: com.amazonaws.transform.JsonUnmarshallerContext.getCurrentToken()Lcom/amazonaws/thirdparty/jackson/core/JsonToken;
   	at com.amazonaws.services.glue.model.transform.GetDatabaseResultJsonUnmarshaller.unmarshall(GetDatabaseResultJsonUnmarshaller.java:39)
   	at com.amazonaws.services.glue.model.transform.GetDatabaseResultJsonUnmarshaller.unmarshall(GetDatabaseResultJsonUnmarshaller.java:29)
   	at com.amazonaws.http.JsonResponseHandler.handle(JsonResponseHandler.java:118)
   	at com.amazonaws.http.JsonResponseHandler.handle(JsonResponseHandler.java:43)
   	at com.amazonaws.http.response.AwsResponseHandlerAdapter.handle(AwsResponseHandlerAdapter.java:69)
   	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleResponse(AmazonHttpClient.java:1734)
   	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleSuccessResponse(AmazonHttpClient.java:1454)
   	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1369)
   	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145)
   	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
   	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
   	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
   	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704)
   	at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
   	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550)
   	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530)
   	at com.amazonaws.services.glue.AWSGlueClient.doInvoke(AWSGlueClient.java:10640)
   	at com.amazonaws.services.glue.AWSGlueClient.invoke(AWSGlueClient.java:10607)
   	at com.amazonaws.services.glue.AWSGlueClient.invoke(AWSGlueClient.java:10596)
   	at com.amazonaws.services.glue.AWSGlueClient.executeGetDatabase(AWSGlueClient.java:4466)
   	at com.amazonaws.services.glue.AWSGlueClient.getDatabase(AWSGlueClient.java:4435)
   	at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.doesDefaultDBExist(AWSCatalogMetastoreClient.java:238)
   	at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.<init>(AWSCatalogMetastoreClient.java:151)
   	at com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory.createMetaStoreClient(AWSGlueDataCatalogHiveClientFactory.java:20)
   	at org.apache.hadoop.hive.ql.metadata.HiveUtils.createMetaStoreClient(HiveUtils.java:507)
   	at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3746)
   	at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3726)
   	at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3988)
   	at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:251)
   	at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:234)
   	at org.apache.hadoop.hive.ql.metadata.Hive.<init>(Hive.java:402)
   	at org.apache.hadoop.hive.ql.metadata.Hive.create(Hive.java:335)
   	at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:315)
   	at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:291)
   	at org.apache.spark.sql.hive.client.HiveClientImpl.client(HiveClientImpl.scala:257)
   	at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:283)
   	at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:224)
   	at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:223)
   	at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273)
   	at org.apache.spark.sql.hive.client.HiveClientImpl.databaseExists(HiveClientImpl.scala:384)
   	at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:249)
   	at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
   	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:105)
   	at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:249)
   	at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:135)
   	at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:125)
   	at org.apache.spark.sql.internal.SharedState.isDatabaseExistent$1(SharedState.scala:169)
   	at org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:201)
   	at org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:153)
   	at org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$2(HiveSessionStateBuilder.scala:52)
   	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:99)
   	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:99)
   	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupGlobalTempView(SessionCatalog.scala:870)
   	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveTempViews$$lookupTempView(Analyzer.scala:916)
   	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveTempViews$$lookupAndResolveTempView(Analyzer.scala:930)
   	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.applyOrElse(Analyzer.scala:875)
   	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.applyOrElse(Analyzer.scala:873)
   	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$3(AnalysisHelper.scala:90)
   	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:75)
   	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$1(AnalysisHelper.scala:90)
   	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:221)
   	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp(AnalysisHelper.scala:86)
   	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp$(AnalysisHelper.scala:84)
   	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
   	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$2(AnalysisHelper.scala:87)
   	at org.apache.spark.sql.catalyst.trees.TreeNode.applyFunctionIfChanged$1(TreeNode.scala:388)
   	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:424)
   	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:256)
   	at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:422)
   	at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:370)
   	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$1(AnalysisHelper.scala:87)
   	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:221)
   	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp(AnalysisHelper.scala:86)
   	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp$(AnalysisHelper.scala:84)
   	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
   	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$.apply(Analyzer.scala:873)
   	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:1112)
   	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:1077)
   	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:220)
   	at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
   	at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
   	at scala.collection.immutable.List.foldLeft(List.scala:89)
   	at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeBatch$1(RuleExecutor.scala:217)
   	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$6(RuleExecutor.scala:290)
   	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
   	at org.apache.spark.sql.catalyst.rules.RuleExecutor$RuleExecutionContext$.withContext(RuleExecutor.scala:333)
   	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$5(RuleExecutor.scala:290)
   	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$5$adapted(RuleExecutor.scala:280)
   	at scala.collection.immutable.List.foreach(List.scala:392)
   	at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:280)
   	at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:192)
   	at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:196)
   	at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:190)
   	at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:155)
   	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:183)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
   	at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:183)
   	at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:174)
   	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:228)
   	at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:173)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:73)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:192)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:163)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
   	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:163)
   	at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:73)
   	at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:71)
   	at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:63)
   	at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
   	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:98)
   	at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:618)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
   	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   	at py4j.Gateway.invoke(Gateway.java:282)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.GatewayConnection.run(GatewayConnection.java:238)
   	at java.lang.Thread.run(Thread.java:748)
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] a0x commented on issue #4442: [SUPPORT] PySpark(3.1.2) with Hudi(0.10.0) failed when querying spark sql

Posted by GitBox <gi...@apache.org>.

a0x commented on issue #4442:
URL: https://github.com/apache/hudi/issues/4442#issuecomment-1004485623


   @xushiyan Thanks for your reply.
   
   Do you mean not to replace Hudi 0.8.0 bundled in EMR and start spark session with Hudi 0.10.0 which in another separated dir?
   
   To be honest I think it's not a good idea.
   
   When I dug into the error, I realized this was the problem inside the aws java sdk bundled in EMR Spark library.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] kazdy commented on issue #4442: [SUPPORT] PySpark(3.1.2) with Hudi(0.10.0) failed when querying spark sql

Posted by GitBox <gi...@apache.org>.

kazdy commented on issue #4442:
URL: https://github.com/apache/hudi/issues/4442#issuecomment-1003734711


   I have the same issue when running hudi on emr.
   This issue seems to have the same root cause as in this one: [https://github.com/apache/hudi/issues/4474](4474).
   The solution is to shade and relocate aws dependencies introduced in hudi-aws:
   > For our internal hudi version, we shade aws dependencies, you can add new relocation and build a new bundle package:
   > 
   > For example, to shade aws dependencies in spark, add following codes in **packaging/hudi-spark-bundle/pom.xml**
   > 
   > ```
   > <!-- line 185-->
   > <relocation>
   >  <pattern>com.amazonaws.</pattern>
   >  <shadedPattern>${spark.bundle.spark.shade.prefix}com.amazonaws.</shadedPattern>
   > </relocation>
   > ```
   
   @xushiyan should this relocation be added to the official hudi release to avoid such conflicts?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] a0x commented on issue #4442: [SUPPORT] PySpark(3.1.2) with Hudi(0.10.0) failed when querying spark sql

Posted by GitBox <gi...@apache.org>.

a0x commented on issue #4442:
URL: https://github.com/apache/hudi/issues/4442#issuecomment-1005363447


   Finally I fixed this problem by removing aws deps in `packing/hudi-spark-bundle/pom.xml` and recompiling it myself.
   
   ```xml
   <!-- line 106, NEED TO REMOVE -->
   <include>com.amazonaws:dynamodb-lock-client</include>
   <include>com.amazonaws:aws-java-sdk-dynamodb</include>
   <include>com.amazonaws:aws-java-sdk-core</include>
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xushiyan commented on issue #4442: [SUPPORT] PySpark(3.1.2) with Hudi(0.10.0) failed when querying spark sql

Posted by GitBox <gi...@apache.org>.

xushiyan commented on issue #4442:
URL: https://github.com/apache/hudi/issues/4442#issuecomment-1001116347


   @a0x can you try put the jars in a different directory like /home/hadoop and try not to change /usr/lib/hudi . 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] a0x edited a comment on issue #4442: [SUPPORT] PySpark(3.1.2) with Hudi(0.10.0) failed when querying spark sql

Posted by GitBox <gi...@apache.org>.

a0x edited a comment on issue #4442:
URL: https://github.com/apache/hudi/issues/4442#issuecomment-1004486507


   > I have the same issue when running hudi on emr. This issue seems to have the same root cause as in this one: #4474 . The solution is to shade and relocate aws dependencies introduced in hudi-aws:
   > 
   > > For our internal hudi version, we shade aws dependencies, you can add new relocation and build a new bundle package:
   > > For example, to shade aws dependencies in spark, add following codes in **packaging/hudi-spark-bundle/pom.xml**
   > > ```
   > > <!-- line 185-->
   > > <relocation>
   > >  <pattern>com.amazonaws.</pattern>
   > >  <shadedPattern>${spark.bundle.spark.shade.prefix}com.amazonaws.</shadedPattern>
   > > </relocation>
   > > ```
   > 
   > @xushiyan should this relocation be added to the official hudi release to avoid such conflicts?
   
   @kazdy Thank you! This should work.
   
   But shall we shade all aws deps in Spark? I'm worrying about the side effict, but let me have a try before replying in that issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] a0x edited a comment on issue #4442: [SUPPORT] PySpark(3.1.2) with Hudi(0.10.0) failed when querying spark sql

Posted by GitBox <gi...@apache.org>.

a0x edited a comment on issue #4442:
URL: https://github.com/apache/hudi/issues/4442#issuecomment-1005363447


   Finally I fixed this problem by removing aws deps in `packing/hudi-spark-bundle/pom.xml` and recompiling it myself.
   
   ```xml
   <!-- line 106, keep it as comment -->
   <!-- <include>com.amazonaws:dynamodb-lock-client</include> -->
   <!-- <include>com.amazonaws:aws-java-sdk-dynamodb</include> -->
   <!-- <include>com.amazonaws:aws-java-sdk-core</include> -->
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] a0x commented on issue #4442: [SUPPORT] PySpark(3.1.2) with Hudi(0.10.0) failed when querying spark sql

Posted by GitBox <gi...@apache.org>.

a0x commented on issue #4442:
URL: https://github.com/apache/hudi/issues/4442#issuecomment-1004486507


   > I have the same issue when running hudi on emr. This issue seems to have the same root cause as in this one: #4474 . The solution is to shade and relocate aws dependencies introduced in hudi-aws:
   > 
   > > For our internal hudi version, we shade aws dependencies, you can add new relocation and build a new bundle package:
   > > For example, to shade aws dependencies in spark, add following codes in **packaging/hudi-spark-bundle/pom.xml**
   > > ```
   > > <!-- line 185-->
   > > <relocation>
   > >  <pattern>com.amazonaws.</pattern>
   > >  <shadedPattern>${spark.bundle.spark.shade.prefix}com.amazonaws.</shadedPattern>
   > > </relocation>
   > > ```
   > 
   > @xushiyan should this relocation be added to the official hudi release to avoid such conflicts?
   
   Thank you! This should work.
   
   But shall we shade all aws deps in Spark? I'm worrying about the side effict, but let me have a try before replying in that issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] a0x commented on issue #4442: [SUPPORT] PySpark(3.1.2) with Hudi(0.10.0) failed when querying spark sql

Posted by GitBox <gi...@apache.org>.

a0x commented on issue #4442:
URL: https://github.com/apache/hudi/issues/4442#issuecomment-1004539542


   @kazdy I did recompile Hudi packages as the mentioned config, yet the error remains.
   
   This is an interesting problem, because all things good in `spark-shell`, yet the problem occues **only in PySpark**.
   
   So I think the lib confliction is hidden in the diff between `spark-shell` and `pyspark`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] a0x edited a comment on issue #4442: [SUPPORT] PySpark(3.1.2) with Hudi(0.10.0) failed when querying spark sql

Posted by GitBox <gi...@apache.org>.

a0x edited a comment on issue #4442:
URL: https://github.com/apache/hudi/issues/4442#issuecomment-1004486507


   > I have the same issue when running hudi on emr. This issue seems to have the same root cause as in this one: #4474 . The solution is to shade and relocate aws dependencies introduced in hudi-aws:
   > 
   > > For our internal hudi version, we shade aws dependencies, you can add new relocation and build a new bundle package:
   > > For example, to shade aws dependencies in spark, add following codes in **packaging/hudi-spark-bundle/pom.xml**
   > > ```
   > > <!-- line 185-->
   > > <relocation>
   > >  <pattern>com.amazonaws.</pattern>
   > >  <shadedPattern>${spark.bundle.spark.shade.prefix}com.amazonaws.</shadedPattern>
   > > </relocation>
   > > ```
   > 
   > @xushiyan should this relocation be added to the official hudi release to avoid such conflicts?
   
   @kazdy Thank you! This should work.
   
   But shall we shade all aws deps in Spark? I'm worrying about the side effict, but let me have a try before replying in #4474 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] kazdy edited a comment on issue #4442: [SUPPORT] PySpark(3.1.2) with Hudi(0.10.0) failed when querying spark sql

Posted by GitBox <gi...@apache.org>.

kazdy edited a comment on issue #4442:
URL: https://github.com/apache/hudi/issues/4442#issuecomment-1003734711


   I have the same issue when running hudi on emr.
   This issue seems to have the same root cause as in this one: #4474 .
   The solution is to shade and relocate aws dependencies introduced in hudi-aws:
   > For our internal hudi version, we shade aws dependencies, you can add new relocation and build a new bundle package:
   > 
   > For example, to shade aws dependencies in spark, add following codes in **packaging/hudi-spark-bundle/pom.xml**
   > 
   > ```
   > <!-- line 185-->
   > <relocation>
   >  <pattern>com.amazonaws.</pattern>
   >  <shadedPattern>${spark.bundle.spark.shade.prefix}com.amazonaws.</shadedPattern>
   > </relocation>
   > ```
   
   @xushiyan should this relocation be added to the official hudi release to avoid such conflicts?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] a0x closed issue #4442: [SUPPORT] PySpark(3.1.2) with Hudi(0.10.0) failed when querying spark sql

Posted by GitBox <gi...@apache.org>.

a0x closed issue #4442:
URL: https://github.com/apache/hudi/issues/4442


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] a0x closed issue #4442: [SUPPORT] PySpark(3.1.2) with Hudi(0.10.0) failed when querying spark sql

Posted by GitBox <gi...@apache.org>.

a0x closed issue #4442:
URL: https://github.com/apache/hudi/issues/4442


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org