You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/05/03 11:56:45 UTC

[GitHub] [hudi] parisni opened a new issue, #5488: [SUPPORT] Read hive Table fail when HoodieCatalog used

parisni opened a new issue, #5488:
URL: https://github.com/apache/hudi/issues/5488

   hudi 0.11.0
   spark 3.2.1
   
   when hive_sync then `read.table("table_name")` raise an error `pyspark.sql.utils.AnalysisException: Table does not support reads`.
   The error does't raise when ` --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'` is not set.
   
   
   ```python
   pyspark   --packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.11.0   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'   --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
   
   sc.setLogLevel("WARN")
   dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()
   inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(
       dataGen.generateInserts(10)
   )
   from pyspark.sql.functions import expr
   
   df = spark.read.json(spark.sparkContext.parallelize(inserts, 10)).withColumn(
       "part", expr("'foo'")
   )
   tableName = "test_hudi_pyspark"
   basePath = f"/tmp/{tableName}"
   
   hudi_options = {
       "hoodie.table.name": tableName,
       "hoodie.datasource.write.recordkey.field": "uuid",
       "hoodie.datasource.write.partitionpath.field": "part",
       "hoodie.datasource.write.table.name": tableName,
       "hoodie.datasource.write.operation": "upsert",
       "hoodie.datasource.write.precombine.field": "ts",
       "hoodie.upsert.shuffle.parallelism": 2,
       "hoodie.insert.shuffle.parallelism": 2,
       "hoodie.datasource.hive_sync.database": "default",
       "hoodie.datasource.hive_sync.table": tableName,
       "hoodie.datasource.hive_sync.mode": "hms",
       "hoodie.datasource.hive_sync.enable": "true",
       "hoodie.datasource.hive_sync.partition_fields": "part",
       "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
   }
   (df.write.format("hudi").options(**hudi_options).mode("overwrite").save(basePath))
   spark.read.format("hudi").load(basePath).count()
   spark.table("default.test_hudi_pyspark").count()
   
   ERROR: pyspark.sql.utils.AnalysisException: Table does not support reads: default.test_hudi_pyspark
   ```
   
   
   I debugged it a bit and the hudi catalog for load table uses the super.loadTable which is not aware of hudi ?
   
   ```scala
     override def loadTable(ident: Identifier): Table = {
       try {
         super.loadTable(ident) match {
           case v1: V1Table if sparkAdapter.isHoodieTable(v1.catalogTable) =>
             HoodieInternalV2Table(
               spark,
               v1.catalogTable.location.toString,
               catalogTable = Some(v1.catalogTable),
               tableIdentifier = Some(ident.toString))
           case o => o // this case is used
         }
       } catch {
         case e: Exception =>
           throw e
       }
     }
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] leesf commented on issue #5488: [SUPPORT] Read hive Table fail when HoodieCatalog used

Posted by GitBox <gi...@apache.org>.

leesf commented on issue #5488:
URL: https://github.com/apache/hudi/issues/5488#issuecomment-1146421860

   Closing the issue, @parisni please reopen if you have new problems.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] parisni commented on issue #5488: [SUPPORT] Read hive Table fail when HoodieCatalog used

Posted by GitBox <gi...@apache.org>.

parisni commented on issue #5488:
URL: https://github.com/apache/hudi/issues/5488#issuecomment-1125402574

   My point is that hoodie catalog breaks a feature. Then I need to turn it down in 3.2
   
   On May 12, 2022 12:34:08 AM UTC, Sivabalan Narayanan ***@***.***> wrote:
   ***@***.*** : can we add this info to 0.11 release notes? I don't see it here https://hudi.apache.org/releases/release-0.11.0
   >i.e. for spark3.2, users have to set hoodie catalog. 
   >
   >
   >-- 
   >Reply to this email directly or view it on GitHub:
   >https://github.com/apache/hudi/issues/5488#issuecomment-1124408937
   >You are receiving this because you authored the thread.
   >
   >Message ID: ***@***.***>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #5488: [SUPPORT] Read hive Table fail when HoodieCatalog used

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #5488:
URL: https://github.com/apache/hudi/issues/5488#issuecomment-1124408937

   @bhasudha : can we add this info to 0.11 release notes? I don't see it here https://hudi.apache.org/releases/release-0.11.0
   i.e. for spark3.2, users have to set hoodie catalog. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] leesf commented on issue #5488: [SUPPORT] Read hive Table fail when HoodieCatalog used

Posted by GitBox <gi...@apache.org>.

leesf commented on issue #5488:
URL: https://github.com/apache/hudi/issues/5488#issuecomment-1132950771

   > yeah sorry this was internal code leading to the same result:
   > SparkSession.builder().config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog").config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension").getOrCreate()
   > you can test my code snippet in OP and reproduce the error on your side
   > […](#)
   > On Fri May 20, 2022 at 9:08 AM CEST, leesf wrote: > nope, sadly adding bellow configs don't solve the issue > > ``` > sparkConf.set( > "spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog"); > sparkConf.set("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension"); > ``` @parisni I think you are not set the config correctly, please use the following code ``` SparkSession.builder().config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog").config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension").getOrCreate() ``` or pyspark --packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.11.0 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' use this command to open a new spark shell 
 -- Reply to this email directly or view it on GitHub: [#5488 (comment)](https://github.com/apache/hudi/issues/5488#issuecomment-1132552145) You are receiving this because you were mentioned. Message ID: ***@***.***>
   
   I tried in my local env and show the same result as xushiyan pasted.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #5488: [SUPPORT] Read hive Table fail when HoodieCatalog used

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #5488:
URL: https://github.com/apache/hudi/issues/5488#issuecomment-1126695845

   @parisni : can you confirm the above suggested solution works for you. feel free to close out the issue. we have a PR to fix the documentation. https://github.com/apache/hudi/pull/5584
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] XuQianJin-Stars commented on issue #5488: [SUPPORT] Read hive Table fail when HoodieCatalog used

Posted by GitBox <gi...@apache.org>.

XuQianJin-Stars commented on issue #5488:
URL: https://github.com/apache/hudi/issues/5488#issuecomment-1132473092

   @parisni I'll see if I reproduce this error later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] parisni commented on issue #5488: [SUPPORT] Read hive Table fail when HoodieCatalog used

Posted by GitBox <gi...@apache.org>.

parisni commented on issue #5488:
URL: https://github.com/apache/hudi/issues/5488#issuecomment-1132600345

   yeah sorry this was internal code leading to the same result:
   
   
   > SparkSession.builder().config("spark.sql.catalog.spark_catalog",
   > "org.apache.spark.sql.hudi.catalog.HoodieCatalog").config("spark.sql.extensions",
   > "org.apache.spark.sql.hudi.HoodieSparkSessionExtension").getOrCreate()
   
   you can test my code snippet in OP and reproduce the error on your side
   
   On Fri May 20, 2022 at 9:08 AM CEST, leesf wrote:
   >
   > > nope, sadly adding bellow configs don't solve the issue
   > > 
   > > ```
   > >     sparkConf.set(
   > >         "spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog");
   > >     sparkConf.set("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension");
   > > ```
   >
   > @parisni I think you are not set the config correctly, please use the
   > following code
   > ```
   > SparkSession.builder().config("spark.sql.catalog.spark_catalog",
   > "org.apache.spark.sql.hudi.catalog.HoodieCatalog").config("spark.sql.extensions",
   > "org.apache.spark.sql.hudi.HoodieSparkSessionExtension").getOrCreate()
   > ```
   >
   > or
   >
   > pyspark --packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.11.0
   > --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   > --conf
   > 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
   > --conf
   > 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
   > use this command to open a new spark shell
   >
   > --
   > Reply to this email directly or view it on GitHub:
   > https://github.com/apache/hudi/issues/5488#issuecomment-1132552145
   > You are receiving this because you were mentioned.
   >
   > Message ID: ***@***.***>
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] leesf closed issue #5488: [SUPPORT] Read hive Table fail when HoodieCatalog used

Posted by GitBox <gi...@apache.org>.

leesf closed issue #5488: [SUPPORT] Read hive Table fail when HoodieCatalog used 
URL: https://github.com/apache/hudi/issues/5488


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] leesf commented on issue #5488: [SUPPORT] Read hive Table fail when HoodieCatalog used

Posted by GitBox <gi...@apache.org>.

leesf commented on issue #5488:
URL: https://github.com/apache/hudi/issues/5488#issuecomment-1126577288

   @parisni would you please add the conf `--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'` as well? this would solve your problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xushiyan commented on issue #5488: [SUPPORT] Read hive Table fail when HoodieCatalog used

Posted by GitBox <gi...@apache.org>.

xushiyan commented on issue #5488:
URL: https://github.com/apache/hudi/issues/5488#issuecomment-1128741232

   > The conf `--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension` is missed in [Spark Guide](https://hudi.apache.org/docs/quick-start-guide) for scala and python
   
   @leesf @parisni The examples in guide (for spark 3.2) do include it. If I get it right, @parisni did you mean you run into issue *with* this config set, and problem went away *without* it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xushiyan commented on issue #5488: [SUPPORT] Read hive Table fail when HoodieCatalog used

Posted by GitBox <gi...@apache.org>.

xushiyan commented on issue #5488:
URL: https://github.com/apache/hudi/issues/5488#issuecomment-1132639401

   @parisni I was able to reproduce the error, and saw that `spark.sql.extensions` is indeed the missing config.
   
   - without `spark.sql.extensions`
   
   ```shell
   ./bin/pyspark \
   --packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.11.0 \
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
   --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \
   --conf 'spark.sql.warehouse.dir=hdfs://localhost:8020/user/hive/warehouse' \
   ```
   ```python
   >>> spark.table("default.test_hudi_pyspark").count()
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/home/hadoop/spark-3.2.1-bin-hadoop3.2/python/pyspark/sql/dataframe.py", line 680, in count
       return int(self._jdf.count())
     File "/home/hadoop/spark-3.2.1-bin-hadoop3.2/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py", line 1322, in __call__
     File "/home/hadoop/spark-3.2.1-bin-hadoop3.2/python/pyspark/sql/utils.py", line 117, in deco
       raise converted from None
   pyspark.sql.utils.AnalysisException: Table does not support reads: default.test_hudi_pyspark
   ```
   
   - with `spark.sql.extensions`
   
   ```shell
   ./bin/pyspark \
   --packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.11.0 \
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
   --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \
   --conf 'spark.sql.warehouse.dir=hdfs://localhost:8020/user/hive/warehouse' \
   --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
   ```
   ```python
   >>> spark.table("default.test_hudi_pyspark").count()
   10
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #5488: [SUPPORT] Read hive Table fail when HoodieCatalog used

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #5488:
URL: https://github.com/apache/hudi/issues/5488#issuecomment-1124408544

   yes, setting hoodie catalog is necesssary if you are using spark3.2. otherwise its not required. we added a note in our quick start guide as well https://hudi.apache.org/docs/quick-start-guide
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] leesf commented on issue #5488: [SUPPORT] Read hive Table fail when HoodieCatalog used

Posted by GitBox <gi...@apache.org>.

leesf commented on issue #5488:
URL: https://github.com/apache/hudi/issues/5488#issuecomment-1126579180

   The conf `--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension` is missed in [Spark Guide](https://hudi.apache.org/docs/quick-start-guide) for scala and python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] leesf commented on issue #5488: [SUPPORT] Read hive Table fail when HoodieCatalog used

Posted by GitBox <gi...@apache.org>.

leesf commented on issue #5488:
URL: https://github.com/apache/hudi/issues/5488#issuecomment-1132552145

   
   > nope, sadly adding bellow configs don't solve the issue
   > 
   > ```
   >     sparkConf.set(
   >         "spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog");
   >     sparkConf.set("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension");
   > ```
   
   @parisni I think you are not set the config correctly, please use the following code 
   ```
   SparkSession.builder().config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog").config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension").getOrCreate()
   ```  
   
   or 
   
   pyspark   --packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.11.0   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'   --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'    use this command to open a new spark shell 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] parisni commented on issue #5488: [SUPPORT] Read hive Table fail when HoodieCatalog used

Posted by GitBox <gi...@apache.org>.

parisni commented on issue #5488:
URL: https://github.com/apache/hudi/issues/5488#issuecomment-1131715213

   nope, sadly adding bellow configs don't solve the issue
   
   ```
       sparkConf.set(
           "spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog");
       sparkConf.set("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension");
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org