You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/02/23 15:52:22 UTC

[GitHub] [hudi] fisser001 opened a new issue #4887: [SUPPORT] Unexpected behaviour with partitioned hudi tables with impala as query engine

fisser001 opened a new issue #4887:
URL: https://github.com/apache/hudi/issues/4887


   **Describe the problem you faced**
   
   We have an unexpected behaviour with partitioned hudi tables when we query those tables with impala.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. We write data with hudi and spark to hdfs with the following config:
   `val inputSchema = StructType(
     List(
       StructField("id", StringType, false),
       StructField("attribute", StringType, false),
       StructField("p_year", IntegerType, false),
   	StructField("p_month", IntegerType, false),
       StructField("sequence", IntegerType, false)
     )
   )
   
   val initialData = Seq[Row](
     Row("1", "abc", 2019, 1, 1),
     Row("2", "def", 2018, 2, 2),
     Row("3", "ghi", 2018, 3, 3),
   )
   
   val initialDataFrame = spark.createDataFrame(spark.sparkContext.parallelize(initialData), inputSchema)
   
   initialDataFrame.write.format("hudi")
   .option(TABLE_NAME.key(), "test")
   .option(RECORDKEY_FIELD.key(), "id")
   .option(PRECOMBINE_FIELD.key(), "sequence")
   .option("hoodie.table.name", "test")
   .option(OPERATION.key(), "insert_overwrite")
   .option(PARTITIONPATH_FIELD.key(), "p_year,p_month")
   .option(KEYGENERATOR_CLASS_NAME.key(), "org.apache.hudi.keygen.ComplexKeyGenerator")
   .option(HIVE_STYLE_PARTITIONING.key(), "true")	
   .option(HIVE_SYNC_ENABLED.key(), true)
   .option(HIVE_SYNC_MODE.key(), "HMS")
   .option(HIVE_DATABASE.key(), "db_abc_raw")
   .option(HIVE_TABLE.key(), "test")
   .option(HIVE_CREATE_MANAGED_TABLE.key(), false)
   .mode("append")
   .save("hdfs:///datalake/abc/raw/abc2/abc3/abc4") `
   
   2. After the code has finished data is written to hdfs and a hudi table is created in Hive Metastore.
   3. Now it is possible to read the data with spark and also with hive
   4. However, when when we try to read the data with impala, no data is shown
   5. So we execute the following query in order to recover the partitions. Result: "Partitions have been recovered.":
   `ALTER TABLE db_abc_raw.test RECOVER PARTITIONS;`
   6. When we execute the following query
   `"show partitions db_abc_raw.test;"  `
   Result: (Please see attachment)
   7. We were able to query the hudi table with hive (tez). No Problems. Data is displayed
   8. We were also able to read the data with spark and hudi `.format("hudi") ` . No problems here. Data could be read.
   
   **Expected behavior**
   
   It should be possible to query the table with impala and data should be displayed.
   
   **Environment Description**
   
   * Hudi version :
   0.10.0 + 0.10.1
   
   * Spark version :
   3.1.1
   
   * Hive version :
   3.1.3000
   
   * Hadoop version :
   3.1.1
   
   * Storage (HDFS/S3/GCS..) :
   HDFS
   
   * Running on Docker? (yes/no) :
   no
   
   **Additional context**
   
   - Could be connected with https://github.com/apache/hudi/issues/4830 ?
   
   **Stacktrace**
   
   No stacktrace available.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] fisser001 commented on issue #4887: [SUPPORT] Unexpected behaviour with partitioned hudi tables with impala as query engine

Posted by GitBox <gi...@apache.org>.

fisser001 commented on issue #4887:
URL: https://github.com/apache/hudi/issues/4887#issuecomment-1050642447


   Hi @garyli1019 It is an external table. We tried two different approaches. First we created the external impala table manually by a ddl script. However, we also let hudi create the external table by "HIVE_SYNC_ENABLED" option. Both approaches end up in the same result / problem. Yes we are using hive metastore and impala gets table information from hive metastore. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #4887: [SUPPORT] Unexpected behaviour with partitioned hudi tables with impala as query engine

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #4887:
URL: https://github.com/apache/hudi/issues/4887#issuecomment-1050387645


   @garyli1019 : Can you help us here. we don't have any exp w/ impala. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] fisser001 commented on issue #4887: [SUPPORT] Unexpected behaviour with partitioned hudi tables with impala as query engine

Posted by GitBox <gi...@apache.org>.

fisser001 commented on issue #4887:
URL: https://github.com/apache/hudi/issues/4887#issuecomment-1066530031


   @garyli1019 It is an external table. We tried both - creating the table like proposed in the docs and using hudi hive sync. Both ends up in the same result. The table gets created however impala can not show any data. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] garyli1019 commented on issue #4887: [SUPPORT] Unexpected behaviour with partitioned hudi tables with impala as query engine

Posted by GitBox <gi...@apache.org>.

garyli1019 commented on issue #4887:
URL: https://github.com/apache/hudi/issues/4887#issuecomment-1054979776


   @fisser001 I didn't try the HIVE_SYNC way myself so I am not sure if that would work, but the ddl script should work. If that possible to find any stack trace or log of the impala front end? Possible the file listing has some issues. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] fisser001 commented on issue #4887: [SUPPORT] Unexpected behaviour with partitioned hudi tables with impala as query engine

Posted by GitBox <gi...@apache.org>.

fisser001 commented on issue #4887:
URL: https://github.com/apache/hudi/issues/4887#issuecomment-1056483661


   Hi @garyli1019: I have run the queries again and checked the logs. There are no errors visible in the logs. The query simply does not find any data. But this does not seem to be an error for Impala. I have exported the query plan. This is the only thing I can offer. Additionally I have created an external partitioned table, which is only saved in parquet format without hudi.  For comparison I attach this query plan too. In particular, differences in the query plans are visible in the recognition of the partitions (see "00:SCAN HDFS"). Hope this helps. 
   
   [query_profile_partioned_hudi.txt](https://github.com/apache/hudi/files/8167636/query_profile_partioned_hudi.txt)
   [query_profile_partioned_parquet.txt](https://github.com/apache/hudi/files/8167637/query_profile_partioned_parquet.txt)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] garyli1019 commented on issue #4887: [SUPPORT] Unexpected behaviour with partitioned hudi tables with impala as query engine

Posted by GitBox <gi...@apache.org>.

garyli1019 commented on issue #4887:
URL: https://github.com/apache/hudi/issues/4887#issuecomment-1066498891


   @fisser001 hi, do you mean your hudi table is not external table? Is the hudi table created like this docs https://hudi.apache.org/docs/querying_data#impala-34-or-later


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] garyli1019 commented on issue #4887: [SUPPORT] Unexpected behaviour with partitioned hudi tables with impala as query engine

Posted by GitBox <gi...@apache.org>.

garyli1019 commented on issue #4887:
URL: https://github.com/apache/hudi/issues/4887#issuecomment-1050616343


   hi, is the hudi table stored as impala external table or you are using hive metastore as impala's catalog? @fisser001 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #4887: [SUPPORT] Unexpected behaviour with partitioned hudi tables with impala as query engine

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #4887:
URL: https://github.com/apache/hudi/issues/4887#issuecomment-1061353183


   @garyli1019 : can you please follow up when you get a chance. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] fisser001 commented on issue #4887: [SUPPORT] Unexpected behaviour with partitioned hudi tables with impala as query engine

Posted by GitBox <gi...@apache.org>.

fisser001 commented on issue #4887:
URL: https://github.com/apache/hudi/issues/4887#issuecomment-1066743261


   @garyli1019 We also tried this. However, still no data. Maybe it is a good idea to increase the hudi version in the impala project? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] garyli1019 commented on issue #4887: [SUPPORT] Unexpected behaviour with partitioned hudi tables with impala as query engine

Posted by GitBox <gi...@apache.org>.

garyli1019 commented on issue #4887:
URL: https://github.com/apache/hudi/issues/4887#issuecomment-1066544282


   @fisser001 Impala actually doesn't support hudi hive sync, that why we need to manually create the external table, recover partition, and refresh table manually. The table created by hive sync are using HoodieHiveInputFormat, but impala read HUDI_PARQUET as regular parquet. Those two are totally different and it could be problematic if we use the hive and impala for the same table. 
   Would you try this, create an external impala table pointing to the hudi hdfs path, run impala query to recover partitions and refresh table. If this still doesn't work, we probably need some help from impala support.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org