You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/07/06 19:12:07 UTC

[GitHub] [hudi] rishabhbandi opened a new issue, #6055: Hudi Partial Update not working by using MERGE statement on Hudi External Table

rishabhbandi opened a new issue, #6055:
URL: https://github.com/apache/hudi/issues/6055

   **Describe the problem you faced**
   
   **Scenario #1:**
   
   1)created a dataframe(**targetDf**) and using the below statement to write it in GCS Bucket location (for ex - **locA**)
   targetDF.write.format(org.apache.hudi).options(hudiWriteConf).mode(SaveMode.Overwrite).save(locA)
   
   2)then we are creating an external hudi table on locA. lets call it **ext_hudi_tbl_on_locA**
   
   3)next we have our dataframe which contains record that has columns to be updated. lets call it **updDf**.
   
   4)we are creating a spark table on top of **updDf** in spark session, lets call it **upd_spark_tbl**.
   
   5)then we run the merge command using spark.sql() way on **ext_hudi_tbl_on_locA** using **upd_spark_tbl**, the statement finishes without any error but it does not update any record.
   
   NOTE: we checked that there is no data issue, if we join the tables **ext_hudi_tbl_on_locA** and **upd_spark_tbl** it works and give the joined data result.
   
   
   **Scenario #2**
   
   1)we create an managed hudi table. lets call **int_hudi_tbl**
   
   2)we insert data from **targetDf** into the above hudi table. using spark.sql() way.
   
   3)next we have our dataframe which contains record that has columns to be updated lets call it **updDf**.
   
   4)we are creating a spark table on top of **updDf** in spark session, lets call it **upd_spark_tbl**
   
   5)then we run the merge command using spark.sql() way on **int_hudi_tbl** using **upd_spark_tbl**, the statement finishes without any error but this time it updates the data.
   
   
   CONCLUSION
   Scenario #1: no error thrown and update does not works, Scenario #2: no error thrown and update works.
   
   Please advise why its not working in Scenario #1.
   
   
   **Environment Description**
   
   * Hudi version : 0.11.0
   
   * Spark version : 2.4.8
   
   * Hive version :2.3.7
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : GCS
   
   * Running on Docker? (yes/no) : no
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #6055: Hudi Partial Update not working by using MERGE statement on Hudi External Table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #6055:
URL: https://github.com/apache/hudi/issues/6055#issuecomment-1229348797

   @rishabhbandi : do you mind sharing a reproducible scripts. would help investigate faster. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] rishabhbandi commented on issue #6055: Hudi Partial Update not working by using MERGE statement on Hudi External Table

Posted by GitBox <gi...@apache.org>.
rishabhbandi commented on issue #6055:
URL: https://github.com/apache/hudi/issues/6055#issuecomment-1179026939

   @hassan-ammar below command being used to create the spark shell - spark-shell --jars gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.22.2.jar,/edge_data/code/svcordrdats/pipeline-resources/hudi-support-jars/hudi-spark-bundle_2.12-0.11.0.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryoserializer.buffer.max=512m --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' --conf 'spark.sql.catalogImplementation=hive'
   
   
   you can save the hudi config as mentioned in my jira as a hudiConf.conf file and use that conf file in the options method.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hassan-ammar commented on issue #6055: Hudi Partial Update not working by using MERGE statement on Hudi External Table

Posted by GitBox <gi...@apache.org>.
hassan-ammar commented on issue #6055:
URL: https://github.com/apache/hudi/issues/6055#issuecomment-1177771338

   @rishabhbandi  can you please share the correct config to set table path ? 
   
   I am trying your scenario #2 (merging by spark.sql with managed HUDI table) and getting this error :
   An error occurred while calling o89.sql. Hoodie table not found in path file:/tmp/spark-warehouse/[table_name]/.hoodie
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hassan-ammar commented on issue #6055: Hudi Partial Update not working by using MERGE statement on Hudi External Table

Posted by GitBox <gi...@apache.org>.
hassan-ammar commented on issue #6055:
URL: https://github.com/apache/hudi/issues/6055#issuecomment-1177903224

   Logging off for today. @rishabhbandi  It would be really great if you share how to set the configs. I have tried the following 
    `spark = SparkSession.builder.config('hoodie.base.path','s3://[bucket path]/')`
   `.config('BASE_PATH.key','s3://[bucket path]/')`
   Also tried:
   `spark.sql("set hoodie.base.path=s3://[bucket path]/[table_name]/")`
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hassan-ammar commented on issue #6055: Hudi Partial Update not working by using MERGE statement on Hudi External Table

Posted by GitBox <gi...@apache.org>.
hassan-ammar commented on issue #6055:
URL: https://github.com/apache/hudi/issues/6055#issuecomment-1184125659

   For Scenario 1, _hoodie_commit_time is getting updated for rows which satisfies the merge criteria but other column values are not getting updated.
   For Scenario 2 I am still getting hoodie table not found error. 
   
   I am using aws glue along with hudi connecter for glue   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] rishabhbandi commented on issue #6055: Hudi Partial Update not working by using MERGE statement on Hudi External Table

Posted by GitBox <gi...@apache.org>.
rishabhbandi commented on issue #6055:
URL: https://github.com/apache/hudi/issues/6055#issuecomment-1303007672

   Hi Team, we created a separate custom java class to perform the partial update.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] rishabhbandi commented on issue #6055: Hudi Partial Update not working by using MERGE statement on Hudi External Table

Posted by GitBox <gi...@apache.org>.
rishabhbandi commented on issue #6055:
URL: https://github.com/apache/hudi/issues/6055#issuecomment-1177121640

   **Hudi Config**
   "hoodie.datasource.write.recordkey.field" = "data_src_cd,sales_order_num"
   "hoodie.datasource.write.partitionpath.field" = "op_cmpny_cd,order_placed_dt"
   "hoodie.datasource.write.precombine.field" = "src_modfd_ts"
   "hoodie.datasource.write.operation" = "upsert"
   "hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
   "hoodie.table.name" = "ww_mb_dl_secure.sales_order"
   "hoodie.datasource.write.keygenerator.class" = "org.apache.hudi.keygen.ComplexKeyGenerator"
   "hoodie.datasource.write.hive_style_partitioning" = "true"
   "hoodie.datasource.hive_sync.support_timestamp" = "true"
   "hoodie.cleaner.commits.retained" = 2
   "hoodie.datasource.query.type" = "snapshot"
   
   **Spark Shell** 
   spark-shell --jars gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.22.2.jar,/edge_data/code/svcordrdats/pipeline-resources/hudi-support-jars/hudi-spark-bundle_2.12-0.11.0.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryoserializer.buffer.max=512m --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' --conf 'spark.sql.catalogImplementation=hive'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] fengjian428 commented on issue #6055: Hudi Partial Update not working by using MERGE statement on Hudi External Table

Posted by GitBox <gi...@apache.org>.
fengjian428 commented on issue #6055:
URL: https://github.com/apache/hudi/issues/6055#issuecomment-1189751745

   @voonhous 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #6055: Hudi Partial Update not working by using MERGE statement on Hudi External Table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #6055:
URL: https://github.com/apache/hudi/issues/6055#issuecomment-1236382839

   @rishabhbandi : gentle ping. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] rishabhbandi commented on issue #6055: Hudi Partial Update not working by using MERGE statement on Hudi External Table

Posted by GitBox <gi...@apache.org>.
rishabhbandi commented on issue #6055:
URL: https://github.com/apache/hudi/issues/6055#issuecomment-1177119908

   Hudi Config


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] rishabhbandi commented on issue #6055: Hudi Partial Update not working by using MERGE statement on Hudi External Table

Posted by GitBox <gi...@apache.org>.
rishabhbandi commented on issue #6055:
URL: https://github.com/apache/hudi/issues/6055#issuecomment-1177835873

   @hassan-ammar can we have one working session if possible?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #6055: Hudi Partial Update not working by using MERGE statement on Hudi External Table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #6055:
URL: https://github.com/apache/hudi/issues/6055#issuecomment-1302881510

   hey @rishabhbandi @hassan-ammar : were you folks able to resolve the issue. Did any fix go into hudi on this regard. 
   can you guys help me understand is the issue still persists.
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on issue #6055: Hudi Partial Update not working by using MERGE statement on Hudi External Table

Posted by GitBox <gi...@apache.org>.
yihua commented on issue #6055:
URL: https://github.com/apache/hudi/issues/6055#issuecomment-1176915832

   @rishabhbandi could you provide the Hudi configs you use to write and update the tables?
   
   @YannByron @xiarixiaoyao @XuQianJin-Stars could any of you help check if there is a problem?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #6055: Hudi Partial Update not working by using MERGE statement on Hudi External Table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #6055:
URL: https://github.com/apache/hudi/issues/6055#issuecomment-1209924314

   @rishabhbandi : can you respond to the clarifications when you get a chance please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] rishabhbandi closed issue #6055: Hudi Partial Update not working by using MERGE statement on Hudi External Table

Posted by GitBox <gi...@apache.org>.
rishabhbandi closed issue #6055: Hudi Partial Update not working by using MERGE statement on Hudi External Table
URL: https://github.com/apache/hudi/issues/6055


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on issue #6055: Hudi Partial Update not working by using MERGE statement on Hudi External Table

Posted by GitBox <gi...@apache.org>.
voonhous commented on issue #6055:
URL: https://github.com/apache/hudi/issues/6055#issuecomment-1189966009

   @rishabhbandi I don't quite understand the steps between:
   
   ```txt
   1)created a dataframe(targetDf) and using the below statement to write it in GCS Bucket location (for ex - locA)
   targetDF.write.format(org.apache.hudi).options(hudiWriteConf).mode(SaveMode.Overwrite).save(locA)
   
   2)then we are creating an external hudi table on locA. lets call it ext_hudi_tbl_on_locA
   ```
   
   and 
   
   ```txt
   1)we create an managed hudi table. lets call int_hudi_tbl
   
   2)we insert data from targetDf into the above hudi table. using spark.sql() way.
   ```
   
   Can you please provide a coded example instead, thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on issue #6055: Hudi Partial Update not working by using MERGE statement on Hudi External Table

Posted by GitBox <gi...@apache.org>.
xushiyan commented on issue #6055:
URL: https://github.com/apache/hudi/issues/6055#issuecomment-1188603783

   cc @fengjian428 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hassan-ammar commented on issue #6055: Hudi Partial Update not working by using MERGE statement on Hudi External Table

Posted by GitBox <gi...@apache.org>.
hassan-ammar commented on issue #6055:
URL: https://github.com/apache/hudi/issues/6055#issuecomment-1177861912

   @rishabhbandi  we can talk now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org