You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/06/15 02:56:50 UTC

[GitHub] [hudi] uvplearn opened a new issue, #5869: [SUPPORT]

uvplearn opened a new issue, #5869:
URL: https://github.com/apache/hudi/issues/5869

   
   **Desciption**
   There are duplicate values in HUDI MOR table for different  partition and not updating values in same partition for GLOBAL_BLOOM.
   
   **Steps To Reproduce this behavior**
   **STEP 1** 
   I have created a hudi table with follwing input data and properties.
                 hudi_options  = {
                                 'hoodie.table.name': 'my_hudi_table',
                                 'hoodie.datasource.write.recordkey.field': 'id',
                                 'hoodie.datasource.write.partitionpath.field': 'creation_date',
                                 'hoodie.datasource.write.precombine.field': 'last_update_time',
                                 'hoodie.datasource.write.table.type':  'MERGE_ON_READ'     ,
                                 'hoodie.bloom.index.update.partition.path': 'true',
                                 "hoodie.index.type": "GLOBAL_BLOOM",
                                 "hoodie.datasource.write.keygenerator.class" : "org.apache.hudi.keygen.ComplexKeyGenerator",
                                 "hoodie.datasource.write.hive_style_partitioning": 'true',
                                 'hoodie.datasource.hive_sync.assume_date_partitioning':'false',
                                 'hoodie.datasource.hive_sync.enable': 'true',
                                 'hoodie.datasource.hive_sync.database':'pfg_silver_fantasy',
                                 'hoodie.datasource.hive_sync.table': 'hudi_test1',
                                 'hoodie.datasource.hive_sync.partition_fields': 'creation_date',
                                  'hoodie.datasource.hive_sync.support_timestamp': 'true',
                                 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor'
                                 }
                 
                 
                 # Create a DataFrame
                 inputDF = spark.createDataFrame(
                   [
                       ("100", "2015-01-01", "1", 'a'),
                       ("101", "2015-01-01", "1", 'a'),
                   ],
                   ["id", "creation_date", "last_update_time","new_col"]
               )
                             
                 # Write a DataFrame as a Hudi dataset
                 inputDF.write \
                 .format('org.apache.hudi') \
                 .options(**hudi_options) \
                 .mode('overwrite') \
                 .save('s3://<loc>/hudi_test1')
                 
   **Output after step1 in  _rt table:** 
   "_hoodie_commit_time"	"_hoodie_commit_seqno"	"_hoodie_record_key"	"_hoodie_partition_path"	"_hoodie_file_name"	id	last_update_time	new_col	creation_date
   20220615024525	20220615024525_0_1	id:101	creation_date=2015-01-01	cb8df2b4-1268-48b3-8665-1e4ac1196734-0_0-58-25650_20220615024525.parquet	101	1	a	2015-01-01
   20220615024525	20220615024525_0_2	id:100	creation_date=2015-01-01	cb8df2b4-1268-48b3-8665-1e4ac1196734-0_0-58-25650_20220615024525.parquet	100	1	a	2015-01-01
   
   
   **Step3: Upserting**
                             inputDF = spark.createDataFrame(
                                 [
                                     ("100", "2015-01-02", "2","b"),
                                     ("101", "2015-01-01", "2","b")
                                 ],
                                 ["id", "creation_date", "last_update_time","new_col"]
                             )
                             inputDF.write \
                                 .format('org.apache.hudi') \
                                 .options(**hudi_options) \
                                 .mode('append') \
                                 .option('hoodie.datasource.write.operation', 'upsert') \
                                 .save('s3://<loc>/hudi_test2')
   
   **Output after step3 in _rt table :**
   "_hoodie_commit_time"	"_hoodie_commit_seqno"	"_hoodie_record_key"	"_hoodie_partition_path"	"_hoodie_file_name"	id	last_update_time	new_col	creation_date
   20220615024525	20220615024525_0_1	id:101	creation_date=2015-01-01	cb8df2b4-1268-48b3-8665-1e4ac1196734-0_0-58-25650_20220615024525.parquet	101	1	a	2015-01-01
   20220615024525	20220615024525_0_2	id:100	creation_date=2015-01-01	cb8df2b4-1268-48b3-8665-1e4ac1196734-0_0-58-25650_20220615024525.parquet	100	1	a	2015-01-01
   20220615024626	20220615024626_1_3	id:100	creation_date=2015-01-02	6c1dbd2d-5db5-4c65-b180-f1d9561cf637-0_1-92-39217_20220615024626.parquet	100	2	b	2015-01-02
   
   
   
   
   
   **Expected behavior**
   
   It should not have any duplicate values and also update values in same partition.
   
   **Environment Description**
   
   * Hudi version : hudi-spark-bundle_2.11-0.7.0-amzn-1.jar
   
   * Spark version : version 2.4.7-amzn-1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan closed issue #5869: [SUPPORT] There are duplicate values in HUDI MOR table for different partition and not updating values in same partition for GLOBAL_BLOOM

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan closed issue #5869: [SUPPORT] There are duplicate values in HUDI MOR table for different partition and not updating values in same partition for GLOBAL_BLOOM
URL: https://github.com/apache/hudi/issues/5869


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #5869: [SUPPORT] There are duplicate values in HUDI MOR table for different partition and not updating values in same partition for GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5869:
URL: https://github.com/apache/hudi/issues/5869#issuecomment-1216250137

   If the issue is resolved, can you please close the github issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on issue #5869: [SUPPORT] There are duplicate values in HUDI MOR table for different partition and not updating values in same partition for GLOBAL_BLOOM

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on issue #5869:
URL: https://github.com/apache/hudi/issues/5869#issuecomment-1524786956

   Reopening. Fix is in progress - https://github.com/apache/hudi/pull/8490


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #5869: [SUPPORT] There are duplicate values in HUDI MOR table for different partition and not updating values in same partition for GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5869:
URL: https://github.com/apache/hudi/issues/5869#issuecomment-1289933598

   Closing this due to no activity. Feel free to open a new issue if you are having any more issues. we can def look into deeply. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan closed issue #5869: [SUPPORT] There are duplicate values in HUDI MOR table for different partition and not updating values in same partition for GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #5869: [SUPPORT] There are duplicate values in HUDI MOR table for different partition and not updating values in same partition for GLOBAL_BLOOM
URL: https://github.com/apache/hudi/issues/5869


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #5869: [SUPPORT] There are duplicate values in HUDI MOR table for different partition and not updating values in same partition for GLOBAL_BLOOM

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #5869:
URL: https://github.com/apache/hudi/issues/5869#issuecomment-1527003611

   From what I can glean from the description, looks like the query is a RO query and update partition path is set to true. So, w/ 2nd commit, the delete record went to a log file in partition creation_date=2015-01-01, while the new insert for same record key (100), went to new partition creation_date=2015-01-02. hence RO query will return dups. If you trigger compaction, this should be resolved. this is a known limitation for RO query. 
   
   
   Also, if you prefer not to update the partition path, for eg, for record with record key 100, if you wish to retain the record in partition 2015-01-01 itself, you should set `hoodie.bloom.index.update.partition.path` = false. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on issue #5869: [SUPPORT] There are duplicate values in HUDI MOR table for different partition and not updating values in same partition for GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
codope commented on issue #5869:
URL: https://github.com/apache/hudi/issues/5869#issuecomment-1158964146

   Also, if you query using spark.read.format("hudi"), do you see the right records? I am trying to figure out if this is an issue with hive _rt tables only.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on issue #5869: [SUPPORT] There are duplicate values in HUDI MOR table for different partition and not updating values in same partition for GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
codope commented on issue #5869:
URL: https://github.com/apache/hudi/issues/5869#issuecomment-1158963015

   @uvplearn I am unable to reproduce the issue. I am using latest master of Hudi (with 0.7.0 i'm facing some other issue in my setup). Meanwhile, I would suggest to try out with the latest Hudi (0.11.0) or if you're on AWS you could use 0.10.1-amzn-0 version of Hudi.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #5869: [SUPPORT] There are duplicate values in HUDI MOR table for different partition and not updating values in same partition for GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5869:
URL: https://github.com/apache/hudi/issues/5869#issuecomment-1216249331

   yeah, I remember we had a fix around this in 0.10.1 or some release. Can you try out 0.11 or later versions and let us know what you see. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #5869: [SUPPORT] There are duplicate values in HUDI MOR table for different partition and not updating values in same partition for GLOBAL_BLOOM

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #5869:
URL: https://github.com/apache/hudi/issues/5869#issuecomment-1527002010

   Not reproducible w/ https://github.com/apache/hudi/pull/8490
   
   ```
   
   import org.apache.spark.sql.SaveMode._
   import org.apache.hudi.DataSourceReadOptions._
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.spark.sql.{DataFrame, Row, SparkSession}
   import scala.collection.mutable
   
   val tableName = "hudi5869"
   val spark = SparkSession.builder.enableHiveSupport.getOrCreate
   
   val basePath = "/tmp/hudi5869/"
   
   import spark.implicits._
   // spark-shell
   
   
   val hudiOptions = mutable.Map(
         "hoodie.table.name" -> tableName,
         "hoodie.datasource.write.table.type" -> "MERGE_ON_READ",
         "hoodie.datasource.write.operation" -> "upsert",
         "hoodie.datasource.write.recordkey.field" -> "id",
         "hoodie.datasource.write.precombine.field" -> "last_update_time",
         "hoodie.datasource.write.partitionpath.field" -> "creation_date",
         "hoodie.index.type" -> "GLOBAL_BLOOM",
         "hoodie.bloom.index.update.partition.path" -> "true",
         "hoodie.compact.inline" -> "true",
         "hoodie.datasource.write.keygenerator.class" -> "org.apache.hudi.keygen.ComplexKeyGenerator"
       )
   
   val df = Seq(
         ("100", "2015-01-01", "1","a"),
         ("101", "2015-01-01", "1","a")
       ).toDF("id", "creation_date", "last_update_time", "new_col")
   
   
    df.write.format("hudi").
           options(hudiOptions).
           mode(Append).
           save(basePath)
   spark.read.format("hudi").load(basePath).show(false)
   
   
   val df1 = Seq(
         ("100", "2015-01-02", "2","b"),
         ("101", "2015-01-01", "2","b")
       ).toDF("id", "creation_date", "last_update_time", "new_col")
   
   
    df1.write.format("hudi").
           options(hudiOptions).
           mode(Append).
           save(basePath)
   spark.read.format("hudi").load(basePath).show(false)
   
   
   ```
   
   Output:
   ```
   scala> spark.read.format("hudi").load(basePath).show(false)
   +-------------------+---------------------+------------------+----------------------+---------------------------------------------------------------------------+---+-------------+----------------+-------+
   |_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name                                                          |id |creation_date|last_update_time|new_col|
   +-------------------+---------------------+------------------+----------------------+---------------------------------------------------------------------------+---+-------------+----------------+-------+
   |20230427220827516  |20230427220827516_0_1|id:101            |2015-01-01            |f183954a-9d23-4192-a1ed-8efc25e4e77f-0                                     |101|2015-01-01   |2               |b      |
   |20230427220827516  |20230427220827516_1_0|id:100            |2015-01-02            |b395a368-8e9a-46c8-8660-c78cfd53d06f-0_1-275-1770_20230427220827516.parquet|100|2015-01-02   |2               |b      |
   +-------------------+---------------------+------------------+----------------------+---------------------------------------------------------------------------+---+-------------+----------------+-------+
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org