You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/09/05 06:03:13 UTC

[GitHub] [hudi] FredMkl opened a new issue, #6591: [SUPPORT]Duplicate records in MOR

FredMkl opened a new issue, #6591:
URL: https://github.com/apache/hudi/issues/6591

   **Describe the problem you faced**
   We use MOR table, we found that when updating an existing set of rows to another partition will result in both a)generate a parquet file b)an update written to a log file. This brings duplicate records
   
   **To Reproduce**
   ```
   //action1: spark-dataframe write
   import org.apache.spark.sql.SaveMode._
   import org.apache.hudi.DataSourceReadOptions._
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.spark.sql.{DataFrame, Row, SparkSession}
   import scala.collection.mutable
   
   val tableName = "f_schedule_test"
   val basePath = "oss://nbadatalake-poc/fred/warehouse/dw/f_schedule_test"
   val spark = SparkSession.builder.enableHiveSupport.getOrCreate
   
   import spark.implicits._
   // spark-shell
   
   val df = Seq(
         ("1", "10001", "2022-08-30","2022-08-30 12:00:00.000","2022-08-30"),
         ("2", "10002", "2022-08-31","2022-08-30 12:00:00.000","2022-08-30"),
         ("3", "10003", "2022-08-31","2022-08-30 12:00:00.000","2022-08-30"),
         ("4", "10004", "2022-08-31","2022-08-30 12:00:00.000","2022-08-30"),
         ("5", "10005", "2022-08-31","2022-08-30 12:00:00.000","2022-08-30"),
         ("6", "10006", "2022-08-31","2022-08-30 12:00:00.000","2022-08-30")
       ).toDF("game_schedule_id", "game_id", "game_date_cn", "insert_date", "dt")
   
       // df.show()
   val hudiOptions = mutable.Map(
         "hoodie.datasource.write.table.type" -> "MERGE_ON_READ",
         "hoodie.datasource.write.operation" -> "upsert",
         "hoodie.datasource.write.recordkey.field" -> "game_schedule_id",
         "hoodie.datasource.write.precombine.field" -> "insert_date",
         "hoodie.datasource.write.partitionpath.field" -> "dt",
         "hoodie.index.type" -> "GLOBAL_BLOOM",
         "hoodie.compact.inline" -> "true",
         "hoodie.datasource.write.keygenerator.class" -> "org.apache.hudi.keygen.ComplexKeyGenerator"
       )
   
   //step1: insert  --no issue
   df.write.format("hudi").
           options(hudiOptions).
           mode(Append).
           save(basePath)
   
   //step2: move part data to another partition --no issue
   val df1 = spark.sql("select * from dw.f_schedule_test where dt = '2022-08-30'").withColumn("dt",lit("2022-08-31")).limit(3)
   df1.write.format("hudi").
           options(hudiOptions).
           mode(Append).
           save(basePath)
   
   //step3: move back  --duplicate occurs
   //Updating an existing set of rows will result in either a) a companion log/delta file for an existing base parquet file generated from a previous compaction or b) an update written to a log/delta file in case no compaction ever happened for it.
   val df2 = spark.sql("select * from dw.f_schedule_test where dt = '2022-08-31'").withColumn("dt",lit("2022-08-30")).limit(3)
   df2.write.format("hudi").
           options(hudiOptions).
           mode(Append).
           save(basePath)
   ```
   **Checking scripts:**
   ```
   select * from dw.f_schedule_test where game_schedule_id = 1;
   
   select _hoodie_file_name,count(*) as co from dw.f_schedule_test group by _hoodie_file_name;
   ```
   
   **results:**
   ![截屏2022-09-05 13 36 10](https://user-images.githubusercontent.com/110440662/188367978-138fdce1-24d8-4cfa-ae81-82467f9cde09.png)
   ![截屏2022-09-05 13 36 19](https://user-images.githubusercontent.com/110440662/188368061-c7846de6-2490-475f-99ec-d240c5307033.png)
   
   
   **Expected behavior**
   
   Duplicate records should not occur
   
   **Environment Description**
   
   * Hudi version :0.10.1
   
   * Spark version :3.2.1
   
   * Hive version :3.1.2
   
   * Hadoop version :3.2.1
   
   * Storage (HDFS/S3/GCS..) :OSS
   
   * Running on Docker? (yes/no) :no
   [hoodie.zip](https://github.com/apache/hudi/files/9487065/hoodie.zip)
   
   
   **Stacktrace**
   
   ```Pls check logs as attached
   [hoodie.zip](https://github.com/apache/hudi/files/9487078/hoodie.zip)
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #6591: [SUPPORT]Duplicate records in MOR

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #6591:
URL: https://github.com/apache/hudi/issues/6591#issuecomment-1246366467

   atleast I will file a follow up jira so that we don't miss this. thanks for filing the issue. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] ad1happy2go commented on issue #6591: [SUPPORT]Duplicate records in MOR

Posted by "ad1happy2go (via GitHub)" <gi...@apache.org>.
ad1happy2go commented on issue #6591:
URL: https://github.com/apache/hudi/issues/6591#issuecomment-1525265398

   JIRA created to fix the issue - https://issues.apache.org/jira/browse/HUDI-6146


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] FredMkl closed issue #6591: [SUPPORT]Duplicate records in MOR

Posted by GitBox <gi...@apache.org>.
FredMkl closed issue #6591: [SUPPORT]Duplicate records in MOR
URL: https://github.com/apache/hudi/issues/6591


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] ad1happy2go commented on issue #6591: [SUPPORT]Duplicate records in MOR

Posted by "ad1happy2go (via GitHub)" <gi...@apache.org>.
ad1happy2go commented on issue #6591:
URL: https://github.com/apache/hudi/issues/6591#issuecomment-1525255395

   Issue still exists in master. 
   
   Reproducible script - 
   ```
   //action1: spark-dataframe write
   import org.apache.spark.sql.SaveMode._
   import org.apache.hudi.DataSourceReadOptions._
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.spark.sql.{DataFrame, Row, SparkSession}
   import scala.collection.mutable
   
   val tableName = "issue_6591_try6"
   val basePath = "file:///tmp/issue_6591_try6"
   val spark = SparkSession.builder.enableHiveSupport.getOrCreate
   
   import spark.implicits._
   // spark-shell
   
   val df = Seq(
         ("1", "10001", "2022-08-30","2022-08-30 12:00:00.000","2022-08-30"),
         ("2", "10002", "2022-08-31","2022-08-30 12:00:00.000","2022-08-30"),
         ("3", "10003", "2022-08-31","2022-08-30 12:00:00.000","2022-08-30"),
         ("4", "10004", "2022-08-31","2022-08-30 12:00:00.000","2022-08-30"),
         ("5", "10005", "2022-08-31","2022-08-30 12:00:00.000","2022-08-30"),
         ("6", "10006", "2022-08-31","2022-08-30 12:00:00.000","2022-08-30")
       ).toDF("game_schedule_id", "game_id", "game_date_cn", "insert_date", "dt")
   df.createOrReplaceTempView("f_schedule_test")
       // df.show()
   val hudiOptions = mutable.Map(
         "hoodie.table.name" -> tableName,
         "hoodie.datasource.write.table.type" -> "MERGE_ON_READ",
         "hoodie.datasource.write.operation" -> "upsert",
         "hoodie.datasource.write.recordkey.field" -> "game_schedule_id",
         "hoodie.datasource.write.precombine.field" -> "insert_date",
         "hoodie.datasource.write.partitionpath.field" -> "dt",
         "hoodie.index.type" -> "GLOBAL_BLOOM",
         "hoodie.compact.inline" -> "true",
         "hoodie.datasource.write.keygenerator.class" -> "org.apache.hudi.keygen.ComplexKeyGenerator"
       )
   
   //step1: insert  --no issue
   df.write.format("hudi").
           options(hudiOptions).
           mode(Append).
           save(basePath)
   spark.read.format("hudi").load(basePath).show(false)
   spark.read.format("hudi").load(basePath).createOrReplaceTempView("f_schedule_test")
   
   //step2: move part data to another partition --no issue
   val df1 = spark.sql("select * from f_schedule_test where dt = '2022-08-30'").withColumn("dt",lit("2022-08-31")).limit(3)
   df1.write.format("hudi").
           options(hudiOptions).
           mode(Append).
           save(basePath)
   spark.read.format("hudi").load(basePath).show(false)
   
   //step3: move back  --duplicate occurs
   //Updating an existing set of rows will result in either a) a companion log/delta file for an existing base parquet file generated from a previous compaction or b) an update written to a log/delta file in case no compaction ever happened for it.
   spark.read.format("hudi").load(basePath).createOrReplaceTempView("f_schedule_test")
   val df2 = spark.sql("select * from f_schedule_test where dt = '2022-08-31'").withColumn("dt",lit("2022-08-30")).limit(3)
   df2.write.format("hudi").
           options(hudiOptions).
           mode(Append).
           save(basePath)
   spark.read.format("hudi").load(basePath).show(false)
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on issue #6591: [SUPPORT]Duplicate records in MOR

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on issue #6591:
URL: https://github.com/apache/hudi/issues/6591#issuecomment-1524756517

   Reopening to validate against master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan closed issue #6591: [SUPPORT]Duplicate records in MOR

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan closed issue #6591: [SUPPORT]Duplicate records in MOR
URL: https://github.com/apache/hudi/issues/6591


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] FredMkl commented on issue #6591: [SUPPORT]Duplicate records in MOR

Posted by GitBox <gi...@apache.org>.
FredMkl commented on issue #6591:
URL: https://github.com/apache/hudi/issues/6591#issuecomment-1246449719

   > 
   Thank you for reply, actually we found that even if a compaction kicks in, the duplication data is still there.
   We tried the settings below, then the duplicate disappear, unfortunately this is not we want.
     hoodie.compact.inline.trigger.strategy = 'NUM_COMMITS',
     hoodie.compact.inline.max.delta.commits = '1'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #6591: [SUPPORT]Duplicate records in MOR

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #6591:
URL: https://github.com/apache/hudi/issues/6591#issuecomment-1526991688

   Its already fixed w/ this patch https://github.com/apache/hudi/pull/8490
   
   ```
   scala> spark.read.format("hudi").load(basePath).show(false)
   +-------------------+---------------------+------------------+----------------------+---------------------------------------------------------------------------+----------------+-------+------------+-----------------------+----------+
   |_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name                                                          |game_schedule_id|game_id|game_date_cn|insert_date            |dt        |
   +-------------------+---------------------+------------------+----------------------+---------------------------------------------------------------------------+----------------+-------+------------+-----------------------+----------+
   |20230427215728276  |20230427215728276_0_3|game_schedule_id:5|2022-08-30            |bd4d1121-57bc-4103-91a0-5541a474ef9e-0_0-28-369_20230427215728276.parquet  |5               |10005  |2022-08-31  |2022-08-30 12:00:00.000|2022-08-30|
   |20230427215728276  |20230427215728276_0_4|game_schedule_id:6|2022-08-30            |bd4d1121-57bc-4103-91a0-5541a474ef9e-0_0-28-369_20230427215728276.parquet  |6               |10006  |2022-08-31  |2022-08-30 12:00:00.000|2022-08-30|
   |20230427215728276  |20230427215728276_0_5|game_schedule_id:1|2022-08-30            |bd4d1121-57bc-4103-91a0-5541a474ef9e-0_0-28-369_20230427215728276.parquet  |1               |10001  |2022-08-30  |2022-08-30 12:00:00.000|2022-08-30|
   |20230427215753406  |20230427215753406_1_0|game_schedule_id:2|2022-08-30            |484347af-e681-4b1e-ad99-e7c1cd9adeea-0_1-150-1051_20230427215753406.parquet|2               |10002  |2022-08-31  |2022-08-30 12:00:00.000|2022-08-30|
   |20230427215753406  |20230427215753406_1_1|game_schedule_id:3|2022-08-30            |484347af-e681-4b1e-ad99-e7c1cd9adeea-0_1-150-1051_20230427215753406.parquet|3               |10003  |2022-08-31  |2022-08-30 12:00:00.000|2022-08-30|
   |20230427215753406  |20230427215753406_1_2|game_schedule_id:4|2022-08-30            |484347af-e681-4b1e-ad99-e7c1cd9adeea-0_1-150-1051_20230427215753406.parquet|4               |10004  |2022-08-31  |2022-08-30 12:00:00.000|2022-08-30|
   +-------------------+---------------------+------------------+----------------------+---------------------------------------------------------------------------+----------------+-------+------------+-----------------------+----------+
   
   
   scala> spark.read.format("hudi").load(basePath).count
   res9: Long = 6
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #6591: [SUPPORT]Duplicate records in MOR

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #6591:
URL: https://github.com/apache/hudi/issues/6591#issuecomment-1247367471

   yes, I get it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #6591: [SUPPORT]Duplicate records in MOR

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #6591:
URL: https://github.com/apache/hudi/issues/6591#issuecomment-1246261284

   yes, this is a known limitation we have w/ MOR table. If a compaction kicks in, you may not see the update in older partitions/ where it was deleted. it will be an issue until then. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org