You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "xiagupqin (via GitHub)" <gi...@apache.org> on 2023/03/20 03:43:26 UTC

[GitHub] [hudi] xiagupqin opened a new issue, #8236: [SUPPORT]Duplicate data in MOR table Hudi

xiagupqin opened a new issue, #8236:
URL: https://github.com/apache/hudi/issues/8236

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   We run sparkstructed streaming application where we read kafka stream process the data and stores in Hudi.
   We started seeing duplicates in our hudi dataset
   Below are our Hudi configs
   
   `DataSourceWriteOptions.TABLE_TYPE.key() -> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL,
   DataSourceWriteOptions.RECORDKEY_FIELD.key() -> "id",
   DataSourceWriteOptions.PARTITIONPATH_FIELD.key() -> "dt",
   DataSourceWriteOptions.PRECOMBINE_FIELD.key() -> "ts",
   HoodieCompactionConfig.INLINE_COMPACT.key() -> "true",
   `
   We only use upsert in our code 
   `dataframe.write.format("org.apache.hudi")
         .option(HoodieWriteConfig.TABLE_NAME, hudiTableName)
         .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
         .options(hudiOptions).mode(SaveMode.Append)
         .save(s3Location)`
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.
   2.
   3.
   4.
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :0.12.0
   
   * Spark version :3.3.0
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) :s3
   
   * Running on Docker? (yes/no) :no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xiagupqin commented on issue #8236: [SUPPORT]Duplicate data in MOR table Hudi

Posted by "xiagupqin (via GitHub)" <gi...@apache.org>.
xiagupqin commented on issue #8236:
URL: https://github.com/apache/hudi/issues/8236#issuecomment-1475576852

   please help me look at this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xiagupqin commented on issue #8236: [SUPPORT]Duplicate data in MOR table Hudi

Posted by "xiagupqin (via GitHub)" <gi...@apache.org>.
xiagupqin commented on issue #8236:
URL: https://github.com/apache/hudi/issues/8236#issuecomment-1484312854

   @nsivabalan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] ad1happy2go commented on issue #8236: [SUPPORT]Duplicate data in MOR table Hudi

Posted by "ad1happy2go (via GitHub)" <gi...@apache.org>.
ad1happy2go commented on issue #8236:
URL: https://github.com/apache/hudi/issues/8236#issuecomment-1503813179

   @xiagupqin Can you please let us know if you got this issue again with the fix or disabling metadata.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xiagupqin commented on issue #8236: [SUPPORT]Duplicate data in MOR table Hudi

Posted by "xiagupqin (via GitHub)" <gi...@apache.org>.
xiagupqin commented on issue #8236:
URL: https://github.com/apache/hudi/issues/8236#issuecomment-1475576693

   @nsivabalan this is dupilcate records 
   [query-hive-10092.csv](https://github.com/apache/hudi/files/11013511/query-hive-10092.csv)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xiagupqin commented on issue #8236: [SUPPORT]Duplicate data in MOR table Hudi

Posted by "xiagupqin (via GitHub)" <gi...@apache.org>.
xiagupqin commented on issue #8236:
URL: https://github.com/apache/hudi/issues/8236#issuecomment-1484312750

   > 
   
   hello  nsivabalan Thank you very much for your advice  can you tell me how to disable metadata table 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] ad1happy2go commented on issue #8236: [SUPPORT]Duplicate data in MOR table Hudi

Posted by "ad1happy2go (via GitHub)" <gi...@apache.org>.
ad1happy2go commented on issue #8236:
URL: https://github.com/apache/hudi/issues/8236#issuecomment-1486225521

   @xiagupqin you can disable metadata table by setting this property as false - hoodie.metadata.enable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on issue #8236: [SUPPORT]Duplicate data in MOR table Hudi

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on issue #8236:
URL: https://github.com/apache/hudi/issues/8236#issuecomment-1488072678

   @xiagupqin Can you confirm that there are no duplicates after disabling metadata? Most likely, https://github.com/apache/hudi/pull/8079 fixes this issue too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xiagupqin commented on issue #8236: [SUPPORT]Duplicate data in MOR table Hudi

Posted by "xiagupqin (via GitHub)" <gi...@apache.org>.
xiagupqin commented on issue #8236:
URL: https://github.com/apache/hudi/issues/8236#issuecomment-1475579357

   @nsivabalan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xiagupqin commented on issue #8236: [SUPPORT]Duplicate data in MOR table Hudi

Posted by "xiagupqin (via GitHub)" <gi...@apache.org>.
xiagupqin commented on issue #8236:
URL: https://github.com/apache/hudi/issues/8236#issuecomment-1486824378

   > @xiagupqin you can disable metadata table by setting this property as false - hoodie.metadata.enable
   
   taanks i try it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on issue #8236: [SUPPORT]Duplicate data in MOR table Hudi

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on issue #8236:
URL: https://github.com/apache/hudi/issues/8236#issuecomment-1527119084

   I could not reproduce with the master branch. My script
   ```
   import org.apache.spark.sql.SparkSession
   import org.apache.spark.sql.functions.{col, from_json,to_json,struct}
   import org.apache.spark.sql.types.{IntegerType, StringType, LongType, StructType}
   import java.time.LocalDateTime
   import scala.collection.JavaConversions._
   import org.apache.spark.sql.SaveMode._
   import org.apache.hudi.DataSourceReadOptions._
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.config.HoodieWriteConfig._
   import org.apache.hudi.config.HoodieCompactionConfig
   import org.apache.spark.sql.streaming.OutputMode
   
   val dataStreamReader = spark.
         readStream.
         format("kafka").
         option("kafka.bootstrap.servers", "localhost:9092").
         option("subscribe", "impressions").
         option("startingOffsets", "earliest"). // also tried with "latest"
         option("maxOffsetsPerTrigger", 2000). // also tried with 1000, 5000
         option("failOnDataLoss", false)
   
    val schema = new StructType().
         add("impresssiontime",LongType).
         add("impressionid",StringType).
         add("userid",StringType).
         add("adid",StringType)
   
    val df = dataStreamReader.load().
    selectExpr(
           "topic as kafka_topic",
           "CAST(partition AS STRING) kafka_partition",
           "cast(timestamp as String) kafka_timestamp",
           "CAST(offset AS STRING) kafka_offset",
           "CAST(key AS STRING) kafka_key",
           "CAST(value AS STRING) kafka_value",
           "current_timestamp() current_time").
           selectExpr(
           "kafka_topic",
           "concat(kafka_partition,'-',kafka_offset) kafka_partition_offset",
           "kafka_offset",
           "kafka_timestamp",
           "kafka_key",
           "kafka_value",
           "substr(current_time,1,10) partition_date").select(col("kafka_topic"),col("kafka_partition_offset"),col("kafka_offset"),col("kafka_timestamp"),col("kafka_key"),col("kafka_value"),from_json(col("kafka_value"), schema).as("data"),col("partition_date")).select("kafka_topic","kafka_partition_offset","kafka_offset","kafka_timestamp","kafka_key","kafka_value","data.impresssiontime","data.impressionid", "data.userid","data.adid","partition_date")
   
   
   val writer = df.
       writeStream.format("org.apache.hudi").
         option(TABLE_TYPE.key, "MERGE_ON_READ").
         option(TBL_NAME.key, "mor_table").
         option(PRECOMBINE_FIELD.key, "impresssiontime").
         option(RECORDKEY_FIELD.key, "impressionid").
         option(PARTITIONPATH_FIELD.key, "userid").
         option(HIVE_SYNC_ENABLED.key, true).
         option(HIVE_STYLE_PARTITIONING.key, true).
         option(HoodieCompactionConfig.INLINE_COMPACT.key, true).
         option(STREAMING_RETRY_CNT.key, 0).
         option(OPERATION.key, UPSERT_OPERATION_OPT_VAL).
         option("hoodie.datasource.hive_sync.database", "default").
         option("hoodie.datasource.hive_sync.table", "mor_table").
         option("hoodie.datasource.hive_sync.username", "hive").
         option("hoodie.datasource.hive_sync.password","hive").
         option("hoodie.datasource.hive_sync.use_jdbc",true).
         option("hoodie.datasource.hive_sync.jdbcurl","jdbc:hive2://hiveserver:10000").
         option("checkpointLocation", "/tmp/hudi_streaming_kafka/checkpoint/").
         option("path", "/tmp/hudi_streaming_kafka/mor_table").
         outputMode(OutputMode.Append()).
         start()
   
   writer.awaitTermination()
   ```
   This does not seem related to spark streaming. There was a bug in the implementation of `HoodieMergeOnReadTableInputFormat` which is fixed by https://github.com/apache/hudi/commit/fe43e6f85d6d98db43b4fdc144654a07db032a28
   
   You can upgrade to the latest release of Hudi.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope closed issue #8236: [SUPPORT]Duplicate data in MOR table Hudi

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope closed issue #8236: [SUPPORT]Duplicate data in MOR table Hudi
URL: https://github.com/apache/hudi/issues/8236


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #8236: [SUPPORT]Duplicate data in MOR table Hudi

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #8236:
URL: https://github.com/apache/hudi/issues/8236#issuecomment-1481924602

   There could be two reasons:
   1. you may need to set hoodie.datasource.write.streaming.ignore.failed.batch = false. (default value is false). 
   2. we found a corner case w/ metadata table and timeline server that may have resulted in duplicates. can you disable metadata table and give it a try. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xiagupqin commented on issue #8236: [SUPPORT]Duplicate data in MOR table Hudi

Posted by "xiagupqin (via GitHub)" <gi...@apache.org>.
xiagupqin commented on issue #8236:
URL: https://github.com/apache/hudi/issues/8236#issuecomment-1484328482

   @nsivabalan  set hoodie.datasource.write.streaming.ignore.failed.batch = false i dont found this config can you tell me 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org