You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "whight (via GitHub)" <gi...@apache.org> on 2023/04/13 11:27:04 UTC

[GitHub] [hudi] whight opened a new issue, #8451: [SUPPORT] Insert write operation pre combined problem

whight opened a new issue, #8451:
URL: https://github.com/apache/hudi/issues/8451

   **Describe the problem you faced**
   
   I used Spark structured streaming import Kafka data to Hudi table, Kafka message contain many same id records. The write operation is INSERT means that pre combined will be not work, but I found many rows in the table are upserted, only little rows of duplicate key are kept in table, why?
   
   **Expected behavior**
   Every row of duplicate key  will stored in table
   
   **Environment Description**
   
   * Hudi version : 0.10.0
   
   * Spark version : 3.2.1
   
   * Hive version : 3.1.2
   
   * Hadoop version :3.2.1
   
   * Storage (HDFS/S3/GCS..) : Aliyun oss
   
   * Running on Docker? (yes/no) :no
   
   
   **Additional context**
   
   This is spark writer
   ```scala
   dataSet.writeStream
     .format("org.apache.hudi")
     .option(DataSourceWriteOptions.TABLE_TYPE.key(), DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL) 
     .option(DataSourceWriteOptions.OPERATION.key(), DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL) 
     .option(DataSourceWriteOptions.RECORDKEY_FIELD.key(), "id") 
     .option(DataSourceWriteOptions.PRECOMBINE_FIELD.key(), "timestamp")
     .option(DataSourceWriteOptions.PARTITIONPATH_FIELD.key(), "day") 
     .option(DataSourceWriteOptions.HIVE_URL.key(), "") 
     .option(DataSourceWriteOptions.HIVE_DATABASE.key(), dbName) 
     .option(DataSourceWriteOptions.HIVE_TABLE.key(), tableName) 
     .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS.key(), "day") 
     .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED.key(), "true") 
     .option("hoodie.table.name", tableName) 
     .option("hoodie.bulkinsert.shuffle.parallelism", "6")
     .option("hoodie.insert.shuffle.parallelism", "6")
     .option("hoodie.upsert.shuffle.parallelism", "6")
     .option(HIVE_STYLE_PARTITIONING.key(), "true")
     .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME.key(), "org.apache.hudi.keygen.ComplexKeyGenerator")
     .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING.key(), "true") 
     .option("path", "")
     .option("checkpointLocation", "")
     .start()
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] ad1happy2go commented on issue #8451: [SUPPORT] Insert write operation pre combined problem

Posted by "ad1happy2go (via GitHub)" <gi...@apache.org>.
ad1happy2go commented on issue #8451:
URL: https://github.com/apache/hudi/issues/8451#issuecomment-1508801895

   @whight can you try setting hoodie.merge.allow.duplicate.on.inserts as true.
   I was able to reproduce your error and it got fixed with above setting.
   
   Else you can also use Bulk insert which is fast but will not do small file handling. You can schedule separate clustering job for the same.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] ad1happy2go commented on issue #8451: [SUPPORT] Insert write operation pre combined problem

Posted by "ad1happy2go (via GitHub)" <gi...@apache.org>.
ad1happy2go commented on issue #8451:
URL: https://github.com/apache/hudi/issues/8451#issuecomment-1511233030

   @whight We can't say it is directly a bug as due to merging small file handling its happening. The document claims we can have duplicates for insert but not data consistency(all duplicates).
   
   Created a JIRA to make it a default behaviour - https://issues.apache.org/jira/browse/HUDI-6089
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] whight closed issue #8451: [SUPPORT] Insert write operation pre combined problem

Posted by "whight (via GitHub)" <gi...@apache.org>.
whight closed issue #8451: [SUPPORT] Insert write operation pre combined problem
URL: https://github.com/apache/hudi/issues/8451


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] whight commented on issue #8451: [SUPPORT] Insert write operation pre combined problem

Posted by "whight (via GitHub)" <gi...@apache.org>.
whight commented on issue #8451:
URL: https://github.com/apache/hudi/issues/8451#issuecomment-1511137106

   @ad1happy2go 
   I found the setting's default value is false, but there are still a number of duplicate rows in the table,  it is a BUG?
   
   I viewed the FAQ page, the part of "How does Hudi handle duplicate record keys in an input"  as follows:
   
   > For an insert or bulk_insert operation, no such pre-combining is performed. Thus, if your input contains duplicates, the dataset would also contain duplicates. If you don't want duplicate records either issue an upsert or consider specifying option to de-duplicate input in either [datasource](https://hudi.apache.org/docs/configurations.html#INSERT_DROP_DUPS_OPT_KEY) or [deltastreamer](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java#L229).
   
   I suggest add the setting tips in this part.
   
   
   > @whight can you try setting hoodie.merge.allow.duplicate.on.inserts as true. I was able to reproduce your error and it got fixed with above setting.
   > 
   > Else you can also use Bulk insert which is fast but will not do small file handling. You can schedule separate clustering job for the same.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org