You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/08/29 00:49:37 UTC

[GitHub] [hudi] jiegzhan opened a new issue #2051: [SUPPORT] insert operation didn't insert a new record, instead it updated existing records in my no-primary-key table

jiegzhan opened a new issue #2051:
URL: https://github.com/apache/hudi/issues/2051


   **Step 1: I created a hudi table and inserted 4 records with this query:**
   ```
   val hudiConfig = Map[String, String](
     "hoodie.datasource.write.table.type" -> "COPY_ON_WRITE",
     "hoodie.datasource.write.recordkey.field" -> "word",
     "hoodie.datasource.write.precombine.field" -> "word",
     "hoodie.datasource.write.partitionpath.field" -> "",
     "hoodie.datasource.write.keygenerator.class" -> "org.apache.hudi.keygen.NonpartitionedKeyGenerator",
     "hoodie.datasource.write.hive_style_partitioning" -> "true",
     "hoodie.datasource.hive_sync.database" -> "default",
     "hoodie.datasource.hive_sync.table" -> tableName,
     "hoodie.datasource.hive_sync.username" -> "hadoop",
     "hoodie.datasource.hive_sync.partition_fields" -> "",
     "hoodie.datasource.hive_sync.enable" -> "true",
     "hoodie.datasource.hive_sync.partition_extractor_class" -> classOf[NonPartitionedExtractor].getCanonicalName,
     "hoodie.combine.before.insert" -> "false",
     "hoodie.combine.before.upsert" -> "false",
     "hoodie.parquet.max.file.size" -> "2000000000",
     "hoodie.parquet.block.size" -> "2000000000",
     "hoodie.parquet.small.file.limit" -> "512000000",
     "hoodie.cleaner.commits.retained" -> "1"
   )
   
   val insertRecords = Seq(
     ("java", 3),
     ("java", 56),
     ("python", 19),
     ("scala", -28)
   ).toDF("word", "number")
   
   insertRecords.
     write.format("org.apache.hudi").
     option(HoodieWriteConfig.TABLE_NAME, tableName).
     option("hoodie.datasource.write.operation", "insert").
     options(hudiConfig).
     mode(SaveMode.Append).
     save(tablePath)
   ```
   
   Result looks like this:
   ```
   +-------------------+--------------------+------------------+----------------------+--------------------+------+------+
   |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|  word|number|
   +-------------------+--------------------+------------------+----------------------+--------------------+------+------+
   |     20200829004101|  20200829004101_0_1|              java|                      |fd94def5-1ecd-453...|  java|     3|
   |     20200829004101|  20200829004101_0_2|              java|                      |fd94def5-1ecd-453...|  java|    56|
   |     20200829004101|  20200829004101_0_3|            python|                      |fd94def5-1ecd-453...|python|    19|
   |     20200829004101|  20200829004101_0_4|             scala|                      |fd94def5-1ecd-453...| scala|   -28|
   +-------------------+--------------------+------------------+----------------------+--------------------+------+------+
   ```
   
   **Step 2: I tried to insert a new record with this query:**
   ```
   val insertNewRecords = Seq(("java", 999999)).toDF("word", "number")
   
   insertNewRecords.
     write.format("org.apache.hudi").
     option(HoodieWriteConfig.TABLE_NAME, tableName).
     option("hoodie.datasource.write.operation", "insert").
     options(hudiConfig).
     mode(SaveMode.Append).
     save(tablePath)
   ```
   Result looks like this:
   ```
   +-------------------+--------------------+------------------+----------------------+--------------------+------+------+
   |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|  word|number|
   +-------------------+--------------------+------------------+----------------------+--------------------+------+------+
   |     20200829004331|  20200829004331_0_5|              java|                      |fd94def5-1ecd-453...|  java|999999|
   |     20200829004331|  20200829004331_0_6|              java|                      |fd94def5-1ecd-453...|  java|999999|
   |     20200829004101|  20200829004101_0_3|            python|                      |fd94def5-1ecd-453...|python|    19|
   |     20200829004101|  20200829004101_0_4|             scala|                      |fd94def5-1ecd-453...| scala|   -28|
   +-------------------+--------------------+------------------+----------------------+--------------------+------+------+
   ```
   
   How come `insert operation didn't insert a new row, but updated existing records`? `upsert operation` should do this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar closed issue #2051: [SUPPORT] insert operation didn't insert a new record, instead it updated existing records in my no-primary-key table

Posted by GitBox <gi...@apache.org>.
bvaradar closed issue #2051:
URL: https://github.com/apache/hudi/issues/2051


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] leesf commented on issue #2051: [SUPPORT] insert operation didn't insert a new record, instead it updated existing records in my no-primary-key table

Posted by GitBox <gi...@apache.org>.
leesf commented on issue #2051:
URL: https://github.com/apache/hudi/issues/2051#issuecomment-683213967


   @jiegzhan hi, it is because hudi handles with small files automatically, after first insert, the file size would be less that  512000000, so it is a small file, so hudi will treat new records as updates to the exist small file. if you set small file size limit to 0, you would see the new records won't update the old records. however I think we should insert the record regardless of file size when using insert operation


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #2051: [SUPPORT] insert operation didn't insert a new record, instead it updated existing records in my no-primary-key table

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #2051:
URL: https://github.com/apache/hudi/issues/2051#issuecomment-691777749


   Closing this as we have a jira. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #2051: [SUPPORT] insert operation didn't insert a new record, instead it updated existing records in my no-primary-key table

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #2051:
URL: https://github.com/apache/hudi/issues/2051#issuecomment-683241932


   @jiegzhan : RecordKey uniqueness is a fundamental constraint in Hudi. Even for inserts, the expectation is that the primary key uniqueness needs to be maintained. The only reason why (1) allowed duplicates is because "hoodie.combine.before.insert=false". If the above property is turned on, deduplication would have eliminated the duplicate. For insert only use-case, you can add a UUID column as record-key to preserve uniqueness


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] jiegzhan commented on issue #2051: [SUPPORT] insert operation didn't insert a new record, instead it updated existing records in my no-primary-key table

Posted by GitBox <gi...@apache.org>.
jiegzhan commented on issue #2051:
URL: https://github.com/apache/hudi/issues/2051#issuecomment-683224661


   insert is insert, upsert is upsert, if they mixed up, then the behavior got changed, which made underlying data unreliable anymore. 
   
   Are you planning to insert a new row to 1) an existing small file or 2) a new small file? 
   1) is better since S3 won't have a lot small files after issuing many insert queries. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] leesf edited a comment on issue #2051: [SUPPORT] insert operation didn't insert a new record, instead it updated existing records in my no-primary-key table

Posted by GitBox <gi...@apache.org>.
leesf edited a comment on issue #2051:
URL: https://github.com/apache/hudi/issues/2051#issuecomment-683213967


   @jiegzhan hi, it is because hudi handles with small files automatically, after first insert, the file size would be less that  512000000, so it is a small file, so hudi will treat new records as updates to the exist small file. if you set small file size limit to 0, you would see the new records won't update the old records. however I think we should insert the record regardless of file size when using insert operation, created a jira ticket to track this https://issues.apache.org/jira/projects/HUDI/issues/HUDI-1234


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] jiegzhan edited a comment on issue #2051: [SUPPORT] insert operation didn't insert a new record, instead it updated existing records in my no-primary-key table

Posted by GitBox <gi...@apache.org>.
jiegzhan edited a comment on issue #2051:
URL: https://github.com/apache/hudi/issues/2051#issuecomment-683224661


   insert is insert, upsert is upsert, if they mixed up, then the behavior got changed, which made underlying data unreliable anymore. 
   
   Are you planning to insert a new row to 1) an existing small file or 2) a new small file? 1) is better since S3 won't have a lot small files after issuing many insert queries. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] leesf commented on issue #2051: [SUPPORT] insert operation didn't insert a new record, instead it updated existing records in my no-primary-key table

Posted by GitBox <gi...@apache.org>.
leesf commented on issue #2051:
URL: https://github.com/apache/hudi/issues/2051#issuecomment-683280244


   @bvaradar hi, I think the point @jiegzhan pointed out is reasonable, for insert operation, we should not update the existing records. Right now the behavior/result is different when setting different small file limit, when it is set to 0, the new inserts will not update the old records and write into a new file, but when it is set to other value such as 128M, the new inserts may update the old records lies in small file picked up the UpsertPartitioner.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #2051: [SUPPORT] insert operation didn't insert a new record, instead it updated existing records in my no-primary-key table

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #2051:
URL: https://github.com/apache/hudi/issues/2051#issuecomment-683825842


   @leesf  @jiegzhan : This sounds fair to me. I have opened a jira to track this.
   
   https://issues.apache.org/jira/browse/HUDI-1257
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org