You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/04/03 07:07:45 UTC

[GitHub] [incubator-hudi] venkee14 opened a new issue #1482: [SUPPORT] Deletion of records through deltaStreamer _hoodie_is_deleted flag does not work as expected

venkee14 opened a new issue #1482: [SUPPORT] Deletion of records through deltaStreamer _hoodie_is_deleted flag does not work as expected
URL: https://github.com/apache/incubator-hudi/issues/1482
 
 
   I am trying to get "Deletion with HoodieDeltaStreamer" working for my existing dataset in Hudi. I am following - https://cwiki.apache.org/confluence/display/HUDI/2020/01/15/Delete+support+in+Hudi
   My initial dataset exists without "_hoodie_is_deleted" key, I am trying to upsert the records with this key for all incoming records , my code -
   <code>
   Dataset<Row> deletedRows = dataframe.filter(dataframe.col(this.deleteKey).equalTo(this.deleteValue));
   Dataset<Row> remainingRows = dataframe.filter(dataframe.col(this.deleteKey).notEqual(this.deleteValue));
   deletedRows = deletedRows.withColumn("_hoodie_is_deleted", lit(true));
   remainingRows = remainingRows.withColumn("_hoodie_is_deleted", lit(false));
   dataframe = deletedRows.union(remainingRows);
   </code>
   I have noticed that, the upsert runs fine, when the record to be deleted is the only record in the parquet file. But fails with below error -
   Null-value for required field: _hoodie_is_deleted
   	at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:194)
   
   When there are other records in the parquet file. Would appreciate any help here
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Load initial dataset without _hoodie_is_deleted in the schema
   2. Pick a record from a parquet file, which has multiple records 
   3. Delete this record by adding _hoodie_is_deleted : true, pass this flag for all incoming upserts.
   4. Throws "Null-value for required field: _hoodie_is_deleted"
   
   Works when the record record to be deleted is the only record on the parquet file
   
   **Expected behavior**
   
   Only a single record has to be deleted on the parquet file and all other records should exist and the upsert should not throw "Null-value for required field: _hoodie_is_deleted"
   
   **Environment Description**
   
   * Hudi version : 0.5.1
   
   * Spark version : 2.2
   
   * EMR Version: emr-5.28
   
   * Hive version : NA
   
   * Hadoop version : NA
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   StackTrace : 
   
   Caused by: org.apache.hudi.exception.HoodieException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Null-value for required field: _hoodie_is_deleted
   	at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.execute(BoundedInMemoryExecutor.java:143)
   	at org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdateInternal(HoodieCopyOnWriteTable.java:204)
   	... 32 more
   Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Null-value for required field: _hoodie_is_deleted
   	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
   	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
   	at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.execute(BoundedInMemoryExecutor.java:141)
   	... 33 more
   Caused by: java.lang.RuntimeException: Null-value for required field: _hoodie_is_deleted
   	at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:194)
   	at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)
   	at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
   	at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299)
   	at org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:103)
   	at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:296)
   	at org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:434)
   	at org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:424)
   	at org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37)
   	at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
   	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   	... 3 more
   )
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] venkee14 commented on issue #1482: [SUPPORT] Deletion of records through deltaStreamer _hoodie_is_deleted flag does not work as expected

Posted by GitBox <gi...@apache.org>.
venkee14 commented on issue #1482: [SUPPORT] Deletion of records through deltaStreamer _hoodie_is_deleted flag does not work as expected
URL: https://github.com/apache/incubator-hudi/issues/1482#issuecomment-608504894
 
 
   @nsivabalan : My new schema looks like,
   
   20/04/02 06:04:08 INFO DeltaSync: Registering Schema :[{"type":"record","name":"hoodie_source","namespace":"hoodie.source","fields": ........
   
   {"name":"updatedby_user","type":["string","null"]},{"name":"_hoodie_is_deleted","type":"boolean"},{"name":"partition_date","type":["string","null"]}]}]
   
   Let me know if you would need complete schema defn

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on issue #1482: [SUPPORT] Deletion of records through deltaStreamer _hoodie_is_deleted flag does not work as expected

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #1482: [SUPPORT] Deletion of records through deltaStreamer _hoodie_is_deleted flag does not work as expected
URL: https://github.com/apache/incubator-hudi/issues/1482#issuecomment-609498911
 
 
   @venkee14 : can you try setting a default value for the new field. 
   {
       "name" : "_hoodie_is_deleted",
       "type" : "boolean",
       "default" : false
     }
   Let me know if this works. 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on issue #1482: [SUPPORT] Deletion of records through deltaStreamer _hoodie_is_deleted flag does not work as expected

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #1482: [SUPPORT] Deletion of records through deltaStreamer _hoodie_is_deleted flag does not work as expected
URL: https://github.com/apache/incubator-hudi/issues/1482#issuecomment-608466335
 
 
   @venkee14 : may I know how does the new schema look like? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan edited a comment on issue #1482: [SUPPORT] Deletion of records through deltaStreamer _hoodie_is_deleted flag does not work as expected

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #1482: [SUPPORT] Deletion of records through deltaStreamer _hoodie_is_deleted flag does not work as expected
URL: https://github.com/apache/incubator-hudi/issues/1482#issuecomment-608466335
 
 
   @venkee14 : may I know how does the new schema look like? Did you update the schema explicitly

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [hudi] vinothchandar closed issue #1482: [SUPPORT] Deletion of records through deltaStreamer _hoodie_is_deleted flag does not work as expected

Posted by GitBox <gi...@apache.org>.
vinothchandar closed issue #1482:
URL: https://github.com/apache/hudi/issues/1482


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] nsivabalan commented on issue #1482: [SUPPORT] Deletion of records through deltaStreamer _hoodie_is_deleted flag does not work as expected

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #1482: [SUPPORT] Deletion of records through deltaStreamer _hoodie_is_deleted flag does not work as expected
URL: https://github.com/apache/incubator-hudi/issues/1482#issuecomment-615844051
 
 
   @venkee14 : did the above changes work? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on issue #1482: [SUPPORT] Deletion of records through deltaStreamer _hoodie_is_deleted flag does not work as expected

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1482: [SUPPORT] Deletion of records through deltaStreamer _hoodie_is_deleted flag does not work as expected
URL: https://github.com/apache/incubator-hudi/issues/1482#issuecomment-608448949
 
 
   @nsivabalan : can you take a look at this issue ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] venkee14 commented on issue #1482: [SUPPORT] Deletion of records through deltaStreamer _hoodie_is_deleted flag does not work as expected

Posted by GitBox <gi...@apache.org>.
venkee14 commented on issue #1482: [SUPPORT] Deletion of records through deltaStreamer _hoodie_is_deleted flag does not work as expected
URL: https://github.com/apache/incubator-hudi/issues/1482#issuecomment-608505242
 
 
   > @venkee14 : may I know how does the new schema look like? Did you update the schema explicitly
   
   No, I did not change the schema explicitly

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on issue #1482: [SUPPORT] Deletion of records through deltaStreamer _hoodie_is_deleted flag does not work as expected

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #1482:
URL: https://github.com/apache/incubator-hudi/issues/1482#issuecomment-626100680


   gentle ping. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org