You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/11/18 11:57:50 UTC

[GitHub] [hudi] Limess opened a new issue #4031: [SUPPORT] _hoodie_is_deleted should work with any truthy value

Limess opened a new issue #4031:
URL: https://github.com/apache/hudi/issues/4031


   **Describe the problem you faced**
   
   It is currently possible to set `_hoodie_is_deleted` to a not-null, non `true` value.
   
   In this scenario, the column is written to the target table. At this point it's included in the schema, which seems to be undesirable in all cases.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create a hudi table
   2. Upsert a record with `_hoodie_is_deleted="some-string"
   3. Observe that the record is written to the underlying Hudi table
   
   **Expected behavior**
   
   The string value should be treated as truthy and the record should be ignored/deleted in the target table.
   
   I can't see any scenario where you would want to populate this column.
   
   **Environment Description**
   
   EMR 6.4.0
   
   * Hudi version: 0.9.0
   * Spark version :
   
   3.1.2
   
   * Hive version :
   
   Hive 3.1.2
   
   * Hadoop version :
   
   Amazon 3.2.1
   
   * Storage (HDFS/S3/GCS..) :
   
   S3
   
   * Running on Docker? (yes/no) :
   
   no
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4031: [SUPPORT] _hoodie_is_deleted should work with any truthy value

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4031:
URL: https://github.com/apache/hudi/issues/4031#issuecomment-974820517


   Let me know if I understand your question correctly. 
   - You are seeing a behavior where when "_hoodie_is_deleted" is set to null or false, hudi persist this column on storage. And you are asking why do we need to do this and why not just drop the column altogether? 
   
   
   Guess its easier to have same schema across incoming dataset and whats in storage. Also, since this is a boolean column, and only non-deleted entries are persisted on storage, this will compress nicely. So, I don't think this will give us much benefit.
   
   Let me know if this makes sense. But happy to discuss more if you need more clarification or have suggestion to improve furhtere. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Limess commented on issue #4031: [SUPPORT] _hoodie_is_deleted should work with any truthy value

Posted by GitBox <gi...@apache.org>.
Limess commented on issue #4031:
URL: https://github.com/apache/hudi/issues/4031#issuecomment-976248107


   > You are seeing a behavior where when "_hoodie_is_deleted" is set to null or false, hudi persist this column on storage. And you are asking why do we need to do this and why not just drop the column altogether?
   
   Yes that's largely the question. We assumed it would be dropped as the deleted records are not persisted and it's otherwise redundant, and there already seems to be codepaths to drop redundant columns (e.g. `hoodie.datasource.write.drop.partition.columns`)
   
   We were also caught out when we used a string value by mistake. This ended up being written to the end datastore, which then broke our schema in a seemingly non-recoverable way (as it was written to the table, and now we had a schema type change which wasn't obviously compatible). 
   
   I'd suggest:
   * Possibly dropping the column (as you say if it has little benefits sure). If not, documenting the behaviour somewhere. Alternatively, always include the column, along with the other Hudi metadata fields which are prepended to written schema already.
   * If the column is not a boolean:
   	* Failing hard, as this column is essentially "reserved" for Hudi
   	* Taking `IS NOT NULL` as truthy


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4031: [SUPPORT] _hoodie_is_deleted should work with any truthy value

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4031:
URL: https://github.com/apache/hudi/issues/4031#issuecomment-974820517


   Let me know if I understand your question correctly. 
   - You are seeing a behavior where when "_hoodie_is_deleted" is set to null or false, hudi persist this column on storage. And you are asking why do we need to do this and why not just drop the column altogether? 
   
   
   Guess its easier to have same schema across incoming dataset and whats in storage. Also, since this is a boolean column, and only non-deleted entries are persisted on storage, this will compress nicely. So, I don't think this will give us much benefit.
   
   Let me know if this makes sense. But happy to discuss more if you need more clarification or have suggestion to improve furhtere. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Limess commented on issue #4031: [SUPPORT] _hoodie_is_deleted should work with any truthy value

Posted by GitBox <gi...@apache.org>.
Limess commented on issue #4031:
URL: https://github.com/apache/hudi/issues/4031#issuecomment-972847534


   After actually getting `_hoodie_is_deleted` we noticed it's always written to the target table schema - should this be the case? In my mind it should be dropped by hoodie as it makes no sense for it to exist as `null` or `false` on every row in the Hudi table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4031: [SUPPORT] _hoodie_is_deleted should work with any truthy value

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4031:
URL: https://github.com/apache/hudi/issues/4031#issuecomment-993916668


   got it. would be nice to add the validation to the minimum. Have filed a tracking ticket [here](https://issues.apache.org/jira/browse/HUDI-3018).  closing this github issue out for now. Feel free to add more to the tracking jira. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Limess edited a comment on issue #4031: [SUPPORT] _hoodie_is_deleted should work with any truthy value

Posted by GitBox <gi...@apache.org>.
Limess edited a comment on issue #4031:
URL: https://github.com/apache/hudi/issues/4031#issuecomment-972847534


   After actually getting `_hoodie_is_deleted` to work correctly, we noticed it's always still written to the target table schema (as a boolean, with the default value being `null` when we don't want to delete) - should this be the case? In my mind it should be dropped by hoodie as it makes no sense for it to exist as `null` or `false` on every row in the Hudi table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Limess edited a comment on issue #4031: [SUPPORT] _hoodie_is_deleted should work with any truthy value

Posted by GitBox <gi...@apache.org>.
Limess edited a comment on issue #4031:
URL: https://github.com/apache/hudi/issues/4031#issuecomment-976248107


   > You are seeing a behavior where when "_hoodie_is_deleted" is set to null or false, hudi persist this column on storage. And you are asking why do we need to do this and why not just drop the column altogether?
   
   Yes that's largely the question. We assumed it would be dropped as the deleted records are not persisted and it's otherwise redundant, and there already seems to be codepaths to drop redundant columns (e.g. `hoodie.datasource.write.drop.partition.columns`)
   
   We were also caught out when we used a string value by mistake. This ended up being written to the end datastore, which then broke our schema in a seemingly non-recoverable way (as it was written to the table, and now we had a schema type change from `string`->`boolean` when trying to write the correct value which wasn't obviously compatible). 
   
   I'd suggest:
   * Possibly dropping the column (as you say if it has little benefits sure). If not, documenting the behaviour somewhere. Alternatively, always include the column, along with the other Hudi metadata fields which are prepended to written schema already.
   * If the column is not a boolean:
   	* Failing hard, as this column is essentially "reserved" for Hudi
   	* Taking `IS NOT NULL` as truthy


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #4031: [SUPPORT] _hoodie_is_deleted should work with any truthy value

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #4031:
URL: https://github.com/apache/hudi/issues/4031


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org