You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Pramod Biligiri (Jira)" <ji...@apache.org> on 2022/10/07 13:33:00 UTC

[jira] [Updated] (HUDI-4994) DatahubSyncTool does not correctly re-ingest soft-deleted entities

     [ https://issues.apache.org/jira/browse/HUDI-4994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pramod Biligiri updated HUDI-4994:
----------------------------------
    Description: 
Datahub has a notion of soft-deletes (the entity still exists in the database with a status=removed:true). Such entities could get re-ingested with new properties at a later time, such that the older one gets overwritten. The current implementation in DatahubSyncTool does not handle this scenario. It fails to update the status flag to removed:false during ingest, which means the entity won't surface in the Datahub UI at all.

Ref: See sections on Soft Delete and Hard Delete in the Datahub docs: [https://datahubproject.io/docs/how/delete-metadata/#soft-delete-the-default]

  was:
When DatahubSyncTool updates an entity in Datahub using an UPSERT request of their RestEmiiter client, it can be assumed that the entity is no longer considered deleted, and needs to be discoverable henceforth in the Datahub UI.

For that, it is necessary to explicitly set the "status" metadata aspect of the entity to "\{'removed':false}". This will handle the situation where the entity may have been (soft) deleted in the past. The addition of this "removed:false" for "status" aspect has no impact on newly created entities, or hard-deleted entities (of which no trace remains anyway).

Ref: See sections on Soft Delete and Hard Delete in the Datahub docs: https://datahubproject.io/docs/how/delete-metadata/#soft-delete-the-default

        Summary: DatahubSyncTool does not correctly re-ingest soft-deleted entities  (was: DatahubSyncTool should set "removed" status of an entity to false when updating it)

> DatahubSyncTool does not correctly re-ingest soft-deleted entities
> ------------------------------------------------------------------
>
>                 Key: HUDI-4994
>                 URL: https://issues.apache.org/jira/browse/HUDI-4994
>             Project: Apache Hudi
>          Issue Type: Task
>          Components: meta-sync
>            Reporter: Pramod Biligiri
>            Priority: Major
>              Labels: pull-request-available
>
> Datahub has a notion of soft-deletes (the entity still exists in the database with a status=removed:true). Such entities could get re-ingested with new properties at a later time, such that the older one gets overwritten. The current implementation in DatahubSyncTool does not handle this scenario. It fails to update the status flag to removed:false during ingest, which means the entity won't surface in the Datahub UI at all.
> Ref: See sections on Soft Delete and Hard Delete in the Datahub docs: [https://datahubproject.io/docs/how/delete-metadata/#soft-delete-the-default]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)