You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2023/01/13 15:59:45 UTC

[GitHub] [hudi] the-other-tim-brown commented on pull request #7640: [HUDI-5514] Add in support for a keyless workflow

the-other-tim-brown commented on PR #7640:
URL: https://github.com/apache/hudi/pull/7640#issuecomment-1382053922

   > Hi @the-other-tim-brown I'm interested in this functionality and have some questions, if I understand correctly the UUID will be the same for the same set of values in columns that it's based on?
   > 
   > So this generator can't be used for generating a surrogate key (a standard practice in data warehousing) as key is derived from data? My understanding of keyless model is that record key is a surrogate key that's globally unique.
   > 
   > I'm wondering if there's something that does not allow to create globally unique ids via the key generator interface (maybe virtual keys support)? At the same time in context of this PR, what's the place of [UuidKeyGenerator](https://github.com/apache/hudi/blob/41a9986a7641f3232b1edd2a737fd4b7aa430dbf/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/UuidKeyGenerator.scala)? Could it be used to generate surrogate keys that are globally unique?
   
   Yes it is correct that the keys are not guaranteed to be unique here. The issue with using a random UUID for us was that we were using deltastreamer and if the dag ever retriggered we were seeing data generated with new random UUIDs which could cause the records to be written to different filegroups causing an issue with duplicate/lost data due to some internals of how Hudi works. @nsivabalan had some similar thoughts around other approaches, can you chime in here? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org