You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by Vinoth Chandar <vi...@apache.org> on 2020/01/07 08:33:17 UTC

Re: [DISCUSS] Simplification of terminologies

Howdy all,

I have written up a first full version of
https://cwiki.apache.org/confluence/display/HUDI/Design+And+Architecture
<https://cwiki.apache.org/confluence/display/HUDI/Design+And+Architecture>,
based on style set by Nick on the cWiki.
Please contribute to making this more elaborate.

Special thanks to Nick, for introducing a great way to fully leverage power
of wiki and the patience in working together on seeding this.

Moving onto cleaning up the code classes and docs.

On Wed, Nov 13, 2019 at 9:24 PM Vinoth Chandar <vi...@apache.org> wrote:

> Will review the POC in cwiki.  +1
>
> Based on this feedback, I will proceed with the changes. Thanks all!
>
>
>
> On Tue, Nov 12, 2019 at 10:47 PM Semantic Beeng <ni...@semanticbeeng.com>
> wrote:
>
>> @vc, I think of it as elaborating the #ubiquitouslanguage in DDD.
>> See private email with references to a small POC in wiki and decide how
>> to proceed.
>>
>> On November 12, 2019 at 10:04 PM Vinoth Chandar < vinoth@apache.org>
>> wrote:
>>
>>
>> Thanks everyone for the feedback. Looks like we are in general agreement.
>>
>> I am inclined to just do 1 & 2 and leave COPY_ON_WRITE as is based on
>> great
>> points Ethan and Shiyan raised. Makes sense..
>> Will wait for 1-2 days still to close this thread.
>>
>> @semanticbeeing Thats a great idea. Is it more like a technical glossary
>> of
>> sorts? Lets may be start a different DISCUSS thread on that specific
>> topic,
>> so everyone can chime in and provide more attention to that proposal?
>>
>>
>>
>>
>>
>> On Tue, Nov 12, 2019 at 2:44 PM Y. Ethan Guo < guoyihua@uber.com.invalid>
>>
>> wrote:
>>
>> +1 on [1] and [2].
>>
>> For [3], I have similar doubts as Shiyan.
>>
>> For the naming, I can understand the original intent of the analogy for
>> COW
>> which is to make another "copy" of columnar/parquet file upon the
>> modification/update to the records in the file. From the system design
>> point of view, it's easy to understand. I'm okay with the renaming as
>> "MERGE_ON_WRITE" since it's probably straightforward for users at the
>> first
>> glance.
>>
>> In terms of the concept, COW and MOR are listed as storage/table types.
>> From my understanding, they represent different tradeoffs of the
>> performance between reading and writing Hudi tables, and within MOR there
>> are different tradeoffs, e.g., lazy merge on read or periodic compaction
>> and cleaning pipelined along ingestion. It looks like these can be
>> controlled through configs, e.g., "disable_merge_on_write",
>> "compaction_frenquency", etc., instead of fixing the storage type, to
>> control the tradeoff that a user would like to make. The requirement may
>> change so a user can switch between COW and MOR by tuning the configs. We
>> don't have to make such changes now, but I'm wondering if this is
>> something
>> worth considering in the future releases.
>>
>> - Ethan
>>
>> On Tue, Nov 12, 2019 at 8:43 AM nishith agarwal < n3.nash29@gmail.com>
>> wrote:
>>
>> +1 on the first two, don't feel strongly about (3).
>>
>> Thanks,
>> Nishith
>>
>> On Tue, Nov 12, 2019 at 5:03 AM leesf < leesf0315@gmail.com> wrote:
>>
>> [1] +1. `views` indeed confused me a lot.
>> [2] +1. `snapshot` is more reasonable.
>> [3] I don't feel very strong to rename it, the current name
>>
>> `COPY_ON_WRITE`
>>
>> is reasonable considering the cost to rename and the behavior that new
>> version parquet file will be created and seems to be copied from old
>> version parquet file.
>>
>> Best,
>> Leesf
>>
>> Balaji Varadarajan < vbalaji@apache.org> 于2019年11月12日周二 下午3:55写道:
>>
>> Agree with all 3 changes. The naming now looks more consistent than
>> earlier. +1 on them
>>
>> Depending on whether we are renaming Input formats for (1) and (2) -
>>
>> this
>>
>> could require some migration steps for
>>
>> Balaji.V
>>
>> >
>>
>> On Mon, Nov 11, 2019 at 7:38 PM vino yang < yanghua1127@gmail.com>
>>
>> wrote:
>>
>> Hi Vinoth,
>>
>> Thanks for bringing these proposals.
>>
>> +1 on all three. Especially, big +1 on the third renaming proposal.
>>
>> When I was a newbie. The "COPY_ON_WRITE" term confused me a lot. It
>>
>> easily
>>
>> mislead users on the "copy" term. And make users compare it with
>>
>> the
>>
>> `CopyOnWriteArrayList` data structure provided by JDK and thoughts
>>
>> of
>>
>> the
>>
>> file systems.
>>
>> Best,
>> Vino
>>
>> >
>>
>> Bhavani Sudha < bhavanisudhas@gmail.com> 于2019年11月12日周二 上午9:05写道:
>>
>> +1 on all three rename proposals. I think this would make the
>>
>> concepts
>>
>> super easy to follow for new users.
>>
>> If changing [3] seems to be a stretch, we should definitely do
>>
>> [1]
>>
>> &
>>
>> [2]
>>
>> at
>>
>> the least IMO. I will be glad to help out on the renames to
>>
>> whatever
>>
>> extent
>>
>> possible should the Hudi community incline to pursue this.
>>
>> Thanks,
>> Sudha
>>
>> >
>> >
>>
>> On Mon, Nov 11, 2019 at 3:46 PM Vinoth Chandar <
>>
>> vinoth@apache.org>
>>
>> wrote:
>> >
>>
>> Hello all,
>>
>> I wanted to raise an important topic with the community around
>>
>> whether
>>
>> we
>>
>> should rename some of our terminologies in code/docs to be more
>> user-friendly and understandable..
>>
>> Let me also provide some context for each, since I am probably
>>
>> guilty
>>
>> of
>>
>> introducing most of them in the first place :).
>>
>> *1. Rename "views" to "query" : *Instead of saying incremental
>>
>> view
>>
>> or
>>
>> read-optimized view, talk about them as "incremental query" and
>> "read-optimized query". The term "view" is very technical, and
>>
>> what I
>>
>> was
>>
>> trying to convey was that we ingest/store the data once and
>>
>> expose
>>
>> views
>>
>> on
>>
>> top. But new users (atleast half dozen of them to me) tend to
>>
>> confuse
>>
>> this
>>
>> with views/materialized views found in databases. Almost always
>>
>> we
>>
>> talk
>>
>> about views mostly in terms of expected behavior for a query on
>>
>> the
>>
>> view. I
>>
>> am proposing to just call these different query types since
>>
>> its a
>>
>> more
>>
>> universally accepted terminology and IMO clearer.
>>
>> *2. Rename "Read-Optimized/Realtime" views to Snapshot views +
>>
>> Have
>>
>> Read-Optimized view only for MOR storage :* This one is
>>
>> probably
>>
>> the
>>
>> trickiest. Hudi was always designed with MOR in mind, even as
>>
>> we
>>
>> were
>>
>> working on COW storage and consequently we named the pure
>>
>> parquet
>>
>> backed
>>
>> view as Read-Optimized, hoping to name parquet + avro based
>>
>> view
>>
>> as
>>
>> Write-Optimized. However, we opted to name it Realtime to
>>
>> emphasize
>>
>> the
>>
>> data freshness aspect. In retrospect, the views should have not
>>
>> been
>>
>> named
>>
>> after their performance characteristics but rather the classes
>>
>> of
>>
>> queries
>>
>> done on them and guarantees for those (point above #1).
>>
>> Moreover,
>>
>> once
>>
>> we
>>
>> have parquet embedded into the log format, then the tradeoffs
>>
>> may
>>
>> not
>>
>> be
>>
>> the same anyways.
>>
>> So combining with the renaming proposed in #1, we would end up
>>
>> with
>>
>> the
>>
>> following..
>>
>> Copy-On-Write :
>> [Old] Read-Optimized View => [New] Snapshot Query
>> [Old] Incremental View => [New] Incremental Query
>>
>> Merge-On-Read:
>> [Old] Realtime View => [New] Snapshot Query
>> [Old] Incremental View => [New] Incremental Query
>> [Old] ReadOptimzied View => [New] Read-Optimized Query (since
>>
>> it
>>
>> is
>>
>> read
>>
>> optimized compared to Snapshot query always, at the cost of
>>
>> staler
>>
>> data)
>>
>> Both changes #1 & #2 could be simpler changes to just code
>>
>> references,
>>
>> docs
>>
>> and configs.. we can support both string for sometime and
>>
>> deprecate
>>
>> eventually since queries are stateless.
>>
>> *3. Rename COPY_ON_WRITE to MERGE_ON_WRITE :* Name originated
>>
>> since
>>
>> the
>>
>> design was very similar to
>>
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Copy-2Don-2Dwrite&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=z456dQQXMUCz1m72nlkFQpylUpdOVMBG38x2peG1m44&m=m1yKGEwnAUe_FyIsWFAo-YVKyfq1nayItNGNc7iv8Yw&s=y9XF8-75xzGHY4yCbfVVWcIC1sbEXDxitqeAS2A6GoQ&e=
>>
>> filesystems
>> & snapshotting and we once hoped to push some of this logic
>>
>> into
>>
>> the
>>
>> storage itself, all in vain. but the name stuck, even though
>>
>> once
>>
>> we
>>
>> had
>>
>> MERGE_ON_READ the focus was often on merge costs etc, which the
>>
>> name
>>
>> COPY_ON_WRITE does not convey directly. I don't feel very
>>
>> strong
>>
>> about
>>
>> this
>>
>> and there is also cost to changing this since its persisted
>>
>> inside
>>
>>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__hoodie.properties&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=z456dQQXMUCz1m72nlkFQpylUpdOVMBG38x2peG1m44&m=m1yKGEwnAUe_FyIsWFAo-YVKyfq1nayItNGNc7iv8Yw&s=930ugGMXsrqzE-acg9nfeoePBmVjTRG3gD765ihEiqU&e=
>>
>> and we will support both strings internally in
>>
>> code
>>
>> for
>>
>> backwards compatibility anyway
>>
>> Naming something is very hard (yes, try :)).I believe these
>>
>> changes
>>
>> will
>>
>> make the project simpler to understand for everyone out there.
>>
>> We
>>
>> also
>>
>> have
>>
>> tons of new people here, so I am also happy to let go, if its
>>
>> already
>>
>> clear
>>
>> :)
>>
>> Please use the bullet number when you share your feedback so we
>>
>> know
>>
>> what
>>
>> the discussion is about.
>>
>> Thanks
>> Vinoth
>>
>>