You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by Vinoth Chandar <vi...@apache.org> on 2019/10/02 15:33:01 UTC

Re: [DISCUSS] Decouple Hudi and Spark (in wiki design page)

Based on some conversations I had with Flink folks including Hudi's very
own mentor Thomas, it seems future proof to look into supporting the Flink
streaming APIs. The batch apis IIUC will move towards converging with
Streaming APIs, which matches Hudi's model anyway

From Hudi's perspective, following are the major work that we need to do to
pave the way (took sometime to distill it down).
Most issues to support Flink comes from usage of Spark caching extensively
in two places

a) HoodieBloomIndex : To achieve performance and scale, we cache the input
RDD for indexing. We make two passes over input and without the caching
Spark would recompute the RDD and it ofc wont scale.
b) WorkloadProfile : After indexing, we compute the number of inserts,
updates so that we can size file and layout data in the right order etc.

We need projects to

a) Build a new index, that based on something like HashIndex proposal here
https://github.com/apache/incubator-hudi/wiki/HashMap-Index . I have more
ideas, that I plan to dump into a HIP. But if someone can drive this, happy
to partner.
b) We need to look into an efficient way of knowing number of insert,
updates (which filegroups) across partitions. I have only early ideas. Need
some help here as well coming to a solution.

Both a & b, need some ground work/clean up. IIUC balaji is already working
on some of it.

If we can have volunteers for each of these areas, we can get underway.



On Thu, Sep 26, 2019 at 10:13 AM Vinoth Chandar <vi...@apache.org> wrote:

> I'd kind of expect its not as fast today . But from the keynote in Flink
> Forward it seemed like its about to get lot better and we can definitely
> help by bringing hudi goodness to flink users.
>
> Others, what do you think? I think this is a good discussion to have
> upfront. For e.g the needs for streaming performance may need very
> different design..
>
> On Thu, Sep 26, 2019 at 9:56 AM Taher Koitawala <ta...@gmail.com>
> wrote:
>
>> Hi Vinoth,
>>       IMHO we should stick to Spark for micro batching for 2 reasons. 1:
>> Easy out use 2: Performance. Flink batch is not as fast as Spark. Also the
>> rich library of functions and the ease of integration which Spark has with
>> Hive etc that is not there in Flink batch.
>>
>>
>> Regards,
>> Taher Koitawala
>>
>>
>>
>> On Thu, Sep 26, 2019, 10:11 PM Vinoth Chandar <vi...@apache.org> wrote:
>>
>> > How mature is Flink batch? I am wondering if we can with Flink Batch
>> APIs.
>> > Hudi itself will provide micro-batching semantics (incremental pull,
>> > upserts)?
>> > For true streaming performace, I am not sure even Cloud stores are
>> ready,
>> > since none of them support append()s.
>> > IMHO a mini/micro batch model makes sense for the kind of space Hudi
>> > solves.. We can keep working towards making this
>> >
>> > I have some thoughts on indexing, a new way to index, that complements
>> > BloomIndex and overcomes some of the spark caching needs.
>> > Will write it down over the weekend and share it..
>> >
>> > P.S: I am also wondering if we should generalize HIPs into a RFC
>> > process[1], that way it does not have to be a real technical proposal
>> and
>> > we can still have structured conversation on future roadmap etc. and
>> evolve
>> > the shape together..
>> > (May be this deserves its own separate DISCUSS thread)
>> >
>> >
>> > [1] - https://en.wikipedia.org/wiki/Request_for_Comments
>> >
>> >
>> > On Wed, Sep 25, 2019 at 9:07 PM Taher Koitawala <ta...@gmail.com>
>> > wrote:
>> >
>> > > Hi Vino,
>> > >          Agree with your suggestion. We all know when thought Flink is
>> > > streaming we can control how files get rolled out through
>> checkpointing
>> > > configurations. Bad config and small files get rolled out. Good config
>> > and
>> > > files are properly sized.
>> > >
>> > >      Also I understand the concern of reading files with Flink and
>> > > performance related to it. (I have faced it before). So how about we
>> > built
>> > > our own functions which can read and write efficiently and is not a
>> > source
>> > > function but an operator! So what I mean is let's use the Akka based
>> > > semantics of passing messages between these operators and read what is
>> > > required.
>> > >
>> > > Have 1 custom stream operator who takes requests for reading files
>> (Again
>> > > that is not a source function, it is an operator), that operator reads
>> > the
>> > > file and passes it downstream in a parallel manner. (May be AsyncIO
>> > > extension can be a better call here). Let me know your thoughts on
>> this.
>> > > I.e: if we choose everything will be written on core DataStreams API
>> > >
>> > > As per the Flink batch and Stream talk goes. I guess as a community we
>> > have
>> > > already agreed that for batch the spark engine is good enough and
>> streams
>> > > will be power by Flink where as beam will do both. By the time Flink
>> > > unification of batch and stream is completely our code will be batch
>> > > compatible with minimal changes.
>> > >
>> > > Further we need to plan what part of Flink API will we use, should we
>> > stick
>> > > to hard core DataStreams API or should we merge it with Table APIs(I
>> > think
>> > > we should) so that we get Append sink, retract sink, temporial tables,
>> > the
>> > > capability to use the new blink table planner, also it would to some
>> > extent
>> > > lift our work of reading files since the table ApI I believe is good
>> at
>> > > doing those things and comes with in build readers and writers. So
>> > > basically if we use that we could also do instream join the stream
>> source
>> > > and the Hudi files to recompute upserts etc. AFAIK Flink Table ApI is
>> > also
>> > > giving a .cache() like functionality now. (Saw conversations about it
>> in
>> > > the mailing list)
>> > >
>> > > So I think we really need to start planning such things to move ahead.
>> > > Other than that 100% with Vino that we cannot read files otherwise in
>> > Flink
>> > > it would be really bad to do that.
>> > >
>> > > Regards,
>> > > Taher Koitawala
>> > >
>> > > On Thu, Sep 26, 2019, 8:47 AM vino yang <ya...@gmail.com>
>> wrote:
>> > >
>> > > > Hi
>> > > >
>> > > > A simple example. In Hudi Project, you can find many code snippet
>> like
>> > > > `spark.read().format().load()` The load method can pass any path,
>> > > > especially DFS paths.
>> > > >
>> > > > While if we only want to use Flink streaming, there is not a good
>> way
>> > to
>> > > > read HDFS now.
>> > > >
>> > > > In addition, we.also need to consider other ability between Flink
>> and
>> > > > Spark. You should know Spark API(non-structured streaming mode) can
>> > > support
>> > > > both Streaming(micro-batch) and batch. However, Flink distinguishs
>> them
>> > > > with two differentAPI, they have different feature set.
>> > > >
>> > > >
>> > > >
>> > > > On 09/25/2019 13:15, Semantic Beeng <ni...@semanticbeeng.com> wrote:
>> > > >
>> > > > Hi Vino,
>> > > >
>> > > > Would you be kind to start a wiki page to discuss this deep
>> > understanding
>> > > > of the functionality and design of Hudi?
>> > > >
>> > > > There you can put git links (https://github.com/ben-gibson/GitLink
>> for
>> > > > intellij) and design knowledge so we can discuss in context.
>> > > >
>> > > > I am exploring the approach from this retweet
>> > > > https://twitter.com/semanticbeeng/status/1176241250967666689?s=20
>> and
>> > > > need this understanding you have.
>> > > >
>> > > > "difficult to ignore Flink Batch API to match some features provide
>> by
>> > > > Hudi now" - can you please post there some gitlinks to this?
>> > > >
>> > > > Thanks
>> > > >
>> > > > Nick
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > On September 24, 2019 at 10:22 PM vino yang <ya...@gmail.com>
>> > > > wrote:
>> > > >
>> > > > Hi Taher,
>> > > >
>> > > > As I mentioned in the previous mail. Things may not be too easy by
>> just
>> > > > using Flink state API.
>> > > >
>> > > > Copied here " Hudi can connect with many different Source/Sinks.
>> Some
>> > > > file-based reads are not appropriate for Flink Streaming."
>> > > >
>> > > > Although, unify Batch and Streaming is Flink's goal. But, it is
>> > difficult
>> > > > to ignore Flink Batch API to match some features provide by Hudi
>> now.
>> > > >
>> > > > The example you provided is in application layer about usage. So my
>> > > > suggestion is be patient, it needs time to give an detailed design.
>> > > >
>> > > > Best,
>> > > > Vino
>> > > >
>> > > >
>> > > >
>> > > > On 09/24/2019 22:38, Taher Koitawala <ta...@gmail.com> wrote:
>> > > > Hi All,
>> > > >              Sample code to see how records tagging will be handled
>> in
>> > > > Flink is posted on [1]. The main class to run the same is
>> MockHudi.java
>> > > > with a sample path for checkpointing.
>> > > >
>> > > > As of now this is just a sample to know we should ke caching in
>> Flink
>> > > > states with bare minimum configs.
>> > > >
>> > > >
>> > > > As per my experience I have cached around 10s of TBs in Flink
>> rocksDB
>> > > > state
>> > > > with the right configs. So I'm sure it should work here as well.
>> > > >
>> > > > 1:
>> > > >
>> > > >
>> > >
>> >
>> https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/org/apache/hudi
>> > > >
>> > > > Regards,
>> > > > Taher Koitawala
>> > > >
>> > > >
>> > > > On Sun, Sep 22, 2019, 7:34 PM Vinoth Chandar <vi...@apache.org>
>> > wrote:
>> > > >
>> > > > > It wont be much different than the HBaseIndex we have today. Would
>> > like
>> > > > to
>> > > > > have always have an option like BloomIndex that does not need any
>> > > > external
>> > > > > dependencies.
>> > > > > The moment you bring an external data store in, someone becomes a
>> > DBA.
>> > > > :)
>> > > > >
>> > > > > On Sun, Sep 22, 2019 at 6:46 AM Semantic Beeng <
>> > nick@semanticbeeng.com
>> > > >
>> > > > > wrote:
>> > > > >
>> > > > > > @vc can you see how ApacheCrail could be used to implement this
>> at
>> > > > scale
>> > > > > > but also in a way that abstracts over both Spark and Flink?
>> > > > > >
>> > > > > > "Crail Store implements a hierarchical namespace across a
>> cluster
>> > of
>> > > > RDMA
>> > > > > > interconnected storage resources such as DRAM or flash"
>> > > > > >
>> > > > > > https://crail.incubator.apache.org/overview/
>> > > > > >
>> > > > > > + 2 cents
>> > > > > >
>> https://twitter.com/semanticbeeng/status/1175767500790915072?s=20
>> > > > > >
>> > > > > > Cheers
>> > > > > >
>> > > > > > Nick
>> > > > > >
>> > > > > > On September 22, 2019 at 9:28 AM Vinoth Chandar <
>> vinoth@apache.org
>> > >
>> > > > > wrote:
>> > > > > >
>> > > > > >
>> > > > > > It could be much larger. :) imagine billions of keys each 32
>> bytes,
>> > > > > mapped
>> > > > > > to another 32 byte
>> > > > > >
>> > > > > > The advantage of the current bloom index is that its effectively
>> > > > stored
>> > > > > > with data itself and this reduces complexity in terms of keeping
>> > > index
>> > > > > and
>> > > > > > data consistent etc
>> > > > > >
>> > > > > > One orthogonal idea from long time ago that moves indexing out
>> of
>> > > data
>> > > > > > storage and is generalizable
>> > > > > >
>> > > > > > https://github.com/apache/incubator-hudi/wiki/HashMap-Index
>> > > > > >
>> > > > > > If someone here knows flink well and can implement some
>> standalone
>> > > > flink
>> > > > > > code to mimic tagLocation() functionality and share with the
>> group,
>> > > > that
>> > > > > > would be great. Lets worry about performance once we have a
>> flink
>> > > DAG.
>> > > > I
>> > > > > > think this is a critical and most tricky piece in supporting
>> flink.
>> > > > > >
>> > > > > > On Sat, Sep 21, 2019 at 4:17 AM Vinay Patil <
>> > vinay18.patil@gmail.com
>> > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > Hi Taher,
>> > > > > >
>> > > > > > I agree with this , if the state is becoming too large we should
>> > have
>> > > > an
>> > > > > > option of storing it in external state like File System or
>> RocksDb.
>> > > > > >
>> > > > > > @Vinoth Chandar <vi...@apache.org> can the state of
>> > > HoodieBloomIndex
>> > > > go
>> > > > > > beyond 10-15 GB
>> > > > > >
>> > > > > > Regards,
>> > > > > > Vinay Patil
>> > > > > >
>> > > > > > >
>> > > > > >
>> > > > > > On Fri, Sep 20, 2019 at 11:37 AM Taher Koitawala <
>> > taherk77@gmail.com
>> > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > >> Hey Guys, Any thoughts on the above idea? To handle
>> > > > HoodieBloomIndex
>> > > > > > with
>> > > > > > >> HeapState, RocksDBState and FsState but on Spark.
>> > > > > > >>
>> > > > > > >> On Tue, Sep 17, 2019 at 1:41 PM Taher Koitawala <
>> > > taherk77@gmail.com>
>> > > >
>> > > > > > >> wrote:
>> > > > > > >>
>> > > > > > >> > Hi Vinoth,
>> > > > > > >> > Having seen the doc and code. I understand the
>> > > > > > >> > HoodieBloomIndex mainly caches key and partition path. Can
>> we
>> > > > > address
>> > > > > > >> how
>> > > > > > >> > Flink does it? Like, have HeapState where the user chooses
>> to
>> > > > cache
>> > > > > > the
>> > > > > > >> > Index on heap, RockDBState where indexes are written to
>> > RocksDB
>> > > > and
>> > > > > > >> finally
>> > > > > > >> > FsState where indexes can be written to HDFS, S3, Azure Fs.
>> > And
>> > > > on
>> > > > > > top,
>> > > > > > >> we
>> > > > > > >> > can do an index Time To Live.
>> > > > > > >> >
>> > > > > > >> > Regards,
>> > > > > > >> > Taher Koitawala
>> > > > > > >> >
>> > > > > > >> > On Mon, Sep 16, 2019 at 11:43 PM Vinoth Chandar <
>> > > > vinoth@apache.org>
>> > > > > > >> wrote:
>> > > > > > >> >
>> > > > > > >> >> I still feel the key thing here is reimplementing
>> > > > HoodieBloomIndex
>> > > > > > >> without
>> > > > > > >> >> needing spark caching.
>> > > > > > >> >>
>> > > > > > >> >>
>> > > > > > >>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design&Architecture-BloomIndex(non-global
>> > > >
>> > > > > > )
>> > > > > > >> >> documents the spark DAG in detail.
>> > > > > > >> >>
>> > > > > > >> >> If everyone feels, it's best for me to scope the work out,
>> > then
>> > > > > happy
>> > > > > > >> to
>> > > > > > >> >> do
>> > > > > > >> >> it!
>> > > > > > >> >>
>> > > > > > >> >> On Mon, Sep 16, 2019 at 10:23 AM Taher Koitawala <
>> > > > > taherk77@gmail.com
>> > > > > > >
>> > > > > > >> >> wrote:
>> > > > > > >> >>
>> > > > > > >> >> > Guys I think we are slowing down on this again. We need
>> to
>> > > > start
>> > > > > > >> >> planning
>> > > > > > >> >> > small small tasks towards this VC please can you help
>> fast
>> > > > track
>> > > > > > >> this?
>> > > > > > >> >> >
>> > > > > > >> >> > Regards,
>> > > > > > >> >> > Taher Koitawala
>> > > > > > >> >> >
>> > > > > > >> >> > On Thu, Aug 15, 2019, 10:07 AM Vinoth Chandar <
>> > > > vinoth@apache.org
>> > > > > >
>> > > > > > >> >> wrote:
>> > > > > > >> >> >
>> > > > > > >> >> > > Look forward to the analysis. A key class to read
>> would
>> > be
>> > > > > > >> >> > > HoodieBloomIndex, which uses a lot of spark caching
>> and
>> > > > > shuffles.
>> > > > > > >> >> > >
>> > > > > > >> >> > > On Tue, Aug 13, 2019 at 7:52 PM vino yang <
>> > > > > yanghua1127@gmail.com
>> > > > > > >
>> > > > > > >> >> wrote:
>> > > > > > >> >> > >
>> > > > > > >> >> > > > >> Currently Spark Streaming micro batching fits
>> well
>> > > with
>> > > > > > Hudi,
>> > > > > > >> >> since
>> > > > > > >> >> > it
>> > > > > > >> >> > > > amortizes the cost of indexing, workload profiling
>> > etc. 1
>> > > > > spark
>> > > > > > >> >> micro
>> > > > > > >> >> > > batch
>> > > > > > >> >> > > > = 1 hudi commit
>> > > > > > >> >> > > > With the per-record model in Flink, I am not sure
>> how
>> > > > useful
>> > > > > it
>> > > > > > >> >> will be
>> > > > > > >> >> > > to
>> > > > > > >> >> > > > support hudi.. for e.g, 1 input record cannot be 1
>> hudi
>> > > > > commit,
>> > > > > > >> it
>> > > > > > >> >> will
>> > > > > > >> >> > > be
>> > > > > > >> >> > > > inefficient..
>> > > > > > >> >> > > >
>> > > > > > >> >> > > > Yes, if 1 input record = 1 hudi commit, it would be
>> > > > > > inefficient.
>> > > > > > >> >> About
>> > > > > > >> >> > > > Flink streaming, we can also implement the "batch"
>> and
>> > > > > > >> "micro-batch"
>> > > > > > >> >> > > model
>> > > > > > >> >> > > > when process data. For example:
>> > > > > > >> >> > > >
>> > > > > > >> >> > > > - aggregation: use flexibility window mechanism;
>> > > > > > >> >> > > > - non-aggregation: use Flink stateful state API
>> cache a
>> > > > batch
>> > > > > > >> >> data
>> > > > > > >> >> > > >
>> > > > > > >> >> > > >
>> > > > > > >> >> > > > >> On first focussing on decoupling of Spark and
>> Hudi
>> > > > alone,
>> > > > > > yes
>> > > > > > >> a
>> > > > > > >> >> full
>> > > > > > >> >> > > > summary of how Spark is being used in a wiki page
>> is a
>> > > > good
>> > > > > > start
>> > > > > > >> >> IMO.
>> > > > > > >> >> > We
>> > > > > > >> >> > > > can then hash out what can be generalized and what
>> > cannot
>> > > > be
>> > > > > > and
>> > > > > > >> >> needs
>> > > > > > >> >> > to
>> > > > > > >> >> > > > be left in hudi-client-spark vs hudi-client-core
>> > > > > > >> >> > > >
>> > > > > > >> >> > > > agree
>> > > > > > >> >> > > >
>> > > > > > >> >> > > > Vinoth Chandar <vi...@apache.org> 于2019年8月14日周三
>> > > > 上午8:35写道:
>> > > > > > >> >> > > >
>> > > > > > >> >> > > > > >> We should only stick to Flink Streaming.
>> > Furthermore
>> > > > if
>> > > > > > >> there
>> > > > > > >> >> is a
>> > > > > > >> >> > > > > requirement for batch then users
>> > > > > > >> >> > > > > >> should use Spark or then we will anyway have a
>> > beam
>> > > > > > >> integration
>> > > > > > >> >> > > coming
>> > > > > > >> >> > > > > up.
>> > > > > > >> >> > > > >
>> > > > > > >> >> > > > > Currently Spark Streaming micro batching fits well
>> > with
>> > > > > Hudi,
>> > > > > > >> >> since
>> > > > > > >> >> > it
>> > > > > > >> >> > > > > amortizes the cost of indexing, workload profiling
>> > etc.
>> > > > 1
>> > > > > > spark
>> > > > > > >> >> micro
>> > > > > > >> >> > > > batch
>> > > > > > >> >> > > > > = 1 hudi commit
>> > > > > > >> >> > > > > With the per-record model in Flink, I am not sure
>> how
>> > > > > useful
>> > > > > > it
>> > > > > > >> >> will
>> > > > > > >> >> > be
>> > > > > > >> >> > > > to
>> > > > > > >> >> > > > > support hudi.. for e.g, 1 input record cannot be 1
>> > hudi
>> > > > > > >> commit, it
>> > > > > > >> >> > will
>> > > > > > >> >> > > > be
>> > > > > > >> >> > > > > inefficient..
>> > > > > > >> >> > > > >
>> > > > > > >> >> > > > > On first focussing on decoupling of Spark and Hudi
>> > > > alone,
>> > > > > > yes a
>> > > > > > >> >> full
>> > > > > > >> >> > > > > summary of how Spark is being used in a wiki page
>> is
>> > a
>> > > > good
>> > > > > > >> start
>> > > > > > >> >> > IMO.
>> > > > > > >> >> > > We
>> > > > > > >> >> > > > > can then hash out what can be generalized and what
>> > > > cannot
>> > > > > be
>> > > > > > >> and
>> > > > > > >> >> > needs
>> > > > > > >> >> > > to
>> > > > > > >> >> > > > > be left in hudi-client-spark vs hudi-client-core
>> > > > > > >> >> > > > >
>> > > > > > >> >> > > > >
>> > > > > > >> >> > > > >
>> > > > > > >> >> > > > > On Tue, Aug 13, 2019 at 3:57 AM vino yang <
>> > > > > > >> yanghua1127@gmail.com>
>> > > > > > >> >> > > wrote:
>> > > > > > >> >> > > > >
>> > > > > > >> >> > > > > > Hi Nick and Taher,
>> > > > > > >> >> > > > > >
>> > > > > > >> >> > > > > > I just want to answer Nishith's question.
>> Reference
>> > > > his
>> > > > > old
>> > > > > > >> >> > > description
>> > > > > > >> >> > > > > > here:
>> > > > > > >> >> > > > > >
>> > > > > > >> >> > > > > > > You can do a parallel investigation while we
>> are
>> > > > > deciding
>> > > > > > >> on
>> > > > > > >> >> the
>> > > > > > >> >> > > > module
>> > > > > > >> >> > > > > > structure. You could be looking at all the
>> patterns
>> > > in
>> > > > > > >> Hudi's
>> > > > > > >> >> > Spark
>> > > > > > >> >> > > > APIs
>> > > > > > >> >> > > > > > usage (RDD/DataSource/SparkContext) and see if
>> such
>> > > > > support
>> > > > > > >> can
>> > > > > > >> >> be
>> > > > > > >> >> > > > > achieved
>> > > > > > >> >> > > > > > in theory with Flink. If not, what is the
>> > workaround.
>> > > > > > >> >> Documenting
>> > > > > > >> >> > > such
>> > > > > > >> >> > > > > > patterns would be valuable when multiple
>> engineers
>> > > are
>> > > > > > >> working
>> > > > > > >> >> on
>> > > > > > >> >> > it.
>> > > > > > >> >> > > > For
>> > > > > > >> >> > > > > > e:g, Hudi relies on (a) custom partitioning
>> logic
>> > for
>> > > > > > >> >> upserts,
>> > > > > > >> >> > > > > (b)
>> > > > > > >> >> > > > > > caching RDDs to avoid reruns of costly stages
>> (c) A
>> > > > Spark
>> > > > > > >> >> > upsert
>> > > > > > >> >> > > > task
>> > > > > > >> >> > > > > > knowing its spark partition/task/attempt ids
>> > > > > > >> >> > > > > >
>> > > > > > >> >> > > > > > And just like the title of this thread, we are
>> > going
>> > > > to
>> > > > > try
>> > > > > > >> to
>> > > > > > >> >> > > decouple
>> > > > > > >> >> > > > > > Hudi and Spark. That means we can run the whole
>> > Hudi
>> > > > > > without
>> > > > > > >> >> > > depending
>> > > > > > >> >> > > > > > Spark. So we need to analyze all the usage of
>> Spark
>> > > in
>> > > > > > Hudi.
>> > > > > > >> >> > > > > >
>> > > > > > >> >> > > > > > Here we are not discussing the integration of
>> Hudi
>> > > and
>> > > > > > Flink
>> > > > > > >> in
>> > > > > > >> >> the
>> > > > > > >> >> > > > > > application layer. Instead, I want Hudi to be
>> > > > decoupled
>> > > > > > from
>> > > > > > >> >> Spark
>> > > > > > >> >> > > and
>> > > > > > >> >> > > > > > allow other engines (such as Flink) to replace
>> > Spark.
>> > > > > > >> >> > > > > >
>> > > > > > >> >> > > > > > It can be divided into long-term goals and
>> > short-term
>> > > > > > goals.
>> > > > > > >> As
>> > > > > > >> >> > > Nishith
>> > > > > > >> >> > > > > > stated in a recent email.
>> > > > > > >> >> > > > > >
>> > > > > > >> >> > > > > > I mentioned the Flink Batch API here because
>> Hudi
>> > can
>> > > > > > connect
>> > > > > > >> >> with
>> > > > > > >> >> > > many
>> > > > > > >> >> > > > > > different Source/Sinks. Some file-based reads
>> are
>> > not
>> > > > > > >> >> appropriate
>> > > > > > >> >> > for
>> > > > > > >> >> > > > > Flink
>> > > > > > >> >> > > > > > Streaming.
>> > > > > > >> >> > > > > >
>> > > > > > >> >> > > > > > Therefore, this is a comprehensive survey of the
>> > use
>> > > > of
>> > > > > > >> Spark in
>> > > > > > >> >> > > Hudi.
>> > > > > > >> >> > > > > >
>> > > > > > >> >> > > > > > Best,
>> > > > > > >> >> > > > > > Vino
>> > > > > > >> >> > > > > >
>> > > > > > >> >> > > > > >
>> > > > > > >> >> > > > > > taher koitawala <ta...@gmail.com>
>> 于2019年8月13日周二
>> > > > > > 下午5:43写道:
>> > > > > > >> >> > > > > >
>> > > > > > >> >> > > > > > > Hi Vino,
>> > > > > > >> >> > > > > > > According to what I've seen Hudi has a lot of
>> > spark
>> > > > > > >> >> > component
>> > > > > > >> >> > > > > > flowing
>> > > > > > >> >> > > > > > > throwing it. Like Taskcontexts,
>> JavaSparkContexts
>> > > > etc.
>> > > > > > The
>> > > > > > >> >> main
>> > > > > > >> >> > > > > classes I
>> > > > > > >> >> > > > > > > guess we should focus upon is HoodieTable and
>> > > Hoodie
>> > > > > > write
>> > > > > > >> >> > clients.
>> > > > > > >> >> > > > > > >
>> > > > > > >> >> > > > > > > Also Vino, I don't think we should be
>> providing
>> > > > Flink
>> > > > > > >> dataset
>> > > > > > >> >> > > > > > > implementation. We should only stick to Flink
>> > > > > Streaming.
>> > > > > > >> >> > > > > > > Furthermore if there is a requirement for
>> > > > > > >> batch
>> > > > > > >> >> > then
>> > > > > > >> >> > > > > users
>> > > > > > >> >> > > > > > > should use Spark or then we will anyway have a
>> > beam
>> > > > > > >> >> integration
>> > > > > > >> >> > > > coming
>> > > > > > >> >> > > > > > up.
>> > > > > > >> >> > > > > > >
>> > > > > > >> >> > > > > > > As of cache, How about we write our stateful
>> > Flink
>> > > > > > function
>> > > > > > >> >> and
>> > > > > > >> >> > use
>> > > > > > >> >> > > > > > > RocksDbStateBackend with some state TTL.
>> > > > > > >> >> > > > > > >
>> > > > > > >> >> > > > > > > On Tue, Aug 13, 2019, 2:28 PM vino yang <
>> > > > > > >> >> yanghua1127@gmail.com>
>> > > > > > >> >> > > > wrote:
>> > > > > > >> >> > > > > > >
>> > > > > > >> >> > > > > > > > Hi all,
>> > > > > > >> >> > > > > > > >
>> > > > > > >> >> > > > > > > > After doing some research, let me share my
>> > > > > information:
>> > > > > > >> >> > > > > > > >
>> > > > > > >> >> > > > > > > >
>> > > > > > >> >> > > > > > > > - Limitation of computing engine
>> capabilities:
>> > > > Hudi
>> > > > > > >> uses
>> > > > > > >> >> > > Spark's
>> > > > > > >> >> > > > > > > > RDD#persist, and Flink currently has no API
>> to
>> > > > cache
>> > > > > > >> >> > datasets.
>> > > > > > >> >> > > > > Maybe
>> > > > > > >> >> > > > > > > we
>> > > > > > >> >> > > > > > > > can
>> > > > > > >> >> > > > > > > > only choose to use external storage or do
>> not
>> > use
>> > > > > > >> cache?
>> > > > > > >> >> For
>> > > > > > >> >> > > the
>> > > > > > >> >> > > > > use
>> > > > > > >> >> > > > > > > of
>> > > > > > >> >> > > > > > > > other APIs, the two currently offer almost
>> > > > equivalent
>> > > > > > >> >> > > > > capabilities.
>> > > > > > >> >> > > > > > > > - The abstraction of the computing engine is
>> > > > > > >> different:
>> > > > > > >> >> > > > > Considering
>> > > > > > >> >> > > > > > > the
>> > > > > > >> >> > > > > > > > different usage scenarios of the computing
>> > engine
>> > > > in
>> > > > > > >> >> Hudi,
>> > > > > > >> >> > > Flink
>> > > > > > >> >> > > > > has
>> > > > > > >> >> > > > > > > not
>> > > > > > >> >> > > > > > > > yet implemented stream batch unification,
>> so we
>> > > > may
>> > > > > > >> use
>> > > > > > >> >> both
>> > > > > > >> >> > > > > Flink's
>> > > > > > >> >> > > > > > > > DataSet API (batch processing) and
>> DataStream
>> > API
>> > > > > > >> (stream
>> > > > > > >> >> > > > > > processing).
>> > > > > > >> >> > > > > > > >
>> > > > > > >> >> > > > > > > > Best,
>> > > > > > >> >> > > > > > > > Vino
>> > > > > > >> >> > > > > > > >
>> > > > > > >> >> > > > > > > > nishith agarwal <n3...@gmail.com>
>> > > > 于2019年8月8日周四
>> > > > > > >> >> 上午12:57写道:
>> > > > > > >> >> > > > > > > >
>> > > > > > >> >> > > > > > > > > Nick,
>> > > > > > >> >> > > > > > > > >
>> > > > > > >> >> > > > > > > > > You bring up a good point about the
>> > non-trivial
>> > > > > > >> >> programming
>> > > > > > >> >> > > model
>> > > > > > >> >> > > > > > > > > differences between these different
>> > > > technologies.
>> > > > > > From
>> > > > > > >> a
>> > > > > > >> >> > > > > theoretical
>> > > > > > >> >> > > > > > > > > perspective, I'd say considering a higher
>> > level
>> > > > > > >> >> abstraction
>> > > > > > >> >> > > makes
>> > > > > > >> >> > > > > > > sense.
>> > > > > > >> >> > > > > > > > I
>> > > > > > >> >> > > > > > > > > think we have to decouple some objectives
>> and
>> > > > > > concerns
>> > > > > > >> >> here.
>> > > > > > >> >> > > > > > > > >
>> > > > > > >> >> > > > > > > > > a) The immediate desire is to have Hudi be
>> > able
>> > > > to
>> > > > > > run
>> > > > > > >> on
>> > > > > > >> >> a
>> > > > > > >> >> > > Flink
>> > > > > > >> >> > > > > (or
>> > > > > > >> >> > > > > > > > > non-spark) engine. This naturally begs the
>> > > > question
>> > > > > > of
>> > > > > > >> >> > > decoupling
>> > > > > > >> >> > > > > > Hudi
>> > > > > > >> >> > > > > > > > > concepts from direct Spark dependencies.
>> > > > > > >> >> > > > > > > > >
>> > > > > > >> >> > > > > > > > > b) If we do want to initiate the above
>> > effort,
>> > > > > would
>> > > > > > it
>> > > > > > >> >> make
>> > > > > > >> >> > > > sense
>> > > > > > >> >> > > > > to
>> > > > > > >> >> > > > > > > > just
>> > > > > > >> >> > > > > > > > > have a higher level abstraction, building
>> on
>> > > > other
>> > > > > > >> >> > technologies
>> > > > > > >> >> > > > > like
>> > > > > > >> >> > > > > > > beam
>> > > > > > >> >> > > > > > > > > (euphoria etc) and provide single, clean
>> > API's
>> > > > that
>> > > > > > >> may be
>> > > > > > >> >> > more
>> > > > > > >> >> > > > > > > > > maintainable from a code perspective. But
>> at
>> > > the
>> > > > > same
>> > > > > > >> time
>> > > > > > >> >> > this
>> > > > > > >> >> > > > > will
>> > > > > > >> >> > > > > > > > > introduce challenges on how to maintain
>> > > > efficiency
>> > > > > > and
>> > > > > > >> >> > > optimized
>> > > > > > >> >> > > > > > > runtime
>> > > > > > >> >> > > > > > > > > dags for Hudi (since the code would move
>> away
>> > > > from
>> > > > > > >> point
>> > > > > > >> >> > > > > integrations
>> > > > > > >> >> > > > > > > and
>> > > > > > >> >> > > > > > > > > whenever this happens, tuning natively for
>> > > > specific
>> > > > > > >> >> engines
>> > > > > > >> >> > > > becomes
>> > > > > > >> >> > > > > > > more
>> > > > > > >> >> > > > > > > > > and more difficult).
>> > > > > > >> >> > > > > > > > >
>> > > > > > >> >> > > > > > > > > My general opinion is that, as the
>> community
>> > > > grows
>> > > > > > over
>> > > > > > >> >> time
>> > > > > > >> >> > > with
>> > > > > > >> >> > > > > > more
>> > > > > > >> >> > > > > > > > > folks having an in-depth understanding of
>> > Hudi,
>> > > > > going
>> > > > > > >> from
>> > > > > > >> >> > > > > > > current_state
>> > > > > > >> >> > > > > > > > ->
>> > > > > > >> >> > > > > > > > > (a) -> (b) might be the most reliable and
>> > > > adoptable
>> > > > > > >> path
>> > > > > > >> >> for
>> > > > > > >> >> > > this
>> > > > > > >> >> > > > > > > > project.
>> > > > > > >> >> > > > > > > > >
>> > > > > > >> >> > > > > > > > > Thanks,
>> > > > > > >> >> > > > > > > > > Nishith
>> > > > > > >> >> > > > > > > > >
>> > > > > > >> >> > > > > > > > > On Tue, Aug 6, 2019 at 1:30 PM Semantic
>> > Beeng <
>> > > > > > >> >> > > > > > nick@semanticbeeng.com>
>> > > > > > >> >> > > > > > > > > wrote:
>> > > > > > >> >> > > > > > > > >
>> > > > > > >> >> > > > > > > > > > There are some not trivial difference
>> > between
>> > > > > > >> >> programming
>> > > > > > >> >> > > model
>> > > > > > >> >> > > > > and
>> > > > > > >> >> > > > > > > > > > runtime semantics between Beam, Spark
>> and
>> > > > Flink.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > >
>> > > > > > >> >> > > > > > > >
>> > > > > > >> >> > > > > > >
>> > > > > > >> >> > > > > >
>> > > > > > >> >> > > > >
>> > > > > > >> >> > > >
>> > > > > > >> >> > >
>> > > > > > >> >> >
>> > > > > > >> >>
>> > > > > > >>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://beam.apache.org/documentation/runners/capability-matrix/#cap-full-how
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > Nitish, Vino - thoughts?
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > Does it feel to consider a higher level
>> > > > > > abstraction /
>> > > > > > >> >> DSL
>> > > > > > >> >> > > > instead
>> > > > > > >> >> > > > > > of
>> > > > > > >> >> > > > > > > > > > maintaining different code with same
>> > > > > functionality
>> > > > > > >> but
>> > > > > > >> >> > > > different
>> > > > > > >> >> > > > > > > > > > programming models ?
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> https://beam.apache.org/documentation/sdks/java/euphoria/
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > Nick
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > On August 6, 2019 at 4:04 PM nishith
>> > agarwal
>> > > <
>> > > > > > >> >> > > > > n3.nash29@gmail.com>
>> > > > > > >> >> > > > > > > > > wrote:
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > +1 for Approach 1 Point integration with
>> > each
>> > > > > > >> framework.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > Pros for point integration
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > - Hudi community is already familiar
>> with
>> > > > spark
>> > > > > > >> and
>> > > > > > >> >> > spark
>> > > > > > >> >> > > > > based
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > actions/shuffles etc. Since both modules
>> > can
>> > > > be
>> > > > > > >> >> decoupled,
>> > > > > > >> >> > > this
>> > > > > > >> >> > > > > > > enables
>> > > > > > >> >> > > > > > > > > us
>> > > > > > >> >> > > > > > > > > > to have a steady release for Hudi for 1
>> > > > execution
>> > > > > > >> engine
>> > > > > > >> >> > > > (spark)
>> > > > > > >> >> > > > > > > while
>> > > > > > >> >> > > > > > > > we
>> > > > > > >> >> > > > > > > > > > hone our skills and iterate on making
>> flink
>> > > > dag
>> > > > > > >> >> optimized,
>> > > > > > >> >> > > > > > performant
>> > > > > > >> >> > > > > > > > > with
>> > > > > > >> >> > > > > > > > > > the right configuration.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > - This might be a stepping stone towards
>> > > > > rewriting
>> > > > > > >> >> the
>> > > > > > >> >> > > > entire
>> > > > > > >> >> > > > > > code
>> > > > > > >> >> > > > > > > > > base
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > being agnostic of spark/flink. This
>> > approach
>> > > > will
>> > > > > > >> help
>> > > > > > >> >> us
>> > > > > > >> >> > fix
>> > > > > > >> >> > > > > > tests,
>> > > > > > >> >> > > > > > > > > > intricacies and help make the code base
>> > ready
>> > > > > for a
>> > > > > > >> >> larger
>> > > > > > >> >> > > > > rework.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > - Seems like the easiest way to add
>> flink
>> > > > support
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > Cons
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > - More code paths to maintain and reason
>> > > since
>> > > > > the
>> > > > > > >> >> spark
>> > > > > > >> >> > > and
>> > > > > > >> >> > > > > > flink
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > integrations will naturally diverge over
>> > > time.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > Theoretically, I do like the idea of
>> being
>> > > > able
>> > > > > to
>> > > > > > >> run
>> > > > > > >> >> the
>> > > > > > >> >> > > hudi
>> > > > > > >> >> > > > > dag
>> > > > > > >> >> > > > > > > on
>> > > > > > >> >> > > > > > > > > beam
>> > > > > > >> >> > > > > > > > > > more than point integrations, where
>> there
>> > is
>> > > > one
>> > > > > > >> >> API/logic
>> > > > > > >> >> > to
>> > > > > > >> >> > > > > > reason
>> > > > > > >> >> > > > > > > > > about.
>> > > > > > >> >> > > > > > > > > > But practically, that may not be the
>> right
>> > > > > > direction.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > Pros
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > - Lesser cognitive burden in
>> maintaining,
>> > > > > evolving
>> > > > > > >> >> and
>> > > > > > >> >> > > > > releasing
>> > > > > > >> >> > > > > > > the
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > project with one API to reason with.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > - Theoretically, going forward assuming
>> > beam
>> > > > is
>> > > > > > >> >> adopted
>> > > > > > >> >> > > as a
>> > > > > > >> >> > > > > > > > standard
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > programming paradigm for stream/batch,
>> this
>> > > > would
>> > > > > > >> enable
>> > > > > > >> >> > > > > consumers
>> > > > > > >> >> > > > > > > > > leverage
>> > > > > > >> >> > > > > > > > > > the power of hudi more easily.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > Cons
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > - Massive rewrite of the code base.
>> > > > Additionally,
>> > > > > > >> >> since
>> > > > > > >> >> > we
>> > > > > > >> >> > > > > would
>> > > > > > >> >> > > > > > > > have
>> > > > > > >> >> > > > > > > > > > moved
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > away from directly using spark APIs,
>> there
>> > is
>> > > > a
>> > > > > > >> bigger
>> > > > > > >> >> risk
>> > > > > > >> >> > > of
>> > > > > > >> >> > > > > > > > > regression.
>> > > > > > >> >> > > > > > > > > > We would have to be very thorough with
>> all
>> > > the
>> > > > > > >> >> intricacies
>> > > > > > >> >> > > and
>> > > > > > >> >> > > > > > ensure
>> > > > > > >> >> > > > > > > > the
>> > > > > > >> >> > > > > > > > > > same stability of new releases.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > - Managing future features (which may be
>> > very
>> > > > > > >> spark
>> > > > > > >> >> > > driven)
>> > > > > > >> >> > > > > will
>> > > > > > >> >> > > > > > > > > either
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > clash or pause or will need to be
>> reworked.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > - Tuning jobs for Spark/Flink type
>> > execution
>> > > > > > >> >> frameworks
>> > > > > > >> >> > > > > > > individually
>> > > > > > >> >> > > > > > > > > > might
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > be difficult and will get difficult over
>> > time
>> > > > as
>> > > > > > the
>> > > > > > >> >> > project
>> > > > > > >> >> > > > > > evolves,
>> > > > > > >> >> > > > > > > > > where
>> > > > > > >> >> > > > > > > > > > some beam integrations with spark/flink
>> may
>> > > > not
>> > > > > > work
>> > > > > > >> as
>> > > > > > >> >> > > > expected.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > - Also, as pointed above, need to
>> probably
>> > > > > support
>> > > > > > >> >> the
>> > > > > > >> >> > > > > > > hoodie-spark
>> > > > > > >> >> > > > > > > > > > module
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > as a first-class.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > Thank,
>> > > > > > >> >> > > > > > > > > > Nishith
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > On Tue, Aug 6, 2019 at 9:48 AM taher
>> > > koitawala
>> > > > <
>> > > > > > >> >> > > > > taherk77@gmail.com
>> > > > > > >> >> > > > > > >
>> > > > > > >> >> > > > > > > > > wrote:
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > Hi Vinoth,
>> > > > > > >> >> > > > > > > > > > Are there some tasks I can take up to
>> ramp
>> > up
>> > > > the
>> > > > > > >> code?
>> > > > > > >> >> > Want
>> > > > > > >> >> > > to
>> > > > > > >> >> > > > > get
>> > > > > > >> >> > > > > > > > > > more used to the code and understand the
>> > > > existing
>> > > > > > >> >> > > > implementation
>> > > > > > >> >> > > > > > > > better.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > Thanks,
>> > > > > > >> >> > > > > > > > > > Taher Koitawala
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > On Tue, Aug 6, 2019, 10:02 PM Vinoth
>> > Chandar
>> > > <
>> > > > > > >> >> > > > vinoth@apache.org>
>> > > > > > >> >> > > > > > > > wrote:
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > Let's see if others have any thoughts as
>> > > well.
>> > > > We
>> > > > > > can
>> > > > > > >> >> plan
>> > > > > > >> >> > to
>> > > > > > >> >> > > > fix
>> > > > > > >> >> > > > > > the
>> > > > > > >> >> > > > > > > > > > approach by EOW.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > On Mon, Aug 5, 2019 at 7:06 PM vino
>> yang <
>> > > > > > >> >> > > > yanghua1127@gmail.com>
>> > > > > > >> >> > > > > > > > wrote:
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > Hi guys,
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > Also, +1 for Approach 1 like Taher.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > If we can do a comprehensive analysis of
>> > this
>> > > > > model
>> > > > > > >> and
>> > > > > > >> >> > come
>> > > > > > >> >> > > up
>> > > > > > >> >> > > > > > with.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > means
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > to refactor this cleanly, this would be
>> > > > > promising.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > Yes, when we get the conclusion, we
>> could
>> > > > start
>> > > > > > this
>> > > > > > >> >> work.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > Best,
>> > > > > > >> >> > > > > > > > > > Vino
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > >
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > taher koitawala <ta...@gmail.com>
>> > > > > 于2019年8月6日周二
>> > > > > > >> >> > 上午12:28写道:
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > +1 for Approch 1 Point integration with
>> > each
>> > > > > > >> framework
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > Approach 2 has a problem as you said
>> > > > "Developers
>> > > > > > >> need to
>> > > > > > >> >> > > think
>> > > > > > >> >> > > > > > about
>> > > > > > >> >> > > > > > > > > >
>> > > > > what-if-this-piece-of-code-ran-as-spark-vs-flink..
>> > > > > > >> So in
>> > > > > > >> >> > the
>> > > > > > >> >> > > > end,
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > this
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > may
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > not be the panacea that it seems to be"
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > We have seen various pipelines in the
>> beam
>> > > dag
>> > > > > > being
>> > > > > > >> >> > > expressed
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > differently
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > then we had them in our original
>> usecase.
>> > And
>> > > > > also
>> > > > > > >> >> > switching
>> > > > > > >> >> > > > > > between
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > spark
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > and Flink runners in beam have various
>> > impact
>> > > > on
>> > > > > > the
>> > > > > > >> >> > > pipelines
>> > > > > > >> >> > > > > like
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > some
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > features available in Flink are not
>> > available
>> > > > on
>> > > > > > the
>> > > > > > >> >> spark
>> > > > > > >> >> > > > runner
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > etc.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > Refer to this compatible matrix ->
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > >
>> > > > > >
>> https://beam.apache.org/documentation/runners/capability-matrix/
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > Hence my vote on Approch 1 let's
>> decouple
>> > and
>> > > > > build
>> > > > > > >> the
>> > > > > > >> >> > > > abstract
>> > > > > > >> >> > > > > > for
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > each
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > framework. That is a much better
>> option. We
>> > > > will
>> > > > > > also
>> > > > > > >> >> have
>> > > > > > >> >> > > more
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > control
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > over each framework's implement.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > On Mon, Aug 5, 2019, 9:28 PM Vinoth
>> > Chandar <
>> > > > > > >> >> > > vinoth@apache.org
>> > > > > > >> >> > > > >
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > wrote:
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > Would like to highlight that there are
>> two
>> > > > > distinct
>> > > > > > >> >> > > approaches
>> > > > > > >> >> > > > > here
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > with
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > different tradeoffs. Think of this as my
>> > > > > braindump,
>> > > > > > >> as I
>> > > > > > >> >> > have
>> > > > > > >> >> > > > > been
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > thinking
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > about this quite a bit in the past.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > >
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > *Approach 1 : Point integration with
>> each
>> > > > > > framework *
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > We may need a pure client module named
>> for
>> > > > > example
>> > > > > > >> >> > > > > > > > > > hoodie-client-core(common)
>> > > > > > >> >> > > > > > > > > > >> Then we could have:
>> hoodie-client-spark,
>> > > > > > >> >> > > hoodie-client-flink
>> > > > > > >> >> > > > > > > > > > and hoodie-client-beam
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > (+) This is the safest to do IMO, since
>> we
>> > > can
>> > > > > > >> isolate
>> > > > > > >> >> the
>> > > > > > >> >> > > > > current
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > Spark
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > execution (hoodie-spark,
>> > hoodie-client-spark)
>> > > > > from
>> > > > > > >> the
>> > > > > > >> >> > > changes
>> > > > > > >> >> > > > > for
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > flink,
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > while it stabilizes over few releases.
>> > > > > > >> >> > > > > > > > > > (-) Downside is that the utilities
>> needs to
>> > > be
>> > > > > > >> redone :
>> > > > > > >> >> > > > > > > > > > hoodie-utilities-spark and
>> > > > hoodie-utilities-flink
>> > > > > > and
>> > > > > > >> >> > > > > > > > > > hoodie-utilities-core ? hoodie-cli?
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > If we can do a comprehensive analysis of
>> > this
>> > > > > model
>> > > > > > >> and
>> > > > > > >> >> > come
>> > > > > > >> >> > > up
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > with.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > means
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > to refactor this cleanly, this would be
>> > > > > promising.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > >
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > *Approach 2: Beam as the compute
>> > abstraction*
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > Another more drastic approach is to
>> remove
>> > > > Spark
>> > > > > as
>> > > > > > >> the
>> > > > > > >> >> > > compute
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > abstraction
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > for writing data and replace it with
>> Beam.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > (+) All of the code remains more or less
>> > > > similar
>> > > > > > and
>> > > > > > >> >> there
>> > > > > > >> >> > is
>> > > > > > >> >> > > > one
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > compute
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > API to reason about.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > (-) The (very big) assumption here is
>> that
>> > we
>> > > > are
>> > > > > > >> able
>> > > > > > >> >> to
>> > > > > > >> >> > > tune
>> > > > > > >> >> > > > > the
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > spark
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > runtime the same way using Beam : custom
>> > > > > > >> partitioners,
>> > > > > > >> >> > > support
>> > > > > > >> >> > > > > for
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > all
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > RDD
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > operations we invoke, caching etc etc.
>> > > > > > >> >> > > > > > > > > > (-) It will be a massive rewrite and
>> > testing
>> > > > of
>> > > > > > such
>> > > > > > >> a
>> > > > > > >> >> > large
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > rewrite
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > would
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > also be really challenging, since we
>> need
>> > to
>> > > > pay
>> > > > > > >> >> attention
>> > > > > > >> >> > to
>> > > > > > >> >> > > > all
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > intricate
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > details to ensure the spark users today
>> > > > > experience
>> > > > > > no
>> > > > > > >> >> > > > > > > > > > regressions/side-effects
>> > > > > > >> >> > > > > > > > > > (-) Note that we still need to probably
>> > > > support
>> > > > > the
>> > > > > > >> >> > > > hoodie-spark
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > module
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > and
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > may be a first-class such integration
>> with
>> > > > flink,
>> > > > > > for
>> > > > > > >> >> > native
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > flink/spark
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > pipeline authoring. Users of say
>> > > DeltaStreamer
>> > > > > need
>> > > > > > >> to
>> > > > > > >> >> pass
>> > > > > > >> >> > > in
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > Spark
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > or
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > Flink configs anyway.. Developers need
>> to
>> > > > think
>> > > > > > about
>> > > > > > >> >> > > > > > > > > >
>> > > > > what-if-this-piece-of-code-ran-as-spark-vs-flink..
>> > > > > > >> So in
>> > > > > > >> >> > the
>> > > > > > >> >> > > > end,
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > this
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > may
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > not be the panacea that it seems to be.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > >
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > One goal for the HIP is to get us all to
>> > > agree
>> > > > > as a
>> > > > > > >> >> > community
>> > > > > > >> >> > > > > which
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > one
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > to
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > pick, with sufficient investigation,
>> > testing,
>> > > > > > >> >> > benchmarking..
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > On Sat, Aug 3, 2019 at 7:56 PM vino
>> yang <
>> > > > > > >> >> > > > yanghua1127@gmail.com>
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > wrote:
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > +1 for both Beam and Flink
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > First step here is to probably draw out
>> > > > current
>> > > > > > >> >> hierrarchy
>> > > > > > >> >> > > and
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > figure
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > out
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > what the abstraction points are..
>> > > > > > >> >> > > > > > > > > > In my opinion, the runtime (spark,
>> flink)
>> > > > should
>> > > > > be
>> > > > > > >> >> done at
>> > > > > > >> >> > > the
>> > > > > > >> >> > > > > > > > > > hoodie-client level and just used by
>> > > > > > hoodie-utilties
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > seamlessly..
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > +1 for Vinoth's opinion, it should be
>> the
>> > > > first
>> > > > > > step.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > No matter we hope Hudi to integrate with
>> > > which
>> > > > > > >> computing
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > framework.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > We need to decouple Hudi client and
>> Spark.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > We may need a pure client module named
>> for
>> > > > > example
>> > > > > > >> >> > > > > > > > > > hoodie-client-core(common)
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > Then we could have: hoodie-client-spark,
>> > > > > > >> >> > hoodie-client-flink
>> > > > > > >> >> > > > and
>> > > > > > >> >> > > > > > > > > > hoodie-client-beam
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > Suneel Marthi <sm...@apache.org>
>> > > > 于2019年8月4日周日
>> > > > > > >> >> 上午10:45写道:
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > +1 for Beam -- agree with Semantic
>> Beeng's
>> > > > > > analysis.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > On Sat, Aug 3, 2019 at 10:30 PM taher
>> > > > koitawala <
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > taherk77@gmail.com>
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > wrote:
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > So the way to go around this is that
>> file a
>> > > > hip.
>> > > > > > >> Chalk
>> > > > > > >> >> all
>> > > > > > >> >> > th
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > classes
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > our
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > and start moving towards Pure client.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > Secondly should we want to try beam?
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > I think there is to much going on here
>> and
>> > > I'm
>> > > > > not
>> > > > > > >> able
>> > > > > > >> >> to
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > follow.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > If
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > we
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > want to try out beam all along I don't
>> > think
>> > > > it
>> > > > > > makes
>> > > > > > >> >> sense
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > to
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > do
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > anything
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > on Flink then.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > On Sun, Aug 4, 2019, 2:30 AM Semantic
>> > Beeng <
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > nick@semanticbeeng.com>
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > wrote:
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > >> +1 My money is on this approach.
>> > > > > > >> >> > > > > > > > > > >>
>> > > > > > >> >> > > > > > > > > > >> The existing abstractions from Beam
>> seem
>> > > > > enough
>> > > > > > >> for
>> > > > > > >> >> the
>> > > > > > >> >> > > use
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > cases
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > as I
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > imagine them.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > >> Flink also has "dynamic table",
>> "table
>> > > > source"
>> > > > > > and
>> > > > > > >> >> > "table
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > sink"
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > which
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > seem very useful abstractions where Hudi
>> > > might
>> > > > > fit
>> > > > > > >> >> nicely.
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > >>
>> > > > > > >> >> > > > > > > > > > >>
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > >
>> > > > > > >> >> > > > > > > >
>> > > > > > >> >> > > > > > >
>> > > > > > >> >> > > > > >
>> > > > > > >> >> > > > >
>> > > > > > >> >> > > >
>> > > > > > >> >> > >
>> > > > > > >> >> >
>> > > > > > >> >>
>> > > > > > >>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html
>> > > > > > >> >> > > > > > > > > >
>> > > > > > >> >> > > > > > > > > > >>
>> > > > > > >> >> > > > > > > > > > >> Attached a screen shot.
>> > > > > > >> >> > > > > > > > > > >>
>> > > > > > >> >> > > > > > > > > > >> This seems to fit with the original
>> > > premise
>> > > > of
>> > > > > > >> Hudi
>> > > > > > >> >> as
>> > > > > > >> >> > > well.
>> > > > > > >> >> > > > > > > > > > >>
>> > > > > > >> >> >
>>
>