You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by Vinoth Chandar <vi...@apache.org> on 2019/05/06 18:25:14 UTC

Joining 3 tables incrementally

Reposting a discussion on slack as FYI.

"Jaimin [3:10 AM]
Hi
We have a use-case where we have set of MOR base tables and flattened
entities based on them. For example we have order,customer,seller table and
flattened entity  based on joining these 3 tables.
To create a flattened I think we need to fetch changes from each of these
tables incrementally (incremental pull) and do join with rest of the
complete tables. So there will n number of joins ( equal to number of
tables involved in flattened entity). Is there any other efficient way to
do this? Also for the join will we need our own spark job or hudi provides
these capabilities also?
Also our data can have deletes also I am using empty payload implementation
to delete data. I tried out this we sample data deleting data from base
table compacting and then using incremental pull to fetch changes but I
didn't see deletes as part of incremental pull. Am I missing something?
Thanks "

and my response

"
you can pull 3 tables and join them in a custom Spark job, that should be
fine. (yes you need your own spark job.. DeltaStreamer tool supports
transforms.. but limits itself to 1 table pulled incrementally)..  What
Nishith is alluding to is to be able to "safely" aligning windows between
the 3 tables, which needs more business context as to determine.. For e.g,
if you are joining the 3 tables based on order_id, then you need to be sure
that the order shows up on customer/seller/order tables in the same time
range you are pulling for..

@Jaimin This is such an interesting topic.. I will start a thread on the
mailing list. Please join and we can continue there, so others can also
jump in.. https://hudi.apache.org/community.html
"

Re: Joining 3 tables incrementally

Posted by Vinoth Chandar <vi...@apache.org>.

Interesting.. you captured the pitfalls I was alluding to nicely.
IIUC you are doing multiple incremental pull vs table joins to reconcile.
It should work.





On Tue, May 7, 2019 at 12:06 AM Jaimin Shah <sh...@gmail.com>
wrote:

> Hi
>
> Thanks for the quick response.
> As we discussed we will pull changes incrementally and join with MOR read
> optimized view. For example order will be pulled incrementally and will be
> joined with read optimized view of  seller and customer. Incrementally pull
> seller and join with order and customer apply same process for customer
> also.
>
> Regarding "safely" aligning windows I don't think we should bother with
> this as data will be corrected while processing subsequent batch. For
> example data inserted in seller is not reflected and order comes first for
> the seller data will be missed in the first batch but in the next batch
> insert in seller will be processed which will be joined with customer and
> orders so it will be handled. We are fine with eventual consistency of the
> data. Please correct me if I am missing some points.
>
> On Mon, 6 May 2019 at 23:55, Vinoth Chandar <vi...@apache.org> wrote:
>
> > Reposting a discussion on slack as FYI.
> >
> > "Jaimin [3:10 AM]
> > Hi
> > We have a use-case where we have set of MOR base tables and flattened
> > entities based on them. For example we have order,customer,seller table
> and
> > flattened entity  based on joining these 3 tables.
> > To create a flattened I think we need to fetch changes from each of these
> > tables incrementally (incremental pull) and do join with rest of the
> > complete tables. So there will n number of joins ( equal to number of
> > tables involved in flattened entity). Is there any other efficient way to
> > do this? Also for the join will we need our own spark job or hudi
> provides
> > these capabilities also?
> > Also our data can have deletes also I am using empty payload
> implementation
> > to delete data. I tried out this we sample data deleting data from base
> > table compacting and then using incremental pull to fetch changes but I
> > didn't see deletes as part of incremental pull. Am I missing something?
> > Thanks "
> >
> > and my response
> >
> > "
> > you can pull 3 tables and join them in a custom Spark job, that should be
> > fine. (yes you need your own spark job.. DeltaStreamer tool supports
> > transforms.. but limits itself to 1 table pulled incrementally)..  What
> > Nishith is alluding to is to be able to "safely" aligning windows between
> > the 3 tables, which needs more business context as to determine.. For
> e.g,
> > if you are joining the 3 tables based on order_id, then you need to be
> sure
> > that the order shows up on customer/seller/order tables in the same time
> > range you are pulling for..
> >
> > @Jaimin This is such an interesting topic.. I will start a thread on the
> > mailing list. Please join and we can continue there, so others can also
> > jump in.. https://hudi.apache.org/community.html
> > "
> >
>

Re: Joining 3 tables incrementally

Posted by Jaimin Shah <sh...@gmail.com>.

Hi

Thanks for the quick response.
As we discussed we will pull changes incrementally and join with MOR read
optimized view. For example order will be pulled incrementally and will be
joined with read optimized view of  seller and customer. Incrementally pull
seller and join with order and customer apply same process for customer
also.

Regarding "safely" aligning windows I don't think we should bother with
this as data will be corrected while processing subsequent batch. For
example data inserted in seller is not reflected and order comes first for
the seller data will be missed in the first batch but in the next batch
insert in seller will be processed which will be joined with customer and
orders so it will be handled. We are fine with eventual consistency of the
data. Please correct me if I am missing some points.

On Mon, 6 May 2019 at 23:55, Vinoth Chandar <vi...@apache.org> wrote:

> Reposting a discussion on slack as FYI.
>
> "Jaimin [3:10 AM]
> Hi
> We have a use-case where we have set of MOR base tables and flattened
> entities based on them. For example we have order,customer,seller table and
> flattened entity  based on joining these 3 tables.
> To create a flattened I think we need to fetch changes from each of these
> tables incrementally (incremental pull) and do join with rest of the
> complete tables. So there will n number of joins ( equal to number of
> tables involved in flattened entity). Is there any other efficient way to
> do this? Also for the join will we need our own spark job or hudi provides
> these capabilities also?
> Also our data can have deletes also I am using empty payload implementation
> to delete data. I tried out this we sample data deleting data from base
> table compacting and then using incremental pull to fetch changes but I
> didn't see deletes as part of incremental pull. Am I missing something?
> Thanks "
>
> and my response
>
> "
> you can pull 3 tables and join them in a custom Spark job, that should be
> fine. (yes you need your own spark job.. DeltaStreamer tool supports
> transforms.. but limits itself to 1 table pulled incrementally)..  What
> Nishith is alluding to is to be able to "safely" aligning windows between
> the 3 tables, which needs more business context as to determine.. For e.g,
> if you are joining the 3 tables based on order_id, then you need to be sure
> that the order shows up on customer/seller/order tables in the same time
> range you are pulling for..
>
> @Jaimin This is such an interesting topic.. I will start a thread on the
> mailing list. Please join and we can continue there, so others can also
> jump in.. https://hudi.apache.org/community.html
> "
>