You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by Joaquim S <jo...@gmail.com> on 2020/03/26 20:42:23 UTC

Get all deletes after a specific commit time

Folks,

I am looking at DMS integration and have a question.

It is clear that the incremental queries only show incrementals ( :) ) . I
need to extract the deletes too from a specific commit (ideally sparksql or
hive).

What am I missing, is there another way to query the timeline to get the
records that were deleted after a specific commit?

Thank you!

Re: Get all deletes after a specific commit time

Posted by Joaquim S <jo...@gmail.com>.
Will take my workaround back. Your workaround is cleaner. Thanks again.

Joaquim S <jo...@gmail.com> escreveu no dia quinta, 26/03/2020 à(s)
17:09:

> Looking at the definition of soft deletes, as the key in the record is
> still available, I can work with that for downstream consumption for now.
> For hard deletes will follow the thread.
>
> Joaquim S <jo...@gmail.com> escreveu no dia quinta, 26/03/2020 à(s)
> 17:04:
>
>> Thank you Vinoth. Yup, will consider this option. If I can capture the
>> soft deletes, maybe make a change on my side to only use soft deletes for
>> now. (will see how much impact it creates, still figuring out hudi...)
>>
>> Vinoth Chandar <vi...@apache.org> escreveu no dia quinta, 26/03/2020
>> à(s) 17:00:
>>
>>> As a workaround, I am wondering if you can do this for now.
>>> Let's say you can records deleted from commit c1 to now.
>>>
>>> - Do a snapshot query at commit c1
>>> - Do a snapshot query at latest commit
>>> - Diff the two
>>>
>>> It will inefficient, since it has to scan all the data twice, but may
>>> work
>>> functionally?
>>>
>>> On Thu, Mar 26, 2020 at 1:47 PM Vinoth Chandar <vi...@apache.org>
>>> wrote:
>>>
>>> > Currently, soft deletes will show up in the incremental stream, while
>>> hard
>>> > deletes will not..
>>> >
>>> > We are debating how to add this features, since it has come up few
>>> times
>>> > recently..
>>> >
>>> > May be this can be a good discuss thread for that? :)
>>> >
>>> > On Thu, Mar 26, 2020 at 1:42 PM Joaquim S <jo...@gmail.com> wrote:
>>> >
>>> >> Folks,
>>> >>
>>> >> I am looking at DMS integration and have a question.
>>> >>
>>> >> It is clear that the incremental queries only show incrementals ( :)
>>> ) . I
>>> >> need to extract the deletes too from a specific commit (ideally
>>> sparksql
>>> >> or
>>> >> hive).
>>> >>
>>> >> What am I missing, is there another way to query the timeline to get
>>> the
>>> >> records that were deleted after a specific commit?
>>> >>
>>> >> Thank you!
>>> >>
>>> >
>>>
>>

Re: Get all deletes after a specific commit time

Posted by Joaquim S <jo...@gmail.com>.
Looking at the definition of soft deletes, as the key in the record is
still available, I can work with that for downstream consumption for now.
For hard deletes will follow the thread.

Joaquim S <jo...@gmail.com> escreveu no dia quinta, 26/03/2020 à(s)
17:04:

> Thank you Vinoth. Yup, will consider this option. If I can capture the
> soft deletes, maybe make a change on my side to only use soft deletes for
> now. (will see how much impact it creates, still figuring out hudi...)
>
> Vinoth Chandar <vi...@apache.org> escreveu no dia quinta, 26/03/2020
> à(s) 17:00:
>
>> As a workaround, I am wondering if you can do this for now.
>> Let's say you can records deleted from commit c1 to now.
>>
>> - Do a snapshot query at commit c1
>> - Do a snapshot query at latest commit
>> - Diff the two
>>
>> It will inefficient, since it has to scan all the data twice, but may work
>> functionally?
>>
>> On Thu, Mar 26, 2020 at 1:47 PM Vinoth Chandar <vi...@apache.org> wrote:
>>
>> > Currently, soft deletes will show up in the incremental stream, while
>> hard
>> > deletes will not..
>> >
>> > We are debating how to add this features, since it has come up few times
>> > recently..
>> >
>> > May be this can be a good discuss thread for that? :)
>> >
>> > On Thu, Mar 26, 2020 at 1:42 PM Joaquim S <jo...@gmail.com> wrote:
>> >
>> >> Folks,
>> >>
>> >> I am looking at DMS integration and have a question.
>> >>
>> >> It is clear that the incremental queries only show incrementals ( :) )
>> . I
>> >> need to extract the deletes too from a specific commit (ideally
>> sparksql
>> >> or
>> >> hive).
>> >>
>> >> What am I missing, is there another way to query the timeline to get
>> the
>> >> records that were deleted after a specific commit?
>> >>
>> >> Thank you!
>> >>
>> >
>>
>

Re: Get all deletes after a specific commit time

Posted by Joaquim S <jo...@gmail.com>.
Thank you Vinoth. Yup, will consider this option. If I can capture the soft
deletes, maybe make a change on my side to only use soft deletes for now.
(will see how much impact it creates, still figuring out hudi...)

Vinoth Chandar <vi...@apache.org> escreveu no dia quinta, 26/03/2020 à(s)
17:00:

> As a workaround, I am wondering if you can do this for now.
> Let's say you can records deleted from commit c1 to now.
>
> - Do a snapshot query at commit c1
> - Do a snapshot query at latest commit
> - Diff the two
>
> It will inefficient, since it has to scan all the data twice, but may work
> functionally?
>
> On Thu, Mar 26, 2020 at 1:47 PM Vinoth Chandar <vi...@apache.org> wrote:
>
> > Currently, soft deletes will show up in the incremental stream, while
> hard
> > deletes will not..
> >
> > We are debating how to add this features, since it has come up few times
> > recently..
> >
> > May be this can be a good discuss thread for that? :)
> >
> > On Thu, Mar 26, 2020 at 1:42 PM Joaquim S <jo...@gmail.com> wrote:
> >
> >> Folks,
> >>
> >> I am looking at DMS integration and have a question.
> >>
> >> It is clear that the incremental queries only show incrementals ( :) )
> . I
> >> need to extract the deletes too from a specific commit (ideally sparksql
> >> or
> >> hive).
> >>
> >> What am I missing, is there another way to query the timeline to get the
> >> records that were deleted after a specific commit?
> >>
> >> Thank you!
> >>
> >
>

Re: Get all deletes after a specific commit time

Posted by Vinoth Chandar <vi...@apache.org>.
As a workaround, I am wondering if you can do this for now.
Let's say you can records deleted from commit c1 to now.

- Do a snapshot query at commit c1
- Do a snapshot query at latest commit
- Diff the two

It will inefficient, since it has to scan all the data twice, but may work
functionally?

On Thu, Mar 26, 2020 at 1:47 PM Vinoth Chandar <vi...@apache.org> wrote:

> Currently, soft deletes will show up in the incremental stream, while hard
> deletes will not..
>
> We are debating how to add this features, since it has come up few times
> recently..
>
> May be this can be a good discuss thread for that? :)
>
> On Thu, Mar 26, 2020 at 1:42 PM Joaquim S <jo...@gmail.com> wrote:
>
>> Folks,
>>
>> I am looking at DMS integration and have a question.
>>
>> It is clear that the incremental queries only show incrementals ( :) ) . I
>> need to extract the deletes too from a specific commit (ideally sparksql
>> or
>> hive).
>>
>> What am I missing, is there another way to query the timeline to get the
>> records that were deleted after a specific commit?
>>
>> Thank you!
>>
>

Re: Get all deletes after a specific commit time

Posted by Vinoth Chandar <vi...@apache.org>.
Currently, soft deletes will show up in the incremental stream, while hard
deletes will not..

We are debating how to add this features, since it has come up few times
recently..

May be this can be a good discuss thread for that? :)

On Thu, Mar 26, 2020 at 1:42 PM Joaquim S <jo...@gmail.com> wrote:

> Folks,
>
> I am looking at DMS integration and have a question.
>
> It is clear that the incremental queries only show incrementals ( :) ) . I
> need to extract the deletes too from a specific commit (ideally sparksql or
> hive).
>
> What am I missing, is there another way to query the timeline to get the
> records that were deleted after a specific commit?
>
> Thank you!
>