You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by Pratyaksh Sharma <pr...@gmail.com> on 2020/04/13 13:12:24 UTC

Manual deletion of a parquet file

Hi,

From my experience so far of working with Hudi, I understand that Hudi is
not designed to handle concurrent writes from 2 different sources for
example 2 instances of HoodieDeltaStreamer are simultaneously running and
writing to the same dataset. I have experienced such a case can result in
duplicate writes in case of inserts. Also once duplicates are written, you
are not sure of which file the update will go to next since the record is
already present in 2 different parquet files. Please correct me if I am
wrong.

Having experienced this in few Hudi datasets, I now want to delete one of
the parquet files which contains duplicates in some partition of a COW type
Hudi dataset. I want to know if deleting a parquet file manually can have
any repercussions? If yes, what all can be the side effects of doing the
same?

Any leads will be highly appreciated.

Re: Manual deletion of a parquet file

Posted by Pratyaksh Sharma <pr...@gmail.com>.

Awesome.

https://issues.apache.org/jira/browse/HUDI-796 tracks this.

On Wed, Apr 15, 2020 at 3:17 AM Vinoth Chandar <vi...@apache.org> wrote:

> Okay makes sense.. I think we can raise a PR for the tool, integrated into
> the CLI..
> Then everyone can weigh in more as well ?
>
> Thanks for taking this up, Pratyaksh!
>
> On Tue, Apr 14, 2020 at 2:58 AM Pratyaksh Sharma <pr...@gmail.com>
> wrote:
>
> > Hi Vinoth,
> >
> > Thank you for your guidance.
> >
> > I went through the code for RepairsCommand in Hudi-cli package which
> > internally calls DedupeSparkJob.scala. The logic therein basically marks
> > the file as bad based on the commit time of records. However in my case
> > even the commit time is same for duplicates. The only thing that varies
> is
> > `_hoodie_commit_seqno` and `_hoodie_file_name`. So I guess this class
> will
> > not help me.
> >
> > IIUC the logic in DedupeSparkJob can only work when duplicates were
> created
> > due to INSERT operation. If we have UPDATE coming in for some duplicate
> > record, then both the files where that record is present will have the
> same
> > commit time henceforth. Such cases cannot be dealt with by considering
> > `_hoodie_commit_time` which is the same as I am experiencing.
> >
> > I have written one script to solve my use case. It is a no brainer where
> I
> > simply delete the duplicate keys and rewrite the file. Wanted to check if
> > it would add any value to our code base and if I should raise a PR for
> the
> > same. If the community agrees, then we can work together to further
> improve
> > it and make it generic enough.
> >
> > On Mon, Apr 13, 2020 at 8:22 PM Vinoth Chandar <vi...@apache.org>
> wrote:
> >
> > > Hi Pratyaksh,
> > >
> > > Your understanding is correct. There is a duplicate fix tool in the cli
> > (I
> > > wrote this a while ago for cow, but did use it in production few times
> > for
> > > situations like these). Check that out? IIRC it will keep the both the
> > > commits and its files, but simply get rid of the duplicate records and
> > > replace parquet files in place.
> > >
> > > >> Also once duplicates are written, you
> > > are not sure of which file the update will go to next since the record
> is
> > > already present in 2 different parquet files.
> > >
> > > IIRC bloom index will tag both files and both will be updated.
> > >
> > > Table could show many side effects depending on when exactly the race
> > > happened.
> > >
> > > - the second commit may have rolled back the first inflight commit and
> > > mistaking it for a failed write. In this case, some data may also be
> > > missing. In this case though i expect first commit to actually fail
> since
> > > files got deleted midway into writing.
> > > - if both of them indeed succeeded, then then its just the duplicates
> > >
> > >
> > > Thanks
> > > Vinoth
> > >
> > >
> > >
> > >
> > >
> > > On Mon, Apr 13, 2020 at 6:12 AM Pratyaksh Sharma <
> pratyaksh13@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > From my experience so far of working with Hudi, I understand that
> Hudi
> > is
> > > > not designed to handle concurrent writes from 2 different sources for
> > > > example 2 instances of HoodieDeltaStreamer are simultaneously running
> > and
> > > > writing to the same dataset. I have experienced such a case can
> result
> > in
> > > > duplicate writes in case of inserts. Also once duplicates are
> written,
> > > you
> > > > are not sure of which file the update will go to next since the
> record
> > is
> > > > already present in 2 different parquet files. Please correct me if I
> am
> > > > wrong.
> > > >
> > > > Having experienced this in few Hudi datasets, I now want to delete
> one
> > of
> > > > the parquet files which contains duplicates in some partition of a
> COW
> > > type
> > > > Hudi dataset. I want to know if deleting a parquet file manually can
> > have
> > > > any repercussions? If yes, what all can be the side effects of doing
> > the
> > > > same?
> > > >
> > > > Any leads will be highly appreciated.
> > > >
> > >
> >
>

Re: Manual deletion of a parquet file

Posted by Vinoth Chandar <vi...@apache.org>.

Okay makes sense.. I think we can raise a PR for the tool, integrated into
the CLI..
Then everyone can weigh in more as well ?

Thanks for taking this up, Pratyaksh!

On Tue, Apr 14, 2020 at 2:58 AM Pratyaksh Sharma <pr...@gmail.com>
wrote:

> Hi Vinoth,
>
> Thank you for your guidance.
>
> I went through the code for RepairsCommand in Hudi-cli package which
> internally calls DedupeSparkJob.scala. The logic therein basically marks
> the file as bad based on the commit time of records. However in my case
> even the commit time is same for duplicates. The only thing that varies is
> `_hoodie_commit_seqno` and `_hoodie_file_name`. So I guess this class will
> not help me.
>
> IIUC the logic in DedupeSparkJob can only work when duplicates were created
> due to INSERT operation. If we have UPDATE coming in for some duplicate
> record, then both the files where that record is present will have the same
> commit time henceforth. Such cases cannot be dealt with by considering
> `_hoodie_commit_time` which is the same as I am experiencing.
>
> I have written one script to solve my use case. It is a no brainer where I
> simply delete the duplicate keys and rewrite the file. Wanted to check if
> it would add any value to our code base and if I should raise a PR for the
> same. If the community agrees, then we can work together to further improve
> it and make it generic enough.
>
> On Mon, Apr 13, 2020 at 8:22 PM Vinoth Chandar <vi...@apache.org> wrote:
>
> > Hi Pratyaksh,
> >
> > Your understanding is correct. There is a duplicate fix tool in the cli
> (I
> > wrote this a while ago for cow, but did use it in production few times
> for
> > situations like these). Check that out? IIRC it will keep the both the
> > commits and its files, but simply get rid of the duplicate records and
> > replace parquet files in place.
> >
> > >> Also once duplicates are written, you
> > are not sure of which file the update will go to next since the record is
> > already present in 2 different parquet files.
> >
> > IIRC bloom index will tag both files and both will be updated.
> >
> > Table could show many side effects depending on when exactly the race
> > happened.
> >
> > - the second commit may have rolled back the first inflight commit and
> > mistaking it for a failed write. In this case, some data may also be
> > missing. In this case though i expect first commit to actually fail since
> > files got deleted midway into writing.
> > - if both of them indeed succeeded, then then its just the duplicates
> >
> >
> > Thanks
> > Vinoth
> >
> >
> >
> >
> >
> > On Mon, Apr 13, 2020 at 6:12 AM Pratyaksh Sharma <pr...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > From my experience so far of working with Hudi, I understand that Hudi
> is
> > > not designed to handle concurrent writes from 2 different sources for
> > > example 2 instances of HoodieDeltaStreamer are simultaneously running
> and
> > > writing to the same dataset. I have experienced such a case can result
> in
> > > duplicate writes in case of inserts. Also once duplicates are written,
> > you
> > > are not sure of which file the update will go to next since the record
> is
> > > already present in 2 different parquet files. Please correct me if I am
> > > wrong.
> > >
> > > Having experienced this in few Hudi datasets, I now want to delete one
> of
> > > the parquet files which contains duplicates in some partition of a COW
> > type
> > > Hudi dataset. I want to know if deleting a parquet file manually can
> have
> > > any repercussions? If yes, what all can be the side effects of doing
> the
> > > same?
> > >
> > > Any leads will be highly appreciated.
> > >
> >
>

Re: Manual deletion of a parquet file

Posted by Pratyaksh Sharma <pr...@gmail.com>.

Hi Vinoth,

Thank you for your guidance.

I went through the code for RepairsCommand in Hudi-cli package which
internally calls DedupeSparkJob.scala. The logic therein basically marks
the file as bad based on the commit time of records. However in my case
even the commit time is same for duplicates. The only thing that varies is
`_hoodie_commit_seqno` and `_hoodie_file_name`. So I guess this class will
not help me.

IIUC the logic in DedupeSparkJob can only work when duplicates were created
due to INSERT operation. If we have UPDATE coming in for some duplicate
record, then both the files where that record is present will have the same
commit time henceforth. Such cases cannot be dealt with by considering
`_hoodie_commit_time` which is the same as I am experiencing.

I have written one script to solve my use case. It is a no brainer where I
simply delete the duplicate keys and rewrite the file. Wanted to check if
it would add any value to our code base and if I should raise a PR for the
same. If the community agrees, then we can work together to further improve
it and make it generic enough.

On Mon, Apr 13, 2020 at 8:22 PM Vinoth Chandar <vi...@apache.org> wrote:

> Hi Pratyaksh,
>
> Your understanding is correct. There is a duplicate fix tool in the cli (I
> wrote this a while ago for cow, but did use it in production few times for
> situations like these). Check that out? IIRC it will keep the both the
> commits and its files, but simply get rid of the duplicate records and
> replace parquet files in place.
>
> >> Also once duplicates are written, you
> are not sure of which file the update will go to next since the record is
> already present in 2 different parquet files.
>
> IIRC bloom index will tag both files and both will be updated.
>
> Table could show many side effects depending on when exactly the race
> happened.
>
> - the second commit may have rolled back the first inflight commit and
> mistaking it for a failed write. In this case, some data may also be
> missing. In this case though i expect first commit to actually fail since
> files got deleted midway into writing.
> - if both of them indeed succeeded, then then its just the duplicates
>
>
> Thanks
> Vinoth
>
>
>
>
>
> On Mon, Apr 13, 2020 at 6:12 AM Pratyaksh Sharma <pr...@gmail.com>
> wrote:
>
> > Hi,
> >
> > From my experience so far of working with Hudi, I understand that Hudi is
> > not designed to handle concurrent writes from 2 different sources for
> > example 2 instances of HoodieDeltaStreamer are simultaneously running and
> > writing to the same dataset. I have experienced such a case can result in
> > duplicate writes in case of inserts. Also once duplicates are written,
> you
> > are not sure of which file the update will go to next since the record is
> > already present in 2 different parquet files. Please correct me if I am
> > wrong.
> >
> > Having experienced this in few Hudi datasets, I now want to delete one of
> > the parquet files which contains duplicates in some partition of a COW
> type
> > Hudi dataset. I want to know if deleting a parquet file manually can have
> > any repercussions? If yes, what all can be the side effects of doing the
> > same?
> >
> > Any leads will be highly appreciated.
> >
>

Re: Manual deletion of a parquet file

Posted by Vinoth Chandar <vi...@apache.org>.

Hi Pratyaksh,

Your understanding is correct. There is a duplicate fix tool in the cli (I
wrote this a while ago for cow, but did use it in production few times for
situations like these). Check that out? IIRC it will keep the both the
commits and its files, but simply get rid of the duplicate records and
replace parquet files in place.

>> Also once duplicates are written, you
are not sure of which file the update will go to next since the record is
already present in 2 different parquet files.

IIRC bloom index will tag both files and both will be updated.

Table could show many side effects depending on when exactly the race
happened.

- the second commit may have rolled back the first inflight commit and
mistaking it for a failed write. In this case, some data may also be
missing. In this case though i expect first commit to actually fail since
files got deleted midway into writing.
- if both of them indeed succeeded, then then its just the duplicates


Thanks
Vinoth





On Mon, Apr 13, 2020 at 6:12 AM Pratyaksh Sharma <pr...@gmail.com>
wrote:

> Hi,
>
> From my experience so far of working with Hudi, I understand that Hudi is
> not designed to handle concurrent writes from 2 different sources for
> example 2 instances of HoodieDeltaStreamer are simultaneously running and
> writing to the same dataset. I have experienced such a case can result in
> duplicate writes in case of inserts. Also once duplicates are written, you
> are not sure of which file the update will go to next since the record is
> already present in 2 different parquet files. Please correct me if I am
> wrong.
>
> Having experienced this in few Hudi datasets, I now want to delete one of
> the parquet files which contains duplicates in some partition of a COW type
> Hudi dataset. I want to know if deleting a parquet file manually can have
> any repercussions? If yes, what all can be the side effects of doing the
> same?
>
> Any leads will be highly appreciated.
>