You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by Pratyaksh Sharma <pr...@gmail.com> on 2019/11/19 09:38:54 UTC

Small clarification in Hoodie Cleaner flow

Hi,

We are assuming the following in getDeletePaths() method in cleaner flow in
case of KEEP_LATEST_COMMITS policy -

/**
* Selects the versions for file for cleaning, such that it
* <p>
* - Leaves the latest version of the file untouched - For older versions, -
It leaves all the commits untouched which
* has occured in last <code>config.getCleanerCommitsRetained()</code>
commits - It leaves ONE commit before this
* window. We assume that the max(query execution time) == commit_batch_time
* config.getCleanerCommitsRetained().
* This is 12 hours by default. This is essential to leave the file used by
the query thats running for the max time.
* <p>
* This provides the effect of having lookback into all changes that
happened in the last X commits. (eg: if you
* retain 24 commits, and commit batch time is 30 mins, then you have 12 hrs
of lookback)
* <p>
* This policy is the default.
*/

I want to understand the term commit_batch_time in this assumption and the
assumption as a whole. As per my understanding, this term refers to the
time taken in one iteration of DeltaSync end to end (which is hardly 7-8
minutes in my case). If my understanding is correct, then this time will
vary depending on the size of incoming RDD. So in that case, the time
needed for the longest query is effectively a variable. So in that case
what is a safe option to keep for the config
<code>config.getCleanerCommitsRetained()</code>.

Basically I want to set the config
<code>config.getCleanerCommitsRetained()</code> properly for my Hudi
instance and hence I am trying to understand the assumption. Its default
value is 10, I want to understand if this can be reduced further without
any query failing.

Please help me with this.

Regards
Pratyaksh

Re: Small clarification in Hoodie Cleaner flow

Posted by Pratyaksh Sharma <pr...@gmail.com>.
Thank you for the clarification Balaji. Now I understand it properly. :)

On Tue, Nov 19, 2019 at 11:17 PM Balaji Varadarajan <vb...@apache.org>
wrote:

> I updated the FAQ section to set defaults correctly and add more
> information related to this :
>
> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-WhatdoestheHudicleanerdo
>
> The cleaner retention configuration is based on counts (number of commits
> to be retained) with the assumption that users need to provide a
> conservative number. The historical reason was that ingestion used to run
> in specific cadence (e.g every 30 mins) with the norm being an ingestion
> run taking less than 30 mins. With this model, it was simpler to represent
> the configuration as a count of commits to approximate the retention time.
>
> With delta-streamer continuous mode, ingestion is allowed to be scheduled
> immediately after the previous run is scheduled. I think it would make
> sense to introduce a time based retention. I have created a newbie ticket
> for this : https://jira.apache.org/jira/browse/HUDI-349
>
> Pratyaksh, In sum, if the defaults are too low, use a conservative number
> based on the number of ingestion runs you see in your setup. The defaults
> as referenced in the code-comments needs change (from 24 to 10).(
> https://jira.apache.org/jira/browse/HUDI-350)
>
> Thanks,
> Balaji.V
>
> On Tue, Nov 19, 2019 at 1:40 AM Pratyaksh Sharma <pr...@gmail.com>
> wrote:
>
> > Hi,
> >
> > We are assuming the following in getDeletePaths() method in cleaner flow
> in
> > case of KEEP_LATEST_COMMITS policy -
> >
> > /**
> > * Selects the versions for file for cleaning, such that it
> > * <p>
> > * - Leaves the latest version of the file untouched - For older
> versions, -
> > It leaves all the commits untouched which
> > * has occured in last <code>config.getCleanerCommitsRetained()</code>
> > commits - It leaves ONE commit before this
> > * window. We assume that the max(query execution time) ==
> commit_batch_time
> > * config.getCleanerCommitsRetained().
> > * This is 12 hours by default. This is essential to leave the file used
> by
> > the query thats running for the max time.
> > * <p>
> > * This provides the effect of having lookback into all changes that
> > happened in the last X commits. (eg: if you
> > * retain 24 commits, and commit batch time is 30 mins, then you have 12
> hrs
> > of lookback)
> > * <p>
> > * This policy is the default.
> > */
> >
> > I want to understand the term commit_batch_time in this assumption and
> the
> > assumption as a whole. As per my understanding, this term refers to the
> > time taken in one iteration of DeltaSync end to end (which is hardly 7-8
> > minutes in my case). If my understanding is correct, then this time will
> > vary depending on the size of incoming RDD. So in that case, the time
> > needed for the longest query is effectively a variable. So in that case
> > what is a safe option to keep for the config
> > <code>config.getCleanerCommitsRetained()</code>.
> >
> > Basically I want to set the config
> > <code>config.getCleanerCommitsRetained()</code> properly for my Hudi
> > instance and hence I am trying to understand the assumption. Its default
> > value is 10, I want to understand if this can be reduced further without
> > any query failing.
> >
> > Please help me with this.
> >
> > Regards
> > Pratyaksh
> >
>

Re: Small clarification in Hoodie Cleaner flow

Posted by Balaji Varadarajan <vb...@apache.org>.
I updated the FAQ section to set defaults correctly and add more
information related to this :
https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-WhatdoestheHudicleanerdo

The cleaner retention configuration is based on counts (number of commits
to be retained) with the assumption that users need to provide a
conservative number. The historical reason was that ingestion used to run
in specific cadence (e.g every 30 mins) with the norm being an ingestion
run taking less than 30 mins. With this model, it was simpler to represent
the configuration as a count of commits to approximate the retention time.

With delta-streamer continuous mode, ingestion is allowed to be scheduled
immediately after the previous run is scheduled. I think it would make
sense to introduce a time based retention. I have created a newbie ticket
for this : https://jira.apache.org/jira/browse/HUDI-349

Pratyaksh, In sum, if the defaults are too low, use a conservative number
based on the number of ingestion runs you see in your setup. The defaults
as referenced in the code-comments needs change (from 24 to 10).(
https://jira.apache.org/jira/browse/HUDI-350)

Thanks,
Balaji.V

On Tue, Nov 19, 2019 at 1:40 AM Pratyaksh Sharma <pr...@gmail.com>
wrote:

> Hi,
>
> We are assuming the following in getDeletePaths() method in cleaner flow in
> case of KEEP_LATEST_COMMITS policy -
>
> /**
> * Selects the versions for file for cleaning, such that it
> * <p>
> * - Leaves the latest version of the file untouched - For older versions, -
> It leaves all the commits untouched which
> * has occured in last <code>config.getCleanerCommitsRetained()</code>
> commits - It leaves ONE commit before this
> * window. We assume that the max(query execution time) == commit_batch_time
> * config.getCleanerCommitsRetained().
> * This is 12 hours by default. This is essential to leave the file used by
> the query thats running for the max time.
> * <p>
> * This provides the effect of having lookback into all changes that
> happened in the last X commits. (eg: if you
> * retain 24 commits, and commit batch time is 30 mins, then you have 12 hrs
> of lookback)
> * <p>
> * This policy is the default.
> */
>
> I want to understand the term commit_batch_time in this assumption and the
> assumption as a whole. As per my understanding, this term refers to the
> time taken in one iteration of DeltaSync end to end (which is hardly 7-8
> minutes in my case). If my understanding is correct, then this time will
> vary depending on the size of incoming RDD. So in that case, the time
> needed for the longest query is effectively a variable. So in that case
> what is a safe option to keep for the config
> <code>config.getCleanerCommitsRetained()</code>.
>
> Basically I want to set the config
> <code>config.getCleanerCommitsRetained()</code> properly for my Hudi
> instance and hence I am trying to understand the assumption. Its default
> value is 10, I want to understand if this can be reduced further without
> any query failing.
>
> Please help me with this.
>
> Regards
> Pratyaksh
>