You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by Tanuj <ta...@gmail.com> on 2020/06/02 10:18:24 UTC

Suggestion needed - Hudi performance wrt no. and depth of partitions

Hi,
We have a requirement to ingest 30M records in S3 backed up by HUDI. I am figuring out the partition strategy and ending up with lot of partitions like 25M partitions (primary partition) --> 2.5 M (secondary partition) --> 2.5 M (third partition) and each parquet file will have the records with less than 10 rows of data.

Our dataset will be ingested at once in full and then it will be incremental daily with less than 1k updates. So its more read heavy rather than write heavy

So what should be the suggestion in terms of HUDI performance - go ahead with the above partition strategy or shall I reduce my partitions and increase  no of rows in each parquet file.

Re: Suggestion needed - Hudi performance wrt no. and depth of partitions

Posted by tanu dua <ta...@gmail.com>.

Thanks a lot Vinoth for your suggestion. I will look into it.

On Thu, 4 Jun 2020 at 10:15 AM, Vinoth Chandar <vi...@apache.org> wrote:

> This is a good conversation. The ask for support of bucketed tables has not
> actually come up much, since if you are looking up things at that
> granularity, it almost feels like you are doing OLTP/database like queries?
>
> Assuming you hash the primary key into a hash that denotes the partition,
> then a simple workaround is to always add a where clause using a UDF in
> presto, I.e where key = 123 and partition = hash_udf(123)
>
> But of course the down side Is that your ops team needs to remember to add
> the second partition clause (which is not very different from querying
> large time partitioned tables today)
>
> Our mid term plan is to build out column indexes (RFC-15 has the details,
> if you are interested)
>
> On Wed, Jun 3, 2020 at 2:54 AM tanu dua <ta...@gmail.com> wrote:
>
> > If I need to plugin this hashing algorithm to resolve the partitions in
> > Presto and hive what is the code I should look into ?
> >
> > On Wed, Jun 3, 2020, 12:04 PM tanu dua <ta...@gmail.com> wrote:
> >
> > > Yes that’s also on cards and for developers that’s ok but we need to
> > > provide an interface to our ops people to execute the queries from
> presto
> > > so I need to find out if they fire a query on primary key how can I
> > > calculate the hash. They can fire a query including primary key with
> > other
> > > fields. So that is the only problem I see in hash partitions and to get
> > if
> > > work I believe I need to go deeper into presto Hudi plugin
> > >
> > > On Wed, 3 Jun 2020 at 11:48 AM, Jaimin Shah <sh...@gmail.com>
> > > wrote:
> > >
> > >> Hi Tanu,
> > >>
> > >> If your primary key is integer you can add one more field as hash of
> > >> integer and partition based on hash field. It will add some complexity
> > to
> > >> read and write because hash has to be computed prior to each read or
> > >> write.
> > >> Not whether overhead of doing this exceeds performance gains due to
> less
> > >> partitions. I wonder why HUDI don't directly support hash based
> > >> partitions?
> > >>
> > >> Thanks
> > >> Jaimin
> > >>
> > >> On Wed, 3 Jun 2020 at 10:07, tanu dua <ta...@gmail.com> wrote:
> > >>
> > >> > Thanks Vinoth for detailed explanation. Even I was thinking on the
> > same
> > >> > lines and I will relook. We can reduce the 2nd and 3rd partition but
> > >> it’s
> > >> > very difficult to reduce the 1st partition as that is the basic
> > primary
> > >> key
> > >> > of our domain model on which analysts and developers need to query
> > >> almost
> > >> > 90% of time and its an integer primary key and can’t be decomposed
> > >> further.
> > >> >
> > >> > On Wed, 3 Jun 2020 at 9:23 AM, Vinoth Chandar <vi...@apache.org>
> > >> wrote:
> > >> >
> > >> > > Hi tanu,
> > >> > >
> > >> > > For good query performance, its recommended to write optimally
> sized
> > >> > files.
> > >> > > Hudi already ensures that.
> > >> > >
> > >> > > Generally speaking, if you have too many partitions, then it also
> > >> means
> > >> > too
> > >> > > many files. Mostly people limit to 1000s of partitions in their
> > >> datasets,
> > >> > > since queries typically crunch data based on time or a
> > business_domain
> > >> > (e.g
> > >> > > city for uber)..  Partitioning too granular - say based on
> user_id -
> > >> is
> > >> > not
> > >> > > very useful unless your queries only crunch per user.. if you are
> > >> using
> > >> > > Hive metastore then 25M partitions mean 25M rows in your backing
> > mysql
> > >> > > metastore db as well - not very scalable.
> > >> > >
> > >> > > What I am trying to say is : even outside of Hudi, if analytics is
> > >> your
> > >> > use
> > >> > > case, might be worth partitioning at lower granularity and
> increase
> > >> rows
> > >> > > per parquet file.
> > >> > >
> > >> > > Thanks
> > >> > > Vinoth
> > >> > >
> > >> > > On Tue, Jun 2, 2020 at 3:18 AM Tanuj <ta...@gmail.com>
> wrote:
> > >> > >
> > >> > > > Hi,
> > >> > > > We have a requirement to ingest 30M records in S3 backed up by
> > >> HUDI. I
> > >> > am
> > >> > > > figuring out the partition strategy and ending up with lot of
> > >> > partitions
> > >> > > > like 25M partitions (primary partition) --> 2.5 M (secondary
> > >> partition)
> > >> > > -->
> > >> > > > 2.5 M (third partition) and each parquet file will have the
> > records
> > >> > with
> > >> > > > less than 10 rows of data.
> > >> > > >
> > >> > > > Our dataset will be ingested at once in full and then it will be
> > >> > > > incremental daily with less than 1k updates. So its more read
> > heavy
> > >> > > rather
> > >> > > > than write heavy
> > >> > > >
> > >> > > > So what should be the suggestion in terms of HUDI performance -
> go
> > >> > ahead
> > >> > > > with the above partition strategy or shall I reduce my
> partitions
> > >> and
> > >> > > > increase  no of rows in each parquet file.
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: Suggestion needed - Hudi performance wrt no. and depth of partitions

Posted by Vinoth Chandar <vi...@apache.org>.

This is a good conversation. The ask for support of bucketed tables has not
actually come up much, since if you are looking up things at that
granularity, it almost feels like you are doing OLTP/database like queries?

Assuming you hash the primary key into a hash that denotes the partition,
then a simple workaround is to always add a where clause using a UDF in
presto, I.e where key = 123 and partition = hash_udf(123)

But of course the down side Is that your ops team needs to remember to add
the second partition clause (which is not very different from querying
large time partitioned tables today)

Our mid term plan is to build out column indexes (RFC-15 has the details,
if you are interested)

On Wed, Jun 3, 2020 at 2:54 AM tanu dua <ta...@gmail.com> wrote:

> If I need to plugin this hashing algorithm to resolve the partitions in
> Presto and hive what is the code I should look into ?
>
> On Wed, Jun 3, 2020, 12:04 PM tanu dua <ta...@gmail.com> wrote:
>
> > Yes that’s also on cards and for developers that’s ok but we need to
> > provide an interface to our ops people to execute the queries from presto
> > so I need to find out if they fire a query on primary key how can I
> > calculate the hash. They can fire a query including primary key with
> other
> > fields. So that is the only problem I see in hash partitions and to get
> if
> > work I believe I need to go deeper into presto Hudi plugin
> >
> > On Wed, 3 Jun 2020 at 11:48 AM, Jaimin Shah <sh...@gmail.com>
> > wrote:
> >
> >> Hi Tanu,
> >>
> >> If your primary key is integer you can add one more field as hash of
> >> integer and partition based on hash field. It will add some complexity
> to
> >> read and write because hash has to be computed prior to each read or
> >> write.
> >> Not whether overhead of doing this exceeds performance gains due to less
> >> partitions. I wonder why HUDI don't directly support hash based
> >> partitions?
> >>
> >> Thanks
> >> Jaimin
> >>
> >> On Wed, 3 Jun 2020 at 10:07, tanu dua <ta...@gmail.com> wrote:
> >>
> >> > Thanks Vinoth for detailed explanation. Even I was thinking on the
> same
> >> > lines and I will relook. We can reduce the 2nd and 3rd partition but
> >> it’s
> >> > very difficult to reduce the 1st partition as that is the basic
> primary
> >> key
> >> > of our domain model on which analysts and developers need to query
> >> almost
> >> > 90% of time and its an integer primary key and can’t be decomposed
> >> further.
> >> >
> >> > On Wed, 3 Jun 2020 at 9:23 AM, Vinoth Chandar <vi...@apache.org>
> >> wrote:
> >> >
> >> > > Hi tanu,
> >> > >
> >> > > For good query performance, its recommended to write optimally sized
> >> > files.
> >> > > Hudi already ensures that.
> >> > >
> >> > > Generally speaking, if you have too many partitions, then it also
> >> means
> >> > too
> >> > > many files. Mostly people limit to 1000s of partitions in their
> >> datasets,
> >> > > since queries typically crunch data based on time or a
> business_domain
> >> > (e.g
> >> > > city for uber)..  Partitioning too granular - say based on user_id -
> >> is
> >> > not
> >> > > very useful unless your queries only crunch per user.. if you are
> >> using
> >> > > Hive metastore then 25M partitions mean 25M rows in your backing
> mysql
> >> > > metastore db as well - not very scalable.
> >> > >
> >> > > What I am trying to say is : even outside of Hudi, if analytics is
> >> your
> >> > use
> >> > > case, might be worth partitioning at lower granularity and increase
> >> rows
> >> > > per parquet file.
> >> > >
> >> > > Thanks
> >> > > Vinoth
> >> > >
> >> > > On Tue, Jun 2, 2020 at 3:18 AM Tanuj <ta...@gmail.com> wrote:
> >> > >
> >> > > > Hi,
> >> > > > We have a requirement to ingest 30M records in S3 backed up by
> >> HUDI. I
> >> > am
> >> > > > figuring out the partition strategy and ending up with lot of
> >> > partitions
> >> > > > like 25M partitions (primary partition) --> 2.5 M (secondary
> >> partition)
> >> > > -->
> >> > > > 2.5 M (third partition) and each parquet file will have the
> records
> >> > with
> >> > > > less than 10 rows of data.
> >> > > >
> >> > > > Our dataset will be ingested at once in full and then it will be
> >> > > > incremental daily with less than 1k updates. So its more read
> heavy
> >> > > rather
> >> > > > than write heavy
> >> > > >
> >> > > > So what should be the suggestion in terms of HUDI performance - go
> >> > ahead
> >> > > > with the above partition strategy or shall I reduce my partitions
> >> and
> >> > > > increase  no of rows in each parquet file.
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: Suggestion needed - Hudi performance wrt no. and depth of partitions

Posted by tanu dua <ta...@gmail.com>.

If I need to plugin this hashing algorithm to resolve the partitions in
Presto and hive what is the code I should look into ?

On Wed, Jun 3, 2020, 12:04 PM tanu dua <ta...@gmail.com> wrote:

> Yes that’s also on cards and for developers that’s ok but we need to
> provide an interface to our ops people to execute the queries from presto
> so I need to find out if they fire a query on primary key how can I
> calculate the hash. They can fire a query including primary key with other
> fields. So that is the only problem I see in hash partitions and to get if
> work I believe I need to go deeper into presto Hudi plugin
>
> On Wed, 3 Jun 2020 at 11:48 AM, Jaimin Shah <sh...@gmail.com>
> wrote:
>
>> Hi Tanu,
>>
>> If your primary key is integer you can add one more field as hash of
>> integer and partition based on hash field. It will add some complexity to
>> read and write because hash has to be computed prior to each read or
>> write.
>> Not whether overhead of doing this exceeds performance gains due to less
>> partitions. I wonder why HUDI don't directly support hash based
>> partitions?
>>
>> Thanks
>> Jaimin
>>
>> On Wed, 3 Jun 2020 at 10:07, tanu dua <ta...@gmail.com> wrote:
>>
>> > Thanks Vinoth for detailed explanation. Even I was thinking on the same
>> > lines and I will relook. We can reduce the 2nd and 3rd partition but
>> it’s
>> > very difficult to reduce the 1st partition as that is the basic primary
>> key
>> > of our domain model on which analysts and developers need to query
>> almost
>> > 90% of time and its an integer primary key and can’t be decomposed
>> further.
>> >
>> > On Wed, 3 Jun 2020 at 9:23 AM, Vinoth Chandar <vi...@apache.org>
>> wrote:
>> >
>> > > Hi tanu,
>> > >
>> > > For good query performance, its recommended to write optimally sized
>> > files.
>> > > Hudi already ensures that.
>> > >
>> > > Generally speaking, if you have too many partitions, then it also
>> means
>> > too
>> > > many files. Mostly people limit to 1000s of partitions in their
>> datasets,
>> > > since queries typically crunch data based on time or a business_domain
>> > (e.g
>> > > city for uber)..  Partitioning too granular - say based on user_id -
>> is
>> > not
>> > > very useful unless your queries only crunch per user.. if you are
>> using
>> > > Hive metastore then 25M partitions mean 25M rows in your backing mysql
>> > > metastore db as well - not very scalable.
>> > >
>> > > What I am trying to say is : even outside of Hudi, if analytics is
>> your
>> > use
>> > > case, might be worth partitioning at lower granularity and increase
>> rows
>> > > per parquet file.
>> > >
>> > > Thanks
>> > > Vinoth
>> > >
>> > > On Tue, Jun 2, 2020 at 3:18 AM Tanuj <ta...@gmail.com> wrote:
>> > >
>> > > > Hi,
>> > > > We have a requirement to ingest 30M records in S3 backed up by
>> HUDI. I
>> > am
>> > > > figuring out the partition strategy and ending up with lot of
>> > partitions
>> > > > like 25M partitions (primary partition) --> 2.5 M (secondary
>> partition)
>> > > -->
>> > > > 2.5 M (third partition) and each parquet file will have the records
>> > with
>> > > > less than 10 rows of data.
>> > > >
>> > > > Our dataset will be ingested at once in full and then it will be
>> > > > incremental daily with less than 1k updates. So its more read heavy
>> > > rather
>> > > > than write heavy
>> > > >
>> > > > So what should be the suggestion in terms of HUDI performance - go
>> > ahead
>> > > > with the above partition strategy or shall I reduce my partitions
>> and
>> > > > increase  no of rows in each parquet file.
>> > > >
>> > >
>> >
>>
>

Re: Suggestion needed - Hudi performance wrt no. and depth of partitions

Posted by tanu dua <ta...@gmail.com>.

Yes that’s also on cards and for developers that’s ok but we need to
provide an interface to our ops people to execute the queries from presto
so I need to find out if they fire a query on primary key how can I
calculate the hash. They can fire a query including primary key with other
fields. So that is the only problem I see in hash partitions and to get if
work I believe I need to go deeper into presto Hudi plugin

On Wed, 3 Jun 2020 at 11:48 AM, Jaimin Shah <sh...@gmail.com>
wrote:

> Hi Tanu,
>
> If your primary key is integer you can add one more field as hash of
> integer and partition based on hash field. It will add some complexity to
> read and write because hash has to be computed prior to each read or write.
> Not whether overhead of doing this exceeds performance gains due to less
> partitions. I wonder why HUDI don't directly support hash based partitions?
>
> Thanks
> Jaimin
>
> On Wed, 3 Jun 2020 at 10:07, tanu dua <ta...@gmail.com> wrote:
>
> > Thanks Vinoth for detailed explanation. Even I was thinking on the same
> > lines and I will relook. We can reduce the 2nd and 3rd partition but it’s
> > very difficult to reduce the 1st partition as that is the basic primary
> key
> > of our domain model on which analysts and developers need to query almost
> > 90% of time and its an integer primary key and can’t be decomposed
> further.
> >
> > On Wed, 3 Jun 2020 at 9:23 AM, Vinoth Chandar <vi...@apache.org> wrote:
> >
> > > Hi tanu,
> > >
> > > For good query performance, its recommended to write optimally sized
> > files.
> > > Hudi already ensures that.
> > >
> > > Generally speaking, if you have too many partitions, then it also means
> > too
> > > many files. Mostly people limit to 1000s of partitions in their
> datasets,
> > > since queries typically crunch data based on time or a business_domain
> > (e.g
> > > city for uber)..  Partitioning too granular - say based on user_id - is
> > not
> > > very useful unless your queries only crunch per user.. if you are using
> > > Hive metastore then 25M partitions mean 25M rows in your backing mysql
> > > metastore db as well - not very scalable.
> > >
> > > What I am trying to say is : even outside of Hudi, if analytics is your
> > use
> > > case, might be worth partitioning at lower granularity and increase
> rows
> > > per parquet file.
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Tue, Jun 2, 2020 at 3:18 AM Tanuj <ta...@gmail.com> wrote:
> > >
> > > > Hi,
> > > > We have a requirement to ingest 30M records in S3 backed up by HUDI.
> I
> > am
> > > > figuring out the partition strategy and ending up with lot of
> > partitions
> > > > like 25M partitions (primary partition) --> 2.5 M (secondary
> partition)
> > > -->
> > > > 2.5 M (third partition) and each parquet file will have the records
> > with
> > > > less than 10 rows of data.
> > > >
> > > > Our dataset will be ingested at once in full and then it will be
> > > > incremental daily with less than 1k updates. So its more read heavy
> > > rather
> > > > than write heavy
> > > >
> > > > So what should be the suggestion in terms of HUDI performance - go
> > ahead
> > > > with the above partition strategy or shall I reduce my partitions and
> > > > increase  no of rows in each parquet file.
> > > >
> > >
> >
>

Re: Suggestion needed - Hudi performance wrt no. and depth of partitions

Posted by Jaimin Shah <sh...@gmail.com>.

Hi Tanu,

If your primary key is integer you can add one more field as hash of
integer and partition based on hash field. It will add some complexity to
read and write because hash has to be computed prior to each read or write.
Not whether overhead of doing this exceeds performance gains due to less
partitions. I wonder why HUDI don't directly support hash based partitions?

Thanks
Jaimin

On Wed, 3 Jun 2020 at 10:07, tanu dua <ta...@gmail.com> wrote:

> Thanks Vinoth for detailed explanation. Even I was thinking on the same
> lines and I will relook. We can reduce the 2nd and 3rd partition but it’s
> very difficult to reduce the 1st partition as that is the basic primary key
> of our domain model on which analysts and developers need to query almost
> 90% of time and its an integer primary key and can’t be decomposed further.
>
> On Wed, 3 Jun 2020 at 9:23 AM, Vinoth Chandar <vi...@apache.org> wrote:
>
> > Hi tanu,
> >
> > For good query performance, its recommended to write optimally sized
> files.
> > Hudi already ensures that.
> >
> > Generally speaking, if you have too many partitions, then it also means
> too
> > many files. Mostly people limit to 1000s of partitions in their datasets,
> > since queries typically crunch data based on time or a business_domain
> (e.g
> > city for uber)..  Partitioning too granular - say based on user_id - is
> not
> > very useful unless your queries only crunch per user.. if you are using
> > Hive metastore then 25M partitions mean 25M rows in your backing mysql
> > metastore db as well - not very scalable.
> >
> > What I am trying to say is : even outside of Hudi, if analytics is your
> use
> > case, might be worth partitioning at lower granularity and increase rows
> > per parquet file.
> >
> > Thanks
> > Vinoth
> >
> > On Tue, Jun 2, 2020 at 3:18 AM Tanuj <ta...@gmail.com> wrote:
> >
> > > Hi,
> > > We have a requirement to ingest 30M records in S3 backed up by HUDI. I
> am
> > > figuring out the partition strategy and ending up with lot of
> partitions
> > > like 25M partitions (primary partition) --> 2.5 M (secondary partition)
> > -->
> > > 2.5 M (third partition) and each parquet file will have the records
> with
> > > less than 10 rows of data.
> > >
> > > Our dataset will be ingested at once in full and then it will be
> > > incremental daily with less than 1k updates. So its more read heavy
> > rather
> > > than write heavy
> > >
> > > So what should be the suggestion in terms of HUDI performance - go
> ahead
> > > with the above partition strategy or shall I reduce my partitions and
> > > increase  no of rows in each parquet file.
> > >
> >
>

Re: Suggestion needed - Hudi performance wrt no. and depth of partitions

Posted by tanu dua <ta...@gmail.com>.

Thanks Vinoth for detailed explanation. Even I was thinking on the same
lines and I will relook. We can reduce the 2nd and 3rd partition but it’s
very difficult to reduce the 1st partition as that is the basic primary key
of our domain model on which analysts and developers need to query almost
90% of time and its an integer primary key and can’t be decomposed further.

On Wed, 3 Jun 2020 at 9:23 AM, Vinoth Chandar <vi...@apache.org> wrote:

> Hi tanu,
>
> For good query performance, its recommended to write optimally sized files.
> Hudi already ensures that.
>
> Generally speaking, if you have too many partitions, then it also means too
> many files. Mostly people limit to 1000s of partitions in their datasets,
> since queries typically crunch data based on time or a business_domain (e.g
> city for uber)..  Partitioning too granular - say based on user_id - is not
> very useful unless your queries only crunch per user.. if you are using
> Hive metastore then 25M partitions mean 25M rows in your backing mysql
> metastore db as well - not very scalable.
>
> What I am trying to say is : even outside of Hudi, if analytics is your use
> case, might be worth partitioning at lower granularity and increase rows
> per parquet file.
>
> Thanks
> Vinoth
>
> On Tue, Jun 2, 2020 at 3:18 AM Tanuj <ta...@gmail.com> wrote:
>
> > Hi,
> > We have a requirement to ingest 30M records in S3 backed up by HUDI. I am
> > figuring out the partition strategy and ending up with lot of partitions
> > like 25M partitions (primary partition) --> 2.5 M (secondary partition)
> -->
> > 2.5 M (third partition) and each parquet file will have the records with
> > less than 10 rows of data.
> >
> > Our dataset will be ingested at once in full and then it will be
> > incremental daily with less than 1k updates. So its more read heavy
> rather
> > than write heavy
> >
> > So what should be the suggestion in terms of HUDI performance - go ahead
> > with the above partition strategy or shall I reduce my partitions and
> > increase  no of rows in each parquet file.
> >
>

Re: Suggestion needed - Hudi performance wrt no. and depth of partitions

Posted by Vinoth Chandar <vi...@apache.org>.

Hi tanu,

For good query performance, its recommended to write optimally sized files.
Hudi already ensures that.

Generally speaking, if you have too many partitions, then it also means too
many files. Mostly people limit to 1000s of partitions in their datasets,
since queries typically crunch data based on time or a business_domain (e.g
city for uber)..  Partitioning too granular - say based on user_id - is not
very useful unless your queries only crunch per user.. if you are using
Hive metastore then 25M partitions mean 25M rows in your backing mysql
metastore db as well - not very scalable.

What I am trying to say is : even outside of Hudi, if analytics is your use
case, might be worth partitioning at lower granularity and increase rows
per parquet file.

Thanks
Vinoth

On Tue, Jun 2, 2020 at 3:18 AM Tanuj <ta...@gmail.com> wrote:

> Hi,
> We have a requirement to ingest 30M records in S3 backed up by HUDI. I am
> figuring out the partition strategy and ending up with lot of partitions
> like 25M partitions (primary partition) --> 2.5 M (secondary partition) -->
> 2.5 M (third partition) and each parquet file will have the records with
> less than 10 rows of data.
>
> Our dataset will be ingested at once in full and then it will be
> incremental daily with less than 1k updates. So its more read heavy rather
> than write heavy
>
> So what should be the suggestion in terms of HUDI performance - go ahead
> with the above partition strategy or shall I reduce my partitions and
> increase  no of rows in each parquet file.
>