You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@hudi.apache.org by Vinoth Chandar <vi...@apache.org> on 2020/07/19 03:07:31 UTC

Illustration of how Hudi's file sizing/temporal layout help query performance

Hi all,

You might have heard this repeatedly mentioned over tickets, when we talk
about Hudi paying some "tax" during write time to ensure query performance
is good.

These are conscious decisions we made, designing Uber's data lake for
scale. and sometimes these are not appreciated when trying to optimize
single Spark jobs for e.g

So, I decided to write a small demo (all working on a macbook, on some 50GB
of data and show how impactful these are). Hopefully you find it useful.

TL;DR :
- Keeping data sorted by time helps temporal queries 2-3x speed up.
- 20x reduction in file size can cause upto 3-4x degradation in query
performance.

https://gist.github.com/vinothchandar/5544a92e616094c049f58c152faf0a53
https://gist.github.com/vinothchandar/d7fa1338cddfae68390afcdfe310f94e


Now, is anyone interested in turning these into blogs on hudi.apache.org?
:). referencing the right config names and showing our users how to nail
this.

Thanks
Vinoth

Re: Illustration of how Hudi's file sizing/temporal layout help query performance

Posted by Vinoth Chandar <vi...@apache.org>.

Btw, just to show that these principles are generally true for parquet. I
used the vanilla spark.read.parquet() for illustration.

On Sat, Jul 18, 2020 at 8:07 PM Vinoth Chandar <vi...@apache.org> wrote:

> Hi all,
>
> You might have heard this repeatedly mentioned over tickets, when we talk
> about Hudi paying some "tax" during write time to ensure query performance
> is good.
>
> These are conscious decisions we made, designing Uber's data lake for
> scale. and sometimes these are not appreciated when trying to optimize
> single Spark jobs for e.g
>
> So, I decided to write a small demo (all working on a macbook, on some
> 50GB of data and show how impactful these are). Hopefully you find it
> useful.
>
> TL;DR :
> - Keeping data sorted by time helps temporal queries 2-3x speed up.
> - 20x reduction in file size can cause upto 3-4x degradation in query
> performance.
>
> https://gist.github.com/vinothchandar/5544a92e616094c049f58c152faf0a53
> https://gist.github.com/vinothchandar/d7fa1338cddfae68390afcdfe310f94e
>
>
> Now, is anyone interested in turning these into blogs on hudi.apache.org?
> :). referencing the right config names and showing our users how to nail
> this.
>
> Thanks
> Vinoth
>

Re: Illustration of how Hudi's file sizing/temporal layout help query performance

Posted by Vinoth Chandar <vi...@apache.org>.

Btw, just to show that these principles are generally true for parquet. I
used the vanilla spark.read.parquet() for illustration.

On Sat, Jul 18, 2020 at 8:07 PM Vinoth Chandar <vi...@apache.org> wrote:

> Hi all,
>
> You might have heard this repeatedly mentioned over tickets, when we talk
> about Hudi paying some "tax" during write time to ensure query performance
> is good.
>
> These are conscious decisions we made, designing Uber's data lake for
> scale. and sometimes these are not appreciated when trying to optimize
> single Spark jobs for e.g
>
> So, I decided to write a small demo (all working on a macbook, on some
> 50GB of data and show how impactful these are). Hopefully you find it
> useful.
>
> TL;DR :
> - Keeping data sorted by time helps temporal queries 2-3x speed up.
> - 20x reduction in file size can cause upto 3-4x degradation in query
> performance.
>
> https://gist.github.com/vinothchandar/5544a92e616094c049f58c152faf0a53
> https://gist.github.com/vinothchandar/d7fa1338cddfae68390afcdfe310f94e
>
>
> Now, is anyone interested in turning these into blogs on hudi.apache.org?
> :). referencing the right config names and showing our users how to nail
> this.
>
> Thanks
> Vinoth
>