You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by Satish Kotha <sa...@uber.com.INVALID> on 2020/12/23 06:23:13 UTC

[Announce] Clustering feature available in beta

Hello all,

Clustering feature landed <https://github.com/apache/hudi/pull/2263> on
master branch and is available in beta. This feature can be used to do
following
1) Stitch small files into larger files
2) Change data layout on disk by sorting data using different columns (for
query/storage optimization)

If you are interested in the above use cases, appreciate it if you can try
out this feature. I have included commands to run clustering in this section
<https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+speed+and+query+performance#RFC19Clusteringdataforspeedandqueryperformance-Commandstoscheduleandrunclustering>
(along
with caveats as this feature is still in beta).

Any feedback is welcome. I'm also on #general room in slack. Please feel
free to ping me if you have any questions/comments.

Thanks
Satish

Re: [Announce] Clustering feature available in beta

Posted by Vinoth Chandar <vi...@apache.org>.
This is really really promising! I think the gains will be much higher if
clustered over a larger window of commits!
We can keep improving this over time.

I ll be sure to link the results to the doc updates

On Wed, Jan 20, 2021 at 10:40 PM Satish Kotha <sa...@uber.com.invalid>
wrote:

> Hello everyone,
>
> We see ~60% improvement in query runtime for some datasets. See an example
> documented here
> <
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance#RFC19Clusteringdataforfreshnessandqueryperformance-PerformanceEvaluation
> >.
> Please try out this feature and share any feedback.
> I have included commands to run async clustering in the example section
> <
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance#RFC19Clusteringdataforfreshnessandqueryperformance-PerformanceEvaluation
> >.
> You could also setup inline clustering using commands in this section
> <
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance#RFC19Clusteringdataforfreshnessandqueryperformance-Commandstoscheduleandrunclustering
> >
> .
>
> Thanks
> Satish
>
> On Tue, Dec 22, 2020 at 10:32 PM Vinoth Chandar <vi...@apache.org> wrote:
>
> > Please help us test this more, before RC is cut! :)
> >
> > On Tue, Dec 22, 2020 at 10:23 PM Satish Kotha
> <satishkotha@uber.com.invalid
> > >
> > wrote:
> >
> > > Hello all,
> > >
> > > Clustering feature landed <https://github.com/apache/hudi/pull/2263>
> on
> > > master branch and is available in beta. This feature can be used to do
> > > following
> > > 1) Stitch small files into larger files
> > > 2) Change data layout on disk by sorting data using different columns
> > (for
> > > query/storage optimization)
> > >
> > > If you are interested in the above use cases, appreciate it if you can
> > try
> > > out this feature. I have included commands to run clustering in this
> > > section
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+speed+and+query+performance#RFC19Clusteringdataforspeedandqueryperformance-Commandstoscheduleandrunclustering
> > > >
> > > (along
> > > with caveats as this feature is still in beta).
> > >
> > > Any feedback is welcome. I'm also on #general room in slack. Please
> feel
> > > free to ping me if you have any questions/comments.
> > >
> > > Thanks
> > > Satish
> > >
> >
>

Re: [Announce] Clustering feature available in beta

Posted by Vinoth Chandar <vi...@apache.org>.
This is really really promising! I think the gains will be much higher if
clustered over a larger window of commits!
We can keep improving this over time.

I ll be sure to link the results to the doc updates

On Wed, Jan 20, 2021 at 10:40 PM Satish Kotha <sa...@uber.com.invalid>
wrote:

> Hello everyone,
>
> We see ~60% improvement in query runtime for some datasets. See an example
> documented here
> <
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance#RFC19Clusteringdataforfreshnessandqueryperformance-PerformanceEvaluation
> >.
> Please try out this feature and share any feedback.
> I have included commands to run async clustering in the example section
> <
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance#RFC19Clusteringdataforfreshnessandqueryperformance-PerformanceEvaluation
> >.
> You could also setup inline clustering using commands in this section
> <
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance#RFC19Clusteringdataforfreshnessandqueryperformance-Commandstoscheduleandrunclustering
> >
> .
>
> Thanks
> Satish
>
> On Tue, Dec 22, 2020 at 10:32 PM Vinoth Chandar <vi...@apache.org> wrote:
>
> > Please help us test this more, before RC is cut! :)
> >
> > On Tue, Dec 22, 2020 at 10:23 PM Satish Kotha
> <satishkotha@uber.com.invalid
> > >
> > wrote:
> >
> > > Hello all,
> > >
> > > Clustering feature landed <https://github.com/apache/hudi/pull/2263>
> on
> > > master branch and is available in beta. This feature can be used to do
> > > following
> > > 1) Stitch small files into larger files
> > > 2) Change data layout on disk by sorting data using different columns
> > (for
> > > query/storage optimization)
> > >
> > > If you are interested in the above use cases, appreciate it if you can
> > try
> > > out this feature. I have included commands to run clustering in this
> > > section
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+speed+and+query+performance#RFC19Clusteringdataforspeedandqueryperformance-Commandstoscheduleandrunclustering
> > > >
> > > (along
> > > with caveats as this feature is still in beta).
> > >
> > > Any feedback is welcome. I'm also on #general room in slack. Please
> feel
> > > free to ping me if you have any questions/comments.
> > >
> > > Thanks
> > > Satish
> > >
> >
>

Re: [Announce] Clustering feature available in beta

Posted by Satish Kotha <sa...@uber.com.INVALID>.
Hello everyone,

We see ~60% improvement in query runtime for some datasets. See an example
documented here
<https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance#RFC19Clusteringdataforfreshnessandqueryperformance-PerformanceEvaluation>.
Please try out this feature and share any feedback.
I have included commands to run async clustering in the example section
<https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance#RFC19Clusteringdataforfreshnessandqueryperformance-PerformanceEvaluation>.
You could also setup inline clustering using commands in this section
<https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance#RFC19Clusteringdataforfreshnessandqueryperformance-Commandstoscheduleandrunclustering>
.

Thanks
Satish

On Tue, Dec 22, 2020 at 10:32 PM Vinoth Chandar <vi...@apache.org> wrote:

> Please help us test this more, before RC is cut! :)
>
> On Tue, Dec 22, 2020 at 10:23 PM Satish Kotha <satishkotha@uber.com.invalid
> >
> wrote:
>
> > Hello all,
> >
> > Clustering feature landed <https://github.com/apache/hudi/pull/2263> on
> > master branch and is available in beta. This feature can be used to do
> > following
> > 1) Stitch small files into larger files
> > 2) Change data layout on disk by sorting data using different columns
> (for
> > query/storage optimization)
> >
> > If you are interested in the above use cases, appreciate it if you can
> try
> > out this feature. I have included commands to run clustering in this
> > section
> > <
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+speed+and+query+performance#RFC19Clusteringdataforspeedandqueryperformance-Commandstoscheduleandrunclustering
> > >
> > (along
> > with caveats as this feature is still in beta).
> >
> > Any feedback is welcome. I'm also on #general room in slack. Please feel
> > free to ping me if you have any questions/comments.
> >
> > Thanks
> > Satish
> >
>

Re: [Announce] Clustering feature available in beta

Posted by Vinoth Chandar <vi...@apache.org>.
Please help us test this more, before RC is cut! :)

On Tue, Dec 22, 2020 at 10:23 PM Satish Kotha <sa...@uber.com.invalid>
wrote:

> Hello all,
>
> Clustering feature landed <https://github.com/apache/hudi/pull/2263> on
> master branch and is available in beta. This feature can be used to do
> following
> 1) Stitch small files into larger files
> 2) Change data layout on disk by sorting data using different columns (for
> query/storage optimization)
>
> If you are interested in the above use cases, appreciate it if you can try
> out this feature. I have included commands to run clustering in this
> section
> <
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+speed+and+query+performance#RFC19Clusteringdataforspeedandqueryperformance-Commandstoscheduleandrunclustering
> >
> (along
> with caveats as this feature is still in beta).
>
> Any feedback is welcome. I'm also on #general room in slack. Please feel
> free to ping me if you have any questions/comments.
>
> Thanks
> Satish
>

Re: [Announce] Clustering feature available in beta

Posted by Vinoth Chandar <vi...@apache.org>.
Please help us test this more, before RC is cut! :)

On Tue, Dec 22, 2020 at 10:23 PM Satish Kotha <sa...@uber.com.invalid>
wrote:

> Hello all,
>
> Clustering feature landed <https://github.com/apache/hudi/pull/2263> on
> master branch and is available in beta. This feature can be used to do
> following
> 1) Stitch small files into larger files
> 2) Change data layout on disk by sorting data using different columns (for
> query/storage optimization)
>
> If you are interested in the above use cases, appreciate it if you can try
> out this feature. I have included commands to run clustering in this
> section
> <
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+speed+and+query+performance#RFC19Clusteringdataforspeedandqueryperformance-Commandstoscheduleandrunclustering
> >
> (along
> with caveats as this feature is still in beta).
>
> Any feedback is welcome. I'm also on #general room in slack. Please feel
> free to ping me if you have any questions/comments.
>
> Thanks
> Satish
>