You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by Vinoth Chandar <vi...@apache.org> on 2021/11/03 16:33:22 UTC

Re: feature request/proposal: leverage bloom indexes for readingb

Hi,

You are right about the datasource API. This is one of the mismatches that
prevents us from exposing this more nicely.

we are definitely going the route of having a select query taking hints and
using index for faster lookup. 0.11 we could try once we have the new multi
modal indexing landing.

For now, actually you can use ReadClient, it will work fine. I think we
used it internally at uber.


On Thu, Oct 28, 2021 at 10:22 AM Nicolas Paris <ni...@riseup.net>
wrote:

> I tested the HoodieReadClient. It's a great start indeed. Looks like
> this client is meant fo testing purpose and needs some enhancement. I
> will try to produce a general purpose code aroud this and who knows
> contribute.
>
> I guess the datasource api is not the best candidate since hudi keys
> cannot be passed as options but with rdd or df:
>
> sprark.read.format('hudi').option('hudi.filter.keys',
> 'a,flat,list,of,keys,not,really,cool').load(...)
>
> there is also the option to introduce a new hudi operation such
> "select". but again it's not supposed to return a dataframe but write to
> the hudi:
>
> df_hudi_keys.options(**hudi_options).save(...)
>
> Then a full featured / documented hoodie client is maybe the best option
>
>
> thought ?
>
>
> On Thu Oct 28, 2021 at 2:34 PM CEST, Vinoth Chandar wrote:
> > Sounds great!
> >
> > On Tue, Oct 26, 2021 at 7:26 AM Nicolas Paris <ni...@riseup.net>
> > wrote:
> >
> > > Hi Vinoth,
> > >
> > > Thanks for the starter. Definitely once the new way to manage indexes
> > > and we get migrated on hudi on our datalake, I d'be glad to give this a
> > > shot.
> > >
> > >
> > > Regards, Nicolas
> > >
> > > On Fri Oct 22, 2021 at 4:33 PM CEST, Vinoth Chandar wrote:
> > > > Hi Nicolas,
> > > >
> > > > Thanks for raising this! I think it's a very valid ask.
> > > > https://issues.apache.org/jira/browse/HUDI-2601 has been raised.
> > > >
> > > > As a proof of concept, would you be able to give filterExists() a
> shot
> > > > and
> > > > see if the filtering time improves?
> > > >
> > >
> https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/HoodieReadClient.java#L172
> > > >
> > > > In the upcoming 0.10.0 release, we are planning to move the bloom
> > > > filters
> > > > out to a partition on the metadata table, to even speed this up for
> very
> > > > large tables.
> > > > https://issues.apache.org/jira/browse/HUDI-1295
> > > >
> > > > Please let us know if you are interested in testing that when the PR
> is
> > > > up.
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > > > On Tue, Oct 19, 2021 at 4:38 AM Nicolas Paris <
> nicolas.paris@riseup.net>
> > > > wrote:
> > > >
> > > > > hi !
> > > > >
> > > > > In my use case, for GDPR I have to export all informations of a
> given
> > > > > user from several hudi HUGE tables. Filtering the table results in
> a
> > > > > full scan of around 10 hours and this will get worst year after
> year.
> > > > >
> > > > > Since the filter criteria is based on the bloom key (user_id) it
> would
> > > > > be handy to exploit the bloom and produce a temporary table (in the
> > > > > metastore for eg) with the resulting rows.
> > > > >
> > > > > So far the bloom indexing is used for update/delete operations on a
> > > hudi
> > > > > table.
> > > > >
> > > > > 1. There is a oportunity to exploit the bloom for select
> operations.
> > > > > the hudi options would be:
> > > > > operation: select
> > > > > result-table: <table name>
> > > > > result-path: <s3 path|hdfs path>
> > > > > result-schema: <table schema in metastore> (optional ; when empty
> no
> > > > > sync with the hms, only raw path)
> > > > >
> > > > >
> > > > > 2. It could be implemented as predicate push down in the spark
> > > > > datasource API. When filtering with a IN statement.
> > > > >
> > > > >
> > > > > Thought ?
> > > > >
> > >
> > >
>
>