You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Marco Colli <co...@gmail.com> on 2023/06/06 09:06:28 UTC

Bloom filters for full-text search and predicate pushdown

Hello,

I see that Parquet already supports Bloom filters.

For my understanding, it currently uses them only on the entire value.

Fo example, if I have a column "MovieTitle":

- "The title of my movie"
- "Another movie title"
- "The best movie title"
- ...

Then the current Bloom filters can be used to find only the column
chunks/pages that match an exact title. For example you can use the bloom
filter to search for "The best movie title".

It would be interesting to have *a bloom filter on the specific words*,
instead of using the entire value: in this way you can search the word
"best" in the "MovieTitle" column and find the titles that contain that
specific word in an efficient way.

It would enable a sort of full-text search of keywords inside text columns.
It would also allow predicate pushdown for searches based on keywords.

Would make sense to have such an addition? Is there any strategy already
used by Parquet for fast keyword searches inside text columns?


Best regards,
Marco Colli
AbstractBrain srls

Re: Bloom filters for full-text search and predicate pushdown

Posted by Micah Kornfield <em...@gmail.com>.

You probably need to be more specific on which language bindings you are
using.  I think the C++ community is just starting to work on being able to
write out bloom filters (so it isn't supported in C++, Python and R, Ruby,
etc).

The way I read the specification, yes each single value should be added to
the bloom filter independently (but it could also be that this is a gray
area where repeated fields were not considered).



On Wed, Jun 7, 2023 at 12:38 PM Marco Colli <co...@gmail.com> wrote:

> @Micah Does that mean that columns of type array already get a bloom filter
> on each single value?
> I am using Apache Arrow in particular to deal with Parquet files
>
> Il Mer 7 Giu 2023, 16:00 Micah Kornfield <em...@gmail.com> ha
> scritto:
>
> > Hi Marco,
> > Could you describe how your proposal differs from tokenizing the target
> > string and storing the list of tokens in a column that has a bloom filter
> > attached?  I think this should be supportable today by the format at
> least
> > if not existing libraries.
> >
> > Thanks,
> > Micah
> >
> > On Wednesday, June 7, 2023, Gang Wu <us...@gmail.com> wrote:
> >
> > > Hi Marco,
> > >
> > > That sounds interesting!
> > >
> > > However, this requires the parquet implementation to be able to
> tokenize
> > > both
> > > strings to write and literals in the filters. The actual efficiency
> > depends
> > > on the
> > > data distribution. I am also concerned with the possible explosion of
> > > distinct
> > > values introduced by splitting words, which may result in a large bloom
> > > filter.
> > >
> > > Have you tried any PoC to get a rough estimate of benefits in your use
> > > case?
> > >
> > > Best,
> > > Gang
> > >
> > >
> > >
> > > On Tue, Jun 6, 2023 at 5:06 PM Marco Colli <co...@gmail.com>
> > wrote:
> > >
> > > > Hello,
> > > >
> > > > I see that Parquet already supports Bloom filters.
> > > >
> > > > For my understanding, it currently uses them only on the entire
> value.
> > > >
> > > > Fo example, if I have a column "MovieTitle":
> > > >
> > > > - "The title of my movie"
> > > > - "Another movie title"
> > > > - "The best movie title"
> > > > - ...
> > > >
> > > > Then the current Bloom filters can be used to find only the column
> > > > chunks/pages that match an exact title. For example you can use the
> > bloom
> > > > filter to search for "The best movie title".
> > > >
> > > > It would be interesting to have *a bloom filter on the specific
> words*,
> > > > instead of using the entire value: in this way you can search the
> word
> > > > "best" in the "MovieTitle" column and find the titles that contain
> that
> > > > specific word in an efficient way.
> > > >
> > > > It would enable a sort of full-text search of keywords inside text
> > > columns.
> > > > It would also allow predicate pushdown for searches based on
> keywords.
> > > >
> > > > Would make sense to have such an addition? Is there any strategy
> > already
> > > > used by Parquet for fast keyword searches inside text columns?
> > > >
> > > >
> > > > Best regards,
> > > > Marco Colli
> > > > AbstractBrain srls
> > > >
> > >
> >
>

Re: Bloom filters for full-text search and predicate pushdown

Posted by Marco Colli <co...@gmail.com>.

@Micah Does that mean that columns of type array already get a bloom filter
on each single value?
I am using Apache Arrow in particular to deal with Parquet files

Il Mer 7 Giu 2023, 16:00 Micah Kornfield <em...@gmail.com> ha scritto:

> Hi Marco,
> Could you describe how your proposal differs from tokenizing the target
> string and storing the list of tokens in a column that has a bloom filter
> attached?  I think this should be supportable today by the format at least
> if not existing libraries.
>
> Thanks,
> Micah
>
> On Wednesday, June 7, 2023, Gang Wu <us...@gmail.com> wrote:
>
> > Hi Marco,
> >
> > That sounds interesting!
> >
> > However, this requires the parquet implementation to be able to tokenize
> > both
> > strings to write and literals in the filters. The actual efficiency
> depends
> > on the
> > data distribution. I am also concerned with the possible explosion of
> > distinct
> > values introduced by splitting words, which may result in a large bloom
> > filter.
> >
> > Have you tried any PoC to get a rough estimate of benefits in your use
> > case?
> >
> > Best,
> > Gang
> >
> >
> >
> > On Tue, Jun 6, 2023 at 5:06 PM Marco Colli <co...@gmail.com>
> wrote:
> >
> > > Hello,
> > >
> > > I see that Parquet already supports Bloom filters.
> > >
> > > For my understanding, it currently uses them only on the entire value.
> > >
> > > Fo example, if I have a column "MovieTitle":
> > >
> > > - "The title of my movie"
> > > - "Another movie title"
> > > - "The best movie title"
> > > - ...
> > >
> > > Then the current Bloom filters can be used to find only the column
> > > chunks/pages that match an exact title. For example you can use the
> bloom
> > > filter to search for "The best movie title".
> > >
> > > It would be interesting to have *a bloom filter on the specific words*,
> > > instead of using the entire value: in this way you can search the word
> > > "best" in the "MovieTitle" column and find the titles that contain that
> > > specific word in an efficient way.
> > >
> > > It would enable a sort of full-text search of keywords inside text
> > columns.
> > > It would also allow predicate pushdown for searches based on keywords.
> > >
> > > Would make sense to have such an addition? Is there any strategy
> already
> > > used by Parquet for fast keyword searches inside text columns?
> > >
> > >
> > > Best regards,
> > > Marco Colli
> > > AbstractBrain srls
> > >
> >
>

Re: Bloom filters for full-text search and predicate pushdown

Posted by Micah Kornfield <em...@gmail.com>.

Hi Marco,
Could you describe how your proposal differs from tokenizing the target
string and storing the list of tokens in a column that has a bloom filter
attached?  I think this should be supportable today by the format at least
if not existing libraries.

Thanks,
Micah

On Wednesday, June 7, 2023, Gang Wu <us...@gmail.com> wrote:

> Hi Marco,
>
> That sounds interesting!
>
> However, this requires the parquet implementation to be able to tokenize
> both
> strings to write and literals in the filters. The actual efficiency depends
> on the
> data distribution. I am also concerned with the possible explosion of
> distinct
> values introduced by splitting words, which may result in a large bloom
> filter.
>
> Have you tried any PoC to get a rough estimate of benefits in your use
> case?
>
> Best,
> Gang
>
>
>
> On Tue, Jun 6, 2023 at 5:06 PM Marco Colli <co...@gmail.com> wrote:
>
> > Hello,
> >
> > I see that Parquet already supports Bloom filters.
> >
> > For my understanding, it currently uses them only on the entire value.
> >
> > Fo example, if I have a column "MovieTitle":
> >
> > - "The title of my movie"
> > - "Another movie title"
> > - "The best movie title"
> > - ...
> >
> > Then the current Bloom filters can be used to find only the column
> > chunks/pages that match an exact title. For example you can use the bloom
> > filter to search for "The best movie title".
> >
> > It would be interesting to have *a bloom filter on the specific words*,
> > instead of using the entire value: in this way you can search the word
> > "best" in the "MovieTitle" column and find the titles that contain that
> > specific word in an efficient way.
> >
> > It would enable a sort of full-text search of keywords inside text
> columns.
> > It would also allow predicate pushdown for searches based on keywords.
> >
> > Would make sense to have such an addition? Is there any strategy already
> > used by Parquet for fast keyword searches inside text columns?
> >
> >
> > Best regards,
> > Marco Colli
> > AbstractBrain srls
> >
>

Re: Bloom filters for full-text search and predicate pushdown

Posted by Xinli shang <sh...@uber.com.INVALID>.

Hi Marco,

This is an exciting idea! You think about more use cases of Parquet! As an
open community, we always welcome new ideas and innovations like yours.  I
encourage you to go deeper and broader with this idea and come up with a
proposal and POC. Today, generative AI came to reality. In addition to the
keyword search, you can think of other things like OpenAI embeddings. Maybe
later Parquet filters can do matches based on the closeness of two
embedding vectors.

With that said, other people's comments are also valid that Parquet is a
strict file format and we need to standardization. So we look forward to
your proposal and POC. If you want to come to discuss this week's sync
meeting, you are more than welcome.  I added you.

Xinli Shang

On Thu, Jun 15, 2023 at 4:38 AM Antoine Pitrou <an...@python.org> wrote:

>
> Hi,
>
> This would require standardizing on a specific tokenization algorithm,
> right? I'm not sure it's a good idea to add such complexity to the
> Parquet spec (the tokenization might need to be language-specific
> and/or corpus-specific).
>
> I wonder if it would be more productive to try and find ways to build
> e.g. a Lucene index over Parquet columns (perhaps it's already
> possible?).
>
> Regards
>
> Antoine.
>
>
>
> On Wed, 7 Jun 2023 18:01:32 +0800
> Gang Wu <us...@gmail.com> wrote:
> > Hi Marco,
> >
> > That sounds interesting!
> >
> > However, this requires the parquet implementation to be able to tokenize
> > both
> > strings to write and literals in the filters. The actual efficiency
> depends
> > on the
> > data distribution. I am also concerned with the possible explosion of
> > distinct
> > values introduced by splitting words, which may result in a large bloom
> > filter.
> >
> > Have you tried any PoC to get a rough estimate of benefits in your use
> case?
> >
> > Best,
> > Gang
> >
> >
> >
> > On Tue, Jun 6, 2023 at 5:06 PM Marco Colli <
> collimarco91-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >
> > > Hello,
> > >
> > > I see that Parquet already supports Bloom filters.
> > >
> > > For my understanding, it currently uses them only on the entire value.
> > >
> > > Fo example, if I have a column "MovieTitle":
> > >
> > > - "The title of my movie"
> > > - "Another movie title"
> > > - "The best movie title"
> > > - ...
> > >
> > > Then the current Bloom filters can be used to find only the column
> > > chunks/pages that match an exact title. For example you can use the
> bloom
> > > filter to search for "The best movie title".
> > >
> > > It would be interesting to have *a bloom filter on the specific words*,
> > > instead of using the entire value: in this way you can search the word
> > > "best" in the "MovieTitle" column and find the titles that contain that
> > > specific word in an efficient way.
> > >
> > > It would enable a sort of full-text search of keywords inside text
> columns.
> > > It would also allow predicate pushdown for searches based on keywords.
> > >
> > > Would make sense to have such an addition? Is there any strategy
> already
> > > used by Parquet for fast keyword searches inside text columns?
> > >
> > >
> > > Best regards,
> > > Marco Colli
> > > AbstractBrain srls
> > >
> >
>
>
>
>

-- 
Xinli Shang

Re: Bloom filters for full-text search and predicate pushdown

Posted by Antoine Pitrou <an...@python.org>.

Hi,

This would require standardizing on a specific tokenization algorithm,
right? I'm not sure it's a good idea to add such complexity to the
Parquet spec (the tokenization might need to be language-specific
and/or corpus-specific).

I wonder if it would be more productive to try and find ways to build
e.g. a Lucene index over Parquet columns (perhaps it's already
possible?).

Regards

Antoine.



On Wed, 7 Jun 2023 18:01:32 +0800
Gang Wu <us...@gmail.com> wrote:
> Hi Marco,
> 
> That sounds interesting!
> 
> However, this requires the parquet implementation to be able to tokenize
> both
> strings to write and literals in the filters. The actual efficiency depends
> on the
> data distribution. I am also concerned with the possible explosion of
> distinct
> values introduced by splitting words, which may result in a large bloom
> filter.
> 
> Have you tried any PoC to get a rough estimate of benefits in your use case?
> 
> Best,
> Gang
> 
> 
> 
> On Tue, Jun 6, 2023 at 5:06 PM Marco Colli <co...@public.gmane.org> wrote:
> 
> > Hello,
> >
> > I see that Parquet already supports Bloom filters.
> >
> > For my understanding, it currently uses them only on the entire value.
> >
> > Fo example, if I have a column "MovieTitle":
> >
> > - "The title of my movie"
> > - "Another movie title"
> > - "The best movie title"
> > - ...
> >
> > Then the current Bloom filters can be used to find only the column
> > chunks/pages that match an exact title. For example you can use the bloom
> > filter to search for "The best movie title".
> >
> > It would be interesting to have *a bloom filter on the specific words*,
> > instead of using the entire value: in this way you can search the word
> > "best" in the "MovieTitle" column and find the titles that contain that
> > specific word in an efficient way.
> >
> > It would enable a sort of full-text search of keywords inside text columns.
> > It would also allow predicate pushdown for searches based on keywords.
> >
> > Would make sense to have such an addition? Is there any strategy already
> > used by Parquet for fast keyword searches inside text columns?
> >
> >
> > Best regards,
> > Marco Colli
> > AbstractBrain srls
> >  
>

Re: Bloom filters for full-text search and predicate pushdown

Posted by Gang Wu <us...@gmail.com>.

Hi Marco,

That sounds interesting!

However, this requires the parquet implementation to be able to tokenize
both
strings to write and literals in the filters. The actual efficiency depends
on the
data distribution. I am also concerned with the possible explosion of
distinct
values introduced by splitting words, which may result in a large bloom
filter.

Have you tried any PoC to get a rough estimate of benefits in your use case?

Best,
Gang

On Tue, Jun 6, 2023 at 5:06 PM Marco Colli <co...@gmail.com> wrote:

> Hello,
>
> I see that Parquet already supports Bloom filters.
>
> For my understanding, it currently uses them only on the entire value.
>
> Fo example, if I have a column "MovieTitle":
>
> - "The title of my movie"
> - "Another movie title"
> - "The best movie title"
> - ...
>
> Then the current Bloom filters can be used to find only the column
> chunks/pages that match an exact title. For example you can use the bloom
> filter to search for "The best movie title".
>
> It would be interesting to have *a bloom filter on the specific words*,
> instead of using the entire value: in this way you can search the word
> "best" in the "MovieTitle" column and find the titles that contain that
> specific word in an efficient way.
>
> It would enable a sort of full-text search of keywords inside text columns.
> It would also allow predicate pushdown for searches based on keywords.
>
> Would make sense to have such an addition? Is there any strategy already
> used by Parquet for fast keyword searches inside text columns?
>
>
> Best regards,
> Marco Colli
> AbstractBrain srls
>