You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Breno Arosa <br...@edumobi.com.br> on 2020/09/23 14:58:14 UTC

Bloom Filter to filter huge dataframes with PySpark

Hello,

I need to filter one huge table using others huge tables.
I could not avoid sort operation using `WHERE IN` or `INNER JOIN`.
Can this be avoided?

As I'm ok with false positives maybe Bloom filter is an alternative.
I saw that Scala has a builtin dataframe function 
(https://spark.apache.org/docs/2.4.3/api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions) 
but there is no correspondent function in pyspark.
Is there any way to call this within pyspark?

I'm using spark 2.4.3.

Thanks,
Breno Arosa.

ps:
I'm not sharing the real query but here is a very similar use case, 
consider that both `items` and `item_tags` are huge.

    WITH items_tag_a AS (
         SELECT id
         FROM items
         WHERE id IN (SELECT item_id FROM item_tags WHERE tag = 'a')
    ),

    items_tag_b AS (
         SELECT id
         FROM items
         WHERE id IN (SELECT item_id FROM item_tags WHERE tag = 'b')
    ),

    items_tag_c AS (
         SELECT id
         FROM items
         WHERE id IN (SELECT item_id FROM item_tags WHERE tag = 'c')
    )

    SELECT id
    FROM item_tag_a
    WHERE id IN (SELECT id FROM item_tag_b)
    AND id IN (SELECT id FROM item_tag_c)