You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Breno Arosa <br...@edumobi.com.br> on 2020/09/23 14:58:14 UTC
Bloom Filter to filter huge dataframes with PySpark
Hello,
I need to filter one huge table using others huge tables.
I could not avoid sort operation using `WHERE IN` or `INNER JOIN`.
Can this be avoided?
As I'm ok with false positives maybe Bloom filter is an alternative.
I saw that Scala has a builtin dataframe function
(https://spark.apache.org/docs/2.4.3/api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions)
but there is no correspondent function in pyspark.
Is there any way to call this within pyspark?
I'm using spark 2.4.3.
Thanks,
Breno Arosa.
ps:
I'm not sharing the real query but here is a very similar use case,
consider that both `items` and `item_tags` are huge.
WITH items_tag_a AS (
SELECT id
FROM items
WHERE id IN (SELECT item_id FROM item_tags WHERE tag = 'a')
),
items_tag_b AS (
SELECT id
FROM items
WHERE id IN (SELECT item_id FROM item_tags WHERE tag = 'b')
),
items_tag_c AS (
SELECT id
FROM items
WHERE id IN (SELECT item_id FROM item_tags WHERE tag = 'c')
)
SELECT id
FROM item_tag_a
WHERE id IN (SELECT id FROM item_tag_b)
AND id IN (SELECT id FROM item_tag_c)