You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/28 13:19:38 UTC

[GitHub] [spark] wangyum opened a new pull request, #37697: [SPARK-40248][SQL] Use larger number of bits to build Bloom filter

wangyum opened a new pull request, #37697:
URL: https://github.com/apache/spark/pull/37697

   ### What changes were proposed in this pull request?
   
   This PR makes `BloomFilterAggregate` use larger number of bits to build Bloom filter.
   
   ### Why are the changes needed?
   To fix Bloom filter join cannot filter out more data when CBO is enabled. For example: TPC-DS q64:
   
   CBO enabled | CBO disabled
   -- | --
   <img width="200" alt="image" src="https://user-images.githubusercontent.com/5399861/187075448-747d2fae-9cf5-427b-883c-0bb84ea86570.png"> | <img width="250" alt="image" src="https://user-images.githubusercontent.com/5399861/187075395-8d9ed329-d1a4-472c-81b6-5412d69e2814.png">
   <img width="532" alt="image" src="https://user-images.githubusercontent.com/5399861/187075553-bd6956b7-8f1f-4df5-82b7-d010defb6d21.png"> | <img width="622" alt="image" src="https://user-images.githubusercontent.com/5399861/187075588-254c3246-b9af-403c-8df7-d8344fd1d2a4.png">
   
   After this PR:
   
   Build bloom filter | Filter data
   -- | --
   <img width="262" alt="image" src="https://user-images.githubusercontent.com/5399861/187075676-85b2afae-03a0-4430-9c4e-2679c6ef62f7.png"> | <img width="509" alt="image" src="https://user-images.githubusercontent.com/5399861/187075713-41173dc1-d01d-476a-b218-5c67be823e1b.png">
   
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Unit test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] mskapilks commented on a diff in pull request #37697: [SPARK-40248][SQL] Use larger number of bits to build Bloom filter

Posted by GitBox <gi...@apache.org>.

mskapilks commented on code in PR #37697:
URL: https://github.com/apache/spark/pull/37697#discussion_r958245866


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala:
##########
@@ -55,6 +55,13 @@ case class BloomFilterAggregate(
       Multiply(estimatedNumItemsExpression, Literal(8L)))
   }
 
+  def this(child: Expression, estimatedNumItems: Long) = {
+    this(child, Literal(estimatedNumItems),
+      Literal(math.min(
+        BloomFilter.optimalNumOfBits(estimatedNumItems, estimatedNumItems / 3000000000L.toDouble),

Review Comment:
   Can you please explain this `estimatedNumItems / 3000000000L.toDouble`?
   Shouldn't we use `BloomFilter.DEFAULT_FPP` which is `0.03`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wangyum commented on a diff in pull request #37697: [SPARK-40248][SQL] Use larger number of bits to build Bloom filter

Posted by GitBox <gi...@apache.org>.

wangyum commented on code in PR #37697:
URL: https://github.com/apache/spark/pull/37697#discussion_r958264454


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala:
##########
@@ -55,6 +55,13 @@ case class BloomFilterAggregate(
       Multiply(estimatedNumItemsExpression, Literal(8L)))
   }
 
+  def this(child: Expression, estimatedNumItems: Long) = {
+    this(child, Literal(estimatedNumItems),
+      Literal(math.min(
+        BloomFilter.optimalNumOfBits(estimatedNumItems, estimatedNumItems / 3000000000L.toDouble),

Review Comment:
   This is used to reduce the false positive probability if `estimatedNumItems` is small.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sigmod commented on pull request #37697: [SPARK-40248][SQL] Use larger number of bits to build Bloom filter

Posted by GitBox <gi...@apache.org>.

sigmod commented on PR #37697:
URL: https://github.com/apache/spark/pull/37697#issuecomment-1238356227

   cc @andylam-db @yunxiaoma-db


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wangyum commented on a diff in pull request #37697: [SPARK-40248][SQL] Use larger number of bits to build Bloom filter

Posted by GitBox <gi...@apache.org>.

wangyum commented on code in PR #37697:
URL: https://github.com/apache/spark/pull/37697#discussion_r959142316


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala:
##########
@@ -55,6 +55,13 @@ case class BloomFilterAggregate(
       Multiply(estimatedNumItemsExpression, Literal(8L)))
   }
 
+  def this(child: Expression, estimatedNumItems: Long) = {
+    this(child, Literal(estimatedNumItems),
+      Literal(math.min(
+        BloomFilter.optimalNumOfBits(estimatedNumItems, estimatedNumItems / 3000000000L.toDouble),

Review Comment:
   It can benefit all queries if `estimatedNumItems` < 90000000. Especially the number of creationSide rows is small. For example:
   ```scala
   spark.range(1000).selectExpr("id as a", "id as b", "id as c").write.saveAsTable("t1")
   spark.range(1000000000).selectExpr("id as x", "id as y", "id as z").write.saveAsTable("t2")
   
   spark.sql("set spark.sql.cbo.enabled=true")
   spark.sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS a, b, c")
   
   spark.sql("select * from t1 left join (select distinct * from t2) t on t1.a = t.y where t1.b > 0").collect()
   ```
   Before this PR | After this PR
   -- | --
   ![image](https://user-images.githubusercontent.com/5399861/187590018-214711c0-80f2-4c4c-9e80-4443a703bb25.png) | ![image](https://user-images.githubusercontent.com/5399861/187589973-941aa3a2-2888-4bd8-9942-2709ed9bf056.png)
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sigmod commented on a diff in pull request #37697: [SPARK-40248][SQL] Use larger number of bits to build Bloom filter

Posted by GitBox <gi...@apache.org>.

sigmod commented on code in PR #37697:
URL: https://github.com/apache/spark/pull/37697#discussion_r960091832


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala:
##########
@@ -55,6 +55,13 @@ case class BloomFilterAggregate(
       Multiply(estimatedNumItemsExpression, Literal(8L)))
   }
 
+  def this(child: Expression, estimatedNumItems: Long) = {
+    this(child, Literal(estimatedNumItems),
+      Literal(math.min(
+        BloomFilter.optimalNumOfBits(estimatedNumItems, estimatedNumItems / 3000000000L.toDouble),

Review Comment:
   Thanks @wangyum! Great improvement! I added a comment to the code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wangyum closed pull request #37697: [SPARK-40248][SQL] Use larger number of bits to build Bloom filter

Posted by GitBox <gi...@apache.org>.

wangyum closed pull request #37697: [SPARK-40248][SQL] Use larger number of bits to build Bloom filter
URL: https://github.com/apache/spark/pull/37697


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wangyum commented on a diff in pull request #37697: [SPARK-40248][SQL] Use larger number of bits to build Bloom filter

Posted by GitBox <gi...@apache.org>.

wangyum commented on code in PR #37697:
URL: https://github.com/apache/spark/pull/37697#discussion_r960736515


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala:
##########
@@ -55,6 +55,13 @@ case class BloomFilterAggregate(
       Multiply(estimatedNumItemsExpression, Literal(8L)))
   }
 
+  def this(child: Expression, estimatedNumItems: Long) = {
+    this(child, Literal(estimatedNumItems),
+      Literal(math.min(
+        BloomFilter.optimalNumOfBits(estimatedNumItems, estimatedNumItems / 3000000000L.toDouble),
+        SQLConf.get.getConf(SQLConf.RUNTIME_BLOOM_FILTER_MAX_NUM_BITS))))
+  }

Review Comment:
   How about?
   ```scala
       this(child, Literal(estimatedNumItems),
         Literal(math.min(
           BloomFilter.optimalNumOfBits(estimatedNumItems,
             estimatedNumItems / (SQLConf.get.getConf(SQLConf.RUNTIME_BLOOM_FILTER_MAX_NUM_ITEMS) /
               BloomFilter.DEFAULT_FPP)),
           SQLConf.get.getConf(SQLConf.RUNTIME_BLOOM_FILTER_MAX_NUM_BITS))))
   ```
   The smaller `estimatedNumItems`, the smaller the `FPP`.
   
   
   estimatedNumItems | FPP | numBits
   -- | -- | --
   RUNTIME_BLOOM_FILTER_MAX_NUM_ITEMS | DEFAULT_FPP | 29193763
   2000000 | 0.015 | 17482271
   1000000 | 0.0075 | 10183830
   100000 | 7.50E-04 | 1497636
   10000 | 7.50E-05 | 197688
   1000 | 7.50E-06 | 24561
   100 | 7.50E-07 | 2935
   10 | 7.50E-08 | 341
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sigmod commented on a diff in pull request #37697: [SPARK-40248][SQL] Use larger number of bits to build Bloom filter

Posted by GitBox <gi...@apache.org>.

sigmod commented on code in PR #37697:
URL: https://github.com/apache/spark/pull/37697#discussion_r960939857


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala:
##########
@@ -55,6 +55,13 @@ case class BloomFilterAggregate(
       Multiply(estimatedNumItemsExpression, Literal(8L)))
   }
 
+  def this(child: Expression, estimatedNumItems: Long) = {
+    this(child, Literal(estimatedNumItems),
+      Literal(math.min(
+        BloomFilter.optimalNumOfBits(estimatedNumItems, estimatedNumItems / 3000000000L.toDouble),
+        SQLConf.get.getConf(SQLConf.RUNTIME_BLOOM_FILTER_MAX_NUM_BITS))))
+  }

Review Comment:
   LGTM. Can we make sure that if estimatedNumItems >= RUNTIME_BLOOM_FILTER_MAX_NUM_ITEMS, it's always DEFAULT_FPP?
   
   Thanks, @wangyum !



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sigmod commented on a diff in pull request #37697: [SPARK-40248][SQL] Use larger number of bits to build Bloom filter

Posted by GitBox <gi...@apache.org>.

sigmod commented on code in PR #37697:
URL: https://github.com/apache/spark/pull/37697#discussion_r958704106


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala:
##########
@@ -55,6 +55,13 @@ case class BloomFilterAggregate(
       Multiply(estimatedNumItemsExpression, Literal(8L)))
   }
 
+  def this(child: Expression, estimatedNumItems: Long) = {
+    this(child, Literal(estimatedNumItems),
+      Literal(math.min(
+        BloomFilter.optimalNumOfBits(estimatedNumItems, estimatedNumItems / 3000000000L.toDouble),

Review Comment:
   > estimatedNumItems / 3000000000L.toDouble
   
   Does 3000000000L fit other queries than the TPCDS benchmark? 
   
   



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala:
##########
@@ -55,6 +55,13 @@ case class BloomFilterAggregate(
       Multiply(estimatedNumItemsExpression, Literal(8L)))
   }
 
+  def this(child: Expression, estimatedNumItems: Long) = {
+    this(child, Literal(estimatedNumItems),
+      Literal(math.min(
+        BloomFilter.optimalNumOfBits(estimatedNumItems, estimatedNumItems / 3000000000L.toDouble),

Review Comment:
   > estimatedNumItems / 3000000000L.toDouble
   
   Just curious, does 3000000000L fit other queries than the TPCDS benchmark? 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wangyum commented on pull request #37697: [SPARK-40248][SQL] Use larger number of bits to build Bloom filter

Posted by GitBox <gi...@apache.org>.

wangyum commented on PR #37697:
URL: https://github.com/apache/spark/pull/37697#issuecomment-1286429445

   cc @cloud-fan @sigmod 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wangyum commented on pull request #37697: [SPARK-40248][SQL] Use larger number of bits to build Bloom filter

Posted by GitBox <gi...@apache.org>.

wangyum commented on PR #37697:
URL: https://github.com/apache/spark/pull/37697#issuecomment-1229617980

   cc @sigmod @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wangyum commented on pull request #37697: [SPARK-40248][SQL] Use larger number of bits to build Bloom filter

Posted by GitBox <gi...@apache.org>.

wangyum commented on PR #37697:
URL: https://github.com/apache/spark/pull/37697#issuecomment-1299977341

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sigmod commented on a diff in pull request #37697: [SPARK-40248][SQL] Use larger number of bits to build Bloom filter

Posted by GitBox <gi...@apache.org>.

sigmod commented on code in PR #37697:
URL: https://github.com/apache/spark/pull/37697#discussion_r960030651


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala:
##########
@@ -55,6 +55,13 @@ case class BloomFilterAggregate(
       Multiply(estimatedNumItemsExpression, Literal(8L)))
   }
 
+  def this(child: Expression, estimatedNumItems: Long) = {
+    this(child, Literal(estimatedNumItems),
+      Literal(math.min(
+        BloomFilter.optimalNumOfBits(estimatedNumItems, estimatedNumItems / 3000000000L.toDouble),
+        SQLConf.get.getConf(SQLConf.RUNTIME_BLOOM_FILTER_MAX_NUM_BITS))))
+  }

Review Comment:
   - Is it possible to implement the logic in the constructor above by composing expressions instead of adding a new constructor?
   - I still do not understand why `estimatedNumItems / 3000000000L.toDouble` means for the false positive rate. 
    
   Is it possible to add a threshold config on `estimatedNumItems` and use the following way to address the benchmark problem.
   
   ```
   val fpRate = if (estimatedNumItems < threshold) {
     0.001 // can be an even smaller number.
   } else {
      BloomFilter.DEFAULT_FPP
   }
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org