You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Nikodimos Nikolaidis <ni...@csd.auth.gr> on 2018/03/16 13:27:24 UTC

Time delay in multiple predicate Filter

Hello,

Here’s a behavior that I find strange. A filtering with a predicate with 
zero selectivity is much quicker than a filtering with multiple 
predicates, but with the same zero-selectivity predicate in first place.

For example, in google’s English One Million 1-grams dataset (Spark 2.2, 
wholeStageCodegen enabled, local mode):

|val df = spark.read.parquet("../../1-grams.parquet") def metric[T](f: => 
T): T = { val t0 = System.nanoTime() val res = f println("Executed in " 
+ (System.nanoTime()-t0)/1000000 + "ms") res } df.count() // res11: Long 
= 261823186 metric { df.filter('gram === "does not exist").count() } // 
Executed in 1794ms // res13: Long = 0 metric { df.filter('gram === "does 
not exist" && 'year > 0 && 'times > 0 && 'books > 0).count() } // 
Executed in 4233ms // res15: Long = 0 |

In generated code, the behavior is exact the same; first predicate will 
continue the loop in the same point. Why this time difference exists?

Thanks