You are viewing a plain text version of this content. The canonical link for it is here.
Posted to by Batselem <> on 2014/12/16 10:40:56 UTC

GC problem while filtering

Hi I am trying to filter a large table with 3 columns. My goal is to filter
this bigtable using multi clauses. I filtered bigtable 3 times but the first
filtering took about 50 seconds to complete whereas the second and third
filter transformation took about 5 seconds. I wonder if it is because of
lazy evaluation. But I already evaluated my rdd parsing it when I first read
the data using sc.textFile then counted it. I got the following result:

Running times: 
t1 => 50seconds 
t2 => 5seconds 
t3 => 4seconds 
    val clause = List( 
      ("<>", "?Z"), 
      ("<>", "?Y") 

    val bcastedSubj: Broadcast[String] = sc.broadcast("?X") 
    val bcastedCls: Broadcast[List[(String, String)]] = sc.broadcast(clause) 
    var n = clause.length 
    val t0 = System.currentTimeMillis() 

    val subgraph1 = bigtable.mapPartitions ( 
      iterator => { 
        val bcls = bcastedCls.value 
        val bsubj = bcastedSubj.value 
        n = bcls.length 
        for ((s, grp) <- iterator; 
             if { 
               val flag = if (!bsubj.startsWith("?") && !bsubj.equals(s))
               else { 
                 var k = 0 

                 val m = grp.length 
                 var flag1 = true 

                 while(k < n) { 
                   var flag2 = false 
                   var l = 0 
                   while(l < m) { 
                     if (grp(l)._1.equals(bcls(k)._1) &&
grp(l)._2.equals(bcls(k)._2)) flag2 = true 
                     else if (bcls(k)._1.startsWith("?") &&
grp(l)._2.equals(bcls(k)._2)) flag2 = true 
                     else if  (bcls(k)._2.startsWith("?") &&
grp(l)._1.equals(bcls(k)._1)) flag2 = true 
                     l += 1 
                   if (!flag2) flag1 = false 

                   k += 1 


        ) yield (s, grp) 
      }, preservesPartitioning = true).cache() 
    val num1 = subgraph1.count()

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail: