You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Batselem <se...@gmail.com> on 2014/12/16 10:40:56 UTC
GC problem while filtering

Hi I am trying to filter a large table with 3 columns. My goal is to filter
this bigtable using multi clauses. I filtered bigtable 3 times but the first
filtering took about 50 seconds to complete whereas the second and third
filter transformation took about 5 seconds. I wonder if it is because of
lazy evaluation. But I already evaluated my rdd parsing it when I first read
the data using sc.textFile then counted it. I got the following result:

Running times: 
t1 => 50seconds 
t2 => 5seconds 
t3 => 4seconds 
***************************CODE*******************************
    val clause = List( 
      ("<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>",
"<www.ssu.ac.kr#GraduateStudent>"), 
      ("<www.ssu.ac.kr#memberOf>", "?Z"), 
      ("<www.ssu.ac.kr#undergraduateDegreeFrom>", "?Y") 
    ) 

    val bcastedSubj: Broadcast[String] = sc.broadcast("?X") 
    val bcastedCls: Broadcast[List[(String, String)]] = sc.broadcast(clause) 
    var n = clause.length 
    
    val t0 = System.currentTimeMillis() 

    val subgraph1 = bigtable.mapPartitions ( 
      iterator => { 
        val bcls = bcastedCls.value 
        val bsubj = bcastedSubj.value 
        n = bcls.length 
        for ((s, grp) <- iterator; 
             if { 
               val flag = if (!bsubj.startsWith("?") && !bsubj.equals(s))
false 
               else { 
                 var k = 0 

                 val m = grp.length 
                 var flag1 = true 

                 while(k < n) { 
                   var flag2 = false 
                   var l = 0 
                   while(l < m) { 
                     if (grp(l)._1.equals(bcls(k)._1) &&
grp(l)._2.equals(bcls(k)._2)) flag2 = true 
                     else if (bcls(k)._1.startsWith("?") &&
grp(l)._2.equals(bcls(k)._2)) flag2 = true 
                     else if  (bcls(k)._2.startsWith("?") &&
grp(l)._1.equals(bcls(k)._1)) flag2 = true 
                     l += 1 
                   } 
                   if (!flag2) flag1 = false 

                   k += 1 
                 } 

                 flag1 
               } 

               flag 
             } 
        ) yield (s, grp) 
      }, preservesPartitioning = true).cache() 
    val num1 = subgraph1.count()



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GC-problem-while-filtering-tp20705.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org