You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Ognen Duzlevski <og...@gmail.com> on 2014/07/15 21:23:38 UTC

count vs countByValue in for/yield

Hello,

I am curious about something:

val result = for {
       (dt,evrdd) <- evrdds
       val ct = evrdd.count
     } yield (dt->ct)

works.

val result = for {
       (dt,evrdd) <- evrdds
       val ct = evrdd.countByValue
     } yield (dt->ct)

does not work. I get:
14/07/15 16:46:33 WARN TaskSetManager: Lost TID 0 (task 0.0:0)
14/07/15 16:46:33 WARN TaskSetManager: Loss was due to 
java.lang.NullPointerException
java.lang.NullPointerException
     at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
     at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
     at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
     at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
     at org.apache.spark.scheduler.Task.run(Task.scala:51)
     at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
     at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
     at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
     at java.lang.Thread.run(Thread.java:744)

What is the difference? Is it in the fact that countByValue passes back 
a Map and count passes back a Long?

Thanks!
Ognen

Re: count vs countByValue in for/yield

Posted by Ognen Duzlevski <og...@gmail.com>.

Hello all,

Can anyone offer any insight on the below?

Both are "legal" Spark but the first one works, the latter one does not. 
They both work on a local machine but in a standalone cluster the one 
with countByValue fails.

Thanks!
Ognen

On 7/15/14, 2:23 PM, Ognen Duzlevski wrote:
> Hello,
>
> I am curious about something:
>
> val result = for {
>       (dt,evrdd) <- evrdds
>       val ct = evrdd.count
>     } yield (dt->ct)
>
> works.
>
> val result = for {
>       (dt,evrdd) <- evrdds
>       val ct = evrdd.countByValue
>     } yield (dt->ct)
>
> does not work. I get:
> 14/07/15 16:46:33 WARN TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/07/15 16:46:33 WARN TaskSetManager: Loss was due to 
> java.lang.NullPointerException
> java.lang.NullPointerException
>     at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
>     at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>     at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
>     at org.apache.spark.scheduler.Task.run(Task.scala:51)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>     at java.lang.Thread.run(Thread.java:744)
>
> What is the difference? Is it in the fact that countByValue passes 
> back a Map and count passes back a Long?
>
> Thanks!
> Ognen