You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ognen Duzlevski <og...@gmail.com> on 2014/07/15 21:23:38 UTC
count vs countByValue in for/yield
Hello,
I am curious about something:
val result = for {
(dt,evrdd) <- evrdds
val ct = evrdd.count
} yield (dt->ct)
works.
val result = for {
(dt,evrdd) <- evrdds
val ct = evrdd.countByValue
} yield (dt->ct)
does not work. I get:
14/07/15 16:46:33 WARN TaskSetManager: Lost TID 0 (task 0.0:0)
14/07/15 16:46:33 WARN TaskSetManager: Loss was due to
java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
What is the difference? Is it in the fact that countByValue passes back
a Map and count passes back a Long?
Thanks!
Ognen
Re: count vs countByValue in for/yield
Posted by Ognen Duzlevski <og...@gmail.com>.
Hello all,
Can anyone offer any insight on the below?
Both are "legal" Spark but the first one works, the latter one does not.
They both work on a local machine but in a standalone cluster the one
with countByValue fails.
Thanks!
Ognen
On 7/15/14, 2:23 PM, Ognen Duzlevski wrote:
> Hello,
>
> I am curious about something:
>
> val result = for {
> (dt,evrdd) <- evrdds
> val ct = evrdd.count
> } yield (dt->ct)
>
> works.
>
> val result = for {
> (dt,evrdd) <- evrdds
> val ct = evrdd.countByValue
> } yield (dt->ct)
>
> does not work. I get:
> 14/07/15 16:46:33 WARN TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/07/15 16:46:33 WARN TaskSetManager: Loss was due to
> java.lang.NullPointerException
> java.lang.NullPointerException
> at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
> at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
>
> What is the difference? Is it in the fact that countByValue passes
> back a Map and count passes back a Long?
>
> Thanks!
> Ognen