You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Jeremy Lee <un...@gmail.com> on 2014/06/08 18:44:52 UTC

Are "scala.MatchError" messages a problem?

I shut down my first (working) cluster and brought up a fresh one... and
It's been a bit of a horror and I need to sleep now. Should I be worried
about these errors? Or did I just have the old log4j.config tuned so I
didn't see them?

I

14/06/08 16:32:52 ERROR scheduler.JobScheduler: Error running job streaming
job 1402245172000 ms.2
scala.MatchError: 0101-01-10 (of class java.lang.String)
        at SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:218)
        at SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:217)
        at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
        at
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
        at SimpleApp$$anonfun$6.apply(SimpleApp.scala:217)
        at SimpleApp$$anonfun$6.apply(SimpleApp.scala:214)
        at
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
        at
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
        at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
        at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
        at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
        at scala.util.Try$.apply(Try.scala:161)
        at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
        at
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)


The error comes from this code, which seemed like a sensible way to match
things:
(The "case cmd_plus(w)" statement is generating the error,)

val cmd_plus = """[+]([\w]+)""".r
val cmd_minus = """[-]([\w]+)""".r
// find command user tweets
val commands = stream.map(
status => ( status.getUser().getId(), status.getText() )
).foreachRDD(rdd => {
rdd.join(superusers).map(
x => x._2._1
).collect().foreach{ cmd => {
218:  cmd match {
case cmd_plus(w) => {
...
} case cmd_minus(w) => { ... } } }} })

It seems a bit excessive for scala to throw exceptions because a regex
didn't match. Something feels wrong.

Re: Are "scala.MatchError" messages a problem?

Posted by Tobias Pfeiffer <tg...@preferred.jp>.
Jeremy,

On Mon, Jun 9, 2014 at 10:22 AM, Jeremy Lee
<un...@gmail.com> wrote:
>> When you use match, the match must be exhaustive. That is, a match error
>> is thrown if the match fails.
>
> Ahh, right. That makes sense. Scala is applying its "strong typing" rules
> here instead of "no ceremony"... but isn't the idea that type errors should
> get picked up at compile time? I suppose the compiler can't tell there's not
> complete coverage, but it seems strange to throw that at runtime when it is
> literally the 'default case'.

You can use subclasses of "sealed traits" to get a compiler warning
for non-exhaustive matches:
http://stackoverflow.com/questions/11203268/what-is-a-sealed-trait
I don't know if it can be applied for regular expression matching, though...

Tobias

Re: Are "scala.MatchError" messages a problem?

Posted by Jeremy Lee <un...@gmail.com>.
On Sun, Jun 8, 2014 at 10:00 AM, Nick Pentreath <ni...@gmail.com>
 wrote:

> When you use match, the match must be exhaustive. That is, a match error
> is thrown if the match fails.


Ahh, right. That makes sense. Scala is applying its "strong typing" rules
here instead of "no ceremony"... but isn't the idea that type errors should
get picked up at compile time? I suppose the compiler can't tell there's
not complete coverage, but it seems strange to throw that at runtime when
it is literally the 'default case'.

I think I need a good "Scala Programming Guide"... any suggestions? I've
read and watch the usual resources and videos, but it feels like a shotgun
approach and I've clearly missed a lot.

On Mon, Jun 9, 2014 at 3:26 AM, Mark Hamstra <ma...@clearstorydata.com>
wrote:
>
> And you probably want to push down that filter into the cluster --
> collecting all of the elements of an RDD only to not use or filter out some
> of them isn't an efficient usage of expensive (at least in terms of
> time/performance) network resources.  There may also be a good opportunity
> to use the partial function form of collect to push even more processing
> into the cluster.
>

I almost certainly do :-) And I am really looking forward to spending time
optimizing the code, but I keep getting caught up on deployment issues,
uberjars, missing /mnt/spark directories, only being able to submit from
the master, and being thoroughly confused about sample code from three
versions ago.

I'm even thinking of learning maven, if it means I never have to use sbt
again. Does it mean that?

-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers

Re: Are "scala.MatchError" messages a problem?

Posted by Mark Hamstra <ma...@clearstorydata.com>.
>
> The solution is either to add a default case which does nothing, or
> probably better to add a .filter such that you filter out anything that's
> not a command before matching.
>

And you probably want to push down that filter into the cluster --
collecting all of the elements of an RDD only to not use or filter out some
of them isn't an efficient usage of expensive (at least in terms of
time/performance) network resources.  There may also be a good opportunity
to use the partial function form of collect to push even more processing
into the cluster.



On Sun, Jun 8, 2014 at 10:00 AM, Nick Pentreath <ni...@gmail.com>
wrote:

> When you use match, the match must be exhaustive. That is, a match error
> is thrown if the match fails.
>
> That's why you usually handle the default case using "case _ => ..."
>
> Here it looks like your taking the text of all statuses - which means not
> all of them will be commands... Which means your match will not be
> exhaustive.
>
> The solution is either to add a default case which does nothing, or
> probably better to add a .filter such that you filter out anything that's
> not a command before matching.
>
> Just looking at it again it could also be that you take x => x._2._1 ...
> What type is that? Should it not be a Seq if you're joining, in which case
> the match will also fail...
>
> Hope this helps.
> —
> Sent from Mailbox <https://www.dropbox.com/mailbox>
>
>
> On Sun, Jun 8, 2014 at 6:45 PM, Jeremy Lee <unorthodox.engineers@gmail.com
> > wrote:
>
>>
>> I shut down my first (working) cluster and brought up a fresh one... and
>> It's been a bit of a horror and I need to sleep now. Should I be worried
>> about these errors? Or did I just have the old log4j.config tuned so I
>> didn't see them?
>>
>> I
>>
>>  14/06/08 16:32:52 ERROR scheduler.JobScheduler: Error running job
>> streaming job 1402245172000 ms.2
>> scala.MatchError: 0101-01-10 (of class java.lang.String)
>>          at
>> SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:218)
>>         at
>> SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:217)
>>         at
>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>>         at
>> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>>         at SimpleApp$$anonfun$6.apply(SimpleApp.scala:217)
>>         at SimpleApp$$anonfun$6.apply(SimpleApp.scala:214)
>>         at
>> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
>>         at
>> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
>>         at
>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
>>         at
>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>>         at
>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>>         at scala.util.Try$.apply(Try.scala:161)
>>         at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
>>         at
>> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
>>         at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>         at java.lang.Thread.run(Thread.java:744)
>>
>>
>> The error comes from this code, which seemed like a sensible way to match
>> things:
>> (The "case cmd_plus(w)" statement is generating the error,)
>>
>>   val cmd_plus = """[+]([\w]+)""".r
>>  val cmd_minus = """[-]([\w]+)""".r
>>  // find command user tweets
>>  val commands = stream.map(
>>  status => ( status.getUser().getId(), status.getText() )
>>  ).foreachRDD(rdd => {
>>  rdd.join(superusers).map(
>>  x => x._2._1
>>  ).collect().foreach{ cmd => {
>>  218:  cmd match {
>>  case cmd_plus(w) => {
>>  ...
>> } case cmd_minus(w) => { ... } } }} })
>>
>>  It seems a bit excessive for scala to throw exceptions because a regex
>> didn't match. Something feels wrong.
>>
>
>

Re: Are "scala.MatchError" messages a problem?

Posted by Nick Pentreath <ni...@gmail.com>.
When you use match, the match must be exhaustive. That is, a match error is thrown if the match fails. 


That's why you usually handle the default case using "case _ => ..."




Here it looks like your taking the text of all statuses - which means not all of them will be commands... Which means your match will not be exhaustive.




The solution is either to add a default case which does nothing, or probably better to add a .filter such that you filter out anything that's not a command before matching.




Just looking at it again it could also be that you take x => x._2._1 ... What type is that? Should it not be a Seq if you're joining, in which case the match will also fail...




Hope this helps.
—
Sent from Mailbox

On Sun, Jun 8, 2014 at 6:45 PM, Jeremy Lee <un...@gmail.com>
wrote:

> I shut down my first (working) cluster and brought up a fresh one... and
> It's been a bit of a horror and I need to sleep now. Should I be worried
> about these errors? Or did I just have the old log4j.config tuned so I
> didn't see them?
> I
> 14/06/08 16:32:52 ERROR scheduler.JobScheduler: Error running job streaming
> job 1402245172000 ms.2
> scala.MatchError: 0101-01-10 (of class java.lang.String)
>         at SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:218)
>         at SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:217)
>         at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>         at
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>         at SimpleApp$$anonfun$6.apply(SimpleApp.scala:217)
>         at SimpleApp$$anonfun$6.apply(SimpleApp.scala:214)
>         at
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
>         at
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
>         at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
>         at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>         at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>         at scala.util.Try$.apply(Try.scala:161)
>         at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
>         at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:744)
> The error comes from this code, which seemed like a sensible way to match
> things:
> (The "case cmd_plus(w)" statement is generating the error,)
> val cmd_plus = """[+]([\w]+)""".r
> val cmd_minus = """[-]([\w]+)""".r
> // find command user tweets
> val commands = stream.map(
> status => ( status.getUser().getId(), status.getText() )
> ).foreachRDD(rdd => {
> rdd.join(superusers).map(
> x => x._2._1
> ).collect().foreach{ cmd => {
> 218:  cmd match {
> case cmd_plus(w) => {
> ...
> } case cmd_minus(w) => { ... } } }} })
> It seems a bit excessive for scala to throw exceptions because a regex
> didn't match. Something feels wrong.

Re: Are "scala.MatchError" messages a problem?

Posted by Sean Owen <so...@cloudera.com>.
A match clause needs to cover all the possibilities, and not matching
any regex is a distinct possibility. It's not really like 'switch'
because it requires this and I think that has benefits, like being
able to interpret a match as something with a type. I think it's all
in order, but it's more of a Scala thing than Spark thing.

You just need a "case _ => ..." to cover anything else.

(You can avoid two extra levels of scope with .foreach(_ match { ... }) BTW)

On Sun, Jun 8, 2014 at 12:44 PM, Jeremy Lee
<un...@gmail.com> wrote:
>
> I shut down my first (working) cluster and brought up a fresh one... and
> It's been a bit of a horror and I need to sleep now. Should I be worried
> about these errors? Or did I just have the old log4j.config tuned so I
> didn't see them?
>
> I
>
> 14/06/08 16:32:52 ERROR scheduler.JobScheduler: Error running job streaming
> job 1402245172000 ms.2
> scala.MatchError: 0101-01-10 (of class java.lang.String)
>         at SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:218)
>         at SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:217)
>         at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>         at
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>         at SimpleApp$$anonfun$6.apply(SimpleApp.scala:217)
>         at SimpleApp$$anonfun$6.apply(SimpleApp.scala:214)
>         at
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
>         at
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
>         at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
>         at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>         at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>         at scala.util.Try$.apply(Try.scala:161)
>         at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
>         at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:744)
>
>
> The error comes from this code, which seemed like a sensible way to match
> things:
> (The "case cmd_plus(w)" statement is generating the error,)
>
> val cmd_plus = """[+]([\w]+)""".r
> val cmd_minus = """[-]([\w]+)""".r
> // find command user tweets
> val commands = stream.map(
> status => ( status.getUser().getId(), status.getText() )
> ).foreachRDD(rdd => {
> rdd.join(superusers).map(
> x => x._2._1
> ).collect().foreach{ cmd => {
> 218: cmd match {
> case cmd_plus(w) => {
> ...
> } case cmd_minus(w) => { ... } } }} })
>
> It seems a bit excessive for scala to throw exceptions because a regex
> didn't match. Something feels wrong.