You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Aureliano Buendia <bu...@gmail.com> on 2014/01/04 16:42:35 UTC

SequenceFileRDDFunctions cannot be used output of spark package

Hi,

I'm trying to create a custom version of saveAsObject(). however, I do not
seem to be able to use SequenceFileRDDFunctions in my package.

I simply copy/pasted saveAsObject() body to my funtion:

out.mapPartitions(iter => iter.grouped(10).map(_.toArray))
      .map(x => (NullWritable.get(), new BytesWritable(serialize(x))))
      .*saveAsSequenceFile*("output")

But that gives me this error:

value saveAsSequenceFile is not a member of
org.apache.spark.rdd.RDD[(org.apache.hadoop.io.NullWritable,
org.apache.hadoop.io.BytesWritable)]
possible cause: maybe a semicolon is missing before `value
saveAsSequenceFile'?
      .saveAsSequenceFile("output")
       ^

Scala implicit conversion error is not of any help here. So I tried to
apply explicit conversion:

*org.apache.spark.rdd.SequenceFileRDDFunctions[(NullWritable,
BytesWritable)](*out.mapPartitions(iter => iter.grouped(10).map(_.toArray))
      .map(x => (NullWritable.get(), new BytesWritable(serialize(x))))*)*
      .saveAsSequenceFile("output")

Giving me this error:

object SequenceFileRDDFunctions is not a member of package
org.apache.spark.rdd
*Note: class SequenceFileRDDFunctions exists, but it has no companion
object.*
    org.apache.spark.rdd.SequenceFileRDDFunctions[(NullWritable,
BytesWritable)](out.mapPartitions(iter => iter.grouped(10).map(_.toArray))
                         ^

Is this scala compiler version mismatch hell?

Re: SequenceFileRDDFunctions cannot be used output of spark package

Posted by Xuefeng Wu <be...@gmail.com>.

would you try new it ?
*new org.apache.spark.rdd.SequenceFileRDDFunctions[(NullWritable,
BytesWritable)](*out.mapPartitions(iter => iter.grouped(10).map(_.toArray))
      .map(x => (NullWritable.get(), new BytesWritable(serialize(x))))*)*
      .saveAsSequenceFile("output")


and do you import org.apache.spark.SparkContext._ for implicit conversion
work for you ?



On Sat, Jan 4, 2014 at 11:42 PM, Aureliano Buendia <bu...@gmail.com>wrote:

> Hi,
>
> I'm trying to create a custom version of saveAsObject(). however, I do not
> seem to be able to use SequenceFileRDDFunctions in my package.
>
> I simply copy/pasted saveAsObject() body to my funtion:
>
> out.mapPartitions(iter => iter.grouped(10).map(_.toArray))
>       .map(x => (NullWritable.get(), new BytesWritable(serialize(x))))
>       .*saveAsSequenceFile*("output")
>
> But that gives me this error:
>
> value saveAsSequenceFile is not a member of
> org.apache.spark.rdd.RDD[(org.apache.hadoop.io.NullWritable,
> org.apache.hadoop.io.BytesWritable)]
> possible cause: maybe a semicolon is missing before `value
> saveAsSequenceFile'?
>       .saveAsSequenceFile("output")
>        ^
>
> Scala implicit conversion error is not of any help here. So I tried to
> apply explicit conversion:
>
> *org.apache.spark.rdd.SequenceFileRDDFunctions[(NullWritable,
> BytesWritable)](*out.mapPartitions(iter =>
> iter.grouped(10).map(_.toArray))
>       .map(x => (NullWritable.get(), new BytesWritable(serialize(x))))*)*
>       .saveAsSequenceFile("output")
>
> Giving me this error:
>
> object SequenceFileRDDFunctions is not a member of package
> org.apache.spark.rdd
> *Note: class SequenceFileRDDFunctions exists, but it has no companion
> object.*
>     org.apache.spark.rdd.SequenceFileRDDFunctions[(NullWritable,
> BytesWritable)](out.mapPartitions(iter => iter.grouped(10).map(_.toArray))
>                          ^
>
> Is this scala compiler version mismatch hell?
>



-- 

~Yours, Xuefeng Wu/吴雪峰  敬上

Re: SequenceFileRDDFunctions cannot be used output of spark package

Posted by pradeeps8 <sr...@gmail.com>.

Hi Sonal,

There are no custom objects in saveRDD, it is of type RDD[(String, String)].

Thanks,
Pradeep 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SequenceFileRDDFunctions-cannot-be-used-output-of-spark-package-tp250p3508.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: SequenceFileRDDFunctions cannot be used output of spark package

Posted by Sonal Goyal <so...@gmail.com>.

What does your saveRDD contain? If you are using custom objects, they
should be serializable.

Best Regards,
Sonal
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>




On Sat, Mar 29, 2014 at 12:02 AM, pradeeps8 <sr...@gmail.com>wrote:

> Hi Aureliano,
>
> I followed this thread to create a custom saveAsObjectFile.
> The following is the code.
> /new org.apache.spark.rdd.SequenceFileRDDFunctions[NullWritable,
> BytesWritable](saveRDD.mapPartitions(iter =>
> iter.grouped(10).map(_.toArray)).map(x => (NullWritable.get(), new
> BytesWritable(serialize(x))))).saveAsSequenceFile("objFiles") /
>
> But, I get the following error when executed.
> /
> org.apache.spark.SparkException: Job aborted: Task not serializable:
> java.io.NotSerializableException: org.apache.hadoop.mapred.JobConf
>         at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
>         at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
>         at
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>         at
> org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
>         at
> org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:794)
>         at
> org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:737)
>         at
>
> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:569)
>         at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>         at
>
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>         at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at
>
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at
>
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> /
>
> Any idea about this error?
> or
> Is there anything wrong in the line of code?
>
> Thanks,
> Pradeep
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SequenceFileRDDFunctions-cannot-be-used-output-of-spark-package-tp250p3442.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: SequenceFileRDDFunctions cannot be used output of spark package

Posted by pradeeps8 <sr...@gmail.com>.

Hi Aureliano,

I followed this thread to create a custom saveAsObjectFile. 
The following is the code.
/new org.apache.spark.rdd.SequenceFileRDDFunctions[NullWritable,
BytesWritable](saveRDD.mapPartitions(iter =>
iter.grouped(10).map(_.toArray)).map(x => (NullWritable.get(), new
BytesWritable(serialize(x))))).saveAsSequenceFile("objFiles") /

But, I get the following error when executed.
/
org.apache.spark.SparkException: Job aborted: Task not serializable:
java.io.NotSerializableException: org.apache.hadoop.mapred.JobConf 
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028) 
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026) 
        at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
        at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
        at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026) 
        at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:794) 
        at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:737) 
        at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:569) 
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207) 
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) 
        at akka.actor.ActorCell.invoke(ActorCell.scala:456) 
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) 
        at akka.dispatch.Mailbox.run(Mailbox.scala:219) 
        at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) 
        at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
        at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
        at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
        at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
/

Any idea about this error?
or 
Is there anything wrong in the line of code?

Thanks,
Pradeep




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SequenceFileRDDFunctions-cannot-be-used-output-of-spark-package-tp250p3442.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: SequenceFileRDDFunctions cannot be used output of spark package

Posted by Aureliano Buendia <bu...@gmail.com>.

I think you bumped the wrong thread.

As I mentioned in the other thread:

saveAsHadoopFile only applies compression when the codec is available, and
it does not seem to respect the global hadoop compression properties.

I'm not sure if this is a feature, or a bug in spark.

if this is a feature, the docs should make it clear that
mapred.output.compression.* properties are read only.

On Sat, Mar 22, 2014 at 12:20 AM, deenar.toraskar <de...@db.com>wrote:

> Matei
>
> It turns out that saveAsObjectFile(), saveAsSequenceFile() and
> saveAsHadoopFile() currently do not pickup the hadoop settings as Aureliano
> found out in this post
>
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Turning-kryo-on-does-not-decrease-binary-output-tp212p249.html
>
> Deenar
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SequenceFileRDDFunctions-cannot-be-used-output-of-spark-package-tp250p3019.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: SequenceFileRDDFunctions cannot be used output of spark package

Posted by "deenar.toraskar" <de...@db.com>.

Matei

It turns out that saveAsObjectFile(), saveAsSequenceFile() and
saveAsHadoopFile() currently do not pickup the hadoop settings as Aureliano
found out in this post

http://apache-spark-user-list.1001560.n3.nabble.com/Turning-kryo-on-does-not-decrease-binary-output-tp212p249.html

Deenar



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SequenceFileRDDFunctions-cannot-be-used-output-of-spark-package-tp250p3019.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: SequenceFileRDDFunctions cannot be used output of spark package

Posted by Matei Zaharia <ma...@gmail.com>.

To use compression here, you might just have to set the correct Hadoop settings in SparkContext.hadoopConf.

Matei

On Mar 21, 2014, at 10:53 AM, deenar.toraskar <de...@db.com> wrote:

> Hi Aureliano 
> 
> If you have managed to get a custom version of  saveAsObject() that handles
> compression working, would appreciate if you could share the code. I have
> come across the same issue and it would help me some time having to reinvent
> the wheel.
> 
> Deenar
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SequenceFileRDDFunctions-cannot-be-used-output-of-spark-package-tp250p3005.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: SequenceFileRDDFunctions cannot be used output of spark package

Posted by Aureliano Buendia <bu...@gmail.com>.

On Fri, Mar 21, 2014 at 5:53 PM, deenar.toraskar <de...@db.com>wrote:

> Hi Aureliano
>
> If you have managed to get a custom version of  saveAsObject() that handles
> compression working, would appreciate if you could share the code. I have
> come across the same issue and it would help me some time having to
> reinvent
> the wheel.
>
>
My problem was not about compression.


> Deenar
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SequenceFileRDDFunctions-cannot-be-used-output-of-spark-package-tp250p3005.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: SequenceFileRDDFunctions cannot be used output of spark package

Posted by "deenar.toraskar" <de...@db.com>.

Hi Aureliano 

If you have managed to get a custom version of  saveAsObject() that handles
compression working, would appreciate if you could share the code. I have
come across the same issue and it would help me some time having to reinvent
the wheel.

Deenar



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SequenceFileRDDFunctions-cannot-be-used-output-of-spark-package-tp250p3005.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: SequenceFileRDDFunctions cannot be used output of spark package

Posted by Aureliano Buendia <bu...@gmail.com>.

Thank you both for the clear explanation. Both using new and importing:

import org.apache.spark.SparkContext._

solved the problem.

When using SequenceFileRDDFunctions, Intellij only tried to suggest:

import org.apache.spark.rdd.SequenceFileRDDFunctions

which was not helpful. I guess org.apache.spark.SparkContext._ had to be
imported manually.



On Sat, Jan 4, 2014 at 4:22 PM, Imran Rashid <im...@quantifind.com> wrote:

> nice work tracking down the problems w/ the codec getting applied
> consistently.  I think you're close to the fix, just need to understand
> scala implicit resolution rules.
>
> I'm not entirely sure what you mean when you say "I simply copy/pasted
> saveAsObject() body to my funtion:" -- where does your function live?
> Are you trying to modify SequenceFileRDDFunctions, then recompile your own
> version of spark?  or are you trying to leave the spark package alone, and
> add your own helper function elsewhere?
>
> If you are modifying SequenceFileRDDFunctions, then you should just be
> able to drop another function in there no problem.  Just be sure you have
> the implicit conversions in scope when you try to apply them.  The way to
> do that is to "import org.apache.spark.SparkContext._".  In the
> SparkContext *object*, you'll notice a bunch of "implicit def"s -- by
> importing those, you are telling the scala compiler that it should try to
> apply those rules when searching for function definitions.
>
> Your attempt at *explicit* conversion doesn't work b/c you aren't actually
> doing a conversion -- you are attempting to apply a function.  What you
> have gets desugared to:
>
> org.apache.spark.rdd.SequenceFileRDDFunctions*.apply*[(NullWritable,
> BytesWritable)](...)
>
> you'll notice that the compiler even told you it was looking for an object
> called "SequenceFileRDDFunctions" but didn't find one.   You want to
> create a new instance of the class, which you do by adding *new* in front.
>
> *new* org.apache.spark.rdd.SequenceFileRDDFunctions[(NullWritable,
> BytesWritable)](...)
>
> This is really confusing when you are new to scala -- lots of companion
> objects have an "apply" method that act like new.  But
> SequenceFileRDDFunctions doesn't.
>
>
> then the second part is, making your newly added functions available on
> all RDDs by implicit conversion.  first, define a wrapper class with your
> new function, and a companion object with an implicit conversion:
>
> class AwesomeRDD[T](self: RDD[T]) {
>   def saveAwesomeObjectFile(path: String) {
>     //put your def here
>   }
> }
>
> object AwesomeRDD {
>   implicit def addAwesomeFunctions[T](rdd: RDD[T]) = new AwesomeRDD(rdd)
> }
>
>
> then, just import the implicit conversion wherever you want it:
>
> class Demo {
>   val rdd: RDD[String] = ...
>   import AwesomeRDD._
>   rdd.saveAwesomeObjectFile("/path")
> }
>
>
>
>
>
> On Sat, Jan 4, 2014 at 9:42 AM, Aureliano Buendia <bu...@gmail.com>wrote:
>
>> Hi,
>>
>> I'm trying to create a custom version of saveAsObject(). however, I do
>> not seem to be able to use SequenceFileRDDFunctions in my package.
>>
>> I simply copy/pasted saveAsObject() body to my funtion:
>>
>> out.mapPartitions(iter => iter.grouped(10).map(_.toArray))
>>       .map(x => (NullWritable.get(), new BytesWritable(serialize(x))))
>>       .*saveAsSequenceFile*("output")
>>
>> But that gives me this error:
>>
>> value saveAsSequenceFile is not a member of
>> org.apache.spark.rdd.RDD[(org.apache.hadoop.io.NullWritable,
>> org.apache.hadoop.io.BytesWritable)]
>> possible cause: maybe a semicolon is missing before `value
>> saveAsSequenceFile'?
>>       .saveAsSequenceFile("output")
>>        ^
>>
>> Scala implicit conversion error is not of any help here. So I tried to
>> apply explicit conversion:
>>
>> *org.apache.spark.rdd.SequenceFileRDDFunctions[(NullWritable,
>> BytesWritable)](*out.mapPartitions(iter =>
>> iter.grouped(10).map(_.toArray))
>>       .map(x => (NullWritable.get(), new BytesWritable(serialize(x))))*)*
>>       .saveAsSequenceFile("output")
>>
>> Giving me this error:
>>
>> object SequenceFileRDDFunctions is not a member of package
>> org.apache.spark.rdd
>> *Note: class SequenceFileRDDFunctions exists, but it has no companion
>> object.*
>>     org.apache.spark.rdd.SequenceFileRDDFunctions[(NullWritable,
>> BytesWritable)](out.mapPartitions(iter => iter.grouped(10).map(_.toArray))
>>                          ^
>>
>> Is this scala compiler version mismatch hell?
>>
>
>

Re: SequenceFileRDDFunctions cannot be used output of spark package

Posted by Imran Rashid <im...@quantifind.com>.

nice work tracking down the problems w/ the codec getting applied
consistently.  I think you're close to the fix, just need to understand
scala implicit resolution rules.

I'm not entirely sure what you mean when you say "I simply copy/pasted
saveAsObject() body to my funtion:" -- where does your function live?
Are you trying to modify SequenceFileRDDFunctions, then recompile your own
version of spark?  or are you trying to leave the spark package alone, and
add your own helper function elsewhere?

If you are modifying SequenceFileRDDFunctions, then you should just be able
to drop another function in there no problem.  Just be sure you have the
implicit conversions in scope when you try to apply them.  The way to do
that is to "import org.apache.spark.SparkContext._".  In the SparkContext
*object*, you'll notice a bunch of "implicit def"s -- by importing those,
you are telling the scala compiler that it should try to apply those rules
when searching for function definitions.

Your attempt at *explicit* conversion doesn't work b/c you aren't actually
doing a conversion -- you are attempting to apply a function.  What you
have gets desugared to:

org.apache.spark.rdd.SequenceFileRDDFunctions*.apply*[(NullWritable,
BytesWritable)](...)

you'll notice that the compiler even told you it was looking for an object
called "SequenceFileRDDFunctions" but didn't find one.   You want to create
a new instance of the class, which you do by adding *new* in front.

*new* org.apache.spark.rdd.SequenceFileRDDFunctions[(NullWritable,
BytesWritable)](...)

This is really confusing when you are new to scala -- lots of companion
objects have an "apply" method that act like new.  But
SequenceFileRDDFunctions doesn't.

then the second part is, making your newly added functions available on all
RDDs by implicit conversion.  first, define a wrapper class with your new
function, and a companion object with an implicit conversion:

class AwesomeRDD[T](self: RDD[T]) {
  def saveAwesomeObjectFile(path: String) {
    //put your def here
  }
}

object AwesomeRDD {
  implicit def addAwesomeFunctions[T](rdd: RDD[T]) = new AwesomeRDD(rdd)
}

then, just import the implicit conversion wherever you want it:

class Demo {
  val rdd: RDD[String] = ...
  import AwesomeRDD._
  rdd.saveAwesomeObjectFile("/path")
}

On Sat, Jan 4, 2014 at 9:42 AM, Aureliano Buendia <bu...@gmail.com>wrote:

> Hi,
>
> I'm trying to create a custom version of saveAsObject(). however, I do not
> seem to be able to use SequenceFileRDDFunctions in my package.
>
> I simply copy/pasted saveAsObject() body to my funtion:
>
> out.mapPartitions(iter => iter.grouped(10).map(_.toArray))
>       .map(x => (NullWritable.get(), new BytesWritable(serialize(x))))
>       .*saveAsSequenceFile*("output")
>
> But that gives me this error:
>
> value saveAsSequenceFile is not a member of
> org.apache.spark.rdd.RDD[(org.apache.hadoop.io.NullWritable,
> org.apache.hadoop.io.BytesWritable)]
> possible cause: maybe a semicolon is missing before `value
> saveAsSequenceFile'?
>       .saveAsSequenceFile("output")
>        ^
>
> Scala implicit conversion error is not of any help here. So I tried to
> apply explicit conversion:
>
> *org.apache.spark.rdd.SequenceFileRDDFunctions[(NullWritable,
> BytesWritable)](*out.mapPartitions(iter =>
> iter.grouped(10).map(_.toArray))
>       .map(x => (NullWritable.get(), new BytesWritable(serialize(x))))*)*
>       .saveAsSequenceFile("output")
>
> Giving me this error:
>
> object SequenceFileRDDFunctions is not a member of package
> org.apache.spark.rdd
> *Note: class SequenceFileRDDFunctions exists, but it has no companion
> object.*
>     org.apache.spark.rdd.SequenceFileRDDFunctions[(NullWritable,
> BytesWritable)](out.mapPartitions(iter => iter.grouped(10).map(_.toArray))
>                          ^
>
> Is this scala compiler version mismatch hell?
>