You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sarath Chandra <sa...@algofusiontech.com> on 2014/09/05 16:06:52 UTC

Task not serializable

Hi,

I'm trying to migrate a map-reduce program to work with spark. I migrated
the program from Java to Scala. The map-reduce program basically loads a
HDFS file and for each line in the file it applies several transformation
functions available in various external libraries.

When I execute this over spark, it is throwing me "Task not serializable"
exceptions for each and every class being used from these from external
libraries. I included serialization to few classes which are in my scope,
but there there are several other classes which are out of my scope like
org.apache.hadoop.io.Text.

How to overcome these exceptions?

~Sarath.

Re: Task not serializable

Posted by Alok Kumar <al...@gmail.com>.

Hi,

See if this link helps -
http://stackoverflow.com/questions/22592811/scala-spark-task-not-serializable-java-io-notserializableexceptionon-when

Also, try extending the class and make it Serializable(your new child
class) if you can not get the source locally!!

Thanks
Alok

On Fri, Sep 5, 2014 at 7:51 PM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> Get the class locally and Serialize it.
> http://grepcode.com/file_/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/
> *org/apache/hadoop/io/Text.java*/?v=source
>
> [image: Inline image 1]
>
> PS: Some classes may require additional classes to get serialized.
> Hopefully there should be some other way doing it.
>
>
> Thanks
> Best Regards
>
>
> On Fri, Sep 5, 2014 at 7:45 PM, Sarath Chandra <
> sarathchandra.josyam@algofusiontech.com> wrote:
>
>> Hi Akhil,
>>
>> I've done this for the classes which are in my scope. But what to do with
>> classes that are out of my scope?
>> For example org.apache.hadoop.io.Text
>>
>> Also I'm using several 3rd part libraries like "jeval".
>>
>> ~Sarath
>>
>>
>> On Fri, Sep 5, 2014 at 7:40 PM, Akhil Das <ak...@sigmoidanalytics.com>
>> wrote:
>>
>>> You can bring those classes out of the library and Serialize it
>>> (implements Serializable). It is not the right way of doing it though it
>>> solved few of my similar problems.
>>>
>>> Thanks
>>> Best Regards
>>>
>>>
>>> On Fri, Sep 5, 2014 at 7:36 PM, Sarath Chandra <
>>> sarathchandra.josyam@algofusiontech.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm trying to migrate a map-reduce program to work with spark. I
>>>> migrated the program from Java to Scala. The map-reduce program basically
>>>> loads a HDFS file and for each line in the file it applies several
>>>> transformation functions available in various external libraries.
>>>>
>>>> When I execute this over spark, it is throwing me "Task not
>>>> serializable" exceptions for each and every class being used from these
>>>> from external libraries. I included serialization to few classes which are
>>>> in my scope, but there there are several other classes which are out of my
>>>> scope like org.apache.hadoop.io.Text.
>>>>
>>>> How to overcome these exceptions?
>>>>
>>>> ~Sarath.
>>>>
>>>
>>>
>>
>


-- 
Alok Kumar
Email : alokawi@gmail.com
http://sharepointorange.blogspot.in/
http://www.linkedin.com/in/alokawi

Re: Task not serializable

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

Get the class locally and Serialize it.
http://grepcode.com/file_/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/
*org/apache/hadoop/io/Text.java*/?v=source

[image: Inline image 1]

PS: Some classes may require additional classes to get serialized.
Hopefully there should be some other way doing it.


Thanks
Best Regards


On Fri, Sep 5, 2014 at 7:45 PM, Sarath Chandra <
sarathchandra.josyam@algofusiontech.com> wrote:

> Hi Akhil,
>
> I've done this for the classes which are in my scope. But what to do with
> classes that are out of my scope?
> For example org.apache.hadoop.io.Text
>
> Also I'm using several 3rd part libraries like "jeval".
>
> ~Sarath
>
>
> On Fri, Sep 5, 2014 at 7:40 PM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> You can bring those classes out of the library and Serialize it
>> (implements Serializable). It is not the right way of doing it though it
>> solved few of my similar problems.
>>
>> Thanks
>> Best Regards
>>
>>
>> On Fri, Sep 5, 2014 at 7:36 PM, Sarath Chandra <
>> sarathchandra.josyam@algofusiontech.com> wrote:
>>
>>> Hi,
>>>
>>> I'm trying to migrate a map-reduce program to work with spark. I
>>> migrated the program from Java to Scala. The map-reduce program basically
>>> loads a HDFS file and for each line in the file it applies several
>>> transformation functions available in various external libraries.
>>>
>>> When I execute this over spark, it is throwing me "Task not
>>> serializable" exceptions for each and every class being used from these
>>> from external libraries. I included serialization to few classes which are
>>> in my scope, but there there are several other classes which are out of my
>>> scope like org.apache.hadoop.io.Text.
>>>
>>> How to overcome these exceptions?
>>>
>>> ~Sarath.
>>>
>>
>>
>

Re: Task not serializable

Posted by Sarath Chandra <sa...@algofusiontech.com>.

Hi Akhil,

I've done this for the classes which are in my scope. But what to do with
classes that are out of my scope?
For example org.apache.hadoop.io.Text

Also I'm using several 3rd part libraries like "jeval".

~Sarath


On Fri, Sep 5, 2014 at 7:40 PM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> You can bring those classes out of the library and Serialize it
> (implements Serializable). It is not the right way of doing it though it
> solved few of my similar problems.
>
> Thanks
> Best Regards
>
>
> On Fri, Sep 5, 2014 at 7:36 PM, Sarath Chandra <
> sarathchandra.josyam@algofusiontech.com> wrote:
>
>> Hi,
>>
>> I'm trying to migrate a map-reduce program to work with spark. I migrated
>> the program from Java to Scala. The map-reduce program basically loads a
>> HDFS file and for each line in the file it applies several transformation
>> functions available in various external libraries.
>>
>> When I execute this over spark, it is throwing me "Task not serializable"
>> exceptions for each and every class being used from these from external
>> libraries. I included serialization to few classes which are in my scope,
>> but there there are several other classes which are out of my scope like
>> org.apache.hadoop.io.Text.
>>
>> How to overcome these exceptions?
>>
>> ~Sarath.
>>
>
>

Re: Task not serializable

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

You can bring those classes out of the library and Serialize it (implements
Serializable). It is not the right way of doing it though it solved few of
my similar problems.

Thanks
Best Regards


On Fri, Sep 5, 2014 at 7:36 PM, Sarath Chandra <
sarathchandra.josyam@algofusiontech.com> wrote:

> Hi,
>
> I'm trying to migrate a map-reduce program to work with spark. I migrated
> the program from Java to Scala. The map-reduce program basically loads a
> HDFS file and for each line in the file it applies several transformation
> functions available in various external libraries.
>
> When I execute this over spark, it is throwing me "Task not serializable"
> exceptions for each and every class being used from these from external
> libraries. I included serialization to few classes which are in my scope,
> but there there are several other classes which are out of my scope like
> org.apache.hadoop.io.Text.
>
> How to overcome these exceptions?
>
> ~Sarath.
>

Re: Task not serializable

Posted by Marcelo Vanzin <va...@cloudera.com>.

You're using "hadoopConf", a Configuration object, in your closure.
That type is not serializable.

You can use " -Dsun.io.serialization.extendedDebugInfo=true" to debug
serialization issues.

On Wed, Sep 10, 2014 at 8:23 AM, Sarath Chandra
<sa...@algofusiontech.com> wrote:
> Thanks Sean.
> Please find attached my code. Let me know your suggestions/ideas.
>
> Regards,
> Sarath
>
> On Wed, Sep 10, 2014 at 8:05 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> You mention that you are creating a UserGroupInformation inside your
>> function, but something is still serializing it. You should show your
>> code since it may not be doing what you think.
>>
>> If you instantiate an object, it happens every time your function is
>> called. map() is called once per data element; mapPartitions() once
>> per partition. It depends.
>>
>> On Wed, Sep 10, 2014 at 3:25 PM, Sarath Chandra
>> <sa...@algofusiontech.com> wrote:
>> > Hi Sean,
>> >
>> > The solution of instantiating the non-serializable class inside the map
>> > is
>> > working fine, but I hit a road block. The solution is not working for
>> > singleton classes like UserGroupInformation.
>> >
>> > In my logic as part of processing a HDFS file, I need to refer to some
>> > reference files which are again available in HDFS. So inside the map
>> > method
>> > I'm trying to instantiate UserGroupInformation to get an instance of
>> > FileSystem. Then using this FileSystem instance I read those reference
>> > files
>> > and use that data in my processing logic.
>> >
>> > This is throwing task not serializable exceptions for
>> > 'UserGroupInformation'
>> > and 'FileSystem' classes. I also tried using 'SparkHadoopUtil' instead
>> > of
>> > 'UserGroupInformation'. But it didn't resolve the issue.
>> >
>> > Request you provide some pointers in this regard.
>> >
>> > Also I have a query - when we instantiate a class inside map method,
>> > does it
>> > create a new instance for every RDD it is processing?
>> >
>> > Thanks & Regards,
>> > Sarath
>> >
>> > On Sat, Sep 6, 2014 at 4:32 PM, Sean Owen <so...@cloudera.com> wrote:
>> >>
>> >> I disagree that the generally right change is to try to make the
>> >> classes serializable. Usually, classes that are not serializable are
>> >> not supposed to be serialized. You're using them in a way that's
>> >> causing them to be serialized, and that's probably not desired.
>> >>
>> >> For example, this is wrong:
>> >>
>> >> val foo: SomeUnserializableManagerClass = ...
>> >> rdd.map(d => foo.bar(d))
>> >>
>> >> This is right:
>> >>
>> >> rdd.map { d =>
>> >>   val foo: SomeUnserializableManagerClass = ...
>> >>   foo.bar(d)
>> >> }
>> >>
>> >> In the first instance, you create the object on the driver and try to
>> >> serialize and copy it to workers. In the second, you're creating
>> >> SomeUnserializableManagerClass in the function and therefore on the
>> >> worker.
>> >>
>> >> mapPartitions is better if this creation is expensive.
>> >>
>> >> On Fri, Sep 5, 2014 at 3:06 PM, Sarath Chandra
>> >> <sa...@algofusiontech.com> wrote:
>> >> > Hi,
>> >> >
>> >> > I'm trying to migrate a map-reduce program to work with spark. I
>> >> > migrated
>> >> > the program from Java to Scala. The map-reduce program basically
>> >> > loads a
>> >> > HDFS file and for each line in the file it applies several
>> >> > transformation
>> >> > functions available in various external libraries.
>> >> >
>> >> > When I execute this over spark, it is throwing me "Task not
>> >> > serializable"
>> >> > exceptions for each and every class being used from these from
>> >> > external
>> >> > libraries. I included serialization to few classes which are in my
>> >> > scope,
>> >> > but there there are several other classes which are out of my scope
>> >> > like
>> >> > org.apache.hadoop.io.Text.
>> >> >
>> >> > How to overcome these exceptions?
>> >> >
>> >> > ~Sarath.
>> >
>> >
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org



-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Task not serializable

Posted by Sarath Chandra <sa...@algofusiontech.com>.

Thanks Sean.
Please find attached my code. Let me know your suggestions/ideas.

Regards,

*Sarath*

On Wed, Sep 10, 2014 at 8:05 PM, Sean Owen <so...@cloudera.com> wrote:

> You mention that you are creating a UserGroupInformation inside your
> function, but something is still serializing it. You should show your
> code since it may not be doing what you think.
>
> If you instantiate an object, it happens every time your function is
> called. map() is called once per data element; mapPartitions() once
> per partition. It depends.
>
> On Wed, Sep 10, 2014 at 3:25 PM, Sarath Chandra
> <sa...@algofusiontech.com> wrote:
> > Hi Sean,
> >
> > The solution of instantiating the non-serializable class inside the map
> is
> > working fine, but I hit a road block. The solution is not working for
> > singleton classes like UserGroupInformation.
> >
> > In my logic as part of processing a HDFS file, I need to refer to some
> > reference files which are again available in HDFS. So inside the map
> method
> > I'm trying to instantiate UserGroupInformation to get an instance of
> > FileSystem. Then using this FileSystem instance I read those reference
> files
> > and use that data in my processing logic.
> >
> > This is throwing task not serializable exceptions for
> 'UserGroupInformation'
> > and 'FileSystem' classes. I also tried using 'SparkHadoopUtil' instead of
> > 'UserGroupInformation'. But it didn't resolve the issue.
> >
> > Request you provide some pointers in this regard.
> >
> > Also I have a query - when we instantiate a class inside map method,
> does it
> > create a new instance for every RDD it is processing?
> >
> > Thanks & Regards,
> > Sarath
> >
> > On Sat, Sep 6, 2014 at 4:32 PM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> I disagree that the generally right change is to try to make the
> >> classes serializable. Usually, classes that are not serializable are
> >> not supposed to be serialized. You're using them in a way that's
> >> causing them to be serialized, and that's probably not desired.
> >>
> >> For example, this is wrong:
> >>
> >> val foo: SomeUnserializableManagerClass = ...
> >> rdd.map(d => foo.bar(d))
> >>
> >> This is right:
> >>
> >> rdd.map { d =>
> >>   val foo: SomeUnserializableManagerClass = ...
> >>   foo.bar(d)
> >> }
> >>
> >> In the first instance, you create the object on the driver and try to
> >> serialize and copy it to workers. In the second, you're creating
> >> SomeUnserializableManagerClass in the function and therefore on the
> >> worker.
> >>
> >> mapPartitions is better if this creation is expensive.
> >>
> >> On Fri, Sep 5, 2014 at 3:06 PM, Sarath Chandra
> >> <sa...@algofusiontech.com> wrote:
> >> > Hi,
> >> >
> >> > I'm trying to migrate a map-reduce program to work with spark. I
> >> > migrated
> >> > the program from Java to Scala. The map-reduce program basically
> loads a
> >> > HDFS file and for each line in the file it applies several
> >> > transformation
> >> > functions available in various external libraries.
> >> >
> >> > When I execute this over spark, it is throwing me "Task not
> >> > serializable"
> >> > exceptions for each and every class being used from these from
> external
> >> > libraries. I included serialization to few classes which are in my
> >> > scope,
> >> > but there there are several other classes which are out of my scope
> like
> >> > org.apache.hadoop.io.Text.
> >> >
> >> > How to overcome these exceptions?
> >> >
> >> > ~Sarath.
> >
> >
>

Re: Task not serializable

Posted by Sean Owen <so...@cloudera.com>.

You mention that you are creating a UserGroupInformation inside your
function, but something is still serializing it. You should show your
code since it may not be doing what you think.

If you instantiate an object, it happens every time your function is
called. map() is called once per data element; mapPartitions() once
per partition. It depends.

On Wed, Sep 10, 2014 at 3:25 PM, Sarath Chandra
<sa...@algofusiontech.com> wrote:
> Hi Sean,
>
> The solution of instantiating the non-serializable class inside the map is
> working fine, but I hit a road block. The solution is not working for
> singleton classes like UserGroupInformation.
>
> In my logic as part of processing a HDFS file, I need to refer to some
> reference files which are again available in HDFS. So inside the map method
> I'm trying to instantiate UserGroupInformation to get an instance of
> FileSystem. Then using this FileSystem instance I read those reference files
> and use that data in my processing logic.
>
> This is throwing task not serializable exceptions for 'UserGroupInformation'
> and 'FileSystem' classes. I also tried using 'SparkHadoopUtil' instead of
> 'UserGroupInformation'. But it didn't resolve the issue.
>
> Request you provide some pointers in this regard.
>
> Also I have a query - when we instantiate a class inside map method, does it
> create a new instance for every RDD it is processing?
>
> Thanks & Regards,
> Sarath
>
> On Sat, Sep 6, 2014 at 4:32 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> I disagree that the generally right change is to try to make the
>> classes serializable. Usually, classes that are not serializable are
>> not supposed to be serialized. You're using them in a way that's
>> causing them to be serialized, and that's probably not desired.
>>
>> For example, this is wrong:
>>
>> val foo: SomeUnserializableManagerClass = ...
>> rdd.map(d => foo.bar(d))
>>
>> This is right:
>>
>> rdd.map { d =>
>>   val foo: SomeUnserializableManagerClass = ...
>>   foo.bar(d)
>> }
>>
>> In the first instance, you create the object on the driver and try to
>> serialize and copy it to workers. In the second, you're creating
>> SomeUnserializableManagerClass in the function and therefore on the
>> worker.
>>
>> mapPartitions is better if this creation is expensive.
>>
>> On Fri, Sep 5, 2014 at 3:06 PM, Sarath Chandra
>> <sa...@algofusiontech.com> wrote:
>> > Hi,
>> >
>> > I'm trying to migrate a map-reduce program to work with spark. I
>> > migrated
>> > the program from Java to Scala. The map-reduce program basically loads a
>> > HDFS file and for each line in the file it applies several
>> > transformation
>> > functions available in various external libraries.
>> >
>> > When I execute this over spark, it is throwing me "Task not
>> > serializable"
>> > exceptions for each and every class being used from these from external
>> > libraries. I included serialization to few classes which are in my
>> > scope,
>> > but there there are several other classes which are out of my scope like
>> > org.apache.hadoop.io.Text.
>> >
>> > How to overcome these exceptions?
>> >
>> > ~Sarath.
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Task not serializable

Posted by Sarath Chandra <sa...@algofusiontech.com>.

Hi Sean,

The solution of instantiating the non-serializable class inside the map is
working fine, but I hit a road block. The solution is not working for
singleton classes like UserGroupInformation.

In my logic as part of processing a HDFS file, I need to refer to some
reference files which are again available in HDFS. So inside the map method
I'm trying to instantiate UserGroupInformation to get an instance of
FileSystem. Then using this FileSystem instance I read those reference
files and use that data in my processing logic.

This is throwing task not serializable exceptions for
'UserGroupInformation' and 'FileSystem' classes. I also tried using
'SparkHadoopUtil' instead of 'UserGroupInformation'. But it didn't resolve
the issue.

Request you provide some pointers in this regard.

Also I have a query - when we instantiate a class inside map method, does
it create a new instance for every RDD it is processing?

Thanks & Regards,

*Sarath*

On Sat, Sep 6, 2014 at 4:32 PM, Sean Owen <so...@cloudera.com> wrote:

> I disagree that the generally right change is to try to make the
> classes serializable. Usually, classes that are not serializable are
> not supposed to be serialized. You're using them in a way that's
> causing them to be serialized, and that's probably not desired.
>
> For example, this is wrong:
>
> val foo: SomeUnserializableManagerClass = ...
> rdd.map(d => foo.bar(d))
>
> This is right:
>
> rdd.map { d =>
>   val foo: SomeUnserializableManagerClass = ...
>   foo.bar(d)
> }
>
> In the first instance, you create the object on the driver and try to
> serialize and copy it to workers. In the second, you're creating
> SomeUnserializableManagerClass in the function and therefore on the
> worker.
>
> mapPartitions is better if this creation is expensive.
>
> On Fri, Sep 5, 2014 at 3:06 PM, Sarath Chandra
> <sa...@algofusiontech.com> wrote:
> > Hi,
> >
> > I'm trying to migrate a map-reduce program to work with spark. I migrated
> > the program from Java to Scala. The map-reduce program basically loads a
> > HDFS file and for each line in the file it applies several transformation
> > functions available in various external libraries.
> >
> > When I execute this over spark, it is throwing me "Task not serializable"
> > exceptions for each and every class being used from these from external
> > libraries. I included serialization to few classes which are in my scope,
> > but there there are several other classes which are out of my scope like
> > org.apache.hadoop.io.Text.
> >
> > How to overcome these exceptions?
> >
> > ~Sarath.
>

Re: Task not serializable

Posted by Sarath Chandra <sa...@algofusiontech.com>.

Thanks Alok, Sean.

As suggested by Sean, I tried a sample program. I wrote a function in which
I made a reference to a class from third party library that is not
serialized and passed it to my map function. On executing I got same
exception.

Then I modified the program removed function and written it's contents as
anonymous function inside map function. This time the execution succeeded.

I understood the explanation of Sean. But request for references to a more
detailed explanation and examples for writing efficient spark programs
avoiding such pitfalls.

~Sarath
 On 06-Sep-2014 4:32 pm, "Sean Owen" <so...@cloudera.com> wrote:

> I disagree that the generally right change is to try to make the
> classes serializable. Usually, classes that are not serializable are
> not supposed to be serialized. You're using them in a way that's
> causing them to be serialized, and that's probably not desired.
>
> For example, this is wrong:
>
> val foo: SomeUnserializableManagerClass = ...
> rdd.map(d => foo.bar(d))
>
> This is right:
>
> rdd.map { d =>
>   val foo: SomeUnserializableManagerClass = ...
>   foo.bar(d)
> }
>
> In the first instance, you create the object on the driver and try to
> serialize and copy it to workers. In the second, you're creating
> SomeUnserializableManagerClass in the function and therefore on the
> worker.
>
> mapPartitions is better if this creation is expensive.
>
> On Fri, Sep 5, 2014 at 3:06 PM, Sarath Chandra
> <sa...@algofusiontech.com> wrote:
> > Hi,
> >
> > I'm trying to migrate a map-reduce program to work with spark. I migrated
> > the program from Java to Scala. The map-reduce program basically loads a
> > HDFS file and for each line in the file it applies several transformation
> > functions available in various external libraries.
> >
> > When I execute this over spark, it is throwing me "Task not serializable"
> > exceptions for each and every class being used from these from external
> > libraries. I included serialization to few classes which are in my scope,
> > but there there are several other classes which are out of my scope like
> > org.apache.hadoop.io.Text.
> >
> > How to overcome these exceptions?
> >
> > ~Sarath.
>

Re: Task not serializable

Posted by Sean Owen <so...@cloudera.com>.

I disagree that the generally right change is to try to make the
classes serializable. Usually, classes that are not serializable are
not supposed to be serialized. You're using them in a way that's
causing them to be serialized, and that's probably not desired.

For example, this is wrong:

val foo: SomeUnserializableManagerClass = ...
rdd.map(d => foo.bar(d))

This is right:

rdd.map { d =>
  val foo: SomeUnserializableManagerClass = ...
  foo.bar(d)
}

In the first instance, you create the object on the driver and try to
serialize and copy it to workers. In the second, you're creating
SomeUnserializableManagerClass in the function and therefore on the
worker.

mapPartitions is better if this creation is expensive.

On Fri, Sep 5, 2014 at 3:06 PM, Sarath Chandra
<sa...@algofusiontech.com> wrote:
> Hi,
>
> I'm trying to migrate a map-reduce program to work with spark. I migrated
> the program from Java to Scala. The map-reduce program basically loads a
> HDFS file and for each line in the file it applies several transformation
> functions available in various external libraries.
>
> When I execute this over spark, it is throwing me "Task not serializable"
> exceptions for each and every class being used from these from external
> libraries. I included serialization to few classes which are in my scope,
> but there there are several other classes which are out of my scope like
> org.apache.hadoop.io.Text.
>
> How to overcome these exceptions?
>
> ~Sarath.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org