You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by David Thomas <dt...@gmail.com> on 2014/02/11 18:18:28 UTC

Task not serializable (java.io.NotSerializableException)

I'm trying to copy a file from hdfs to a temp local directory within a map
function using static method of FileUtil and I get the below error. Is
there a way to get around this?

org.apache.spark.SparkException: Job aborted: Task not serializable:
java.io.NotSerializableException: org.apache.hadoop.fs.Path
    at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)

Re: Task not serializable (java.io.NotSerializableException)

Posted by David Thomas <dt...@gmail.com>.

The files that are in hdfs are pretty heavy weight and so I do not want to
create an RDD out of it. Instead, I have another lightweight RDD and I want
to apply a map function on it, within which I'll load the files into local
disk, perform some operations with the RDD elements against these files and
create another RDD.


On Tue, Feb 11, 2014 at 10:41 AM, Andrew Ash <an...@andrewash.com> wrote:

> The full file on all the machines or just write the partitions that are
> already on each machine to disk?
>
> If the latter, try rdd.saveAsTextFile("file:///tmp/mydata")
>
>
> On Tue, Feb 11, 2014 at 9:39 AM, David Thomas <dt...@gmail.com> wrote:
>
>> I want it to be available on all machines in the cluster.
>>
>>
>> On Tue, Feb 11, 2014 at 10:35 AM, Andrew Ash <an...@andrewash.com>wrote:
>>
>>> Do you want the files scattered across the local temp directories of all
>>> your machines or just one of them?  If just one, I'd recommend having your
>>> driver program execute hadoop fs -getmerge /path/to/files...  using Scala's
>>> external process libraries.
>>>
>>>
>>> On Tue, Feb 11, 2014 at 9:18 AM, David Thomas <dt...@gmail.com>wrote:
>>>
>>>> I'm trying to copy a file from hdfs to a temp local directory within a
>>>> map function using static method of FileUtil and I get the below error. Is
>>>> there a way to get around this?
>>>>
>>>> org.apache.spark.SparkException: Job aborted: Task not serializable:
>>>> java.io.NotSerializableException: org.apache.hadoop.fs.Path
>>>>     at
>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
>>>>
>>>
>>>
>>
>

Re: Task not serializable (java.io.NotSerializableException)

Posted by Andrew Ash <an...@andrewash.com>.

The full file on all the machines or just write the partitions that are
already on each machine to disk?

If the latter, try rdd.saveAsTextFile("file:///tmp/mydata")


On Tue, Feb 11, 2014 at 9:39 AM, David Thomas <dt...@gmail.com> wrote:

> I want it to be available on all machines in the cluster.
>
>
> On Tue, Feb 11, 2014 at 10:35 AM, Andrew Ash <an...@andrewash.com> wrote:
>
>> Do you want the files scattered across the local temp directories of all
>> your machines or just one of them?  If just one, I'd recommend having your
>> driver program execute hadoop fs -getmerge /path/to/files...  using Scala's
>> external process libraries.
>>
>>
>> On Tue, Feb 11, 2014 at 9:18 AM, David Thomas <dt...@gmail.com>wrote:
>>
>>> I'm trying to copy a file from hdfs to a temp local directory within a
>>> map function using static method of FileUtil and I get the below error. Is
>>> there a way to get around this?
>>>
>>> org.apache.spark.SparkException: Job aborted: Task not serializable:
>>> java.io.NotSerializableException: org.apache.hadoop.fs.Path
>>>     at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
>>>
>>
>>
>

Re: Task not serializable (java.io.NotSerializableException)

Posted by David Thomas <dt...@gmail.com>.

I want it to be available on all machines in the cluster.


On Tue, Feb 11, 2014 at 10:35 AM, Andrew Ash <an...@andrewash.com> wrote:

> Do you want the files scattered across the local temp directories of all
> your machines or just one of them?  If just one, I'd recommend having your
> driver program execute hadoop fs -getmerge /path/to/files...  using Scala's
> external process libraries.
>
>
> On Tue, Feb 11, 2014 at 9:18 AM, David Thomas <dt...@gmail.com> wrote:
>
>> I'm trying to copy a file from hdfs to a temp local directory within a
>> map function using static method of FileUtil and I get the below error. Is
>> there a way to get around this?
>>
>> org.apache.spark.SparkException: Job aborted: Task not serializable:
>> java.io.NotSerializableException: org.apache.hadoop.fs.Path
>>     at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
>>
>
>

Re: Task not serializable (java.io.NotSerializableException)

Posted by Andrew Ash <an...@andrewash.com>.

Do you want the files scattered across the local temp directories of all
your machines or just one of them?  If just one, I'd recommend having your
driver program execute hadoop fs -getmerge /path/to/files...  using Scala's
external process libraries.

On Tue, Feb 11, 2014 at 9:18 AM, David Thomas <dt...@gmail.com> wrote:

> I'm trying to copy a file from hdfs to a temp local directory within a map
> function using static method of FileUtil and I get the below error. Is
> there a way to get around this?
>
> org.apache.spark.SparkException: Job aborted: Task not serializable:
> java.io.NotSerializableException: org.apache.hadoop.fs.Path
>     at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
>