You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Jaonary Rabarisoa <ja...@gmail.com> on 2014/11/28 18:13:40 UTC

Understanding and optimizing spark disk usage during a job.

Dear all,

I have a job that crashes before its end because of no space left on
device, and I noticed that this job generates a lots of temporary data on
my disk.

To be precise, the job is a simple map job that takes a set of images,
extracts local features and save these local features as a sequence file.
My images are represented as a key value pair where the key are strings
representing the id of the image (the filename) and the values are the
base64 encoding of the images.

To extract the features, I use an external c program that I call with
RDD.pipe. I stream the base64 image to the c program and it sends back the
extracted feature vectors through stdout. Each line represents one feature
vector from the current image. I don't use any serialization library, I
just write the feature vector element on the stdout separated by space.
Once in spark, I just split the line and create a scala vector from each
value and save my sequence file.

The overall job looks like the following :

val images: RDD[(String, String) = ...
val features: RDD[(String, Vector)] = images.pipe(...).map(_split(" ")...)
features.saveAsSequenceFile(...)

The problem is that for about 3G of image data (about 12000 images) this
job generates more than 180G of temporary data. It seems to be strange
since for each image I have about 4000 double feature vectors of dimension
400.

I run the job on my laptop for test purpose that why I can't add additional
disk space. By the way, I need to understand why this simple job generates
such a lot of data and how can I reduce this ?


Best,

Jao

Re: Understanding and optimizing spark disk usage during a job.

Posted by Vikas Agarwal <vi...@infoobjects.com>.
I may not be correct (in fact I may be completely opposite), but here is my
guess:

Assuming 8 bytes for double, 4000 vectors of dimension 400 for 12k images,
would require 153.6 GB (12k*4000*400*8) of data which may justify the
amount of data to be written to the disk. Without compression, it seems it
would be using roughly that much data. You can further cross check what is
the storage level for your RDDs, default is MEMORY_ONLY. In case it is also
spilling data to disk, it would further increase the storage needed.

On Fri, Nov 28, 2014 at 10:43 PM, Jaonary Rabarisoa <ja...@gmail.com>
wrote:

> Dear all,
>
> I have a job that crashes before its end because of no space left on
> device, and I noticed that this job generates a lots of temporary data on
> my disk.
>
> To be precise, the job is a simple map job that takes a set of images,
> extracts local features and save these local features as a sequence file.
> My images are represented as a key value pair where the key are strings
> representing the id of the image (the filename) and the values are the
> base64 encoding of the images.
>
> To extract the features, I use an external c program that I call with
> RDD.pipe. I stream the base64 image to the c program and it sends back the
> extracted feature vectors through stdout. Each line represents one feature
> vector from the current image. I don't use any serialization library, I
> just write the feature vector element on the stdout separated by space.
> Once in spark, I just split the line and create a scala vector from each
> value and save my sequence file.
>
> The overall job looks like the following :
>
> val images: RDD[(String, String) = ...
> val features: RDD[(String, Vector)] = images.pipe(...).map(_split(" ")...)
> features.saveAsSequenceFile(...)
>
> The problem is that for about 3G of image data (about 12000 images) this
> job generates more than 180G of temporary data. It seems to be strange
> since for each image I have about 4000 double feature vectors of dimension
> 400.
>
> I run the job on my laptop for test purpose that why I can't add
> additional disk space. By the way, I need to understand why this simple job
> generates such a lot of data and how can I reduce this ?
>
>
> Best,
>
> Jao
>
>
>
>


-- 
Regards,
Vikas Agarwal
91 – 9928301411

InfoObjects, Inc.
Execution Matters
http://www.infoobjects.com
2041 Mission College Boulevard, #280
Santa Clara, CA 95054
+1 (408) 988-2000 Work
+1 (408) 716-2726 Fax