You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jaonary Rabarisoa <ja...@gmail.com> on 2014/09/18 10:50:37 UTC

Better way to process large image data set ?

Hi all,

I'm trying to process a large image data set and need some way to optimize
my implementation since it's very slow from now. In my current
implementation I store my images in an object file with the following fields

case class Image(groupId: String, imageId: String, buffer: String)

Images belong to groups and have an id, the buffer is the image file (jpg,
png) encode in base 64 string.

Before running an image processing algorithm on the image buffer, I have a
lot of jobs that filter, group, join images in my data set based on groupId
or imageId and theses steps are relatively slow. I suspect that spark moves
around my image buffer even if it's not necessary for these specific jobs
and then there's a lot of communication times waste.

Is there a better way to optimize my implementation ?

Regards,

Jaonary

Re: Better way to process large image data set ?

Posted by Evan Chan <ve...@gmail.com>.

What Sean said.

You should also definitely turn on Kryo serialization.  The default
Java serialization is really really slow if you're gonna move around
lots of data.    Also make sure you use a cluster with high network
bandwidth on.

On Thu, Sep 18, 2014 at 3:06 AM, Sean Owen <so...@cloudera.com> wrote:
> Base 64 is an inefficient encoding for binary data by about 2.6x. You could
> use byte[] directly.
>
> But you would still be storing and potentially shuffling lots of data in
> your RDDs.
>
> If the files exist separately on HDFS perhaps you can just send around the
> file location and load it directly using HDFS APIs in the function that
> needs it.
>
> On Sep 18, 2014 9:51 AM, "Jaonary Rabarisoa" <ja...@gmail.com> wrote:
>>
>> Hi all,
>>
>> I'm trying to process a large image data set and need some way to optimize
>> my implementation since it's very slow from now. In my current
>> implementation I store my images in an object file with the following fields
>>
>> case class Image(groupId: String, imageId: String, buffer: String)
>>
>> Images belong to groups and have an id, the buffer is the image file (jpg,
>> png) encode in base 64 string.
>>
>> Before running an image processing algorithm on the image buffer, I have a
>> lot of jobs that filter, group, join images in my data set based on groupId
>> or imageId and theses steps are relatively slow. I suspect that spark moves
>> around my image buffer even if it's not necessary for these specific jobs
>> and then there's a lot of communication times waste.
>>
>> Is there a better way to optimize my implementation ?
>>
>> Regards,
>>
>> Jaonary

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Better way to process large image data set ?

Posted by Sean Owen <so...@cloudera.com>.

Base 64 is an inefficient encoding for binary data by about 2.6x. You could
use byte[] directly.

But you would still be storing and potentially shuffling lots of data in
your RDDs.

If the files exist separately on HDFS perhaps you can just send around the
file location and load it directly using HDFS APIs in the function that
needs it.
 On Sep 18, 2014 9:51 AM, "Jaonary Rabarisoa" <ja...@gmail.com> wrote:

> Hi all,
>
> I'm trying to process a large image data set and need some way to optimize
> my implementation since it's very slow from now. In my current
> implementation I store my images in an object file with the following fields
>
> case class Image(groupId: String, imageId: String, buffer: String)
>
> Images belong to groups and have an id, the buffer is the image file (jpg,
> png) encode in base 64 string.
>
> Before running an image processing algorithm on the image buffer, I have a
> lot of jobs that filter, group, join images in my data set based on groupId
> or imageId and theses steps are relatively slow. I suspect that spark moves
> around my image buffer even if it's not necessary for these specific jobs
> and then there's a lot of communication times waste.
>
> Is there a better way to optimize my implementation ?
>
> Regards,
>
> Jaonary
>