You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by jtgenesis <jt...@gmail.com> on 2016/07/28 18:04:57 UTC

Custom Image RDD and Sequence Files

Hey all,

I was wondering what the best course of action is for processing an image
that has an involved internal structure (file headers, sub-headers, image
data, more sub-headers, more kinds of data etc). I was hoping to get some
insight on the approach I'm using and whether there is a better, more Spark
way of handling it.

I'm coming from a Hadoop approach where I convert the image to a sequence
file. Now, i'm new to both Spark and Hadoop, but I have a deeper
understanding of Hadoop, which is why I went with the sequence files. The
sequence file is chopped into key/value pairs that contain file and image
meta-data and separate key/value pairs that contain the raw image data. I
currently use a LongWritable for the key and a BytesWritable for the value.
This is a naive approach, but I plan to create custom Writable key type that
contain pertinent information to the corresponding image data. The idea is
to create a custom Spark Partitioner, taking advantage of the key structure,
to reduce inter-cluster communication. Example. store all image tiles with
the same key.id property on the same node.

1.) Is converting the image to a Sequence File superfluous? Is it better to
do this pre-processing and creating a custom key/value type another way.
Would it be through Spark or Hadoop's Writable? It seems like Spark just
uses different flavors of Hadoop's InputFormat under the hood.

I see that Spark does have support for SequenceFiles, but I'm still not
fully clear on the extent of it.

2.) When you read in a .seq file through sc.sequenceFIle(), it's using
SequenceFileInputFormat. This means that the number of partitions will be
determined by the number of splits, specified in the
SequenceFileInputFormat.getSplits. Do the input splits happen on key/value
boundaries?

3.) The RDD created from Sequence Files will have the translated Scala
key/value type, but if I use a custom Hadoop Writable, will I have to do
anything on Spark/Scala side to understand it?

4.) Since I'm using a custom Hadoop Writable, is it best to register my
writable types with Kryo?

Thanks for any help!

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Custom-Image-RDD-and-Sequence-Files-tp27426.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Custom Image RDD and Sequence Files

Posted by Jörn Franke <jo...@gmail.com>.

Why don't you write your own Hadoop FileInputFormat. It can be used by Spark...

> On 28 Jul 2016, at 20:04, jtgenesis <jt...@gmail.com> wrote:
> 
> Hey all,
> 
> I was wondering what the best course of action is for processing an image
> that has an involved internal structure (file headers, sub-headers, image
> data, more sub-headers, more kinds of data etc). I was hoping to get some
> insight on the approach I'm using and whether there is a better, more Spark
> way of handling it.
> 
> I'm coming from a Hadoop approach where I convert the image to a sequence
> file. Now, i'm new to both Spark and Hadoop, but I have a deeper
> understanding of Hadoop, which is why I went with the sequence files. The
> sequence file is chopped into key/value pairs that contain file and image
> meta-data and separate key/value pairs that contain the raw image data. I
> currently use a LongWritable for the key and a BytesWritable for the value.
> This is a naive approach, but I plan to create custom Writable key type that
> contain pertinent information to the corresponding image data. The idea is
> to create a custom Spark Partitioner, taking advantage of the key structure,
> to reduce inter-cluster communication. Example. store all image tiles with
> the same key.id property on the same node.
> 
> 1.) Is converting the image to a Sequence File superfluous? Is it better to
> do this pre-processing and creating a custom key/value type another way.
> Would it be through Spark or Hadoop's Writable? It seems like Spark just
> uses different flavors of Hadoop's InputFormat under the hood.
> 
> I see that Spark does have support for SequenceFiles, but I'm still not
> fully clear on the extent of it.
> 
> 2.)  When you read in a .seq file through sc.sequenceFIle(), it's using
> SequenceFileInputFormat. This means that the number of partitions will be
> determined by the number of splits, specified in the
> SequenceFileInputFormat.getSplits. Do the input splits happen on key/value
> boundaries? 
> 
> 3.) The RDD created from Sequence Files will have the translated Scala
> key/value type, but if I use a custom Hadoop Writable, will I have to do
> anything on Spark/Scala side to understand it?
> 
> 4.) Since I'm using a custom Hadoop Writable, is it best to register my
> writable types with Kryo?
> 
> Thanks for any help!
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Custom-Image-RDD-and-Sequence-Files-tp27426.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org