You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by jtgenesis <jt...@gmail.com> on 2016/07/15 17:31:44 UTC

Custom InputFormat (SequenceFileInputFormat vs FileInputFormat)

I'm working with a single image file that consists of headers and a multitude
of different of data segment types (each data segment having its own
sub-header that contains meta data). Currently using Hadoop's HDFS.

Example file layout:

The headers will vary from 1-10 Kb in size and each Data segment size will
vary anywhere from 10KB - 10GB. The headers are represented as characters
and the data is represented as binary. The headers include some useful
information like number of segments, size of subheaders and segment data
(I'll need this to create my splits).

To digest it all, I'm wondering if it's best to create a custom InputFormat
inheriting from (1) FileInputFormat or (2) SequenceFileInputFormat.

If I go with (1), I will create HeaderSplits and DataSplits (data splits
will be equiv to block size 128MB). I would also create a custom
RecordReader for the DataSplits. Where the record size will be of tile
sizes, 1024^2 Bytes. In the record reader, I will just read a tile size at a
time. For the headers, each split will contain one record.

If i go with (2), I believe the bulk of my work would be in converting my
image file to a SequenceFile. I would create a a key,value for each
header/subheader, and a key/value for every 1024^2 Bytes in my Segment Data.
Once I do that, I would have to create a custom SequenceFileInputFormat that
will also split the headers from the partitioned data segments. I read that
SequenceFiles are great for dealing with the "large # of small files"
problem, but I'm dealing with just 1 image file (although with possibly many
different data segments).

I also noticed that SequenceFileInputFormat uses FileInputFormat getSplits
implementation. I'm assuming I would have to modify it to get the kinds of
splits that I want. (Extract the Header key/value pair and parse/extract
size info, etc).

Is one approach better than the other? I feel (1) would be a simpler task,
but (2) has a lot of nice features. Is there a better way?

This is probably more of a hadoop question, but was curious if anyone had
experience with this.

Thank you in advance!

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Custom-InputFormat-SequenceFileInputFormat-vs-FileInputFormat-tp27344.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Custom InputFormat (SequenceFileInputFormat vs FileInputFormat)

Posted by Jörn Franke <jo...@gmail.com>.

I am not sure if I exactly understand your use case, but for my Hadoop/Spark format that reads the Bitcoin blockchain I extend from  FileInputFormat. I use the default split mechanism. This could mean that I split in the middle of a bitcoin block, which is no issue, because the first split can reach beyond its original size (in this case the remaining necessary data might be transferred from a remote node) and the second split can be seeked through the next block.

However the main different thing to your case it that my blocks are of similar size. Your block size can vary a lot, which means that one task could be busy with a small block and another with a very big block. This means parallel processing might be suboptimal. Here it depends now what do you plan with the blocks afterwards?

> On 15 Jul 2016, at 19:31, jtgenesis <jt...@gmail.com> wrote:
> 
> I'm working with a single image file that consists of headers and a multitude
> of different of data segment types (each data segment having its own
> sub-header that contains meta data). Currently using Hadoop's HDFS.
> 
> Example file layout:
> 
> | Header | Seg A-1 Sub-Header | Seg A-1 Data | Seg A-2 SubHdr | Seg A-2 Data
> | Seg B-1 Subhdr | Seg B-1 Data | Seg C-1 SubHdr | Seg C-1 Data | etc....
> 
> The headers will vary from 1-10 Kb in size and each Data segment size will
> vary anywhere from 10KB - 10GB. The headers are represented as characters
> and the data is represented as binary. The headers include some useful
> information like number of segments, size of subheaders and segment data
> (I'll need this to create my splits).
> 
> To digest it all, I'm wondering if it's best to create a custom InputFormat
> inheriting from (1) FileInputFormat or (2) SequenceFileInputFormat.
> 
> If I go with (1), I will create HeaderSplits and DataSplits (data splits
> will be equiv to block size 128MB). I would also create a custom
> RecordReader for the DataSplits. Where the record size will be of tile
> sizes, 1024^2 Bytes. In the record reader, I will just read a tile size at a
> time. For the headers, each split will contain one record.
> 
> If i go with (2), I believe the bulk of my work would be in converting my
> image file to a SequenceFile. I would create a a key,value for each
> header/subheader, and a key/value for every 1024^2 Bytes in my Segment Data.
> Once I do that, I would have to create a custom SequenceFileInputFormat that
> will also split the headers from the partitioned data segments. I read that
> SequenceFiles are great for dealing with the "large # of small files"
> problem, but I'm dealing with just 1 image file (although with possibly many
> different data segments).
> 
> I also noticed that SequenceFileInputFormat uses FileInputFormat getSplits
> implementation. I'm assuming I would have to modify it to get the kinds of
> splits that I want. (Extract the Header key/value pair and parse/extract
> size info, etc).
> 
> Is one approach better than the other? I feel (1) would be a simpler task,
> but (2) has a lot of nice features. Is there a better way? 
> 
> This is probably more of a hadoop question, but was curious if anyone had
> experience with this.
> 
> Thank you in advance!
> 
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Custom-InputFormat-SequenceFileInputFormat-vs-FileInputFormat-tp27344.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org