You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Travis Chung <jt...@gmail.com> on 2016/07/15 17:11:48 UTC

Hadoop Custom InputFormat (SequenceFileInputFormat vs FileInputFormat)

I'm working with a single image file that consists of headers and a
multitude of different of data segment types (each data segment having its
own sub-header that contains meta data).

Example file layout:

| Header | Seg A-1 Sub-Header | Seg A-1 Data | Seg A-2 SubHdr | Seg A-2
Data | Seg B-1 Subhdr | Seg B-1 Data | Seg C-1 SubHdr | Seg C-1 Data |
etc....

The headers will vary from 1-10 Kb in size and each Data segment size will
vary anywhere from 10KB - 10GB. The headers are represented as characters
and the data is represented as binary. The headers include some useful
information like number of segments, size of subheaders and segment data
(I'll need this to create my splits).

To digest it all, I'm wondering if it's best to create a custom InputFormat
inheriting from (1) FileInputFormat or (2) SequenceFileInputFormat.

If I go with (1), I will create HeaderSplits and DataSplits (data splits
will be equiv to block size 128MB). I would also create a custom
RecordReader for the DataSplits. Where the record size will be of tile
sizes, 1024^2 Bytes. In the record reader, I will just read a tile size at
a time. For the headers, each split will contain one record.

If i go with (2), I believe the bulk of my work would be in converting my
image file to a SequenceFile. I would create a a key,value for each
header/subheader, and a key/value for every 1024^2 Bytes in my Segment
Data. Once I do that, I would have to create a custom
SequenceFileInputFormat that will also split the headers from the
partitioned data segments. I read that SequenceFiles are great for dealing
with the "large # of small files" problem, but I'm dealing with just 1
image file (although with possibly many different data segments).

I also noticed that SequenceFileInputFormat uses FileInputFormat getSplits
implementation. I'm assuming I would have to modify it to get the kinds of
splits that I want. (Extract the Header key/value pair and parse/extract
size info, etc).

Is one approach better than the other? I feel (1) would be a simpler task,
but (2) has a lot of nice features. Is there a better way? Thank you in
advance!