You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Weishung Chung <we...@gmail.com> on 2011/03/21 15:40:53 UTC

Sync-marker in uncompressed sequenceFile

Hello my fellow Hadoop users/developers,

I'm reading the SequenceFile source code, and there is a checkAndWriteSync()
method that writes a sync marker every so many bytes. I was wondering what's
the use of the sync marker. I know one can use it to designate the end of a
header, but it's also used in the process of writing the uncompressed data.
 I wish I could have figured it out :(

Thank you so much

Re: Sync-marker in uncompressed sequenceFile

Posted by Harsh J <qw...@gmail.com>.

Hello,

On Mon, Mar 21, 2011 at 8:10 PM, Weishung Chung <we...@gmail.com> wrote:
> Hello my fellow Hadoop users/developers,
>
> I'm reading the SequenceFile source code, and there is a checkAndWriteSync()
> method that writes a sync marker every so many bytes. I was wondering what's
> the use of the sync marker. I know one can use it to designate the end of a
> header, but it's also used in the process of writing the uncompressed data.
>  I wish I could have figured it out :(

It is to mark a logical boundary of a record or a set of records. This
is ultimately used to read records properly across block splits in
HDFS. In case of text files, one looks for a line-ending as '\n'. In
case of sequence files there is no such thing, so a 'sync' marker is
used instead to look for 'ends' of records so that they may be read
back by Map/Reduce correctly.

-- 
Harsh J
http://harshj.com

Re: Sync-marker in uncompressed sequenceFile

Posted by Weishung Chung <we...@gmail.com>.

Thanks, exciting works !

On Mon, Mar 21, 2011 at 3:07 PM, Chris Douglas <cd...@apache.org> wrote:

> It's used to align input splits of the SequenceFile. A reader can
> start at an arbitrary offset, then find the boundary of the next block
> of records by looking for the sync marker defined in the header. -C
>
> On Mon, Mar 21, 2011 at 7:40 AM, Weishung Chung <we...@gmail.com>
> wrote:
> > Hello my fellow Hadoop users/developers,
> >
> > I'm reading the SequenceFile source code, and there is a
> checkAndWriteSync()
> > method that writes a sync marker every so many bytes. I was wondering
> what's
> > the use of the sync marker. I know one can use it to designate the end of
> a
> > header, but it's also used in the process of writing the uncompressed
> data.
> >  I wish I could have figured it out :(
> >
> > Thank you so much
> >
>

Re: Sync-marker in uncompressed sequenceFile

Posted by Chris Douglas <cd...@apache.org>.

It's used to align input splits of the SequenceFile. A reader can
start at an arbitrary offset, then find the boundary of the next block
of records by looking for the sync marker defined in the header. -C

On Mon, Mar 21, 2011 at 7:40 AM, Weishung Chung <we...@gmail.com> wrote:
> Hello my fellow Hadoop users/developers,
>
> I'm reading the SequenceFile source code, and there is a checkAndWriteSync()
> method that writes a sync marker every so many bytes. I was wondering what's
> the use of the sync marker. I know one can use it to designate the end of a
> header, but it's also used in the process of writing the uncompressed data.
>  I wish I could have figured it out :(
>
> Thank you so much
>