You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Christopher Ng <cn...@gmail.com> on 2013/06/24 11:20:48 UTC

bug in SequenceFile.sync()?

cross-posting this from cdh-users group where it received little interest:

is there a bug in SequenceFile.sync()?  This is from cdh4.3.0:

    /** Seek to the next sync mark past a given position.*/
    public synchronized void sync(long position) throws IOException {
      if (position+SYNC_SIZE >= end) {
        seek(end);
        return;
      }

      if (position < headerEnd) {
        // seek directly to first record
        in.seek(headerEnd);                                         <====
should this not call seek (ie this.seek) instead?
        // note the sync marker "seen" in the header
        syncSeen = true;
        return;
      }

the problem is that when you sync to the start of a compressed file, the
noBufferedKeys and valuesDecompressed isn't reset so a block read isn't
triggered.  When you subsequently call next() you're potentially getting
keys from the buffer which still contains keys from the previous position
of the file.

Re: bug in SequenceFile.sync()?

Posted by Christopher Ng <cn...@gmail.com>.
cool thx.  is there an ETA on a fix?  or a workaround for the case where i
want to seek to the start of the file?


On Mon, Jun 24, 2013 at 4:39 PM, Colin McCabe <cm...@alumni.cmu.edu>wrote:

> Hi Chris,
>
> Thanks for the report.  I filed
> https://issues.apache.org/jira/browse/HADOOP-9667 for this.
>
> Colin
> Software Engineer, Cloudera
>
>
> On Mon, Jun 24, 2013 at 2:20 AM, Christopher Ng <cn...@gmail.com> wrote:
> > cross-posting this from cdh-users group where it received little
> interest:
> >
> > is there a bug in SequenceFile.sync()?  This is from cdh4.3.0:
> >
> >     /** Seek to the next sync mark past a given position.*/
> >     public synchronized void sync(long position) throws IOException {
> >       if (position+SYNC_SIZE >= end) {
> >         seek(end);
> >         return;
> >       }
> >
> >       if (position < headerEnd) {
> >         // seek directly to first record
> >         in.seek(headerEnd);                                         <====
> > should this not call seek (ie this.seek) instead?
> >         // note the sync marker "seen" in the header
> >         syncSeen = true;
> >         return;
> >       }
> >
> > the problem is that when you sync to the start of a compressed file, the
> > noBufferedKeys and valuesDecompressed isn't reset so a block read isn't
> > triggered.  When you subsequently call next() you're potentially getting
> > keys from the buffer which still contains keys from the previous position
> > of the file.
>

Re: bug in SequenceFile.sync()?

Posted by Colin McCabe <cm...@alumni.cmu.edu>.
Hi Chris,

Thanks for the report.  I filed
https://issues.apache.org/jira/browse/HADOOP-9667 for this.

Colin
Software Engineer, Cloudera


On Mon, Jun 24, 2013 at 2:20 AM, Christopher Ng <cn...@gmail.com> wrote:
> cross-posting this from cdh-users group where it received little interest:
>
> is there a bug in SequenceFile.sync()?  This is from cdh4.3.0:
>
>     /** Seek to the next sync mark past a given position.*/
>     public synchronized void sync(long position) throws IOException {
>       if (position+SYNC_SIZE >= end) {
>         seek(end);
>         return;
>       }
>
>       if (position < headerEnd) {
>         // seek directly to first record
>         in.seek(headerEnd);                                         <====
> should this not call seek (ie this.seek) instead?
>         // note the sync marker "seen" in the header
>         syncSeen = true;
>         return;
>       }
>
> the problem is that when you sync to the start of a compressed file, the
> noBufferedKeys and valuesDecompressed isn't reset so a block read isn't
> triggered.  When you subsequently call next() you're potentially getting
> keys from the buffer which still contains keys from the previous position
> of the file.

Re: bug in SequenceFile.sync()?

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Hi Christopher,

indeed, I think that the noBufferedKeys and valuesDecompressed should be 
reset.

Regards
JB

On 06/24/2013 11:20 AM, Christopher Ng wrote:
> cross-posting this from cdh-users group where it received little interest:
>
> is there a bug in SequenceFile.sync()?  This is from cdh4.3.0:
>
>      /** Seek to the next sync mark past a given position.*/
>      public synchronized void sync(long position) throws IOException {
>        if (position+SYNC_SIZE >= end) {
>          seek(end);
>          return;
>        }
>
>        if (position < headerEnd) {
>          // seek directly to first record
>          in.seek(headerEnd);                                         <====
> should this not call seek (ie this.seek) instead?
>          // note the sync marker "seen" in the header
>          syncSeen = true;
>          return;
>        }
>
> the problem is that when you sync to the start of a compressed file, the
> noBufferedKeys and valuesDecompressed isn't reset so a block read isn't
> triggered.  When you subsequently call next() you're potentially getting
> keys from the buffer which still contains keys from the previous position
> of the file.
>

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com