You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Daniel Weeks <dw...@netflix.com.INVALID> on 2015/12/04 22:32:11 UTC

Can't read some parquet files after ByteBuffer Patch

Jason or Julien,

Just wanted to see if you or anyone else has run into problems reading
files after the ByteBuffer patch.  I've been running into issues and have
narrowed it down to the ByteBuffer commit using a small repro file (written
with 1.6.0, unfortunately can't share the data).

It doesn't happen for every file, but those that fail give this error:

can not read class org.apache.parquet.format.PageHeader: Required field
'uncompressed_page_size' was not found in serialized data! Struct:
PageHeader(type:null, uncompressed_page_size:0, compressed_page_size:0)

I assume that the real problem is somehow being trapped and suppressed by
thrift.

Has anyone else seen this?

Thanks,
Dan

Re: Can't read some parquet files after ByteBuffer Patch

Posted by Jason Altekruse <al...@gmail.com>.

I got the file, I should have time to look at it today.

On Mon, Dec 7, 2015 at 3:05 PM, Daniel Weeks <dw...@netflix.com.invalid>
wrote:

> I sent Jason a file that can reproduce the issue with just 1K lines in it.
>
> If you want, I can open a JIRA and attach the file.
>
> 5a45ae3b1deb5117cb9e9a13141eeab1e9ad3d71 Can read the file without issue
> 6b605a4ea05b66e1a6bf843353abcb4834a4ced8 (bytebuffer) cannot read the file
>
> -Dan
>
> On Mon, Dec 7, 2015 at 2:19 PM, Julien Le Dem <ju...@dremio.com> wrote:
>
> > In the meantime if you have the stacktrace for this error that would help
> > too.
> >
> > On Fri, Dec 4, 2015 at 1:59 PM, Jason Altekruse <
> altekrusejason@gmail.com>
> > wrote:
> >
> > > I assume that the buffer that we are giving to thrift doesn't have the
> > > header in it at the expected position. We hadn't seen this error in any
> > of
> > > our regression tests in Drill with the final version of the patch, but
> I
> > > have debugged a few issues that produced this error in the past,
> > including
> > > some that came up when we merged our changes into master.
> > >
> > > Can you try to generate data similar to the private dataset the
> produces
> > > the issue? If you are having trouble reproducing could you share the
> data
> > > types and encodings that are being used in the file and I can try to
> > > reproduce it.
> > >
> > > Thanks,
> > > Jason
> > >
> > > On Fri, Dec 4, 2015 at 1:32 PM, Daniel Weeks
> <dweeks@netflix.com.invalid
> > >
> > > wrote:
> > >
> > > > Jason or Julien,
> > > >
> > > > Just wanted to see if you or anyone else has run into problems
> reading
> > > > files after the ByteBuffer patch.  I've been running into issues and
> > have
> > > > narrowed it down to the ByteBuffer commit using a small repro file
> > > (written
> > > > with 1.6.0, unfortunately can't share the data).
> > > >
> > > > It doesn't happen for every file, but those that fail give this
> error:
> > > >
> > > > can not read class org.apache.parquet.format.PageHeader: Required
> field
> > > > 'uncompressed_page_size' was not found in serialized data! Struct:
> > > > PageHeader(type:null, uncompressed_page_size:0,
> compressed_page_size:0)
> > > >
> > > > I assume that the real problem is somehow being trapped and
> suppressed
> > by
> > > > thrift.
> > > >
> > > > Has anyone else seen this?
> > > >
> > > > Thanks,
> > > > Dan
> > > >
> > >
> >
> >
> >
> > --
> > Julien
> >
>

Re: Can't read some parquet files after ByteBuffer Patch

Posted by Daniel Weeks <dw...@netflix.com.INVALID>.

I sent Jason a file that can reproduce the issue with just 1K lines in it.

If you want, I can open a JIRA and attach the file.

5a45ae3b1deb5117cb9e9a13141eeab1e9ad3d71 Can read the file without issue
6b605a4ea05b66e1a6bf843353abcb4834a4ced8 (bytebuffer) cannot read the file

-Dan

On Mon, Dec 7, 2015 at 2:19 PM, Julien Le Dem <ju...@dremio.com> wrote:

> In the meantime if you have the stacktrace for this error that would help
> too.
>
> On Fri, Dec 4, 2015 at 1:59 PM, Jason Altekruse <al...@gmail.com>
> wrote:
>
> > I assume that the buffer that we are giving to thrift doesn't have the
> > header in it at the expected position. We hadn't seen this error in any
> of
> > our regression tests in Drill with the final version of the patch, but I
> > have debugged a few issues that produced this error in the past,
> including
> > some that came up when we merged our changes into master.
> >
> > Can you try to generate data similar to the private dataset the produces
> > the issue? If you are having trouble reproducing could you share the data
> > types and encodings that are being used in the file and I can try to
> > reproduce it.
> >
> > Thanks,
> > Jason
> >
> > On Fri, Dec 4, 2015 at 1:32 PM, Daniel Weeks <dweeks@netflix.com.invalid
> >
> > wrote:
> >
> > > Jason or Julien,
> > >
> > > Just wanted to see if you or anyone else has run into problems reading
> > > files after the ByteBuffer patch.  I've been running into issues and
> have
> > > narrowed it down to the ByteBuffer commit using a small repro file
> > (written
> > > with 1.6.0, unfortunately can't share the data).
> > >
> > > It doesn't happen for every file, but those that fail give this error:
> > >
> > > can not read class org.apache.parquet.format.PageHeader: Required field
> > > 'uncompressed_page_size' was not found in serialized data! Struct:
> > > PageHeader(type:null, uncompressed_page_size:0, compressed_page_size:0)
> > >
> > > I assume that the real problem is somehow being trapped and suppressed
> by
> > > thrift.
> > >
> > > Has anyone else seen this?
> > >
> > > Thanks,
> > > Dan
> > >
> >
>
>
>
> --
> Julien
>

Re: Can't read some parquet files after ByteBuffer Patch

Posted by Julien Le Dem <ju...@dremio.com>.

In the meantime if you have the stacktrace for this error that would help
too.

On Fri, Dec 4, 2015 at 1:59 PM, Jason Altekruse <al...@gmail.com>
wrote:

> I assume that the buffer that we are giving to thrift doesn't have the
> header in it at the expected position. We hadn't seen this error in any of
> our regression tests in Drill with the final version of the patch, but I
> have debugged a few issues that produced this error in the past, including
> some that came up when we merged our changes into master.
>
> Can you try to generate data similar to the private dataset the produces
> the issue? If you are having trouble reproducing could you share the data
> types and encodings that are being used in the file and I can try to
> reproduce it.
>
> Thanks,
> Jason
>
> On Fri, Dec 4, 2015 at 1:32 PM, Daniel Weeks <dw...@netflix.com.invalid>
> wrote:
>
> > Jason or Julien,
> >
> > Just wanted to see if you or anyone else has run into problems reading
> > files after the ByteBuffer patch.  I've been running into issues and have
> > narrowed it down to the ByteBuffer commit using a small repro file
> (written
> > with 1.6.0, unfortunately can't share the data).
> >
> > It doesn't happen for every file, but those that fail give this error:
> >
> > can not read class org.apache.parquet.format.PageHeader: Required field
> > 'uncompressed_page_size' was not found in serialized data! Struct:
> > PageHeader(type:null, uncompressed_page_size:0, compressed_page_size:0)
> >
> > I assume that the real problem is somehow being trapped and suppressed by
> > thrift.
> >
> > Has anyone else seen this?
> >
> > Thanks,
> > Dan
> >
>



-- 
Julien

Re: Can't read some parquet files after ByteBuffer Patch

Posted by Jason Altekruse <al...@gmail.com>.

I assume that the buffer that we are giving to thrift doesn't have the
header in it at the expected position. We hadn't seen this error in any of
our regression tests in Drill with the final version of the patch, but I
have debugged a few issues that produced this error in the past, including
some that came up when we merged our changes into master.

Can you try to generate data similar to the private dataset the produces
the issue? If you are having trouble reproducing could you share the data
types and encodings that are being used in the file and I can try to
reproduce it.

Thanks,
Jason

On Fri, Dec 4, 2015 at 1:32 PM, Daniel Weeks <dw...@netflix.com.invalid>
wrote:

> Jason or Julien,
>
> Just wanted to see if you or anyone else has run into problems reading
> files after the ByteBuffer patch.  I've been running into issues and have
> narrowed it down to the ByteBuffer commit using a small repro file (written
> with 1.6.0, unfortunately can't share the data).
>
> It doesn't happen for every file, but those that fail give this error:
>
> can not read class org.apache.parquet.format.PageHeader: Required field
> 'uncompressed_page_size' was not found in serialized data! Struct:
> PageHeader(type:null, uncompressed_page_size:0, compressed_page_size:0)
>
> I assume that the real problem is somehow being trapped and suppressed by
> thrift.
>
> Has anyone else seen this?
>
> Thanks,
> Dan
>