You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Spiro Michaylov (JIRA)" <ji...@apache.org> on 2016/02/14 23:12:18 UTC

[jira] [Updated] (PARQUET-531) Can't read past first page in a column

     [ https://issues.apache.org/jira/browse/PARQUET-531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Spiro Michaylov updated PARQUET-531:
------------------------------------
    Attachment: part-r-00031-e5d9a4ef-d73e-406c-8c2f-9ad1f20ebf8e.gz.parquet

This is a single shard of an HDFS/Parquet output from Apache Spark 1.5.0 -- just hit it with parquet_reader after enabling GZip decompression. 

> Can't read past first page in a column
> --------------------------------------
>
>                 Key: PARQUET-531
>                 URL: https://issues.apache.org/jira/browse/PARQUET-531
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>         Environment: Ubuntu Linux 14.04 (no obvious platform dependence), Parquet file created by Apache Spark 1.5.0 on the same platform. 
>            Reporter: Spiro Michaylov
>         Attachments: part-r-00031-e5d9a4ef-d73e-406c-8c2f-9ad1f20ebf8e.gz.parquet
>
>
> Building the code as of 2/14/2015 and adding the obvious three lines of code to serialized-page.cc to enable the newly added CompressionCodec::GZIP:
> {code}
>      case parquet::CompressionCodec::GZIP:
>        decompressor_.reset(new GZipCodec());
>        break;
> {code}
> I try to run the parquet_reader example on the column I'm about to attach, which was created by Apache Spark 1.5.0. It works surprisingly well until it hits the end of the first page, where it dies with  
> {quote}
> Parquet error: Value was non-null, but has not been buffered
> {quote}
> I realize you may be reluctant to look at this because (a) the GZip support is new and (b) I had to modify the code to enable it, but actually things seem to decompress just fine (congratulations: this is awesome!): looking at the problem in the debugger and tracing through a bit it seems to me like the buffering is a bit screwed up in general -- some kind of confusion between the buffering at the Scanner and Reader levels. I can reproduce the problem by reading through just a single column too. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)