You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2016/02/15 02:21:18 UTC

[jira] [Commented] (PARQUET-531) Can't read past first page in a column

    [ https://issues.apache.org/jira/browse/PARQUET-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146819#comment-15146819 ] 

Wes McKinney commented on PARQUET-531:
--------------------------------------

Thanks [~spirom] for having a look! This is a pretty active construction site at the moment so bear with us -- [~mdeepak] is working on the column reader/scanner code path with multiple data pages. We'll take a look at your data and make sure it can be scanned properly once this work is completed in the next several days

> Can't read past first page in a column
> --------------------------------------
>
>                 Key: PARQUET-531
>                 URL: https://issues.apache.org/jira/browse/PARQUET-531
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>         Environment: Ubuntu Linux 14.04 (no obvious platform dependence), Parquet file created by Apache Spark 1.5.0 on the same platform. 
>            Reporter: Spiro Michaylov
>         Attachments: part-r-00031-e5d9a4ef-d73e-406c-8c2f-9ad1f20ebf8e.gz.parquet
>
>
> Building the code as of 2/14/2015 and adding the obvious three lines of code to serialized-page.cc to enable the newly added CompressionCodec::GZIP:
> {code}
>      case parquet::CompressionCodec::GZIP:
>        decompressor_.reset(new GZipCodec());
>        break;
> {code}
> I try to run the parquet_reader example on the column I'm about to attach, which was created by Apache Spark 1.5.0. It works surprisingly well until it hits the end of the first page, where it dies with  
> {quote}
> Parquet error: Value was non-null, but has not been buffered
> {quote}
> I realize you may be reluctant to look at this because (a) the GZip support is new and (b) I had to modify the code to enable it, but actually things seem to decompress just fine (congratulations: this is awesome!): looking at the problem in the debugger and tracing through a bit it seems to me like the buffering is a bit screwed up in general -- some kind of confusion between the buffering at the Scanner and Reader levels. I can reproduce the problem by reading through just a single column too. 
> It fails after 128 rows, which is suspicious given this line in column/scanner.h:
> {code}
>     DEFAULT_SCANNER_BATCH_SIZE = 128;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)