You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2019/05/28 20:05:00 UTC

[jira] [Commented] (PARQUET-1422) [C++] Use Arrow IO interfaces natively rather than current parquet:: wrappers

    [ https://issues.apache.org/jira/browse/PARQUET-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16850097#comment-16850097 ] 

Wes McKinney commented on PARQUET-1422:
---------------------------------------

[~pitrou] I ran into a snag working on this in the different semantics of {{parquet::BufferedInputStream}} versus {{arrow::io::BufferedInputStream}}. The only place this function is used in parquet-cpp is here

https://github.com/apache/arrow/blob/1a0e976/cpp/src/parquet/column_reader.cc#L156

The idea is that we don't yet know how big the next page header is so we keep trying to deserialize a larger and larger page header with Thrift until we reach the maximum allowable page size.

{{parquet::BufferedInputStream}} when passed larger and larger peek sizes will expand the size of the buffer to suit the request. 

The easiest thing would be to port the logic from parquet-cpp to in

https://github.com/apache/arrow/blob/1a0e976/cpp/src/arrow/io/buffered.cc#L258

Another issue I found is that {{arrow::io::BufferedInputStream::Peek}} cannot trigger buffering unlike the {{parquet::}} counterpart

> [C++] Use Arrow IO interfaces natively rather than current parquet:: wrappers
> -----------------------------------------------------------------------------
>
>                 Key: PARQUET-1422
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1422
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-cpp
>            Reporter: Wes McKinney
>            Assignee: Wes McKinney
>            Priority: Major
>             Fix For: cpp-1.6.0
>
>
> We are beginning to do some work on asynchronous IO in Arrow and it would be great to be able to leverage this in the Parquet core internals. 
> I am proposing to remove the Parquet-specific virtual file interfaces in
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/util/memory.h#L221
> and instead rely directly on the Arrow ones in arrow::io. In addition to reducing the amount of code we have to maintain, we will also be able to improve performance of Parquet by utilizing common utilities for managing asynchronous / background IO
> cc [~mdeepak] [~xhochy]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)