You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by "Rohit Aggarwal (JIRA)" <ji...@apache.org> on 2017/05/31 14:26:04 UTC

[jira] [Commented] (PARQUET-674) Add an abstraction to get the length of a stream

    [ https://issues.apache.org/jira/browse/PARQUET-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16031236#comment-16031236 ] 

Rohit Aggarwal commented on PARQUET-674:
----------------------------------------

We have observed that this commit leads to file descriptors left in {{CLOSE_WAIT}} state and not actually being close which will cause issues given enough calls to {{readFooters}} method. We are using Hadoop 2.7.2.

> Add an abstraction to get the length of a stream
> ------------------------------------------------
>
>                 Key: PARQUET-674
>                 URL: https://issues.apache.org/jira/browse/PARQUET-674
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>            Reporter: Ryan Blue
>            Assignee: Ryan Blue
>             Fix For: 1.9.0
>
>
> PARQUET-400 introduces {{SeekableInputStream}} to wrap Hadoop v1 and v2 streams and provide ByteBuffer access transparently. This can also be used as an abstraction to allow Parquet to work without the Hadoop API. The missing component is an abstraction that knows how long the file stream is for reading the footer. This could be done by adding a {{getLength}} method to the new stream interface, but I think there is value in adding a higher-level abstraction that carries information about the file and can open streams for it. This abstraction could be passed to a PageReadStore, which could have more complicated logic including parallel streams to read column chunks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)