You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "William Forson (JIRA)" <ji...@apache.org> on 2016/12/14 19:50:58 UTC

[jira] [Comment Edited] (PARQUET-799) concurrent usage of the file reader API

    [ https://issues.apache.org/jira/browse/PARQUET-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15749289#comment-15749289 ] 

William Forson edited comment on PARQUET-799 at 12/14/16 7:50 PM:
------------------------------------------------------------------

Could you clarify the sense in which the "IO sublayer" is not threadsafe? 

More specifically, I'm interested in the thread safety of the {{ParquetFileReader}} class (which uses {{LocalFileSource}}). I would assume that a given _instance_ of this class is non-threadsafe (owing to a combination of common sense and the fact that {{ParquetFileReader::OpenFile}} returns a unique pointer). However, I would NOT assume that there is anything wrong with invoking {{ParquetFileReader::OpenFile}} concurrently, or using distinct {{ParquetFileReader}} instances concurrently. Are my assumptions wrong?

Finally, I'm curious as to why you refer to {{LocalFileSource}} as a "sample" implementation. Do you mean to say that certain parts of the codebase, which are not explicitly labeled as "test", "example", etc, are specifically not intended for usage in production? (and if so, is the delineation between the production-ready and non-production-ready parts of the codebase stated clearly somewhere in the project source?)

Thanks!


was (Author: wdf):
Could you clarify the sense in which the "IO sublayer" is not threadsafe? 

More specifically, I'm interested in the thread safety of the {{ParquetFileReader}} class (which uses {{LocalFileSource}}). I would assume that a given _instance_ of this class is non-threadsafe (owing to a combination of common sense and the fact that {{ParquetFileReader::OpenFile}} returns a unique pointer). However, I would NOT assume that there is anything wrong with invoking {{ParquetFileReader::OpenFile}} concurrently, or using distinct {{ParquetFileReader}} instances concurrently. Are my assumptions wrong?

Finally, I'm curious as to why you refer to {{LocalFileSource}} as a "sample" implementation. Do you mean to say that the certain parts of the codebase which are not explicitly labeled as "test", "example", etc are specifically not intended for usage in production? (and if so, is the delineation between the production-ready and non-production-ready parts of the codebase stated clearly somewhere in the project source?)

Thanks!

> concurrent usage of the file reader API
> ---------------------------------------
>
>                 Key: PARQUET-799
>                 URL: https://issues.apache.org/jira/browse/PARQUET-799
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>            Reporter: William Forson
>
> I've recently been debugging a segfault that occurs when concurrently reading (distinct) parquet files from multiple threads.
> I initially assumed this was a reasonable thing to do, since the project README doesn't say anything about concurrency one way or the other. But then I encountered [this TODO comment|https://github.com/apache/parquet-cpp/blob/master/src/parquet/column/page.h#L35]:
> {quote}
> // TODO: Parallel processing is not yet safe because of memory-ownership
> // semantics (the PageReader may or may not own the memory referenced by a
> // page)
> {quote}
> And it has got me wondering: is parquet-cpp fundamentally NOT thread-safe, even for the use case of reading a single file per thread at any given time? Or is it basically thread-safe with a couple gotchas?
> Also, jfyi, I'm currently running against a build which incorporates [this change|https://github.com/apache/parquet-cpp/commit/002466539f6aba7bf1f885b66f61f302ed88fa6b].
> (aside: my motivation for recently posting an issue re. {{THRIFT_HOME}} was to rule out any ABI weirdness that might result from building parquet-cpp against a different version of thrift than the applications that ultimately consume parquet-cpp)
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)