You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2016/12/14 18:22:58 UTC

[jira] [Commented] (PARQUET-799) concurrent usage of the file reader API

    [ https://issues.apache.org/jira/browse/PARQUET-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15749050#comment-15749050 ] 

Wes McKinney commented on PARQUET-799:
--------------------------------------

The two places off the top of my head where you need to be careful about concurrency:

- The IO sublayer (e.g. the LocalFileSource sample implementation https://github.com/apache/parquet-cpp/blob/master/src/parquet/util/input.h#L58 is not threadsafe)

- ByteArray / FixedLenByteArray values do not retain ownership of their memory -- so subsequent calls to ReadBatch will generally invalidate the memory in the previous batches

You might find other issues -- I think [~mdeepak] and [~florian.scheibner] may be able to comment from using the library inside database systems

> concurrent usage of the file reader API
> ---------------------------------------
>
>                 Key: PARQUET-799
>                 URL: https://issues.apache.org/jira/browse/PARQUET-799
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>            Reporter: William Forson
>
> I've recently been debugging a segfault that occurs when concurrently reading (distinct) parquet files from multiple threads.
> I initially assumed this was a reasonable thing to do, since the project README doesn't say anything about concurrency one way or the other. But then I encountered [this TODO comment|https://github.com/apache/parquet-cpp/blob/master/src/parquet/column/page.h#L35]:
> {quote}
> // TODO: Parallel processing is not yet safe because of memory-ownership
> // semantics (the PageReader may or may not own the memory referenced by a
> // page)
> {quote}
> And it has got me wondering: is parquet-cpp fundamentally NOT thread-safe, even for the use case of reading a single file per thread at any given time? Or is it basically thread-safe with a couple gotchas?
> Also, jfyi, I'm currently running against a build which incorporates [this change|https://github.com/apache/parquet-cpp/commit/002466539f6aba7bf1f885b66f61f302ed88fa6b].
> (aside: my motivation for recently posting an issue re. {{THRIFT_HOME}} was to rule out any ABI weirdness that might result from building parquet-cpp against a different version of thrift than the applications that ultimately consume parquet-cpp)
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)