You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2019/10/30 23:13:00 UTC

[jira] [Commented] (ARROW-7035) [R] Default arguments are unclear in write_parquet docs

    [ https://issues.apache.org/jira/browse/ARROW-7035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16963508#comment-16963508 ] 

Neal Richardson commented on ARROW-7035:
----------------------------------------

Thanks. I had similar questions in https://github.com/apache/arrow/pull/5451 where this feature was added. Would you be interested in submitting a pull request to improve the documentation? 

Some specific answers:

* The R bindings defer the default behavior to C++. You can find the defaults there at https://github.com/apache/arrow/blob/master/cpp/src/parquet/properties.h#L87-L98 and https://github.com/apache/arrow/blob/master/cpp/src/parquet/properties.h#L586-L590. The Python docstring may also be useful: https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L347-L385
* Acceptable values for compression are https://github.com/apache/arrow/blob/master/r/R/enums.R#L79, but all of them may not be available: it also depends on what your C++ library was built with. In the dev version of the package, you can query that with {{codec_is_available(codec)}}.
* I don't have an opinion about whether pyarrow and the R package should have the same defaults here, particularly if they differ from what the C++ defaults are. I could see both sides of the argument for changing the default compression to snappy (though if you did, you'd have to check if snappy were available, and otherwise fall back to uncompressed). FWIW the pyarrow parquet writer also has other features (like {{flavor = "spark"}}) that aren't (yet) in the R implementation.
* ParquetWriterProperties and ParquetArrowWriterProperties should be documented and exported, probably in the same file. 

> [R] Default arguments are unclear in write_parquet docs
> -------------------------------------------------------
>
>                 Key: ARROW-7035
>                 URL: https://issues.apache.org/jira/browse/ARROW-7035
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>    Affects Versions: 0.15.0
>         Environment: Ubuntu with libparquet-dev 0.15.0-1, R 3.6.1, and arrow 0.15.0.
>            Reporter: Karl Dunkle Werner
>            Priority: Minor
>              Labels: documentation
>             Fix For: 0.15.1
>
>
> Thank you so much for adding support for reading and writing parquet files in R! I have a few questions about the user interface and optional arguments, but I want to highlight how great it is to have this useful filetype to pass data back and forth.
> The defaults for the optional arguments in {{arrow::write_parquet}} aren't always clear. Here were my questions after reading the help docs from {{write_parquet}}:
>  * What's the default {{version}}? Should a user prefer "2.0" for new projects?
>  * What are acceptable values for {{compression}}? (Answer: {{uncompressed}}, {{snappy}}, {{gzip}}, {{brotli}}, {{zstd}}, or {{lz4}}.)
>  * What's the default for {{use_dictionary}}? Seems to be {{TRUE}}, at least some of the time.
>  * What's the default for {{write_statistics}}? Should a user prefer {{TRUE}}?
>  * Can I assume {{allow_truncated_timestamps}} is {{FALSE}} by default?
> As someone who works in both R and Python, I was a little surprised when pyarrow uses snappy compression by default, but R's default is uncompressed. My preference would be having the same default arguments, but that might be a fringe use-case.
> While I was digging into this, I was surprised that {{ParquetReaderProperties}} is exported and documented, but {{ParquetWriterProperties}} isn't. Is that intentional?
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)