You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2021/11/11 16:23:00 UTC

[jira] [Commented] (ARROW-14677) [R][C++] macOS R package arrow segfault on `open_dataset()`

    [ https://issues.apache.org/jira/browse/ARROW-14677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17442342#comment-17442342 ] 

Neal Richardson commented on ARROW-14677:
-----------------------------------------

Thanks for the report! autobrew pulls a bundle of static libraries. I'm not sure how it would clash with your local {{brew}} itself; the only thing I could think of would be if there were an issue with system/brew libcurl or openssl, which are not bundled and are required by the aws-sdk-cpp that reads from S3. Some thoughts:

1. Is there a reason you can't use the binary package from CRAN? (That is built with autobrew too, for what it's worth.)
2. You could try a source install and set the env var FORCE_BUNDLED_BUILD=true. This would build libarrow from source instead of using the prebuilt autobrew bundle. (I'd also recommend setting ARROW_R_DEV=true to get some output from the libarrow build, if for no other reason than to see that it is progressing.)
3. Can you download one or two of those parquet files from S3 and try to open_dataset() on them on your local filesystem? The backtrace points at thrift but I'm wondering if that's misleading.

It would be interesting to know if any/all of those segfault for you.

> [R][C++] macOS R package arrow segfault on `open_dataset()`
> -----------------------------------------------------------
>
>                 Key: ARROW-14677
>                 URL: https://issues.apache.org/jira/browse/ARROW-14677
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, R
>    Affects Versions: 6.0.0
>            Reporter: Martin Morgan
>            Priority: Major
>
> Following a slack post (https://ropensci.slack.com/archives/C026GCWKA/p1636588933095400), accessing a public bucket with the R client
> {code:java}
> df <- arrow::open_dataset("s3://gbif-open-data-af-south-1/occurrence/2021-11-01/occurrence.parquet/")
> {code}
> leads to a segfault
> {code:java}
>   *** caught segfault ***
> address 0x0, cause 'unknown'
> Traceback:
> 1: dataset__DatasetFactory_Finish1(self, unify_schemas)
> 2: factory$Finish(schema, isTRUE(unify_schemas))
> 3: doTryCatch(return(expr), name, parentenv, handler)
> 4: tryCatchOne(expr, names, parentenv, handlers[[1L]])
> 5: tryCatchList(expr, classes, parentenv, handlers)
> 6: tryCatch(factory$Finish(schema, isTRUE(unify_schemas)), error = function(e)
> { handle_parquet_io_error(e, format)}
> )
> 7: arrow::open_dataset("s3://gbif-open-data-af-south-1/occurrence/2021-11-01/occurrence.parquet/")
>  
> {code}
> The arrow portion of the lldb traceback is
> {code:java}
> (lldb) thread backtrace
> thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT) frame #0: 0x000000012ab2029c libthrift-0.15.0.dylib`std::__1::shared_ptr<apache::thrift::async::TAsyncProcessor>::~shared_ptr() + 46
> frame #1: 0x0000000128bb6ac2 arrow.so`void parquet::DeserializeThriftUnencryptedMsg<parquet::format::FileMetaData>(unsigned char const*, unsigned int*, parquet::format::FileMetaData*) + 309
> frame #2: 0x0000000128bb5f49 arrow.so`parquet::FileMetaData::FileMetaDataImpl::FileMetaDataImpl(void const*, unsigned int*, std::__1::shared_ptr<parquet::InternalFileDecryptor>) + 517
> frame #3: 0x0000000128bace0d arrow.so`parquet::FileMetaData::FileMetaData(void const*, unsigned int*, std::__1::shared_ptr<parquet::InternalFileDecryptor>) + 85
> frame #4: 0x0000000128bacd1b arrow.so`parquet::FileMetaData::Make(void const*, unsigned int*, std::__1::shared_ptr<parquet::InternalFileDecryptor>) + 89
> frame #5: 0x0000000128b9cb4a arrow.so`parquet::SerializedFile::ParseUnencryptedFileMetadata(std::__1::shared_ptr<arrow::Buffer> const&, unsigned int) + 118
> frame #6: 0x0000000128b9df43 arrow.so`parquet::SerializedFile::ParseMetaData() + 607
> frame #7: 0x0000000128b9dc6c arrow.so`parquet::ParquetFileReader::Contents::Open(std::_1::shared_ptr<arrow::io::RandomAccessFile>, parquet::ReaderProperties const&, std::_1::shared_ptr<parquet::FileMetaData>) + 214
> frame #8: 0x0000000128b9eb72 arrow.so`parquet::ParquetFileReader::Open(std::_1::shared_ptr<arrow::io::RandomAccessFile>, parquet::ReaderProperties const&, std::_1::shared_ptr<parquet::FileMetaData>) + 58
> frame #9: 0x0000000128c8a988 arrow.so`arrow::dataset::ParquetFileFormat::GetReader(arrow::dataset::FileSource const&, arrow::dataset::ScanOptions*) const + 286
> frame #10: 0x0000000128c8a72e arrow.so`arrow::dataset::ParquetFileFormat::Inspect(arrow::dataset::FileSource const&) const + 44
> frame #11: 0x0000000128c0b994 arrow.so`arrow::dataset::FileSystemDatasetFactory::InspectSchemas(arrow::dataset::InspectOptions) + 336
> frame #12: 0x0000000128c09079 arrow.so`arrow::dataset::DatasetFactory::Inspect(arrow::dataset::InspectOptions) + 43
> frame #13: 0x0000000128c0c1cf arrow.so`arrow::dataset::FileSystemDatasetFactory::Finish(arrow::dataset::FinishOptions) + 541
> frame #14: 0x0000000128a66805 arrow.so`dataset__DatasetFactoryFinish1(std::_1::shared_ptr<arrow::dataset::DatasetFactory> const&, bool) + 69
> frame #15: 0x0000000128a105aa arrow.so`arrow_dataset_DatasetFactory_Finish1 + 154 {code}
> arrow was installed from source on
> {code:java}
> > sessionInfo()
> R Under development (unstable) (2021-10-28 r81109)
> Platform: x86_64-apple-darwin19.6.0 (64-bit)
> Running under: macOS Catalina 10.15.7
> Matrix products: default
> BLAS: /Users/ma38727/bin/R-devel/lib/libRblas.dylib
> LAPACK: /Users/ma38727/bin/R-devel/lib/libRlapack.dylib
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> other attached packages:
> [1] arrow_6.0.0.2
> loaded via a namespace (and not attached):
> [1] tidyselect_1.1.1 bit_4.0.4 compiler_4.2.0
> [4] BiocManager_1.30.16 magrittr_2.0.1 assertthat_0.2.1
> [7] R6_2.5.1 glue_1.5.0 bit64_4.0.5
> [10] vctrs_0.3.8 rlang_0.4.12 purrr_0.3.4
> {code}
> During package installation, the one step that was 'new' to me was the use of autobrew
> {code:java}
> *** Downloading apache-arrow
> Using autobrew bundle: apache-arrow-6.0.0-high_sierra.tar.xz{code}
> I'm not sure how to validate that this use is consistent with my brew installation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)