You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2020/10/01 15:48:00 UTC

[jira] [Commented] (ARROW-10114) [R] Segfault in to_dataframe_parallel with deeply nested structs

    [ https://issues.apache.org/jira/browse/ARROW-10114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205629#comment-17205629 ] 

Neal Richardson commented on ARROW-10114:
-----------------------------------------

Thanks for the reprex. I did some exploration and think I've narrowed down the bug a bit. 

So we can read the JSON file into Arrow ok:
 
{code:r}
tab <- read_json_arrow("head.jsonl", as_data_frame=FALSE)
names(tab)
## [1] "master"            "publications"      "publication_count"
{code}

Next I tried to convert each column individually to R, hoping to isolate which column was problematic, but as it turned out, all three could convert. You can even re-assemble them into a data.frame, which is what you'd expect {{as.data.frame(tab)}} to do itself:

{code:r}
df <- data.frame(
  master = as.vector(tab$master), 
  publications = as.vector(tab$publications), 
  publication_count = as.vector(tab$publication_count)
)
{code}

I looked at the Table__to_dataframe source to see what it might be doing differently than just that and saw that there are two different code paths, one that uses multithreading (default) and one that doesn't. So I tried switching to the non-parallel version, and *that* worked:

{code}
options(arrow.use_threads=FALSE)
df <- read_json_arrow("head.jsonl")
{code}

So my conclusion is that the parallel conversion is somehow not stable.

> [R] Segfault in to_dataframe_parallel with deeply nested structs
> ----------------------------------------------------------------
>
>                 Key: ARROW-10114
>                 URL: https://issues.apache.org/jira/browse/ARROW-10114
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 1.0.1
>         Environment: > sessionInfo()
> R version 3.6.3 (2020-02-29)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Linux Mint 19.3
> Matrix products: default
> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
>  [5] LC_MONETARY=sv_SE.UTF-8    LC_MESSAGES=en_US.UTF-8   
>  [7] LC_PAPER=sv_SE.UTF-8       LC_NAME=C                 
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
> [11] LC_MEASUREMENT=sv_SE.UTF-8 LC_IDENTIFICATION=C       
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base     
> other attached packages:
> [1] arrow_1.0.1
> loaded via a namespace (and not attached):
>  [1] tidyselect_1.1.0 bit_4.0.4        compiler_3.6.3   magrittr_1.5    
>  [5] assertthat_0.2.1 R6_2.4.1         glue_1.4.1       Rcpp_1.0.5      
>  [9] bit64_4.0.2      vctrs_0.3.2      rlang_0.4.7      purrr_0.3.4     
>            Reporter: Markus Skyttner
>            Priority: Major
>         Attachments: Dockerfile, Makefile, reprex_10114.R
>
>
> A .jsonl file (newline separated JSON) created from open data available at [ftp://ftp.libris.kb.se/pub/spa/swepub-deduplicated-2019-12-29.zip] is used with the R package arrow (installed from CRAN) using the following statement:
> > arrow::read_json_arrow("~/.config/swepub/head.jsonl")
> It crashes RStudio with no error message. At the R prompt, the error message is:
> Error in Table__to_dataframe(x, use_threads = option_use_threads()) : 
>  SET_VECTOR_ELT() can only be applied to a 'list', not a 'integer'
> The file "head.jsonl" above was created from the extracted zip's .jsonl file with the *nix "head -1 $BIG_JSONL_FILE" command. It can be parsed with jsonlite and tidyjson.
> Also got this error message at one point:
> > arrow::read_json_arrow("head.jsonl", as_data_frame = TRUE)
> *** caught segfault ***
> address 0x8, cause 'memory not mapped'
> Traceback:
>  1: structure(x, extra_cols = colonnade[extra_cols], class = "pillar_squeezed_colonnade")
>  2: new_colonnade_sqeezed(out, colonnade = x, extra_cols = extra_cols)
>  3: pillar::squeeze(x$mcf, width = width)
>  4: format.trunc_mat(mat)
>  5: format(mat)
>  6: format.tbl(x, ..., n = n, width = width, n_extra = n_extra)
>  7: format(x, ..., n = n, width = width, n_extra = n_extra)
>  8: paste0(..., collapse = "\n")
>  9: cli::cat_line(format(x, ..., n = n, width = width, n_extra = n_extra))
> 10: print.tbl(x)
> 11: (function (x, ...) UseMethod("print"))(x)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)