You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "David Li (Jira)" <ji...@apache.org> on 2021/09/02 17:32:00 UTC

[jira] [Assigned] (ARROW-13803) [C++] Segfault on filtering taxi dataset

     [ https://issues.apache.org/jira/browse/ARROW-13803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Li reassigned ARROW-13803:
--------------------------------

    Assignee: David Li

> [C++] Segfault on filtering taxi dataset
> ----------------------------------------
>
>                 Key: ARROW-13803
>                 URL: https://issues.apache.org/jira/browse/ARROW-13803
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>         Environment: macOS 11.2.1, MacBook Pro (13-inch, M1, 2020)
>            Reporter: Neal Richardson
>            Assignee: David Li
>            Priority: Major
>              Labels: pull-request-available, query-engine
>             Fix For: 6.0.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Found this while testing ARROW-13740. Using the nyc-taxi dataset:
> {code}
> ds %>%
>   filter(total_amount > 0, passenger_count > 0) %>%
>   summarise(n = n()) %>%
>   collect()
> {code}
> {code}
>  *** caught segfault ***
> address 0x161784000, cause 'invalid permissions'
> Traceback:
>  1: .Call(`_arrow_ExecPlan_run`, plan, final_node, sort_options)
> ...
> {code}
> lldb shows 
> {code}
> * thread #11, stop reason = EXC_BAD_ACCESS (code=1, address=0x1631a8000)
>     frame #0: 0x000000013a79d9cc libarrow.600.dylib`arrow::BitUtil::SetBitmap(unsigned char*, long long, long long) + 296
> libarrow.600.dylib`arrow::BitUtil::SetBitmap:
> ->  0x13a79d9cc <+296>: ldrb   w10, [x8]
>     0x13a79d9d0 <+300>: cmp    w9, #0x8                  ; =0x8 
>     0x13a79d9d4 <+304>: cset   w11, lo
>     0x13a79d9d8 <+308>: and    w9, w9, #0x7
> Target 0: (R) stopped.
> (lldb) 
> {code}
> Interestingly, I can evaluate those filter expressions just fine, and it only seems to crash if both are provided. And I can count over the data with both:
> {code}
> ds %>% 
>   group_by(total_amount > 0, passenger_count > 0) %>% 
>   summarize(n=n()) %>% 
>   collect()
> # A tibble: 4 × 3
>   `total_amount > 0` `passenger_count > 0`          n
>   <lgl>              <lgl>                      <int>
> 1 FALSE              FALSE                        805
> 2 FALSE              TRUE                      368680
> 3 TRUE               FALSE                    5810556
> 4 TRUE               TRUE                  1541561340
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)