You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Yibo Cai (Jira)" <ji...@apache.org> on 2022/10/08 05:15:00 UTC
[jira] [Commented] (ARROW-17943) [Python] Coredump when joining big large_strings

    [ https://issues.apache.org/jira/browse/ARROW-17943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614361#comment-17614361 ] 

Yibo Cai commented on ARROW-17943:
----------------------------------

The code below triggers same error log. Try it online: https://onlinegdb.com/UpqUsk4Zv
Looks this might be caused by integer overflow which leads to a huge buffer size greater than {{std::vector::max_size()}}.

{code:cpp}
#include <vector>

int main() {
    std::vector<int> v;
    v.resize(-1ULL);
    return 0;
}
{code}


> [Python] Coredump when joining big large_strings
> ------------------------------------------------
>
>                 Key: ARROW-17943
>                 URL: https://issues.apache.org/jira/browse/ARROW-17943
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 9.0.0
>         Environment: run inside a fedora container:
> registry.fedoraproject.org/fedora-toolbox:36
> host information:
> uname -a:
> Linux ws1 5.18.16-200.fc36.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Aug 3 15:44:49 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
> /etc/os-release:
> NAME="Fedora Linux"
> VERSION="36 (Container Image)"
> ID=fedora
> VERSION_ID=36
> VERSION_CODENAME=""
> PLATFORM_ID="platform:f36"
> PRETTY_NAME="Fedora Linux 36 (Container Image)"
> ANSI_COLOR="0;38;2;60;110;180"
> LOGO=fedora-logo-icon
> CPE_NAME="cpe:/o:fedoraproject:fedora:36"
> HOME_URL="https://fedoraproject.org/"
> DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f36/system-administrators-guide/"
> SUPPORT_URL="https://ask.fedoraproject.org/"
> BUG_REPORT_URL="https://bugzilla.redhat.com/"
> REDHAT_BUGZILLA_PRODUCT="Fedora"
> REDHAT_BUGZILLA_PRODUCT_VERSION=36
> REDHAT_SUPPORT_PRODUCT="Fedora"
> REDHAT_SUPPORT_PRODUCT_VERSION=36
> PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
> VARIANT="Container Image"
> VARIANT_ID=container
>            Reporter: flowpoint
>            Priority: Major
>              Labels: JOIN, large_string, python, string
>
> joining large strings in pyarrow results in this error:
> {code:java}
> terminate called after throwing an instance of 'std::length_error'
>   what():  vector::_M_default_append
> Aborted (core dumped) {code}
> example code:
> note that this needs quite some ram (run on 128GB)
> {code:java}
> import pyarrow as pa    
>      
> ids = [x for x in range(2**24)]    
> text = ['a'*2**10]*2**24    
> schema = pa.schema([    
>     ('Id', pa.int32()),    
>     ('Text', pa.large_string()),    
>     ])    
>      
> tab1 = pa.Table.from_arrays([ids, text],schema=schema)    
> tab2 = pa.Table.from_arrays([ids, text],schema=schema)    
>      
> joined = tab1.join(tab2, keys='Id', right_keys='Id', left_suffix='tab1')  {code}
> the same results in a segfault, if i use this schema
> {code:java}
> schema = pa.schema([
>     ('Id', pa.int32()),
>     ('Text', pa.string()),
>     ]){code}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)