You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Patrick (Jira)" <ji...@apache.org> on 2021/03/23 15:40:00 UTC
[jira] [Updated] (ARROW-12065) segfault in pyarrow read_json
[ https://issues.apache.org/jira/browse/ARROW-12065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Patrick updated ARROW-12065:
----------------------------
Description:
I noticed this when doing some analysis on a not very complex, but reasonably large json file and I've simplified it to a fairly minimal reproduction:
{{import pyarrow.json}}
{{ pyarrow.json.read_json('test.json')}}
and test.json is
{{{"A":"<0 repeated 1.6 million times>"}}}
{{{"B":[]}}}
this seems like it shouldn't be too large to load into memory all-at-once, so I'm surprised there is a segfault
running via gdb and getting a backtrace gives
{{(gdb) bt}}
{{ #0 0x00007ffff5c1965d in std::__shared_ptr<arrow::Buffer, (__gnu_cxx::_Lock_policy)2>::__shared_ptr(std::__shared_ptr<arrow::Buffer, (__gnu_cxx::_Lock_policy)2> const&) () from /home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300}}
{{ #1 0x00007ffff5ca8d9e in arrow::json::ChunkedListArrayBuilder::Insert(long, std::shared_ptr<arrow::Field> const&, std::shared_ptr<arrow::Array> const&) () from /home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300}}
{{ #2 0x00007ffff5cabcc8 in arrow::json::ChunkedStructArrayBuilder::Finish(std::shared_ptr<arrow::ChunkedArray>*) () from /home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300}}
{{ #3 0x00007ffff5c1fc16 in arrow::json::TableReaderImpl::Read() () from /home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300}}
{{ #4 0x00007fffcf73da69 in __pyx_pw_7pyarrow_5_json_1read_json(_object*, _object*, _object*) () from /home/patrick/.local/lib/python3.9/site-packages/pyarrow/_json.cpython-39-x86_64-linux-gnu.so}}
{{ #5 0x00007ffff7d35a43 in ?? () from /usr/lib/libpython3.9.so.1.0}}
{{ #6 0x00007ffff7d1be6d in _PyObject_MakeTpCall () from /usr/lib/libpython3.9.so.1.0}}
{{ #7 0x00007ffff7d17b3a in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.9.so.1.0}}
{{ #8 0x00007ffff7d119ad in ?? () from /usr/lib/libpython3.9.so.1.0}}
{{ #9 0x00007ffff7d11371 in _PyEval_EvalCodeWithName () from /usr/lib/libpython3.9.so.1.0}}
{{ #10 0x00007ffff7dd3f83 in PyEval_EvalCode () from /usr/lib/libpython3.9.so.1.0}}
{{ #11 0x00007ffff7de43dd in ?? () from /usr/lib/libpython3.9.so.1.0}}
{{ #12 0x00007ffff7ddfc7b in ?? () from /usr/lib/libpython3.9.so.1.0}}
{{ #13 0x00007ffff7cf38ab in ?? () from /usr/lib/libpython3.9.so.1.0}}
{{ #14 0x00007ffff7cf3a63 in PyRun_InteractiveLoopFlags () from /usr/lib/libpython3.9.so.1.0}}
{{ #15 0x00007ffff7c81f6b in PyRun_AnyFileExFlags () from /usr/lib/libpython3.9.so.1.0}}
{{ #16 0x00007ffff7c7665c in ?? () from /usr/lib/libpython3.9.so.1.0}}
{{ #17 0x00007ffff7dc6fa9 in Py_BytesMain () from /usr/lib/libpython3.9.so.1.0}}
{{ #18 0x00007ffff7a43b25 in __libc_start_main () from /usr/lib/libc.so.6}}
{{ #19 0x000055555555504e in _start ()}}
{{ (gdb)}}
was:
I noticed this when doing some analysis on a not very complex, but reasonably large json file and I've simplified it to a fairly minimal reproduction:
```
import pyarrow.json
pyarrow.json.read_json('test.json')
```
and `test.json` is
```
{"A":"<0 repeated 1.6 million times>"}
{"B":[]}
```
this seems like it shouldn't be too large to load into memory all-at-once, so I'm surprised there is a segfault
running via gdb and getting a backtrace gives
```
(gdb) bt
#0 0x00007ffff5c1965d in std::__shared_ptr<arrow::Buffer, (__gnu_cxx::_Lock_policy)2>::__shared_ptr(std::__shared_ptr<arrow::Buffer, (__gnu_cxx::_Lock_policy)2> const&) () from /home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300
#1 0x00007ffff5ca8d9e in arrow::json::ChunkedListArrayBuilder::Insert(long, std::shared_ptr<arrow::Field> const&, std::shared_ptr<arrow::Array> const&) () from /home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300
#2 0x00007ffff5cabcc8 in arrow::json::ChunkedStructArrayBuilder::Finish(std::shared_ptr<arrow::ChunkedArray>*) () from /home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300
#3 0x00007ffff5c1fc16 in arrow::json::TableReaderImpl::Read() () from /home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300
#4 0x00007fffcf73da69 in __pyx_pw_7pyarrow_5_json_1read_json(_object*, _object*, _object*) () from /home/patrick/.local/lib/python3.9/site-packages/pyarrow/_json.cpython-39-x86_64-linux-gnu.so
#5 0x00007ffff7d35a43 in ?? () from /usr/lib/libpython3.9.so.1.0
#6 0x00007ffff7d1be6d in _PyObject_MakeTpCall () from /usr/lib/libpython3.9.so.1.0
#7 0x00007ffff7d17b3a in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.9.so.1.0
#8 0x00007ffff7d119ad in ?? () from /usr/lib/libpython3.9.so.1.0
#9 0x00007ffff7d11371 in _PyEval_EvalCodeWithName () from /usr/lib/libpython3.9.so.1.0
#10 0x00007ffff7dd3f83 in PyEval_EvalCode () from /usr/lib/libpython3.9.so.1.0
#11 0x00007ffff7de43dd in ?? () from /usr/lib/libpython3.9.so.1.0
#12 0x00007ffff7ddfc7b in ?? () from /usr/lib/libpython3.9.so.1.0
#13 0x00007ffff7cf38ab in ?? () from /usr/lib/libpython3.9.so.1.0
#14 0x00007ffff7cf3a63 in PyRun_InteractiveLoopFlags () from /usr/lib/libpython3.9.so.1.0
#15 0x00007ffff7c81f6b in PyRun_AnyFileExFlags () from /usr/lib/libpython3.9.so.1.0
#16 0x00007ffff7c7665c in ?? () from /usr/lib/libpython3.9.so.1.0
#17 0x00007ffff7dc6fa9 in Py_BytesMain () from /usr/lib/libpython3.9.so.1.0
#18 0x00007ffff7a43b25 in __libc_start_main () from /usr/lib/libc.so.6
#19 0x000055555555504e in _start ()
(gdb)
```
> segfault in pyarrow read_json
> -----------------------------
>
> Key: ARROW-12065
> URL: https://issues.apache.org/jira/browse/ARROW-12065
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 3.0.0
> Environment: arch linux, 31G ram
> Reporter: Patrick
> Priority: Major
>
> I noticed this when doing some analysis on a not very complex, but reasonably large json file and I've simplified it to a fairly minimal reproduction:
> {{import pyarrow.json}}
> {{ pyarrow.json.read_json('test.json')}}
> and test.json is
> {{{"A":"<0 repeated 1.6 million times>"}}}
> {{{"B":[]}}}
> this seems like it shouldn't be too large to load into memory all-at-once, so I'm surprised there is a segfault
> running via gdb and getting a backtrace gives
> {{(gdb) bt}}
> {{ #0 0x00007ffff5c1965d in std::__shared_ptr<arrow::Buffer, (__gnu_cxx::_Lock_policy)2>::__shared_ptr(std::__shared_ptr<arrow::Buffer, (__gnu_cxx::_Lock_policy)2> const&) () from /home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300}}
> {{ #1 0x00007ffff5ca8d9e in arrow::json::ChunkedListArrayBuilder::Insert(long, std::shared_ptr<arrow::Field> const&, std::shared_ptr<arrow::Array> const&) () from /home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300}}
> {{ #2 0x00007ffff5cabcc8 in arrow::json::ChunkedStructArrayBuilder::Finish(std::shared_ptr<arrow::ChunkedArray>*) () from /home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300}}
> {{ #3 0x00007ffff5c1fc16 in arrow::json::TableReaderImpl::Read() () from /home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300}}
> {{ #4 0x00007fffcf73da69 in __pyx_pw_7pyarrow_5_json_1read_json(_object*, _object*, _object*) () from /home/patrick/.local/lib/python3.9/site-packages/pyarrow/_json.cpython-39-x86_64-linux-gnu.so}}
> {{ #5 0x00007ffff7d35a43 in ?? () from /usr/lib/libpython3.9.so.1.0}}
> {{ #6 0x00007ffff7d1be6d in _PyObject_MakeTpCall () from /usr/lib/libpython3.9.so.1.0}}
> {{ #7 0x00007ffff7d17b3a in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.9.so.1.0}}
> {{ #8 0x00007ffff7d119ad in ?? () from /usr/lib/libpython3.9.so.1.0}}
> {{ #9 0x00007ffff7d11371 in _PyEval_EvalCodeWithName () from /usr/lib/libpython3.9.so.1.0}}
> {{ #10 0x00007ffff7dd3f83 in PyEval_EvalCode () from /usr/lib/libpython3.9.so.1.0}}
> {{ #11 0x00007ffff7de43dd in ?? () from /usr/lib/libpython3.9.so.1.0}}
> {{ #12 0x00007ffff7ddfc7b in ?? () from /usr/lib/libpython3.9.so.1.0}}
> {{ #13 0x00007ffff7cf38ab in ?? () from /usr/lib/libpython3.9.so.1.0}}
> {{ #14 0x00007ffff7cf3a63 in PyRun_InteractiveLoopFlags () from /usr/lib/libpython3.9.so.1.0}}
> {{ #15 0x00007ffff7c81f6b in PyRun_AnyFileExFlags () from /usr/lib/libpython3.9.so.1.0}}
> {{ #16 0x00007ffff7c7665c in ?? () from /usr/lib/libpython3.9.so.1.0}}
> {{ #17 0x00007ffff7dc6fa9 in Py_BytesMain () from /usr/lib/libpython3.9.so.1.0}}
> {{ #18 0x00007ffff7a43b25 in __libc_start_main () from /usr/lib/libc.so.6}}
> {{ #19 0x000055555555504e in _start ()}}
> {{ (gdb)}}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)