You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (Jira)" <ji...@apache.org> on 2019/09/18 17:21:00 UTC
[jira] [Commented] (ARROW-6573) Segfault when writing to parquet
[ https://issues.apache.org/jira/browse/ARROW-6573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932691#comment-16932691 ]
Wes McKinney commented on ARROW-6573:
-------------------------------------
This raises an exception in master
{code}
import pyarrow as pa
import pyarrow.parquet as pq
data = dict()
data["key"] = [0, 1, 2, 3] # segfault
#data["key"] = ["0", "1", "2", "3"] # no segfault
schema = pa.schema({"key" : pa.string()})
table = pa.Table.from_pydict(data, schema = schema)
print("now writing out test file")
pq.write_table(table, "test.parquet")
## -- End pasted text --
---------------------------------------------------------------------------
ArrowTypeError Traceback (most recent call last)
<ipython-input-1-1ff07de63b32> in <module>
8 schema = pa.schema({"key" : pa.string()})
9
---> 10 table = pa.Table.from_pydict(data, schema = schema)
11 print("now writing out test file")
12 pq.write_table(table, "test.parquet")
~/code/arrow/python/pyarrow/types.pxi in __iter__()
~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib.array()
~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()
~/code/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowTypeError: Expected a string or bytes object, got a 'int' object
In ../src/arrow/python/common.h, line 241, code: FromBinary(obj, "a string or bytes object")
In ../src/arrow/python/python_to_arrow.cc, line 549, code: string_view_.FromString(obj, &is_utf8)
In ../src/arrow/python/python_to_arrow.cc, line 570, code: Append(obj, &is_full)
In ../src/arrow/python/iterators.h, line 70, code: func(value, static_cast<int64_t>(i), &keep_going)
In ../src/arrow/python/python_to_arrow.cc, line 1097, code: converter->AppendMultiple(seq, size)
{code}
Might want to add a unit test, though
> Segfault when writing to parquet
> --------------------------------
>
> Key: ARROW-6573
> URL: https://issues.apache.org/jira/browse/ARROW-6573
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 0.14.1
> Environment: Ubuntu 16.04. Pyarrow 0.14.1 installed through pip. Using Anaconda distribution of Python 3.7.
> Reporter: Josh Weinstock
> Priority: Minor
>
> When attempting to write out a pyarrow table to parquet I am observing a segfault when there is a mismatch between the schema and the datatypes.
> Here is a reproducible example:
>
> {code:java}
> import pyarrow as pa
> import pyarrow.parquet as pq
> data = dict()
> data["key"] = [0, 1, 2, 3] # segfault
> #data["key"] = ["0", "1", "2", "3"] # no segfault
> schema = pa.schema({"key" : pa.string()})
> table = pa.Table.from_pydict(data, schema = schema)
> print("now writing out test file")
> pq.write_table(table, "test.parquet")
> {code}
> This results in a segfault when writing the table. Running
>
> {code:java}
> gdb -ex r --args python test.py
> {code}
> Yields
>
>
> {noformat}
> Program received signal SIGSEGV, Segmentation fault. 0x00007fffe8173917 in virtual thunk to parquet::DictEncoderImpl<parquet::DataType<(parquet::Type::type)6> >::Put(parquet::ByteArray const*, int) () from /net/fantasia/home/jweinstk/anaconda3/lib/python3.7/site-packages/pyarrow/libparquet.so.14
> {noformat}
>
>
> Thanks for all of your arrow work,
> Josh
--
This message was sent by Atlassian Jira
(v8.3.4#803005)