You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (Jira)" <ji...@apache.org> on 2019/09/18 17:21:00 UTC
[jira] [Commented] (ARROW-6573) Segfault when writing to parquet

    [ https://issues.apache.org/jira/browse/ARROW-6573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932691#comment-16932691 ] 

Wes McKinney commented on ARROW-6573:
-------------------------------------

This raises an exception in master

{code}
import pyarrow as pa
import pyarrow.parquet as pq

data = dict()
data["key"] = [0, 1, 2, 3] # segfault
#data["key"] = ["0", "1", "2", "3"] # no segfault

schema = pa.schema({"key" : pa.string()})

table = pa.Table.from_pydict(data, schema = schema)
print("now writing out test file")
pq.write_table(table, "test.parquet")

## -- End pasted text --
---------------------------------------------------------------------------
ArrowTypeError                            Traceback (most recent call last)
<ipython-input-1-1ff07de63b32> in <module>
      8 schema = pa.schema({"key" : pa.string()})
      9 
---> 10 table = pa.Table.from_pydict(data, schema = schema)
     11 print("now writing out test file")
     12 pq.write_table(table, "test.parquet")

~/code/arrow/python/pyarrow/types.pxi in __iter__()

~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib.array()

~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()

~/code/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowTypeError: Expected a string or bytes object, got a 'int' object
In ../src/arrow/python/common.h, line 241, code: FromBinary(obj, "a string or bytes object")
In ../src/arrow/python/python_to_arrow.cc, line 549, code: string_view_.FromString(obj, &is_utf8)
In ../src/arrow/python/python_to_arrow.cc, line 570, code: Append(obj, &is_full)
In ../src/arrow/python/iterators.h, line 70, code: func(value, static_cast<int64_t>(i), &keep_going)
In ../src/arrow/python/python_to_arrow.cc, line 1097, code: converter->AppendMultiple(seq, size)
{code}

Might want to add a unit test, though

> Segfault when writing to parquet
> --------------------------------
>
>                 Key: ARROW-6573
>                 URL: https://issues.apache.org/jira/browse/ARROW-6573
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.14.1
>         Environment: Ubuntu 16.04. Pyarrow 0.14.1 installed through pip. Using Anaconda distribution of Python 3.7. 
>            Reporter: Josh Weinstock
>            Priority: Minor
>
> When attempting to write out a pyarrow table to parquet I am observing a segfault when there is a mismatch between the schema and the datatypes. 
> Here is a reproducible example:
>  
> {code:java}
> import pyarrow as pa
> import pyarrow.parquet as pq
> data = dict()
> data["key"] = [0, 1, 2, 3] # segfault
> #data["key"] = ["0", "1", "2", "3"] # no segfault
> schema = pa.schema({"key" : pa.string()})
> table = pa.Table.from_pydict(data, schema = schema)
> print("now writing out test file")
> pq.write_table(table, "test.parquet") 
> {code}
> This results in a segfault when writing the table. Running 
>  
> {code:java}
> gdb -ex r --args python test.py 
> {code}
> Yields
>  
>  
> {noformat}
> Program received signal SIGSEGV, Segmentation fault. 0x00007fffe8173917 in virtual thunk to parquet::DictEncoderImpl<parquet::DataType<(parquet::Type::type)6> >::Put(parquet::ByteArray const*, int) () from /net/fantasia/home/jweinstk/anaconda3/lib/python3.7/site-packages/pyarrow/libparquet.so.14
> {noformat}
>  
>  
> Thanks for all of your arrow work,
> Josh



--
This message was sent by Atlassian Jira
(v8.3.4#803005)