You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Wes McKinney (Jira)" <ji...@apache.org> on 2020/11/05 01:17:00 UTC
[jira] [Closed] (ARROW-10498) pyarrow.Table.from_* methods appear to cut off binary data after an embedded zero byte

     [ https://issues.apache.org/jira/browse/ARROW-10498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wes McKinney closed ARROW-10498.
--------------------------------
    Resolution: Not A Problem

Closing since we've adopted NumPy's convention that nul-terminator is used to embed smaller strings in a fixed-size-string dtype

https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/numpy_to_arrow.cc#L550

> pyarrow.Table.from_* methods appear to cut off binary data after an embedded zero byte
> --------------------------------------------------------------------------------------
>
>                 Key: ARROW-10498
>                 URL: https://issues.apache.org/jira/browse/ARROW-10498
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.1, 2.0.0
>         Environment: > python
> Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
> Type "help", "copyright", "credits" or "license" for more information.
>            Reporter: Jason Sachs
>            Priority: Critical
>
> The pyarrow.Table.from_* methods appear to cut off binary data after an embedded zero byte.
> {code}
> >>> import numpy as np
> >>> import pyarrow as pa
> >>>
> >>> data = np.array([b'', b'', b'', b'Foo!!', b'Bar!!',
> ...        b'\x00Baz!', b'half\x00baked', b''], dtype='|S13')
> >>> t = pa.Table.from_pydict({'data':data})
> >>> t.to_pandas()
>        data
> 0       b''
> 1       b''
> 2       b''
> 3  b'Foo!!'
> 4  b'Bar!!'
> 5       b''
> 6   b'half'
> 7       b''
> >>> import pandas as pd
> >>> pd.DataFrame(data)
>                   0
> 0               b''
> 1               b''
> 2               b''
> 3          b'Foo!!'
> 4          b'Bar!!'
> 5       b'\x00Baz!'
> 6  b'half\x00baked'
> 7               b''
> {code}
> Another test case (perhaps it's in the pyarrow.Table -> to_pandas() conversion step?):
> {code}
> import numpy as np
> import pyarrow as pa
> import pandas as pd
> print('PyArrow version: %s' % pa.__version__)
> data = np.array([b'', b'', b'', b'Foo!!', b'Bar!!',
>                  b'\x00Baz!', b'half\x00baked', b''], dtype='|S13')
> df1 = pd.DataFrame(data, columns=['data'])
> print('\ndf1:\n', df1)
> pqfile = '10498.pq'
> df1.to_parquet(pqfile)
>                  
> tables = {'from_pydict': pa.Table.from_pydict({'data':data}),
>           'from_arrays': pa.Table.from_arrays([data],['data']),
>           'from_pandas': pa.Table.from_pandas(df1),
>           'read_table':  pa.parquet.read_table(pqfile)
>          }
> for k,v in tables.items():
>     print("\ntables['%s'].to_pandas():\n" % k,
>           v.to_pandas())
>           
> print('Pandas from parquet file:\n', pd.read_parquet(pqfile))
> for k,v in tables.items():
>     print("tables['%s']['data'][6]=%s" % (k,v['data'][6]))
> {code}
> which prints on my machine
> {noformat}
> >python arrow10498.py
> PyArrow version: 2.0.0
> df1:
>                 data
> 0               b''
> 1               b''
> 2               b''
> 3          b'Foo!!'
> 4          b'Bar!!'
> 5       b'\x00Baz!'
> 6  b'half\x00baked'
> 7               b''
> tables['from_pydict'].to_pandas():
>         data
> 0       b''
> 1       b''
> 2       b''
> 3  b'Foo!!'
> 4  b'Bar!!'
> 5       b''
> 6   b'half'
> 7       b''
> tables['from_arrays'].to_pandas():
>         data
> 0       b''
> 1       b''
> 2       b''
> 3  b'Foo!!'
> 4  b'Bar!!'
> 5       b''
> 6   b'half'
> 7       b''
> tables['from_pandas'].to_pandas():
>         data
> 0       b''
> 1       b''
> 2       b''
> 3  b'Foo!!'
> 4  b'Bar!!'
> 5       b''
> 6   b'half'
> 7       b''
> tables['read_table'].to_pandas():
>         data
> 0       b''
> 1       b''
> 2       b''
> 3  b'Foo!!'
> 4  b'Bar!!'
> 5       b''
> 6   b'half'
> 7       b''
> Pandas from parquet file:
>         data
> 0       b''
> 1       b''
> 2       b''
> 3  b'Foo!!'
> 4  b'Bar!!'
> 5       b''
> 6   b'half'
> 7       b''
> tables['from_pydict']['data'][6]=b'half'
> tables['from_arrays']['data'][6]=b'half'
> tables['from_pandas']['data'][6]=b'half'
> tables['read_table']['data'][6]=b'half'
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)