You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jason Sachs (Jira)" <ji...@apache.org> on 2020/11/04 23:35:00 UTC
[jira] [Updated] (ARROW-10498) pyarrow.Table.from_* methods appear
to cut off binary data after an embedded zero byte
[ https://issues.apache.org/jira/browse/ARROW-10498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Sachs updated ARROW-10498:
--------------------------------
Description:
The pyarrow.Table.from_* methods appear to cut off binary data after an embedded zero byte.
{code}
>>> import numpy as np
>>> import pyarrow as pa
>>>
>>> data = np.array([b'', b'', b'', b'Foo!!', b'Bar!!',
... b'\x00Baz!', b'half\x00baked', b''], dtype='|S13')
>>> t = pa.Table.from_pydict({'data':data})
>>> t.to_pandas()
data
0 b''
1 b''
2 b''
3 b'Foo!!'
4 b'Bar!!'
5 b''
6 b'half'
7 b''
>>> import pandas as pd
>>> pd.DataFrame(data)
0
0 b''
1 b''
2 b''
3 b'Foo!!'
4 b'Bar!!'
5 b'\x00Baz!'
6 b'half\x00baked'
7 b''
{code}
Another test case (perhaps it's in the pyarrow.Table -> to_pandas() conversion step?):
{code}
import numpy as np
import pyarrow as pa
import pandas as pd
print('PyArrow version: %s' % pa.__version__)
data = np.array([b'', b'', b'', b'Foo!!', b'Bar!!',
b'\x00Baz!', b'half\x00baked', b''], dtype='|S13')
df1 = pd.DataFrame(data, columns=['data'])
print('\ndf1:\n', df1)
pqfile = '10498.pq'
df1.to_parquet(pqfile)
tables = {'from_pydict': pa.Table.from_pydict({'data':data}),
'from_arrays': pa.Table.from_arrays([data],['data']),
'from_pandas': pa.Table.from_pandas(df1),
'read_table': pa.parquet.read_table(pqfile)
}
for k,v in tables.items():
print("\ntables['%s'].to_pandas():\n" % k,
v.to_pandas())
print('Pandas from parquet file:\n', pd.read_parquet(pqfile))
{code}
which prints on my machine
{noformat}
>python arrow10498.py
PyArrow version: 2.0.0
df1:
data
0 b''
1 b''
2 b''
3 b'Foo!!'
4 b'Bar!!'
5 b'\x00Baz!'
6 b'half\x00baked'
7 b''
tables['from_pydict'].to_pandas():
data
0 b''
1 b''
2 b''
3 b'Foo!!'
4 b'Bar!!'
5 b''
6 b'half'
7 b''
tables['from_arrays'].to_pandas():
data
0 b''
1 b''
2 b''
3 b'Foo!!'
4 b'Bar!!'
5 b''
6 b'half'
7 b''
tables['from_pandas'].to_pandas():
data
0 b''
1 b''
2 b''
3 b'Foo!!'
4 b'Bar!!'
5 b''
6 b'half'
7 b''
tables['read_table'].to_pandas():
data
0 b''
1 b''
2 b''
3 b'Foo!!'
4 b'Bar!!'
5 b''
6 b'half'
7 b''
Pandas from parquet file:
data
0 b''
1 b''
2 b''
3 b'Foo!!'
4 b'Bar!!'
5 b''
6 b'half'
7 b''
{noformat}
was:
The pyarrow.Table.from_* methods appear to cut off binary data after an embedded zero byte.
{code}
>>> import numpy as np
>>> import pyarrow as pa
>>>
>>> data = np.array([b'', b'', b'', b'Foo!!', b'Bar!!',
... b'\x00Baz!', b'half\x00baked', b''], dtype='|S13')
>>> t = pa.Table.from_pydict({'data':data})
>>> t.to_pandas()
data
0 b''
1 b''
2 b''
3 b'Foo!!'
4 b'Bar!!'
5 b''
6 b'half'
7 b''
>>> import pandas as pd
>>> pd.DataFrame(data)
0
0 b''
1 b''
2 b''
3 b'Foo!!'
4 b'Bar!!'
5 b'\x00Baz!'
6 b'half\x00baked'
7 b''
{code}
> pyarrow.Table.from_* methods appear to cut off binary data after an embedded zero byte
> --------------------------------------------------------------------------------------
>
> Key: ARROW-10498
> URL: https://issues.apache.org/jira/browse/ARROW-10498
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 2.0.0
> Environment: > python
> Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
> Type "help", "copyright", "credits" or "license" for more information.
> Reporter: Jason Sachs
> Priority: Major
>
> The pyarrow.Table.from_* methods appear to cut off binary data after an embedded zero byte.
> {code}
> >>> import numpy as np
> >>> import pyarrow as pa
> >>>
> >>> data = np.array([b'', b'', b'', b'Foo!!', b'Bar!!',
> ... b'\x00Baz!', b'half\x00baked', b''], dtype='|S13')
> >>> t = pa.Table.from_pydict({'data':data})
> >>> t.to_pandas()
> data
> 0 b''
> 1 b''
> 2 b''
> 3 b'Foo!!'
> 4 b'Bar!!'
> 5 b''
> 6 b'half'
> 7 b''
> >>> import pandas as pd
> >>> pd.DataFrame(data)
> 0
> 0 b''
> 1 b''
> 2 b''
> 3 b'Foo!!'
> 4 b'Bar!!'
> 5 b'\x00Baz!'
> 6 b'half\x00baked'
> 7 b''
> {code}
> Another test case (perhaps it's in the pyarrow.Table -> to_pandas() conversion step?):
> {code}
> import numpy as np
> import pyarrow as pa
> import pandas as pd
> print('PyArrow version: %s' % pa.__version__)
> data = np.array([b'', b'', b'', b'Foo!!', b'Bar!!',
> b'\x00Baz!', b'half\x00baked', b''], dtype='|S13')
> df1 = pd.DataFrame(data, columns=['data'])
> print('\ndf1:\n', df1)
> pqfile = '10498.pq'
> df1.to_parquet(pqfile)
>
> tables = {'from_pydict': pa.Table.from_pydict({'data':data}),
> 'from_arrays': pa.Table.from_arrays([data],['data']),
> 'from_pandas': pa.Table.from_pandas(df1),
> 'read_table': pa.parquet.read_table(pqfile)
> }
> for k,v in tables.items():
> print("\ntables['%s'].to_pandas():\n" % k,
> v.to_pandas())
> print('Pandas from parquet file:\n', pd.read_parquet(pqfile))
> {code}
> which prints on my machine
> {noformat}
> >python arrow10498.py
> PyArrow version: 2.0.0
> df1:
> data
> 0 b''
> 1 b''
> 2 b''
> 3 b'Foo!!'
> 4 b'Bar!!'
> 5 b'\x00Baz!'
> 6 b'half\x00baked'
> 7 b''
> tables['from_pydict'].to_pandas():
> data
> 0 b''
> 1 b''
> 2 b''
> 3 b'Foo!!'
> 4 b'Bar!!'
> 5 b''
> 6 b'half'
> 7 b''
> tables['from_arrays'].to_pandas():
> data
> 0 b''
> 1 b''
> 2 b''
> 3 b'Foo!!'
> 4 b'Bar!!'
> 5 b''
> 6 b'half'
> 7 b''
> tables['from_pandas'].to_pandas():
> data
> 0 b''
> 1 b''
> 2 b''
> 3 b'Foo!!'
> 4 b'Bar!!'
> 5 b''
> 6 b'half'
> 7 b''
> tables['read_table'].to_pandas():
> data
> 0 b''
> 1 b''
> 2 b''
> 3 b'Foo!!'
> 4 b'Bar!!'
> 5 b''
> 6 b'half'
> 7 b''
> Pandas from parquet file:
> data
> 0 b''
> 1 b''
> 2 b''
> 3 b'Foo!!'
> 4 b'Bar!!'
> 5 b''
> 6 b'half'
> 7 b''
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)