You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Jason Sachs <jm...@gmail.com> on 2020/11/04 23:05:13 UTC
bug? pyarrow.Table.from_pydict does not handle binary type correctly with embedded 00 bytes?
It looks like pyarrow.Table.from_pydict() cuts off binary data after an embedded 00 byte. Is this a known bug?
(py3) C:\>python
Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> import pyarrow as pa
>>>
>>> data = np.array([b'', b'', b'', b'Foo!!', b'Bar!!',
.. b'\x00Baz!', b'half\x00baked', b''], dtype='|S13')
>>> t = pa.Table.from_pydict({'data':data})
>>> t.to_pandas()
data
0 b''
1 b''
2 b''
3 b'Foo!!'
4 b'Bar!!'
5 b''
6 b'half'
7 b''
>>> import pandas as pd
>>> pd.DataFrame(data)
0
0 b''
1 b''
2 b''
3 b'Foo!!'
4 b'Bar!!'
5 b'\x00Baz!'
6 b'half\x00baked'
7 b''
>>>
Re: bug? pyarrow.Table.from_pydict does not handle binary type correctly with embedded 00 bytes?
Posted by Jason Sachs <jm...@gmail.com>.
> Seems a bit buggy
Yeah that's a bit of an understatement :/
Done. https://issues.apache.org/jira/browse/ARROW-10498
I'm trying to poke around, but it looks like it may affect all of the from_* methods. I don't grok Cython very well, so am not sure I can get to a root cause easily.
On 2020/11/04 23:09:37, Wes McKinney <we...@gmail.com> wrote:
> Seems a bit buggy, can you open a Jira issue? Thanks
>
> On Wed, Nov 4, 2020 at 5:05 PM Jason Sachs <jm...@gmail.com> wrote:
> >
> > It looks like pyarrow.Table.from_pydict() cuts off binary data after an embedded 00 byte. Is this a known bug?
> >
> > (py3) C:\>python
> > Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
> > Type "help", "copyright", "credits" or "license" for more information.
> > >>> import numpy as np
> > >>> import pyarrow as pa
> > >>>
> > >>> data = np.array([b'', b'', b'', b'Foo!!', b'Bar!!',
> > .. b'\x00Baz!', b'half\x00baked', b''], dtype='|S13')
> > >>> t = pa.Table.from_pydict({'data':data})
> > >>> t.to_pandas()
> > data
> > 0 b''
> > 1 b''
> > 2 b''
> > 3 b'Foo!!'
> > 4 b'Bar!!'
> > 5 b''
> > 6 b'half'
> > 7 b''
> > >>> import pandas as pd
> > >>> pd.DataFrame(data)
> > 0
> > 0 b''
> > 1 b''
> > 2 b''
> > 3 b'Foo!!'
> > 4 b'Bar!!'
> > 5 b'\x00Baz!'
> > 6 b'half\x00baked'
> > 7 b''
> > >>>
>
Re: bug? pyarrow.Table.from_pydict does not handle binary type
correctly with embedded 00 bytes?
Posted by Wes McKinney <we...@gmail.com>.
Seems a bit buggy, can you open a Jira issue? Thanks
On Wed, Nov 4, 2020 at 5:05 PM Jason Sachs <jm...@gmail.com> wrote:
>
> It looks like pyarrow.Table.from_pydict() cuts off binary data after an embedded 00 byte. Is this a known bug?
>
> (py3) C:\>python
> Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np
> >>> import pyarrow as pa
> >>>
> >>> data = np.array([b'', b'', b'', b'Foo!!', b'Bar!!',
> .. b'\x00Baz!', b'half\x00baked', b''], dtype='|S13')
> >>> t = pa.Table.from_pydict({'data':data})
> >>> t.to_pandas()
> data
> 0 b''
> 1 b''
> 2 b''
> 3 b'Foo!!'
> 4 b'Bar!!'
> 5 b''
> 6 b'half'
> 7 b''
> >>> import pandas as pd
> >>> pd.DataFrame(data)
> 0
> 0 b''
> 1 b''
> 2 b''
> 3 b'Foo!!'
> 4 b'Bar!!'
> 5 b'\x00Baz!'
> 6 b'half\x00baked'
> 7 b''
> >>>
Re: bug? pyarrow.Table.from_pydict does not handle binary type correctly with embedded 00 bytes?
Posted by Jason Sachs <jm...@gmail.com>.
GAH! It looks like it might be my problem, not pyarrow; type code S is a null-terminated data:
https://numpy.org/doc/stable/reference/arrays.dtypes.html
'S', 'a' zero-terminated bytes (not recommended)
Now I have to figure out why I'm getting that S code (it's generated through some sort of operation via numpy)
On 2020/11/04 23:05:13, Jason Sachs <jm...@gmail.com> wrote:
> It looks like pyarrow.Table.from_pydict() cuts off binary data after an embedded 00 byte. Is this a known bug?
>
> (py3) C:\>python
> Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np
> >>> import pyarrow as pa
> >>>
> >>> data = np.array([b'', b'', b'', b'Foo!!', b'Bar!!',
> .. b'\x00Baz!', b'half\x00baked', b''], dtype='|S13')
> >>> t = pa.Table.from_pydict({'data':data})
> >>> t.to_pandas()
> data
> 0 b''
> 1 b''
> 2 b''
> 3 b'Foo!!'
> 4 b'Bar!!'
> 5 b''
> 6 b'half'
> 7 b''
> >>> import pandas as pd
> >>> pd.DataFrame(data)
> 0
> 0 b''
> 1 b''
> 2 b''
> 3 b'Foo!!'
> 4 b'Bar!!'
> 5 b'\x00Baz!'
> 6 b'half\x00baked'
> 7 b''
> >>>
>