You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Jason Sachs <jm...@gmail.com> on 2020/11/04 23:05:13 UTC

bug? pyarrow.Table.from_pydict does not handle binary type correctly with embedded 00 bytes?

It looks like pyarrow.Table.from_pydict() cuts off binary data after an embedded 00 byte. Is this a known bug?

(py3) C:\>python
Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> import pyarrow as pa
>>>
>>> data = np.array([b'', b'', b'', b'Foo!!', b'Bar!!',
..        b'\x00Baz!', b'half\x00baked', b''], dtype='|S13')
>>> t = pa.Table.from_pydict({'data':data})
>>> t.to_pandas()
       data
0       b''
1       b''
2       b''
3  b'Foo!!'
4  b'Bar!!'
5       b''
6   b'half'
7       b''
>>> import pandas as pd
>>> pd.DataFrame(data)
                  0
0               b''
1               b''
2               b''
3          b'Foo!!'
4          b'Bar!!'
5       b'\x00Baz!'
6  b'half\x00baked'
7               b''
>>>

Re: bug? pyarrow.Table.from_pydict does not handle binary type correctly with embedded 00 bytes?

Posted by Jason Sachs <jm...@gmail.com>.
> Seems a bit buggy

Yeah that's a bit of an understatement :/ 

Done. https://issues.apache.org/jira/browse/ARROW-10498

I'm trying to poke around, but it looks like it may affect all of the from_* methods. I don't grok Cython very well, so am not sure I can get to a root cause easily.

On 2020/11/04 23:09:37, Wes McKinney <we...@gmail.com> wrote: 
> Seems a bit buggy, can you open a Jira issue? Thanks
> 
> On Wed, Nov 4, 2020 at 5:05 PM Jason Sachs <jm...@gmail.com> wrote:
> >
> > It looks like pyarrow.Table.from_pydict() cuts off binary data after an embedded 00 byte. Is this a known bug?
> >
> > (py3) C:\>python
> > Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
> > Type "help", "copyright", "credits" or "license" for more information.
> > >>> import numpy as np
> > >>> import pyarrow as pa
> > >>>
> > >>> data = np.array([b'', b'', b'', b'Foo!!', b'Bar!!',
> > ..        b'\x00Baz!', b'half\x00baked', b''], dtype='|S13')
> > >>> t = pa.Table.from_pydict({'data':data})
> > >>> t.to_pandas()
> >        data
> > 0       b''
> > 1       b''
> > 2       b''
> > 3  b'Foo!!'
> > 4  b'Bar!!'
> > 5       b''
> > 6   b'half'
> > 7       b''
> > >>> import pandas as pd
> > >>> pd.DataFrame(data)
> >                   0
> > 0               b''
> > 1               b''
> > 2               b''
> > 3          b'Foo!!'
> > 4          b'Bar!!'
> > 5       b'\x00Baz!'
> > 6  b'half\x00baked'
> > 7               b''
> > >>>
> 

Re: bug? pyarrow.Table.from_pydict does not handle binary type correctly with embedded 00 bytes?

Posted by Wes McKinney <we...@gmail.com>.
Seems a bit buggy, can you open a Jira issue? Thanks

On Wed, Nov 4, 2020 at 5:05 PM Jason Sachs <jm...@gmail.com> wrote:
>
> It looks like pyarrow.Table.from_pydict() cuts off binary data after an embedded 00 byte. Is this a known bug?
>
> (py3) C:\>python
> Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np
> >>> import pyarrow as pa
> >>>
> >>> data = np.array([b'', b'', b'', b'Foo!!', b'Bar!!',
> ..        b'\x00Baz!', b'half\x00baked', b''], dtype='|S13')
> >>> t = pa.Table.from_pydict({'data':data})
> >>> t.to_pandas()
>        data
> 0       b''
> 1       b''
> 2       b''
> 3  b'Foo!!'
> 4  b'Bar!!'
> 5       b''
> 6   b'half'
> 7       b''
> >>> import pandas as pd
> >>> pd.DataFrame(data)
>                   0
> 0               b''
> 1               b''
> 2               b''
> 3          b'Foo!!'
> 4          b'Bar!!'
> 5       b'\x00Baz!'
> 6  b'half\x00baked'
> 7               b''
> >>>

Re: bug? pyarrow.Table.from_pydict does not handle binary type correctly with embedded 00 bytes?

Posted by Jason Sachs <jm...@gmail.com>.
GAH! It looks like it might be my problem, not pyarrow; type code S is a null-terminated data:

https://numpy.org/doc/stable/reference/arrays.dtypes.html
'S', 'a' zero-terminated bytes (not recommended)

Now I have to figure out why I'm getting that S code (it's generated through some sort of operation via numpy)

On 2020/11/04 23:05:13, Jason Sachs <jm...@gmail.com> wrote: 
> It looks like pyarrow.Table.from_pydict() cuts off binary data after an embedded 00 byte. Is this a known bug?
> 
> (py3) C:\>python
> Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np
> >>> import pyarrow as pa
> >>>
> >>> data = np.array([b'', b'', b'', b'Foo!!', b'Bar!!',
> ..        b'\x00Baz!', b'half\x00baked', b''], dtype='|S13')
> >>> t = pa.Table.from_pydict({'data':data})
> >>> t.to_pandas()
>        data
> 0       b''
> 1       b''
> 2       b''
> 3  b'Foo!!'
> 4  b'Bar!!'
> 5       b''
> 6   b'half'
> 7       b''
> >>> import pandas as pd
> >>> pd.DataFrame(data)
>                   0
> 0               b''
> 1               b''
> 2               b''
> 3          b'Foo!!'
> 4          b'Bar!!'
> 5       b'\x00Baz!'
> 6  b'half\x00baked'
> 7               b''
> >>>
>