You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Adrien Hoarau (Jira)" <ji...@apache.org> on 2020/12/15 16:45:00 UTC
[jira] [Commented] (ARROW-10919) [Python] Wrong values with Table slicing and conversion to/from pandas ExtensionArray

    [ https://issues.apache.org/jira/browse/ARROW-10919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249789#comment-17249789 ] 

Adrien Hoarau commented on ARROW-10919:
---------------------------------------

Seems to be a problem with the offsetting in the slice, as long as it's a view of the original structure. Forcing the slice to be a copy and not a ref to the original structure seems to solve it (though not ideal obviously). The way I force it currently is doing a roundtrip to/from parquet on the slice. Open to better suggestions :)

> [Python] Wrong values with Table slicing and conversion to/from pandas ExtensionArray
> -------------------------------------------------------------------------------------
>
>                 Key: ARROW-10919
>                 URL: https://issues.apache.org/jira/browse/ARROW-10919
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0
>         Environment: INSTALLED VERSIONS
> ------------------
> commit           : b5958ee1999e9aead1938c0bba2b674378807b3d
> python           : 3.8.6.final.0
> python-bits      : 64
> OS               : Linux
> OS-release       : 5.4.0-58-generic
> Version          : #64-Ubuntu SMP Wed Dec 9 08:16:25 UTC 2020
> machine          : x86_64
> processor        : x86_64
> byteorder        : little
> LC_ALL           : None
> LANG             : en_US.UTF-8
> LOCALE           : en_US.UTF-8
> pandas           : 1.1.5
> numpy            : 1.19.4
> pytz             : 2020.4
> dateutil         : 2.8.1
> pip              : 20.2.1
> setuptools       : 49.2.1
> Cython           : None
> pytest           : 5.4.3
> hypothesis       : None
> sphinx           : None
> blosc            : None
> feather          : None
> xlsxwriter       : None
> lxml.etree       : None
> html5lib         : None
> pymysql          : None
> psycopg2         : None
> jinja2           : None
> IPython          : None
> pandas_datareader: None
> bs4              : None
> bottleneck       : None
> fsspec           : 0.8.4
> fastparquet      : None
> gcsfs            : None
> matplotlib       : None
> numexpr          : None
> odfpy            : None
> openpyxl         : None
> pandas_gbq       : None
> pyarrow          : 2.0.0
> pytables         : None
> pyxlsb           : None
> s3fs             : 0.4.2
> scipy            : None
> sqlalchemy       : None
> tables           : None
> tabulate         : None
> xarray           : None
> xlrd             : None
> xlwt             : None
> numba            : None
>            Reporter: Adrien Hoarau
>            Priority: Major
>         Attachments: Screenshot from 2020-12-15 13-28-38.png
>
>
>  
> {code:java}
> import pandas as pd
> from pyarrow import Table
> df = pd.DataFrame({'int_na': [0, None, 2, 3, None, 5, 6, None, 8]}, dtype=pd.Int64Dtype())
> print(df)
> {code}
>     int_na
> 0 0 
> 1 <NA>
>  2 2 
> 3 3 
> 4 <NA>
>  5 5 
> 6 6 
> 7 <NA> 
> 8 8
> {code:java}
> Table.from_pandas(df).slice(2, None).to_pandas()
> {code}
>   int_na
> 0 2
> 1 <NA>
> 2 1
> 3 5
> 4 <NA>
> 5 1
> 6 8



--
This message was sent by Atlassian Jira
(v8.3.4#803005)