You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2016/12/20 16:38:58 UTC

[jira] [Created] (ARROW-436) [Python] pandas-parquet roundtrip dtype mismatch

Wes McKinney created ARROW-436:
----------------------------------

             Summary: [Python] pandas-parquet roundtrip dtype mismatch
                 Key: ARROW-436
                 URL: https://issues.apache.org/jira/browse/ARROW-436
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
            Reporter: Wes McKinney


As a follow up to ARROW-434, I observed the following odd failure:

{code}
@parquet
def test_pandas_parquet_pyfile_failure(tmpdir):
    filename = tmpdir.join('pandas_pyfile_roundtrip.parquet').strpath
    size = 5
    np.random.seed(0)
    df = pd.DataFrame({
        'uint8': np.arange(size, dtype=np.uint8),
        'uint16': np.arange(size, dtype=np.uint16),
        'uint32': np.arange(size, dtype=np.uint32),
        'uint64': np.arange(size, dtype=np.uint64),
        'int8': np.arange(size, dtype=np.int16),
        'int16': np.arange(size, dtype=np.int16),
        'int32': np.arange(size, dtype=np.int32),
        'int64': np.arange(size, dtype=np.int64),
        'float32': np.arange(size, dtype=np.float32),
        'float64': np.arange(size, dtype=np.float64),
        'bool': np.random.randn(size) > 0
    })

    arrow_table = A.from_pandas_dataframe(df)

    with open(filename, 'wb') as f:
        A.parquet.write_table(arrow_table, f, version="1.0")

    data = io.BytesIO(open(filename, 'rb').read())

    table_read = pq.read_table(data)
    df_read = table_read.to_pandas()
    pdt.assert_frame_equal(df, df_read)
{code}

I see debugging locally:

{code}
(Pdb) df.dtypes
bool          bool
float32    float32
float64    float64
int16        int16
int32        int32
int64        int64
int8         int16
uint16      uint16
uint32      uint32
uint64      uint64
uint8        uint8
dtype: object
(Pdb) df_read.dtypes
bool          bool
float32    float32
float64    float64
int16        int16
int32        int32
int64        int64
int8         int16
uint16      uint16
uint32       int64
uint64      uint64
uint8        uint8
dtype: object
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)