You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/05/19 07:51:00 UTC

[jira] [Commented] (ARROW-12762) [Python] ListType doesn't preserve field name after pickle and unpickle

    [ https://issues.apache.org/jira/browse/ARROW-12762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347381#comment-17347381 ] 

Joris Van den Bossche commented on ARROW-12762:
-----------------------------------------------

This should be a relatively easy fix by pickling the value_field instead of value_type at https://github.com/apache/arrow/blob/aa37d197a63a7efbc0660f9cea2f75cc08c30587/python/pyarrow/types.pxi#L279-L280

> [Python] ListType doesn't preserve field name after pickle and unpickle
> -----------------------------------------------------------------------
>
>                 Key: ARROW-12762
>                 URL: https://issues.apache.org/jira/browse/ARROW-12762
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 4.0.0
>            Reporter: Juan Galvez
>            Priority: Major
>             Fix For: 5.0.0
>
>
> Here is a small reproducer:
> {code:python}
> import pandas as pd
> from pyspark.sql import SparkSession
> import pyarrow.parquet as pq
> import pickle
> df = pd.DataFrame(
>     {
>         "A": [
>             ["aa", "bb "],
>             ["c"],
>             ["d", "ee", "", "f"],
>             ["ggg", "H"],
>             [""],
>         ]
>     }
> )
> spark = SparkSession.builder.appName("GenSparkData").getOrCreate()
> spark_df = spark.createDataFrame(df)
> spark_df.write.parquet("list_str.pq", "overwrite")
> ds = pq.ParquetDataset("list_str.pq")
> assert pickle.loads(pickle.dumps(ds.schema)) == ds.schema # PASSES
> assert pickle.loads(pickle.dumps(ds.schema.to_arrow_schema())) == ds.schema.to_arrow_schema() # FAILS
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)