You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/05/17 11:38:00 UTC
[jira] [Comment Edited] (ARROW-12762) [Python] pyarrow.lib.Schema
equality fails after pickle and unpickle
[ https://issues.apache.org/jira/browse/ARROW-12762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346100#comment-17346100 ]
Joris Van den Bossche edited comment on ARROW-12762 at 5/17/21, 11:37 AM:
--------------------------------------------------------------------------
[~jjgalvez] thanks for opening the issue.
I can't reproduce this without pyspark; when writing the pandas dataframe to parquet with pyarrow, it seems to work:
{code}
In [11]: import pyarrow.parquet as pq
In [12]: tabe = pa.table(df)
In [13]: pq.write_table(table, "test_list_str.parquet")
In [14]: ds = pq.ParquetDataset("test_list_str.parquet")
In [15]: pickle.loads(pickle.dumps(ds.schema)) == ds.schema
Out[15]: True
In [16]: pickle.loads(pickle.dumps(ds.schema.to_arrow_schema())) == ds.schema.to_arrow_schema()
Out[16]: True
{code}
Could you try to check what the difference is between the schemas before and after pickling? (eg if you print both, do you see a difference? Or it's schema.metadata?)
was (Author: jorisvandenbossche):
[~jjgalvez] thanks for opening the issue.
I can't reproduce this without pyspark; when writing the pandas dataframe to parquet with pyarrow, it seems to work:
{code}
In [12]: import pyarrow.parquet as pq
In [13]: pq.write_table(table, "test_list_str.parquet")
In [14]: ds = pq.ParquetDataset("test_list_str.parquet")
In [15]: pickle.loads(pickle.dumps(ds.schema)) == ds.schema
Out[15]: True
In [16]: pickle.loads(pickle.dumps(ds.schema.to_arrow_schema())) == ds.schema.to_arrow_schema()
Out[16]: True
{code}
> [Python] pyarrow.lib.Schema equality fails after pickle and unpickle
> --------------------------------------------------------------------
>
> Key: ARROW-12762
> URL: https://issues.apache.org/jira/browse/ARROW-12762
> Project: Apache Arrow
> Issue Type: Bug
> Affects Versions: 4.0.0
> Reporter: Juan Galvez
> Priority: Major
>
> Here is a small reproducer:
> {code:python}
> import pandas as pd
> from pyspark.sql import SparkSession
> import pyarrow.parquet as pq
> import pickle
> df = pd.DataFrame(
> {
> "A": [
> ["aa", "bb "],
> ["c"],
> ["d", "ee", "", "f"],
> ["ggg", "H"],
> [""],
> ]
> }
> )
> spark = SparkSession.builder.appName("GenSparkData").getOrCreate()
> spark_df = spark.createDataFrame(df)
> spark_df.write.parquet("list_str.pq", "overwrite")
> ds = pq.ParquetDataset("list_str.pq")
> assert pickle.loads(pickle.dumps(ds.schema)) == ds.schema # PASSES
> assert pickle.loads(pickle.dumps(ds.schema.to_arrow_schema())) == ds.schema.to_arrow_schema() # FAILS
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)