You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Kari Schoonbee (Jira)" <ji...@apache.org> on 2021/01/19 16:08:00 UTC
[jira] [Comment Edited] (ARROW-11257) [C++][Parquet] PyArrow Table
contains different data after writing and reloading from Parquet
[ https://issues.apache.org/jira/browse/ARROW-11257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17267998#comment-17267998 ]
Kari Schoonbee edited comment on ARROW-11257 at 1/19/21, 4:07 PM:
------------------------------------------------------------------
Thanks Joris, that is great. I'll keep an eye out.
I can also add that doing the parquet round-trip using `pyspark==3.0.0` works using `data_frame.write.parquet()`
was (Author: kari_s):
Thanks Joris, that is great. I'll keep an eye out.
I can also add that doing the same parquet round-trip using `pyspark==3.0.0` works.
> [C++][Parquet] PyArrow Table contains different data after writing and reloading from Parquet
> ---------------------------------------------------------------------------------------------
>
> Key: ARROW-11257
> URL: https://issues.apache.org/jira/browse/ARROW-11257
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 2.0.0
> Reporter: Kari Schoonbee
> Priority: Critical
> Attachments: anonymised.jsonl, pyarrow_parquet_issue.ipynb
>
>
> * I'm loading a JSONlines object into a table using
> {code:java}
> pa.json.readjson{code}
> It contains one column that is a nested dictionary.
> * I select a row by key and inspect its nested dictionary.
> * I write the table to parquet
> * I load the table again from the parquet file
> * I check the same key and the nested dictionary is not the same.
>
> To reproduce:
>
> Find the attached JSONLines file and Jupyter Notebook.
> The json file contains entries per customer with a generated `msisdn`, `scoring_request_id` and `scorecard_result` object. Each `scorecard result consists of a list of feature objects, all with the value the same as the msidn` and a score.
> The notebook reads the file and demonstrates the issue.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)