You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Kari Schoonbee (Jira)" <ji...@apache.org> on 2021/01/15 07:43:00 UTC
[jira] [Created] (ARROW-11257) PyArrow Table contains different
data after writing and reloading from Parquet
Kari Schoonbee created ARROW-11257:
--------------------------------------
Summary: PyArrow Table contains different data after writing and reloading from Parquet
Key: ARROW-11257
URL: https://issues.apache.org/jira/browse/ARROW-11257
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 2.0.0
Reporter: Kari Schoonbee
Attachments: anonymised.jsonl, pyarrow_parquet_issue.ipynb
* I'm loading a JSONlines object into a table using
{code:java}
pa.json.readjson{code}
It contains one column that is a nested dictionary.
* I select a row by key and inspect its nested dictionary.
* I write the table to parquet
* I load the table again from the parquet file
* I check the same key and the nested dictionary is not the same.
To reproduce:
Find the attached JSONLines file and Jupyter Notebook.
The json file contains entries per customer with a generated `msisdn`, `scoring_request_id` and `scorecard_result` object. Each `scorecard result consists of a list of feature objects, all with the value the same as the msidn` and a score.
The notebook reads the file and demonstrates the issue.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)