You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Kari Schoonbee (Jira)" <ji...@apache.org> on 2021/01/15 07:43:00 UTC

[jira] [Created] (ARROW-11257) PyArrow Table contains different data after writing and reloading from Parquet

Kari Schoonbee created ARROW-11257:
--------------------------------------

             Summary: PyArrow Table contains different data after writing and reloading from Parquet
                 Key: ARROW-11257
                 URL: https://issues.apache.org/jira/browse/ARROW-11257
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 2.0.0
            Reporter: Kari Schoonbee
         Attachments: anonymised.jsonl, pyarrow_parquet_issue.ipynb

* I'm loading a JSONlines object into a table using 
{code:java}
pa.json.readjson{code}
It contains one column that is a nested dictionary.
 * I select a row by key and inspect its nested dictionary.
 * I write the table to parquet 
 * I load the table again from the parquet file 
 * I check the same key and the nested dictionary is not the same.

 

To reproduce:

 

Find the attached JSONLines file and Jupyter Notebook. 

The json file contains entries per customer with a generated `msisdn`, `scoring_request_id` and `scorecard_result` object. Each `scorecard result consists of a list of feature objects, all with the value the same as the msidn` and a score.

The notebook reads the file and demonstrates the issue.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)