You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Kari Schoonbee (Jira)" <ji...@apache.org> on 2021/01/19 16:08:00 UTC

[jira] [Comment Edited] (ARROW-11257) [C++][Parquet] PyArrow Table contains different data after writing and reloading from Parquet

    [ https://issues.apache.org/jira/browse/ARROW-11257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17267998#comment-17267998 ] 

Kari Schoonbee edited comment on ARROW-11257 at 1/19/21, 4:07 PM:
------------------------------------------------------------------

Thanks Joris, that is great. I'll keep an eye out.

I can also add that doing the parquet round-trip using `pyspark==3.0.0` works using `data_frame.write.parquet()`


was (Author: kari_s):
Thanks Joris, that is great. I'll keep an eye out.

I can also add that doing the same parquet round-trip using `pyspark==3.0.0` works.

> [C++][Parquet] PyArrow Table contains different data after writing and reloading from Parquet
> ---------------------------------------------------------------------------------------------
>
>                 Key: ARROW-11257
>                 URL: https://issues.apache.org/jira/browse/ARROW-11257
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0
>            Reporter: Kari Schoonbee
>            Priority: Critical
>         Attachments: anonymised.jsonl, pyarrow_parquet_issue.ipynb
>
>
> * I'm loading a JSONlines object into a table using 
> {code:java}
> pa.json.readjson{code}
> It contains one column that is a nested dictionary.
>  * I select a row by key and inspect its nested dictionary.
>  * I write the table to parquet 
>  * I load the table again from the parquet file 
>  * I check the same key and the nested dictionary is not the same.
>  
> To reproduce:
>  
> Find the attached JSONLines file and Jupyter Notebook. 
> The json file contains entries per customer with a generated `msisdn`, `scoring_request_id` and `scorecard_result` object. Each `scorecard result consists of a list of feature objects, all with the value the same as the msidn` and a score.
> The notebook reads the file and demonstrates the issue.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)