You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/01/15 09:05:00 UTC

[jira] [Commented] (ARROW-11257) [C++][Parquet] PyArrow Table contains different data after writing and reloading from Parquet

    [ https://issues.apache.org/jira/browse/ARROW-11257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265841#comment-17265841 ] 

Joris Van den Bossche commented on ARROW-11257:
-----------------------------------------------

[~kari_s] thanks for the reproducible example!

Running your notebook with pyarrow 2.0.0, I indeed see a different result of the selected entry after parquet roundtrip. 
But when trying it with pyarrow master, I _think_ it is solved (it's a complex structure, so I might be missing something, but at least it looks better / correct now). There have been a few fixes in master compared to 2.0.0 related to reading/writing such heavily nested data with parquet (which was new in pyarrow 2.0.0). 

Could you try out the latest pyarrow version, and verify it is indeed fixed? That would be very helpful. 
You can install the nightly version both with pip or conda-forge (see https://arrow.apache.org/docs/python/install.html#installing-nightly-packages for instructions)

cc [~emkornfield] this is a nice example for (quite real-world I assume) nested data with several mixtures of structs and lists. The schema looks like:

{code}
pyarrow.Table
msisdn: string
scoring_request_id: string
scorecard_result: struct<feature_values: list<item: struct<feature: struct<id: string, implementation_id: string>, missing: bool, value: string>>, score: double>
  child 0, feature_values: list<item: struct<feature: struct<id: string, implementation_id: string>, missing: bool, value: string>>
      child 0, item: struct<feature: struct<id: string, implementation_id: string>, missing: bool, value: string>
          child 0, feature: struct<id: string, implementation_id: string>
              child 0, id: string
              child 1, implementation_id: string
          child 1, missing: bool
          child 2, value: string
  child 1, score: double
{code}

It might still be worth adding such examples as test cases? (in either the python or C++ tests)


> [C++][Parquet] PyArrow Table contains different data after writing and reloading from Parquet
> ---------------------------------------------------------------------------------------------
>
>                 Key: ARROW-11257
>                 URL: https://issues.apache.org/jira/browse/ARROW-11257
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0
>            Reporter: Kari Schoonbee
>            Priority: Critical
>         Attachments: anonymised.jsonl, pyarrow_parquet_issue.ipynb
>
>
> * I'm loading a JSONlines object into a table using 
> {code:java}
> pa.json.readjson{code}
> It contains one column that is a nested dictionary.
>  * I select a row by key and inspect its nested dictionary.
>  * I write the table to parquet 
>  * I load the table again from the parquet file 
>  * I check the same key and the nested dictionary is not the same.
>  
> To reproduce:
>  
> Find the attached JSONLines file and Jupyter Notebook. 
> The json file contains entries per customer with a generated `msisdn`, `scoring_request_id` and `scorecard_result` object. Each `scorecard result consists of a list of feature objects, all with the value the same as the msidn` and a score.
> The notebook reads the file and demonstrates the issue.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)