You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2020/10/16 04:48:00 UTC

[jira] [Resolved] (SPARK-33149) Why does ArrayType schema change between read/write for parquet files?

     [ https://issues.apache.org/jira/browse/SPARK-33149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-33149.
----------------------------------
    Resolution: Invalid

Also, questions should better go to the mailing list anyway. Tentatively resolving it.

> Why does ArrayType schema change between read/write for parquet files?
> ----------------------------------------------------------------------
>
>                 Key: SPARK-33149
>                 URL: https://issues.apache.org/jira/browse/SPARK-33149
>             Project: Spark
>          Issue Type: Question
>          Components: Input/Output
>    Affects Versions: 3.0.0
>            Reporter: Juha Iso-Sipilä
>            Priority: Major
>
> I have parquet files that have been produced with org.apache.parquet Java library (not Spark). The schema has a list of authors that's reported by [https://github.com/apache/parquet-mr/tree/master/parquet-tools] like this:
> repeated binary authors (STRING);
> If I do spark.read.parquet(input_dir).write.parquet(output_dir) and do the same for the output files, the authors column has been changed into:
> optional group authors (LIST) {
>   repeated group list {
>     optional binary element (STRING);
>   }
>  }
> It seems to mean the same thing but from schema perspective this is different.
> I have other set of tools that are reading the output from this step (with real logic) but I fail to match the schemas. The original data works fine. Also, df.printSchema() shows the same for both (except for possible nullability change, which we can ignore for this case)
> Any thoughts if this is intentional and if I have any control for this from within Spark?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org