You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Juha Iso-Sipilä (Jira)" <ji...@apache.org> on 2020/10/14 13:08:00 UTC

[jira] [Created] (SPARK-33149) Why does ArrayType schema change between read/write for parquet files?

Juha Iso-Sipilä created SPARK-33149:
---------------------------------------

             Summary: Why does ArrayType schema change between read/write for parquet files?
                 Key: SPARK-33149
                 URL: https://issues.apache.org/jira/browse/SPARK-33149
             Project: Spark
          Issue Type: Question
          Components: Input/Output
    Affects Versions: 3.0.0
            Reporter: Juha Iso-Sipilä


I have parquet files that have been produced with org.apache.parquet Java library (not Spark). The schema has a list of authors that's reported by [https://github.com/apache/parquet-mr/tree/master/parquet-tools] like this:

repeated binary authors (STRING);

If I do spark.read.parquet(input_dir).write.parquet(output_dir) and do the same for the output files, the authors column has been changed into:

optional group authors (LIST) {
  repeated group list {
    optional binary element (STRING);
  }
 }

It seems to mean the same thing but from schema perspective this is different.

I have other set of tools that are reading the output from this step (with real logic) but I fail to match the schemas. The original data works fine. Also, df.printSchema() shows the same for both (except for possible nullability change, which we can ignore for this case)

Any thoughts if this is intentional and if I have any control for this from within Spark?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org