You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Herman van Hovell (JIRA)" <ji...@apache.org> on 2017/05/04 07:13:04 UTC

[jira] [Commented] (SPARK-20593) Writing Parquet: Cannot build an empty group

    [ https://issues.apache.org/jira/browse/SPARK-20593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15996283#comment-15996283 ] 

Herman van Hovell commented on SPARK-20593:
-------------------------------------------

This is not a bug. You cannot write a empty struct in parquet.

This is due to the way the parquet format works, a parquet file only consists of leaf field data, the intermediate structure is not stored and can be inferred using the schema and the repetition levels and definition levels of the written leaf fields. An empty struct (which is written as a group) has no leaf fields and that is why parquet fails to write this.

> Writing Parquet: Cannot build an empty group
> --------------------------------------------
>
>                 Key: SPARK-20593
>                 URL: https://issues.apache.org/jira/browse/SPARK-20593
>             Project: Spark
>          Issue Type: Question
>          Components: Spark Core, Spark Shell
>    Affects Versions: 2.1.1
>         Environment: I use Apache Spark 2.1.1 (used 2.1.0 and it was the same, switched today). Tested only on Mac
>            Reporter: Viktor Khristenko
>            Priority: Minor
>
> Hi,
> This is my first ticket and I apologize for/if I'm doing certain things in an improper way.
>  I have a dataset:
> {noformat}
> root
> |-- muons: array (nullable = true)
> |    |-- element: struct (containsNull = true)
> |    |    |-- reco::Candidate: struct (nullable = true)
> |    |    |-- qx3_: integer (nullable = true)
> |    |    |-- pt_: float (nullable = true)
> |    |    |-- eta_: float (nullable = true)
> |    |    |-- phi_: float (nullable = true)
> |    |    |-- mass_: float (nullable = true)
> |    |    |-- vertex_: struct (nullable = true)
> |    |    |    |-- fCoordinates: struct (nullable = true)
> |    |    |    |    |-- fX: float (nullable = true)
> |    |    |    |    |-- fY: float (nullable = true)
> |    |    |    |    |-- fZ: float (nullable = true)
> |    |    |-- pdgId_: integer (nullable = true)
> |    |    |-- status_: integer (nullable = true)
> |    |    |-- cachePolarFixed_: struct (nullable = true)
> |    |    |-- cacheCartesianFixed_: struct (nullable = true)
> {noformat}
> As you can see, there are 3 empty structs in this schema. I know 100% that I can read/manipulate/do whatever. However, when I try writing to disk in parquet, I get the following Exception:
> ds.write.format("parquet").save(outputPathName):
> java.lang.IllegalStateException: Cannot build an empty group
> at org.apache.parquet.Preconditions.checkState(Preconditions.java:91)
> at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:622)
> at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:497)
> at org.apache.parquet.schema.Types$Builder.named(Types.java:286)
> at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:535)
> at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321)
> at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:534)
> at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:533)
> So, basically I would like to understand if it's a bug or an intended behavior??? I also assume that it's related to the empty structs. Any help would be really appreciated!
> I've quickly created stripped version and that one works without any issues!
> For reference, I put a link to the original question on SO[1]
> VK
> [1] http://stackoverflow.com/questions/43767358/apache-spark-parquet-cannot-build-an-empty-group



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org