You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by storm <pe...@gmail.com> on 2015/08/25 14:13:34 UTC
SparkSQL saveAsParquetFile does not preserve AVRO schema
Hi,
I have serious problems with saving DataFrame as parquet file.
I read the data from the parquet file like this:
val df = sparkSqlCtx.parquetFile(inputFile.toString)
and print the schema (you can see both fields are required)
root
|-- time: long (nullable = false)
|-- time_ymdhms: long (nullable = false)
...omitted...
Now I try to save DataFrame as parquet file like this:
df.saveAsParquetFile(outputFile.toString)
The code runs normally, but loading the file, which I have saved in the
previous step (outputFile) together with the same inputFile fails with this
error:
Caused by: parquet.schema.IncompatibleSchemaModificationException:
repetition constraint is more restrictive: can not merge type required int64
time into optional int64 time
The problem is that saveAsParquetFile does not preserve nullable flags! So
once I try to load outputFile parquet file and print the schema I get this:
root
|-- time: long (nullable = true)
|-- time_ymdhms: long (nullable = true)
...omitted...
I use Spark 1.3.0 with Parquet 1.6.0
Is it somehow possible to keep also these flags? Or is it a bug?
Any help will be appreciated.
Thanks in advance!
Petr
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-saveAsParquetFile-does-not-preserve-AVRO-schema-tp24444.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org
Re: SparkSQL saveAsParquetFile does not preserve AVRO schema
Posted by storm <pe...@gmail.com>.
Note:
In the code (org.apache.spark.sql.parquet.DefaultSource) I've found this:
val relation = if (doInsertion) {
// This is a hack. We always set
nullable/containsNull/valueContainsNull to true
// for the schema of a parquet data.
val df =
sqlContext.createDataFrame(
data.queryExecution.toRdd,
data.schema.asNullable)
val createdRelation =
createRelation(sqlContext, parameters,
df.schema).asInstanceOf[ParquetRelation2]
createdRelation.insert(df, overwrite = mode == SaveMode.Overwrite)
createdRelation
}
The culprit is "data.schema.asNullable". What's the real reason for this?
Why not simply use the existing schema nullable flags?
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-saveAsParquetFile-does-not-preserve-AVRO-schema-tp24444p24454.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org