You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by emlyn <em...@swiftkey.com> on 2016/01/13 12:06:46 UTC
Merging compatible schemas on Spark 1.6.0

I have a series of directories on S3 with parquet data, all with compatible
(but not identical) schemas. We verify that the schemas stay compatible when
they evolve using
org.apache.avro.SchemaCompatibility.checkReaderWriterCompatibility. On Spark
1.5, I could read these into a DataFrame with sqlCtx.read().parquet(path1,
path2), and Spark would take care of merging the compatible schemas.
I have just been trying to run on Spark 1.6, and that is now giving an
error, saying:

java.lang.AssertionError: assertion failed: Conflicting directory structures
detected. Suspicious paths:
	s3n://bucket/data/app1/version1/event1
	s3n://bucket/data/app2/version1/event1
If provided paths are partition directories, please set "basePath" in the
options of the data source to specify the root directory of the table. If
there are multiple root directories, please load them separately and then
union them.

Under these paths I have partitioned data, like
s3n://bucket/data/appN/versionN/eventN/dat_received=YYYY-MM-DD/fingerprint=XXXX/part-r-0000-xxxx.lzo.parquet
If I load both paths into separate DataFrames and then try to union them, as
suggested in the error message, that fails with:

org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
	at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
	at
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
	at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:203)
	at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
	at
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105)
	at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
	at
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
	at
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
	at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:133)
	at
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165)
	at org.apache.spark.sql.DataFrame.unionAll(DataFrame.scala:1052)

How can I combine these data sets in Spark 1.6? Is there are way to union
DataFrames with different but compatible schemas?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Merging-compatible-schemas-on-Spark-1-6-0-tp25958.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org