You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Efe Selcuk <ef...@gmail.com> on 2016/10/14 03:25:38 UTC

[Spark 2.0.0] error when unioning to an empty dataset

I have a use case where I want to build a dataset based off of
conditionally available data. I thought I'd do something like this:

case class SomeData( ... ) // parameters are basic encodable types like
strings and BigDecimals

var data = spark.emptyDataset[SomeData]

// loop, determining what data to ingest and process into datasets
  data = data.union(someCode.thatReturnsADataset)
// end loop

However I get a runtime exception:

Exception in thread "main" org.apache.spark.sql.AnalysisException:
unresolved operator 'Union;
        at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
        at
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
        at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:361)
        at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
        at
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
        at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
        at
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
        at
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
        at org.apache.spark.sql.Dataset.<init>(Dataset.scala:161)
        at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
        at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
        at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
        at org.apache.spark.sql.Dataset.union(Dataset.scala:1459)

Granted, I'm new at Spark so this might be an anti-pattern, so I'm open to
suggestions. However it doesn't seem like I'm doing anything incorrect
here, the types are correct. Searching for this error online returns
results seemingly about working in dataframes and having mismatching
schemas or a different order of fields, and it seems like bugfixes have
gone into place for those cases.

Thanks in advance.
Efe

Re: [Spark 2.0.0] error when unioning to an empty dataset

Posted by Efe Selcuk <ef...@gmail.com>.
All right, I looked at the schemas. There is one mismatching nullability,
on a scala.Boolean. It looks like an empty Dataset with that *cannot* be
nullable. However, when I run my code to generate the Dataset, the schema
comes back with nullable = true. Effectively:

scala> val empty = spark.createDataset[SomeClass]
scala> empty.printSchema
root
 |-- aCaseClass: struct (nullable = true)
 |    |-- aBool: boolean (nullable = false)


scala> val data = // Dataset#flatMap that returns a Dataset[SomeClass]
scala> data.printSchema
root
 |-- aCaseClass: struct (nullable = true)
 |    |-- aBool: boolean (nullable = true)

scala> empty.union(data)
org.apache.spark.sql.AnalysisException: unresolved operator 'Union;

If I switch the Boolean to a java.lang.Boolean, I get nullable = true in
the empty schema and the union starts working.

1) Is there a fix for this that I can do without jumping through hoops? I
don't know of the implications to switching to java.lang.Boolean.

2) It looks like this is probably the issue that these PRs fix:
https://github.com/apache/spark/pull/15595 and
https://github.com/apache/spark/pull/15602  Is there a timeline for 2.0.2?
I'm in a situation where I can't easily build from source.

On Mon, Oct 24, 2016 at 12:29 PM Cheng Lian <li...@gmail.com> wrote:

>
>
> On 10/22/16 1:42 PM, Efe Selcuk wrote:
>
> Ah, looks similar. Next opportunity I get, I'm going to do a printSchema
> on the two datasets and see if they don't match up.
>
> I assume that unioning the underlying RDDs doesn't run into this problem
> because of less type checking or something along those lines?
>
> Exactly.
>
>
> On Fri, Oct 21, 2016 at 3:39 PM Cheng Lian <li...@gmail.com> wrote:
>
> Efe - You probably hit this bug:
> https://issues.apache.org/jira/browse/SPARK-18058
>
> On 10/21/16 2:03 AM, Agraj Mangal wrote:
>
> I have seen this error sometimes when the elements in the schema have
> different nullabilities. Could you print the schema for data and for
> someCode.thatReturnsADataset() and see if there is any difference between
> the two ?
>
> On Fri, Oct 21, 2016 at 9:14 AM, Efe Selcuk <ef...@gmail.com> wrote:
>
> Thanks for the response. What do you mean by "semantically" the same?
> They're both Datasets of the same type, which is a case class, so I would
> expect compile-time integrity of the data. Is there a situation where this
> wouldn't be the case?
>
> Interestingly enough, if I instead create an empty rdd with
> sparkContext.emptyRDD of the same case class type, it works!
>
> So something like:
> var data = spark.sparkContext.emptyRDD[SomeData]
>
> // loop
>   data = data.union(someCode.thatReturnsADataset().rdd)
> // end loop
>
> data.toDS //so I can union it to the actual Dataset I have elsewhere
>
> On Thu, Oct 20, 2016 at 8:34 PM Agraj Mangal <ag...@gmail.com> wrote:
>
> I believe this normally comes when Spark is unable to perform union due to
> "difference" in schema of the operands. Can you check if the schema of both
> the datasets are semantically same ?
>
> On Tue, Oct 18, 2016 at 9:06 AM, Efe Selcuk <ef...@gmail.com> wrote:
>
> Bump!
>
> On Thu, Oct 13, 2016 at 8:25 PM Efe Selcuk <ef...@gmail.com> wrote:
>
> I have a use case where I want to build a dataset based off of
> conditionally available data. I thought I'd do something like this:
>
> case class SomeData( ... ) // parameters are basic encodable types like
> strings and BigDecimals
>
> var data = spark.emptyDataset[SomeData]
>
> // loop, determining what data to ingest and process into datasets
>   data = data.union(someCode.thatReturnsADataset)
> // end loop
>
> However I get a runtime exception:
>
> Exception in thread "main" org.apache.spark.sql.AnalysisException:
> unresolved operator 'Union;
>         at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
>         at
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
>         at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:361)
>         at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>         at
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
>         at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
>         at
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
>         at
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
>         at org.apache.spark.sql.Dataset.<init>(Dataset.scala:161)
>         at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
>         at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
>         at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
>         at org.apache.spark.sql.Dataset.union(Dataset.scala:1459)
>
> Granted, I'm new at Spark so this might be an anti-pattern, so I'm open to
> suggestions. However it doesn't seem like I'm doing anything incorrect
> here, the types are correct. Searching for this error online returns
> results seemingly about working in dataframes and having mismatching
> schemas or a different order of fields, and it seems like bugfixes have
> gone into place for those cases.
>
> Thanks in advance.
> Efe
>
>
>
>
> --
> Thanks & Regards,
> Agraj Mangal
>
>
>
>
> --
> Thanks & Regards,
> Agraj Mangal
>
>
>
>

Re: [Spark 2.0.0] error when unioning to an empty dataset

Posted by Cheng Lian <li...@gmail.com>.

On 10/22/16 1:42 PM, Efe Selcuk wrote:
> Ah, looks similar. Next opportunity I get, I'm going to do a 
> printSchema on the two datasets and see if they don't match up.
>
> I assume that unioning the underlying RDDs doesn't run into this 
> problem because of less type checking or something along those lines?
Exactly.
>
> On Fri, Oct 21, 2016 at 3:39 PM Cheng Lian <lian.cs.zju@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Efe - You probably hit this bug:
>     https://issues.apache.org/jira/browse/SPARK-18058
>
>
>     On 10/21/16 2:03 AM, Agraj Mangal wrote:
>>     I have seen this error sometimes when the elements in the schema
>>     have different nullabilities. Could you print the schema for
>>     data and for someCode.thatReturnsADataset() and see if there is
>>     any difference between the two ?
>>
>>     On Fri, Oct 21, 2016 at 9:14 AM, Efe Selcuk <efeman92@gmail.com
>>     <ma...@gmail.com>> wrote:
>>
>>         Thanks for the response. What do you mean by "semantically"
>>         the same? They're both Datasets of the same type, which is a
>>         case class, so I would expect compile-time integrity of the
>>         data. Is there a situation where this wouldn't be the case?
>>
>>         Interestingly enough, if I instead create an empty rdd with
>>         sparkContext.emptyRDD of the same case class type, it works!
>>
>>         So something like:
>>         var data = spark.sparkContext.emptyRDD[SomeData]
>>
>>         // loop
>>         data = data.union(someCode.thatReturnsADataset().rdd)
>>         // end loop
>>
>>         data.toDS //so I can union it to the actual Dataset I have
>>         elsewhere
>>
>>         On Thu, Oct 20, 2016 at 8:34 PM Agraj Mangal
>>         <agraj.mng@gmail.com <ma...@gmail.com>> wrote:
>>
>>             I believe this normally comes when Spark is unable to
>>             perform union due to "difference" in schema of the
>>             operands. Can you check if the schema of both the
>>             datasets are semantically same ?
>>
>>             On Tue, Oct 18, 2016 at 9:06 AM, Efe Selcuk
>>             <efeman92@gmail.com <ma...@gmail.com>> wrote:
>>
>>                 Bump!
>>
>>                 On Thu, Oct 13, 2016 at 8:25 PM Efe Selcuk
>>                 <efeman92@gmail.com <ma...@gmail.com>> wrote:
>>
>>                     I have a use case where I want to build a dataset
>>                     based off of conditionally available data. I
>>                     thought I'd do something like this:
>>
>>                     case class SomeData( ... ) // parameters are
>>                     basic encodable types like strings and BigDecimals
>>
>>                     var data = spark.emptyDataset[SomeData]
>>
>>                     // loop, determining what data to ingest and
>>                     process into datasets
>>                     data = data.union(someCode.thatReturnsADataset)
>>                     // end loop
>>
>>                     However I get a runtime exception:
>>
>>                     Exception in thread "main"
>>                     org.apache.spark.sql.AnalysisException:
>>                     unresolved operator 'Union;
>>                             at
>>                     org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
>>                             at
>>                     org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
>>                             at
>>                     org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:361)
>>                             at
>>                     org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>>                             at
>>                     org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
>>                             at
>>                     org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
>>                             at
>>                     org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
>>                             at
>>                     org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
>>                             at
>>                     org.apache.spark.sql.Dataset.<init>(Dataset.scala:161)
>>                             at
>>                     org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
>>                             at
>>                     org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
>>                             at
>>                     org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
>>                             at
>>                     org.apache.spark.sql.Dataset.union(Dataset.scala:1459)
>>
>>                     Granted, I'm new at Spark so this might be an
>>                     anti-pattern, so I'm open to suggestions. However
>>                     it doesn't seem like I'm doing anything incorrect
>>                     here, the types are correct. Searching for this
>>                     error online returns results seemingly about
>>                     working in dataframes and having mismatching
>>                     schemas or a different order of fields, and it
>>                     seems like bugfixes have gone into place for
>>                     those cases.
>>
>>                     Thanks in advance.
>>                     Efe
>>
>>
>>
>>
>>             -- 
>>             Thanks & Regards,
>>             Agraj Mangal
>>
>>
>>
>>
>>     -- 
>>     Thanks & Regards,
>>     Agraj Mangal
>


Re: [Spark 2.0.0] error when unioning to an empty dataset

Posted by Efe Selcuk <ef...@gmail.com>.
Ah, looks similar. Next opportunity I get, I'm going to do a printSchema on
the two datasets and see if they don't match up.

I assume that unioning the underlying RDDs doesn't run into this problem
because of less type checking or something along those lines?

On Fri, Oct 21, 2016 at 3:39 PM Cheng Lian <li...@gmail.com> wrote:

> Efe - You probably hit this bug:
> https://issues.apache.org/jira/browse/SPARK-18058
>
> On 10/21/16 2:03 AM, Agraj Mangal wrote:
>
> I have seen this error sometimes when the elements in the schema have
> different nullabilities. Could you print the schema for data and for
> someCode.thatReturnsADataset() and see if there is any difference between
> the two ?
>
> On Fri, Oct 21, 2016 at 9:14 AM, Efe Selcuk <ef...@gmail.com> wrote:
>
> Thanks for the response. What do you mean by "semantically" the same?
> They're both Datasets of the same type, which is a case class, so I would
> expect compile-time integrity of the data. Is there a situation where this
> wouldn't be the case?
>
> Interestingly enough, if I instead create an empty rdd with
> sparkContext.emptyRDD of the same case class type, it works!
>
> So something like:
> var data = spark.sparkContext.emptyRDD[SomeData]
>
> // loop
>   data = data.union(someCode.thatReturnsADataset().rdd)
> // end loop
>
> data.toDS //so I can union it to the actual Dataset I have elsewhere
>
> On Thu, Oct 20, 2016 at 8:34 PM Agraj Mangal <ag...@gmail.com> wrote:
>
> I believe this normally comes when Spark is unable to perform union due to
> "difference" in schema of the operands. Can you check if the schema of both
> the datasets are semantically same ?
>
> On Tue, Oct 18, 2016 at 9:06 AM, Efe Selcuk <ef...@gmail.com> wrote:
>
> Bump!
>
> On Thu, Oct 13, 2016 at 8:25 PM Efe Selcuk <ef...@gmail.com> wrote:
>
> I have a use case where I want to build a dataset based off of
> conditionally available data. I thought I'd do something like this:
>
> case class SomeData( ... ) // parameters are basic encodable types like
> strings and BigDecimals
>
> var data = spark.emptyDataset[SomeData]
>
> // loop, determining what data to ingest and process into datasets
>   data = data.union(someCode.thatReturnsADataset)
> // end loop
>
> However I get a runtime exception:
>
> Exception in thread "main" org.apache.spark.sql.AnalysisException:
> unresolved operator 'Union;
>         at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
>         at
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
>         at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:361)
>         at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>         at
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
>         at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
>         at
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
>         at
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
>         at org.apache.spark.sql.Dataset.<init>(Dataset.scala:161)
>         at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
>         at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
>         at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
>         at org.apache.spark.sql.Dataset.union(Dataset.scala:1459)
>
> Granted, I'm new at Spark so this might be an anti-pattern, so I'm open to
> suggestions. However it doesn't seem like I'm doing anything incorrect
> here, the types are correct. Searching for this error online returns
> results seemingly about working in dataframes and having mismatching
> schemas or a different order of fields, and it seems like bugfixes have
> gone into place for those cases.
>
> Thanks in advance.
> Efe
>
>
>
>
> --
> Thanks & Regards,
> Agraj Mangal
>
>
>
>
> --
> Thanks & Regards,
> Agraj Mangal
>
>
>

Re: [Spark 2.0.0] error when unioning to an empty dataset

Posted by Cheng Lian <li...@gmail.com>.
Efe - You probably hit this bug: 
https://issues.apache.org/jira/browse/SPARK-18058


On 10/21/16 2:03 AM, Agraj Mangal wrote:
> I have seen this error sometimes when the elements in the schema have 
> different nullabilities. Could you print the schema for data and for 
> someCode.thatReturnsADataset() and see if there is any difference 
> between the two ?
>
> On Fri, Oct 21, 2016 at 9:14 AM, Efe Selcuk <efeman92@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Thanks for the response. What do you mean by "semantically" the
>     same? They're both Datasets of the same type, which is a case
>     class, so I would expect compile-time integrity of the data. Is
>     there a situation where this wouldn't be the case?
>
>     Interestingly enough, if I instead create an empty rdd with
>     sparkContext.emptyRDD of the same case class type, it works!
>
>     So something like:
>     var data = spark.sparkContext.emptyRDD[SomeData]
>
>     // loop
>     data = data.union(someCode.thatReturnsADataset().rdd)
>     // end loop
>
>     data.toDS //so I can union it to the actual Dataset I have elsewhere
>
>     On Thu, Oct 20, 2016 at 8:34 PM Agraj Mangal <agraj.mng@gmail.com
>     <ma...@gmail.com>> wrote:
>
>         I believe this normally comes when Spark is unable to perform
>         union due to "difference" in schema of the operands. Can you
>         check if the schema of both the datasets are semantically same ?
>
>         On Tue, Oct 18, 2016 at 9:06 AM, Efe Selcuk
>         <efeman92@gmail.com <ma...@gmail.com>> wrote:
>
>             Bump!
>
>             On Thu, Oct 13, 2016 at 8:25 PM Efe Selcuk
>             <efeman92@gmail.com <ma...@gmail.com>> wrote:
>
>                 I have a use case where I want to build a dataset
>                 based off of conditionally available data. I thought
>                 I'd do something like this:
>
>                 case class SomeData( ... ) // parameters are basic
>                 encodable types like strings and BigDecimals
>
>                 var data = spark.emptyDataset[SomeData]
>
>                 // loop, determining what data to ingest and process
>                 into datasets
>                   data = data.union(someCode.thatReturnsADataset)
>                 // end loop
>
>                 However I get a runtime exception:
>
>                 Exception in thread "main"
>                 org.apache.spark.sql.AnalysisException: unresolved
>                 operator 'Union;
>                         at
>                 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
>                         at
>                 org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
>                         at
>                 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:361)
>                         at
>                 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>                         at
>                 org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
>                         at
>                 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
>                         at
>                 org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
>                         at
>                 org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
>                         at
>                 org.apache.spark.sql.Dataset.<init>(Dataset.scala:161)
>                         at
>                 org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
>                         at
>                 org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
>                         at
>                 org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
>                         at
>                 org.apache.spark.sql.Dataset.union(Dataset.scala:1459)
>
>                 Granted, I'm new at Spark so this might be an
>                 anti-pattern, so I'm open to suggestions. However it
>                 doesn't seem like I'm doing anything incorrect here,
>                 the types are correct. Searching for this error online
>                 returns results seemingly about working in dataframes
>                 and having mismatching schemas or a different order of
>                 fields, and it seems like bugfixes have gone into
>                 place for those cases.
>
>                 Thanks in advance.
>                 Efe
>
>
>
>
>         -- 
>         Thanks & Regards,
>         Agraj Mangal
>
>
>
>
> -- 
> Thanks & Regards,
> Agraj Mangal


Re: [Spark 2.0.0] error when unioning to an empty dataset

Posted by Agraj Mangal <ag...@gmail.com>.
I have seen this error sometimes when the elements in the schema have
different nullabilities. Could you print the schema for data and for
someCode.thatReturnsADataset() and see if there is any difference between
the two ?

On Fri, Oct 21, 2016 at 9:14 AM, Efe Selcuk <ef...@gmail.com> wrote:

> Thanks for the response. What do you mean by "semantically" the same?
> They're both Datasets of the same type, which is a case class, so I would
> expect compile-time integrity of the data. Is there a situation where this
> wouldn't be the case?
>
> Interestingly enough, if I instead create an empty rdd with
> sparkContext.emptyRDD of the same case class type, it works!
>
> So something like:
> var data = spark.sparkContext.emptyRDD[SomeData]
>
> // loop
>   data = data.union(someCode.thatReturnsADataset().rdd)
> // end loop
>
> data.toDS //so I can union it to the actual Dataset I have elsewhere
>
> On Thu, Oct 20, 2016 at 8:34 PM Agraj Mangal <ag...@gmail.com> wrote:
>
> I believe this normally comes when Spark is unable to perform union due to
> "difference" in schema of the operands. Can you check if the schema of both
> the datasets are semantically same ?
>
> On Tue, Oct 18, 2016 at 9:06 AM, Efe Selcuk <ef...@gmail.com> wrote:
>
> Bump!
>
> On Thu, Oct 13, 2016 at 8:25 PM Efe Selcuk <ef...@gmail.com> wrote:
>
> I have a use case where I want to build a dataset based off of
> conditionally available data. I thought I'd do something like this:
>
> case class SomeData( ... ) // parameters are basic encodable types like
> strings and BigDecimals
>
> var data = spark.emptyDataset[SomeData]
>
> // loop, determining what data to ingest and process into datasets
>   data = data.union(someCode.thatReturnsADataset)
> // end loop
>
> However I get a runtime exception:
>
> Exception in thread "main" org.apache.spark.sql.AnalysisException:
> unresolved operator 'Union;
>         at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.
> failAnalysis(CheckAnalysis.scala:40)
>         at org.apache.spark.sql.catalyst.analysis.Analyzer.
> failAnalysis(Analyzer.scala:58)
>         at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$
> anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:361)
>         at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$
> anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>         at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(
> TreeNode.scala:126)
>         at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.
> checkAnalysis(CheckAnalysis.scala:67)
>         at org.apache.spark.sql.catalyst.analysis.Analyzer.
> checkAnalysis(Analyzer.scala:58)
>         at org.apache.spark.sql.execution.QueryExecution.
> assertAnalyzed(QueryExecution.scala:49)
>         at org.apache.spark.sql.Dataset.<init>(Dataset.scala:161)
>         at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
>         at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
>         at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
>         at org.apache.spark.sql.Dataset.union(Dataset.scala:1459)
>
> Granted, I'm new at Spark so this might be an anti-pattern, so I'm open to
> suggestions. However it doesn't seem like I'm doing anything incorrect
> here, the types are correct. Searching for this error online returns
> results seemingly about working in dataframes and having mismatching
> schemas or a different order of fields, and it seems like bugfixes have
> gone into place for those cases.
>
> Thanks in advance.
> Efe
>
>
>
>
> --
> Thanks & Regards,
> Agraj Mangal
>
>


-- 
Thanks & Regards,
Agraj Mangal

Re: [Spark 2.0.0] error when unioning to an empty dataset

Posted by Efe Selcuk <ef...@gmail.com>.
Thanks for the response. What do you mean by "semantically" the same?
They're both Datasets of the same type, which is a case class, so I would
expect compile-time integrity of the data. Is there a situation where this
wouldn't be the case?

Interestingly enough, if I instead create an empty rdd with
sparkContext.emptyRDD of the same case class type, it works!

So something like:
var data = spark.sparkContext.emptyRDD[SomeData]

// loop
  data = data.union(someCode.thatReturnsADataset().rdd)
// end loop

data.toDS //so I can union it to the actual Dataset I have elsewhere

On Thu, Oct 20, 2016 at 8:34 PM Agraj Mangal <ag...@gmail.com> wrote:

I believe this normally comes when Spark is unable to perform union due to
"difference" in schema of the operands. Can you check if the schema of both
the datasets are semantically same ?

On Tue, Oct 18, 2016 at 9:06 AM, Efe Selcuk <ef...@gmail.com> wrote:

Bump!

On Thu, Oct 13, 2016 at 8:25 PM Efe Selcuk <ef...@gmail.com> wrote:

I have a use case where I want to build a dataset based off of
conditionally available data. I thought I'd do something like this:

case class SomeData( ... ) // parameters are basic encodable types like
strings and BigDecimals

var data = spark.emptyDataset[SomeData]

// loop, determining what data to ingest and process into datasets
  data = data.union(someCode.thatReturnsADataset)
// end loop

However I get a runtime exception:

Exception in thread "main" org.apache.spark.sql.AnalysisException:
unresolved operator 'Union;
        at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
        at
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
        at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:361)
        at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
        at
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
        at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
        at
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
        at
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
        at org.apache.spark.sql.Dataset.<init>(Dataset.scala:161)
        at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
        at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
        at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
        at org.apache.spark.sql.Dataset.union(Dataset.scala:1459)

Granted, I'm new at Spark so this might be an anti-pattern, so I'm open to
suggestions. However it doesn't seem like I'm doing anything incorrect
here, the types are correct. Searching for this error online returns
results seemingly about working in dataframes and having mismatching
schemas or a different order of fields, and it seems like bugfixes have
gone into place for those cases.

Thanks in advance.
Efe




-- 
Thanks & Regards,
Agraj Mangal

Re: [Spark 2.0.0] error when unioning to an empty dataset

Posted by Agraj Mangal <ag...@gmail.com>.
I believe this normally comes when Spark is unable to perform union due to
"difference" in schema of the operands. Can you check if the schema of both
the datasets are semantically same ?

On Tue, Oct 18, 2016 at 9:06 AM, Efe Selcuk <ef...@gmail.com> wrote:

> Bump!
>
> On Thu, Oct 13, 2016 at 8:25 PM Efe Selcuk <ef...@gmail.com> wrote:
>
>> I have a use case where I want to build a dataset based off of
>> conditionally available data. I thought I'd do something like this:
>>
>> case class SomeData( ... ) // parameters are basic encodable types like
>> strings and BigDecimals
>>
>> var data = spark.emptyDataset[SomeData]
>>
>> // loop, determining what data to ingest and process into datasets
>>   data = data.union(someCode.thatReturnsADataset)
>> // end loop
>>
>> However I get a runtime exception:
>>
>> Exception in thread "main" org.apache.spark.sql.AnalysisException:
>> unresolved operator 'Union;
>>         at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.
>> failAnalysis(CheckAnalysis.scala:40)
>>         at org.apache.spark.sql.catalyst.analysis.Analyzer.
>> failAnalysis(Analyzer.scala:58)
>>         at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$
>> anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:361)
>>         at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$
>> anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>>         at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(
>> TreeNode.scala:126)
>>         at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.
>> checkAnalysis(CheckAnalysis.scala:67)
>>         at org.apache.spark.sql.catalyst.analysis.Analyzer.
>> checkAnalysis(Analyzer.scala:58)
>>         at org.apache.spark.sql.execution.QueryExecution.
>> assertAnalyzed(QueryExecution.scala:49)
>>         at org.apache.spark.sql.Dataset.<init>(Dataset.scala:161)
>>         at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
>>         at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
>>         at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
>>         at org.apache.spark.sql.Dataset.union(Dataset.scala:1459)
>>
>> Granted, I'm new at Spark so this might be an anti-pattern, so I'm open
>> to suggestions. However it doesn't seem like I'm doing anything incorrect
>> here, the types are correct. Searching for this error online returns
>> results seemingly about working in dataframes and having mismatching
>> schemas or a different order of fields, and it seems like bugfixes have
>> gone into place for those cases.
>>
>> Thanks in advance.
>> Efe
>>
>>


-- 
Thanks & Regards,
Agraj Mangal

Re: [Spark 2.0.0] error when unioning to an empty dataset

Posted by Efe Selcuk <ef...@gmail.com>.
Bump!

On Thu, Oct 13, 2016 at 8:25 PM Efe Selcuk <ef...@gmail.com> wrote:

> I have a use case where I want to build a dataset based off of
> conditionally available data. I thought I'd do something like this:
>
> case class SomeData( ... ) // parameters are basic encodable types like
> strings and BigDecimals
>
> var data = spark.emptyDataset[SomeData]
>
> // loop, determining what data to ingest and process into datasets
>   data = data.union(someCode.thatReturnsADataset)
> // end loop
>
> However I get a runtime exception:
>
> Exception in thread "main" org.apache.spark.sql.AnalysisException:
> unresolved operator 'Union;
>         at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
>         at
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
>         at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:361)
>         at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>         at
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
>         at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
>         at
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
>         at
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
>         at org.apache.spark.sql.Dataset.<init>(Dataset.scala:161)
>         at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
>         at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
>         at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
>         at org.apache.spark.sql.Dataset.union(Dataset.scala:1459)
>
> Granted, I'm new at Spark so this might be an anti-pattern, so I'm open to
> suggestions. However it doesn't seem like I'm doing anything incorrect
> here, the types are correct. Searching for this error online returns
> results seemingly about working in dataframes and having mismatching
> schemas or a different order of fields, and it seems like bugfixes have
> gone into place for those cases.
>
> Thanks in advance.
> Efe
>
>