You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Koert Kuipers <ko...@tresata.com> on 2016/02/27 06:54:58 UTC

some joins stopped working with spark 2.0.0 SNAPSHOT

dataframe df1:
schema:
StructType(StructField(x,IntegerType,true))
explain:
== Physical Plan ==
MapPartitions <function1>, obj#135: object, [if (input[0, object].isNullAt)
null else input[0, object].get AS x#128]
+- MapPartitions <function1>, createexternalrow(if (isnull(x#9)) null else
x#9), [input[0, object] AS obj#135]
   +- WholeStageCodegen
      :  +- Project [_1#8 AS x#9]
      :     +- Scan ExistingRDD[_1#8]
show:
+---+
|  x|
+---+
|  2|
|  3|
+---+


dataframe df2:
schema:
StructType(StructField(x,IntegerType,true), StructField(y,StringType,true))
explain:
== Physical Plan ==
MapPartitions <function1>, createexternalrow(x#2, if (isnull(y#3)) null
else y#3.toString), [if (input[0, object].isNullAt) null else input[0,
object].get AS x#130,if (input[0, object].isNullAt) null else
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType,
fromString, input[0, object].get, true) AS y#131]
+- WholeStageCodegen
   :  +- Project [_1#0 AS x#2,_2#1 AS y#3]
   :     +- Scan ExistingRDD[_1#0,_2#1]
show:
+---+---+
|  x|  y|
+---+---+
|  1|  1|
|  2|  2|
|  3|  3|
+---+---+


i run:
df1.join(df2, Seq("x")).show

i get:
java.lang.UnsupportedOperationException: No size estimation available for
objects.
at org.apache.spark.sql.types.ObjectType.defaultSize(ObjectType.scala:41)
at
org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323)
at
org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
at scala.collection.immutable.List.map(List.scala:285)
at
org.apache.spark.sql.catalyst.plans.logical.UnaryNode.statistics(LogicalPlan.scala:323)
at
org.apache.spark.sql.execution.SparkStrategies$CanBroadcast$.unapply(SparkStrategies.scala:87)

now sure what changed, this ran about a week ago without issues (in our
internal unit tests). it is fully reproducible, however when i tried to
minimize the issue i could not reproduce it by just creating data frames in
the repl with the same contents, so it probably has something to do with
way these are created (from Row objects and StructTypes).

best, koert

Re: some joins stopped working with spark 2.0.0 SNAPSHOT

Posted by Jonathan Kelly <jo...@gmail.com>.

If you want to find what commit caused it, try out the "git bisect" command.
On Sat, Feb 27, 2016 at 11:06 AM Koert Kuipers <ko...@tresata.com> wrote:

> https://issues.apache.org/jira/browse/SPARK-13531
>
> On Sat, Feb 27, 2016 at 3:49 AM, Reynold Xin <rx...@databricks.com> wrote:
>
>> Can you file a JIRA ticket?
>>
>>
>> On Friday, February 26, 2016, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> dataframe df1:
>>> schema:
>>> StructType(StructField(x,IntegerType,true))
>>> explain:
>>> == Physical Plan ==
>>> MapPartitions <function1>, obj#135: object, [if (input[0,
>>> object].isNullAt) null else input[0, object].get AS x#128]
>>> +- MapPartitions <function1>, createexternalrow(if (isnull(x#9)) null
>>> else x#9), [input[0, object] AS obj#135]
>>>    +- WholeStageCodegen
>>>       :  +- Project [_1#8 AS x#9]
>>>       :     +- Scan ExistingRDD[_1#8]
>>> show:
>>> +---+
>>> |  x|
>>> +---+
>>> |  2|
>>> |  3|
>>> +---+
>>>
>>>
>>> dataframe df2:
>>> schema:
>>> StructType(StructField(x,IntegerType,true),
>>> StructField(y,StringType,true))
>>> explain:
>>> == Physical Plan ==
>>> MapPartitions <function1>, createexternalrow(x#2, if (isnull(y#3)) null
>>> else y#3.toString), [if (input[0, object].isNullAt) null else input[0,
>>> object].get AS x#130,if (input[0, object].isNullAt) null else
>>> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType,
>>> fromString, input[0, object].get, true) AS y#131]
>>> +- WholeStageCodegen
>>>    :  +- Project [_1#0 AS x#2,_2#1 AS y#3]
>>>    :     +- Scan ExistingRDD[_1#0,_2#1]
>>> show:
>>> +---+---+
>>> |  x|  y|
>>> +---+---+
>>> |  1|  1|
>>> |  2|  2|
>>> |  3|  3|
>>> +---+---+
>>>
>>>
>>> i run:
>>> df1.join(df2, Seq("x")).show
>>>
>>> i get:
>>> java.lang.UnsupportedOperationException: No size estimation available
>>> for objects.
>>> at org.apache.spark.sql.types.ObjectType.defaultSize(ObjectType.scala:41)
>>> at
>>> org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323)
>>> at
>>> org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323)
>>> at
>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>>> at
>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>>> at scala.collection.immutable.List.foreach(List.scala:381)
>>> at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>>> at scala.collection.immutable.List.map(List.scala:285)
>>> at
>>> org.apache.spark.sql.catalyst.plans.logical.UnaryNode.statistics(LogicalPlan.scala:323)
>>> at
>>> org.apache.spark.sql.execution.SparkStrategies$CanBroadcast$.unapply(SparkStrategies.scala:87)
>>>
>>> now sure what changed, this ran about a week ago without issues (in our
>>> internal unit tests). it is fully reproducible, however when i tried to
>>> minimize the issue i could not reproduce it by just creating data frames in
>>> the repl with the same contents, so it probably has something to do with
>>> way these are created (from Row objects and StructTypes).
>>>
>>> best, koert
>>>
>>
>

Re: some joins stopped working with spark 2.0.0 SNAPSHOT

Posted by Koert Kuipers <ko...@tresata.com>.

https://issues.apache.org/jira/browse/SPARK-13531

On Sat, Feb 27, 2016 at 3:49 AM, Reynold Xin <rx...@databricks.com> wrote:

> Can you file a JIRA ticket?
>
>
> On Friday, February 26, 2016, Koert Kuipers <ko...@tresata.com> wrote:
>
>> dataframe df1:
>> schema:
>> StructType(StructField(x,IntegerType,true))
>> explain:
>> == Physical Plan ==
>> MapPartitions <function1>, obj#135: object, [if (input[0,
>> object].isNullAt) null else input[0, object].get AS x#128]
>> +- MapPartitions <function1>, createexternalrow(if (isnull(x#9)) null
>> else x#9), [input[0, object] AS obj#135]
>>    +- WholeStageCodegen
>>       :  +- Project [_1#8 AS x#9]
>>       :     +- Scan ExistingRDD[_1#8]
>> show:
>> +---+
>> |  x|
>> +---+
>> |  2|
>> |  3|
>> +---+
>>
>>
>> dataframe df2:
>> schema:
>> StructType(StructField(x,IntegerType,true),
>> StructField(y,StringType,true))
>> explain:
>> == Physical Plan ==
>> MapPartitions <function1>, createexternalrow(x#2, if (isnull(y#3)) null
>> else y#3.toString), [if (input[0, object].isNullAt) null else input[0,
>> object].get AS x#130,if (input[0, object].isNullAt) null else
>> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType,
>> fromString, input[0, object].get, true) AS y#131]
>> +- WholeStageCodegen
>>    :  +- Project [_1#0 AS x#2,_2#1 AS y#3]
>>    :     +- Scan ExistingRDD[_1#0,_2#1]
>> show:
>> +---+---+
>> |  x|  y|
>> +---+---+
>> |  1|  1|
>> |  2|  2|
>> |  3|  3|
>> +---+---+
>>
>>
>> i run:
>> df1.join(df2, Seq("x")).show
>>
>> i get:
>> java.lang.UnsupportedOperationException: No size estimation available for
>> objects.
>> at org.apache.spark.sql.types.ObjectType.defaultSize(ObjectType.scala:41)
>> at
>> org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323)
>> at
>> org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323)
>> at
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>> at
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>> at scala.collection.immutable.List.foreach(List.scala:381)
>> at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>> at scala.collection.immutable.List.map(List.scala:285)
>> at
>> org.apache.spark.sql.catalyst.plans.logical.UnaryNode.statistics(LogicalPlan.scala:323)
>> at
>> org.apache.spark.sql.execution.SparkStrategies$CanBroadcast$.unapply(SparkStrategies.scala:87)
>>
>> now sure what changed, this ran about a week ago without issues (in our
>> internal unit tests). it is fully reproducible, however when i tried to
>> minimize the issue i could not reproduce it by just creating data frames in
>> the repl with the same contents, so it probably has something to do with
>> way these are created (from Row objects and StructTypes).
>>
>> best, koert
>>
>

Re: some joins stopped working with spark 2.0.0 SNAPSHOT

Posted by Reynold Xin <rx...@databricks.com>.

Can you file a JIRA ticket?

On Friday, February 26, 2016, Koert Kuipers <ko...@tresata.com> wrote:

> dataframe df1:
> schema:
> StructType(StructField(x,IntegerType,true))
> explain:
> == Physical Plan ==
> MapPartitions <function1>, obj#135: object, [if (input[0,
> object].isNullAt) null else input[0, object].get AS x#128]
> +- MapPartitions <function1>, createexternalrow(if (isnull(x#9)) null else
> x#9), [input[0, object] AS obj#135]
>    +- WholeStageCodegen
>       :  +- Project [_1#8 AS x#9]
>       :     +- Scan ExistingRDD[_1#8]
> show:
> +---+
> |  x|
> +---+
> |  2|
> |  3|
> +---+
>
>
> dataframe df2:
> schema:
> StructType(StructField(x,IntegerType,true), StructField(y,StringType,true))
> explain:
> == Physical Plan ==
> MapPartitions <function1>, createexternalrow(x#2, if (isnull(y#3)) null
> else y#3.toString), [if (input[0, object].isNullAt) null else input[0,
> object].get AS x#130,if (input[0, object].isNullAt) null else
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType,
> fromString, input[0, object].get, true) AS y#131]
> +- WholeStageCodegen
>    :  +- Project [_1#0 AS x#2,_2#1 AS y#3]
>    :     +- Scan ExistingRDD[_1#0,_2#1]
> show:
> +---+---+
> |  x|  y|
> +---+---+
> |  1|  1|
> |  2|  2|
> |  3|  3|
> +---+---+
>
>
> i run:
> df1.join(df2, Seq("x")).show
>
> i get:
> java.lang.UnsupportedOperationException: No size estimation available for
> objects.
> at org.apache.spark.sql.types.ObjectType.defaultSize(ObjectType.scala:41)
> at
> org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323)
> at
> org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
> at scala.collection.immutable.List.map(List.scala:285)
> at
> org.apache.spark.sql.catalyst.plans.logical.UnaryNode.statistics(LogicalPlan.scala:323)
> at
> org.apache.spark.sql.execution.SparkStrategies$CanBroadcast$.unapply(SparkStrategies.scala:87)
>
> now sure what changed, this ran about a week ago without issues (in our
> internal unit tests). it is fully reproducible, however when i tried to
> minimize the issue i could not reproduce it by just creating data frames in
> the repl with the same contents, so it probably has something to do with
> way these are created (from Row objects and StructTypes).
>
> best, koert
>