You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Kelly, Jonathan" <jo...@amazon.com> on 2014/11/27 02:23:58 UTC

SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection

I've noticed some strange behavior when I try to use
SchemaRDD.saveAsTable() with a SchemaRDD that I¹ve loaded from a JSON file
that contains elements with nested arrays.  For example, with a file
test.json that contains the single line:

	{"values":[1,2,3]}

and with code like the following:

scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala> val test = sqlContext.jsonFile("test.json")
scala> test.saveAsTable("test")

it creates the table but fails when inserting the data into it.  Here¹s
the exception:

scala.MatchError: ArrayType(IntegerType,true) (of class
org.apache.spark.sql.catalyst.types.ArrayType)
	at 
org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:2
47)
	at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
	at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
	at 
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala
:84)
	at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.appl
y(Projection.scala:66)
	at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.appl
y(Projection.scala:50)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
	at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sq
l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sca
la:149)
	at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHive
File$1.apply(InsertIntoHiveTable.scala:158)
	at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHive
File$1.apply(InsertIntoHiveTable.scala:158)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
	at org.apache.spark.scheduler.Task.run(Task.scala:54)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
	at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1
145)
	at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:
615)
	at java.lang.Thread.run(Thread.java:745)

I'm guessing that this is due to the slight difference in the schemas of
these tables:

scala> test.printSchema
root
 |-- values: array (nullable = true)
 |    |-- element: integer (containsNull = false)


scala> sqlContext.table("test").printSchema
root
 |-- values: array (nullable = true)
 |    |-- element: integer (containsNull = true)

If I reload the file using the schema that was created for the Hive table
then try inserting the data into the table, it works:

scala> sqlContext.jsonFile("file:///home/hadoop/test.json",
sqlContext.table("test").schema).insertInto("test")
scala> sqlContext.sql("select * from test").collect().foreach(println)
[ArrayBuffer(1, 2, 3)]

Does this mean that there is a bug with how the schema is being
automatically determined when you use HiveContext.jsonFile() for JSON
files that contain nested arrays?  (i.e., should containsNull be true for
the array elements?)  Or is there a bug with how the Hive table is created
from the SchemaRDD?  (i.e., should containsNull in fact be false?)  I can
probably get around this by defining the schema myself rather than using
auto-detection, but for now I¹d like to use auto-detection.

By the way, I'm using Spark 1.1.0.

Thanks,
Jonathan


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection

Posted by "Kelly, Jonathan" <jo...@amazon.com>.

Yeah, only a few hours after I sent my message I saw some correspondence on this other thread: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-complex-types-like-map-lt-string-map-lt-string-int-gt-gt-in-spark-sql-td19603.html, which is the exact same issue.  Glad to find that this should be fixed in 1.2.0!  I'll give that a try later.

Thanks a lot,
Jonathan

From: Yin Huai <hu...@gmail.com>>
Date: Thursday, November 27, 2014 at 4:37 PM
To: Jonathan Kelly <jo...@amazon.com>>
Cc: "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection

Hello Jonathan,

There was a bug regarding casting data types before inserting into a Hive table. Hive does not have the notion of "containsNull" for array values. So, for a Hive table, the containsNull will be always true for an array and we should ignore this field for Hive. This issue has been fixed by https://issues.apache.org/jira/browse/SPARK-4245, which will be released with 1.2.

Thanks,

Yin

On Wed, Nov 26, 2014 at 9:01 PM, Kelly, Jonathan <jo...@amazon.com>> wrote:
After playing around with this a little more, I discovered that:

1. If test.json contains something like {"values":[null,1,2,3]}, the
schema auto-determined by SchemaRDD.jsonFile() will have "element: integer
(containsNull = true)", and then
SchemaRDD.saveAsTable()/SchemaRDD.insertInto() will work (which of course
makes sense but doesn't really help).
2. If I specify the schema myself (e.g., sqlContext.jsonFile("test.json",
StructType(Seq(StructField("values", ArrayType(IntegerType, true),
true))))), that also makes SchemaRDD.saveAsTable()/SchemaRDD.insertInto()
work, though as I mentioned before, this is less than ideal.

Why don't saveAsTable/insertInto work when the containsNull properties
don't match?  I can understand how inserting data with containsNull=true
into a column where containsNull=false might fail, but I think the other
way around (which is the case here) should work.

~ Jonathan


On 11/26/14, 5:23 PM, "Kelly, Jonathan" <jo...@amazon.com>> wrote:

>I've noticed some strange behavior when I try to use
>SchemaRDD.saveAsTable() with a SchemaRDD that I¹ve loaded from a JSON file
>that contains elements with nested arrays.  For example, with a file
>test.json that contains the single line:
>
>       {"values":[1,2,3]}
>
>and with code like the following:
>
>scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
>scala> val test = sqlContext.jsonFile("test.json")
>scala> test.saveAsTable("test")
>
>it creates the table but fails when inserting the data into it.  Here¹s
>the exception:
>
>scala.MatchError: ArrayType(IntegerType,true) (of class
>org.apache.spark.sql.catalyst.types.ArrayType)
>       at
>org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:
>2
>47)
>       at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
>       at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
>       at
>org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scal
>a
>:84)
>       at
>org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app
>l
>y(Projection.scala:66)
>       at
>org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app
>l
>y(Projection.scala:50)
>       at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>       at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>       at
>org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org<http://org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org>$apache$spark$s
>q
>l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sc
>a
>la:149)
>       at
>org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv
>e
>File$1.apply(InsertIntoHiveTable.scala:158)
>       at
>org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv
>e
>File$1.apply(InsertIntoHiveTable.scala:158)
>       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>       at org.apache.spark.scheduler.Task.run(Task.scala:54)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>       at
>java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
>1
>145)
>       at
>java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java
>:
>615)
>       at java.lang.Thread.run(Thread.java:745)
>
>I'm guessing that this is due to the slight difference in the schemas of
>these tables:
>
>scala> test.printSchema
>root
> |-- values: array (nullable = true)
> |    |-- element: integer (containsNull = false)
>
>
>scala> sqlContext.table("test").printSchema
>root
> |-- values: array (nullable = true)
> |    |-- element: integer (containsNull = true)
>
>If I reload the file using the schema that was created for the Hive table
>then try inserting the data into the table, it works:
>
>scala> sqlContext.jsonFile("file:///home/hadoop/test.json",
>sqlContext.table("test").schema).insertInto("test")
>scala> sqlContext.sql("select * from test").collect().foreach(println)
>[ArrayBuffer(1, 2, 3)]
>
>Does this mean that there is a bug with how the schema is being
>automatically determined when you use HiveContext.jsonFile() for JSON
>files that contain nested arrays?  (i.e., should containsNull be true for
>the array elements?)  Or is there a bug with how the Hive table is created
>from the SchemaRDD?  (i.e., should containsNull in fact be false?)  I can
>probably get around this by defining the schema myself rather than using
>auto-detection, but for now I¹d like to use auto-detection.
>
>By the way, I'm using Spark 1.1.0.
>
>Thanks,
>Jonathan
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>
For additional commands, e-mail: user-help@spark.apache.org<ma...@spark.apache.org>

Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection

Posted by Yin Huai <hu...@gmail.com>.

Hello Jonathan,

There was a bug regarding casting data types before inserting into a Hive
table. Hive does not have the notion of "containsNull" for array values.
So, for a Hive table, the containsNull will be always true for an array and
we should ignore this field for Hive. This issue has been fixed by
https://issues.apache.org/jira/browse/SPARK-4245, which will be released
with 1.2.

Thanks,

Yin

On Wed, Nov 26, 2014 at 9:01 PM, Kelly, Jonathan <jo...@amazon.com>
wrote:

> After playing around with this a little more, I discovered that:
>
> 1. If test.json contains something like {"values":[null,1,2,3]}, the
> schema auto-determined by SchemaRDD.jsonFile() will have "element: integer
> (containsNull = true)", and then
> SchemaRDD.saveAsTable()/SchemaRDD.insertInto() will work (which of course
> makes sense but doesn't really help).
> 2. If I specify the schema myself (e.g., sqlContext.jsonFile("test.json",
> StructType(Seq(StructField("values", ArrayType(IntegerType, true),
> true))))), that also makes SchemaRDD.saveAsTable()/SchemaRDD.insertInto()
> work, though as I mentioned before, this is less than ideal.
>
> Why don't saveAsTable/insertInto work when the containsNull properties
> don't match?  I can understand how inserting data with containsNull=true
> into a column where containsNull=false might fail, but I think the other
> way around (which is the case here) should work.
>
> ~ Jonathan
>
>
> On 11/26/14, 5:23 PM, "Kelly, Jonathan" <jo...@amazon.com> wrote:
>
> >I've noticed some strange behavior when I try to use
> >SchemaRDD.saveAsTable() with a SchemaRDD that I¹ve loaded from a JSON file
> >that contains elements with nested arrays.  For example, with a file
> >test.json that contains the single line:
> >
> >       {"values":[1,2,3]}
> >
> >and with code like the following:
> >
> >scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> >scala> val test = sqlContext.jsonFile("test.json")
> >scala> test.saveAsTable("test")
> >
> >it creates the table but fails when inserting the data into it.  Here¹s
> >the exception:
> >
> >scala.MatchError: ArrayType(IntegerType,true) (of class
> >org.apache.spark.sql.catalyst.types.ArrayType)
> >       at
> >org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:
> >2
> >47)
> >       at
> org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
> >       at
> org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
> >       at
> >org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scal
> >a
> >:84)
> >       at
> >org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app
> >l
> >y(Projection.scala:66)
> >       at
> >org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app
> >l
> >y(Projection.scala:50)
> >       at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> >       at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> >       at
> >org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org
> $apache$spark$s
> >q
> >l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sc
> >a
> >la:149)
> >       at
> >org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv
> >e
> >File$1.apply(InsertIntoHiveTable.scala:158)
> >       at
> >org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv
> >e
> >File$1.apply(InsertIntoHiveTable.scala:158)
> >       at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> >       at org.apache.spark.scheduler.Task.run(Task.scala:54)
> >       at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> >       at
> >java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
> >1
> >145)
> >       at
> >java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java
> >:
> >615)
> >       at java.lang.Thread.run(Thread.java:745)
> >
> >I'm guessing that this is due to the slight difference in the schemas of
> >these tables:
> >
> >scala> test.printSchema
> >root
> > |-- values: array (nullable = true)
> > |    |-- element: integer (containsNull = false)
> >
> >
> >scala> sqlContext.table("test").printSchema
> >root
> > |-- values: array (nullable = true)
> > |    |-- element: integer (containsNull = true)
> >
> >If I reload the file using the schema that was created for the Hive table
> >then try inserting the data into the table, it works:
> >
> >scala> sqlContext.jsonFile("file:///home/hadoop/test.json",
> >sqlContext.table("test").schema).insertInto("test")
> >scala> sqlContext.sql("select * from test").collect().foreach(println)
> >[ArrayBuffer(1, 2, 3)]
> >
> >Does this mean that there is a bug with how the schema is being
> >automatically determined when you use HiveContext.jsonFile() for JSON
> >files that contain nested arrays?  (i.e., should containsNull be true for
> >the array elements?)  Or is there a bug with how the Hive table is created
> >from the SchemaRDD?  (i.e., should containsNull in fact be false?)  I can
> >probably get around this by defining the schema myself rather than using
> >auto-detection, but for now I¹d like to use auto-detection.
> >
> >By the way, I'm using Spark 1.1.0.
> >
> >Thanks,
> >Jonathan
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection

Posted by "Kelly, Jonathan" <jo...@amazon.com>.

After playing around with this a little more, I discovered that:

1. If test.json contains something like {"values":[null,1,2,3]}, the
schema auto-determined by SchemaRDD.jsonFile() will have "element: integer
(containsNull = true)", and then
SchemaRDD.saveAsTable()/SchemaRDD.insertInto() will work (which of course
makes sense but doesn't really help).
2. If I specify the schema myself (e.g., sqlContext.jsonFile("test.json",
StructType(Seq(StructField("values", ArrayType(IntegerType, true),
true))))), that also makes SchemaRDD.saveAsTable()/SchemaRDD.insertInto()
work, though as I mentioned before, this is less than ideal.

Why don't saveAsTable/insertInto work when the containsNull properties
don't match?  I can understand how inserting data with containsNull=true
into a column where containsNull=false might fail, but I think the other
way around (which is the case here) should work.

~ Jonathan


On 11/26/14, 5:23 PM, "Kelly, Jonathan" <jo...@amazon.com> wrote:

>I've noticed some strange behavior when I try to use
>SchemaRDD.saveAsTable() with a SchemaRDD that I¹ve loaded from a JSON file
>that contains elements with nested arrays.  For example, with a file
>test.json that contains the single line:
>
>	{"values":[1,2,3]}
>
>and with code like the following:
>
>scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
>scala> val test = sqlContext.jsonFile("test.json")
>scala> test.saveAsTable("test")
>
>it creates the table but fails when inserting the data into it.  Here¹s
>the exception:
>
>scala.MatchError: ArrayType(IntegerType,true) (of class
>org.apache.spark.sql.catalyst.types.ArrayType)
>	at 
>org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:
>2
>47)
>	at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
>	at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
>	at 
>org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scal
>a
>:84)
>	at 
>org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app
>l
>y(Projection.scala:66)
>	at 
>org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app
>l
>y(Projection.scala:50)
>	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>	at 
>org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$s
>q
>l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sc
>a
>la:149)
>	at 
>org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv
>e
>File$1.apply(InsertIntoHiveTable.scala:158)
>	at 
>org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv
>e
>File$1.apply(InsertIntoHiveTable.scala:158)
>	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>	at org.apache.spark.scheduler.Task.run(Task.scala:54)
>	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>	at 
>java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
>1
>145)
>	at 
>java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java
>:
>615)
>	at java.lang.Thread.run(Thread.java:745)
>
>I'm guessing that this is due to the slight difference in the schemas of
>these tables:
>
>scala> test.printSchema
>root
> |-- values: array (nullable = true)
> |    |-- element: integer (containsNull = false)
>
>
>scala> sqlContext.table("test").printSchema
>root
> |-- values: array (nullable = true)
> |    |-- element: integer (containsNull = true)
>
>If I reload the file using the schema that was created for the Hive table
>then try inserting the data into the table, it works:
>
>scala> sqlContext.jsonFile("file:///home/hadoop/test.json",
>sqlContext.table("test").schema).insertInto("test")
>scala> sqlContext.sql("select * from test").collect().foreach(println)
>[ArrayBuffer(1, 2, 3)]
>
>Does this mean that there is a bug with how the schema is being
>automatically determined when you use HiveContext.jsonFile() for JSON
>files that contain nested arrays?  (i.e., should containsNull be true for
>the array elements?)  Or is there a bug with how the Hive table is created
>from the SchemaRDD?  (i.e., should containsNull in fact be false?)  I can
>probably get around this by defining the schema myself rather than using
>auto-detection, but for now I¹d like to use auto-detection.
>
>By the way, I'm using Spark 1.1.0.
>
>Thanks,
>Jonathan
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org