You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Herman van Hovell (JIRA)" <ji...@apache.org> on 2016/09/01 15:08:20 UTC

[jira] [Comment Edited] (SPARK-17335) Creating Hive table from Spark data

    [ https://issues.apache.org/jira/browse/SPARK-17335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15455709#comment-15455709 ] 

Herman van Hovell edited comment on SPARK-17335 at 9/1/16 3:07 PM:
-------------------------------------------------------------------

{{ArrayType}} and {{MapType}} do not have a proper {{catalogString}} implementation. They call `simpleString` on their child data types, which is problematic if the child data type is a struct.

You can reproduce this with the following code:
{noformat}
import org.apache.spark.sql.types._
def complex = new StructType((0 to 25).map(i => new StructField(('a' + i).toChar.toString, IntegerType, false)).toArray)
val schema = new StructType().add("elements", ArrayType(complex))
println(schema.catalogString)

>struct<elements:array<struct<a:int,b:int,c:int,d:int,e:int,f:int,g:int,h:int,i:int,j:int,k:int,l:int,m:int,n:int,o:int,p:int,q:int,r:int,s:int,t:int,u:int,v:int,w:int,x:int,... 2 more fields>>>
{noformat}



was (Author: hvanhovell):
{ArrayType} and {MapType} do not have a proper {catalogString} implementation. They call `simpleString` on their child data types, which is problematic if the child data type is a struct.

You can reproduce this with the following code:
{noformat}
import org.apache.spark.sql.types._
def complex = new StructType((0 to 25).map(i => new StructField(('a' + i).toChar.toString, IntegerType, false)).toArray)
val schema = new StructType().add("elements", ArrayType(complex))
println(schema.catalogString)

>struct<elements:array<struct<a:int,b:int,c:int,d:int,e:int,f:int,g:int,h:int,i:int,j:int,k:int,l:int,m:int,n:int,o:int,p:int,q:int,r:int,s:int,t:int,u:int,v:int,w:int,x:int,... 2 more fields>>>
{noformat}


> Creating Hive table from Spark data
> -----------------------------------
>
>                 Key: SPARK-17335
>                 URL: https://issues.apache.org/jira/browse/SPARK-17335
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Michal Kielbowicz
>
> Recently my team started using Spark for analysis of huge JSON objects. Spark itself handles it well. The problem starts when we try to create a Hive table from it using steps from this part of doc: http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
> After running command `spark.sql("CREATE TABLE x AS (SELECT * FROM y)") we get following exception (sorry for obfuscating, confidential data):
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: Error: : expected at the position 993 of 'string:struct<a:boolean,b:array<string>,c:boolean,d:struct<e:boolean,f:boolean,[...(few others)],z:boolean,... 4 more fields>,[...(rest of valid struct string)]>' but ' ' is found.;
> {code}
> It turned out that the exception was raised because of `... 4 more fields` part as it is not a valid representation of data structure.
> An easy workaround is to set `spark.debug.maxToStringFields` to some large value. Nevertheless it shouldn't be required and the stringifying process should use methods targeted at giving valid data structure for Hive.
> In my opinion the root problem is here:
> https://github.com/apache/spark/blob/9d7a47406ed538f0005cdc7a62bc6e6f20634815/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L318 when calling `simpleString` method instead of `catalogString`. Nevertheless this class is used at many places and I don't feel that experienced with Spark to automatically submit PR.
> We believe this issue is indirectly caused by this PR: https://github.com/apache/spark/pull/13537
> There has been almost the same issue in the past. You can find it here: https://issues.apache.org/jira/browse/SPARK-16415



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org