You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by sr...@gmail.com on 2016/09/01 00:54:41 UTC

RE: AnalysisException exception while parsing XML

How do we explode nested arrays?

Thanks,
Sreekanth Jella

From: Peyman Mohajerian

Re: AnalysisException exception while parsing XML

Posted by Peyman Mohajerian <mo...@gmail.com>.
here is an example:
df1 = df0.select(explode("manager.subordinates.subordinate_clerk
<http://manager.subordinates.subordinate_clerk.duties.duty.name/>.duties).alias("duties-flat"),
col("duties-flat.duty.name"").alias("duty-name"))

this is in pyspark, i may have some part of this wrong, didn't test it, but
something similar.

On Wed, Aug 31, 2016 at 5:54 PM, <sr...@gmail.com> wrote:

> How do we explode nested arrays?
>
>
>
> Thanks,
> Sreekanth Jella
>
>
>
> *From: *Peyman Mohajerian <mo...@gmail.com>
> *Sent: *Wednesday, August 31, 2016 7:41 PM
> *To: *srikanth.jella@gmail.com
> *Cc: *user@spark.apache.org
> *Subject: *Re: AnalysisException exception while parsing XML
>
>
>
> Once you get to the 'Array' type, you got to use explode, you cannot to
> the same traversing.
>
>
>
> On Wed, Aug 31, 2016 at 2:19 PM, <sr...@gmail.com> wrote:
>
> Hello Experts,
>
>
>
> I am using Spark XML package to parse the XML. Below exception is being
> thrown when trying to *parse a tag which exist in arrays of array depth*.
> i.e. in this case subordinate_clerk.xxxx .duty.name
>
>
>
> With below sample XML, issue is reproducible:
>
>
>
> <emplist>
>
>   <emp>
>
>    <manager>
>
>     <id>1</id>
>
>     <name>mgr1</name>
>
>     <dateOfJoin>2005-07-31</dateOfJoin>
>
>     <subordinates>
>
>       <subordinate_clerk>
>
>         <cid>2</cid>
>
>         <cname>clerk2</cname>
>
>         <dateOfJoin>2005-07-31</dateOfJoin>
>
>       </subordinate_clerk>
>
>       <subordinate_clerk>
>
>         <cid>3</cid>
>
>         <cname>clerk3</cname>
>
>         <dateOfJoin>2005-07-31</dateOfJoin>
>
>       </subordinate_clerk>
>
>     </subordinates>
>
>    </manager>
>
>   </emp>
>
>   <emp>
>
>    <manager>
>
>    <id>11</id>
>
>    <name>mgr11</name>
>
>     <subordinates>
>
>       <subordinate_clerk>
>
>         <cid>12</cid>
>
>         <cname>clerk12</cname>
>
>         <duties>
>
>          <duty>
>
>            <name>first duty</name>
>
>          </duty>
>
>          <duty>
>
>            <name>second duty</name>
>
>          </duty>
>
>        </duties>
>
>       </subordinate_clerk>
>
>     </subordinates>
>
>    </manager>
>
>   </emp>
>
> </emplist>
>
>
>
>
>
> scala> df.select( "manager.subordinates.subordinate_clerk.duties.duty.name").show
>
>
>
> Exception is:
>
>  org.apache.spark.sql.AnalysisException: cannot resolve 'manager.subordinates.subordinate_clerk.duties.duty[name]' due to data type mismatch: argument 2 requires integral type, however, 'name' is of string type.;
>
>        at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>
>        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:65)
>
>        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>
>        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
>
>        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
>
>        at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>
>        at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:334)
>
>        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332)
>
>        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332)
>
>        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:281)
>
>        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>
>        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>
>        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>
>        at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>
>        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>
>        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>
>        at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>
>        at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>
>        at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>
>        at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>
>        at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>
>        at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>
>        at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:321)
>
>        at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:332)
>
> ... more
>
>
>
>
>
>
>
>
>
> scala> df.printSchema
>
> root
>
>  |-- manager: struct (nullable = true)
>
>  |    |-- dateOfJoin: string (nullable = true)
>
>  |    |-- id: long (nullable = true)
>
>  |    |-- name: string (nullable = true)
>
>  |    |-- subordinates: struct (nullable = true)
>
>  |    |    |-- subordinate_clerk: array (nullable = true)
>
>  |    |    |    |-- element: struct (containsNull = true)
>
>  |    |    |    |    |-- cid: long (nullable = true)
>
>  |    |    |    |    |-- cname: string (nullable = true)
>
>  |    |    |    |    |-- dateOfJoin: string (nullable = true)
>
>  |    |    |    |    |-- duties: struct (nullable = true)
>
>  |    |    |    |    |    |-- duty: array (nullable = true)
>
>  |    |    |    |    |    |    |-- element: struct (containsNull = true)
>
>  |    |    |    |    |    |    |    |-- name: string (nullable = true)
>
>
>
>
>
>
>
> Versions info:
>
> Spark - 1.6.0
>
> Scala - 2.10.5
>
> Spark XML - com.databricks:spark-xml_2.10:0.3.3
>
>
>
> Please let me know if there is a solution or workaround for this?
>
>
>
> Thanks,
>
> Sreekanth
>
>
>
>
>
>
>