You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by mars76 <sk...@yahoo.com.INVALID> on 2020/06/30 16:28:05 UTC

XmlReader not Parsing the Nested elements in XML properly

Hi,

  I am trying to read XML data from a Kafka topic and using XmlReader to
convert the RDD[String] into a DataFrame conforming to predefined Schema.

  One issue i am running into is after saving the final Data Frame to AVRO
format most of the elements data is showing up in avro files. How ever the
nested Element which is of Array Type is not getting parsed properly and
getting loaded as null into the DF and hence when i save it to avro or to
json that field is always null.

  Not sure why this element is not getting parsed.

 
  Here is the code i am using 


  kafkaValueAsStringDF = kafakDF.selectExpr("CAST(key AS STRING)
msgKey","CAST(value AS STRING) xmlString")

  var parameters = collection.mutable.Map.empty[String, String]

  parameters.put("rowTag", "Book")

kafkaValueAsStringDF.writeStream.foreachBatch {
          (batchDF: DataFrame, batchId: Long) =>

 val xmlStringDF:DataFrame = batchDF.selectExpr("xmlString")

            xmlStringDF.printSchema()

            val rdd: RDD[String] = xmlStringDF.as[String].rdd


            val relation = XmlRelation(
              () => rdd,
              None,
              parameters.toMap,
              xmlSchema)(spark.sqlContext)


            logger.info(".convert() : XmlRelation Schema ={}
"+relation.schema.treeString)

}
        .start()
        .awaitTermination()
  

Thanks
Sateesh



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: XmlReader not Parsing the Nested elements in XML properly

Posted by Sean Owen <sr...@gmail.com>.
This is more a question about spark-xml, which is not part of Spark.
You can ask at https://github.com/databricks/spark-xml/ but if you do
please show some example of the XML input and schema and output.

On Tue, Jun 30, 2020 at 11:39 AM mars76 <sk...@yahoo.com.invalid> wrote:
>
> Hi,
>
>   I am trying to read XML data from a Kafka topic and using XmlReader to
> convert the RDD[String] into a DataFrame conforming to predefined Schema.
>
>   One issue i am running into is after saving the final Data Frame to AVRO
> format most of the elements data is showing up in avro files. How ever the
> nested Element which is of Array Type is not getting parsed properly and
> getting loaded as null into the DF and hence when i save it to avro or to
> json that field is always null.
>
>   Not sure why this element is not getting parsed.
>
>
>   Here is the code i am using
>
>
>   kafkaValueAsStringDF = kafakDF.selectExpr("CAST(key AS STRING)
> msgKey","CAST(value AS STRING) xmlString")
>
>   var parameters = collection.mutable.Map.empty[String, String]
>
>   parameters.put("rowTag", "Book")
>
> kafkaValueAsStringDF.writeStream.foreachBatch {
>           (batchDF: DataFrame, batchId: Long) =>
>
>  val xmlStringDF:DataFrame = batchDF.selectExpr("xmlString")
>
>             xmlStringDF.printSchema()
>
>             val rdd: RDD[String] = xmlStringDF.as[String].rdd
>
>
>             val relation = XmlRelation(
>               () => rdd,
>               None,
>               parameters.toMap,
>               xmlSchema)(spark.sqlContext)
>
>
>             logger.info(".convert() : XmlRelation Schema ={}
> "+relation.schema.treeString)
>
> }
>         .start()
>         .awaitTermination()
>
>
> Thanks
> Sateesh
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org