You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ritika Maheshwari (Jira)" <ji...@apache.org> on 2023/11/07 23:17:00 UTC
[jira] [Commented] (SPARK-45414) spark-xml misplaces string tag content
[ https://issues.apache.org/jira/browse/SPARK-45414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17783832#comment-17783832 ]
Ritika Maheshwari commented on SPARK-45414:
-------------------------------------------
Change _SerialNumberFlag to. SerialNumberFlag and _Action to Action. Then the query runs without error.
***************************
<root_tag>
<row_tag>
<ItemID Type="ItemId">123</ItemID>
<UPC Type="UPC">123</UPC>
<SerialNumberFlag></SerialNumberFlag>
<Description>MyDescription</Description>
<MerchandiseHierarchy ID="" Level=""/>
<ItemPrice ValueTypeCode="" Value=""/>
<Color></Color>
<IntendedIndustry></IntendedIndustry>
<Manufacturer>
<Name></Name>
</Manufacturer>
<Marketing>
<Season></Season>
</Marketing>
<BrandOwner Name=""/>
<ItemAttribute_culinary Name="Attribute1">
<AttributeValue>Value1</AttributeValue>
</ItemAttribute_culinary>
<ItemAttribute_noculinary Name="">
<item>
<AttributeCode></AttributeCode>
<AttributeValue></AttributeValue>
</item>
</ItemAttribute_noculinary>
<ItemMeasurements>
<Depth UnitOfMeasure=""></Depth>
<Height UnitOfMeasure=""></Height>
<Width UnitOfMeasure=""></Width>
<Diameter UnitOfMeasure=""></Diameter>
</ItemMeasurements>
<TaxInformation>
<TaxGroupID>GroupA</TaxGroupID>
<TaxExemptCode>CodeA</TaxExemptCode>
<TaxAmount>1</TaxAmount>
</TaxInformation>
<ItemImageUrl></ItemImageUrl>
<ItemFranchisees>
<item action="" franchiseeId="" franchiseeName=""/>
</ItemFranchisees>
<Action>Add</Action>
</row_tag>
</root_tag>
> spark-xml misplaces string tag content
> --------------------------------------
>
> Key: SPARK-45414
> URL: https://issues.apache.org/jira/browse/SPARK-45414
> Project: Spark
> Issue Type: Bug
> Components: PySpark, Spark Core
> Affects Versions: 3.3.0
> Reporter: Giuseppe Ceravolo
> Priority: Critical
> Attachments: IllegalArgumentException.txt
>
>
> h1. Intro
> Hi all! Please expect some degree of incompleteness in this issue as this is the very first one I post, and feel free to edit it as you like - I welcome your feedback.
> My goal is to provide you with as many details and indications as I can on this issue that I am currently facing with a Client of mine on its Production environment (we use Azure Databricks DBR 11.3 LTS).
> I was told by Sean Owen [[srowen (Sean Owen) (github.com)|https://github.com/srowen]], who maintains the spark-xml maven repository on GitHub [[https://github.com/srowen/spark-xml]] to post an issue here because "This code has been ported to Apache Spark now anyway so won't be updated here" (refer to his comment [here|#issuecomment-1744792958]).
> h1. Issue
> When I write a DataFrame into xml format via the spark-xml library either (1) I get an error if empty string columns are in between non-string nested ones or (2) if I put all string columns at the end then I get a wrong xml where the content of string tags are misplaced into the following ones.
> h1. Code to reproduce the issue
> Please find below the end-to-end code snippet that results into the error
> h2. CASE (1): ERROR
> When empty strings are in between non-string nested ones, the write fails with the following error.
> _Caused by: java.lang.IllegalArgumentException: Failed to convert value MyDescription (class of class java.lang.String) in type ArrayType(StructType(StructField(_ID,StringType,true),StructField(_Level,StringType,true)),true) to XML._
> Please find attached the full trace of the error.
> {code:python}
> fake_file_df = spark \
> .sql(
> """SELECT
> CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID,
> CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS UPC,
> CAST('' AS STRING) AS _SerialNumberFlag,
> CAST('MyDescription' AS STRING) AS Description,
> CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS ARRAY<STRUCT<_ID: STRING, _Level: STRING>>) AS MerchandiseHierarchy,
> CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS ARRAY<STRUCT<_ValueTypeCode: STRING, _Value: STRING>>) AS ItemPrice,
> CAST('' AS STRING) AS Color,
> CAST('' AS STRING) AS IntendedIndustry,
> CAST(STRUCT(NULL AS `Name`) AS STRUCT<Name: STRING>) AS Manufacturer,
> CAST(STRUCT(NULL AS `Season`) AS STRUCT<Season: STRING>) AS Marketing,
> CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name: STRING>) AS BrandOwner,
> CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 'Value1' AS `_VALUE`)) AS ARRAY<STRUCT<_Name: STRING, AttributeValue: STRING>>) AS ItemAttribute_culinary,
> CAST(ARRAY(STRUCT(NULL AS `_Name`, ARRAY(ARRAY(STRUCT(NULL AS `AttributeCode`, NULL AS `AttributeValue`))) AS `_VALUE`)) AS ARRAY<STRUCT<_Name: STRING, _VALUE: ARRAY<ARRAY<STRUCT<AttributeCode: STRING, AttributeValue: STRING>>>>>) AS ItemAttribute_noculinary,
> CAST(STRUCT(STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Depth`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT<Depth: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Height: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>>) AS ItemMeasurements,
> CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS `TaxExemptCode`, '1' AS `TaxAmount`) AS STRUCT<TaxGroupID: STRING, TaxExemptCode: STRING, TaxAmount: STRING>) AS TaxInformation,
> CAST('' AS STRING) AS ItemImageUrl,
> CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS `_franchiseeId`, NULL AS `_franchiseeName`))) AS ARRAY<ARRAY<STRUCT<_action: STRING, _franchiseeId: STRING, _franchiseeName: STRING>>>) AS ItemFranchisees,
> CAST('Add' AS STRING) AS _Action
> ;"""
> )
> # fake_file_df.display()
> fake_file_df \
> .coalesce(1) \
> .write \
> .format('com.databricks.spark.xml') \
> .option('declaration', 'version="1.0" encoding="UTF-8"') \
> .option("nullValue", "") \
> .option('rootTag', "root_tag") \
> .option('rowTag', "row_tag") \
> .mode('overwrite') \
> .save(xml_folder_path) {code}
> I noticed that it works if I try to write all columns up to "Color" (excluded), namely:
> {code:python}
> fake_file_df \
> .select(
> "ItemID",
> "UPC",
> "_SerialNumberFlag",
> "Description",
> "MerchandiseHierarchy",
> "ItemPrice"
> ) \
> .coalesce(1) \
> .write \
> .format('com.databricks.spark.xml') \
> .option('declaration', 'version="1.0" encoding="UTF-8"') \
> .option("nullValue", "") \
> .option('rootTag', "root_tag") \
> .option('rowTag', "row_tag") \
> .mode('overwrite') \
> .save(xml_folder_path){code}
> h2. CASE (2): MISPLACED XML
> When I put all string columns at the end of the 1-row DataFrame it mistakenly writes the content of one column into the tag right after it.
> {code:python}
> fake_file_df = spark \
> .sql(
> """SELECT
> CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID,
> CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS UPC,
> CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS ARRAY<STRUCT<_ID: STRING, _Level: STRING>>) AS MerchandiseHierarchy,
> CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS ARRAY<STRUCT<_ValueTypeCode: STRING, _Value: STRING>>) AS ItemPrice,
> CAST(STRUCT(NULL AS `Name`) AS STRUCT<Name: STRING>) AS Manufacturer,
> CAST(STRUCT(NULL AS `Season`) AS STRUCT<Season: STRING>) AS Marketing,
> CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name: STRING>) AS BrandOwner,
> CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 'Value1' AS `_VALUE`)) AS ARRAY<STRUCT<_Name: STRING, AttributeValue: STRING>>) AS ItemAttribute_culinary,
> CAST(ARRAY(STRUCT(NULL AS `_Name`, ARRAY(ARRAY(STRUCT(NULL AS `AttributeCode`, NULL AS `AttributeValue`))) AS `_VALUE`)) AS ARRAY<STRUCT<_Name: STRING, _VALUE: ARRAY<ARRAY<STRUCT<AttributeCode: STRING, AttributeValue: STRING>>>>>) AS ItemAttribute_noculinary,
> CAST(STRUCT(STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Depth`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT<Depth: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Height: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>>) AS ItemMeasurements,
> CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS `TaxExemptCode`, '1' AS `TaxAmount`) AS STRUCT<TaxGroupID: STRING, TaxExemptCode: STRING, TaxAmount: STRING>) AS TaxInformation,
> CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS `_franchiseeId`, NULL AS `_franchiseeName`))) AS ARRAY<ARRAY<STRUCT<_action: STRING, _franchiseeId: STRING, _franchiseeName: STRING>>>) AS ItemFranchisees,
> CAST('' AS STRING) AS _SerialNumberFlag,
> CAST('MyDescription' AS STRING) AS Description,
> CAST('' AS STRING) AS Color,
> CAST('' AS STRING) AS IntendedIndustry,
> CAST('' AS STRING) AS ItemImageUrl,
> CAST('Add' AS STRING) AS _Action
> ;"""
> )
> fake_file_df \
> .coalesce(1) \
> .write \
> .format('com.databricks.spark.xml') \
> .option('declaration', 'version="1.0" encoding="UTF-8"') \
> .option("nullValue", "") \
> .option('rootTag', "root_tag") \
> .option('rowTag', "row_tag") \
> .mode('overwrite') \
> .save(xml_folder_path) {code}
> The output is a wrong xml where "MyDescription" is written inside the "Color" tag instead of the "Description" tag (but if you display the "fake_file_df" DataFrame it looks good as "MyDescription" is under the "Description" column).
> {code:xml}
> <?xml version="1.0" encoding="UTF-8"?>
> <root_tag>
> <row_tag SerialNumberFlag="" Action="Add">
> <ItemID Type="ItemId">123</ItemID>
> <UPC Type="UPC">123</UPC>
> <MerchandiseHierarchy ID="" Level=""/>
> <ItemPrice ValueTypeCode="" Value=""/>
> <Manufacturer>
> <Name></Name>
> </Manufacturer>
> <Marketing>
> <Season></Season>
> </Marketing>
> <BrandOwner Name=""/>
> <ItemAttribute_culinary Name="Attribute1">
> <AttributeValue>Value1</AttributeValue>
> </ItemAttribute_culinary>
> <ItemAttribute_noculinary Name="">
> <item>
> <AttributeCode></AttributeCode>
> <AttributeValue></AttributeValue>
> </item>
> </ItemAttribute_noculinary>
> <ItemMeasurements>
> <Depth UnitOfMeasure=""></Depth>
> <Height UnitOfMeasure=""></Height>
> <Width UnitOfMeasure=""></Width>
> <Diameter UnitOfMeasure=""></Diameter>
> </ItemMeasurements>
> <TaxInformation>
> <TaxGroupID>GroupA</TaxGroupID>
> <TaxExemptCode>CodeA</TaxExemptCode>
> <TaxAmount>1</TaxAmount>
> </TaxInformation>
> <ItemFranchisees>
> <item action="" franchiseeId="" franchiseeName=""/>
> </ItemFranchisees>
> <Description></Description>
> <Color>MyDescription</Color>
> <IntendedIndustry></IntendedIndustry>
> <ItemImageUrl></ItemImageUrl>
> </row_tag>
> </root_tag> {code}
> h1. Current workaround I put into Production
> As it looks like spark-xml is having a hard time when non-empty and empty string columns are separated by non-string ones (e.g., a nested struct or array column) I programmatically move all string columns at the end of the DataFrame right before the write command executes.
> Not only that, I add a "fake" string column before each and every string column ("Col1 AS FAKE_Col1") as it also looks like spark-xml is misplacing ahead of 1 tag the content of string columns when writing the xml. And, of course, I have to read back the xml file and get rid of all these "fake" tags before I can feed it into the downward process.
>
> Thanks!
> ~Giuseppe Ceravolo
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org