You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Giuseppe Ceravolo (Jira)" <ji...@apache.org> on 2023/10/04 13:33:00 UTC

[jira] [Updated] (SPARK-45414) spark-xml misplaces string tag content

     [ https://issues.apache.org/jira/browse/SPARK-45414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Giuseppe Ceravolo updated SPARK-45414:
--------------------------------------
    Attachment: IllegalArgumentException.txt

> spark-xml misplaces string tag content
> --------------------------------------
>
>                 Key: SPARK-45414
>                 URL: https://issues.apache.org/jira/browse/SPARK-45414
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core
>    Affects Versions: 3.3.0
>            Reporter: Giuseppe Ceravolo
>            Priority: Critical
>         Attachments: IllegalArgumentException.txt
>
>
> h1. Intro
> Hi all! Please expect some degree of incompleteness in this issue as this is the very first I post, and feel free to edit it as you like - I welcome your feedback.
> My goal is to provide you with as many details and indications as I can on this issue that I am currently facing with a Client of mine on its Production environment (we use Azure Databricks DBR 11.3 LTS).
> I was told by [Sean Owen|[srowen (Sean Owen) (github.com)|https://github.com/srowen]], who maintains the spark-xml maven repository on GitHub [here|[https://github.com/srowen/spark-xml],] to post an issue here because "This code has been ported to Apache Spark now anyway so won't be updated here" (refer to his comment [here|[https://github.com/databricks/spark-xml/issues/431#issuecomment-1744792958]).]
> h1. Issue
> When I write a DataFrame into xml format via the spark-xml library either (1) I get an error if empty string columns are in between non-string nested ones or (2) if I put all string columns at the end then I get a wrong xml where the content of string tags are misplaced into the following ones.
> h1. Code to reproduce the issue
> Please find below the end-to-end code snippet that results into the error
> h2. CASE (1): ERROR
> When empty strings are in between non-string nested ones, the write fails. Please find attached the full trace of the error.
> fake_file_df = spark \
>     .sql(        """SELECT            CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID,            CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS UPC,            CAST('' AS STRING) AS _SerialNumberFlag,            CAST('MyDescription' AS STRING) AS Description,            CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS ARRAY<STRUCT<_ID: STRING, _Level: STRING>>) AS MerchandiseHierarchy,            CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS ARRAY<STRUCT<_ValueTypeCode: STRING, _Value: STRING>>) AS ItemPrice,            CAST('' AS STRING) AS Color,            CAST('' AS STRING) AS IntendedIndustry,            CAST(STRUCT(NULL AS `Name`) AS STRUCT<Name: STRING>) AS Manufacturer,            CAST(STRUCT(NULL AS `Season`) AS STRUCT<Season: STRING>) AS Marketing,            CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name: STRING>) AS BrandOwner,            CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 'Value1' AS `_VALUE`)) AS ARRAY<STRUCT<_Name: STRING, AttributeValue: STRING>>) AS ItemAttribute_culinary,            CAST(ARRAY(STRUCT(NULL AS `_Name`, ARRAY(ARRAY(STRUCT(NULL AS `AttributeCode`, NULL AS `AttributeValue`))) AS `_VALUE`)) AS ARRAY<STRUCT<_Name: STRING, _VALUE: ARRAY<ARRAY<STRUCT<AttributeCode: STRING, AttributeValue: STRING>>>>>) AS ItemAttribute_noculinary,            CAST(STRUCT(STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Depth`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT<Depth: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Height: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>>) AS ItemMeasurements,            CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS `TaxExemptCode`, '1' AS `TaxAmount`) AS STRUCT<TaxGroupID: STRING, TaxExemptCode: STRING, TaxAmount: STRING>) AS TaxInformation,            CAST('' AS STRING) AS ItemImageUrl,            CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS `_franchiseeId`, NULL AS `_franchiseeName`))) AS ARRAY<ARRAY<STRUCT<_action: STRING, _franchiseeId: STRING, _franchiseeName: STRING>>>) AS ItemFranchisees,            CAST('Add' AS STRING) AS _Action        ;"""    )# fake_file_df.display()fake_file_df \
>     .coalesce(1) \
>     .write \
>     .format('com.databricks.spark.xml') \
>     .option('declaration', 'version="1.0" encoding="UTF-8"') \
>     .option("nullValue", "") \
>     .option('rootTag', "root_tag") \
>     .option('rowTag', "row_tag") \
>     .mode('overwrite') \
>     .save(xml_folder_path)
> I noticed that it works if I try to write all columns up to "Color" (excluded), namely:
> fake_file_df \
>     .select(        "ItemID",        "UPC",        "_SerialNumberFlag",        "Description",        "MerchandiseHierarchy",        "ItemPrice"    ) \
>     .coalesce(1) \
>     .write \
>     .format('com.databricks.spark.xml') \
>     .option('declaration', 'version="1.0" encoding="UTF-8"') \
>     .option("nullValue", "") \
>     .option('rootTag', "root_tag") \
>     .option('rowTag', "row_tag") \
>     .mode('overwrite') \
>     .save(xml_folder_path)
> h2. CASE (2): MISPLACED XML
> When I put all string columns at the end of the 1-row DataFrame it mistakenly writes the content of one column into the tag right after it.
> fake_file_df = spark \
>     .sql(        """SELECT            CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID,            CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS UPC,            CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS ARRAY<STRUCT<_ID: STRING, _Level: STRING>>) AS MerchandiseHierarchy,            CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS ARRAY<STRUCT<_ValueTypeCode: STRING, _Value: STRING>>) AS ItemPrice,            CAST(STRUCT(NULL AS `Name`) AS STRUCT<Name: STRING>) AS Manufacturer,            CAST(STRUCT(NULL AS `Season`) AS STRUCT<Season: STRING>) AS Marketing,            CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name: STRING>) AS BrandOwner,            CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 'Value1' AS `_VALUE`)) AS ARRAY<STRUCT<_Name: STRING, AttributeValue: STRING>>) AS ItemAttribute_culinary,            CAST(ARRAY(STRUCT(NULL AS `_Name`, ARRAY(ARRAY(STRUCT(NULL AS `AttributeCode`, NULL AS `AttributeValue`))) AS `_VALUE`)) AS ARRAY<STRUCT<_Name: STRING, _VALUE: ARRAY<ARRAY<STRUCT<AttributeCode: STRING, AttributeValue: STRING>>>>>) AS ItemAttribute_noculinary,            CAST(STRUCT(STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Depth`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT<Depth: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Height: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>>) AS ItemMeasurements,            CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS `TaxExemptCode`, '1' AS `TaxAmount`) AS STRUCT<TaxGroupID: STRING, TaxExemptCode: STRING, TaxAmount: STRING>) AS TaxInformation,            CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS `_franchiseeId`, NULL AS `_franchiseeName`))) AS ARRAY<ARRAY<STRUCT<_action: STRING, _franchiseeId: STRING, _franchiseeName: STRING>>>) AS ItemFranchisees,            CAST('' AS STRING) AS _SerialNumberFlag,            CAST('MyDescription' AS STRING) AS Description,            CAST('' AS STRING) AS Color,            CAST('' AS STRING) AS IntendedIndustry,            CAST('' AS STRING) AS ItemImageUrl,            CAST('Add' AS STRING) AS _Action        ;"""    )fake_file_df \
>     .coalesce(1) \
>     .write \
>     .format('com.databricks.spark.xml') \
>     .option('declaration', 'version="1.0" encoding="UTF-8"') \
>     .option("nullValue", "") \
>     .option('rootTag', "root_tag") \
>     .option('rowTag', "row_tag") \
>     .mode('overwrite') \
>     .save(xml_folder_path)
> The output is a wrong xml where "MyDescription" is written inside the "Color" tag instead of the "Description" tag (but if you display the "fake_file_df" DataFrame it looks good as "MyDescription" is under the "Description" column (see image below):).
> <?xml version="1.0" encoding="UTF-8"?>
> <root_tag>
>     <row_tag SerialNumberFlag="" Action="Add">
>         <ItemID Type="ItemId">123</ItemID>
>         <UPC Type="UPC">123</UPC>
>         <MerchandiseHierarchy ID="" Level=""/>
>         <ItemPrice ValueTypeCode="" Value=""/>
>         <Manufacturer>
>             <Name></Name>
>         </Manufacturer>
>         <Marketing>
>             <Season></Season>
>         </Marketing>
>         <BrandOwner Name=""/>
>         <ItemAttribute_culinary Name="Attribute1">
>             <AttributeValue>Value1</AttributeValue>
>         </ItemAttribute_culinary>
>         <ItemAttribute_noculinary Name="">
>             <item>
>                 <AttributeCode></AttributeCode>
>                 <AttributeValue></AttributeValue>
>             </item>
>         </ItemAttribute_noculinary>
>         <ItemMeasurements>
>             <Depth UnitOfMeasure=""></Depth>
>             <Height UnitOfMeasure=""></Height>
>             <Width UnitOfMeasure=""></Width>
>             <Diameter UnitOfMeasure=""></Diameter>
>         </ItemMeasurements>
>         <TaxInformation>
>             <TaxGroupID>GroupA</TaxGroupID>
>             <TaxExemptCode>CodeA</TaxExemptCode>
>             <TaxAmount>1</TaxAmount>
>         </TaxInformation>
>         <ItemFranchisees>
>             <item action="" franchiseeId="" franchiseeName=""/>
>         </ItemFranchisees>
>         <Description></Description>
>         <Color>{color:#FF0000}MyDescription{color}</Color>
>         <IntendedIndustry></IntendedIndustry>
>         <ItemImageUrl></ItemImageUrl>
>     </row_tag>
> </root_tag>
>  
> Thanks!  Giuseppe Ceravolo
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org