You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "René Zeidler (Jira)" <ji...@apache.org> on 2024/01/25 09:55:00 UTC

[jira] [Created] (NIFI-12669) EvaluateXQuery processor incorrectly encodes result attributes

René Zeidler created NIFI-12669:
-----------------------------------

             Summary: EvaluateXQuery processor incorrectly encodes result attributes
                 Key: NIFI-12669
                 URL: https://issues.apache.org/jira/browse/NIFI-12669
             Project: Apache NiFi
          Issue Type: Bug
          Components: Configuration, Extensions
    Affects Versions: 1.24.0, 2.0.0-M1
         Environment: JVM with non-UTF-8 default encoding (e.g. default Windows installation)
            Reporter: René Zeidler
         Attachments: EvaluateXQuery_Encoding_Bug.json, image-2024-01-25-10-24-17-005.png, image-2024-01-25-10-31-35-200.png

h2. Environment

This issue affects environments where the JVM default encoding is not {{{}UTF-8{}}}. Standard Java installations on Windows are affected, as they usually use the default encoding {{{}windows-1252{}}}. To reproduce the issue on Linux, change the default encoding to {{windows-1252}} by adding the following line to your {{{}bootstrap.conf{}}}:
{quote}{{java.arg.21=-Dfile.encoding=windows-1252}}
{quote}
h2. Summary

The EvaluateXQuery incorrectly encodes result values when storing them in attributes. This causes non-ASCII characters to be garbled.
Example:
!image-2024-01-25-10-24-17-005.png!
h2. Steps to reproduce
 # Make sure NiFi runs with a non-UTF-8 default encoding, see "Environment"
 # Create a GenerateFlowFile processor with the following content:
{quote}{{<?xml version="1.0" encoding="UTF-8"?>}}
{{<myRoot>}}
{{  <myData>This text contains non-ASCII characters: ÄÖÜäöüßéèóò</myData>}}
{{</myRoot>}}
{quote}
 # Connect the processor to an EvaluateXQuery processor.
Set the {{Destination}} to {{{}flowfile-attribute{}}}.
Create a custom property {{myData}} with value {{{}string(/myRoot/myData){}}}.
 # Connect the outputs of the EvaluateXQuery processor to funnels to be able to observe the result in the queue.
 # Start the EvaluateXQuery processor and run the GenerateFlowFile processor once.
The flow should look similar to this:
!image-2024-01-25-10-31-35-200.png!
I also attached a JSON export of the example flow.
 # Observe the attributes of the resulting FlowFile in the queue.

h3. Expected Result

The FlowFile should contain an attribute {{myData}} with the value {{{}"This text contains non-ASCII characters: ÄÖÜäöüßéèóò"{}}}.
h3. Actual Result

The attribute has the value {{{}"This text contains non-ASCII characters: ÄÖÜäöüßéèóò"{}}}.
h2. Root Cause Analysis

EvaluateXQuery uses the method [{{formatItem}}|https://github.com/apache/nifi/blob/2e3f83eb54cbc040b5a1da5bce9a74a558f08ea4/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/EvaluateXQuery.java#L368-L372] to write the query result to an attribute. This method calls {{{}ByteArrayOutputStream{}}}'s [toString|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/io/ByteArrayOutputStream.html#toString()] method without an encoding argument, which then defaults to the default charset of the environment. Bytes are always written to this output stream using UTF-8 ([.getBytes(StandardCharsets.UTF8)|https://github.com/apache/nifi/blob/2e3f83eb54cbc040b5a1da5bce9a74a558f08ea4/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/EvaluateXQuery.java#L397]). When the default charset is not UTF-8, this results in UTF-8 bytes to be interpreted in a different encoding when converting to a string, resulting in garbled text (see above).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)