You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@camel.apache.org by "Robert Half (JIRA)" <ji...@apache.org> on 2017/11/09 12:24:02 UTC
[jira] [Comment Edited] (CAMEL-11846) xtokenize and apply xslt to a string does not work with UTF-16BE

    [ https://issues.apache.org/jira/browse/CAMEL-11846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245555#comment-16245555 ] 

Robert Half edited comment on CAMEL-11846 at 11/9/17 12:23 PM:
---------------------------------------------------------------

Hi Viral,

I have a workaround first: I use BufferedInputStream wrapper, so I am able to reset it later (don't need to open the file twice). I give the InputStream  to XmlStreamReader, which gives me the encoding after reading XML file prolog. Then I set it for camel on the Exchange.CHARSET_NAME header:


{code:java}
EncodingUtil.DetectedEncodingStream detectedEncodingStream = EncodingUtil.detectEncoding(inputStream, new StaxConverter().getInputFactory());
            inputStream = detectedEncodingStream.inputStream;
            exchange.getIn().setHeader(Exchange.CHARSET_NAME, detectedEncodingStream.encoding);
{code}


{code:java}
public class EncodingUtil {

    public static class DetectedEncodingStream {
        public InputStream inputStream;
        public String encoding;

        public DetectedEncodingStream(InputStream inputStream, String encoding) {
            this.inputStream = inputStream;
            this.encoding = encoding;
        }
    }

    private static final int MAX_REWINDABLE_STREAM_BUFFER = 2*4196;

    public static final Logger LOGGER = LoggerFactory.getLogger(EncodingUtil.class);

    public static DetectedEncodingStream detectEncoding(InputStream inputStream, XMLInputFactory xmlInputFactory) {
        final BufferedInputStream bufferedInputStream = new BufferedInputStream(inputStream, MAX_REWINDABLE_STREAM_BUFFER);
        bufferedInputStream.mark(MAX_REWINDABLE_STREAM_BUFFER);
        String encoding;
        XMLStreamReader xmlStreamReader = null;
        try {
            xmlStreamReader = xmlInputFactory.createXMLStreamReader(bufferedInputStream);
        } catch (XMLStreamException e) {
            throw new RuntimeException(e);
        } finally {
            try {
                bufferedInputStream.reset();
            } catch (IOException e) {
                throw new RuntimeException(e);
            } finally {
                try {
                    xmlStreamReader.close();
                } catch (XMLStreamException e) {
                    throw new RuntimeException("Failed to close XmlStreamRader", e);
                }
            }
        }

        encoding = xmlStreamReader.getCharacterEncodingScheme();
        if (encoding == null) {
            encoding = StandardCharsets.UTF_8.name();
        }
        return new DetectedEncodingStream(bufferedInputStream, encoding);
    }
}
{code}



was (Author: antidote2):
Hi Viral,

I have a workaround first: I use BufferedInputStream wrapper, so I am able to reset it later (don't need to open the file twice). I give the InputStream  to XmlStreamReader, which gives me the encoding after reading XML file prolog. Then I set it for camel on the Exchange.CHARSET_NAME header:

EncodingUtil.DetectedEncodingStream detectedEncodingStream = EncodingUtil.detectEncoding(inputStream, new StaxConverter().getInputFactory());
            inputStream = detectedEncodingStream.inputStream;
            exchange.getIn().setHeader(Exchange.CHARSET_NAME, detectedEncodingStream.encoding);

{code:java}
public class EncodingUtil {

    public static class DetectedEncodingStream {
        public InputStream inputStream;
        public String encoding;

        public DetectedEncodingStream(InputStream inputStream, String encoding) {
            this.inputStream = inputStream;
            this.encoding = encoding;
        }
    }

    private static final int MAX_REWINDABLE_STREAM_BUFFER = 2*4196;

    public static final Logger LOGGER = LoggerFactory.getLogger(EncodingUtil.class);

    public static DetectedEncodingStream detectEncoding(InputStream inputStream, XMLInputFactory xmlInputFactory) {
        final BufferedInputStream bufferedInputStream = new BufferedInputStream(inputStream, MAX_REWINDABLE_STREAM_BUFFER);
        bufferedInputStream.mark(MAX_REWINDABLE_STREAM_BUFFER);
        String encoding;
        XMLStreamReader xmlStreamReader = null;
        try {
            xmlStreamReader = xmlInputFactory.createXMLStreamReader(bufferedInputStream);
        } catch (XMLStreamException e) {
            throw new RuntimeException(e);
        } finally {
            try {
                bufferedInputStream.reset();
            } catch (IOException e) {
                throw new RuntimeException(e);
            } finally {
                try {
                    xmlStreamReader.close();
                } catch (XMLStreamException e) {
                    throw new RuntimeException("Failed to close XmlStreamRader", e);
                }
            }
        }

        encoding = xmlStreamReader.getCharacterEncodingScheme();
        if (encoding == null) {
            encoding = StandardCharsets.UTF_8.name();
        }
        return new DetectedEncodingStream(bufferedInputStream, encoding);
    }
}
{code}


> xtokenize and apply xslt to a string does not work  with UTF-16BE
> -----------------------------------------------------------------
>
>                 Key: CAMEL-11846
>                 URL: https://issues.apache.org/jira/browse/CAMEL-11846
>             Project: Camel
>          Issue Type: Bug
>          Components: camel-core
>    Affects Versions: 2.17.5
>            Reporter: Robert Half
>
> In XML, encoding is often provided inside <?xml ..?> tag. In general, you cannot read the tag, if you don't know the encoding, but XML Parsers support the detection of several encodings which allows them to read the tag. With that information they can read the whole file without knowing the "charset" in first place.
> xtokenize and xslt use XmlInputFactory#createXmlStreamReader(Reader). But by providing a reader Camel tells, that it knows the encoding, so it won't be detected by the XML parser.
> Also Camel sets the charset to UTF-8 if it is not provided inside a header. This makes the underlying reader fail reading UTF-16.
> Using XmlInputFactory#createXmlStreamReader(InputStream) inside XMLTokenExpressionIterator works (tried in a patch). But the next xslt steps fails again because it again uses a Reader.
> See Stackoverflow Question for reference:
> [https://stackoverflow.com/questions/46322376/apache-camel-to-handle-encoding-declared-in-xml-file]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)