You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cxf.apache.org by Tom Eastmond <to...@gmail.com> on 2011/05/02 18:35:54 UTC

FileUtils.getStringFromFile issue when using XML

I was using the FileUtils.getStringFromFile() method for some Camel
testing and was receiving a SAXParseException: The processing
instruction target matching "[xX][mM][lL]" is not allowed.].

It turns out that this was due to the was due to the
FileUtils.normalizeCRLF() method which replaces whitespace characters
(\s) with two spaces. This method appends leading spaces to the
contents (before the <?xml version="1.0" encoding="UTF-8"?> in this
case) which chokes the XML parser. Would it be feasible to forgo the
leading spaces at the start of a file in order to avoid this issue?
I'd be happy to submit a test case/patch if this seems like a valid
bug/fix. Please let me know if I should use another forum for this
request.

Thanks for the excellent work,

Tom Eastmond

Re: FileUtils.getStringFromFile issue when using XML

Posted by Tom Eastmond <to...@gmail.com>.
Ok, thanks for the feedback. I'll look into the suggestions that you offered.

Thanks again,
Tom

On Wed, May 4, 2011 at 1:17 AM, Aki Yoshida <el...@googlemail.com> wrote:
> Hi Tom,
> I think the wrong thing about this method is that it adds an extra
> space at the beginning. If the file content is an XML and it starts
> with the xml declaration, there will be an extra space in front of the
> declaration that violates the well-formdness.
>
> You can create a jira issue for this particular bug. But this will not
> really help your in the long run. I will explain the reason below.
>
> As I understand your use case, you want to use this method for reading
> an XML file and creating its java string representation in your
> application.  As I see this method, it doesn't look like it was really
> meant to be used for such purposes. Furthermore, it seems that this
> class is only used in some unit test classes for performing a simple
> content comparison.
>
> For your particular use case, you need to take care of the character
> encoding and possibly the newline handling. This FileUtil's method
> ignores the encoding of the file.  If the file is using the utf-8
> encoding, you need to read the stream and covert it into a java String
> using the utf-8 encoding. If it is in some other encoding like utf-16,
> iso-8859-1, etc, you need to use that encoding for conversion.
> Otherwise, you will have a corrupted String for some characters.
> Regarding the newline handling, this method currently removes all the
> CR/LFs. This is probably okay for the existing test use cases, but for
> your use case, you may want to either preserve the new line characters
> or to normalize them using the standard XML rule. So, there will be
> some other issues you will encounter if you use this simple method.
>
> Therefore, I would recommed you not to use this FileUtil's method and
> instead use an alternative approach using the xml parser to convert a
> file for further processing (e.g., using InputSource to work on the
> Source or XMLUtils.parse() to work on the Document).
>
> Regards, Aki
>
> 2011/5/3 Tom Eastmond <to...@gmail.com>:
>> That would be great to get this fixed - should I create a defect? I'd
>> also love to not have it replace a single space with 2 spaces since
>> that has caught me by surprise in my testing as well. Let me know what
>> you'd like me to do.
>>
>> Thanks again,
>> Tom Eastmond
>>
>> On Tue, May 3, 2011 at 6:19 AM, Aki Yoshida <el...@googlemail.com> wrote:
>>> Sorry,
>>> I realized this method has actually nothing to do with XML.
>>> please ignore my comments on XML normalization.
>>> regards, aki
>>>
>>> 2011/5/3 Aki Yoshida <el...@googlemail.com>:
>>>> Hi,
>>>> you are right. The normalizeCRLF() method should not add an extra
>>>> space at the begining. We can fix this particular issue.
>>>>
>>>> But there is one open question, as the exact purpose (use case) of
>>>> this method is not clear to me. Why do we need this normalization
>>>> method that just removes all the CRs and LFs and replace each
>>>> space/tab character with a single space and this method is
>>>> automatically called in FileUtils.getStringFromFile()?
>>>>
>>>> Does someone else wants to have other normalization options such as
>>>> doing the standard xml white space "ignore" handling or the
>>>> end-of-line handling (i.e., replacing each CRLF pair to a single LF)?
>>>>
>>>> Regards, aki
>>>>
>>>> 2011/5/2 Tom Eastmond <to...@gmail.com>:
>>>>> I was using the FileUtils.getStringFromFile() method for some Camel
>>>>> testing and was receiving a SAXParseException: The processing
>>>>> instruction target matching "[xX][mM][lL]" is not allowed.].
>>>>>
>>>>> It turns out that this was due to the was due to the
>>>>> FileUtils.normalizeCRLF() method which replaces whitespace characters
>>>>> (\s) with two spaces. This method appends leading spaces to the
>>>>> contents (before the <?xml version="1.0" encoding="UTF-8"?> in this
>>>>> case) which chokes the XML parser. Would it be feasible to forgo the
>>>>> leading spaces at the start of a file in order to avoid this issue?
>>>>> I'd be happy to submit a test case/patch if this seems like a valid
>>>>> bug/fix. Please let me know if I should use another forum for this
>>>>> request.
>>>>>
>>>>> Thanks for the excellent work,
>>>>>
>>>>> Tom Eastmond
>>>>>
>>>>
>>>
>>
>

Re: FileUtils.getStringFromFile issue when using XML

Posted by Aki Yoshida <el...@googlemail.com>.
Hi Tom,
I think the wrong thing about this method is that it adds an extra
space at the beginning. If the file content is an XML and it starts
with the xml declaration, there will be an extra space in front of the
declaration that violates the well-formdness.

You can create a jira issue for this particular bug. But this will not
really help your in the long run. I will explain the reason below.

As I understand your use case, you want to use this method for reading
an XML file and creating its java string representation in your
application.  As I see this method, it doesn't look like it was really
meant to be used for such purposes. Furthermore, it seems that this
class is only used in some unit test classes for performing a simple
content comparison.

For your particular use case, you need to take care of the character
encoding and possibly the newline handling. This FileUtil's method
ignores the encoding of the file.  If the file is using the utf-8
encoding, you need to read the stream and covert it into a java String
using the utf-8 encoding. If it is in some other encoding like utf-16,
iso-8859-1, etc, you need to use that encoding for conversion.
Otherwise, you will have a corrupted String for some characters.
Regarding the newline handling, this method currently removes all the
CR/LFs. This is probably okay for the existing test use cases, but for
your use case, you may want to either preserve the new line characters
or to normalize them using the standard XML rule. So, there will be
some other issues you will encounter if you use this simple method.

Therefore, I would recommed you not to use this FileUtil's method and
instead use an alternative approach using the xml parser to convert a
file for further processing (e.g., using InputSource to work on the
Source or XMLUtils.parse() to work on the Document).

Regards, Aki

2011/5/3 Tom Eastmond <to...@gmail.com>:
> That would be great to get this fixed - should I create a defect? I'd
> also love to not have it replace a single space with 2 spaces since
> that has caught me by surprise in my testing as well. Let me know what
> you'd like me to do.
>
> Thanks again,
> Tom Eastmond
>
> On Tue, May 3, 2011 at 6:19 AM, Aki Yoshida <el...@googlemail.com> wrote:
>> Sorry,
>> I realized this method has actually nothing to do with XML.
>> please ignore my comments on XML normalization.
>> regards, aki
>>
>> 2011/5/3 Aki Yoshida <el...@googlemail.com>:
>>> Hi,
>>> you are right. The normalizeCRLF() method should not add an extra
>>> space at the begining. We can fix this particular issue.
>>>
>>> But there is one open question, as the exact purpose (use case) of
>>> this method is not clear to me. Why do we need this normalization
>>> method that just removes all the CRs and LFs and replace each
>>> space/tab character with a single space and this method is
>>> automatically called in FileUtils.getStringFromFile()?
>>>
>>> Does someone else wants to have other normalization options such as
>>> doing the standard xml white space "ignore" handling or the
>>> end-of-line handling (i.e., replacing each CRLF pair to a single LF)?
>>>
>>> Regards, aki
>>>
>>> 2011/5/2 Tom Eastmond <to...@gmail.com>:
>>>> I was using the FileUtils.getStringFromFile() method for some Camel
>>>> testing and was receiving a SAXParseException: The processing
>>>> instruction target matching "[xX][mM][lL]" is not allowed.].
>>>>
>>>> It turns out that this was due to the was due to the
>>>> FileUtils.normalizeCRLF() method which replaces whitespace characters
>>>> (\s) with two spaces. This method appends leading spaces to the
>>>> contents (before the <?xml version="1.0" encoding="UTF-8"?> in this
>>>> case) which chokes the XML parser. Would it be feasible to forgo the
>>>> leading spaces at the start of a file in order to avoid this issue?
>>>> I'd be happy to submit a test case/patch if this seems like a valid
>>>> bug/fix. Please let me know if I should use another forum for this
>>>> request.
>>>>
>>>> Thanks for the excellent work,
>>>>
>>>> Tom Eastmond
>>>>
>>>
>>
>

Re: FileUtils.getStringFromFile issue when using XML

Posted by Tom Eastmond <to...@gmail.com>.
That would be great to get this fixed - should I create a defect? I'd
also love to not have it replace a single space with 2 spaces since
that has caught me by surprise in my testing as well. Let me know what
you'd like me to do.

Thanks again,
Tom Eastmond

On Tue, May 3, 2011 at 6:19 AM, Aki Yoshida <el...@googlemail.com> wrote:
> Sorry,
> I realized this method has actually nothing to do with XML.
> please ignore my comments on XML normalization.
> regards, aki
>
> 2011/5/3 Aki Yoshida <el...@googlemail.com>:
>> Hi,
>> you are right. The normalizeCRLF() method should not add an extra
>> space at the begining. We can fix this particular issue.
>>
>> But there is one open question, as the exact purpose (use case) of
>> this method is not clear to me. Why do we need this normalization
>> method that just removes all the CRs and LFs and replace each
>> space/tab character with a single space and this method is
>> automatically called in FileUtils.getStringFromFile()?
>>
>> Does someone else wants to have other normalization options such as
>> doing the standard xml white space "ignore" handling or the
>> end-of-line handling (i.e., replacing each CRLF pair to a single LF)?
>>
>> Regards, aki
>>
>> 2011/5/2 Tom Eastmond <to...@gmail.com>:
>>> I was using the FileUtils.getStringFromFile() method for some Camel
>>> testing and was receiving a SAXParseException: The processing
>>> instruction target matching "[xX][mM][lL]" is not allowed.].
>>>
>>> It turns out that this was due to the was due to the
>>> FileUtils.normalizeCRLF() method which replaces whitespace characters
>>> (\s) with two spaces. This method appends leading spaces to the
>>> contents (before the <?xml version="1.0" encoding="UTF-8"?> in this
>>> case) which chokes the XML parser. Would it be feasible to forgo the
>>> leading spaces at the start of a file in order to avoid this issue?
>>> I'd be happy to submit a test case/patch if this seems like a valid
>>> bug/fix. Please let me know if I should use another forum for this
>>> request.
>>>
>>> Thanks for the excellent work,
>>>
>>> Tom Eastmond
>>>
>>
>

Re: FileUtils.getStringFromFile issue when using XML

Posted by Aki Yoshida <el...@googlemail.com>.
Sorry,
I realized this method has actually nothing to do with XML.
please ignore my comments on XML normalization.
regards, aki

2011/5/3 Aki Yoshida <el...@googlemail.com>:
> Hi,
> you are right. The normalizeCRLF() method should not add an extra
> space at the begining. We can fix this particular issue.
>
> But there is one open question, as the exact purpose (use case) of
> this method is not clear to me. Why do we need this normalization
> method that just removes all the CRs and LFs and replace each
> space/tab character with a single space and this method is
> automatically called in FileUtils.getStringFromFile()?
>
> Does someone else wants to have other normalization options such as
> doing the standard xml white space "ignore" handling or the
> end-of-line handling (i.e., replacing each CRLF pair to a single LF)?
>
> Regards, aki
>
> 2011/5/2 Tom Eastmond <to...@gmail.com>:
>> I was using the FileUtils.getStringFromFile() method for some Camel
>> testing and was receiving a SAXParseException: The processing
>> instruction target matching "[xX][mM][lL]" is not allowed.].
>>
>> It turns out that this was due to the was due to the
>> FileUtils.normalizeCRLF() method which replaces whitespace characters
>> (\s) with two spaces. This method appends leading spaces to the
>> contents (before the <?xml version="1.0" encoding="UTF-8"?> in this
>> case) which chokes the XML parser. Would it be feasible to forgo the
>> leading spaces at the start of a file in order to avoid this issue?
>> I'd be happy to submit a test case/patch if this seems like a valid
>> bug/fix. Please let me know if I should use another forum for this
>> request.
>>
>> Thanks for the excellent work,
>>
>> Tom Eastmond
>>
>

Re: FileUtils.getStringFromFile issue when using XML

Posted by Aki Yoshida <el...@googlemail.com>.
Hi,
you are right. The normalizeCRLF() method should not add an extra
space at the begining. We can fix this particular issue.

But there is one open question, as the exact purpose (use case) of
this method is not clear to me. Why do we need this normalization
method that just removes all the CRs and LFs and replace each
space/tab character with a single space and this method is
automatically called in FileUtils.getStringFromFile()?

Does someone else wants to have other normalization options such as
doing the standard xml white space "ignore" handling or the
end-of-line handling (i.e., replacing each CRLF pair to a single LF)?

Regards, aki

2011/5/2 Tom Eastmond <to...@gmail.com>:
> I was using the FileUtils.getStringFromFile() method for some Camel
> testing and was receiving a SAXParseException: The processing
> instruction target matching "[xX][mM][lL]" is not allowed.].
>
> It turns out that this was due to the was due to the
> FileUtils.normalizeCRLF() method which replaces whitespace characters
> (\s) with two spaces. This method appends leading spaces to the
> contents (before the <?xml version="1.0" encoding="UTF-8"?> in this
> case) which chokes the XML parser. Would it be feasible to forgo the
> leading spaces at the start of a file in order to avoid this issue?
> I'd be happy to submit a test case/patch if this seems like a valid
> bug/fix. Please let me know if I should use another forum for this
> request.
>
> Thanks for the excellent work,
>
> Tom Eastmond
>