You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@camel.apache.org by fedd <fe...@hotmail.com> on 2016/03/05 21:39:30 UTC

A possible bug in IOConverter with Win-1251 charset

Hi, I believe I have found a bug, they recommend to discuss it on forum
before posting to Jira.

I found it impossible to unmarshal a Win-1251 CSV file with Cyrillic
strings, on a machine where a vm charset is Win-1251 at least. I dug into
the code and saw that in an IOConverter you subclass an InputStream with
this read method:

                @Override
                public int read() throws IOException {
                    if (bufferBytes == null || bufferBytes.remaining() <= 0)
{
                        bufferedChars.clear();
                        int len = reader.read(bufferedChars);
                        bufferedChars.flip();
                        if (len == -1) {
                            return -1;
                        }
                        bufferBytes =
defaultStreamCharset.encode(bufferedChars);
                    }
                    return bufferBytes.get();
                }

I tried to find out why are you converting character buffer to byte buffer,
when you have chars and need to return integers. It may work for other
languages but doesn't work for Russian, where a character "ya" has a code of
FF in an encoding invented by Microsoft, Win-1251, the most widespread
encoding in Russia, Ukraine and some other countries that use Cyrillic
letters. (And "ya" is a very frequent character :)

This in turn makes it -1 as a byte, and later when calling this read()
method and expecting a -1 as an EOF signal, we stop reading it at a
legitimate Cyrillic letter.

Probably you have some reasoning,... but I would totaly omit this "encode"
part and provide the next integer right out of the "bufferedChars" buffer.

What do you think?

Regards,
Fyodor





--
View this message in context: http://camel.465427.n5.nabble.com/A-possible-bug-in-IOConverter-with-Win-1251-charset-tp5778665.html
Sent from the Camel Development mailing list archive at Nabble.com.

Re: A possible bug in IOConverter with Win-1251 charset

Posted by fedd <fe...@hotmail.com>.

Hi Claus,

thank you for your attention, I am sorry I didn't have chance to do it. 

I understand what you are saying, and agree, and I would love to contribute
to this great project, but unfortunately I can't immediately meet the
requirement of providing a unit test and a push request, just because I
don't have enough experience with jUnit, git and overall opensource
contributions to the serious project that the whole world is using. I have
no idea how to write unit tests, I will have to learn it and it will take
time. The only thing I could provide in the nearest future was the above
textual description of my findings and a couple of csv files, but I
understood that it was not enough. 

I have actually moved forward and currently I am not experiencing this
particular problem, because I wrote my own unicode-codepoint-aware
reader-splitter, which apparently bypasses the IOConverter, see the second
half of my comment, http://stackoverflow.com/a/35844207/499377 . This is my
side project, and I can only hope that it will grow into something, exploit
the full power of Camel and will be maintained by professionals. Currently I
am experiencing a strange behaviour of Hibernate and it takes all my brain
power on weekends and evenings =)

Regards,
Fyodor



--
View this message in context: http://camel.465427.n5.nabble.com/A-possible-bug-in-IOConverter-with-Win-1251-charset-tp5778665p5778983.html
Sent from the Camel Development mailing list archive at Nabble.com.

Re: A possible bug in IOConverter with Win-1251 charset

Posted by Claus Ibsen <cl...@gmail.com>.

Hi

Did you get a chance to work on this? We are working on releasing
Camel 2.17 so its time to step up if you want to have this issue
resolved.

You are in better position to fix or track down the issue as you have
the problem and uses the russian locale.



On Wed, Mar 9, 2016 at 11:11 AM, Claus Ibsen <cl...@gmail.com> wrote:
> Hi
>
> Yeah would be good if you can try the suggestions from Antoine. And if
> you can reproduce an unit test and possible provide a fix in a PR /
> patch. We love contributions
> http://camel.apache.org/contributing
>
> On Tue, Mar 8, 2016 at 12:53 AM, Antoine Toulme <an...@toulme.name> wrote:
>> What happens is that your default charset is win-1251 while the file is UTF-8.
>>
>> The file is read correctly according to the charset argument passed to the toInputStream method ; however, the default charset used to parse and send the stream is the default charset.
>>
>> The immediate workaround for you is to add an explicit charset when launching the JVM: -Dfile.encoding=UTF-8
>>
>> I would recommend you go ahead, file a bug and add a simple test case in IOConverterTest around line 83.
>>
>>> On Mar 5, 2016, at 11:05 PM, fedd <fe...@hotmail.com> wrote:
>>>
>>> I made an experiment and saw that the situation is much worse that just
>>> losing one frequent Russian letter.
>>>
>>> I made a UTF-8 file with both Russian text and one German A Umlaut letter,
>>> and Camel was unable to read a German letter replacing it with a question
>>> mark, just because my windows dev machine native charset happened to be
>>> win-1251.
>>>
>>> I don't really think it's okay
>>>
>>> 1) to ever flatten Unicode strings to a single byte character set;
>>>
>>> 2) when the behaviour of the server side code depends on the host operating
>>> system settings (becomes not portable)
>>>
>>> May I file a Jira bug report?
>>>
>>> Here's by route:
>>>
>>>        <dataFormats>
>>>            <json id="jack" library="Jackson" prettyPrint="true"/>
>>>        </dataFormats>
>>>
>>>        <route>
>>>
>>>            <from
>>> uri="file:///C:/tries/collApp/exchange/in?fileName=registerSampleUtf.csv&amp;charset=UTF-8"/>
>>>            <log message="file: ${body.class.name} ${body}"
>>> loggingLevel="WARN"/>
>>>            <unmarshal>
>>>                <csv delimiter=";"  useMaps="true" />
>>>            </unmarshal>
>>>            <log message="unmarshalled: ${body.class.name} ${body}"
>>> loggingLevel="WARN"/>
>>>            <marshal ref="jack"/>
>>>            <log message="marshalled: ${body}" loggingLevel="WARN"/>
>>>            <to
>>> uri="file:///C:/tries/collApp/exchange/out?fileName=out.json"/>
>>>        </route>
>>>
>>> At the first "log" only a German letter is replaced with the question mark.
>>>
>>> At the second, all Russian letters are replaced with the question marks.
>>>
>>> The resulting JSON can't even display the question marks when read in any of
>>> the world's encodings.
>>>
>>> Shall I provide a test CSV file here? (warning: it contains Russian letters)
>>>
>>>
>>>
>>> --
>>> View this message in context: http://camel.465427.n5.nabble.com/A-possible-bug-in-IOConverter-with-Win-1251-charset-tp5778665p5778666.html
>>> Sent from the Camel Development mailing list archive at Nabble.com.
>>
>
>
>
> --
> Claus Ibsen
> -----------------
> http://davsclaus.com @davsclaus
> Camel in Action 2: https://www.manning.com/ibsen2



-- 
Claus Ibsen
-----------------
http://davsclaus.com @davsclaus
Camel in Action 2: https://www.manning.com/ibsen2

Re: A possible bug in IOConverter with Win-1251 charset

Posted by Claus Ibsen <cl...@gmail.com>.

Hi

Yeah would be good if you can try the suggestions from Antoine. And if
you can reproduce an unit test and possible provide a fix in a PR /
patch. We love contributions
http://camel.apache.org/contributing

On Tue, Mar 8, 2016 at 12:53 AM, Antoine Toulme <an...@toulme.name> wrote:
> What happens is that your default charset is win-1251 while the file is UTF-8.
>
> The file is read correctly according to the charset argument passed to the toInputStream method ; however, the default charset used to parse and send the stream is the default charset.
>
> The immediate workaround for you is to add an explicit charset when launching the JVM: -Dfile.encoding=UTF-8
>
> I would recommend you go ahead, file a bug and add a simple test case in IOConverterTest around line 83.
>
>> On Mar 5, 2016, at 11:05 PM, fedd <fe...@hotmail.com> wrote:
>>
>> I made an experiment and saw that the situation is much worse that just
>> losing one frequent Russian letter.
>>
>> I made a UTF-8 file with both Russian text and one German A Umlaut letter,
>> and Camel was unable to read a German letter replacing it with a question
>> mark, just because my windows dev machine native charset happened to be
>> win-1251.
>>
>> I don't really think it's okay
>>
>> 1) to ever flatten Unicode strings to a single byte character set;
>>
>> 2) when the behaviour of the server side code depends on the host operating
>> system settings (becomes not portable)
>>
>> May I file a Jira bug report?
>>
>> Here's by route:
>>
>>        <dataFormats>
>>            <json id="jack" library="Jackson" prettyPrint="true"/>
>>        </dataFormats>
>>
>>        <route>
>>
>>            <from
>> uri="file:///C:/tries/collApp/exchange/in?fileName=registerSampleUtf.csv&amp;charset=UTF-8"/>
>>            <log message="file: ${body.class.name} ${body}"
>> loggingLevel="WARN"/>
>>            <unmarshal>
>>                <csv delimiter=";"  useMaps="true" />
>>            </unmarshal>
>>            <log message="unmarshalled: ${body.class.name} ${body}"
>> loggingLevel="WARN"/>
>>            <marshal ref="jack"/>
>>            <log message="marshalled: ${body}" loggingLevel="WARN"/>
>>            <to
>> uri="file:///C:/tries/collApp/exchange/out?fileName=out.json"/>
>>        </route>
>>
>> At the first "log" only a German letter is replaced with the question mark.
>>
>> At the second, all Russian letters are replaced with the question marks.
>>
>> The resulting JSON can't even display the question marks when read in any of
>> the world's encodings.
>>
>> Shall I provide a test CSV file here? (warning: it contains Russian letters)
>>
>>
>>
>> --
>> View this message in context: http://camel.465427.n5.nabble.com/A-possible-bug-in-IOConverter-with-Win-1251-charset-tp5778665p5778666.html
>> Sent from the Camel Development mailing list archive at Nabble.com.
>



-- 
Claus Ibsen
-----------------
http://davsclaus.com @davsclaus
Camel in Action 2: https://www.manning.com/ibsen2

Re: A possible bug in IOConverter with Win-1251 charset

Posted by Antoine Toulme <an...@toulme.name>.

What happens is that your default charset is win-1251 while the file is UTF-8.

The file is read correctly according to the charset argument passed to the toInputStream method ; however, the default charset used to parse and send the stream is the default charset.

The immediate workaround for you is to add an explicit charset when launching the JVM: -Dfile.encoding=UTF-8

I would recommend you go ahead, file a bug and add a simple test case in IOConverterTest around line 83.

> On Mar 5, 2016, at 11:05 PM, fedd <fe...@hotmail.com> wrote:
> 
> I made an experiment and saw that the situation is much worse that just
> losing one frequent Russian letter.
> 
> I made a UTF-8 file with both Russian text and one German A Umlaut letter,
> and Camel was unable to read a German letter replacing it with a question
> mark, just because my windows dev machine native charset happened to be
> win-1251.
> 
> I don't really think it's okay
> 
> 1) to ever flatten Unicode strings to a single byte character set;
> 
> 2) when the behaviour of the server side code depends on the host operating
> system settings (becomes not portable)
> 
> May I file a Jira bug report?
> 
> Here's by route:
> 
>        <dataFormats>
>            <json id="jack" library="Jackson" prettyPrint="true"/>
>        </dataFormats>        
> 
>        <route>
> 
>            <from
> uri="file:///C:/tries/collApp/exchange/in?fileName=registerSampleUtf.csv&amp;charset=UTF-8"/>
>            <log message="file: ${body.class.name} ${body}"
> loggingLevel="WARN"/>
>            <unmarshal>
>                <csv delimiter=";"  useMaps="true" />
>            </unmarshal>            
>            <log message="unmarshalled: ${body.class.name} ${body}"
> loggingLevel="WARN"/>
>            <marshal ref="jack"/>
>            <log message="marshalled: ${body}" loggingLevel="WARN"/>
>            <to
> uri="file:///C:/tries/collApp/exchange/out?fileName=out.json"/>          
>        </route>
> 
> At the first "log" only a German letter is replaced with the question mark.
> 
> At the second, all Russian letters are replaced with the question marks.
> 
> The resulting JSON can't even display the question marks when read in any of
> the world's encodings.
> 
> Shall I provide a test CSV file here? (warning: it contains Russian letters)
> 
> 
> 
> --
> View this message in context: http://camel.465427.n5.nabble.com/A-possible-bug-in-IOConverter-with-Win-1251-charset-tp5778665p5778666.html
> Sent from the Camel Development mailing list archive at Nabble.com.

Re: A possible bug in IOConverter with Win-1251 charset

Posted by fedd <fe...@hotmail.com>.

I made an experiment and saw that the situation is much worse that just
losing one frequent Russian letter.

I made a UTF-8 file with both Russian text and one German A Umlaut letter,
and Camel was unable to read a German letter replacing it with a question
mark, just because my windows dev machine native charset happened to be
win-1251.

I don't really think it's okay

1) to ever flatten Unicode strings to a single byte character set;

2) when the behaviour of the server side code depends on the host operating
system settings (becomes not portable)

May I file a Jira bug report?

Here's by route:

        <dataFormats>
            <json id="jack" library="Jackson" prettyPrint="true"/>
        </dataFormats>        

        <route>
            
            <from
uri="file:///C:/tries/collApp/exchange/in?fileName=registerSampleUtf.csv&amp;charset=UTF-8"/>
            <log message="file: ${body.class.name} ${body}"
loggingLevel="WARN"/>
            <unmarshal>
                <csv delimiter=";"  useMaps="true" />
            </unmarshal>            
            <log message="unmarshalled: ${body.class.name} ${body}"
loggingLevel="WARN"/>
            <marshal ref="jack"/>
            <log message="marshalled: ${body}" loggingLevel="WARN"/>
            <to
uri="file:///C:/tries/collApp/exchange/out?fileName=out.json"/>          
        </route>

At the first "log" only a German letter is replaced with the question mark.

At the second, all Russian letters are replaced with the question marks.

The resulting JSON can't even display the question marks when read in any of
the world's encodings.

Shall I provide a test CSV file here? (warning: it contains Russian letters)



--
View this message in context: http://camel.465427.n5.nabble.com/A-possible-bug-in-IOConverter-with-Win-1251-charset-tp5778665p5778666.html
Sent from the Camel Development mailing list archive at Nabble.com.