You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by "Boris Kolpackov (JIRA)" <xe...@xml.apache.org> on 2010/09/07 22:58:33 UTC

[jira] Updated: (XERCESC-1936) ICUTransService and IconvGNUransService CAN NOT deal with huge file.

     [ https://issues.apache.org/jira/browse/XERCESC-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Boris Kolpackov updated XERCESC-1936:
-------------------------------------

    Fix Version/s: 3.1.2
                   3.2.0
                   4.0.0

Yes, I just tried your test with ICU and I get the error. Scheduling this bug for the next release.

> ICUTransService and IconvGNUransService CAN NOT deal with huge file.
> --------------------------------------------------------------------
>
>                 Key: XERCESC-1936
>                 URL: https://issues.apache.org/jira/browse/XERCESC-1936
>             Project: Xerces-C++
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 2.8.0, 3.1.1
>         Environment: RHEL-5.5
> glibc-2.5-49.el5_5.2
> libicu-3.6-5.11.4
>            Reporter: kirby zhou
>             Fix For: 3.1.2, 3.2.0, 4.0.0
>
>
> If a huge file passed to XMLReader, it will call TransService mulitple times, and splite the file content into several fragments.
> Unfortunately, the fragment will contain incomplete multi-byte characters.
> But neither ICUTransService nor IconvGNUransService deal with it. ICUTransService did not deal with U_TRUNCATED_CHAR_FOUND, and IconvGNUransService did not deal with EINVAL.
> Both 2.8.0 and 3.1.1 have the same bug.
> For example, make 2 XML like that:
> ]# ( echo '<?xml version="1.0" encoding="GBK" ?>'; echo '<data>'; for ((i=0;i<2;++i)); do echo -n '中文汉字A'; done ; echo; echo '</data>' ) > ~/small.xml
> ]# ( echo '<?xml version="1.0" encoding="GBK" ?>'; echo '<data>'; for ((i=0;i<100000;++i)); do echo -n '中文汉字A'; done ; echo; echo '</data>' ) > ~/big.xml
> # the small.xml and big.xml are analogical. 
> ]# samples/SAXPrint -x=gbk ~/small.xml 
> <?xml version="1.0" encoding="gbk"?>
> <data>
> 中文汉字A中文汉字A
> </data>
> # with icu
> ]# samples/SAXPrint -x=gbk ~/big.xml
> <?xml version="1.0" encoding="gbk"?>
> <data>
> Fatal Error at file /root/big.xml, line 3, char 16377
>   Message: char 0x6C49 is not representable in 'gbk' encoding
> # with iconvgnu
> ]# samples/SAXPrint -x=gbk ~/big.xml
> ]# samples/SAXPrint -x=gbk ~/big.xml 
> <?xml version="1.0" encoding="gbk"?>
> <data>
> Fatal Error at file /root/big.xml, line 3, char 16377
>   Message: invalid multi-byte sequence

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org