You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by "Ken Hansen (Commented) (JIRA)" <xe...@xml.apache.org> on 2011/10/27 19:56:32 UTC

[jira] [Commented] (XERCESC-1936) ICUTransService and IconvGNUransService CAN NOT deal with huge file.

    [ https://issues.apache.org/jira/browse/XERCESC-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13137332#comment-13137332 ] 

Ken Hansen commented on XERCESC-1936:
-------------------------------------

I'm having the same problem...  Is there a hope for a patch for this?  Or a workaround? 
                
> ICUTransService and IconvGNUransService CAN NOT deal with huge file.
> --------------------------------------------------------------------
>
>                 Key: XERCESC-1936
>                 URL: https://issues.apache.org/jira/browse/XERCESC-1936
>             Project: Xerces-C++
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 2.8.0, 3.1.1
>         Environment: RHEL-5.5
> glibc-2.5-49.el5_5.2
> libicu-3.6-5.11.4
>            Reporter: kirby zhou
>             Fix For: 3.1.2, 3.2.0, 4.0.0
>
>
> If a huge file passed to XMLReader, it will call TransService mulitple times, and splite the file content into several fragments.
> Unfortunately, the fragment will contain incomplete multi-byte characters.
> But neither ICUTransService nor IconvGNUransService deal with it. ICUTransService did not deal with U_TRUNCATED_CHAR_FOUND, and IconvGNUransService did not deal with EINVAL.
> Both 2.8.0 and 3.1.1 have the same bug.
> For example, make 2 XML like that:
> ]# ( echo '<?xml version="1.0" encoding="GBK" ?>'; echo '<data>'; for ((i=0;i<2;++i)); do echo -n '中文汉字A'; done ; echo; echo '</data>' ) > ~/small.xml
> ]# ( echo '<?xml version="1.0" encoding="GBK" ?>'; echo '<data>'; for ((i=0;i<100000;++i)); do echo -n '中文汉字A'; done ; echo; echo '</data>' ) > ~/big.xml
> # the small.xml and big.xml are analogical. 
> ]# samples/SAXPrint -x=gbk ~/small.xml 
> <?xml version="1.0" encoding="gbk"?>
> <data>
> 中文汉字A中文汉字A
> </data>
> # with icu
> ]# samples/SAXPrint -x=gbk ~/big.xml
> <?xml version="1.0" encoding="gbk"?>
> <data>
> Fatal Error at file /root/big.xml, line 3, char 16377
>   Message: char 0x6C49 is not representable in 'gbk' encoding
> # with iconvgnu
> ]# samples/SAXPrint -x=gbk ~/big.xml
> ]# samples/SAXPrint -x=gbk ~/big.xml 
> <?xml version="1.0" encoding="gbk"?>
> <data>
> Fatal Error at file /root/big.xml, line 3, char 16377
>   Message: invalid multi-byte sequence

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org