You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by bu...@apache.org on 2004/08/25 14:55:40 UTC

DO NOT REPLY [Bug 30844] New: - demo HTML parser corrupts foreign characters

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=30844>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=30844

demo HTML parser corrupts foreign characters

           Summary: demo HTML parser corrupts foreign characters
           Product: Lucene
           Version: 1.3
          Platform: Other
               URL: https://bugs.eclipse.org/bugs/show_bug.cgi?id=72552
        OS/Version: All
            Status: NEW
          Severity: Normal
          Priority: Other
         Component: Other
        AssignedTo: lucene-dev@jakarta.apache.org
        ReportedBy: konradk@ca.ibm.com


We are using HTML parser for parsing English and other NL documents in 
Eclipse.  Post Lucene 1.2 there has been a regression in the parser.  
Characters coming from Reader (obtained from getReader() ) are corrupted.  
Only the characters that can be encoded using the default machine encoding go 
through correctly.  For example, parsing Chinese document on an English 
machine results with all characters, except the few English words, corrupted.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org