You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Peter Liu (JIRA)" <ji...@apache.org> on 2013/04/10 02:08:16 UTC
[jira] [Updated] (IO-354) Commons IO Tailer does not respect UTF-8
Charset
[ https://issues.apache.org/jira/browse/IO-354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Peter Liu updated IO-354:
-------------------------
Attachment: Tailer-commonsio-354.patch
including test cases and tested on Linux/Windows
> Commons IO Tailer does not respect UTF-8 Charset
> ------------------------------------------------
>
> Key: IO-354
> URL: https://issues.apache.org/jira/browse/IO-354
> Project: Commons IO
> Issue Type: Bug
> Components: Utilities
> Affects Versions: 2.3
> Environment: JDK 7
> RHEL Linux
> Apache Commons IO version 2.4
> Reporter: Liyu Yi
> Labels: Charset, Encoding, Tailer
> Attachments: Tailer-commonsio-354.patch
>
>
> I just realized there is a defect in the source code of "org.apache.commons.io.input.Tailer.java". Basically, the current implementation does not work for multi-byte encoded files. See the following snippet,
> 448 private long readLines(RandomAccessFile reader) throws IOException {
> 449 StringBuilder sb = new StringBuilder();
> 450
> 451 long pos = reader.getFilePointer();
> 452 long rePos = pos; // position to re-read
> 453
> 454 int num;
> 455 boolean seenCR = false;
> 456 while (run && ((num = reader.read(inbuf)) != -1)) {
> 457 for (int i = 0; i < num; i++) {
> 458 byte ch = inbuf[i];
> 459 switch (ch) {
> 460 case '\n':
> 461 seenCR = false; // swallow CR before LF
> 462 listener.handle(sb.toString());
> 463 sb.setLength(0);
> 464 rePos = pos + i + 1;
> 465 break;
> 466 case '\r':
> 467 if (seenCR) {
> 468 sb.append('\r');
> 469 }
> 470 seenCR = true;
> 471 break;
> 472 default:
> 473 if (seenCR) {
> 474 seenCR = false; // swallow final CR
> 475 listener.handle(sb.toString());
> 476 sb.setLength(0);
> 477 rePos = pos + i + 1;
> 478 }
> 479 sb.append((char) ch); // add character, not its ascii value
> 480 }
> 481 }
> 482
> 483 pos = reader.getFilePointer();
> 484 }
> 485
> 486 reader.seek(rePos); // Ensure we can re-read if necessary
> 487 return rePos;
> 488 }
> At line 479, the conversion of byte to char type breaks the encoding.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira