You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Liyu Yi (JIRA)" <ji...@apache.org> on 2012/10/27 02:11:11 UTC
[jira] [Commented] (IO-354) Commons IO Tailer does not respect
UTF-8 Charset
[ https://issues.apache.org/jira/browse/IO-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485300#comment-13485300 ]
Liyu Yi commented on IO-354:
----------------------------
I used a "hacky" fix to reconstruct the String with right encoding in the handler class.
private String rebuildUTF8String(String line) {
int len = line.length();
byte[] bytes = new byte[len];
for (int i=0; i<len; i++) {
bytes[i] = (byte)line.charAt(i);
}
return new String(bytes, UTF8);
}
However, the right approach is to pass in the encoding in the "create" method and handling it in the Tailer.
> Commons IO Tailer does not respect UTF-8 Charset
> ------------------------------------------------
>
> Key: IO-354
> URL: https://issues.apache.org/jira/browse/IO-354
> Project: Commons IO
> Issue Type: Bug
> Components: Utilities
> Affects Versions: 2.3
> Environment: JDK 7
> RHEL Linux
> Apache Commons IO version 2.4
> Reporter: Liyu Yi
> Labels: Charset, Encoding, Tailer
>
> I just realized there is a defect in the source code of "org.apache.commons.io.input.Tailer.java". Basically, the current implementation does not work for multi-byte encoded files. See the following snippet,
> 448 private long readLines(RandomAccessFile reader) throws IOException {
> 449 StringBuilder sb = new StringBuilder();
> 450
> 451 long pos = reader.getFilePointer();
> 452 long rePos = pos; // position to re-read
> 453
> 454 int num;
> 455 boolean seenCR = false;
> 456 while (run && ((num = reader.read(inbuf)) != -1)) {
> 457 for (int i = 0; i < num; i++) {
> 458 byte ch = inbuf[i];
> 459 switch (ch) {
> 460 case '\n':
> 461 seenCR = false; // swallow CR before LF
> 462 listener.handle(sb.toString());
> 463 sb.setLength(0);
> 464 rePos = pos + i + 1;
> 465 break;
> 466 case '\r':
> 467 if (seenCR) {
> 468 sb.append('\r');
> 469 }
> 470 seenCR = true;
> 471 break;
> 472 default:
> 473 if (seenCR) {
> 474 seenCR = false; // swallow final CR
> 475 listener.handle(sb.toString());
> 476 sb.setLength(0);
> 477 rePos = pos + i + 1;
> 478 }
> 479 sb.append((char) ch); // add character, not its ascii value
> 480 }
> 481 }
> 482
> 483 pos = reader.getFilePointer();
> 484 }
> 485
> 486 reader.seek(rePos); // Ensure we can re-read if necessary
> 487 return rePos;
> 488 }
> At line 479, the conversion of byte to char types breaks the encoding.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira