You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Liyu Yi (JIRA)" <ji...@apache.org> on 2012/10/27 02:09:14 UTC
[jira] [Created] (IO-354) Commons IO Tailer does not respect UTF-8
Charset
Liyu Yi created IO-354:
--------------------------
Summary: Commons IO Tailer does not respect UTF-8 Charset
Key: IO-354
URL: https://issues.apache.org/jira/browse/IO-354
Project: Commons IO
Issue Type: Bug
Components: Utilities
Affects Versions: 2.3
Environment: JDK 7
RHEL Linux
Apache Commons IO version 2.4
Reporter: Liyu Yi
I just realized there is a defect in the source code of "org.apache.commons.io.input.Tailer.java". Basically, the current implementation does not work for multi-byte encoded files. See the following snippet,
448 private long readLines(RandomAccessFile reader) throws IOException {
449 StringBuilder sb = new StringBuilder();
450
451 long pos = reader.getFilePointer();
452 long rePos = pos; // position to re-read
453
454 int num;
455 boolean seenCR = false;
456 while (run && ((num = reader.read(inbuf)) != -1)) {
457 for (int i = 0; i < num; i++) {
458 byte ch = inbuf[i];
459 switch (ch) {
460 case '\n':
461 seenCR = false; // swallow CR before LF
462 listener.handle(sb.toString());
463 sb.setLength(0);
464 rePos = pos + i + 1;
465 break;
466 case '\r':
467 if (seenCR) {
468 sb.append('\r');
469 }
470 seenCR = true;
471 break;
472 default:
473 if (seenCR) {
474 seenCR = false; // swallow final CR
475 listener.handle(sb.toString());
476 sb.setLength(0);
477 rePos = pos + i + 1;
478 }
479 sb.append((char) ch); // add character, not its ascii value
480 }
481 }
482
483 pos = reader.getFilePointer();
484 }
485
486 reader.seek(rePos); // Ensure we can re-read if necessary
487 return rePos;
488 }
At line 479, the conversion of byte to char types breaks the encoding.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (IO-354) Commons IO Tailer does not respect
UTF-8 Charset
Posted by "Liyu Yi (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/IO-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485300#comment-13485300 ]
Liyu Yi edited comment on IO-354 at 10/27/12 12:10 AM:
-------------------------------------------------------
I used a "hacky" fix to reconstruct the String with right encoding in the handler class.
private String rebuildUTF8String(String line) {
int len = line.length();
byte[] bytes = new byte[len];
for (int i=0; i<len; i++) {
bytes[i] = (byte)line.charAt(i);
}
return new String(bytes, UTF8);
}
However, the right approach is to pass in the encoding in the "create" method and handle it in the Tailer.
was (Author: liyuyi):
I used a "hacky" fix to reconstruct the String with right encoding in the handler class.
private String rebuildUTF8String(String line) {
int len = line.length();
byte[] bytes = new byte[len];
for (int i=0; i<len; i++) {
bytes[i] = (byte)line.charAt(i);
}
return new String(bytes, UTF8);
}
However, the right approach is to pass in the encoding in the "create" method and handling it in the Tailer.
> Commons IO Tailer does not respect UTF-8 Charset
> ------------------------------------------------
>
> Key: IO-354
> URL: https://issues.apache.org/jira/browse/IO-354
> Project: Commons IO
> Issue Type: Bug
> Components: Utilities
> Affects Versions: 2.3
> Environment: JDK 7
> RHEL Linux
> Apache Commons IO version 2.4
> Reporter: Liyu Yi
> Labels: Charset, Encoding, Tailer
>
> I just realized there is a defect in the source code of "org.apache.commons.io.input.Tailer.java". Basically, the current implementation does not work for multi-byte encoded files. See the following snippet,
> 448 private long readLines(RandomAccessFile reader) throws IOException {
> 449 StringBuilder sb = new StringBuilder();
> 450
> 451 long pos = reader.getFilePointer();
> 452 long rePos = pos; // position to re-read
> 453
> 454 int num;
> 455 boolean seenCR = false;
> 456 while (run && ((num = reader.read(inbuf)) != -1)) {
> 457 for (int i = 0; i < num; i++) {
> 458 byte ch = inbuf[i];
> 459 switch (ch) {
> 460 case '\n':
> 461 seenCR = false; // swallow CR before LF
> 462 listener.handle(sb.toString());
> 463 sb.setLength(0);
> 464 rePos = pos + i + 1;
> 465 break;
> 466 case '\r':
> 467 if (seenCR) {
> 468 sb.append('\r');
> 469 }
> 470 seenCR = true;
> 471 break;
> 472 default:
> 473 if (seenCR) {
> 474 seenCR = false; // swallow final CR
> 475 listener.handle(sb.toString());
> 476 sb.setLength(0);
> 477 rePos = pos + i + 1;
> 478 }
> 479 sb.append((char) ch); // add character, not its ascii value
> 480 }
> 481 }
> 482
> 483 pos = reader.getFilePointer();
> 484 }
> 485
> 486 reader.seek(rePos); // Ensure we can re-read if necessary
> 487 return rePos;
> 488 }
> At line 479, the conversion of byte to char types breaks the encoding.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (IO-354) Commons IO Tailer does not respect
UTF-8 Charset
Posted by "Gary Gregory (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/IO-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485535#comment-13485535 ]
Gary Gregory commented on IO-354:
---------------------------------
Feel free to provide a patch! :)
> Commons IO Tailer does not respect UTF-8 Charset
> ------------------------------------------------
>
> Key: IO-354
> URL: https://issues.apache.org/jira/browse/IO-354
> Project: Commons IO
> Issue Type: Bug
> Components: Utilities
> Affects Versions: 2.3
> Environment: JDK 7
> RHEL Linux
> Apache Commons IO version 2.4
> Reporter: Liyu Yi
> Labels: Charset, Encoding, Tailer
>
> I just realized there is a defect in the source code of "org.apache.commons.io.input.Tailer.java". Basically, the current implementation does not work for multi-byte encoded files. See the following snippet,
> 448 private long readLines(RandomAccessFile reader) throws IOException {
> 449 StringBuilder sb = new StringBuilder();
> 450
> 451 long pos = reader.getFilePointer();
> 452 long rePos = pos; // position to re-read
> 453
> 454 int num;
> 455 boolean seenCR = false;
> 456 while (run && ((num = reader.read(inbuf)) != -1)) {
> 457 for (int i = 0; i < num; i++) {
> 458 byte ch = inbuf[i];
> 459 switch (ch) {
> 460 case '\n':
> 461 seenCR = false; // swallow CR before LF
> 462 listener.handle(sb.toString());
> 463 sb.setLength(0);
> 464 rePos = pos + i + 1;
> 465 break;
> 466 case '\r':
> 467 if (seenCR) {
> 468 sb.append('\r');
> 469 }
> 470 seenCR = true;
> 471 break;
> 472 default:
> 473 if (seenCR) {
> 474 seenCR = false; // swallow final CR
> 475 listener.handle(sb.toString());
> 476 sb.setLength(0);
> 477 rePos = pos + i + 1;
> 478 }
> 479 sb.append((char) ch); // add character, not its ascii value
> 480 }
> 481 }
> 482
> 483 pos = reader.getFilePointer();
> 484 }
> 485
> 486 reader.seek(rePos); // Ensure we can re-read if necessary
> 487 return rePos;
> 488 }
> At line 479, the conversion of byte to char type breaks the encoding.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (IO-354) Commons IO Tailer does not respect
UTF-8 Charset
Posted by "Liyu Yi (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/IO-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489272#comment-13489272 ]
Liyu Yi commented on IO-354:
----------------------------
OK, I'll give it a shot, as a return to the community. Hope this process is a smooth one :-)
> Commons IO Tailer does not respect UTF-8 Charset
> ------------------------------------------------
>
> Key: IO-354
> URL: https://issues.apache.org/jira/browse/IO-354
> Project: Commons IO
> Issue Type: Bug
> Components: Utilities
> Affects Versions: 2.3
> Environment: JDK 7
> RHEL Linux
> Apache Commons IO version 2.4
> Reporter: Liyu Yi
> Labels: Charset, Encoding, Tailer
>
> I just realized there is a defect in the source code of "org.apache.commons.io.input.Tailer.java". Basically, the current implementation does not work for multi-byte encoded files. See the following snippet,
> 448 private long readLines(RandomAccessFile reader) throws IOException {
> 449 StringBuilder sb = new StringBuilder();
> 450
> 451 long pos = reader.getFilePointer();
> 452 long rePos = pos; // position to re-read
> 453
> 454 int num;
> 455 boolean seenCR = false;
> 456 while (run && ((num = reader.read(inbuf)) != -1)) {
> 457 for (int i = 0; i < num; i++) {
> 458 byte ch = inbuf[i];
> 459 switch (ch) {
> 460 case '\n':
> 461 seenCR = false; // swallow CR before LF
> 462 listener.handle(sb.toString());
> 463 sb.setLength(0);
> 464 rePos = pos + i + 1;
> 465 break;
> 466 case '\r':
> 467 if (seenCR) {
> 468 sb.append('\r');
> 469 }
> 470 seenCR = true;
> 471 break;
> 472 default:
> 473 if (seenCR) {
> 474 seenCR = false; // swallow final CR
> 475 listener.handle(sb.toString());
> 476 sb.setLength(0);
> 477 rePos = pos + i + 1;
> 478 }
> 479 sb.append((char) ch); // add character, not its ascii value
> 480 }
> 481 }
> 482
> 483 pos = reader.getFilePointer();
> 484 }
> 485
> 486 reader.seek(rePos); // Ensure we can re-read if necessary
> 487 return rePos;
> 488 }
> At line 479, the conversion of byte to char type breaks the encoding.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (IO-354) Commons IO Tailer does not respect
UTF-8 Charset
Posted by "Liyu Yi (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/IO-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485300#comment-13485300 ]
Liyu Yi commented on IO-354:
----------------------------
I used a "hacky" fix to reconstruct the String with right encoding in the handler class.
private String rebuildUTF8String(String line) {
int len = line.length();
byte[] bytes = new byte[len];
for (int i=0; i<len; i++) {
bytes[i] = (byte)line.charAt(i);
}
return new String(bytes, UTF8);
}
However, the right approach is to pass in the encoding in the "create" method and handling it in the Tailer.
> Commons IO Tailer does not respect UTF-8 Charset
> ------------------------------------------------
>
> Key: IO-354
> URL: https://issues.apache.org/jira/browse/IO-354
> Project: Commons IO
> Issue Type: Bug
> Components: Utilities
> Affects Versions: 2.3
> Environment: JDK 7
> RHEL Linux
> Apache Commons IO version 2.4
> Reporter: Liyu Yi
> Labels: Charset, Encoding, Tailer
>
> I just realized there is a defect in the source code of "org.apache.commons.io.input.Tailer.java". Basically, the current implementation does not work for multi-byte encoded files. See the following snippet,
> 448 private long readLines(RandomAccessFile reader) throws IOException {
> 449 StringBuilder sb = new StringBuilder();
> 450
> 451 long pos = reader.getFilePointer();
> 452 long rePos = pos; // position to re-read
> 453
> 454 int num;
> 455 boolean seenCR = false;
> 456 while (run && ((num = reader.read(inbuf)) != -1)) {
> 457 for (int i = 0; i < num; i++) {
> 458 byte ch = inbuf[i];
> 459 switch (ch) {
> 460 case '\n':
> 461 seenCR = false; // swallow CR before LF
> 462 listener.handle(sb.toString());
> 463 sb.setLength(0);
> 464 rePos = pos + i + 1;
> 465 break;
> 466 case '\r':
> 467 if (seenCR) {
> 468 sb.append('\r');
> 469 }
> 470 seenCR = true;
> 471 break;
> 472 default:
> 473 if (seenCR) {
> 474 seenCR = false; // swallow final CR
> 475 listener.handle(sb.toString());
> 476 sb.setLength(0);
> 477 rePos = pos + i + 1;
> 478 }
> 479 sb.append((char) ch); // add character, not its ascii value
> 480 }
> 481 }
> 482
> 483 pos = reader.getFilePointer();
> 484 }
> 485
> 486 reader.seek(rePos); // Ensure we can re-read if necessary
> 487 return rePos;
> 488 }
> At line 479, the conversion of byte to char types breaks the encoding.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (IO-354) Commons IO Tailer does not respect UTF-8
Charset
Posted by "Liyu Yi (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/IO-354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Liyu Yi updated IO-354:
-----------------------
Description:
I just realized there is a defect in the source code of "org.apache.commons.io.input.Tailer.java". Basically, the current implementation does not work for multi-byte encoded files. See the following snippet,
448 private long readLines(RandomAccessFile reader) throws IOException {
449 StringBuilder sb = new StringBuilder();
450
451 long pos = reader.getFilePointer();
452 long rePos = pos; // position to re-read
453
454 int num;
455 boolean seenCR = false;
456 while (run && ((num = reader.read(inbuf)) != -1)) {
457 for (int i = 0; i < num; i++) {
458 byte ch = inbuf[i];
459 switch (ch) {
460 case '\n':
461 seenCR = false; // swallow CR before LF
462 listener.handle(sb.toString());
463 sb.setLength(0);
464 rePos = pos + i + 1;
465 break;
466 case '\r':
467 if (seenCR) {
468 sb.append('\r');
469 }
470 seenCR = true;
471 break;
472 default:
473 if (seenCR) {
474 seenCR = false; // swallow final CR
475 listener.handle(sb.toString());
476 sb.setLength(0);
477 rePos = pos + i + 1;
478 }
479 sb.append((char) ch); // add character, not its ascii value
480 }
481 }
482
483 pos = reader.getFilePointer();
484 }
485
486 reader.seek(rePos); // Ensure we can re-read if necessary
487 return rePos;
488 }
At line 479, the conversion of byte to char type breaks the encoding.
was:
I just realized there is a defect in the source code of "org.apache.commons.io.input.Tailer.java". Basically, the current implementation does not work for multi-byte encoded files. See the following snippet,
448 private long readLines(RandomAccessFile reader) throws IOException {
449 StringBuilder sb = new StringBuilder();
450
451 long pos = reader.getFilePointer();
452 long rePos = pos; // position to re-read
453
454 int num;
455 boolean seenCR = false;
456 while (run && ((num = reader.read(inbuf)) != -1)) {
457 for (int i = 0; i < num; i++) {
458 byte ch = inbuf[i];
459 switch (ch) {
460 case '\n':
461 seenCR = false; // swallow CR before LF
462 listener.handle(sb.toString());
463 sb.setLength(0);
464 rePos = pos + i + 1;
465 break;
466 case '\r':
467 if (seenCR) {
468 sb.append('\r');
469 }
470 seenCR = true;
471 break;
472 default:
473 if (seenCR) {
474 seenCR = false; // swallow final CR
475 listener.handle(sb.toString());
476 sb.setLength(0);
477 rePos = pos + i + 1;
478 }
479 sb.append((char) ch); // add character, not its ascii value
480 }
481 }
482
483 pos = reader.getFilePointer();
484 }
485
486 reader.seek(rePos); // Ensure we can re-read if necessary
487 return rePos;
488 }
At line 479, the conversion of byte to char types breaks the encoding.
> Commons IO Tailer does not respect UTF-8 Charset
> ------------------------------------------------
>
> Key: IO-354
> URL: https://issues.apache.org/jira/browse/IO-354
> Project: Commons IO
> Issue Type: Bug
> Components: Utilities
> Affects Versions: 2.3
> Environment: JDK 7
> RHEL Linux
> Apache Commons IO version 2.4
> Reporter: Liyu Yi
> Labels: Charset, Encoding, Tailer
>
> I just realized there is a defect in the source code of "org.apache.commons.io.input.Tailer.java". Basically, the current implementation does not work for multi-byte encoded files. See the following snippet,
> 448 private long readLines(RandomAccessFile reader) throws IOException {
> 449 StringBuilder sb = new StringBuilder();
> 450
> 451 long pos = reader.getFilePointer();
> 452 long rePos = pos; // position to re-read
> 453
> 454 int num;
> 455 boolean seenCR = false;
> 456 while (run && ((num = reader.read(inbuf)) != -1)) {
> 457 for (int i = 0; i < num; i++) {
> 458 byte ch = inbuf[i];
> 459 switch (ch) {
> 460 case '\n':
> 461 seenCR = false; // swallow CR before LF
> 462 listener.handle(sb.toString());
> 463 sb.setLength(0);
> 464 rePos = pos + i + 1;
> 465 break;
> 466 case '\r':
> 467 if (seenCR) {
> 468 sb.append('\r');
> 469 }
> 470 seenCR = true;
> 471 break;
> 472 default:
> 473 if (seenCR) {
> 474 seenCR = false; // swallow final CR
> 475 listener.handle(sb.toString());
> 476 sb.setLength(0);
> 477 rePos = pos + i + 1;
> 478 }
> 479 sb.append((char) ch); // add character, not its ascii value
> 480 }
> 481 }
> 482
> 483 pos = reader.getFilePointer();
> 484 }
> 485
> 486 reader.seek(rePos); // Ensure we can re-read if necessary
> 487 return rePos;
> 488 }
> At line 479, the conversion of byte to char type breaks the encoding.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira