You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Liyu Yi (JIRA)" <ji...@apache.org> on 2012/10/27 02:09:14 UTC

[jira] [Created] (IO-354) Commons IO Tailer does not respect UTF-8 Charset

Liyu Yi created IO-354:
--------------------------

             Summary: Commons IO Tailer does not respect UTF-8 Charset
                 Key: IO-354
                 URL: https://issues.apache.org/jira/browse/IO-354
             Project: Commons IO
          Issue Type: Bug
          Components: Utilities
    Affects Versions: 2.3
         Environment: JDK 7 
RHEL Linux
Apache Commons IO version 2.4
            Reporter: Liyu Yi


I just realized there is a defect in the source code of "org.apache.commons.io.input.Tailer.java". Basically, the current implementation does not work for multi-byte encoded files. See the following snippet,

448    private long readLines(RandomAccessFile reader) throws IOException {
449        StringBuilder sb = new StringBuilder();
450
451        long pos = reader.getFilePointer();
452        long rePos = pos; // position to re-read
453
454        int num;
455        boolean seenCR = false;
456        while (run && ((num = reader.read(inbuf)) != -1)) {
457            for (int i = 0; i < num; i++) {
458                byte ch = inbuf[i];
459                switch (ch) {
460                case '\n':
461                    seenCR = false; // swallow CR before LF
462                    listener.handle(sb.toString());
463                    sb.setLength(0);
464                    rePos = pos + i + 1;
465                    break;
466                case '\r':
467                    if (seenCR) {
468                        sb.append('\r');
469                    }
470                    seenCR = true;
471                    break;
472                default:
473                    if (seenCR) {
474                        seenCR = false; // swallow final CR
475                        listener.handle(sb.toString());
476                        sb.setLength(0);
477                        rePos = pos + i + 1;
478                    }
479                    sb.append((char) ch); // add character, not its ascii value
480                }
481            }
482
483            pos = reader.getFilePointer();
484        }
485
486        reader.seek(rePos); // Ensure we can re-read if necessary
487        return rePos;
488    }

At line 479, the conversion of byte to char types breaks the encoding.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (IO-354) Commons IO Tailer does not respect UTF-8 Charset

Posted by "Liyu Yi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/IO-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485300#comment-13485300 ] 

Liyu Yi edited comment on IO-354 at 10/27/12 12:10 AM:
-------------------------------------------------------

I used a "hacky" fix to reconstruct the String with right encoding in the handler class. 

	private String rebuildUTF8String(String line) {
		int len = line.length();
		byte[] bytes = new byte[len];
		for (int i=0; i<len; i++) {
			bytes[i] = (byte)line.charAt(i);
		}
		return new String(bytes, UTF8);
	}

However, the right approach is to pass in the encoding in the "create" method and handle it in the Tailer.
                
      was (Author: liyuyi):
    I used a "hacky" fix to reconstruct the String with right encoding in the handler class. 

	private String rebuildUTF8String(String line) {
		int len = line.length();
		byte[] bytes = new byte[len];
		for (int i=0; i<len; i++) {
			bytes[i] = (byte)line.charAt(i);
		}
		return new String(bytes, UTF8);
	}

However, the right approach is to pass in the encoding in the "create" method and handling it in the Tailer.
                  
> Commons IO Tailer does not respect UTF-8 Charset
> ------------------------------------------------
>
>                 Key: IO-354
>                 URL: https://issues.apache.org/jira/browse/IO-354
>             Project: Commons IO
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 2.3
>         Environment: JDK 7 
> RHEL Linux
> Apache Commons IO version 2.4
>            Reporter: Liyu Yi
>              Labels: Charset, Encoding, Tailer
>
> I just realized there is a defect in the source code of "org.apache.commons.io.input.Tailer.java". Basically, the current implementation does not work for multi-byte encoded files. See the following snippet,
> 448    private long readLines(RandomAccessFile reader) throws IOException {
> 449        StringBuilder sb = new StringBuilder();
> 450
> 451        long pos = reader.getFilePointer();
> 452        long rePos = pos; // position to re-read
> 453
> 454        int num;
> 455        boolean seenCR = false;
> 456        while (run && ((num = reader.read(inbuf)) != -1)) {
> 457            for (int i = 0; i < num; i++) {
> 458                byte ch = inbuf[i];
> 459                switch (ch) {
> 460                case '\n':
> 461                    seenCR = false; // swallow CR before LF
> 462                    listener.handle(sb.toString());
> 463                    sb.setLength(0);
> 464                    rePos = pos + i + 1;
> 465                    break;
> 466                case '\r':
> 467                    if (seenCR) {
> 468                        sb.append('\r');
> 469                    }
> 470                    seenCR = true;
> 471                    break;
> 472                default:
> 473                    if (seenCR) {
> 474                        seenCR = false; // swallow final CR
> 475                        listener.handle(sb.toString());
> 476                        sb.setLength(0);
> 477                        rePos = pos + i + 1;
> 478                    }
> 479                    sb.append((char) ch); // add character, not its ascii value
> 480                }
> 481            }
> 482
> 483            pos = reader.getFilePointer();
> 484        }
> 485
> 486        reader.seek(rePos); // Ensure we can re-read if necessary
> 487        return rePos;
> 488    }
> At line 479, the conversion of byte to char types breaks the encoding.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (IO-354) Commons IO Tailer does not respect UTF-8 Charset

Posted by "Gary Gregory (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/IO-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485535#comment-13485535 ] 

Gary Gregory commented on IO-354:
---------------------------------

Feel free to provide a patch! :)
                
> Commons IO Tailer does not respect UTF-8 Charset
> ------------------------------------------------
>
>                 Key: IO-354
>                 URL: https://issues.apache.org/jira/browse/IO-354
>             Project: Commons IO
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 2.3
>         Environment: JDK 7 
> RHEL Linux
> Apache Commons IO version 2.4
>            Reporter: Liyu Yi
>              Labels: Charset, Encoding, Tailer
>
> I just realized there is a defect in the source code of "org.apache.commons.io.input.Tailer.java". Basically, the current implementation does not work for multi-byte encoded files. See the following snippet,
> 448    private long readLines(RandomAccessFile reader) throws IOException {
> 449        StringBuilder sb = new StringBuilder();
> 450
> 451        long pos = reader.getFilePointer();
> 452        long rePos = pos; // position to re-read
> 453
> 454        int num;
> 455        boolean seenCR = false;
> 456        while (run && ((num = reader.read(inbuf)) != -1)) {
> 457            for (int i = 0; i < num; i++) {
> 458                byte ch = inbuf[i];
> 459                switch (ch) {
> 460                case '\n':
> 461                    seenCR = false; // swallow CR before LF
> 462                    listener.handle(sb.toString());
> 463                    sb.setLength(0);
> 464                    rePos = pos + i + 1;
> 465                    break;
> 466                case '\r':
> 467                    if (seenCR) {
> 468                        sb.append('\r');
> 469                    }
> 470                    seenCR = true;
> 471                    break;
> 472                default:
> 473                    if (seenCR) {
> 474                        seenCR = false; // swallow final CR
> 475                        listener.handle(sb.toString());
> 476                        sb.setLength(0);
> 477                        rePos = pos + i + 1;
> 478                    }
> 479                    sb.append((char) ch); // add character, not its ascii value
> 480                }
> 481            }
> 482
> 483            pos = reader.getFilePointer();
> 484        }
> 485
> 486        reader.seek(rePos); // Ensure we can re-read if necessary
> 487        return rePos;
> 488    }
> At line 479, the conversion of byte to char type breaks the encoding.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (IO-354) Commons IO Tailer does not respect UTF-8 Charset

Posted by "Liyu Yi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/IO-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489272#comment-13489272 ] 

Liyu Yi commented on IO-354:
----------------------------

OK, I'll give it a shot, as a return to the community. Hope this process is a smooth one :-)

                
> Commons IO Tailer does not respect UTF-8 Charset
> ------------------------------------------------
>
>                 Key: IO-354
>                 URL: https://issues.apache.org/jira/browse/IO-354
>             Project: Commons IO
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 2.3
>         Environment: JDK 7 
> RHEL Linux
> Apache Commons IO version 2.4
>            Reporter: Liyu Yi
>              Labels: Charset, Encoding, Tailer
>
> I just realized there is a defect in the source code of "org.apache.commons.io.input.Tailer.java". Basically, the current implementation does not work for multi-byte encoded files. See the following snippet,
> 448    private long readLines(RandomAccessFile reader) throws IOException {
> 449        StringBuilder sb = new StringBuilder();
> 450
> 451        long pos = reader.getFilePointer();
> 452        long rePos = pos; // position to re-read
> 453
> 454        int num;
> 455        boolean seenCR = false;
> 456        while (run && ((num = reader.read(inbuf)) != -1)) {
> 457            for (int i = 0; i < num; i++) {
> 458                byte ch = inbuf[i];
> 459                switch (ch) {
> 460                case '\n':
> 461                    seenCR = false; // swallow CR before LF
> 462                    listener.handle(sb.toString());
> 463                    sb.setLength(0);
> 464                    rePos = pos + i + 1;
> 465                    break;
> 466                case '\r':
> 467                    if (seenCR) {
> 468                        sb.append('\r');
> 469                    }
> 470                    seenCR = true;
> 471                    break;
> 472                default:
> 473                    if (seenCR) {
> 474                        seenCR = false; // swallow final CR
> 475                        listener.handle(sb.toString());
> 476                        sb.setLength(0);
> 477                        rePos = pos + i + 1;
> 478                    }
> 479                    sb.append((char) ch); // add character, not its ascii value
> 480                }
> 481            }
> 482
> 483            pos = reader.getFilePointer();
> 484        }
> 485
> 486        reader.seek(rePos); // Ensure we can re-read if necessary
> 487        return rePos;
> 488    }
> At line 479, the conversion of byte to char type breaks the encoding.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (IO-354) Commons IO Tailer does not respect UTF-8 Charset

Posted by "Liyu Yi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/IO-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485300#comment-13485300 ] 

Liyu Yi commented on IO-354:
----------------------------

I used a "hacky" fix to reconstruct the String with right encoding in the handler class. 

	private String rebuildUTF8String(String line) {
		int len = line.length();
		byte[] bytes = new byte[len];
		for (int i=0; i<len; i++) {
			bytes[i] = (byte)line.charAt(i);
		}
		return new String(bytes, UTF8);
	}

However, the right approach is to pass in the encoding in the "create" method and handling it in the Tailer.
                
> Commons IO Tailer does not respect UTF-8 Charset
> ------------------------------------------------
>
>                 Key: IO-354
>                 URL: https://issues.apache.org/jira/browse/IO-354
>             Project: Commons IO
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 2.3
>         Environment: JDK 7 
> RHEL Linux
> Apache Commons IO version 2.4
>            Reporter: Liyu Yi
>              Labels: Charset, Encoding, Tailer
>
> I just realized there is a defect in the source code of "org.apache.commons.io.input.Tailer.java". Basically, the current implementation does not work for multi-byte encoded files. See the following snippet,
> 448    private long readLines(RandomAccessFile reader) throws IOException {
> 449        StringBuilder sb = new StringBuilder();
> 450
> 451        long pos = reader.getFilePointer();
> 452        long rePos = pos; // position to re-read
> 453
> 454        int num;
> 455        boolean seenCR = false;
> 456        while (run && ((num = reader.read(inbuf)) != -1)) {
> 457            for (int i = 0; i < num; i++) {
> 458                byte ch = inbuf[i];
> 459                switch (ch) {
> 460                case '\n':
> 461                    seenCR = false; // swallow CR before LF
> 462                    listener.handle(sb.toString());
> 463                    sb.setLength(0);
> 464                    rePos = pos + i + 1;
> 465                    break;
> 466                case '\r':
> 467                    if (seenCR) {
> 468                        sb.append('\r');
> 469                    }
> 470                    seenCR = true;
> 471                    break;
> 472                default:
> 473                    if (seenCR) {
> 474                        seenCR = false; // swallow final CR
> 475                        listener.handle(sb.toString());
> 476                        sb.setLength(0);
> 477                        rePos = pos + i + 1;
> 478                    }
> 479                    sb.append((char) ch); // add character, not its ascii value
> 480                }
> 481            }
> 482
> 483            pos = reader.getFilePointer();
> 484        }
> 485
> 486        reader.seek(rePos); // Ensure we can re-read if necessary
> 487        return rePos;
> 488    }
> At line 479, the conversion of byte to char types breaks the encoding.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (IO-354) Commons IO Tailer does not respect UTF-8 Charset

Posted by "Liyu Yi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/IO-354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Liyu Yi updated IO-354:
-----------------------

    Description: 
I just realized there is a defect in the source code of "org.apache.commons.io.input.Tailer.java". Basically, the current implementation does not work for multi-byte encoded files. See the following snippet,

448    private long readLines(RandomAccessFile reader) throws IOException {
449        StringBuilder sb = new StringBuilder();
450
451        long pos = reader.getFilePointer();
452        long rePos = pos; // position to re-read
453
454        int num;
455        boolean seenCR = false;
456        while (run && ((num = reader.read(inbuf)) != -1)) {
457            for (int i = 0; i < num; i++) {
458                byte ch = inbuf[i];
459                switch (ch) {
460                case '\n':
461                    seenCR = false; // swallow CR before LF
462                    listener.handle(sb.toString());
463                    sb.setLength(0);
464                    rePos = pos + i + 1;
465                    break;
466                case '\r':
467                    if (seenCR) {
468                        sb.append('\r');
469                    }
470                    seenCR = true;
471                    break;
472                default:
473                    if (seenCR) {
474                        seenCR = false; // swallow final CR
475                        listener.handle(sb.toString());
476                        sb.setLength(0);
477                        rePos = pos + i + 1;
478                    }
479                    sb.append((char) ch); // add character, not its ascii value
480                }
481            }
482
483            pos = reader.getFilePointer();
484        }
485
486        reader.seek(rePos); // Ensure we can re-read if necessary
487        return rePos;
488    }

At line 479, the conversion of byte to char type breaks the encoding.

  was:
I just realized there is a defect in the source code of "org.apache.commons.io.input.Tailer.java". Basically, the current implementation does not work for multi-byte encoded files. See the following snippet,

448    private long readLines(RandomAccessFile reader) throws IOException {
449        StringBuilder sb = new StringBuilder();
450
451        long pos = reader.getFilePointer();
452        long rePos = pos; // position to re-read
453
454        int num;
455        boolean seenCR = false;
456        while (run && ((num = reader.read(inbuf)) != -1)) {
457            for (int i = 0; i < num; i++) {
458                byte ch = inbuf[i];
459                switch (ch) {
460                case '\n':
461                    seenCR = false; // swallow CR before LF
462                    listener.handle(sb.toString());
463                    sb.setLength(0);
464                    rePos = pos + i + 1;
465                    break;
466                case '\r':
467                    if (seenCR) {
468                        sb.append('\r');
469                    }
470                    seenCR = true;
471                    break;
472                default:
473                    if (seenCR) {
474                        seenCR = false; // swallow final CR
475                        listener.handle(sb.toString());
476                        sb.setLength(0);
477                        rePos = pos + i + 1;
478                    }
479                    sb.append((char) ch); // add character, not its ascii value
480                }
481            }
482
483            pos = reader.getFilePointer();
484        }
485
486        reader.seek(rePos); // Ensure we can re-read if necessary
487        return rePos;
488    }

At line 479, the conversion of byte to char types breaks the encoding.

    
> Commons IO Tailer does not respect UTF-8 Charset
> ------------------------------------------------
>
>                 Key: IO-354
>                 URL: https://issues.apache.org/jira/browse/IO-354
>             Project: Commons IO
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 2.3
>         Environment: JDK 7 
> RHEL Linux
> Apache Commons IO version 2.4
>            Reporter: Liyu Yi
>              Labels: Charset, Encoding, Tailer
>
> I just realized there is a defect in the source code of "org.apache.commons.io.input.Tailer.java". Basically, the current implementation does not work for multi-byte encoded files. See the following snippet,
> 448    private long readLines(RandomAccessFile reader) throws IOException {
> 449        StringBuilder sb = new StringBuilder();
> 450
> 451        long pos = reader.getFilePointer();
> 452        long rePos = pos; // position to re-read
> 453
> 454        int num;
> 455        boolean seenCR = false;
> 456        while (run && ((num = reader.read(inbuf)) != -1)) {
> 457            for (int i = 0; i < num; i++) {
> 458                byte ch = inbuf[i];
> 459                switch (ch) {
> 460                case '\n':
> 461                    seenCR = false; // swallow CR before LF
> 462                    listener.handle(sb.toString());
> 463                    sb.setLength(0);
> 464                    rePos = pos + i + 1;
> 465                    break;
> 466                case '\r':
> 467                    if (seenCR) {
> 468                        sb.append('\r');
> 469                    }
> 470                    seenCR = true;
> 471                    break;
> 472                default:
> 473                    if (seenCR) {
> 474                        seenCR = false; // swallow final CR
> 475                        listener.handle(sb.toString());
> 476                        sb.setLength(0);
> 477                        rePos = pos + i + 1;
> 478                    }
> 479                    sb.append((char) ch); // add character, not its ascii value
> 480                }
> 481            }
> 482
> 483            pos = reader.getFilePointer();
> 484        }
> 485
> 486        reader.seek(rePos); // Ensure we can re-read if necessary
> 487        return rePos;
> 488    }
> At line 479, the conversion of byte to char type breaks the encoding.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira