You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Liyu Yi (JIRA)" <ji...@apache.org> on 2012/10/27 02:11:11 UTC
[jira] [Commented] (IO-354) Commons IO Tailer does not respect UTF-8 Charset

    [ https://issues.apache.org/jira/browse/IO-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485300#comment-13485300 ] 

Liyu Yi commented on IO-354:
----------------------------

I used a "hacky" fix to reconstruct the String with right encoding in the handler class. 

	private String rebuildUTF8String(String line) {
		int len = line.length();
		byte[] bytes = new byte[len];
		for (int i=0; i<len; i++) {
			bytes[i] = (byte)line.charAt(i);
		}
		return new String(bytes, UTF8);
	}

However, the right approach is to pass in the encoding in the "create" method and handling it in the Tailer.
                
> Commons IO Tailer does not respect UTF-8 Charset
> ------------------------------------------------
>
>                 Key: IO-354
>                 URL: https://issues.apache.org/jira/browse/IO-354
>             Project: Commons IO
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 2.3
>         Environment: JDK 7 
> RHEL Linux
> Apache Commons IO version 2.4
>            Reporter: Liyu Yi
>              Labels: Charset, Encoding, Tailer
>
> I just realized there is a defect in the source code of "org.apache.commons.io.input.Tailer.java". Basically, the current implementation does not work for multi-byte encoded files. See the following snippet,
> 448    private long readLines(RandomAccessFile reader) throws IOException {
> 449        StringBuilder sb = new StringBuilder();
> 450
> 451        long pos = reader.getFilePointer();
> 452        long rePos = pos; // position to re-read
> 453
> 454        int num;
> 455        boolean seenCR = false;
> 456        while (run && ((num = reader.read(inbuf)) != -1)) {
> 457            for (int i = 0; i < num; i++) {
> 458                byte ch = inbuf[i];
> 459                switch (ch) {
> 460                case '\n':
> 461                    seenCR = false; // swallow CR before LF
> 462                    listener.handle(sb.toString());
> 463                    sb.setLength(0);
> 464                    rePos = pos + i + 1;
> 465                    break;
> 466                case '\r':
> 467                    if (seenCR) {
> 468                        sb.append('\r');
> 469                    }
> 470                    seenCR = true;
> 471                    break;
> 472                default:
> 473                    if (seenCR) {
> 474                        seenCR = false; // swallow final CR
> 475                        listener.handle(sb.toString());
> 476                        sb.setLength(0);
> 477                        rePos = pos + i + 1;
> 478                    }
> 479                    sb.append((char) ch); // add character, not its ascii value
> 480                }
> 481            }
> 482
> 483            pos = reader.getFilePointer();
> 484        }
> 485
> 486        reader.seek(rePos); // Ensure we can re-read if necessary
> 487        return rePos;
> 488    }
> At line 479, the conversion of byte to char types breaks the encoding.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira