You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@commons.apache.org by "Keith D Gregory (JIRA)" <ji...@apache.org> on 2008/08/16 15:59:44 UTC

[jira] Created: (IO-178) BOMExclusionInputStream - an InputStream for UTF-8 data that ignores an initial Byte Order mark

BOMExclusionInputStream - an InputStream for UTF-8 data that ignores an initial Byte Order mark
-----------------------------------------------------------------------------------------------

                 Key: IO-178
                 URL: https://issues.apache.org/jira/browse/IO-178
             Project: Commons IO
          Issue Type: New Feature
          Components: Streams/Writers
    Affects Versions: 1.4
            Reporter: Keith D Gregory
            Priority: Minor


Microsoft tools have the unpleasant habit of writing a byte order mark (the three-byte sequence 0xEF 0xBB 0xBF) at the start of a UTF-8 encoded file.

The CharsetDecoder supplied with the JDK does not simply discard these bytes, but instead returns the BOM character (0xFEFF); see http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6378911 for discussion on this.

This makes life unpleasant for anyone who is processing text data, as the program must look for this character and ignore it.

The BOMExclusionInputStream class is a work-around: it recognizes the BOM at the start of the stream, and skips over it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (IO-178) BOMInputStream - an InputStream for detected and optionally excludeing an initial Byte Order mark

Posted by "Niall Pemberton (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/IO-178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917434#action_12917434 ] 

Niall Pemberton commented on IO-178:
------------------------------------

Also meant to say I've also renamed it from BOMExclusionInputStream to BOMInputStream and added a ByteOrderMark implementation:

    http://svn.apache.org/viewvc?view=revision&revision=1004073

> BOMInputStream - an InputStream for detected and optionally excludeing an initial Byte Order mark
> -------------------------------------------------------------------------------------------------
>
>                 Key: IO-178
>                 URL: https://issues.apache.org/jira/browse/IO-178
>             Project: Commons IO
>          Issue Type: New Feature
>          Components: Streams/Writers
>    Affects Versions: 1.4
>            Reporter: Keith D Gregory
>            Assignee: Niall Pemberton
>            Priority: Minor
>             Fix For: 2.0
>
>         Attachments: BOMExclusionInputStream.java, BOMExclusionInputStream.patch, TestBOMExclusionInputStream.java
>
>
> Microsoft tools have the unpleasant habit of writing a byte order mark (the three-byte sequence 0xEF 0xBB 0xBF) at the start of a UTF-8 encoded file.
> The CharsetDecoder supplied with the JDK does not simply discard these bytes, but instead returns the BOM character (0xFEFF); see http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6378911 for discussion on this.
> This makes life unpleasant for anyone who is processing text data, as the program must look for this character and ignore it.
> The BOMExclusionInputStream class is a work-around: it recognizes the BOM at the start of the stream, and skips over it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (IO-178) BOMExclusionInputStream - an InputStream for UTF-8 data that ignores an initial Byte Order mark

Posted by "Niall Pemberton (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/IO-178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Niall Pemberton resolved IO-178.
--------------------------------

       Resolution: Fixed
    Fix Version/s:     (was: 1.5)
                   2.0
         Assignee: Niall Pemberton

Thanks Keith, I have added this with superficial changes, mostly formatting

http://svn.apache.org/viewvc?view=rev&revision=721749

> BOMExclusionInputStream - an InputStream for UTF-8 data that ignores an initial Byte Order mark
> -----------------------------------------------------------------------------------------------
>
>                 Key: IO-178
>                 URL: https://issues.apache.org/jira/browse/IO-178
>             Project: Commons IO
>          Issue Type: New Feature
>          Components: Streams/Writers
>    Affects Versions: 1.4
>            Reporter: Keith D Gregory
>            Assignee: Niall Pemberton
>            Priority: Minor
>             Fix For: 2.0
>
>         Attachments: BOMExclusionInputStream.java, BOMExclusionInputStream.patch, TestBOMExclusionInputStream.java
>
>
> Microsoft tools have the unpleasant habit of writing a byte order mark (the three-byte sequence 0xEF 0xBB 0xBF) at the start of a UTF-8 encoded file.
> The CharsetDecoder supplied with the JDK does not simply discard these bytes, but instead returns the BOM character (0xFEFF); see http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6378911 for discussion on this.
> This makes life unpleasant for anyone who is processing text data, as the program must look for this character and ignore it.
> The BOMExclusionInputStream class is a work-around: it recognizes the BOM at the start of the stream, and skips over it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (IO-178) BOMExclusionInputStream - an InputStream for UTF-8 data that ignores an initial Byte Order mark

Posted by "Keith D Gregory (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/IO-178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Keith D Gregory updated IO-178:
-------------------------------

    Attachment: TestBOMExclusionInputStream.java
                BOMExclusionInputStream.java

I apologize for attaching actual files, but I didn't find any way to get Subversion diff to recognize new files (unlike CVS diff, which takes a "N" argument).

> BOMExclusionInputStream - an InputStream for UTF-8 data that ignores an initial Byte Order mark
> -----------------------------------------------------------------------------------------------
>
>                 Key: IO-178
>                 URL: https://issues.apache.org/jira/browse/IO-178
>             Project: Commons IO
>          Issue Type: New Feature
>          Components: Streams/Writers
>    Affects Versions: 1.4
>            Reporter: Keith D Gregory
>            Priority: Minor
>         Attachments: BOMExclusionInputStream.java, TestBOMExclusionInputStream.java
>
>
> Microsoft tools have the unpleasant habit of writing a byte order mark (the three-byte sequence 0xEF 0xBB 0xBF) at the start of a UTF-8 encoded file.
> The CharsetDecoder supplied with the JDK does not simply discard these bytes, but instead returns the BOM character (0xFEFF); see http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6378911 for discussion on this.
> This makes life unpleasant for anyone who is processing text data, as the program must look for this character and ignore it.
> The BOMExclusionInputStream class is a work-around: it recognizes the BOM at the start of the stream, and skips over it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (IO-178) BOMExclusionInputStream - an InputStream for UTF-8 data that ignores an initial Byte Order mark

Posted by "Henri Yandell (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/IO-178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Henri Yandell updated IO-178:
-----------------------------

    Fix Version/s: 1.5

Seems like a good idea to me. Definitely seen BOM bouncing around as an issue. Setting for 1.5.

> BOMExclusionInputStream - an InputStream for UTF-8 data that ignores an initial Byte Order mark
> -----------------------------------------------------------------------------------------------
>
>                 Key: IO-178
>                 URL: https://issues.apache.org/jira/browse/IO-178
>             Project: Commons IO
>          Issue Type: New Feature
>          Components: Streams/Writers
>    Affects Versions: 1.4
>            Reporter: Keith D Gregory
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: BOMExclusionInputStream.java, BOMExclusionInputStream.patch, TestBOMExclusionInputStream.java
>
>
> Microsoft tools have the unpleasant habit of writing a byte order mark (the three-byte sequence 0xEF 0xBB 0xBF) at the start of a UTF-8 encoded file.
> The CharsetDecoder supplied with the JDK does not simply discard these bytes, but instead returns the BOM character (0xFEFF); see http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6378911 for discussion on this.
> This makes life unpleasant for anyone who is processing text data, as the program must look for this character and ignore it.
> The BOMExclusionInputStream class is a work-around: it recognizes the BOM at the start of the stream, and skips over it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (IO-178) BOMInputStream - an InputStream for detected and optionally excludeing an initial Byte Order mark

Posted by "Niall Pemberton (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/IO-178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Niall Pemberton updated IO-178:
-------------------------------

    Summary: BOMInputStream - an InputStream for detected and optionally excludeing an initial Byte Order mark  (was: BOMExclusionInputStream - an InputStream for UTF-8 data that ignores an initial Byte Order mark)

I have enhanced the functionality on this implementation from just excluding UTF-8 BOMs to detecting and optionally excluding any BOM.

So for example you could configure it to detect either UTF-8, UTF-16BE or UTF-16LE BOMs and then find out which BOM was found. Also whether the BOM is excluded or not is now configurable

So to detect and exclude a UTF-8 BOM:

{code}
BOMInputStream bomIn = new BOMInputStream(in);
if (bomIn.hasBOM()) {
    // has a UTF-8 BOM
}
{code}

So to detect and include a UTF-8 BOM:

{code}
boolean include = true;
BOMInputStream bomIn = new BOMInputStream(in, include);
if (bomIn.hasBOM()) {
    // has a UTF-8 BOM
}
{code}

So to detect and include a UTF-18BE or UTF-16LE BOMs:

{code}
BOMInputStream bomIn = new BOMInputStream(in, ByteOrderMark.UTF_16LE, ByteOrderMark.UTF_16BE);
if (bomIn.hasBOM() == false) {
    // No BOM found
} else if (bomIn.hasBOM(ByteOrderMark.UTF_16LE)) {
    // has a UTF-16LE BOM
} else if (bomIn.hasBOM(ByteOrderMark.UTF_16BE)) {
    // has a UTF-16BE BOM
}
{code}


> BOMInputStream - an InputStream for detected and optionally excludeing an initial Byte Order mark
> -------------------------------------------------------------------------------------------------
>
>                 Key: IO-178
>                 URL: https://issues.apache.org/jira/browse/IO-178
>             Project: Commons IO
>          Issue Type: New Feature
>          Components: Streams/Writers
>    Affects Versions: 1.4
>            Reporter: Keith D Gregory
>            Assignee: Niall Pemberton
>            Priority: Minor
>             Fix For: 2.0
>
>         Attachments: BOMExclusionInputStream.java, BOMExclusionInputStream.patch, TestBOMExclusionInputStream.java
>
>
> Microsoft tools have the unpleasant habit of writing a byte order mark (the three-byte sequence 0xEF 0xBB 0xBF) at the start of a UTF-8 encoded file.
> The CharsetDecoder supplied with the JDK does not simply discard these bytes, but instead returns the BOM character (0xFEFF); see http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6378911 for discussion on this.
> This makes life unpleasant for anyone who is processing text data, as the program must look for this character and ignore it.
> The BOMExclusionInputStream class is a work-around: it recognizes the BOM at the start of the stream, and skips over it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (IO-178) BOMExclusionInputStream - an InputStream for UTF-8 data that ignores an initial Byte Order mark

Posted by "Keith D Gregory (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/IO-178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Keith D Gregory updated IO-178:
-------------------------------

    Attachment: BOMExclusionInputStream.patch

I guess in retrospect this should have been obvious: add the files before doing a diff.

> BOMExclusionInputStream - an InputStream for UTF-8 data that ignores an initial Byte Order mark
> -----------------------------------------------------------------------------------------------
>
>                 Key: IO-178
>                 URL: https://issues.apache.org/jira/browse/IO-178
>             Project: Commons IO
>          Issue Type: New Feature
>          Components: Streams/Writers
>    Affects Versions: 1.4
>            Reporter: Keith D Gregory
>            Priority: Minor
>         Attachments: BOMExclusionInputStream.java, BOMExclusionInputStream.patch, TestBOMExclusionInputStream.java
>
>
> Microsoft tools have the unpleasant habit of writing a byte order mark (the three-byte sequence 0xEF 0xBB 0xBF) at the start of a UTF-8 encoded file.
> The CharsetDecoder supplied with the JDK does not simply discard these bytes, but instead returns the BOM character (0xFEFF); see http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6378911 for discussion on this.
> This makes life unpleasant for anyone who is processing text data, as the program must look for this character and ignore it.
> The BOMExclusionInputStream class is a work-around: it recognizes the BOM at the start of the stream, and skips over it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.