You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Keith D Gregory (JIRA)" <ji...@apache.org> on 2008/08/16 15:59:44 UTC
[jira] Created: (IO-178) BOMExclusionInputStream - an InputStream
for UTF-8 data that ignores an initial Byte Order mark
BOMExclusionInputStream - an InputStream for UTF-8 data that ignores an initial Byte Order mark
-----------------------------------------------------------------------------------------------
Key: IO-178
URL: https://issues.apache.org/jira/browse/IO-178
Project: Commons IO
Issue Type: New Feature
Components: Streams/Writers
Affects Versions: 1.4
Reporter: Keith D Gregory
Priority: Minor
Microsoft tools have the unpleasant habit of writing a byte order mark (the three-byte sequence 0xEF 0xBB 0xBF) at the start of a UTF-8 encoded file.
The CharsetDecoder supplied with the JDK does not simply discard these bytes, but instead returns the BOM character (0xFEFF); see http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6378911 for discussion on this.
This makes life unpleasant for anyone who is processing text data, as the program must look for this character and ignore it.
The BOMExclusionInputStream class is a work-around: it recognizes the BOM at the start of the stream, and skips over it.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (IO-178) BOMInputStream - an InputStream for
detected and optionally excludeing an initial Byte Order mark
Posted by "Niall Pemberton (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/IO-178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917434#action_12917434 ]
Niall Pemberton commented on IO-178:
------------------------------------
Also meant to say I've also renamed it from BOMExclusionInputStream to BOMInputStream and added a ByteOrderMark implementation:
http://svn.apache.org/viewvc?view=revision&revision=1004073
> BOMInputStream - an InputStream for detected and optionally excludeing an initial Byte Order mark
> -------------------------------------------------------------------------------------------------
>
> Key: IO-178
> URL: https://issues.apache.org/jira/browse/IO-178
> Project: Commons IO
> Issue Type: New Feature
> Components: Streams/Writers
> Affects Versions: 1.4
> Reporter: Keith D Gregory
> Assignee: Niall Pemberton
> Priority: Minor
> Fix For: 2.0
>
> Attachments: BOMExclusionInputStream.java, BOMExclusionInputStream.patch, TestBOMExclusionInputStream.java
>
>
> Microsoft tools have the unpleasant habit of writing a byte order mark (the three-byte sequence 0xEF 0xBB 0xBF) at the start of a UTF-8 encoded file.
> The CharsetDecoder supplied with the JDK does not simply discard these bytes, but instead returns the BOM character (0xFEFF); see http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6378911 for discussion on this.
> This makes life unpleasant for anyone who is processing text data, as the program must look for this character and ignore it.
> The BOMExclusionInputStream class is a work-around: it recognizes the BOM at the start of the stream, and skips over it.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (IO-178) BOMExclusionInputStream - an InputStream
for UTF-8 data that ignores an initial Byte Order mark
Posted by "Niall Pemberton (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/IO-178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Niall Pemberton resolved IO-178.
--------------------------------
Resolution: Fixed
Fix Version/s: (was: 1.5)
2.0
Assignee: Niall Pemberton
Thanks Keith, I have added this with superficial changes, mostly formatting
http://svn.apache.org/viewvc?view=rev&revision=721749
> BOMExclusionInputStream - an InputStream for UTF-8 data that ignores an initial Byte Order mark
> -----------------------------------------------------------------------------------------------
>
> Key: IO-178
> URL: https://issues.apache.org/jira/browse/IO-178
> Project: Commons IO
> Issue Type: New Feature
> Components: Streams/Writers
> Affects Versions: 1.4
> Reporter: Keith D Gregory
> Assignee: Niall Pemberton
> Priority: Minor
> Fix For: 2.0
>
> Attachments: BOMExclusionInputStream.java, BOMExclusionInputStream.patch, TestBOMExclusionInputStream.java
>
>
> Microsoft tools have the unpleasant habit of writing a byte order mark (the three-byte sequence 0xEF 0xBB 0xBF) at the start of a UTF-8 encoded file.
> The CharsetDecoder supplied with the JDK does not simply discard these bytes, but instead returns the BOM character (0xFEFF); see http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6378911 for discussion on this.
> This makes life unpleasant for anyone who is processing text data, as the program must look for this character and ignore it.
> The BOMExclusionInputStream class is a work-around: it recognizes the BOM at the start of the stream, and skips over it.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (IO-178) BOMExclusionInputStream - an InputStream
for UTF-8 data that ignores an initial Byte Order mark
Posted by "Keith D Gregory (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/IO-178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Keith D Gregory updated IO-178:
-------------------------------
Attachment: TestBOMExclusionInputStream.java
BOMExclusionInputStream.java
I apologize for attaching actual files, but I didn't find any way to get Subversion diff to recognize new files (unlike CVS diff, which takes a "N" argument).
> BOMExclusionInputStream - an InputStream for UTF-8 data that ignores an initial Byte Order mark
> -----------------------------------------------------------------------------------------------
>
> Key: IO-178
> URL: https://issues.apache.org/jira/browse/IO-178
> Project: Commons IO
> Issue Type: New Feature
> Components: Streams/Writers
> Affects Versions: 1.4
> Reporter: Keith D Gregory
> Priority: Minor
> Attachments: BOMExclusionInputStream.java, TestBOMExclusionInputStream.java
>
>
> Microsoft tools have the unpleasant habit of writing a byte order mark (the three-byte sequence 0xEF 0xBB 0xBF) at the start of a UTF-8 encoded file.
> The CharsetDecoder supplied with the JDK does not simply discard these bytes, but instead returns the BOM character (0xFEFF); see http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6378911 for discussion on this.
> This makes life unpleasant for anyone who is processing text data, as the program must look for this character and ignore it.
> The BOMExclusionInputStream class is a work-around: it recognizes the BOM at the start of the stream, and skips over it.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (IO-178) BOMExclusionInputStream - an InputStream
for UTF-8 data that ignores an initial Byte Order mark
Posted by "Henri Yandell (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/IO-178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Henri Yandell updated IO-178:
-----------------------------
Fix Version/s: 1.5
Seems like a good idea to me. Definitely seen BOM bouncing around as an issue. Setting for 1.5.
> BOMExclusionInputStream - an InputStream for UTF-8 data that ignores an initial Byte Order mark
> -----------------------------------------------------------------------------------------------
>
> Key: IO-178
> URL: https://issues.apache.org/jira/browse/IO-178
> Project: Commons IO
> Issue Type: New Feature
> Components: Streams/Writers
> Affects Versions: 1.4
> Reporter: Keith D Gregory
> Priority: Minor
> Fix For: 1.5
>
> Attachments: BOMExclusionInputStream.java, BOMExclusionInputStream.patch, TestBOMExclusionInputStream.java
>
>
> Microsoft tools have the unpleasant habit of writing a byte order mark (the three-byte sequence 0xEF 0xBB 0xBF) at the start of a UTF-8 encoded file.
> The CharsetDecoder supplied with the JDK does not simply discard these bytes, but instead returns the BOM character (0xFEFF); see http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6378911 for discussion on this.
> This makes life unpleasant for anyone who is processing text data, as the program must look for this character and ignore it.
> The BOMExclusionInputStream class is a work-around: it recognizes the BOM at the start of the stream, and skips over it.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (IO-178) BOMInputStream - an InputStream for
detected and optionally excludeing an initial Byte Order mark
Posted by "Niall Pemberton (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/IO-178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Niall Pemberton updated IO-178:
-------------------------------
Summary: BOMInputStream - an InputStream for detected and optionally excludeing an initial Byte Order mark (was: BOMExclusionInputStream - an InputStream for UTF-8 data that ignores an initial Byte Order mark)
I have enhanced the functionality on this implementation from just excluding UTF-8 BOMs to detecting and optionally excluding any BOM.
So for example you could configure it to detect either UTF-8, UTF-16BE or UTF-16LE BOMs and then find out which BOM was found. Also whether the BOM is excluded or not is now configurable
So to detect and exclude a UTF-8 BOM:
{code}
BOMInputStream bomIn = new BOMInputStream(in);
if (bomIn.hasBOM()) {
// has a UTF-8 BOM
}
{code}
So to detect and include a UTF-8 BOM:
{code}
boolean include = true;
BOMInputStream bomIn = new BOMInputStream(in, include);
if (bomIn.hasBOM()) {
// has a UTF-8 BOM
}
{code}
So to detect and include a UTF-18BE or UTF-16LE BOMs:
{code}
BOMInputStream bomIn = new BOMInputStream(in, ByteOrderMark.UTF_16LE, ByteOrderMark.UTF_16BE);
if (bomIn.hasBOM() == false) {
// No BOM found
} else if (bomIn.hasBOM(ByteOrderMark.UTF_16LE)) {
// has a UTF-16LE BOM
} else if (bomIn.hasBOM(ByteOrderMark.UTF_16BE)) {
// has a UTF-16BE BOM
}
{code}
> BOMInputStream - an InputStream for detected and optionally excludeing an initial Byte Order mark
> -------------------------------------------------------------------------------------------------
>
> Key: IO-178
> URL: https://issues.apache.org/jira/browse/IO-178
> Project: Commons IO
> Issue Type: New Feature
> Components: Streams/Writers
> Affects Versions: 1.4
> Reporter: Keith D Gregory
> Assignee: Niall Pemberton
> Priority: Minor
> Fix For: 2.0
>
> Attachments: BOMExclusionInputStream.java, BOMExclusionInputStream.patch, TestBOMExclusionInputStream.java
>
>
> Microsoft tools have the unpleasant habit of writing a byte order mark (the three-byte sequence 0xEF 0xBB 0xBF) at the start of a UTF-8 encoded file.
> The CharsetDecoder supplied with the JDK does not simply discard these bytes, but instead returns the BOM character (0xFEFF); see http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6378911 for discussion on this.
> This makes life unpleasant for anyone who is processing text data, as the program must look for this character and ignore it.
> The BOMExclusionInputStream class is a work-around: it recognizes the BOM at the start of the stream, and skips over it.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (IO-178) BOMExclusionInputStream - an InputStream
for UTF-8 data that ignores an initial Byte Order mark
Posted by "Keith D Gregory (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/IO-178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Keith D Gregory updated IO-178:
-------------------------------
Attachment: BOMExclusionInputStream.patch
I guess in retrospect this should have been obvious: add the files before doing a diff.
> BOMExclusionInputStream - an InputStream for UTF-8 data that ignores an initial Byte Order mark
> -----------------------------------------------------------------------------------------------
>
> Key: IO-178
> URL: https://issues.apache.org/jira/browse/IO-178
> Project: Commons IO
> Issue Type: New Feature
> Components: Streams/Writers
> Affects Versions: 1.4
> Reporter: Keith D Gregory
> Priority: Minor
> Attachments: BOMExclusionInputStream.java, BOMExclusionInputStream.patch, TestBOMExclusionInputStream.java
>
>
> Microsoft tools have the unpleasant habit of writing a byte order mark (the three-byte sequence 0xEF 0xBB 0xBF) at the start of a UTF-8 encoded file.
> The CharsetDecoder supplied with the JDK does not simply discard these bytes, but instead returns the BOM character (0xFEFF); see http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6378911 for discussion on this.
> This makes life unpleasant for anyone who is processing text data, as the program must look for this character and ignore it.
> The BOMExclusionInputStream class is a work-around: it recognizes the BOM at the start of the stream, and skips over it.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.