You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Dennis Kubes <nu...@dragonflymc.com> on 2006/03/30 19:09:07 UTC
writeChars method in IndexOutput
I was reading up on conversion of characters to UTF-8 and I now understand
why it is writing out UTF-8 (to be able to support most of the worlds
languages with minimal space?). But after reading up on the algorithms for
conversion as given below, does the writeChars method not support the
U+10000→U+10FFFF conversions or am I misreading the code?
Character Range
Bit Encoding
U+0000→U+007F
0xxxxxxx
U+0080→U+07FF
110xxxxx 10xxxxxx
U+0800→U+FFFF
1110xxxx 10xxxxxx 10xxxxxx
U+10000→U+10FFFF
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
public void writeChars(String s, int start, int length)
throws IOException {
final int end = start + length;
for (int i = start; i < end; i++) {
final int code = (int)s.charAt(i);
if (code >= 0x01 && code <= 0x7F)
writeByte((byte)code);
else if (((code >= 0x80) && (code <= 0x7FF)) || code == 0) {
writeByte((byte)(0xC0 | (code >> 6)));
writeByte((byte)(0x80 | (code & 0x3F)));
}
else {
writeByte((byte)(0xE0 | (code >>> 12)));
writeByte((byte)(0x80 | ((code >> 6) & 0x3F)));
writeByte((byte)(0x80 | (code & 0x3F)));
}
}
}
Re: writeChars method in IndexOutput
Posted by Yonik Seeley <ys...@gmail.com>.
On 3/30/06, Dennis Kubes <nu...@dragonflymc.com> wrote:
> Is this modified UTF-8 such as is found in DataInput interface?
Yes, I believe so.
-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server
> -----Original Message-----
> From: Yonik Seeley [mailto:yseeley@gmail.com]
> Sent: Thursday, March 30, 2006 11:56 AM
> To: java-user@lucene.apache.org
> Subject: Re: writeChars method in IndexOutput
>
> Lucene doesn't currently output totally valid UTF-8
> Patches to make it do so are here:
> http://www.mail-archive.com/java-dev@lucene.apache.org/msg01987.html
>
> Should this be tackled pre or post 2.0?
>
> -Yonik
> http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server
>
> On 3/30/06, Dennis Kubes <nu...@dragonflymc.com> wrote:
> > I was reading up on conversion of characters to UTF-8 and I now understand
> > why it is writing out UTF-8 (to be able to support most of the worlds
> > languages with minimal space?). But after reading up on the algorithms for
> > conversion as given below, does the writeChars method not support the
> > U+10000→U+10FFFF conversions or am I misreading the code?
> >
> >
> >
> >
> > Character Range
> >
> > Bit Encoding
> >
> >
> > U+0000→U+007F
> >
> > 0xxxxxxx
> >
> >
> > U+0080→U+07FF
> >
> > 110xxxxx 10xxxxxx
> >
> >
> > U+0800→U+FFFF
> >
> > 1110xxxx 10xxxxxx 10xxxxxx
> >
> >
> > U+10000→U+10FFFF
> >
> > 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
> >
> >
> >
> > public void writeChars(String s, int start, int length)
> >
> > throws IOException {
> >
> >
> >
> > final int end = start + length;
> >
> > for (int i = start; i < end; i++) {
> >
> >
> >
> > final int code = (int)s.charAt(i);
> >
> >
> >
> > if (code >= 0x01 && code <= 0x7F)
> >
> > writeByte((byte)code);
> >
> > else if (((code >= 0x80) && (code <= 0x7FF)) || code == 0) {
> >
> > writeByte((byte)(0xC0 | (code >> 6)));
> >
> > writeByte((byte)(0x80 | (code & 0x3F)));
> >
> > }
> >
> > else {
> >
> > writeByte((byte)(0xE0 | (code >>> 12)));
> >
> > writeByte((byte)(0x80 | ((code >> 6) & 0x3F)));
> >
> > writeByte((byte)(0x80 | (code & 0x3F)));
> >
> > }
> >
> > }
> >
> > }
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: writeChars method in IndexOutput
Posted by Dennis Kubes <nu...@dragonflymc.com>.
Is this modified UTF-8 such as is found in DataInput interface?
Dennis
-----Original Message-----
From: Yonik Seeley [mailto:yseeley@gmail.com]
Sent: Thursday, March 30, 2006 11:56 AM
To: java-user@lucene.apache.org
Subject: Re: writeChars method in IndexOutput
Lucene doesn't currently output totally valid UTF-8
Patches to make it do so are here:
http://www.mail-archive.com/java-dev@lucene.apache.org/msg01987.html
Should this be tackled pre or post 2.0?
-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server
On 3/30/06, Dennis Kubes <nu...@dragonflymc.com> wrote:
> I was reading up on conversion of characters to UTF-8 and I now understand
> why it is writing out UTF-8 (to be able to support most of the worlds
> languages with minimal space?). But after reading up on the algorithms for
> conversion as given below, does the writeChars method not support the
> U+10000→U+10FFFF conversions or am I misreading the code?
>
>
>
>
> Character Range
>
> Bit Encoding
>
>
> U+0000→U+007F
>
> 0xxxxxxx
>
>
> U+0080→U+07FF
>
> 110xxxxx 10xxxxxx
>
>
> U+0800→U+FFFF
>
> 1110xxxx 10xxxxxx 10xxxxxx
>
>
> U+10000→U+10FFFF
>
> 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
>
>
>
> public void writeChars(String s, int start, int length)
>
> throws IOException {
>
>
>
> final int end = start + length;
>
> for (int i = start; i < end; i++) {
>
>
>
> final int code = (int)s.charAt(i);
>
>
>
> if (code >= 0x01 && code <= 0x7F)
>
> writeByte((byte)code);
>
> else if (((code >= 0x80) && (code <= 0x7FF)) || code == 0) {
>
> writeByte((byte)(0xC0 | (code >> 6)));
>
> writeByte((byte)(0x80 | (code & 0x3F)));
>
> }
>
> else {
>
> writeByte((byte)(0xE0 | (code >>> 12)));
>
> writeByte((byte)(0x80 | ((code >> 6) & 0x3F)));
>
> writeByte((byte)(0x80 | (code & 0x3F)));
>
> }
>
> }
>
> }
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: writeChars method in IndexOutput
Posted by Yonik Seeley <ys...@gmail.com>.
Lucene doesn't currently output totally valid UTF-8
Patches to make it do so are here:
http://www.mail-archive.com/java-dev@lucene.apache.org/msg01987.html
Should this be tackled pre or post 2.0?
-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server
On 3/30/06, Dennis Kubes <nu...@dragonflymc.com> wrote:
> I was reading up on conversion of characters to UTF-8 and I now understand
> why it is writing out UTF-8 (to be able to support most of the worlds
> languages with minimal space?). But after reading up on the algorithms for
> conversion as given below, does the writeChars method not support the
> U+10000→U+10FFFF conversions or am I misreading the code?
>
>
>
>
> Character Range
>
> Bit Encoding
>
>
> U+0000→U+007F
>
> 0xxxxxxx
>
>
> U+0080→U+07FF
>
> 110xxxxx 10xxxxxx
>
>
> U+0800→U+FFFF
>
> 1110xxxx 10xxxxxx 10xxxxxx
>
>
> U+10000→U+10FFFF
>
> 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
>
>
>
> public void writeChars(String s, int start, int length)
>
> throws IOException {
>
>
>
> final int end = start + length;
>
> for (int i = start; i < end; i++) {
>
>
>
> final int code = (int)s.charAt(i);
>
>
>
> if (code >= 0x01 && code <= 0x7F)
>
> writeByte((byte)code);
>
> else if (((code >= 0x80) && (code <= 0x7FF)) || code == 0) {
>
> writeByte((byte)(0xC0 | (code >> 6)));
>
> writeByte((byte)(0x80 | (code & 0x3F)));
>
> }
>
> else {
>
> writeByte((byte)(0xE0 | (code >>> 12)));
>
> writeByte((byte)(0x80 | ((code >> 6) & 0x3F)));
>
> writeByte((byte)(0x80 | (code & 0x3F)));
>
> }
>
> }
>
> }
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org