You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2005/08/27 06:18:21 UTC

Lucene does NOT use UTF-8.

Greets,

[crossposted to java-user@lucene.apache.org and plucene@kasei.com]

I've delved into the matter of Lucene and UTF-8 a little further, and  
I am discouraged by what I believe I've uncovered.

Lucene should not be advertising that it uses "standard UTF-8" -- or  
even UTF-8 at all, since "Modified UTF-8" is _illegal_ UTF-8.  The  
two distinguishing characteristics of "Modified UTF-8" are the  
treatment of codepoints above the BMP (which are written as surrogate  
pairs), and the encoding of null bytes as 1100 0000 1000 0000 rather  
than 0000 0000.  Both of these became illegal as of Unicode 3.1  
(IIRC), because they are not shortest-form and non-shortest-form  
UTF-8 presents a security risk.

The documentation should really state that Lucene stores strings in a  
Java-only adulteration of UTF-8, unsuitable for interchange.  Since  
Perl uses true shortest-form UTF-8 as its native encoding, Plucene  
would have to jump through two efficiency-killing hoops in order to  
write files that would not choke Lucene: instead of writing out its  
true, legal UTF-8 directly, it would be necessary to first translate  
to UTF-16, then duplicate the Lucene encoding algorithm from  
OutputStream.  In theory.

Below you will find a simple Perl script which illustrates what  
happens when Perl encounters malformed UTF-8.  Run it (you need Perl  
5.8 or higher) and you will see why even if I thought it was a good  
idea to emulate the Java hack for encoding "Modified UTF-8", trying  
to make it work in practice would be a nightmare.

If Plucene were to write legal UTF-8 strings to its index files, Java  
Lucene would misbehave and possibly blow up any time a string  
contained either a 4-byte character or a null byte.  On the flip  
side, Perl will spew warnings like crazy and possibly blow up  
whenever it encounters a Lucene-encoded null or surrogate pair.  The  
potential blowups are due to the fact that Lucene and Plucene will  
not agree on how many characters a string contains, resulting in  
overruns or underruns.

I am hoping that the answer to this will be a fix to the encoding  
mechanism in Lucene so that it really does use legal UTF-8.  The most  
efficient way to go about this has not yet presented itself.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

#----------------------------------------

#!/usr/bin/perl
use strict;
use warnings;

# illegal_null.plx -- Perl complains about non-shortest-form null.

my $data = "foo\xC0\x80\n";

open (my $virtual_filehandle, "+<:utf8", \$data);
print <$virtual_filehandle>;


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Fwd: Lucene does NOT use UTF-8.

Posted by Daniel Naber <lu...@danielnaber.de>.

On Saturday 27 August 2005 16:05, Marvin Humphrey wrote:

> Lucene should not be advertising that it uses "standard UTF-8" -- or  
> even UTF-8 at all, since "Modified UTF-8" is _illegal_ UTF-8.  

For now, I've changed the information about the file format documentation.

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Fwd: Lucene does NOT use UTF-8.

Posted by Marvin Humphrey <ma...@rectangular.com>.

Greets,

Discussion moved from the users list as per suggestion...

-- Marvin Humphrey

Begin forwarded message:

From: Marvin Humphrey <ma...@rectangular.com>
Date: August 26, 2005 9:18:21 PM PDT
To: java-user@lucene.apache.org, plucene@kasei.com
Subject: Lucene does NOT use UTF-8.
Reply-To: java-user@lucene.apache.org

Greets,

[crossposted to java-user@lucene.apache.org and plucene@kasei.com]

I've delved into the matter of Lucene and UTF-8 a little further, and  
I am discouraged by what I believe I've uncovered.

Lucene should not be advertising that it uses "standard UTF-8" -- or  
even UTF-8 at all, since "Modified UTF-8" is _illegal_ UTF-8.  The  
two distinguishing characteristics of "Modified UTF-8" are the  
treatment of codepoints above the BMP (which are written as surrogate  
pairs), and the encoding of null bytes as 1100 0000 1000 0000 rather  
than 0000 0000.  Both of these became illegal as of Unicode 3.1  
(IIRC), because they are not shortest-form and non-shortest-form  
UTF-8 presents a security risk.

The documentation should really state that Lucene stores strings in a  
Java-only adulteration of UTF-8, unsuitable for interchange.  Since  
Perl uses true shortest-form UTF-8 as its native encoding, Plucene  
would have to jump through two efficiency-killing hoops in order to  
write files that would not choke Lucene: instead of writing out its  
true, legal UTF-8 directly, it would be necessary to first translate  
to UTF-16, then duplicate the Lucene encoding algorithm from  
OutputStream.  In theory.

Below you will find a simple Perl script which illustrates what  
happens when Perl encounters malformed UTF-8.  Run it (you need Perl  
5.8 or higher) and you will see why even if I thought it was a good  
idea to emulate the Java hack for encoding "Modified UTF-8", trying  
to make it work in practice would be a nightmare.

If Plucene were to write legal UTF-8 strings to its index files, Java  
Lucene would misbehave and possibly blow up any time a string  
contained either a 4-byte character or a null byte.  On the flip  
side, Perl will spew warnings like crazy and possibly blow up  
whenever it encounters a Lucene-encoded null or surrogate pair.  The  
potential blowups are due to the fact that Lucene and Plucene will  
not agree on how many characters a string contains, resulting in  
overruns or underruns.

I am hoping that the answer to this will be a fix to the encoding  
mechanism in Lucene so that it really does use legal UTF-8.  The most  
efficient way to go about this has not yet presented itself.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

#----------------------------------------

#!/usr/bin/perl
use strict;
use warnings;

# illegal_null.plx -- Perl complains about non-shortest-form null.

my $data = "foo\xC0\x80\n";

open (my $virtual_filehandle, "+<:utf8", \$data);
print <$virtual_filehandle>;

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: Lucene does NOT use UTF-8.

Posted by Robert Engels <re...@ix.netcom.com>.

That method should easily be changed to

public final String readString() throws IOException {
int length = readVInt();
return new String(readBytes(length),"UTF-8);
}

readBytes(0 could reuse the same array if it was large enough. Then only the
single char[] is created in the String code.

-----Original Message-----
From: Yonik Seeley [mailto:yseeley@gmail.com]
Sent: Tuesday, August 30, 2005 11:28 AM
To: java-dev@lucene.apache.org
Subject: Re: Lucene does NOT use UTF-8.


> How will the difference impact String memory allocations? Looking at the
> String code, I can't see where it would make an impact.


This is from Lucene InputStream:
public final String readString() throws IOException {
int length = readVInt();
if (chars == null || length > chars.length)
chars = new char[length];
readChars(chars, 0, length);
return new String(chars, 0, length);
}

If you know the length in bytes, you still have to allocate that many chars
(even though the number of chars may be less than the number of bytes). Not
a big deal IMHO.

A bigger pain is on the writing side, where you can't stream things because
you don't know what the length is going to be (in either bytes *or* UTF-8
chars).

So it turns out that Java's 16 bit chars were just a waste... it's still a
multibyte format *and* it takes up more space. UTF-8 would have been nice -
no conversions necessary.


-Yonik Now hiring -- http://tinyurl.com/7m67g


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8.

Posted by Yonik Seeley <ys...@gmail.com>.

> How will the difference impact String memory allocations? Looking at the
> String code, I can't see where it would make an impact. 


This is from Lucene InputStream:
public final String readString() throws IOException {
int length = readVInt();
if (chars == null || length > chars.length)
chars = new char[length];
readChars(chars, 0, length);
return new String(chars, 0, length);
}

If you know the length in bytes, you still have to allocate that many chars 
(even though the number of chars may be less than the number of bytes). Not 
a big deal IMHO.

A bigger pain is on the writing side, where you can't stream things because 
you don't know what the length is going to be (in either bytes *or* UTF-8 
chars).

So it turns out that Java's 16 bit chars were just a waste... it's still a 
multibyte format *and* it takes up more space. UTF-8 would have been nice - 
no conversions necessary.


-Yonik Now hiring -- http://tinyurl.com/7m67g

Re: Lucene does NOT use UTF-8.

Posted by Yonik Seeley <ys...@gmail.com>.

> How will the difference impact String memory allocations? Looking at the
> String code, I can't see where it would make an impact.


This is from Lucene InputStream:
public final String readString() throws IOException {
int length = readVInt();
if (chars == null || length > chars.length)
chars = new char[length];
readChars(chars, 0, length);
return new String(chars, 0, length);
}

If you know the length in bytes, you still have to allocate that many chars 
(even though the number of chars may be less than the number of bytes). Not 
a big deal IMHO.

A bigger pain is on the writing side, where you can't stream things because 
you don't know what the length is going to be (in either bytes *or* UTF-8 
chars).

So it turns out that Java's 16 bit chars were just a waste... it's still a 
multibyte format *and* it takes up more space. UTF-8 would have been nice - 
no conversions necessary.


-Yonik Now hiring -- http://tinyurl.com/7m67g

Re: Lucene does NOT use UTF-8.

Posted by Doug Cutting <cu...@apache.org>.

Yonik Seeley wrote:
>>Where/how is the Lucene ordering of terms used?
> 
> An ordering is necessary to be able to find things in the index.
> For the most part, the ordering doesn't seem matter... the only query that 
> comes to mind where it does matter is RangeQuery.

For back-compatibility it would be best if the ordering is consistent 
with the current ordering, i.e., lexicographic by character (or code 
point, if you prefer).  Fortunately, UTF-8 makes this easy.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8.

Posted by Yonik Seeley <ys...@gmail.com>.

> Where/how is the Lucene ordering of terms used?


An ordering is necessary to be able to find things in the index.
For the most part, the ordering doesn't seem matter... the only query that 
comes to mind where it does matter is RangeQuery.

For sorting queries, one is able to specify a Locale.
-Yonik 
Now hiring -- http://tinyurl.com/7m67g

Re: Lucene does NOT use UTF-8.

Posted by Ken Krugler <kk...@transpac.com>.

>Yonik Seeley wrote:
>>A related problem exists even if the prefix length vInt is changed 
>>to represent the number of unicode chars (as opposed to number of 
>>java chars), right? The prefix length is no longer the offset into 
>>the char[] to put the suffix.
>
>Yes, I suppose this is a problem too.  Sigh.
>
>>Another approach might be to convert the target to a UTF-8 byte[] 
>>and do all comparisons on byte[]. UTF-8 has some very nice 
>>properties, including that the byte[] representation of UTF-8 
>>strings compare the same as UCS-4 would.
>
>I was not aware of that, but I see you are correct:
>
>    o  The byte-value lexicographic sorting order of UTF-8 strings is the
>       same as if ordered by character numbers.
>
>(From http://www.faqs.org/rfcs/rfc3629.html)
>
>That makes the byte representation much more palatable, since Lucene 
>orders terms lexicographically.

Where/how is the Lucene ordering of terms used?

I'm asking because people often confuse lexicographic order with 
"dictionary" order, whereas in the context of UTF-8 it just means 
"the same order as Unicode code points". And the order of Java chars 
would be the same as for Unicode code points, other than non-BMP 
characters.

Thanks,

-- Ken
-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8.

Posted by Doug Cutting <cu...@apache.org>.

Yonik Seeley wrote:
> A related problem exists even if the prefix length vInt is changed to 
> represent the number of unicode chars (as opposed to number of java chars), 
> right? The prefix length is no longer the offset into the char[] to put the 
> suffix.

Yes, I suppose this is a problem too.  Sigh.

> Another approach might be to convert the target to a UTF-8 byte[] 
> and do all comparisons on byte[]. UTF-8 has some very nice properties, 
> including that the byte[] representation of UTF-8 strings compare the same 
> as UCS-4 would.

I was not aware of that, but I see you are correct:

    o  The byte-value lexicographic sorting order of UTF-8 strings is the
       same as if ordered by character numbers.

(From http://www.faqs.org/rfcs/rfc3629.html)

That makes the byte representation much more palatable, since Lucene 
orders terms lexicographically.

Doug



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8.

Posted by Yonik Seeley <ys...@gmail.com>.

> The inefficiency would be if prefix were re-converted from UTF-8
> for each term, e.g., in order to compare it to the target.

Ahhh, gotcha.

A related problem exists even if the prefix length vInt is changed to 
represent the number of unicode chars (as opposed to number of java chars), 
right? The prefix length is no longer the offset into the char[] to put the 
suffix.

Another approach might be to convert the target to a UTF-8 byte[] 
and do all comparisons on byte[]. UTF-8 has some very nice properties, 
including that the byte[] representation of UTF-8 strings compare the same 
as UCS-4 would.

As you say, the variations need to be tested.

-Yonik 
Now hiring -- http://tinyurl.com/7m67g

Re: Lucene does NOT use UTF-8.

Posted by Doug Cutting <cu...@apache.org>.

Wolfgang Hoschek wrote:
> I don't know if it matters for Lucene usage. But if using  
> CharsetEncoder/CharBuffer/ByteBuffer should turn out to be a  
> significant problem, it's probably due to startup/init time of these  
> methods for individually converting many small strings, not  inherently 
> due to UTF-8 usage. I'm confident that a custom UTF-8  implementation 
> can almost completely eliminate these issues. I've  done this before for 
> binary XML with great success, and it could  certainly be done for 
> lucene just as well. Bottom line: It's probably  an issue that can be 
> dealt with via proper impl; it probably  shouldn't dictate design 
> directions.

Good point.  Currently Lucene already has its own (buggy) UTF-8 
implementation for performance, so that wouldn't really be a big change.

The big question now seems to be whether the stored character sequence 
lengths should be in bytes or characters.  Bytes might be fast and 
simple (whether we implement our own UTF-8 in Java or not) but are not 
back-compatible.  So do we bite the bullet and make a very incompatible 
change to index formats?  Or do we make these counts be unicode 
characters (which is mostly back-compatible) and make the code a bit 
more awkward?  Some implementations would be nice to see just how 
awkward things get.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8.

Posted by Wolfgang Hoschek <wh...@lbl.gov>.

On Aug 30, 2005, at 12:47 PM, Doug Cutting wrote:

> Yonik Seeley wrote:
>
>> I've been looking around... do you have a pointer to the source  
>> where just the suffix is converted from UTF-8?
>> I understand the index format, but I'm not sure I understand the  
>> problem that would be posed by the prefix length being a byte count.
>>
>
> TermBuffer.java:66
>
> Things could work fine if the prefix length were a byte count.  A  
> byte buffer could easily be constructed that contains the full byte  
> sequence (prefix + suffix), and then this could be converted to a  
> String.  The inefficiency would be if prefix were re-converted from  
> UTF-8 for each term, e.g., in order to compare it to the target.   
> Prefixes are frequently longer than suffixes, so this could be  
> significant.  Does that make sense?  I don't know whether it would  
> actually be significant, although TermBuffer.java was added  
> recently as a measurable performance enhancement, so this is  
> performance critical code.
>
> We need to stop discussing this in the abstract and start coding  
> alternatives and benchmarking them.  Is  
> java.nio.charset.CharsetEncoder fast enough?  Will moving things  
> through CharBuffer and ByteBuffer be too slow?  Should Lucene keep  
> maintaining its own UTF-8 implementation for performance?  I don't  
> know, only some experiments will tell.
>
> Doug
>

I don't know if it matters for Lucene usage. But if using  
CharsetEncoder/CharBuffer/ByteBuffer should turn out to be a  
significant problem, it's probably due to startup/init time of these  
methods for individually converting many small strings, not  
inherently due to UTF-8 usage. I'm confident that a custom UTF-8  
implementation can almost completely eliminate these issues. I've  
done this before for binary XML with great success, and it could  
certainly be done for lucene just as well. Bottom line: It's probably  
an issue that can be dealt with via proper impl; it probably  
shouldn't dictate design directions.

Wolfgang.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8.

Posted by Doug Cutting <cu...@apache.org>.

Yonik Seeley wrote:
> I've been looking around... do you have a pointer to the source where just 
> the suffix is converted from UTF-8?
> 
> I understand the index format, but I'm not sure I understand the problem 
> that would be posed by the prefix length being a byte count.

TermBuffer.java:66

Things could work fine if the prefix length were a byte count.  A byte 
buffer could easily be constructed that contains the full byte sequence 
(prefix + suffix), and then this could be converted to a String.  The 
inefficiency would be if prefix were re-converted from UTF-8 for each 
term, e.g., in order to compare it to the target.  Prefixes are 
frequently longer than suffixes, so this could be significant.  Does 
that make sense?  I don't know whether it would actually be significant, 
although TermBuffer.java was added recently as a measurable performance 
enhancement, so this is performance critical code.

We need to stop discussing this in the abstract and start coding 
alternatives and benchmarking them.  Is java.nio.charset.CharsetEncoder 
fast enough?  Will moving things through CharBuffer and ByteBuffer be 
too slow?  Should Lucene keep maintaining its own UTF-8 implementation 
for performance?  I don't know, only some experiments will tell.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8.

Posted by Yonik Seeley <ys...@gmail.com>.

I've been looking around... do you have a pointer to the source where just 
the suffix is converted from UTF-8?

I understand the index format, but I'm not sure I understand the problem 
that would be posed by the prefix length being a byte count.

-Yonik Now hiring -- http://tinyurl.com/7m67g

On 8/30/05, Doug Cutting <cu...@apache.org> wrote:
> 
> tjones@apache.org wrote:
> > How will the difference impact String memory allocations? Looking at
> > the String code, I can't see where it would make an impact.
> 
> I spoke a bit too soon. I should have looked at the code first. You're
> right, I don't think it would require more allocations.
> 
> When considering this byte-count versus character-count issue please
> note that it also arises elsewhere. The PrefixLength in the Term
> Dictionary section of the file format document is currently defined as a
> number of characters, not bytes.
> 
> http://lucene.apache.org/java/docs/fileformats.html#Term Dictionary
> 
> Implementing this in terms of bytes may have performance implications,
> since, at first glance, the entire byte sequence would need to be
> converted from UTF-8 into the internal string representation for each
> term, rather than just the suffix. Does anyone see a way around that?
> 
> As for how we got to this point: I wrote Lucene's UTF-8 reading and
> writing code in 1998, back when Unicode still had fewer than 2^16
> characters. It's surprising that it has lasted this long without anyone
> noticing!
> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8.

Posted by Doug Cutting <cu...@apache.org>.

tjones@apache.org wrote:
> How will the difference impact String memory allocations?  Looking at 
> the String code, I can't see where it would make an impact.

I spoke a bit too soon.  I should have looked at the code first.  You're 
right, I don't think it would require more allocations.

When considering this byte-count versus character-count issue please 
note that it also arises elsewhere.  The PrefixLength in the Term 
Dictionary section of the file format document is currently defined as a 
number of characters, not bytes.

http://lucene.apache.org/java/docs/fileformats.html#Term Dictionary

Implementing this in terms of bytes may have performance implications, 
since, at first glance, the entire byte sequence would need to be 
converted from UTF-8 into the internal string representation for each 
term, rather than just the suffix.  Does anyone see a way around that?

As for how we got to this point: I wrote Lucene's UTF-8 reading and 
writing code in 1998, back when Unicode still had fewer than 2^16 
characters.  It's surprising that it has lasted this long without anyone 
noticing!

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8.

Posted by tj...@apache.org.

Doug,

How will the difference impact String memory allocations?  Looking at the 
String code, I can't see where it would make an impact.

Tim


>I would argue that the length written be the number of characters in the 
>string, rather than the number of bytes written, since that can minimize 
>string memory allocations.
>
>>I'm going to take this off-list now [ ... ]
>
>Please don't.  It's better to have a record of the discussion.
>
>Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8.

Posted by Ken Krugler <kk...@transpac.com>.

>Ken Krugler wrote:
>>The remaining issue is dealing with old-format indexes.
>
>I think that revving the version number on the segments file would 
>be a good start.  This file must be read before any others.  Its 
>current version is -1 and would become -2.  (All positive values are 
>version 0, for back-compatibility.)  Implementations can be modified 
>to pass the version around if they wish to be back-compatible, or 
>they can simply throw exceptions for old format indexes.

After looking at it a bit more, I think there's no problem w/having 
the new code read both UTF-8 and Java modified UTF-8, and always 
write correct UTF-8. So the only compatibility issue would be new 
Lucene indexes w/non-BMP characters being processed by older versions 
of Lucene (or ports that weren't updated).

>I would argue that the length written be the number of characters in 
>the string, rather than the number of bytes written, since that can 
>minimize string memory allocations.

Agreed, though just to clarify, it's the number of UTF-16 code units 
(Java chars), not the number of Unicode code points (Unicode 
characters).

>>I'm going to take this off-list now [ ... ]
>
>Please don't.  It's better to have a record of the discussion.

No problem. I was worried that the discussion Marvin & I were having 
was turning into a two person IM chat via email.

-- Ken
-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8.

Posted by Yonik Seeley <ys...@gmail.com>.

> Sure you can. Do a "tell" to get the position. Write any number.


The representation of the number is variable sized... you can't use a 
placeholder.

-Yonik Now hiring -- http://tinyurl.com/7m67g 

 <http://tinyurl.com/7m67g>

Re: Lucene does NOT use UTF-8.

Posted by DM Smith <dm...@gmail.com>.


Ken Krugler wrote:

>> I think the VInt should be the numbers of bytes to be stored using 
>> the UTF-8
>> encoding.
>>
>> It is trivial to use the String methods identified before to do the
>> conversion. The String(char[]) allocates a new char array.
>>
>> For performance, you can use the actual CharSet encoding classes - 
>> avoiding
>> all of the lookups performed by the String class.
>
>
> Regardless of what underlying support is used, if you want to write 
> out the VInt value as UTF-8 bytes versus Java chars, the Java String 
> has to either be converted to UTF-8 in memory first, or pre-scanned. 
> The first is a memory hit, and the second is a performance hit. I 
> don't know the extent of either, but it's there.
>
> Note that since the VInt is a variable size, you can't write out the 
> bytes first and then fill in the correct value later.

Sure you can. Do a "tell" to get the position. Write any number. Write 
the text. Do another "tell" to note the position. Based on the 
difference between the two "tells", you have the length. Rewind to the 
first "tell" and write out the number. Then advance to the end.

I am not recommending this, but it can be done.

There may be other ways.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: Lucene does NOT use UTF-8.

Posted by Robert Engels <re...@ix.netcom.com>.

Since the buffer can be reused, seems that is the proper choice, and the
"increased memory" you cited originally is not an issue.

-----Original Message-----
From: Yonik Seeley [mailto:yseeley@gmail.com]
Sent: Tuesday, August 30, 2005 1:07 PM
To: java-dev@lucene.apache.org; rengels@ix.netcom.com
Subject: Re: Lucene does NOT use UTF-8.


On 8/30/05, Robert Engels <re...@ix.netcom.com> wrote:
>
> Not true. You do not need to pre-scan it.


What I previously wrote, with emphasis on key words added:
"one has to *either* buffer the entire string, *or* pre-scan it."

-Yonik Now hiring -- http://tinyurl.com/7m67g


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8.

Posted by Yonik Seeley <ys...@gmail.com>.

On 8/30/05, Robert Engels <re...@ix.netcom.com> wrote:
> 
> Not true. You do not need to pre-scan it.


What I previously wrote, with emphasis on key words added:
"one has to *either* buffer the entire string, *or* pre-scan it."

-Yonik Now hiring -- http://tinyurl.com/7m67g

RE: Lucene does NOT use UTF-8.

Posted by Robert Engels <re...@ix.netcom.com>.

Not true. You do not need to pre-scan it.

When you use CharSet encoder, it will write the bytes to a buffer (expanding
as needed). At the end of the encoding you can get the actual number of
bytes needed.

The pseudo-code is

use CharsetEncoder to write String to ByteBuffer
write VInt using ByteBuffer.getLength()
write bytes using ByteBuffer.getByte[]

better yet you NIO so you can pass the ByteBuffer directly.


-----Original Message-----
From: Yonik Seeley [mailto:yseeley@gmail.com]
Sent: Tuesday, August 30, 2005 12:56 PM
To: java-dev@lucene.apache.org; rengels@ix.netcom.com
Subject: Re: Lucene does NOT use UTF-8.


> I think you guys are WAY overcomplicating things, or you just don't know
> enough about the Java class libraries.


People were just pointing out that if the vint isn't String.length(), then
one has to either buffer the entire string, or pre-scan it.

It's a valid point, and CharsetEncoder doesn't change that.

 -Yonik Now hiring -- http://tinyurl.com/7m67g


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8.

Posted by Yonik Seeley <ys...@gmail.com>.

> I think you guys are WAY overcomplicating things, or you just don't know
> enough about the Java class libraries.


People were just pointing out that if the vint isn't String.length(), then 
one has to either buffer the entire string, or pre-scan it.

It's a valid point, and CharsetEncoder doesn't change that.

 -Yonik Now hiring -- http://tinyurl.com/7m67g

RE: Lucene does NOT use UTF-8.

Posted by Robert Engels <re...@ix.netcom.com>.

At bit more clarity...

Using CharBuffer and ByteBuffer allows for easy reuse and expansion. You
also need to use the CharSetDecoder class as well.

-----Original Message-----
From: Robert Engels [mailto:rengels@ix.netcom.com]
Sent: Tuesday, August 30, 2005 12:40 PM
To: java-dev@lucene.apache.org
Subject: RE: Lucene does NOT use UTF-8.

I think you guys are WAY overcomplicating things, or you just don't know
enough about the Java class libraries.

If you use the java.nio.charset.CharsetEncoder class, then you can reuse the
byte[] array, and then it is a simple write of the length, and a blast copy
of the required number of bytes to the OutputStream (which will either fit
or expand its byte[]). You can perform all of this WITHOUT creating new
byte[] or char[] (as long as the existing one is large enough to fit the
encoded/decoded data).

There is no need to use any sort of file position mark/reset stuff.

R

-----Original Message-----
From: Ken Krugler [mailto:kkrugler@transpac.com]
Sent: Tuesday, August 30, 2005 11:54 AM
To: java-dev@lucene.apache.org
Subject: RE: Lucene does NOT use UTF-8.

>I think the VInt should be the numbers of bytes to be stored using the
UTF-8
>encoding.
>
>It is trivial to use the String methods identified before to do the
>conversion. The String(char[]) allocates a new char array.
>
>For performance, you can use the actual CharSet encoding classes - avoiding
>all of the lookups performed by the String class.

Regardless of what underlying support is used, if you want to write
out the VInt value as UTF-8 bytes versus Java chars, the Java String
has to either be converted to UTF-8 in memory first, or pre-scanned.
The first is a memory hit, and the second is a performance hit. I
don't know the extent of either, but it's there.

Note that since the VInt is a variable size, you can't write out the
bytes first and then fill in the correct value later.

-- Ken

>-----Original Message-----
>From: Doug Cutting [mailto:cutting@apache.org]
>Sent: Monday, August 29, 2005 4:24 PM
>To: java-dev@lucene.apache.org
>Subject: Re: Lucene does NOT use UTF-8.
>
>
>Ken Krugler wrote:
>>  The remaining issue is dealing with old-format indexes.
>
>I think that revving the version number on the segments file would be a
>good start.  This file must be read before any others.  Its current
>version is -1 and would become -2.  (All positive values are version 0,
>for back-compatibility.)  Implementations can be modified to pass the
>version around if they wish to be back-compatible, or they can simply
>throw exceptions for old format indexes.
>
>I would argue that the length written be the number of characters in the
>string, rather than the number of bytes written, since that can minimize
>string memory allocations.
>
>>  I'm going to take this off-list now [ ... ]
>
>Please don't.  It's better to have a record of the discussion.
>
>Doug

--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: Lucene does NOT use UTF-8.

Posted by Robert Engels <re...@ix.netcom.com>.

I think you guys are WAY overcomplicating things, or you just don't know
enough about the Java class libraries.

If you use the java.nio.charset.CharsetEncoder class, then you can reuse the
byte[] array, and then it is a simple write of the length, and a blast copy
of the required number of bytes to the OutputStream (which will either fit
or expand its byte[]). You can perform all of this WITHOUT creating new
byte[] or char[] (as long as the existing one is large enough to fit the
encoded/decoded data).

There is no need to use any sort of file position mark/reset stuff.

R

-----Original Message-----
From: Ken Krugler [mailto:kkrugler@transpac.com]
Sent: Tuesday, August 30, 2005 11:54 AM
To: java-dev@lucene.apache.org
Subject: RE: Lucene does NOT use UTF-8.

>I think the VInt should be the numbers of bytes to be stored using the
UTF-8
>encoding.
>
>It is trivial to use the String methods identified before to do the
>conversion. The String(char[]) allocates a new char array.
>
>For performance, you can use the actual CharSet encoding classes - avoiding
>all of the lookups performed by the String class.

Regardless of what underlying support is used, if you want to write
out the VInt value as UTF-8 bytes versus Java chars, the Java String
has to either be converted to UTF-8 in memory first, or pre-scanned.
The first is a memory hit, and the second is a performance hit. I
don't know the extent of either, but it's there.

Note that since the VInt is a variable size, you can't write out the
bytes first and then fill in the correct value later.

-- Ken

>-----Original Message-----
>From: Doug Cutting [mailto:cutting@apache.org]
>Sent: Monday, August 29, 2005 4:24 PM
>To: java-dev@lucene.apache.org
>Subject: Re: Lucene does NOT use UTF-8.
>
>
>Ken Krugler wrote:
>>  The remaining issue is dealing with old-format indexes.
>
>I think that revving the version number on the segments file would be a
>good start.  This file must be read before any others.  Its current
>version is -1 and would become -2.  (All positive values are version 0,
>for back-compatibility.)  Implementations can be modified to pass the
>version around if they wish to be back-compatible, or they can simply
>throw exceptions for old format indexes.
>
>I would argue that the length written be the number of characters in the
>string, rather than the number of bytes written, since that can minimize
>string memory allocations.
>
>>  I'm going to take this off-list now [ ... ]
>
>Please don't.  It's better to have a record of the discussion.
>
>Doug

--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: Lucene does NOT use UTF-8.

Posted by Ken Krugler <kk...@transpac.com>.

>I think the VInt should be the numbers of bytes to be stored using the UTF-8
>encoding.
>
>It is trivial to use the String methods identified before to do the
>conversion. The String(char[]) allocates a new char array.
>
>For performance, you can use the actual CharSet encoding classes - avoiding
>all of the lookups performed by the String class.

Regardless of what underlying support is used, if you want to write 
out the VInt value as UTF-8 bytes versus Java chars, the Java String 
has to either be converted to UTF-8 in memory first, or pre-scanned. 
The first is a memory hit, and the second is a performance hit. I 
don't know the extent of either, but it's there.

Note that since the VInt is a variable size, you can't write out the 
bytes first and then fill in the correct value later.

-- Ken


>-----Original Message-----
>From: Doug Cutting [mailto:cutting@apache.org]
>Sent: Monday, August 29, 2005 4:24 PM
>To: java-dev@lucene.apache.org
>Subject: Re: Lucene does NOT use UTF-8.
>
>
>Ken Krugler wrote:
>>  The remaining issue is dealing with old-format indexes.
>
>I think that revving the version number on the segments file would be a
>good start.  This file must be read before any others.  Its current
>version is -1 and would become -2.  (All positive values are version 0,
>for back-compatibility.)  Implementations can be modified to pass the
>version around if they wish to be back-compatible, or they can simply
>throw exceptions for old format indexes.
>
>I would argue that the length written be the number of characters in the
>string, rather than the number of bytes written, since that can minimize
>string memory allocations.
>
>>  I'm going to take this off-list now [ ... ]
>
>Please don't.  It's better to have a record of the discussion.
>
>Doug


-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: Lucene does NOT use UTF-8.

Posted by Robert Engels <re...@ix.netcom.com>.

I think the VInt should be the numbers of bytes to be stored using the UTF-8
encoding.

It is trivial to use the String methods identified before to do the
conversion. The String(char[]) allocates a new char array.

For performance, you can use the actual CharSet encoding classes - avoiding
all of the lookups performed by the String class.

-----Original Message-----
From: Doug Cutting [mailto:cutting@apache.org]
Sent: Monday, August 29, 2005 4:24 PM
To: java-dev@lucene.apache.org
Subject: Re: Lucene does NOT use UTF-8.

Ken Krugler wrote:
> The remaining issue is dealing with old-format indexes.

I think that revving the version number on the segments file would be a
good start.  This file must be read before any others.  Its current
version is -1 and would become -2.  (All positive values are version 0,
for back-compatibility.)  Implementations can be modified to pass the
version around if they wish to be back-compatible, or they can simply
throw exceptions for old format indexes.

I would argue that the length written be the number of characters in the
string, rather than the number of bytes written, since that can minimize
string memory allocations.

> I'm going to take this off-list now [ ... ]

Please don't.  It's better to have a record of the discussion.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8.

Posted by Yonik Seeley <ys...@gmail.com>.

The temporary char[] buffer is cached per InputStream instance, so the extra 
memory allocation shouldn't be a big deal. One could also use 
String(byte[],offset,len,"UTF-8"), and that creates a char[] that is used 
directly by the string instead of being copied. It remains to be seen how 
fast the native java char converter is though.

I like the idea of the length being the number of bytes... it encapsulates 
the content in case you want to rapidly skip over it (or rapidly copy it). 
It's more future proof w.r.t. alternate encodings (or binary), and if it had 
been number if bytes from the start, it wouldn't have to be changed now.

-Yonik

On 8/29/05, Doug Cutting <cu...@apache.org> wrote:

> I would argue that the length written be the number of characters in the
> string, rather than the number of bytes written, since that can minimize
> string memory allocations.
>

Re: Lucene does NOT use UTF-8.

Posted by Doug Cutting <cu...@apache.org>.

Ken Krugler wrote:
> The remaining issue is dealing with old-format indexes.

I think that revving the version number on the segments file would be a 
good start.  This file must be read before any others.  Its current 
version is -1 and would become -2.  (All positive values are version 0, 
for back-compatibility.)  Implementations can be modified to pass the 
version around if they wish to be back-compatible, or they can simply 
throw exceptions for old format indexes.

I would argue that the length written be the number of characters in the 
string, rather than the number of bytes written, since that can minimize 
string memory allocations.

> I'm going to take this off-list now [ ... ]

Please don't.  It's better to have a record of the discussion.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8.

Posted by Ken Krugler <kk...@transpac.com>.

Hi Marvin,

Thanks for the detailed response. After spending a bit more time in 
the code, I think you're right - all strings seem to be funnelled 
through IndexOutput. The remaining issue is dealing with old-format 
indexes.

I'm going to take this off-list now, since I'm guessing most list 
readers aren't too interested in the on-going discussion. If anybody 
else would like to be copied, send me an email.

-- Ken
-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8

Posted by Andi Vajda <an...@osafoundation.org>.

>> If the rest of the world of Lucene ports followed suit with PyLucene  and 
>> did the GCJ/SWIG thing, we'd have no problems :)  What are the 
>> disadvantages to following this model with Plucene?
>> 
>> 
> Some parts of the Lucene API require subclassing (e. g., Analyzer) and SWIG 
> does support cross-language polymorphism only for a few languages, notably 
> Python and Java but not for Perl. Noticing the smiley I won't mention the 
> zillion other reasons not to use the "GCJ/SWIG thing".

Yes, that's true, Java Lucene requires a bunch of subclassing to truly shine 
in any sizable application. I didn't use SWIG's director feature to implement 
extension but a more or less hardcarved SWIG-in-reverse trick that can easily 
be reproduced by other such SWIG-based ports.
See http://svn.osafoundation.org/pylucene/trunk/README for more details...

Andi..


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8

Posted by Ken Krugler <kk...@transpac.com>.

[snip]

>The surrogate pair problem is another matter entirely. First of all, 
>lets see if I do understand the problem correctly: Some unicode 
>characters can be represented by one codepoint outside the BMP (i. 
>e., not with 16 bits) and alternatively with two codepoints, both of 
>them in the 16-bit range.

A Unicode character has a code point, which is a scalar value in the 
range U+0000 to U+10FFFF. The code point for every character in the 
Unicode character set will fall in this range.

There are Unicode encoding schemes, which specify how Unicode code 
point values are serialized. Examples include UTF-8, UTF-16LE, 
UTF-16BE, UTF-32, UTF-7, etc.

The UTF-16 (big or little endian) encoding scheme uses two code units 
(16-bit values) to encode Unicode characters with code point values > 
U+0FFFF.

>According to Marvin's explanations, the Unicode standard requires 
>these characters to be represented as "the one" codepoint in UTF-8, 
>resulting in a 4-, 5-, or 6-byte encoding for that character.

Since the Unicode code point range is constrained to 
U+0000...U+10FFFF, the longest valid UTF-8 sequence is 4 bytes.

>But since a Java char _is_ 16 bit, the codepoints beyond the 16-bit 
>range cannot be represented as chars.  That is, the 
>in-memory-representation still requires the use of the surrogate 
>pairs.  Therefore, writing consists of translating the surrogate 
>pair to the >16bit representation of the same character and then 
>algorithmically encoding that.  Reading is exactly the reverse 
>process.

Yes. Writing requires that you combine the two surrogate characters 
into a Unicode code point, then converting that value into the UTF-8 
4 byte sequence.

>Adding code to handle the 4 to 6 byte encodings to the 
>readChars/writeChars method is simple, but how do you do the mapping 
>from surrogate pairs to the chars they represent? Is there an 
>algorithm for doing that except for table lookups or huge switch 
>statements?

It's easy, since U+D800...U+DBFF is defined as the range for the high 
(most significant) surrogate, and U+DC00...U+DFFF is defined as the 
range for the low (least significant) surrogate.

-- Ken
-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8

Posted by Ronald Dauster <rp...@ronald-dauster.de>.

Erik Hatcher wrote:

> On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote:
>
>>> I'm not familiar with UTF-8 enough to follow the details of this
>>> discussion.  I hope other Lucene developers are, so we can resolve  
>>> this
>>> issue.... anyone raising a hand?
>>>
>>
>> I could, but recent posts makes me think this is heading towards a  
>> religious debate :)
>
>
> Ken - you mentioned taking the discussion off-line in a previous  
> post.  Please don't.  Let's keep it alive on java-dev until we have a  
> resolution to it.
>
I'd also like to follow this thread.

>> I think the following statements are all true:
>>
>> a. Using UTF-8 for strings would make it easier for Lucene indexes  
>> to be used by other implementations besides the reference Java  version.
>>
>> b. It would be easy to tweak Lucene to read/write conformant UTF-8  
>> strings.
>
>
> What, if any, performance impact would changing Java Lucene in this  
> regard have?   (I realize this is rhetorical at this point, until a  
> solution is at hand)
>
Looking at the source of 1.4.3, fixing the NUL character encoding is 
trivial for writing and reading already works for both the standard and 
the java-style encoding. Not much work and absolutely no performance 
impact here.

The surrogate pair problem is another matter entirely. First of all, 
lets see if I do understand the problem correctly: Some unicode 
characters can be represented by one codepoint outside the BMP (i. e., 
not with 16 bits) and alternatively with two codepoints, both of them in 
the 16-bit range. According to Marvin's explanations, the Unicode 
standard requires these characters to be represented as "the one" 
codepoint in UTF-8, resulting in a 4-, 5-, or 6-byte encoding for that 
character.

But since a Java char _is_ 16 bit, the codepoints beyond the 16-bit 
range cannot be represented as chars.  That is, the 
in-memory-representation still requires the use of the surrogate pairs.  
Therefore, writing consists of translating the surrogate pair to the 
 >16bit representation of the same character and then algorithmically 
encoding that.  Reading is exactly the reverse process.

Adding code to handle the 4 to 6 byte encodings to the 
readChars/writeChars method is simple, but how do you do the mapping 
from surrogate pairs to the chars they represent? Is there an algorithm 
for doing that except for table lookups or huge switch statements?

>> c. The hard(er) part would be backwards compatibility with older  
>> indexes. I haven't looked at this enough to really know, but one  
>> example is the compound file (xx.cfs) format...I didn't see a  
>> version number, and it contains strings.
>
>
> I don't know the gory details, but we've made compatibility breaking  
> changes in the past and the current version of Lucene can open older  
> formats, but only write the most current format.  I suspect it could  
> be made to be backwards compatible.  Worst case, we break  
> compatibility in 2.0.
>
I believe backward compatibility is the easy part and comes for free.  
As I mentioned above, reading the "correct" NUL encoding already works 
and the non-BMP characters will have to be represented as surrogate 
pairs internally anyway.  So there is no problem with reading the old 
encoding and there is nothing wrong with still using or reading the 
surrogate pairs, only that they would not be written. Even indices with 
mixed segments are not a problem. 

Given that the CompoundFileReader/Writer use a 
lucene.store.OutputStream/InputStream for their FileEntries, they would 
also be able to read older files but potentially write incompatible 
files.  OTOH, when used inside lucene, the filenames do not contain NULs 
of non-BMP chars.

But: Is the compound file format supposed to be "interoperable"? Which 
formats are?

> [...]
>
>> What's unclear to me (not being a Perl, Python, etc jock) is how  
>> much easier it would be to get these other implementations working  
>> with Lucene, following a change to UTF-8. So I can't comment on the  
>> return on time required to change things.
>>
>> [...]
>
>
> PyLucene is literally the Java version of Lucene underneath (via GCJ/ 
> SWIG), so no worries there.  CLucene would need to be changed, as  
> well as DotLucene and the other ports out there.
>
> If the rest of the world of Lucene ports followed suit with PyLucene  
> and did the GCJ/SWIG thing, we'd have no problems :)  What are the  
> disadvantages to following this model with Plucene?
>
>
Some parts of the Lucene API require subclassing (e. g., Analyzer) and 
SWIG does support cross-language polymorphism only for a few languages, 
notably Python and Java but not for Perl. Noticing the smiley I won't 
mention the zillion other reasons not to use the "GCJ/SWIG thing".

Ronald

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8

Posted by Steven Rowe <sa...@syr.edu>.

DM Smith wrote:
> Daniel Naber wrote:
>> But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem to 
>> be the case.
>>
> UTF-16 is a fixed 2 byte/char representation.

Except when it's not.  I.e., above the BMP.

 From the Unicode 4.0 standard 
<http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf>:

    In the UTF-16 encoding form, code points in the
    range U+0000..U+FFFF are represented as a single
    16-bit code unit; code points in the supplementary
    planes, in the range U+10000..U+10FFFF, are
    instead represented as pairs of 16-bit code units.
    These pairs of special code units are known as
    surrogate pairs.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8

Posted by Tom White <to...@gmail.com>.

On 8/30/05, Ken Krugler <kk...@transpac.com> wrote:
> 
> >Daniel Naber wrote:
> >
> >>On Monday 29 August 2005 19:56, Ken Krugler wrote:
> >>
> >>>"Lucene writes strings as a VInt representing the length of the
> >>>string in Java chars (UTF-16 code units), followed by the character
> >>>data."
> >>>
> >>>
> >>But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem
> >>to be the case.
> >>
> >UTF-16 is a fixed 2 byte/char representation.
> 
> I hate to keep beating this horse, but I want to emphasize that it's
> 2 bytes per Java char (or UTF-16 code unit), not Unicode character
> (code point).


There's more horse beating on Java and Unicode 4 in this blog entry: 
http://weblogs.java.net/blog/joconner/archive/2005/08/how_long_is_you.html.

Re: Lucene does NOT use UTF-8

Posted by Ken Krugler <kk...@transpac.com>.

>Daniel Naber wrote:
>
>>On Monday 29 August 2005 19:56, Ken Krugler wrote:
>>
>>>"Lucene writes strings as a VInt representing the length of the
>>>string in Java chars (UTF-16 code units), followed by the character
>>>data."
>>>   
>>>
>>But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem 
>>to be the case.
>>
>UTF-16 is a fixed 2 byte/char representation.

I hate to keep beating this horse, but I want to emphasize that it's 
2 bytes per Java char (or UTF-16 code unit), not Unicode character 
(code point).

>But one cannot equate the character count with the byte count. Each 
>Java char is 2 bytes. I think all that is being said is that the 
>VInt is equal to str.length() as java gives it.
>
>On an unrelated project we are determining whether we should use a 
>denormalized (letter followed by an accents) or a normalized form 
>(letter with accents) of accented characters as we present the text 
>to a GUI. We have found that font support varies but appears to be 
>better for denormalized. This is not an issue for storage, as it can 
>be transformed before it goes to screen. However, it is useful to 
>know which form it is in.
>
>The reason I mention this is that I seem to remember that the length 
>of the java string varies with the representation.

String.length() is the number of Java chars, which always uses 
UTF-16. If you normalize text, then yes that can change the number of 
code units and thus the length of the string, but so can doing any 
kind of text munging (e.g. replacement) operation on characters in 
the string.

>So then the count would not be the number of glyphs that the user 
>sees. Please correct me if I am wrong.

All kinds of mxn mappings (both at the layout engine level, and using 
font tables) are possible when going from Unicode characters to 
display glyphs. Plus zero-width left-kerning glyphs would also alter 
the relationship between # of visual "characters" and backing store 
characters.

-- Ken
-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8

Posted by DM Smith <dm...@gmail.com>.

Daniel Naber wrote:

>On Monday 29 August 2005 19:56, Ken Krugler wrote:
>  
>
>>"Lucene writes strings as a VInt representing the length of the
>>string in Java chars (UTF-16 code units), followed by the character
>>data."
>>    
>>
>But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem to be the 
>case.
>
UTF-16 is a fixed 2 byte/char representation. But one cannot equate the 
character count with the byte count. Each Java char is 2 bytes. I think 
all that is being said is that the VInt is equal to str.length() as java 
gives it.

On an unrelated project we are determining whether we should use a 
denormalized (letter followed by an accents) or a normalized form 
(letter with accents) of accented characters as we present the text to a 
GUI. We have found that font support varies but appears to be better for 
denormalized. This is not an issue for storage, as it can be transformed 
before it goes to screen. However, it is useful to know which form it is in.

The reason I mention this is that I seem to remember that the length of 
the java string varies with the representation. So then the count would 
not be the number of glyphs that the user sees. Please correct me if I 
am wrong.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8

Posted by Ken Krugler <kk...@transpac.com>.

>On Monday 29 August 2005 19:56, Ken Krugler wrote:
>
>>  "Lucene writes strings as a VInt representing the length of the
>>  string in Java chars (UTF-16 code units), followed by the character
>>  data."
>
>But wouldn't UTF-16 mean 2 bytes per character?

Yes, UTF-16 means two bytes per code unit. A Unicode character (code 
point) is encoded as either one or two UTF-16 code units.

>That doesn't seem to be the
>case.

The case where? You mean in what actually gets written out?

String.length() is the length in terms of Java chars, which means 
UTF-16 code units (well, sort of...see below). Looking at the code, 
IndexOutput.writeString() calls writeVInt() with the string length.

One related note. Java 1.4 supports Unicode 3.0, while Java 5.0 
supports Unicode 4.0. It was in Unicode 3.1 that supplementary 
characters (code points > U+0FFFF, ie outside of the BMP) were added, 
and the UTF-16 encoding formalized.

So I think the issue of non-BMP characters is currently a bit 
esoteric for Lucene, since I'm guessing there are other places in the 
code (e.g. JDK calls used by Lucene) where non-BMP characters won't 
be properly handled. Though some quick tests indicate that there is 
some knowledge of surrogate pairs in 1.4 (e.g. converting a String 
w/surrogate pairs to UTF-8 does the right thing).

-- Ken
-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8

Posted by Daniel Naber <lu...@danielnaber.de>.

On Monday 29 August 2005 19:56, Ken Krugler wrote:

> "Lucene writes strings as a VInt representing the length of the
> string in Java chars (UTF-16 code units), followed by the character
> data."

But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem to be the 
case.

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8

Posted by Marvin Humphrey <ma...@rectangular.com>.

Eric Hatcher wrote...

> What, if any, performance impact would changing Java Lucene in this  
> regard have?

And Ken Krugler wrote...

> "Lucene writes strings as a VInt representing the length of the  
> string in Java chars (UTF-16 code units), followed by the character  
> data."

I had been working under the assumption that the value of the VInt  
would be changed as well.  It seemed logical that if strings were  
encoded with legal UTF-8, the count at the head should indicate  
either 1) the number of UTF-8 characters in the string, or 2) the  
number of bytes occupied by the encoded string.

Do either of those and more substantial changes to Java Lucene would  
be required.  I expect that the impact on performance could be made  
negligible for the first option, but the question of backwards  
compatibility would become a lot messier.

It simply had not occurred to me to keep the VInt as is.  If you do  
that, this becomes a much more localized problem.

For Plucene, I'll avoid the gory details and just say that having the  
VInt continue to represent UTF-16 code units limits the availability  
of certain options, but doesn't cause major inefficiencies.  Now that  
we know that's what it does, we can work with it.  A transition to  
always-legal UTF-8 obviates the need to scan for and fix the edge  
cases, and addresses my main concern.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8

Posted by Ken Krugler <kk...@transpac.com>.

>On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote:
>>>I'm not familiar with UTF-8 enough to follow the details of this
>>>discussion.  I hope other Lucene developers are, so we can resolve this
>>>issue.... anyone raising a hand?
>>
>>I could, but recent posts makes me think this is heading towards a 
>>religious debate :)
>
>Ken - you mentioned taking the discussion off-line in a previous 
>post.  Please don't.  Let's keep it alive on java-dev until we have 
>a resolution to it.
>
>>I think the following statements are all true:
>>
>>a. Using UTF-8 for strings would make it easier for Lucene indexes 
>>to be used by other implementations besides the reference Java 
>>version.
>>
>>b. It would be easy to tweak Lucene to read/write conformant UTF-8 strings.
>
>What, if any, performance impact would changing Java Lucene in this 
>regard have?   (I realize this is rhetorical at this point, until a 
>solution is at hand)

Almost zero. A tiny hit when reading/writing surrogate pairs, to 
properly encode them as a 4 byte UTF-8 sequence versus two 3-byte 
sequences.

>>c. The hard(er) part would be backwards compatibility with older 
>>indexes. I haven't looked at this enough to really know, but one 
>>example is the compound file (xx.cfs) format...I didn't see a 
>>version number, and it contains strings.
>
>I don't know the gory details, but we've made compatibility breaking 
>changes in the past and the current version of Lucene can open older 
>formats, but only write the most current format.  I suspect it could 
>be made to be backwards compatible.  Worst case, we break 
>compatibility in 2.0.

Ronald is correct in that it would be easy to make the reader handle 
both "Java modified UTF-8" and UTF-8, and the writer always output 
UTF-8. So the only problem would be if older versions of Lucene (or 
maybe CLucene) wound up trying to read strings that contained 4-byte 
UTF-8 sequences, as they wouldn't know how to convert this into two 
UTF-16 Java chars.

Since 4-byte UTF-8 sequences are only for characters outside of the 
BMP, and these are rare, it seems like an OK thing to do, but that's 
just my uninformed view.

>>d. The documentation could be clearer on what is meant by the 
>>"string length", but this is a trivial change.
>
>That change was made by Daniel soon after this discussion began.

Daniel changed the definition of Chars, but String section still 
needs to be clarified. Currently it says:

"Lucene writes strings as a VInt representing the length, followed by 
the character data".

It should read:

"Lucene writes strings as a VInt representing the length of the 
string in Java chars (UTF-16 code units), followed by the character 
data."

-- Ken
-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote:
>> I'm not familiar with UTF-8 enough to follow the details of this
>> discussion.  I hope other Lucene developers are, so we can resolve  
>> this
>> issue.... anyone raising a hand?
>>
>
> I could, but recent posts makes me think this is heading towards a  
> religious debate :)

Ken - you mentioned taking the discussion off-line in a previous  
post.  Please don't.  Let's keep it alive on java-dev until we have a  
resolution to it.

> I think the following statements are all true:
>
> a. Using UTF-8 for strings would make it easier for Lucene indexes  
> to be used by other implementations besides the reference Java  
> version.
>
> b. It would be easy to tweak Lucene to read/write conformant UTF-8  
> strings.

What, if any, performance impact would changing Java Lucene in this  
regard have?   (I realize this is rhetorical at this point, until a  
solution is at hand)

> c. The hard(er) part would be backwards compatibility with older  
> indexes. I haven't looked at this enough to really know, but one  
> example is the compound file (xx.cfs) format...I didn't see a  
> version number, and it contains strings.

I don't know the gory details, but we've made compatibility breaking  
changes in the past and the current version of Lucene can open older  
formats, but only write the most current format.  I suspect it could  
be made to be backwards compatible.  Worst case, we break  
compatibility in 2.0.

> d. The documentation could be clearer on what is meant by the  
> "string length", but this is a trivial change.

That change was made by Daniel soon after this discussion began.

> What's unclear to me (not being a Perl, Python, etc jock) is how  
> much easier it would be to get these other implementations working  
> with Lucene, following a change to UTF-8. So I can't comment on the  
> return on time required to change things.
>
> I'm also curious about the existing CLucene & PyLucene ports. Would  
> they also need to be similarly modified, with the proposed changes?

PyLucene is literally the Java version of Lucene underneath (via GCJ/ 
SWIG), so no worries there.  CLucene would need to be changed, as  
well as DotLucene and the other ports out there.

If the rest of the world of Lucene ports followed suit with PyLucene  
and did the GCJ/SWIG thing, we'd have no problems :)  What are the  
disadvantages to following this model with Plucene?

     Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8

Posted by Andi Vajda <an...@osafoundation.org>.

> I'm also curious about the existing CLucene & PyLucene ports. Would they also 
> need to be similarly modified, with the proposed changes?

PyLucene is built from the Java Lucene source code, so any change made to Java 
Lucene is getting reflected in PyLucene once it gets refreshed. The next 
refresh is to be done shortly after Java Lucene 1.9 is released.

Andi..


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8

Posted by Ken Krugler <kk...@transpac.com>.

>I'm not familiar with UTF-8 enough to follow the details of this
>discussion.  I hope other Lucene developers are, so we can resolve this
>issue.... anyone raising a hand?

I could, but recent posts makes me think this is heading towards a 
religious debate :)

I think the following statements are all true:

a. Using UTF-8 for strings would make it easier for Lucene indexes to 
be used by other implementations besides the reference Java version.

b. It would be easy to tweak Lucene to read/write conformant UTF-8 strings.

c. The hard(er) part would be backwards compatibility with older 
indexes. I haven't looked at this enough to really know, but one 
example is the compound file (xx.cfs) format...I didn't see a version 
number, and it contains strings.

d. The documentation could be clearer on what is meant by the "string 
length", but this is a trivial change.

What's unclear to me (not being a Perl, Python, etc jock) is how much 
easier it would be to get these other implementations working with 
Lucene, following a change to UTF-8. So I can't comment on the return 
on time required to change things.

I'm also curious about the existing CLucene & PyLucene ports. Would 
they also need to be similarly modified, with the proposed changes?

One final point. I doubt people have been adding strings with 
embedded nulls, and text outside of the Unicode BMP is also very 
rare. So _most_ Lucene indexes only contain valid UTF-8 data. It's 
only the above two edge cases that create an interoperability problem.

-- Ken
-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8.

Posted by Otis Gospodnetic <ot...@yahoo.com>.

I'm not familiar with UTF-8 enough to follow the details of this
discussion.  I hope other Lucene developers are, so we can resolve this
issue.... anyone raising a hand?

Otis

--- Marvin Humphrey <ma...@rectangular.com> wrote:

> Ken Krugler sent a reply to the user list.  In an effort to keep all
> the developers informed, I'm sending my reply to the developer list
> and including his entire original post below my sig.
> 
> Ken writes...
> 
>  > Since a null in the
>  > middle of a string is rare, as is a character outside of the BMP,
> a
>  > quick scan of the text should be sufficient to determine if it can
> be
>  > written as-is.
> 
> Let's see.  I think we are looking at two scans, (one index(), one
> regex), or a regex that uses alternation.  I strongly suspect two
> scans
> are faster.
> 
>      if (  (index($string, "\xC0\x80") != -1)
>         or ($string =~ /[\F0-\xF7]/ ) # only exists in 4-byte UTF-8
>      ) {
>          # Process string...
>      }
> 
> That would tell us whether the string needed to be specially encoded
> for
> Java's sake on output.  Yes, I suspect that's considerably more
> efficient than always converting first to UTF-16 and then to
> "Modified
> UTF-8".
> 
> It's also completely unnecessary, as you'll see from the patch below,
> so I'm going to press ahead and make these XS ports of InputStream
> and
> OutputStream work with legal UTF-8.
> 
> It would actually make a lot more sense for Plucene if the integer at
> the head of a string measured *bytes* instead of either Unicode code
> points or Java chars.  Then it's just a straight up copy!  No
> scanning
> OR decoding required.
> 
> (Hmm... I wonder if there's a way to make Lucene work quickly if the
> VInt were redefined to be "length in bytes"...)
> 
> Speaking of which, the Lucene file formats document also says this...
> 
>      "Lucene writes strings as a VInt representing the length,  
> followed by
>      the character data."
> 
> The ambiguity of the word "length" in this sentence left me
> scratching
> my head.  Length in bytes or length in UTF-8 characters?  Of course
> the real answer is... neither. :\
> 
> It's length in Java chars, or, if you prefer to further Sun's
> disinformation campaign, ;) "Modified UTF-8 characters".  If the
> Lucene
> docs had stated "Java chars" explicitly, I would have had a better
> idea
> about why the value of that VInt is what it is -- a Java-specific
> quirk at odds with a widely-accepted standard -- and about what
> it was going to take to adhere to the spec.
> 
>  > I'd need to look at the code more, but using something other than
> the
>  > Java serialized format would probably incur a performance penalty
> for
>  > the Java implementation. Or at least make it harder to handle the
>  > strings using the standard Java serialization support.
> 
> I believe that the following true-UTF-8 replacement for the
> readChars function is at least as fast as the current implementation,
> unless your text contains characters outside the BMP.  It's
> incomplete,
> because my Java expertise is quite limited, but it should be
> conceptually sound.  The algo is adapted from C code supplied by the
> Unicode consortium.
> 
> http://www.unicode.org/Public/PROGRAMS/CVTUTF/ConvertUTF.c
> 
>    static final byte[] TRAILING_BYTES_FOR_UTF8 = {
>        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
>        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
>        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
>        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
>        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
>        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
>        1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
> 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
>        2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
> 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
>    };
> 
>    public final void readChars(char[] buffer, int start, int length)
>         throws IOException {
>      int end = start + length; // No longer a final int.
>      for (int i = start; i < end; i++) {
>        int b = readByte();   // NOTE: b changed from byte to int.
>        switch (TRAILING_BYTES_FOR_UTF8[b & 0xFF]) {
>          case 0:
>            buffer[i] = (char)(b & 0x7F);
>            break;
>          case 1:
>            buffer[i] = (char)(((b & 0x1F) << 6)
>              | (readByte() & 0x3F));
>            //buffer[i] = (char)(((b & 0x1F) << 6)
>            //  | (readByte() & 0x3F));
>            break;
>          case 2:
>            buffer[i] = (char)(((b & 0x0F) << 12)
>              | ((readByte() & 0x3F) << 6)
>              |  (readByte() & 0x3F));
>            break;
>          case 3:
>            int utf32 = (((b & 0x0F) << 18)
>              | ((readByte() & 0x3F) << 12)
>              | ((readByte() & 0x3F) << 6)
>              |  (readByte() & 0x3F));
>            // These are just for illustration.
>            int firstSurrogate  = (utf32 >> 10) + 0xD7C0;
>            int secondSurrogate = (utf32 & 0x03FF) + 0xDC00;
>            // If the current buffer isn't long enough,
>            // create a new buffer with length one greater than
>            // the current buffer, copy the entire contents,
>            // enter the first surrogate, increment both i and end,
>            // enter the second surrogate.
>            // This is extremely inefficient, but also
>            // likely to be invoked extremely rarely.
>            // Problem: In Perl I'd do this with references, and
>            // in C I'd do it with pointers.  Not sure how to
>            // make it work in Java.
>            break;
>        }
>      }
>    }
> 
> 
> Initial benchmarking experiments appear to indicate negligible impact
> on performance.
> 
>  > So I doubt
>  > this would be a slam-dunk in the Lucene community.
> 
> I appreciate your willingness to at least weigth the matter, and I
> understand the potential reluctance.  Hopefully the comparable
> performance of the standards-compliant code above will render the
> issue
> moot, and the next release of Lucene will use legal UTF-8.
> 
> Best,
> 
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
> 
> ================================================================
> 
> From: Ken Krugler <kk...@transpac.com>
> Date: August 27, 2005 2:11:34 PM PDT
> To: java-user@lucene.apache.org
> Subject: Re: Lucene does NOT use UTF-8.
> Reply-To: java-user@lucene.apache.org
> 
> 
> > I've delved into the matter of Lucene and UTF-8 a little further,  
> > and I am discouraged by what I believe I've uncovered.
> >
> > Lucene should not be advertising that it uses "standard UTF-8" --  
> > or even UTF-8 at all, since "Modified UTF-8" is _illegal_ UTF-8.
> >
> 
> Unfortunately this is how Sun documents the format they use for  
> serialized strings.
> 
> 
> > The two distinguishing characteristics of "Modified UTF-8" are the 
> 
> > treatment of codepoints above the BMP (which are written as  
> > surrogate pairs), and the encoding of null bytes as 1100 0000 1000 
> 
> > 0000 rather than 0000 0000.  Both of these became illegal as of  
> > Unicode 3.1 (IIRC), because they are not shortest-form and non- 
> > shortest-form UTF-8 presents a security risk.
> >
> 
> For UTF-8 these were always invalid, but the standard wasn't very  
> clear about it. Unfortunately the fuzzy nature of the 1.0/2.0 specs  
> encouraged some sloppy implementations.
> 
> 
> > The documentation should really state that Lucene stores strings in
>  
> > a Java-only adulteration of UTF-8,
> >
> 
> Yes, good point. I don't know who's in charge of that page, but it  
> should be fixed.
> 
> 
> > unsuitable for interchange.
> >
> 
> Other than as an internal representation for Java serialization.
> 
> 
> > Since Perl uses true shortest-form UTF-8 as its native encoding,  
> > Plucene would have to jump through two efficiency-killing hoops in 
> 
> > order to write files that would not choke Lucene: instead of  
> > writing out its true, legal UTF-8 directly, it would be necessary  
> > to first translate to UTF-16, then duplicate the Lucene encoding  
> > algorithm from OutputStream.  In theory.
> >
> 
> Actually I don't think it would be all that bad. Since a null in the 
> 
> middle of a string is rare, as is a character outside of the BMP, a  
> quick scan of the text should be sufficient to determine if it can be
>  
> written as-is.
> 
> The ICU project has C code that can be used to quickly walk a string.
>  
> I believe these would find/report such invalid code points, if you  
> use the safe (versus faster unsafe) versions.
> 
> 
> > Below you will find a simple Perl script which illustrates what  
> > happens when Perl encounters malformed UTF-8.  Run it (you need  
> > Perl 5.8 or higher) and you will see why even if I thought it was a
>  
> > good idea to emulate the Java hack for encoding "Modified UTF-8",  
> > trying to make it work in practice would be a nightmare.
> >
> > If Plucene were to write legal UTF-8 strings to its index files,  
> > Java Lucene would misbehave and possibly blow up any time a string 
> 
> > contained either a 4-byte character or a null byte.  On the flip  
> > side, Perl will spew warnings like crazy and possibly blow up  
> > whenever it encounters a Lucene-encoded null or surrogate pair.   
> > The potential blowups are due to the fact that Lucene and Plucene  
> > will not agree on how many characters a string contains, resulting 
> 
> > in overruns or underruns.
> >
> > I am hoping that the answer to this will be a fix to the encoding  
> > mechanism in Lucene so that it really does use legal UTF-8.  The  
> > most efficient way to go about this has not yet presented itself.
> >
> 
> I'd need to look at the code more, but using something other than the
>  
> Java serialized format would probably incur a performance penalty for
>  
> the Java implementation. Or at least make it harder to handle the  
> strings using the standard Java serialization support. So I doubt  
> this would be a slam-dunk in the Lucene community.
> 
> -- Ken
> 
> 
> 
> > #----------------------------------------
> >
> > #!/usr/bin/perl
> > use strict;
> > use warnings;
> >
> > # illegal_null.plx -- Perl complains about non-shortest-form null.
> >
> > my $data = "foo\xC0\x80\n";
> >
> > open (my $virtual_filehandle, "+<:utf8", \$data);
> > print <$virtual_filehandle>;
> >
> 
> -- 
> Ken Krugler
> TransPac Software, Inc.
> <http://www.transpac.com>
> +1 530-470-9200
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8.

Posted by Marvin Humphrey <ma...@rectangular.com>.

Hello, Robert...

On Aug 28, 2005, at 7:50 PM, Robert Engels wrote:

> Sorry, but I think you are barking up the wrong tree... and your  
> tone is
> quite bizarre. My personal OPINION is that your "script" language  
> is an
> abomination, and anyone that develops in it is clearly hurting the
> advancement of all software - but that is another story, and  
> doesn't matter
> much to the discussion - in a similar fashion your choice of words is
> clearly not gong to help matters.

My personal perspective is a utilitarian one: languages, platforms,  
they all come and go eventually, and in between a lot of stuff gets  
done.  I enjoy and appreciate Java (what I know of it), and I watched  
the Ruby/Java spat a little while ago with dismay.  The enmity is not  
returned.  :)

> It may be less efficient to decode in other languages, but I don't  
> think the
> original Lucene designers were too worried about the efficiencies  
> of other
> languages/platforms.

That may be the case.  I suppose we're about to find out how  
important the Lucene development community considers interchange.   
The phrase "standard UTF-8" in the documentation led me to believe  
that the intention was to deploy honest-to-goodness UTF-8.  In fact,  
as was pointed out, the early versions of the Unicode standard were  
not very clear.  Lucene was originally begun in 1998, and Unicode  
Corrigendum #1: "UTF-8 Shortest Form" wasn't released until 2001.  My  
best guess is that it was supposed to be legal UTF-8 and that the non- 
conformance is unintentional.

Otis Gospodnetic raised objections when the Plucene project made the  
decision to abandon index compatibility with Java Lucene.  I've been  
arguing that that decision ought to be reconsidered.  It will make it  
easier to achieve this shared goal of interoperability if Plucene  
does not have to go out of its way to defeat measures painstakingly  
put in place by the Perl5Porters team to ensure secure and robust  
Unicode support.

One of the reasons I have placed my own search engine project on hold  
was that I concluded I could not improve in a meaningful way on  
Lucene's file format.  It's really a marvelous piece of work.   
Perhaps it will become the TIFF of inverted index formats.  It seems  
to me that the Lucene project would benefit from having it widely  
adopted.  I'd like to help with that.

> Using String.getBytes("UTF-8"), and String.String(byte[],"UTF-8")  
> is all
> that is needed.

Thank you for the tip.  At first blush, I'm concerned that those may  
be difficult to make work with OutputStream's readByte() without  
incurring a performance penalty, but if I'm wrong and it's six-of-one- 
half-dozen-of-another for Java Lucene, then if a change is going to  
be made, I'll argue for that one.  That would harmonize with the way  
binary field data is stored, assuming that I can trust that portion  
of the spec document. ;)

Cheers,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: Lucene does NOT use UTF-8.

Posted by Robert Engels <re...@ix.netcom.com>.

Sorry, but I think you are barking up the wrong tree... and your tone is
quite bizarre. My personal OPINION is that your "script" language is an
abomination, and anyone that develops in it is clearly hurting the
advancement of all software - but that is another story, and doesn't matter
much to the discussion - in a similar fashion your choice of words is
clearly not gong to help matters.

Just because Lucene uses a proprietary encoding that is efficient for Java,
does not make it non-portable. It is certainly not "Java only" by any
stretch - all you need to know is that a Java "character" is always 2 bytes.
It may be less efficient to decode in other languages, but I don't think the
original Lucene designers were too worried about the efficiencies of other
languages/platforms. The API documentation may have been inaccurate, but
there has been no attempt to "hide" the process - it is still completely
"open".

All that being said, it is trivial to make the VInt the number of bytes, and
use the built in UTF-8 encoders/decoders available in Java.

Using String.getBytes("UTF-8"), and String.String(byte[],"UTF-8") is all
that is needed.

R

-----Original Message-----
From: Marvin Humphrey [mailto:marvin@rectangular.com]
Sent: Sunday, August 28, 2005 8:57 PM
To: java-dev@lucene.apache.org
Subject: Re: Lucene does NOT use UTF-8.


Ken Krugler sent a reply to the user list.  In an effort to keep all
the developers informed, I'm sending my reply to the developer list
and including his entire original post below my sig.

Ken writes...

 > Since a null in the
 > middle of a string is rare, as is a character outside of the BMP, a
 > quick scan of the text should be sufficient to determine if it can be
 > written as-is.

Let's see.  I think we are looking at two scans, (one index(), one
regex), or a regex that uses alternation.  I strongly suspect two scans
are faster.

     if (  (index($string, "\xC0\x80") != -1)
        or ($string =~ /[\F0-\xF7]/ ) # only exists in 4-byte UTF-8
     ) {
         # Process string...
     }

That would tell us whether the string needed to be specially encoded for
Java's sake on output.  Yes, I suspect that's considerably more
efficient than always converting first to UTF-16 and then to "Modified
UTF-8".

It's also completely unnecessary, as you'll see from the patch below,
so I'm going to press ahead and make these XS ports of InputStream and
OutputStream work with legal UTF-8.

It would actually make a lot more sense for Plucene if the integer at
the head of a string measured *bytes* instead of either Unicode code
points or Java chars.  Then it's just a straight up copy!  No scanning
OR decoding required.

(Hmm... I wonder if there's a way to make Lucene work quickly if the
VInt were redefined to be "length in bytes"...)

Speaking of which, the Lucene file formats document also says this...

     "Lucene writes strings as a VInt representing the length,
followed by
     the character data."

The ambiguity of the word "length" in this sentence left me scratching
my head.  Length in bytes or length in UTF-8 characters?  Of course
the real answer is... neither. :\

It's length in Java chars, or, if you prefer to further Sun's
disinformation campaign, ;) "Modified UTF-8 characters".  If the Lucene
docs had stated "Java chars" explicitly, I would have had a better idea
about why the value of that VInt is what it is -- a Java-specific
quirk at odds with a widely-accepted standard -- and about what
it was going to take to adhere to the spec.

 > I'd need to look at the code more, but using something other than the
 > Java serialized format would probably incur a performance penalty for
 > the Java implementation. Or at least make it harder to handle the
 > strings using the standard Java serialization support.

I believe that the following true-UTF-8 replacement for the
readChars function is at least as fast as the current implementation,
unless your text contains characters outside the BMP.  It's incomplete,
because my Java expertise is quite limited, but it should be
conceptually sound.  The algo is adapted from C code supplied by the
Unicode consortium.

http://www.unicode.org/Public/PROGRAMS/CVTUTF/ConvertUTF.c

   static final byte[] TRAILING_BYTES_FOR_UTF8 = {
       0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
       0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
       0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
       0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
       0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
       0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
       1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
       2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
   };

   public final void readChars(char[] buffer, int start, int length)
        throws IOException {
     int end = start + length; // No longer a final int.
     for (int i = start; i < end; i++) {
       int b = readByte();   // NOTE: b changed from byte to int.
       switch (TRAILING_BYTES_FOR_UTF8[b & 0xFF]) {
         case 0:
           buffer[i] = (char)(b & 0x7F);
           break;
         case 1:
           buffer[i] = (char)(((b & 0x1F) << 6)
             | (readByte() & 0x3F));
           //buffer[i] = (char)(((b & 0x1F) << 6)
           //  | (readByte() & 0x3F));
           break;
         case 2:
           buffer[i] = (char)(((b & 0x0F) << 12)
             | ((readByte() & 0x3F) << 6)
             |  (readByte() & 0x3F));
           break;
         case 3:
           int utf32 = (((b & 0x0F) << 18)
             | ((readByte() & 0x3F) << 12)
             | ((readByte() & 0x3F) << 6)
             |  (readByte() & 0x3F));
           // These are just for illustration.
           int firstSurrogate  = (utf32 >> 10) + 0xD7C0;
           int secondSurrogate = (utf32 & 0x03FF) + 0xDC00;
           // If the current buffer isn't long enough,
           // create a new buffer with length one greater than
           // the current buffer, copy the entire contents,
           // enter the first surrogate, increment both i and end,
           // enter the second surrogate.
           // This is extremely inefficient, but also
           // likely to be invoked extremely rarely.
           // Problem: In Perl I'd do this with references, and
           // in C I'd do it with pointers.  Not sure how to
           // make it work in Java.
           break;
       }
     }
   }


Initial benchmarking experiments appear to indicate negligible impact
on performance.

 > So I doubt
 > this would be a slam-dunk in the Lucene community.

I appreciate your willingness to at least weigth the matter, and I
understand the potential reluctance.  Hopefully the comparable
performance of the standards-compliant code above will render the issue
moot, and the next release of Lucene will use legal UTF-8.

Best,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

================================================================

From: Ken Krugler <kk...@transpac.com>
Date: August 27, 2005 2:11:34 PM PDT
To: java-user@lucene.apache.org
Subject: Re: Lucene does NOT use UTF-8.
Reply-To: java-user@lucene.apache.org


> I've delved into the matter of Lucene and UTF-8 a little further,
> and I am discouraged by what I believe I've uncovered.
>
> Lucene should not be advertising that it uses "standard UTF-8" --
> or even UTF-8 at all, since "Modified UTF-8" is _illegal_ UTF-8.
>

Unfortunately this is how Sun documents the format they use for
serialized strings.


> The two distinguishing characteristics of "Modified UTF-8" are the
> treatment of codepoints above the BMP (which are written as
> surrogate pairs), and the encoding of null bytes as 1100 0000 1000
> 0000 rather than 0000 0000.  Both of these became illegal as of
> Unicode 3.1 (IIRC), because they are not shortest-form and non-
> shortest-form UTF-8 presents a security risk.
>

For UTF-8 these were always invalid, but the standard wasn't very
clear about it. Unfortunately the fuzzy nature of the 1.0/2.0 specs
encouraged some sloppy implementations.


> The documentation should really state that Lucene stores strings in
> a Java-only adulteration of UTF-8,
>

Yes, good point. I don't know who's in charge of that page, but it
should be fixed.


> unsuitable for interchange.
>

Other than as an internal representation for Java serialization.


> Since Perl uses true shortest-form UTF-8 as its native encoding,
> Plucene would have to jump through two efficiency-killing hoops in
> order to write files that would not choke Lucene: instead of
> writing out its true, legal UTF-8 directly, it would be necessary
> to first translate to UTF-16, then duplicate the Lucene encoding
> algorithm from OutputStream.  In theory.
>

Actually I don't think it would be all that bad. Since a null in the
middle of a string is rare, as is a character outside of the BMP, a
quick scan of the text should be sufficient to determine if it can be
written as-is.

The ICU project has C code that can be used to quickly walk a string.
I believe these would find/report such invalid code points, if you
use the safe (versus faster unsafe) versions.


> Below you will find a simple Perl script which illustrates what
> happens when Perl encounters malformed UTF-8.  Run it (you need
> Perl 5.8 or higher) and you will see why even if I thought it was a
> good idea to emulate the Java hack for encoding "Modified UTF-8",
> trying to make it work in practice would be a nightmare.
>
> If Plucene were to write legal UTF-8 strings to its index files,
> Java Lucene would misbehave and possibly blow up any time a string
> contained either a 4-byte character or a null byte.  On the flip
> side, Perl will spew warnings like crazy and possibly blow up
> whenever it encounters a Lucene-encoded null or surrogate pair.
> The potential blowups are due to the fact that Lucene and Plucene
> will not agree on how many characters a string contains, resulting
> in overruns or underruns.
>
> I am hoping that the answer to this will be a fix to the encoding
> mechanism in Lucene so that it really does use legal UTF-8.  The
> most efficient way to go about this has not yet presented itself.
>

I'd need to look at the code more, but using something other than the
Java serialized format would probably incur a performance penalty for
the Java implementation. Or at least make it harder to handle the
strings using the standard Java serialization support. So I doubt
this would be a slam-dunk in the Lucene community.

-- Ken



> #----------------------------------------
>
> #!/usr/bin/perl
> use strict;
> use warnings;
>
> # illegal_null.plx -- Perl complains about non-shortest-form null.
>
> my $data = "foo\xC0\x80\n";
>
> open (my $virtual_filehandle, "+<:utf8", \$data);
> print <$virtual_filehandle>;
>

--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8.

Posted by Marvin Humphrey <ma...@rectangular.com>.

Ken Krugler sent a reply to the user list.  In an effort to keep all
the developers informed, I'm sending my reply to the developer list
and including his entire original post below my sig.

Ken writes...

 > Since a null in the
 > middle of a string is rare, as is a character outside of the BMP, a
 > quick scan of the text should be sufficient to determine if it can be
 > written as-is.

Let's see.  I think we are looking at two scans, (one index(), one
regex), or a regex that uses alternation.  I strongly suspect two scans
are faster.

     if (  (index($string, "\xC0\x80") != -1)
        or ($string =~ /[\F0-\xF7]/ ) # only exists in 4-byte UTF-8
     ) {
         # Process string...
     }

That would tell us whether the string needed to be specially encoded for
Java's sake on output.  Yes, I suspect that's considerably more
efficient than always converting first to UTF-16 and then to "Modified
UTF-8".

It's also completely unnecessary, as you'll see from the patch below,
so I'm going to press ahead and make these XS ports of InputStream and
OutputStream work with legal UTF-8.

It would actually make a lot more sense for Plucene if the integer at
the head of a string measured *bytes* instead of either Unicode code
points or Java chars.  Then it's just a straight up copy!  No scanning
OR decoding required.

(Hmm... I wonder if there's a way to make Lucene work quickly if the
VInt were redefined to be "length in bytes"...)

Speaking of which, the Lucene file formats document also says this...

     "Lucene writes strings as a VInt representing the length,  
followed by
     the character data."

The ambiguity of the word "length" in this sentence left me scratching
my head.  Length in bytes or length in UTF-8 characters?  Of course
the real answer is... neither. :\

It's length in Java chars, or, if you prefer to further Sun's
disinformation campaign, ;) "Modified UTF-8 characters".  If the Lucene
docs had stated "Java chars" explicitly, I would have had a better idea
about why the value of that VInt is what it is -- a Java-specific
quirk at odds with a widely-accepted standard -- and about what
it was going to take to adhere to the spec.

 > I'd need to look at the code more, but using something other than the
 > Java serialized format would probably incur a performance penalty for
 > the Java implementation. Or at least make it harder to handle the
 > strings using the standard Java serialization support.

I believe that the following true-UTF-8 replacement for the
readChars function is at least as fast as the current implementation,
unless your text contains characters outside the BMP.  It's incomplete,
because my Java expertise is quite limited, but it should be
conceptually sound.  The algo is adapted from C code supplied by the
Unicode consortium.

http://www.unicode.org/Public/PROGRAMS/CVTUTF/ConvertUTF.c

   static final byte[] TRAILING_BYTES_FOR_UTF8 = {
       0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
       0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
       0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
       0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
       0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
       0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
       1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
       2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
   };

   public final void readChars(char[] buffer, int start, int length)
        throws IOException {
     int end = start + length; // No longer a final int.
     for (int i = start; i < end; i++) {
       int b = readByte();   // NOTE: b changed from byte to int.
       switch (TRAILING_BYTES_FOR_UTF8[b & 0xFF]) {
         case 0:
           buffer[i] = (char)(b & 0x7F);
           break;
         case 1:
           buffer[i] = (char)(((b & 0x1F) << 6)
             | (readByte() & 0x3F));
           //buffer[i] = (char)(((b & 0x1F) << 6)
           //  | (readByte() & 0x3F));
           break;
         case 2:
           buffer[i] = (char)(((b & 0x0F) << 12)
             | ((readByte() & 0x3F) << 6)
             |  (readByte() & 0x3F));
           break;
         case 3:
           int utf32 = (((b & 0x0F) << 18)
             | ((readByte() & 0x3F) << 12)
             | ((readByte() & 0x3F) << 6)
             |  (readByte() & 0x3F));
           // These are just for illustration.
           int firstSurrogate  = (utf32 >> 10) + 0xD7C0;
           int secondSurrogate = (utf32 & 0x03FF) + 0xDC00;
           // If the current buffer isn't long enough,
           // create a new buffer with length one greater than
           // the current buffer, copy the entire contents,
           // enter the first surrogate, increment both i and end,
           // enter the second surrogate.
           // This is extremely inefficient, but also
           // likely to be invoked extremely rarely.
           // Problem: In Perl I'd do this with references, and
           // in C I'd do it with pointers.  Not sure how to
           // make it work in Java.
           break;
       }
     }
   }


Initial benchmarking experiments appear to indicate negligible impact
on performance.

 > So I doubt
 > this would be a slam-dunk in the Lucene community.

I appreciate your willingness to at least weigth the matter, and I
understand the potential reluctance.  Hopefully the comparable
performance of the standards-compliant code above will render the issue
moot, and the next release of Lucene will use legal UTF-8.

Best,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

================================================================

From: Ken Krugler <kk...@transpac.com>
Date: August 27, 2005 2:11:34 PM PDT
To: java-user@lucene.apache.org
Subject: Re: Lucene does NOT use UTF-8.
Reply-To: java-user@lucene.apache.org


> I've delved into the matter of Lucene and UTF-8 a little further,  
> and I am discouraged by what I believe I've uncovered.
>
> Lucene should not be advertising that it uses "standard UTF-8" --  
> or even UTF-8 at all, since "Modified UTF-8" is _illegal_ UTF-8.
>

Unfortunately this is how Sun documents the format they use for  
serialized strings.


> The two distinguishing characteristics of "Modified UTF-8" are the  
> treatment of codepoints above the BMP (which are written as  
> surrogate pairs), and the encoding of null bytes as 1100 0000 1000  
> 0000 rather than 0000 0000.  Both of these became illegal as of  
> Unicode 3.1 (IIRC), because they are not shortest-form and non- 
> shortest-form UTF-8 presents a security risk.
>

For UTF-8 these were always invalid, but the standard wasn't very  
clear about it. Unfortunately the fuzzy nature of the 1.0/2.0 specs  
encouraged some sloppy implementations.


> The documentation should really state that Lucene stores strings in  
> a Java-only adulteration of UTF-8,
>

Yes, good point. I don't know who's in charge of that page, but it  
should be fixed.


> unsuitable for interchange.
>

Other than as an internal representation for Java serialization.


> Since Perl uses true shortest-form UTF-8 as its native encoding,  
> Plucene would have to jump through two efficiency-killing hoops in  
> order to write files that would not choke Lucene: instead of  
> writing out its true, legal UTF-8 directly, it would be necessary  
> to first translate to UTF-16, then duplicate the Lucene encoding  
> algorithm from OutputStream.  In theory.
>

Actually I don't think it would be all that bad. Since a null in the  
middle of a string is rare, as is a character outside of the BMP, a  
quick scan of the text should be sufficient to determine if it can be  
written as-is.

The ICU project has C code that can be used to quickly walk a string.  
I believe these would find/report such invalid code points, if you  
use the safe (versus faster unsafe) versions.


> Below you will find a simple Perl script which illustrates what  
> happens when Perl encounters malformed UTF-8.  Run it (you need  
> Perl 5.8 or higher) and you will see why even if I thought it was a  
> good idea to emulate the Java hack for encoding "Modified UTF-8",  
> trying to make it work in practice would be a nightmare.
>
> If Plucene were to write legal UTF-8 strings to its index files,  
> Java Lucene would misbehave and possibly blow up any time a string  
> contained either a 4-byte character or a null byte.  On the flip  
> side, Perl will spew warnings like crazy and possibly blow up  
> whenever it encounters a Lucene-encoded null or surrogate pair.   
> The potential blowups are due to the fact that Lucene and Plucene  
> will not agree on how many characters a string contains, resulting  
> in overruns or underruns.
>
> I am hoping that the answer to this will be a fix to the encoding  
> mechanism in Lucene so that it really does use legal UTF-8.  The  
> most efficient way to go about this has not yet presented itself.
>

I'd need to look at the code more, but using something other than the  
Java serialized format would probably incur a performance penalty for  
the Java implementation. Or at least make it harder to handle the  
strings using the standard Java serialization support. So I doubt  
this would be a slam-dunk in the Lucene community.

-- Ken



> #----------------------------------------
>
> #!/usr/bin/perl
> use strict;
> use warnings;
>
> # illegal_null.plx -- Perl complains about non-shortest-form null.
>
> my $data = "foo\xC0\x80\n";
>
> open (my $virtual_filehandle, "+<:utf8", \$data);
> print <$virtual_filehandle>;
>

-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8.

Posted by Ken Krugler <kk...@transpac.com>.

>I've delved into the matter of Lucene and UTF-8 a little further, 
>and I am discouraged by what I believe I've uncovered.
>
>Lucene should not be advertising that it uses "standard UTF-8" -- or 
>even UTF-8 at all, since "Modified UTF-8" is _illegal_ UTF-8.

Unfortunately this is how Sun documents the format they use for 
serialized strings.

>The two distinguishing characteristics of "Modified UTF-8" are the 
>treatment of codepoints above the BMP (which are written as 
>surrogate pairs), and the encoding of null bytes as 1100 0000 1000 
>0000 rather than 0000 0000.  Both of these became illegal as of 
>Unicode 3.1 (IIRC), because they are not shortest-form and 
>non-shortest-form UTF-8 presents a security risk.

For UTF-8 these were always invalid, but the standard wasn't very 
clear about it. Unfortunately the fuzzy nature of the 1.0/2.0 specs 
encouraged some sloppy implementations.

>The documentation should really state that Lucene stores strings in 
>a Java-only adulteration of UTF-8,

Yes, good point. I don't know who's in charge of that page, but it 
should be fixed.

>unsuitable for interchange.

Other than as an internal representation for Java serialization.

>Since Perl uses true shortest-form UTF-8 as its native encoding, 
>Plucene would have to jump through two efficiency-killing hoops in 
>order to write files that would not choke Lucene: instead of writing 
>out its true, legal UTF-8 directly, it would be necessary to first 
>translate to UTF-16, then duplicate the Lucene encoding algorithm 
>from OutputStream.  In theory.

Actually I don't think it would be all that bad. Since a null in the 
middle of a string is rare, as is a character outside of the BMP, a 
quick scan of the text should be sufficient to determine if it can be 
written as-is.

The ICU project has C code that can be used to quickly walk a string. 
I believe these would find/report such invalid code points, if you 
use the safe (versus faster unsafe) versions.

>Below you will find a simple Perl script which illustrates what 
>happens when Perl encounters malformed UTF-8.  Run it (you need Perl 
>5.8 or higher) and you will see why even if I thought it was a good 
>idea to emulate the Java hack for encoding "Modified UTF-8", trying 
>to make it work in practice would be a nightmare.
>
>If Plucene were to write legal UTF-8 strings to its index files, 
>Java Lucene would misbehave and possibly blow up any time a string 
>contained either a 4-byte character or a null byte.  On the flip 
>side, Perl will spew warnings like crazy and possibly blow up 
>whenever it encounters a Lucene-encoded null or surrogate pair.  The 
>potential blowups are due to the fact that Lucene and Plucene will 
>not agree on how many characters a string contains, resulting in 
>overruns or underruns.
>
>I am hoping that the answer to this will be a fix to the encoding 
>mechanism in Lucene so that it really does use legal UTF-8.  The 
>most efficient way to go about this has not yet presented itself.

I'd need to look at the code more, but using something other than the 
Java serialized format would probably incur a performance penalty for 
the Java implementation. Or at least make it harder to handle the 
strings using the standard Java serialization support. So I doubt 
this would be a slam-dunk in the Lucene community.

-- Ken


>#----------------------------------------
>
>#!/usr/bin/perl
>use strict;
>use warnings;
>
># illegal_null.plx -- Perl complains about non-shortest-form null.
>
>my $data = "foo\xC0\x80\n";
>
>open (my $virtual_filehandle, "+<:utf8", \$data);
>print <$virtual_filehandle>;

-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene does NOT use UTF-8.

Posted by jian chen <ch...@gmail.com>.

Hi, Ken,

Thanks for your email. You are right, I was meant to propose that Lucene 
switch to use true UTF-8, rather than having to work around this issue by 
fixing the caused problems elsewhere. 

Also, conforming to standards like UTF-8 will make the code easier for new 
developers to pick up.

Just my 2 cents.

Thanks,

Jian

On 8/27/05, Ken Krugler <kk...@transpac.com> wrote:
> 
> >On Aug 26, 2005, at 10:14 PM, jian chen wrote:
> >
> >>It seems to me that in theory, Lucene storage code could use true UTF-8 
> to
> >>store terms. Maybe it is just a legacy issue that the modified UTF-8 is
> >>used?
> 
> The use of 0xC0 0x80 to encode a U+0000 Unicode code point is an
> aspect of Java serialization of character streams. Java uses what
> they call "a modified version of UTF-8", though that's a really bad
> way to describe it. It's a different Unicode encoding, one that
> resembles UTF-8, but that's it.
> 
> >It's not a matter of a simple switch. The VInt count at the head of
> >a Lucene string is not the number of Unicode code points the string
> >contains. It's the number of Java chars necessary to contain that
> >string. Code points above the BMP require 2 java chars, since they
> >must be represented by surrogate pairs. The same code point must be
> >represented by one character in legal UTF-8.
> >
> >If Plucene counts the number of legal UTF-8 characters and assigns
> >that number as the VInt at the front of a string, when Java Lucene
> >decodes the string it will allocate an array of char which is too
> >small to hold the string.
> 
> I think Jian was proposing that Lucene switch to using a true UTF-8
> encoding, which would make things a bit cleaner. And probably easier
> than changing all references to CEUS-8 :)
> 
> And yes, given that the integer count is the number of UTF-16 code
> units required to represent the string, your code will need to do a
> bit more processing when calculating the character count, but that's
> a one-liner, right?
> 
> -- Ken
> --
> Ken Krugler
> TransPac Software, Inc.
> <http://www.transpac.com>
> +1 530-470-9200
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
>

Re: Lucene does NOT use UTF-8.

Posted by Ken Krugler <kk...@transpac.com>.

>On Aug 26, 2005, at 10:14 PM, jian chen wrote:
>
>>It seems to me that in theory, Lucene storage code could use true UTF-8 to
>>store terms. Maybe it is just a legacy issue that the modified UTF-8 is
>>used?

The use of 0xC0 0x80 to encode a U+0000 Unicode code point is an 
aspect of Java serialization of character streams. Java uses what 
they call "a modified version of UTF-8", though that's a really bad 
way to describe it. It's a different Unicode encoding, one that 
resembles UTF-8, but that's it.

>It's not a matter of a simple switch.  The VInt count at the head of 
>a Lucene string is not the number of Unicode code points the string 
>contains.  It's the number of Java chars necessary to contain that 
>string.  Code points above the BMP require 2 java chars, since they 
>must be represented by surrogate pairs.  The same code point must be 
>represented by one character in legal UTF-8.
>
>If Plucene counts the number of legal UTF-8 characters and assigns 
>that number as the VInt at the front of a string, when Java Lucene 
>decodes the string it will allocate an array of char which is too 
>small to hold the string.

I think Jian was proposing that Lucene switch to using a true UTF-8 
encoding, which would make things a bit cleaner. And probably easier 
than changing all references to CEUS-8 :)

And yes, given that the integer count is the number of UTF-16 code 
units required to represent the string, your code will need to do a 
bit more processing when calculating the character count, but that's 
a one-liner, right?

-- Ken
-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8.

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Aug 26, 2005, at 10:14 PM, jian chen wrote:

> It seems to me that in theory, Lucene storage code could use true  
> UTF-8 to
> store terms. Maybe it is just a legacy issue that the modified  
> UTF-8 is
> used?

It's not a matter of a simple switch.  The VInt count at the head of  
a Lucene string is not the number of Unicode code points the string  
contains.  It's the number of Java chars necessary to contain that  
string.  Code points above the BMP require 2 java chars, since they  
must be represented by surrogate pairs.  The same code point must be  
represented by one character in legal UTF-8.

If Plucene counts the number of legal UTF-8 characters and assigns  
that number as the VInt at the front of a string, when Java Lucene  
decodes the string it will allocate an array of char which is too  
small to hold the string.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene does NOT use UTF-8.

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Aug 26, 2005, at 10:14 PM, jian chen wrote:

> Hi,
>
> It seems to me that in theory, Lucene storage code could use true  
> UTF-8 to
> store terms. Maybe it is just a legacy issue that the modified  
> UTF-8 is
> used?

It has been suggested that this discussion should move to the  
developer's list, so I'm sending my reply there (and cc'ing Jian).

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene does NOT use UTF-8.

Posted by jian chen <ch...@gmail.com>.

Hi,

It seems to me that in theory, Lucene storage code could use true UTF-8 to 
store terms. Maybe it is just a legacy issue that the modified UTF-8 is 
used?

Cheers,

Jian

On 8/26/05, Marvin Humphrey <ma...@rectangular.com> wrote:
> 
> Greets,
> 
> [crossposted to java-user@lucene.apache.org and plucene@kasei.com]
> 
> I've delved into the matter of Lucene and UTF-8 a little further, and
> I am discouraged by what I believe I've uncovered.
> 
> Lucene should not be advertising that it uses "standard UTF-8" -- or
> even UTF-8 at all, since "Modified UTF-8" is _illegal_ UTF-8. The
> two distinguishing characteristics of "Modified UTF-8" are the
> treatment of codepoints above the BMP (which are written as surrogate
> pairs), and the encoding of null bytes as 1100 0000 1000 0000 rather
> than 0000 0000. Both of these became illegal as of Unicode 3.1
> (IIRC), because they are not shortest-form and non-shortest-form
> UTF-8 presents a security risk.
> 
> The documentation should really state that Lucene stores strings in a
> Java-only adulteration of UTF-8, unsuitable for interchange. Since
> Perl uses true shortest-form UTF-8 as its native encoding, Plucene
> would have to jump through two efficiency-killing hoops in order to
> write files that would not choke Lucene: instead of writing out its
> true, legal UTF-8 directly, it would be necessary to first translate
> to UTF-16, then duplicate the Lucene encoding algorithm from
> OutputStream. In theory.
> 
> Below you will find a simple Perl script which illustrates what
> happens when Perl encounters malformed UTF-8. Run it (you need Perl
> 5.8 or higher) and you will see why even if I thought it was a good
> idea to emulate the Java hack for encoding "Modified UTF-8", trying
> to make it work in practice would be a nightmare.
> 
> If Plucene were to write legal UTF-8 strings to its index files, Java
> Lucene would misbehave and possibly blow up any time a string
> contained either a 4-byte character or a null byte. On the flip
> side, Perl will spew warnings like crazy and possibly blow up
> whenever it encounters a Lucene-encoded null or surrogate pair. The
> potential blowups are due to the fact that Lucene and Plucene will
> not agree on how many characters a string contains, resulting in
> overruns or underruns.
> 
> I am hoping that the answer to this will be a fix to the encoding
> mechanism in Lucene so that it really does use legal UTF-8. The most
> efficient way to go about this has not yet presented itself.
> 
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
> 
> #----------------------------------------
> 
> #!/usr/bin/perl
> use strict;
> use warnings;
> 
> # illegal_null.plx -- Perl complains about non-shortest-form null.
> 
> my $data = "foo\xC0\x80\n";
> 
> open (my $virtual_filehandle, "+<:utf8", \$data);
> print <$virtual_filehandle>;
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
>