You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Ken Krugler <kk...@transpac.com> on 2005/08/29 05:42:26 UTC

Re: Lucene does NOT use UTF-8

>I'm not familiar with UTF-8 enough to follow the details of this
>discussion.  I hope other Lucene developers are, so we can resolve this
>issue.... anyone raising a hand?

I could, but recent posts makes me think this is heading towards a 
religious debate :)

I think the following statements are all true:

a. Using UTF-8 for strings would make it easier for Lucene indexes to 
be used by other implementations besides the reference Java version.

b. It would be easy to tweak Lucene to read/write conformant UTF-8 strings.

c. The hard(er) part would be backwards compatibility with older 
indexes. I haven't looked at this enough to really know, but one 
example is the compound file (xx.cfs) format...I didn't see a version 
number, and it contains strings.

d. The documentation could be clearer on what is meant by the "string 
length", but this is a trivial change.

What's unclear to me (not being a Perl, Python, etc jock) is how much 
easier it would be to get these other implementations working with 
Lucene, following a change to UTF-8. So I can't comment on the return 
on time required to change things.

I'm also curious about the existing CLucene & PyLucene ports. Would 
they also need to be similarly modified, with the proposed changes?

One final point. I doubt people have been adding strings with 
embedded nulls, and text outside of the Unicode BMP is also very 
rare. So _most_ Lucene indexes only contain valid UTF-8 data. It's 
only the above two edge cases that create an interoperability problem.

-- Ken
-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Lucene does NOT use UTF-8

Posted by Andi Vajda <an...@osafoundation.org>.
>> If the rest of the world of Lucene ports followed suit with PyLucene  and 
>> did the GCJ/SWIG thing, we'd have no problems :)  What are the 
>> disadvantages to following this model with Plucene?
>> 
>> 
> Some parts of the Lucene API require subclassing (e. g., Analyzer) and SWIG 
> does support cross-language polymorphism only for a few languages, notably 
> Python and Java but not for Perl. Noticing the smiley I won't mention the 
> zillion other reasons not to use the "GCJ/SWIG thing".

Yes, that's true, Java Lucene requires a bunch of subclassing to truly shine 
in any sizable application. I didn't use SWIG's director feature to implement 
extension but a more or less hardcarved SWIG-in-reverse trick that can easily 
be reproduced by other such SWIG-based ports.
See http://svn.osafoundation.org/pylucene/trunk/README for more details...

Andi..


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Lucene does NOT use UTF-8

Posted by Ken Krugler <kk...@transpac.com>.
[snip]

>The surrogate pair problem is another matter entirely. First of all, 
>lets see if I do understand the problem correctly: Some unicode 
>characters can be represented by one codepoint outside the BMP (i. 
>e., not with 16 bits) and alternatively with two codepoints, both of 
>them in the 16-bit range.

A Unicode character has a code point, which is a scalar value in the 
range U+0000 to U+10FFFF. The code point for every character in the 
Unicode character set will fall in this range.

There are Unicode encoding schemes, which specify how Unicode code 
point values are serialized. Examples include UTF-8, UTF-16LE, 
UTF-16BE, UTF-32, UTF-7, etc.

The UTF-16 (big or little endian) encoding scheme uses two code units 
(16-bit values) to encode Unicode characters with code point values > 
U+0FFFF.

>According to Marvin's explanations, the Unicode standard requires 
>these characters to be represented as "the one" codepoint in UTF-8, 
>resulting in a 4-, 5-, or 6-byte encoding for that character.

Since the Unicode code point range is constrained to 
U+0000...U+10FFFF, the longest valid UTF-8 sequence is 4 bytes.

>But since a Java char _is_ 16 bit, the codepoints beyond the 16-bit 
>range cannot be represented as chars.  That is, the 
>in-memory-representation still requires the use of the surrogate 
>pairs.  Therefore, writing consists of translating the surrogate 
>pair to the >16bit representation of the same character and then 
>algorithmically encoding that.  Reading is exactly the reverse 
>process.

Yes. Writing requires that you combine the two surrogate characters 
into a Unicode code point, then converting that value into the UTF-8 
4 byte sequence.

>Adding code to handle the 4 to 6 byte encodings to the 
>readChars/writeChars method is simple, but how do you do the mapping 
>from surrogate pairs to the chars they represent? Is there an 
>algorithm for doing that except for table lookups or huge switch 
>statements?

It's easy, since U+D800...U+DBFF is defined as the range for the high 
(most significant) surrogate, and U+DC00...U+DFFF is defined as the 
range for the low (least significant) surrogate.

-- Ken
-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Lucene does NOT use UTF-8

Posted by Ronald Dauster <rp...@ronald-dauster.de>.
Erik Hatcher wrote:

> On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote:
>
>>> I'm not familiar with UTF-8 enough to follow the details of this
>>> discussion.  I hope other Lucene developers are, so we can resolve  
>>> this
>>> issue.... anyone raising a hand?
>>>
>>
>> I could, but recent posts makes me think this is heading towards a  
>> religious debate :)
>
>
> Ken - you mentioned taking the discussion off-line in a previous  
> post.  Please don't.  Let's keep it alive on java-dev until we have a  
> resolution to it.
>
I'd also like to follow this thread.

>> I think the following statements are all true:
>>
>> a. Using UTF-8 for strings would make it easier for Lucene indexes  
>> to be used by other implementations besides the reference Java  version.
>>
>> b. It would be easy to tweak Lucene to read/write conformant UTF-8  
>> strings.
>
>
> What, if any, performance impact would changing Java Lucene in this  
> regard have?   (I realize this is rhetorical at this point, until a  
> solution is at hand)
>
Looking at the source of 1.4.3, fixing the NUL character encoding is 
trivial for writing and reading already works for both the standard and 
the java-style encoding. Not much work and absolutely no performance 
impact here.

The surrogate pair problem is another matter entirely. First of all, 
lets see if I do understand the problem correctly: Some unicode 
characters can be represented by one codepoint outside the BMP (i. e., 
not with 16 bits) and alternatively with two codepoints, both of them in 
the 16-bit range. According to Marvin's explanations, the Unicode 
standard requires these characters to be represented as "the one" 
codepoint in UTF-8, resulting in a 4-, 5-, or 6-byte encoding for that 
character.

But since a Java char _is_ 16 bit, the codepoints beyond the 16-bit 
range cannot be represented as chars.  That is, the 
in-memory-representation still requires the use of the surrogate pairs.  
Therefore, writing consists of translating the surrogate pair to the 
 >16bit representation of the same character and then algorithmically 
encoding that.  Reading is exactly the reverse process.

Adding code to handle the 4 to 6 byte encodings to the 
readChars/writeChars method is simple, but how do you do the mapping 
from surrogate pairs to the chars they represent? Is there an algorithm 
for doing that except for table lookups or huge switch statements?

>> c. The hard(er) part would be backwards compatibility with older  
>> indexes. I haven't looked at this enough to really know, but one  
>> example is the compound file (xx.cfs) format...I didn't see a  
>> version number, and it contains strings.
>
>
> I don't know the gory details, but we've made compatibility breaking  
> changes in the past and the current version of Lucene can open older  
> formats, but only write the most current format.  I suspect it could  
> be made to be backwards compatible.  Worst case, we break  
> compatibility in 2.0.
>
I believe backward compatibility is the easy part and comes for free.  
As I mentioned above, reading the "correct" NUL encoding already works 
and the non-BMP characters will have to be represented as surrogate 
pairs internally anyway.  So there is no problem with reading the old 
encoding and there is nothing wrong with still using or reading the 
surrogate pairs, only that they would not be written. Even indices with 
mixed segments are not a problem. 

Given that the CompoundFileReader/Writer use a 
lucene.store.OutputStream/InputStream for their FileEntries, they would 
also be able to read older files but potentially write incompatible 
files.  OTOH, when used inside lucene, the filenames do not contain NULs 
of non-BMP chars.

But: Is the compound file format supposed to be "interoperable"? Which 
formats are?

> [...]
>
>> What's unclear to me (not being a Perl, Python, etc jock) is how  
>> much easier it would be to get these other implementations working  
>> with Lucene, following a change to UTF-8. So I can't comment on the  
>> return on time required to change things.
>>
>> [...]
>
>
> PyLucene is literally the Java version of Lucene underneath (via GCJ/ 
> SWIG), so no worries there.  CLucene would need to be changed, as  
> well as DotLucene and the other ports out there.
>
> If the rest of the world of Lucene ports followed suit with PyLucene  
> and did the GCJ/SWIG thing, we'd have no problems :)  What are the  
> disadvantages to following this model with Plucene?
>
>
Some parts of the Lucene API require subclassing (e. g., Analyzer) and 
SWIG does support cross-language polymorphism only for a few languages, 
notably Python and Java but not for Perl. Noticing the smiley I won't 
mention the zillion other reasons not to use the "GCJ/SWIG thing".

Ronald

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Lucene does NOT use UTF-8

Posted by Steven Rowe <sa...@syr.edu>.
DM Smith wrote:
> Daniel Naber wrote:
>> But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem to 
>> be the case.
>>
> UTF-16 is a fixed 2 byte/char representation.

Except when it's not.  I.e., above the BMP.

 From the Unicode 4.0 standard 
<http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf>:

    In the UTF-16 encoding form, code points in the
    range U+0000..U+FFFF are represented as a single
    16-bit code unit; code points in the supplementary
    planes, in the range U+10000..U+10FFFF, are
    instead represented as pairs of 16-bit code units.
    These pairs of special code units are known as
    surrogate pairs.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Lucene does NOT use UTF-8

Posted by Tom White <to...@gmail.com>.
On 8/30/05, Ken Krugler <kk...@transpac.com> wrote:
> 
> >Daniel Naber wrote:
> >
> >>On Monday 29 August 2005 19:56, Ken Krugler wrote:
> >>
> >>>"Lucene writes strings as a VInt representing the length of the
> >>>string in Java chars (UTF-16 code units), followed by the character
> >>>data."
> >>>
> >>>
> >>But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem
> >>to be the case.
> >>
> >UTF-16 is a fixed 2 byte/char representation.
> 
> I hate to keep beating this horse, but I want to emphasize that it's
> 2 bytes per Java char (or UTF-16 code unit), not Unicode character
> (code point).


There's more horse beating on Java and Unicode 4 in this blog entry: 
http://weblogs.java.net/blog/joconner/archive/2005/08/how_long_is_you.html.

Re: Lucene does NOT use UTF-8

Posted by Ken Krugler <kk...@transpac.com>.
>Daniel Naber wrote:
>
>>On Monday 29 August 2005 19:56, Ken Krugler wrote:
>>
>>>"Lucene writes strings as a VInt representing the length of the
>>>string in Java chars (UTF-16 code units), followed by the character
>>>data."
>>>   
>>>
>>But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem 
>>to be the case.
>>
>UTF-16 is a fixed 2 byte/char representation.

I hate to keep beating this horse, but I want to emphasize that it's 
2 bytes per Java char (or UTF-16 code unit), not Unicode character 
(code point).

>But one cannot equate the character count with the byte count. Each 
>Java char is 2 bytes. I think all that is being said is that the 
>VInt is equal to str.length() as java gives it.
>
>On an unrelated project we are determining whether we should use a 
>denormalized (letter followed by an accents) or a normalized form 
>(letter with accents) of accented characters as we present the text 
>to a GUI. We have found that font support varies but appears to be 
>better for denormalized. This is not an issue for storage, as it can 
>be transformed before it goes to screen. However, it is useful to 
>know which form it is in.
>
>The reason I mention this is that I seem to remember that the length 
>of the java string varies with the representation.

String.length() is the number of Java chars, which always uses 
UTF-16. If you normalize text, then yes that can change the number of 
code units and thus the length of the string, but so can doing any 
kind of text munging (e.g. replacement) operation on characters in 
the string.

>So then the count would not be the number of glyphs that the user 
>sees. Please correct me if I am wrong.

All kinds of mxn mappings (both at the layout engine level, and using 
font tables) are possible when going from Unicode characters to 
display glyphs. Plus zero-width left-kerning glyphs would also alter 
the relationship between # of visual "characters" and backing store 
characters.

-- Ken
-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Lucene does NOT use UTF-8

Posted by DM Smith <dm...@gmail.com>.
Daniel Naber wrote:

>On Monday 29 August 2005 19:56, Ken Krugler wrote:
>  
>
>>"Lucene writes strings as a VInt representing the length of the
>>string in Java chars (UTF-16 code units), followed by the character
>>data."
>>    
>>
>But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem to be the 
>case.
>
UTF-16 is a fixed 2 byte/char representation. But one cannot equate the 
character count with the byte count. Each Java char is 2 bytes. I think 
all that is being said is that the VInt is equal to str.length() as java 
gives it.

On an unrelated project we are determining whether we should use a 
denormalized (letter followed by an accents) or a normalized form 
(letter with accents) of accented characters as we present the text to a 
GUI. We have found that font support varies but appears to be better for 
denormalized. This is not an issue for storage, as it can be transformed 
before it goes to screen. However, it is useful to know which form it is in.

The reason I mention this is that I seem to remember that the length of 
the java string varies with the representation. So then the count would 
not be the number of glyphs that the user sees. Please correct me if I 
am wrong.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Lucene does NOT use UTF-8

Posted by Ken Krugler <kk...@transpac.com>.
>On Monday 29 August 2005 19:56, Ken Krugler wrote:
>
>>  "Lucene writes strings as a VInt representing the length of the
>>  string in Java chars (UTF-16 code units), followed by the character
>>  data."
>
>But wouldn't UTF-16 mean 2 bytes per character?

Yes, UTF-16 means two bytes per code unit. A Unicode character (code 
point) is encoded as either one or two UTF-16 code units.

>That doesn't seem to be the
>case.

The case where? You mean in what actually gets written out?

String.length() is the length in terms of Java chars, which means 
UTF-16 code units (well, sort of...see below). Looking at the code, 
IndexOutput.writeString() calls writeVInt() with the string length.

One related note. Java 1.4 supports Unicode 3.0, while Java 5.0 
supports Unicode 4.0. It was in Unicode 3.1 that supplementary 
characters (code points > U+0FFFF, ie outside of the BMP) were added, 
and the UTF-16 encoding formalized.

So I think the issue of non-BMP characters is currently a bit 
esoteric for Lucene, since I'm guessing there are other places in the 
code (e.g. JDK calls used by Lucene) where non-BMP characters won't 
be properly handled. Though some quick tests indicate that there is 
some knowledge of surrogate pairs in 1.4 (e.g. converting a String 
w/surrogate pairs to UTF-8 does the right thing).

-- Ken
-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Lucene does NOT use UTF-8

Posted by Daniel Naber <lu...@danielnaber.de>.
On Monday 29 August 2005 19:56, Ken Krugler wrote:

> "Lucene writes strings as a VInt representing the length of the
> string in Java chars (UTF-16 code units), followed by the character
> data."

But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem to be the 
case.

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Lucene does NOT use UTF-8

Posted by Marvin Humphrey <ma...@rectangular.com>.
Eric Hatcher wrote...

> What, if any, performance impact would changing Java Lucene in this  
> regard have?

And Ken Krugler wrote...

> "Lucene writes strings as a VInt representing the length of the  
> string in Java chars (UTF-16 code units), followed by the character  
> data."

I had been working under the assumption that the value of the VInt  
would be changed as well.  It seemed logical that if strings were  
encoded with legal UTF-8, the count at the head should indicate  
either 1) the number of UTF-8 characters in the string, or 2) the  
number of bytes occupied by the encoded string.

Do either of those and more substantial changes to Java Lucene would  
be required.  I expect that the impact on performance could be made  
negligible for the first option, but the question of backwards  
compatibility would become a lot messier.

It simply had not occurred to me to keep the VInt as is.  If you do  
that, this becomes a much more localized problem.

For Plucene, I'll avoid the gory details and just say that having the  
VInt continue to represent UTF-16 code units limits the availability  
of certain options, but doesn't cause major inefficiencies.  Now that  
we know that's what it does, we can work with it.  A transition to  
always-legal UTF-8 obviates the need to scan for and fix the edge  
cases, and addresses my main concern.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Lucene does NOT use UTF-8

Posted by Ken Krugler <kk...@transpac.com>.
>On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote:
>>>I'm not familiar with UTF-8 enough to follow the details of this
>>>discussion.  I hope other Lucene developers are, so we can resolve this
>>>issue.... anyone raising a hand?
>>
>>I could, but recent posts makes me think this is heading towards a 
>>religious debate :)
>
>Ken - you mentioned taking the discussion off-line in a previous 
>post.  Please don't.  Let's keep it alive on java-dev until we have 
>a resolution to it.
>
>>I think the following statements are all true:
>>
>>a. Using UTF-8 for strings would make it easier for Lucene indexes 
>>to be used by other implementations besides the reference Java 
>>version.
>>
>>b. It would be easy to tweak Lucene to read/write conformant UTF-8 strings.
>
>What, if any, performance impact would changing Java Lucene in this 
>regard have?   (I realize this is rhetorical at this point, until a 
>solution is at hand)

Almost zero. A tiny hit when reading/writing surrogate pairs, to 
properly encode them as a 4 byte UTF-8 sequence versus two 3-byte 
sequences.

>>c. The hard(er) part would be backwards compatibility with older 
>>indexes. I haven't looked at this enough to really know, but one 
>>example is the compound file (xx.cfs) format...I didn't see a 
>>version number, and it contains strings.
>
>I don't know the gory details, but we've made compatibility breaking 
>changes in the past and the current version of Lucene can open older 
>formats, but only write the most current format.  I suspect it could 
>be made to be backwards compatible.  Worst case, we break 
>compatibility in 2.0.

Ronald is correct in that it would be easy to make the reader handle 
both "Java modified UTF-8" and UTF-8, and the writer always output 
UTF-8. So the only problem would be if older versions of Lucene (or 
maybe CLucene) wound up trying to read strings that contained 4-byte 
UTF-8 sequences, as they wouldn't know how to convert this into two 
UTF-16 Java chars.

Since 4-byte UTF-8 sequences are only for characters outside of the 
BMP, and these are rare, it seems like an OK thing to do, but that's 
just my uninformed view.

>>d. The documentation could be clearer on what is meant by the 
>>"string length", but this is a trivial change.
>
>That change was made by Daniel soon after this discussion began.

Daniel changed the definition of Chars, but String section still 
needs to be clarified. Currently it says:

"Lucene writes strings as a VInt representing the length, followed by 
the character data".

It should read:

"Lucene writes strings as a VInt representing the length of the 
string in Java chars (UTF-16 code units), followed by the character 
data."

-- Ken
-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Lucene does NOT use UTF-8

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote:
>> I'm not familiar with UTF-8 enough to follow the details of this
>> discussion.  I hope other Lucene developers are, so we can resolve  
>> this
>> issue.... anyone raising a hand?
>>
>
> I could, but recent posts makes me think this is heading towards a  
> religious debate :)

Ken - you mentioned taking the discussion off-line in a previous  
post.  Please don't.  Let's keep it alive on java-dev until we have a  
resolution to it.

> I think the following statements are all true:
>
> a. Using UTF-8 for strings would make it easier for Lucene indexes  
> to be used by other implementations besides the reference Java  
> version.
>
> b. It would be easy to tweak Lucene to read/write conformant UTF-8  
> strings.

What, if any, performance impact would changing Java Lucene in this  
regard have?   (I realize this is rhetorical at this point, until a  
solution is at hand)

> c. The hard(er) part would be backwards compatibility with older  
> indexes. I haven't looked at this enough to really know, but one  
> example is the compound file (xx.cfs) format...I didn't see a  
> version number, and it contains strings.

I don't know the gory details, but we've made compatibility breaking  
changes in the past and the current version of Lucene can open older  
formats, but only write the most current format.  I suspect it could  
be made to be backwards compatible.  Worst case, we break  
compatibility in 2.0.

> d. The documentation could be clearer on what is meant by the  
> "string length", but this is a trivial change.

That change was made by Daniel soon after this discussion began.

> What's unclear to me (not being a Perl, Python, etc jock) is how  
> much easier it would be to get these other implementations working  
> with Lucene, following a change to UTF-8. So I can't comment on the  
> return on time required to change things.
>
> I'm also curious about the existing CLucene & PyLucene ports. Would  
> they also need to be similarly modified, with the proposed changes?

PyLucene is literally the Java version of Lucene underneath (via GCJ/ 
SWIG), so no worries there.  CLucene would need to be changed, as  
well as DotLucene and the other ports out there.

If the rest of the world of Lucene ports followed suit with PyLucene  
and did the GCJ/SWIG thing, we'd have no problems :)  What are the  
disadvantages to following this model with Plucene?

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Lucene does NOT use UTF-8

Posted by Andi Vajda <an...@osafoundation.org>.
> I'm also curious about the existing CLucene & PyLucene ports. Would they also 
> need to be similarly modified, with the proposed changes?

PyLucene is built from the Java Lucene source code, so any change made to Java 
Lucene is getting reflected in PyLucene once it gets refreshed. The next 
refresh is to be done shortly after Java Lucene 1.9 is released.

Andi..


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org