You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2005/10/30 23:39:31 UTC
bytecount as String and prefix length
Greets,
I've been experimenting with using the UTF-8 bytecount as the VInt
count at the top of Lucene's string format, as was discussed back in
the "Lucene does NOT use UTF-8" thread. Changes were made to
IndexInput and IndexOutput as per some of Robert Engel's
suggestions. Here's the implementation of writeString, which chooses
the memory hit of buffering over the performance hit of pre-scanning:
public void writeString(String s) throws IOException {
utf8Bytes = s.getBytes("UTF-8");
int length = utf8Bytes.length;
writeVInt(length);
writeBytes(utf8Bytes, length);
}
That, in conjunction with a similar implementation of readString and
the UTF-8-clean implementation of readChars I submitted a while back,
executes 2-3% slower than the current implementation against my
index-1000-wikipedia-articles benchmarker.
I also had a hack at TermBuffer, TermInfosWriter, and StringHelper,
in an attempt to convert the prefix length for the Term Dictionary
from chars to UTF-8 bytes. (Note that I've left off changing
TermVectors for now.)
A major change has to be inflicted on TermBuffer in order for it to
work with a bytecount: the char[] "text" buffer must be swapped out
for a byte[] "utf8Bytes" buffer. With that change in place,
TermBuffer's 'read' method can utilize readBytes instead of readChars:
public final void read(IndexInput input, FieldInfos fieldInfos)
throws IOException {
this.term = null; // invalidate cache
int start = input.readVInt();
int length = input.readVInt();
int totalLength = start + length;
setutf8Length(totalLength);
input.readBytes(this.utf8Bytes, start, length);
this.field = fieldInfos.fieldName(input.readVInt());
}
I'd thought this might speed up scanning through a SegmentTermEnum,
since the intermediate step of converting UTF-8 bytes to chars was
deferred until toTerm gets called. However, a ScanEnum benchmarker I
cooked up which calls next() a whole bunch indicated a wash.
Concerns were raised before about whether it would be necessary to
convert all strings to UTF-8 to calculate prefix length for the Term
Dictionary. Yes, it's necessary to convert them -- though the same
copy can be used as a byte buffer which obviates the need to call
writeChars.
private final void writeTerm(Term term)
throws IOException {
byte[] bytes = term.text().getBytes("UTF-8");
int totalLength = bytes.length;
int start = StringHelper.bytesDifference(lastBytes, bytes);
int length = totalLength - start;
output.writeVInt(start); // write shared
prefix length
output.writeVInt(length); // write delta length
for (int i = start ; i < totalLength; i++) {
output.writeByte(bytes[i]); // write delta UTF-8
bytes
}
output.writeVInt(fieldInfos.fieldNumber(term.field)); // write
field num
lastTerm = term;
lastBytes = bytes;
}
Unfortunately, once the changes to TermBuffer, TermInfosWriter, and
StringHelper are applied, execution speed at index-time suffers a
slowdown of about 20%. Perhaps this can be blamed on all the calls
to getBytes("UTF-8") in TermInfosWriter? Maybe alternative
implementations using ByteBuffer, CharsetDecoder, and CharsetEncoder
are possible that can mitigate the problem?
A patch for all 5 files against repository revision 329640 is below;
to apply it, cd to the trunk directory first.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
diff -u -r -X../excludefile src_old/java/org/apache/lucene/index/
TermBuffer.java src/java/org/apache/lucene/index/TermBuffer.java
--- src_old/java/org/apache/lucene/index/TermBuffer.java
2005-10-30 14:30:23.000000000 -0800
+++ src/java/org/apache/lucene/index/TermBuffer.java 2005-10-30
13:27:47.000000000 -0800
@@ -20,40 +20,38 @@
import org.apache.lucene.store.IndexInput;
final class TermBuffer implements Cloneable {
- private static final char[] NO_CHARS = new char[0];
+ private static final byte[] NO_BYTES = new byte[0];
private String field;
- private char[] text = NO_CHARS;
- private int textLength;
+ private byte[] utf8Bytes = NO_BYTES;
+ private int utf8Length;
private Term term; // cached
public final int compareTo(TermBuffer other) {
if (field == other.field) // fields are interned
- return compareChars(text, textLength, other.text,
other.textLength);
+ return compareUTF8(utf8Bytes, utf8Length,
+ other.utf8Bytes, other.utf8Length);
else
return field.compareTo(other.field);
}
- private static final int compareChars(char[] v1, int len1,
- char[] v2, int len2) {
+ private static final int compareUTF8(byte[] v1, int len1,
+ byte[] v2, int len2) {
int end = Math.min(len1, len2);
for (int k = 0; k < end; k++) {
- char c1 = v1[k];
- char c2 = v2[k];
- if (c1 != c2) {
- return c1 - c2;
- }
+ if (v1[k] != v2[k])
+ return v1[k] - v2[k];
}
return len1 - len2;
}
- private final void setTextLength(int newLength) {
- if (text.length < newLength) {
- char[] newText = new char[newLength];
- System.arraycopy(text, 0, newText, 0, textLength);
- text = newText;
+ private final void setutf8Length(int newLength) {
+ if (utf8Bytes.length < newLength) {
+ byte[] newBytes = new byte[newLength];
+ System.arraycopy(utf8Bytes, 0, newBytes, 0, utf8Length);
+ utf8Bytes = newBytes;
}
- textLength = newLength;
+ utf8Length = newLength;
}
public final void read(IndexInput input, FieldInfos fieldInfos)
@@ -62,28 +60,30 @@
int start = input.readVInt();
int length = input.readVInt();
int totalLength = start + length;
- setTextLength(totalLength);
- input.readChars(this.text, start, length);
+ setutf8Length(totalLength);
+ input.readBytes(this.utf8Bytes, start, length);
this.field = fieldInfos.fieldName(input.readVInt());
}
- public final void set(Term term) {
- if (term == null) {
+ public final void set(Term t) {
+ if (t == null) {
reset();
return;
}
- // copy text into the buffer
- setTextLength(term.text().length());
- term.text().getChars(0, term.text().length(), text, 0);
+ // convert chars into UTF-8 bytes, store in buffer
+ try {
+ utf8Bytes = t.text().getBytes("UTF-8");
+ } catch (java.io.UnsupportedEncodingException e) { }
+ setutf8Length(utf8Bytes.length);
- this.field = term.field();
- this.term = term;
+ this.field = t.field();
+ this.term = t;
}
public final void set(TermBuffer other) {
- setTextLength(other.textLength);
- System.arraycopy(other.text, 0, text, 0, textLength);
+ setutf8Length(other.utf8Length);
+ System.arraycopy(other.utf8Bytes, 0, utf8Bytes, 0, utf8Length);
this.field = other.field;
this.term = other.term;
@@ -91,7 +91,7 @@
public void reset() {
this.field = null;
- this.textLength = 0;
+ this.utf8Length = 0;
this.term = null;
}
@@ -100,7 +100,9 @@
return null;
if (term == null)
- term = new Term(field, new String(text, 0, textLength), false);
+ try {
+ term = new Term(field, new String(utf8Bytes, 0, utf8Length,
"UTF-8"), false);
+ } catch (java.io.UnsupportedEncodingException e) { }
return term;
}
@@ -111,8 +113,8 @@
clone = (TermBuffer)super.clone();
} catch (CloneNotSupportedException e) {}
- clone.text = new char[text.length];
- System.arraycopy(text, 0, clone.text, 0, textLength);
+ clone.utf8Bytes = new byte[utf8Bytes.length];
+ System.arraycopy(utf8Bytes, 0, clone.utf8Bytes, 0, utf8Length);
return clone;
}
diff -u -r -X../excludefile src_old/java/org/apache/lucene/index/
TermInfosWriter.java src/java/org/apache/lucene/index/
TermInfosWriter.java
--- src_old/java/org/apache/lucene/index/TermInfosWriter.java
2005-10-30 14:30:23.000000000 -0800
+++ src/java/org/apache/lucene/index/TermInfosWriter.java
2005-10-30 13:47:43.000000000 -0800
@@ -33,6 +33,8 @@
private IndexOutput output;
private Term lastTerm = new Term("", "");
private TermInfo lastTi = new TermInfo();
+ private byte[] NO_BYTES = new byte[0];
+ private byte[] lastBytes = NO_BYTES;
private long size = 0;
// TODO: the default values for these two parameters should be
settable from
@@ -91,8 +93,10 @@
TermInfo pointers must be positive and greater than all
previous.*/
final void add(Term term, TermInfo ti)
throws IOException {
- if (!isIndex && term.compareTo(lastTerm) <= 0)
- throw new IOException("term out of order");
+ if (!isIndex && term.compareTo(lastTerm) <= 0) {
+ throw new IOException("term out of order" + lastTerm.field +
+ "--" + lastTerm.text + "--" + term.field + "--" + term.text);
+ }
if (ti.freqPointer < lastTi.freqPointer)
throw new IOException("freqPointer out of order");
if (ti.proxPointer < lastTi.proxPointer)
@@ -121,16 +125,22 @@
private final void writeTerm(Term term)
throws IOException {
- int start = StringHelper.stringDifference(lastTerm.text,
term.text);
- int length = term.text.length() - start;
+ byte[] bytes = term.text().getBytes("UTF-8");
+ int totalLength = bytes.length;
+
+ int start = StringHelper.bytesDifference(lastBytes, bytes);
+ int length = totalLength - start;
output.writeVInt(start); // write shared
prefix length
output.writeVInt(length); // write delta length
- output.writeChars(term.text, start, length); // write delta chars
+ for (int i = start ; i < totalLength; i++) {
+ output.writeByte(bytes[i]); // write delta UTF-8
bytes
+ }
output.writeVInt(fieldInfos.fieldNumber(term.field)); // write
field num
lastTerm = term;
+ lastBytes = bytes;
}
diff -u -r -X../excludefile src_old/java/org/apache/lucene/store/
IndexInput.java src/java/org/apache/lucene/store/IndexInput.java
--- src_old/java/org/apache/lucene/store/IndexInput.java
2005-10-30 14:30:25.000000000 -0800
+++ src/java/org/apache/lucene/store/IndexInput.java 2005-10-30
13:27:47.000000000 -0800
@@ -23,7 +23,7 @@
* @see Directory
*/
public abstract class IndexInput implements Cloneable {
- private char[] chars; // used by
readString()
+ private byte[] utf8Bytes; // used by
readString()
/** Reads and returns a single byte.
* @see IndexOutput#writeByte(byte)
@@ -87,32 +87,66 @@
*/
public String readString() throws IOException {
int length = readVInt();
- if (chars == null || length > chars.length)
- chars = new char[length];
- readChars(chars, 0, length);
- return new String(chars, 0, length);
- }
-
- /** Reads UTF-8 encoded characters into an array.
+ if (utf8Bytes == null || length > utf8Bytes.length)
+ utf8Bytes = new byte[length];
+ readBytes(utf8Bytes, 0, length);
+ return new String(utf8Bytes, 0, length, "UTF-8");
+ }
+
+ static final byte[] TRAILING_BYTES_FOR_UTF8 = {
+ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+
+ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
+ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
+ 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
+ 3,3,3,3,3,3,3,3
+ };
+
+ /** Reads UTF-8 encoded characters into an array.
* @param buffer the array to read characters into
* @param start the offset in the array to start storing characters
* @param length the number of characters to read
* @see IndexOutput#writeChars(String,int,int)
*/
- public void readChars(char[] buffer, int start, int length)
+ public void readChars(char[] buffer, int start, int length)
throws IOException {
- final int end = start + length;
+ final int end = start + length;
for (int i = start; i < end; i++) {
- byte b = readByte();
- if ((b & 0x80) == 0)
- buffer[i] = (char)(b & 0x7F);
- else if ((b & 0xE0) != 0xE0) {
- buffer[i] = (char)(((b & 0x1F) << 6)
- | (readByte() & 0x3F));
- } else
- buffer[i] = (char)(((b & 0x0F) << 12)
- | ((readByte() & 0x3F) << 6)
- | (readByte() & 0x3F));
+ byte b = readByte();
+ switch (TRAILING_BYTES_FOR_UTF8[b & 0xFF]) {
+ case 0:
+ buffer[i] = (char)(b & 0x7F);
+ break;
+ case 1:
+ buffer[i] = (char)(((b & 0x1F) << 6)
+ | (readByte() & 0x3F));
+ break;
+ case 2:
+ buffer[i] = (char)(((b & 0x0F) << 12)
+ | ((readByte() & 0x3F) << 6)
+ | (readByte() & 0x3F));
+ break;
+ case 3:
+ int utf32 = (((b & 0x0F) << 18)
+ | ((readByte() & 0x3F) << 12)
+ | ((readByte() & 0x3F) << 6)
+ | (readByte() & 0x3F));
+ buffer[i] = (char)((utf32 >> 10) + 0xD7C0);
+ i++;
+ buffer[i] = (char)((utf32 & 0x03FF) + 0xDC00);
+ break;
+ }
}
}
@@ -148,7 +182,7 @@
clone = (IndexInput)super.clone();
} catch (CloneNotSupportedException e) {}
- clone.chars = null;
+ clone.utf8Bytes = null;
return clone;
}
diff -u -r -X../excludefile src_old/java/org/apache/lucene/store/
IndexOutput.java src/java/org/apache/lucene/store/IndexOutput.java
--- src_old/java/org/apache/lucene/store/IndexOutput.java
2005-10-30 14:30:25.000000000 -0800
+++ src/java/org/apache/lucene/store/IndexOutput.java 2005-10-30
13:27:47.000000000 -0800
@@ -25,6 +25,8 @@
*/
public abstract class IndexOutput {
+ private byte[] utf8Bytes; // used by
writeString()
+
/** Writes a single byte.
* @see IndexInput#readByte()
*/
@@ -85,9 +87,11 @@
* @see IndexInput#readString()
*/
public void writeString(String s) throws IOException {
- int length = s.length();
+ utf8Bytes = s.getBytes("UTF-8");
+
+ int length = utf8Bytes.length;
writeVInt(length);
- writeChars(s, 0, length);
+ writeBytes(utf8Bytes, length);
}
/** Writes a sequence of UTF-8 encoded characters from a string.
diff -u -r -X../excludefile src_old/java/org/apache/lucene/util/
StringHelper.java src/java/org/apache/lucene/util/StringHelper.java
--- src_old/java/org/apache/lucene/util/StringHelper.java
2005-10-30 14:30:25.000000000 -0800
+++ src/java/org/apache/lucene/util/StringHelper.java 2005-10-30
13:48:07.000000000 -0800
@@ -44,7 +44,26 @@
return len;
}
-
+ /**
+ * Compares two byte[] arrays, element by element, and returns the
+ * number of elements common to both arrays.
+ *
+ * @param bytes1 The first byte[] to compare
+ * @param bytes2 The second byte[] to compare
+ * @return The number of common elements.
+ */
+ public static final int bytesDifference(byte[] bytes1, byte[]
bytes2) {
+ int len1 = bytes1.length;
+ int len2 = bytes2.length;
+ int len = len1 < len2 ? len1 : len2;
+ for (int i = 0; i < len; i++) {
+ if (bytes1[i] != bytes2[i]) {
+ return i;
+ }
+ }
+ return len;
+ }
+
private StringHelper() {
}
}
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: bytecount as String and prefix length
Posted by Doug Cutting <cu...@apache.org>.
Marvin Humphrey wrote:
> I think it's time to throw in the towel.
Please don't give up. I think you're quite close.
I would be careful using CharBuffer instead of char[] unless you're sure
all methods you call are very efficient. You could try avoiding
CharBuffer by adding something (ugly) like an int[]charLength parameter
where the length is stored, then simply return the char[], in case you
had to re-allocate.
Using an exception for control flow is not only considered bad style,
but is also a performance pitfall. A catch that is unused is free, but
catching and throwing exceptions is expensive.
A complete diff would make it easier to see if there's something else
which may be slowing things, e.g., inadvertantly allocating something
for each call. Also, someone else could try to tune it more.
Have you profiled this? Sun's built-in profiler can be useful. It can
give you allocation counts, so you can see if you're allocating
significantly more objects. It also gives call counts (in the old 'java
-prof' mode) which are useful. It slows things so much that I find
relative performance of methods to be nearly worthless, but call and
allocation counts can point out problems.
Cheers,
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: bytecount as String and prefix length
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Nov 1, 2005, at 9:51 AM, Doug Cutting wrote:
> Another approach might be to, instead of converting to UTF-8 to
> strings right away, change things to convert lazily, if at all.
> During index merging such conversion should never be needed.
!!
There ought to be some gains possible there, then. No predictions as
to how much, though.
> You needn't do this systematically throughout Lucene, but only
> where it makes a big difference. For example, if you could avoid
> strings in SegmentMerger.mergeTermInfos() it might make a huge
> difference. This might be as simple as changing SegmentMergeInfo
> to use a TermBuffer instead of a Term. Does that make sense?
Abundant sense. I'm not as familiar with SegmentMerger as I am with
other parts of the org.apache.lucene.index package, because I haven't
ported it yet. But conceptually I understand exactly why this should
require fewer resources.
I'll take a swing at SegmentMerger and submit a comprehensive diff.
Thanks for the suggestions,
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: bytecount as String and prefix length
Posted by Doug Cutting <cu...@apache.org>.
Another approach might be to, instead of converting to UTF-8 to strings
right away, change things to convert lazily, if at all. During index
merging such conversion should never be needed. You needn't do this
systematically throughout Lucene, but only where it makes a big
difference. For example, if you could avoid strings in
SegmentMerger.mergeTermInfos() it might make a huge difference. This
might be as simple as changing SegmentMergeInfo to use a TermBuffer
instead of a Term. Does that make sense?
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: bytecount as String and prefix length
Posted by Yonik Seeley <ys...@gmail.com>.
Thanks for looking into this Marvin... very interesting stuff!
I haven't had a chance to review it in detail, but my gut tells me
that it should be able to be faster.
-Yonik
Now hiring -- http://forms.cnet.com/slink?231706
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: bytecount as String and prefix length
Posted by Marvin Humphrey <ma...@rectangular.com>.
I wrote:
> I've got one more idea... time to try overriding readString and
> writeString in BufferedIndexInput and BufferedIndexOutput, to take
> advantage of buffers that are already there.
Too complicated to be worthwhile, it turns out. I think it's time to
throw in the towel. Frustrating, because I was shooting for a win-
win solution (bytecounts and valid UTF-8 both make things a lot
easier on the Perl side). My valid-UTF-8 patches from a month ago
perform better than the current Java implementation, but bytecounts
make for slower indexing. Time to re-rethink the goal of index-
compatibility. How about, compatible for ascii-only indexes? :|
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: bytecount as String and prefix length
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Oct 31, 2005, at 5:15 PM, Robert Engels wrote:
> All of the JDK source is available via download from Sun.
Thanks. I believe the UTF-8 coding algos can be found in...
j2se > src > share > classes > sun > nio > cs > UTF_8.java
It looks like the translator methods have fairly high loop overheads,
since they have to keep track of the member variables of ByteBuffer
and CharBuffer objects and prepare to return result objects on each
loop iter. Also, they have robust error-checking for malformed
source data, which Lucene traditionally has not. The algo below my
sig should be faster.
I wrote...
> So my next step is to write a utf8ToString method that's as efficient
> as I can make it.
Ok, this time we made a little headway. We're down from 20% slower
to around 10% slower indexing than current implementation. But I
don't see how I'm going to get it any faster. There's maybe one
conditional in FieldsReader that can be simplified.
There's another downside to the way I'm implementing this right now.
The byteBuf and charBuf have to be kept somewhere. Currently, I'm
allocating a ByteBuffer for each TermInfosWriter and a charBuf for
each TermBuffer. That's something of a memory hit, though it's hard
to say exactly how much. IndexInput and IndexOutput are still using
the Sun methods -- when I gave them Buffers, they slowed down.
I've got one more idea... time to try overriding readString and
writeString in BufferedIndexInput and BufferedIndexOutput, to take
advantage of buffers that are already there.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
//----------------------------------------------------------------
public static final CharBuffer utf8ToChars (
byte[] bytes, int start, int length, CharBuffer charBuf) {
int i = start;
int j = 0;
final int end = start + length;
char[] chars = charBuf.array();
try {
while (i < end) {
byte b = bytes[i++];
switch (TRAILING_BYTES_FOR_UTF8[b & 0xFF]) {
case 0:
chars[j++] = (char)(b & 0x7F);
break;
case 1:
chars[j++] = (char)(((b & 0x1F) << 6)
| (bytes[i++] & 0x3F));
break;
case 2:
chars[j++] = (char)(((b & 0x0F) << 12)
| ((bytes[i++] & 0x3F) << 6)
| (bytes[i++] & 0x3F));
break;
case 3:
int utf32 = (((b & 0x0F) << 18)
| ((bytes[i++] & 0x3F) << 12)
| ((bytes[i++] & 0x3F) << 6)
| (bytes[i++] & 0x3F));
chars[j++] = (char)((utf32 >> 10) + 0xD7C0);
i++;
chars[j++] = (char)((utf32 & 0x03FF) + 0xDC00);
break;
}
}
}
catch (ArrayIndexOutOfBoundsException e) {
float bytesProcessed = (float)(i - start);
float bytesPerChar = (j / bytesProcessed) * 1.1f;
float bytesLeft = length - bytesProcessed;
float targetSize = (float)chars.length + bytesPerChar *
bytesLeft + 1.0f;
return utf8ToChars(bytes, start, length, CharBuffer.allocate
((int)targetSize));
}
charBuf.position(j);
return charBuf;
}
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
RE: bytecount as String and prefix length
Posted by Robert Engels <re...@ix.netcom.com>.
All of the JDK source is available via download from Sun.
-----Original Message-----
From: Marvin Humphrey [mailto:marvin@rectangular.com]
Sent: Monday, October 31, 2005 6:31 PM
To: java-dev@lucene.apache.org
Subject: Re: bytecount as String and prefix length
I wrote...
> I think I'll take a crack at a custom charsToUTF8 converter algo.
Still no luck. Still 20% slower than the current implementation.
The algo is below, for reference.
It's entirely possible that my patches are doing something dumb
that's causing this, given my limited experience with Java. But if
that's not the case, I can think of two other explanations.
One is that the passage of the text through an intermediate buffer
before blasting it out is considerably more expensive than anticipated.
The other is that the pre-allocation of a char[] array based on the
length VInt yields a significant benefit over the standard techniques
for reading in UTF-8. That wouldn't be hard to believe. Without
that number, there's a lot of guesswork involved. English requires
about 1.1 bytes per UTF-8 code point; Japanese, 3. Multiple memory
allocation ops may be required as bytes get read in, especially if
the final String object kicked out HAS to use the bare minimum amount
of memory. I don't suppose there's any way for me to snoop just
what's happening under the hood in these CharsetDecoder classes or
String constructors, is there?
Scanning through a SegmentTermEnum with next() doesn't seem to be any
slower with a byte-based TermBuffer, and my index-1000-wikipedia-docs
benchmarker doesn't slow down that much when IndexInput is changed to
use a String constructor that accepts UTF-8 bytes rather than chars.
However, it's possible that the modified toTerm method of TermBuffer
is a bottleneck, as it also uses the UTF-8 String constructor. It
doesn't get exercised under SegmentTermEnum.next(), but during
merging of segments I believe it sees plenty of action -- maybe a lot
more than IndexInput's readString.
So my next step is to write a utf8ToString method that's as efficient
as I can make it. After that... I dunno, I'm running out of ideas.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
public static final ByteBuffer stringToUTF8(
String s, int start, int length, ByteBuffer byteBuf) {
byteBuf.clear();
int i = start;
int j = 0;
try {
final int end = start + length;
byte[] bytes = byteBuf.array();
for ( ; i < end; i++) {
final int code = (int)s.charAt(i);
if (code < 0x80)
bytes[j++] = (byte)code;
else if (code < 0x800) {
bytes[j++] = (byte)(0xC0 | (code >> 6));
bytes[j++] = (byte)(0x80 | (code & 0x3F));
} else if (code < 0xD800 || code > 0xDFFF) {
bytes[j++] = (byte)(0xE0 | (code >>> 12));
bytes[j++] = (byte)(0x80 | ((code >> 6) & 0x3F));
bytes[j++] = (byte)(0x80 | (code & 0x3F));
} else {
// surrogate pair
int utf32;
// confirm valid high surrogate
if (code < 0xDC00 && (i < end-1)) {
utf32 = ((int)s.charAt(i+1));
// confirm valid low surrogate and write pair
if (utf32 >= 0xDC00 && utf32 <= 0xDFFF) {
utf32 = ((code - 0xD7C0) << 10) + (utf32 & 0x3FF);
i++;
bytes[j++] = (byte)(0xF0 | (utf32 >>> 18));
bytes[j++] = (byte)(0x80 | ((utf32 >> 12) & 0x3f));
bytes[j++] = (byte)(0x80 | ((utf32 >> 6) & 0x3F));
bytes[j++] = (byte)(0x80 | (utf32 & 0x3F));
continue;
}
}
// replace unpaired surrogate or out-of-order low surrogate
// with substitution character
bytes[j++] = (byte)0xEF;
bytes[j++] = (byte)0xBF;
bytes[j++] = (byte)0xBD;
}
}
}
catch (ArrayIndexOutOfBoundsException e) {
// guess how many more bytes it will take, plus 10%
float charsProcessed = (float)(i - start);
float bytesPerChar = (j / charsProcessed) * 1.1f;
float charsLeft = length - charsProcessed;
float targetSize
= (float)byteBuf.capacity() + bytesPerChar * charsLeft + 1.0f;
return stringToUTF8(s, start, length, ByteBuffer.allocate((int)
targetSize));
}
byteBuf.position(j);
return byteBuf;
}
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: bytecount as String and prefix length
Posted by Marvin Humphrey <ma...@rectangular.com>.
I wrote...
> I think I'll take a crack at a custom charsToUTF8 converter algo.
Still no luck. Still 20% slower than the current implementation.
The algo is below, for reference.
It's entirely possible that my patches are doing something dumb
that's causing this, given my limited experience with Java. But if
that's not the case, I can think of two other explanations.
One is that the passage of the text through an intermediate buffer
before blasting it out is considerably more expensive than anticipated.
The other is that the pre-allocation of a char[] array based on the
length VInt yields a significant benefit over the standard techniques
for reading in UTF-8. That wouldn't be hard to believe. Without
that number, there's a lot of guesswork involved. English requires
about 1.1 bytes per UTF-8 code point; Japanese, 3. Multiple memory
allocation ops may be required as bytes get read in, especially if
the final String object kicked out HAS to use the bare minimum amount
of memory. I don't suppose there's any way for me to snoop just
what's happening under the hood in these CharsetDecoder classes or
String constructors, is there?
Scanning through a SegmentTermEnum with next() doesn't seem to be any
slower with a byte-based TermBuffer, and my index-1000-wikipedia-docs
benchmarker doesn't slow down that much when IndexInput is changed to
use a String constructor that accepts UTF-8 bytes rather than chars.
However, it's possible that the modified toTerm method of TermBuffer
is a bottleneck, as it also uses the UTF-8 String constructor. It
doesn't get exercised under SegmentTermEnum.next(), but during
merging of segments I believe it sees plenty of action -- maybe a lot
more than IndexInput's readString.
So my next step is to write a utf8ToString method that's as efficient
as I can make it. After that... I dunno, I'm running out of ideas.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
public static final ByteBuffer stringToUTF8(
String s, int start, int length, ByteBuffer byteBuf) {
byteBuf.clear();
int i = start;
int j = 0;
try {
final int end = start + length;
byte[] bytes = byteBuf.array();
for ( ; i < end; i++) {
final int code = (int)s.charAt(i);
if (code < 0x80)
bytes[j++] = (byte)code;
else if (code < 0x800) {
bytes[j++] = (byte)(0xC0 | (code >> 6));
bytes[j++] = (byte)(0x80 | (code & 0x3F));
} else if (code < 0xD800 || code > 0xDFFF) {
bytes[j++] = (byte)(0xE0 | (code >>> 12));
bytes[j++] = (byte)(0x80 | ((code >> 6) & 0x3F));
bytes[j++] = (byte)(0x80 | (code & 0x3F));
} else {
// surrogate pair
int utf32;
// confirm valid high surrogate
if (code < 0xDC00 && (i < end-1)) {
utf32 = ((int)s.charAt(i+1));
// confirm valid low surrogate and write pair
if (utf32 >= 0xDC00 && utf32 <= 0xDFFF) {
utf32 = ((code - 0xD7C0) << 10) + (utf32 & 0x3FF);
i++;
bytes[j++] = (byte)(0xF0 | (utf32 >>> 18));
bytes[j++] = (byte)(0x80 | ((utf32 >> 12) & 0x3f));
bytes[j++] = (byte)(0x80 | ((utf32 >> 6) & 0x3F));
bytes[j++] = (byte)(0x80 | (utf32 & 0x3F));
continue;
}
}
// replace unpaired surrogate or out-of-order low surrogate
// with substitution character
bytes[j++] = (byte)0xEF;
bytes[j++] = (byte)0xBF;
bytes[j++] = (byte)0xBD;
}
}
}
catch (ArrayIndexOutOfBoundsException e) {
// guess how many more bytes it will take, plus 10%
float charsProcessed = (float)(i - start);
float bytesPerChar = (j / charsProcessed) * 1.1f;
float charsLeft = length - charsProcessed;
float targetSize
= (float)byteBuf.capacity() + bytesPerChar * charsLeft + 1.0f;
return stringToUTF8(s, start, length, ByteBuffer.allocate((int)
targetSize));
}
byteBuf.position(j);
return byteBuf;
}
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: bytecount as String and prefix length
Posted by Marvin Humphrey <ma...@rectangular.com>.
I wrote...
> Unfortunately, once the changes to TermBuffer, TermInfosWriter, and
> StringHelper are applied, execution speed at index-time suffers a
> slowdown of about 20%. Perhaps this can be blamed on all the calls
> to getBytes("UTF-8") in TermInfosWriter? Maybe alternative
> implementations using ByteBuffer, CharsetDecoder, and
> CharsetEncoder are possible that can mitigate the problem?
Nope.
The version of writeTerm below is about the same speed as the one
with the calls to getBytes("UTF-8").
I think I'll take a crack at a custom charsToUTF8 converter algo.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
//----------------------------------------------------------------------
---
private final void writeTerm(Term term)
throws IOException {
byteBuf.clear();
while (true) {
CoderResult status = utf8Encoder.encode(CharBuffer.wrap
(term.text()),
byteBuf, false);
if (status.isOverflow()) {
bufSize += 32;
byteBuf = ByteBuffer.allocate(bufSize);
utf8Encoder.reset();
}
else {
break;
}
}
int totalLength = byteBuf.position();
int start = StringHelper.bytesDifference(lastByteBuf, byteBuf);
int length = totalLength - start;
output.writeVInt(start); // write shared
prefix length
output.writeVInt(length); // write delta length
byte[] bytes = byteBuf.array();
for (int i = start ; i < totalLength; i++) {
output.writeByte(bytes[i]); // write delta UTF-8
bytes
}
output.writeVInt(fieldInfos.fieldNumber(term.field)); // write
field num
lastTerm = term;
// swap byteBuf and lastByteBuf
scratchByteBuf = lastByteBuf;
lastByteBuf = byteBuf;
byteBuf = scratchByteBuf;
}
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org