You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2005/10/30 23:39:31 UTC

bytecount as String and prefix length

Greets,

I've been experimenting with using the UTF-8 bytecount as the VInt  
count at the top of Lucene's string format, as was discussed back in  
the "Lucene does NOT use UTF-8" thread.  Changes were made to  
IndexInput and IndexOutput as per some of Robert Engel's  
suggestions.  Here's the implementation of writeString, which chooses  
the memory hit of buffering over the performance hit of pre-scanning:

   public void writeString(String s) throws IOException {
     utf8Bytes = s.getBytes("UTF-8");
     int length = utf8Bytes.length;
     writeVInt(length);
     writeBytes(utf8Bytes, length);
   }

That, in conjunction with a similar implementation of readString and  
the UTF-8-clean implementation of readChars I submitted a while back,  
executes 2-3% slower than the current implementation against my  
index-1000-wikipedia-articles benchmarker.

I also had a hack at TermBuffer, TermInfosWriter, and StringHelper,  
in an attempt to convert the prefix length for the Term Dictionary  
from chars to UTF-8 bytes.  (Note that I've left off changing  
TermVectors for now.)

A major change has to be inflicted on TermBuffer in order for it to  
work with a bytecount: the char[] "text" buffer must be swapped out  
for a byte[] "utf8Bytes" buffer.  With that change in place,  
TermBuffer's 'read' method can utilize readBytes instead of readChars:

   public final void read(IndexInput input, FieldInfos fieldInfos)
     throws IOException {
     this.term = null;                           // invalidate cache
     int start = input.readVInt();
     int length = input.readVInt();
     int totalLength = start + length;
     setutf8Length(totalLength);
     input.readBytes(this.utf8Bytes, start, length);
     this.field = fieldInfos.fieldName(input.readVInt());
   }

I'd thought this might speed up scanning through a SegmentTermEnum,  
since the intermediate step of converting UTF-8 bytes to chars was  
deferred until toTerm gets called.  However, a ScanEnum benchmarker I  
cooked up which calls next() a whole bunch indicated a wash.

Concerns were raised before about whether it would be necessary to  
convert all strings to UTF-8 to calculate prefix length for the Term  
Dictionary.  Yes, it's necessary to convert them -- though the same  
copy can be used as a byte buffer which obviates the need to call  
writeChars.

   private final void writeTerm(Term term)
        throws IOException {
     byte[] bytes = term.text().getBytes("UTF-8");
     int totalLength = bytes.length;

     int start = StringHelper.bytesDifference(lastBytes, bytes);
     int length = totalLength - start;

     output.writeVInt(start);                   // write shared  
prefix length
     output.writeVInt(length);                  // write delta length
     for (int i = start ; i < totalLength; i++) {
       output.writeByte(bytes[i]);              // write delta UTF-8  
bytes
     }

     output.writeVInt(fieldInfos.fieldNumber(term.field)); // write  
field num

     lastTerm = term;
     lastBytes = bytes;
   }


Unfortunately, once the changes to TermBuffer, TermInfosWriter, and  
StringHelper are applied, execution speed at index-time suffers a  
slowdown of about 20%.  Perhaps this can be blamed on all the calls  
to getBytes("UTF-8") in TermInfosWriter?  Maybe alternative  
implementations using ByteBuffer, CharsetDecoder, and CharsetEncoder  
are possible that can mitigate the problem?

A patch for all 5 files against repository revision 329640 is below;  
to apply it, cd to the trunk directory first.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


diff -u -r -X../excludefile src_old/java/org/apache/lucene/index/ 
TermBuffer.java src/java/org/apache/lucene/index/TermBuffer.java
--- src_old/java/org/apache/lucene/index/TermBuffer.java     
2005-10-30 14:30:23.000000000 -0800
+++ src/java/org/apache/lucene/index/TermBuffer.java    2005-10-30  
13:27:47.000000000 -0800
@@ -20,40 +20,38 @@
import org.apache.lucene.store.IndexInput;
final class TermBuffer implements Cloneable {
-  private static final char[] NO_CHARS = new char[0];
+  private static final byte[] NO_BYTES = new byte[0];
    private String field;
-  private char[] text = NO_CHARS;
-  private int textLength;
+  private byte[] utf8Bytes = NO_BYTES;
+  private int utf8Length;
    private Term term;                            // cached
    public final int compareTo(TermBuffer other) {
      if (field == other.field)              // fields are interned
-      return compareChars(text, textLength, other.text,  
other.textLength);
+      return compareUTF8(utf8Bytes, utf8Length,
+        other.utf8Bytes, other.utf8Length);
      else
        return field.compareTo(other.field);
    }
-  private static final int compareChars(char[] v1, int len1,
-                                        char[] v2, int len2) {
+  private static final int compareUTF8(byte[] v1, int len1,
+                                       byte[] v2, int len2) {
      int end = Math.min(len1, len2);
      for (int k = 0; k < end; k++) {
-      char c1 = v1[k];
-      char c2 = v2[k];
-      if (c1 != c2) {
-        return c1 - c2;
-      }
+      if (v1[k] != v2[k])
+        return v1[k] - v2[k];
      }
      return len1 - len2;
    }
-  private final void setTextLength(int newLength) {
-    if (text.length < newLength) {
-      char[] newText = new char[newLength];
-      System.arraycopy(text, 0, newText, 0, textLength);
-      text = newText;
+  private final void setutf8Length(int newLength) {
+    if (utf8Bytes.length < newLength) {
+      byte[] newBytes = new byte[newLength];
+      System.arraycopy(utf8Bytes, 0, newBytes, 0, utf8Length);
+      utf8Bytes = newBytes;
      }
-    textLength = newLength;
+    utf8Length = newLength;
    }
    public final void read(IndexInput input, FieldInfos fieldInfos)
@@ -62,28 +60,30 @@
      int start = input.readVInt();
      int length = input.readVInt();
      int totalLength = start + length;
-    setTextLength(totalLength);
-    input.readChars(this.text, start, length);
+    setutf8Length(totalLength);
+    input.readBytes(this.utf8Bytes, start, length);
      this.field = fieldInfos.fieldName(input.readVInt());
    }
-  public final void set(Term term) {
-    if (term == null) {
+  public final void set(Term t)  {
+    if (t == null) {
        reset();
        return;
      }
-    // copy text into the buffer
-    setTextLength(term.text().length());
-    term.text().getChars(0, term.text().length(), text, 0);
+    // convert chars into UTF-8 bytes, store in buffer
+    try {
+        utf8Bytes = t.text().getBytes("UTF-8");
+    } catch (java.io.UnsupportedEncodingException e) { }
+    setutf8Length(utf8Bytes.length);
-    this.field = term.field();
-    this.term = term;
+    this.field = t.field();
+    this.term = t;
    }
    public final void set(TermBuffer other) {
-    setTextLength(other.textLength);
-    System.arraycopy(other.text, 0, text, 0, textLength);
+    setutf8Length(other.utf8Length);
+    System.arraycopy(other.utf8Bytes, 0, utf8Bytes, 0, utf8Length);
      this.field = other.field;
      this.term = other.term;
@@ -91,7 +91,7 @@
    public void reset() {
      this.field = null;
-    this.textLength = 0;
+    this.utf8Length = 0;
      this.term = null;
    }
@@ -100,7 +100,9 @@
        return null;
      if (term == null)
-      term = new Term(field, new String(text, 0, textLength), false);
+    try {
+      term = new Term(field, new String(utf8Bytes, 0, utf8Length,  
"UTF-8"), false);
+    } catch (java.io.UnsupportedEncodingException e) { }
      return term;
    }
@@ -111,8 +113,8 @@
        clone = (TermBuffer)super.clone();
      } catch (CloneNotSupportedException e) {}
-    clone.text = new char[text.length];
-    System.arraycopy(text, 0, clone.text, 0, textLength);
+    clone.utf8Bytes = new byte[utf8Bytes.length];
+    System.arraycopy(utf8Bytes, 0, clone.utf8Bytes, 0, utf8Length);
      return clone;
    }
diff -u -r -X../excludefile src_old/java/org/apache/lucene/index/ 
TermInfosWriter.java src/java/org/apache/lucene/index/ 
TermInfosWriter.java
--- src_old/java/org/apache/lucene/index/TermInfosWriter.java     
2005-10-30 14:30:23.000000000 -0800
+++ src/java/org/apache/lucene/index/TermInfosWriter.java     
2005-10-30 13:47:43.000000000 -0800
@@ -33,6 +33,8 @@
    private IndexOutput output;
    private Term lastTerm = new Term("", "");
    private TermInfo lastTi = new TermInfo();
+  private byte[] NO_BYTES = new byte[0];
+  private byte[] lastBytes = NO_BYTES;
    private long size = 0;
    // TODO: the default values for these two parameters should be  
settable from
@@ -91,8 +93,10 @@
      TermInfo pointers must be positive and greater than all  
previous.*/
    final void add(Term term, TermInfo ti)
         throws IOException {
-    if (!isIndex && term.compareTo(lastTerm) <= 0)
-      throw new IOException("term out of order");
+    if (!isIndex && term.compareTo(lastTerm) <= 0) {
+      throw new IOException("term out of order" + lastTerm.field +
+      "--" + lastTerm.text + "--" + term.field + "--" + term.text);
+    }
      if (ti.freqPointer < lastTi.freqPointer)
        throw new IOException("freqPointer out of order");
      if (ti.proxPointer < lastTi.proxPointer)
@@ -121,16 +125,22 @@
    private final void writeTerm(Term term)
         throws IOException {
-    int start = StringHelper.stringDifference(lastTerm.text,  
term.text);
-    int length = term.text.length() - start;
+    byte[] bytes = term.text().getBytes("UTF-8");
+    int totalLength = bytes.length;
+
+    int start = StringHelper.bytesDifference(lastBytes, bytes);
+    int length = totalLength - start;
      output.writeVInt(start);                   // write shared  
prefix length
      output.writeVInt(length);                  // write delta length
-    output.writeChars(term.text, start, length);  // write delta chars
+    for (int i = start ; i < totalLength; i++) {
+      output.writeByte(bytes[i]);              // write delta UTF-8  
bytes
+    }
      output.writeVInt(fieldInfos.fieldNumber(term.field)); // write  
field num
      lastTerm = term;
+    lastBytes = bytes;
    }
diff -u -r -X../excludefile src_old/java/org/apache/lucene/store/ 
IndexInput.java src/java/org/apache/lucene/store/IndexInput.java
--- src_old/java/org/apache/lucene/store/IndexInput.java     
2005-10-30 14:30:25.000000000 -0800
+++ src/java/org/apache/lucene/store/IndexInput.java    2005-10-30  
13:27:47.000000000 -0800
@@ -23,7 +23,7 @@
   * @see Directory
   */
public abstract class IndexInput implements Cloneable {
-  private char[] chars;                           // used by  
readString()
+  private byte[] utf8Bytes;                           // used by  
readString()
    /** Reads and returns a single byte.
     * @see IndexOutput#writeByte(byte)
@@ -87,32 +87,66 @@
     */
    public String readString() throws IOException {
      int length = readVInt();
-    if (chars == null || length > chars.length)
-      chars = new char[length];
-    readChars(chars, 0, length);
-    return new String(chars, 0, length);
-  }
-
-  /** Reads UTF-8 encoded characters into an array.
+    if (utf8Bytes == null || length > utf8Bytes.length)
+      utf8Bytes = new byte[length];
+    readBytes(utf8Bytes, 0, length);
+    return new String(utf8Bytes, 0, length, "UTF-8");
+  }
+
+  static final byte[] TRAILING_BYTES_FOR_UTF8 = {
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
+    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
+    2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
+    3,3,3,3,3,3,3,3
+  };
+
+ /** Reads UTF-8 encoded characters into an array.
     * @param buffer the array to read characters into
     * @param start the offset in the array to start storing characters
     * @param length the number of characters to read
     * @see IndexOutput#writeChars(String,int,int)
     */
-  public void readChars(char[] buffer, int start, int length)
+  public void readChars(char[] buffer, int start, int length)
         throws IOException {
-    final int end = start + length;
+    final int end = start + length;
      for (int i = start; i < end; i++) {
-      byte b = readByte();
-      if ((b & 0x80) == 0)
-    buffer[i] = (char)(b & 0x7F);
-      else if ((b & 0xE0) != 0xE0) {
-    buffer[i] = (char)(((b & 0x1F) << 6)
-         | (readByte() & 0x3F));
-      } else
-    buffer[i] = (char)(((b & 0x0F) << 12)
-        | ((readByte() & 0x3F) << 6)
-            |  (readByte() & 0x3F));
+      byte b = readByte();
+      switch (TRAILING_BYTES_FOR_UTF8[b & 0xFF]) {
+        case 0:
+          buffer[i] = (char)(b & 0x7F);
+          break;
+        case 1:
+          buffer[i] = (char)(((b & 0x1F) << 6)
+            | (readByte() & 0x3F));
+          break;
+        case 2:
+          buffer[i] = (char)(((b & 0x0F) << 12)
+            | ((readByte() & 0x3F) << 6)
+            |  (readByte() & 0x3F));
+          break;
+        case 3:
+          int utf32 = (((b & 0x0F) << 18)
+            | ((readByte() & 0x3F) << 12)
+            | ((readByte() & 0x3F) << 6)
+            |  (readByte() & 0x3F));
+          buffer[i] = (char)((utf32 >> 10) + 0xD7C0);
+          i++;
+          buffer[i] = (char)((utf32 & 0x03FF) + 0xDC00);
+          break;
+      }
      }
    }
@@ -148,7 +182,7 @@
        clone = (IndexInput)super.clone();
      } catch (CloneNotSupportedException e) {}
-    clone.chars = null;
+    clone.utf8Bytes = null;
      return clone;
    }
diff -u -r -X../excludefile src_old/java/org/apache/lucene/store/ 
IndexOutput.java src/java/org/apache/lucene/store/IndexOutput.java
--- src_old/java/org/apache/lucene/store/IndexOutput.java     
2005-10-30 14:30:25.000000000 -0800
+++ src/java/org/apache/lucene/store/IndexOutput.java    2005-10-30  
13:27:47.000000000 -0800
@@ -25,6 +25,8 @@
   */
public abstract class IndexOutput {
+  private byte[] utf8Bytes;                        // used by  
writeString()
+
    /** Writes a single byte.
     * @see IndexInput#readByte()
     */
@@ -85,9 +87,11 @@
     * @see IndexInput#readString()
     */
    public void writeString(String s) throws IOException {
-    int length = s.length();
+    utf8Bytes = s.getBytes("UTF-8");
+
+    int length = utf8Bytes.length;
      writeVInt(length);
-    writeChars(s, 0, length);
+    writeBytes(utf8Bytes, length);
    }
    /** Writes a sequence of UTF-8 encoded characters from a string.
diff -u -r -X../excludefile src_old/java/org/apache/lucene/util/ 
StringHelper.java src/java/org/apache/lucene/util/StringHelper.java
--- src_old/java/org/apache/lucene/util/StringHelper.java     
2005-10-30 14:30:25.000000000 -0800
+++ src/java/org/apache/lucene/util/StringHelper.java    2005-10-30  
13:48:07.000000000 -0800
@@ -44,7 +44,26 @@
      return len;
    }
-
+  /**
+   * Compares two byte[] arrays, element by element, and returns the
+   * number of elements common to both arrays.
+   *
+   * @param bytes1 The first byte[] to compare
+   * @param bytes2 The second byte[] to compare
+   * @return The number of common elements.
+   */
+  public static final int bytesDifference(byte[] bytes1, byte[]  
bytes2) {
+    int len1 = bytes1.length;
+    int len2 = bytes2.length;
+    int len = len1 < len2 ? len1 : len2;
+    for (int i = 0; i < len; i++) {
+      if (bytes1[i] != bytes2[i]) {
+          return i;
+      }
+    }
+    return len;
+  }
+
    private StringHelper() {
    }
}


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: bytecount as String and prefix length

Posted by Doug Cutting <cu...@apache.org>.

Marvin Humphrey wrote:
> I think it's time to throw in the towel.

Please don't give up.  I think you're quite close.

I would be careful using CharBuffer instead of char[] unless you're sure 
all methods you call are very efficient.  You could try avoiding 
CharBuffer by adding something (ugly) like an int[]charLength parameter 
where the length is stored, then simply return the char[], in case you 
had to re-allocate.

Using an exception for control flow is not only considered bad style, 
but is also a performance pitfall.  A catch that is unused is free, but 
catching and throwing exceptions is expensive.

A complete diff would make it easier to see if there's something else 
which may be slowing things, e.g., inadvertantly allocating something 
for each call.  Also, someone else could try to tune it more.

Have you profiled this?  Sun's built-in profiler can be useful.  It can 
give you allocation counts, so you can see if you're allocating 
significantly more objects.  It also gives call counts (in the old 'java 
-prof' mode) which are useful.  It slows things so much that I find 
relative performance of methods to be nearly worthless, but call and 
allocation counts can point out problems.

Cheers,

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: bytecount as String and prefix length

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Nov 1, 2005, at 9:51 AM, Doug Cutting wrote:

> Another approach might be to, instead of converting to UTF-8 to  
> strings right away, change things to convert lazily, if at all.
> During index merging such conversion should never be needed.

!!

There ought to be some gains possible there, then.  No predictions as  
to how much, though.

> You needn't do this systematically throughout Lucene, but only  
> where it makes a big difference.  For example, if you could avoid  
> strings in SegmentMerger.mergeTermInfos() it might make a huge  
> difference.  This might be as simple as changing SegmentMergeInfo  
> to use a TermBuffer instead of a Term.  Does that make sense?

Abundant sense.  I'm not as familiar with SegmentMerger as I am with  
other parts of the org.apache.lucene.index package, because I haven't  
ported it yet.  But conceptually I understand exactly why this should  
require fewer resources.

I'll take a swing at SegmentMerger and submit a comprehensive diff.

Thanks for the suggestions,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: bytecount as String and prefix length

Posted by Doug Cutting <cu...@apache.org>.

Another approach might be to, instead of converting to UTF-8 to strings 
right away, change things to convert lazily, if at all.  During index 
merging such conversion should never be needed.  You needn't do this 
systematically throughout Lucene, but only where it makes a big 
difference.  For example, if you could avoid strings in 
SegmentMerger.mergeTermInfos() it might make a huge difference.  This 
might be as simple as changing SegmentMergeInfo to use a TermBuffer 
instead of a Term.  Does that make sense?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: bytecount as String and prefix length

Posted by Yonik Seeley <ys...@gmail.com>.

Thanks for looking into this Marvin... very interesting stuff!
I haven't had a chance to review it in detail, but my gut tells me
that it should be able to be faster.

-Yonik
Now hiring -- http://forms.cnet.com/slink?231706

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: bytecount as String and prefix length

Posted by Marvin Humphrey <ma...@rectangular.com>.

I wrote:

> I've got one more idea... time to try overriding readString and  
> writeString in BufferedIndexInput and BufferedIndexOutput, to take  
> advantage of buffers that are already there.

Too complicated to be worthwhile, it turns out.  I think it's time to  
throw in the towel.  Frustrating, because I was shooting for a win- 
win solution (bytecounts and valid UTF-8 both make things a lot  
easier on the Perl side).  My valid-UTF-8 patches from a month ago  
perform better than the current Java implementation, but bytecounts  
make for slower indexing.  Time to re-rethink the goal of index- 
compatibility.  How about, compatible for ascii-only indexes?  :|

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: bytecount as String and prefix length

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Oct 31, 2005, at 5:15 PM, Robert Engels wrote:

> All of the JDK source is available via download from Sun.

Thanks.  I believe the UTF-8 coding algos can be found in...

j2se > src > share > classes > sun > nio > cs > UTF_8.java

It looks like the translator methods have fairly high loop overheads,  
since they have to keep track of the member variables of ByteBuffer  
and CharBuffer objects and prepare to return result objects on each  
loop iter.  Also, they have robust error-checking for malformed  
source data, which Lucene traditionally has not.  The algo below my  
sig should be faster.

I wrote...

> So my next step is to write a utf8ToString method that's as efficient
> as I can make it.

Ok, this time we made a little headway.  We're down from 20% slower  
to around 10% slower indexing than current implementation.  But I  
don't see how I'm going to get it any faster.  There's maybe one  
conditional in FieldsReader that can be simplified.

There's another downside to the way I'm implementing this right now.   
The byteBuf and charBuf have to be kept somewhere.  Currently, I'm  
allocating a ByteBuffer for each TermInfosWriter and a charBuf for  
each TermBuffer.  That's something of a memory hit, though it's hard  
to say exactly how much.  IndexInput and IndexOutput are still using  
the Sun methods -- when I gave them Buffers, they slowed down.

I've got one more idea... time to try overriding readString and  
writeString in BufferedIndexInput and BufferedIndexOutput, to take  
advantage of buffers that are already there.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

//----------------------------------------------------------------

   public static final CharBuffer utf8ToChars (
         byte[] bytes, int start, int length, CharBuffer charBuf) {
     int i = start;
     int j = 0;
     final int end = start + length;
     char[] chars = charBuf.array();
     try {
       while (i < end) {
         byte b = bytes[i++];
         switch (TRAILING_BYTES_FOR_UTF8[b & 0xFF]) {
           case 0:
             chars[j++] = (char)(b & 0x7F);
             break;
           case 1:
             chars[j++] = (char)(((b & 0x1F) << 6)
               | (bytes[i++] & 0x3F));
             break;
           case 2:
             chars[j++] = (char)(((b & 0x0F) << 12)
               | ((bytes[i++] & 0x3F) << 6)
               |  (bytes[i++] & 0x3F));
             break;
           case 3:
             int utf32 = (((b & 0x0F) << 18)
               | ((bytes[i++] & 0x3F) << 12)
               | ((bytes[i++] & 0x3F) << 6)
               |  (bytes[i++] & 0x3F));
             chars[j++] = (char)((utf32 >> 10) + 0xD7C0);
             i++;
             chars[j++] = (char)((utf32 & 0x03FF) + 0xDC00);
             break;
         }
       }
     }
     catch (ArrayIndexOutOfBoundsException e) {
       float bytesProcessed = (float)(i - start);
       float bytesPerChar = (j / bytesProcessed) * 1.1f;

       float bytesLeft = length - bytesProcessed;
       float targetSize = (float)chars.length + bytesPerChar *  
bytesLeft + 1.0f;
       return utf8ToChars(bytes, start, length, CharBuffer.allocate 
((int)targetSize));
     }
     charBuf.position(j);
     return charBuf;
   }

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: bytecount as String and prefix length

Posted by Robert Engels <re...@ix.netcom.com>.

All of the JDK source is available via download from Sun.

-----Original Message-----
From: Marvin Humphrey [mailto:marvin@rectangular.com]
Sent: Monday, October 31, 2005 6:31 PM
To: java-dev@lucene.apache.org
Subject: Re: bytecount as String and prefix length

I wrote...

> I think I'll take a crack at a custom charsToUTF8 converter algo.

Still no luck.  Still 20% slower than the current implementation.   
The algo is below, for reference.

It's entirely possible that my patches are doing something dumb  
that's causing this, given my limited experience with Java.  But if  
that's not the case, I can think of two other explanations.

One is that the passage of the text through an intermediate buffer  
before blasting it out is considerably more expensive than anticipated.

The other is that the pre-allocation of a char[] array based on the  
length VInt yields a significant benefit over the standard techniques  
for reading in UTF-8.  That wouldn't be hard to believe.  Without  
that number, there's a lot of guesswork involved.  English requires  
about 1.1 bytes per UTF-8 code point; Japanese, 3.  Multiple memory  
allocation ops may be required as bytes get read in, especially if  
the final String object kicked out HAS to use the bare minimum amount  
of memory.  I don't suppose there's any way for me to snoop just  
what's happening under the hood in these CharsetDecoder classes or  
String constructors, is there?

Scanning through a SegmentTermEnum with next() doesn't seem to be any  
slower with a byte-based TermBuffer, and my index-1000-wikipedia-docs  
benchmarker doesn't slow down that much when IndexInput is changed to  
use a String constructor that accepts UTF-8 bytes rather than chars.   
However, it's possible that the modified toTerm method of TermBuffer  
is a bottleneck, as it also uses the UTF-8 String constructor.  It  
doesn't get exercised under SegmentTermEnum.next(), but during  
merging of segments I believe it sees plenty of action -- maybe a lot  
more than IndexInput's readString.

So my next step is to write a utf8ToString method that's as efficient  
as I can make it.  After that... I dunno, I'm running out of ideas.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

   public static final ByteBuffer stringToUTF8(
         String s, int start, int length, ByteBuffer byteBuf) {
     byteBuf.clear();
     int i = start;
     int j = 0;
     try {
       final int end = start + length;
       byte[] bytes = byteBuf.array();
       for ( ; i < end; i++) {
         final int code = (int)s.charAt(i);
         if (code < 0x80)
           bytes[j++] = (byte)code;
         else if (code < 0x800) {
           bytes[j++] = (byte)(0xC0 | (code >> 6));
           bytes[j++] = (byte)(0x80 | (code & 0x3F));
         } else if (code < 0xD800 || code > 0xDFFF) {
           bytes[j++] = (byte)(0xE0 | (code >>> 12));
           bytes[j++] = (byte)(0x80 | ((code >> 6) & 0x3F));
           bytes[j++] = (byte)(0x80 | (code & 0x3F));
         } else {
           // surrogate pair
           int utf32;
           // confirm valid high surrogate
           if (code < 0xDC00 && (i < end-1)) {
             utf32 = ((int)s.charAt(i+1));
             // confirm valid low surrogate and write pair
             if (utf32 >= 0xDC00 && utf32 <= 0xDFFF) {
               utf32 = ((code - 0xD7C0) << 10) + (utf32 & 0x3FF);
               i++;
               bytes[j++] = (byte)(0xF0 | (utf32 >>> 18));
               bytes[j++] = (byte)(0x80 | ((utf32 >> 12) & 0x3f));
               bytes[j++] = (byte)(0x80 | ((utf32 >> 6) & 0x3F));
               bytes[j++] = (byte)(0x80 | (utf32 & 0x3F));
               continue;
             }
           }
           // replace unpaired surrogate or out-of-order low surrogate
           // with substitution character
           bytes[j++] = (byte)0xEF;
           bytes[j++] = (byte)0xBF;
           bytes[j++] = (byte)0xBD;
         }
       }
     }
     catch (ArrayIndexOutOfBoundsException e) {
       // guess how many more bytes it will take, plus 10%
       float charsProcessed = (float)(i - start);
       float bytesPerChar = (j / charsProcessed) * 1.1f;

       float charsLeft = length - charsProcessed;
       float targetSize
         = (float)byteBuf.capacity() + bytesPerChar * charsLeft + 1.0f;

       return stringToUTF8(s, start, length, ByteBuffer.allocate((int) 
targetSize));
     }
     byteBuf.position(j);
     return byteBuf;
   }

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: bytecount as String and prefix length

Posted by Marvin Humphrey <ma...@rectangular.com>.

I wrote...

> I think I'll take a crack at a custom charsToUTF8 converter algo.

Still no luck.  Still 20% slower than the current implementation.   
The algo is below, for reference.

It's entirely possible that my patches are doing something dumb  
that's causing this, given my limited experience with Java.  But if  
that's not the case, I can think of two other explanations.

One is that the passage of the text through an intermediate buffer  
before blasting it out is considerably more expensive than anticipated.

The other is that the pre-allocation of a char[] array based on the  
length VInt yields a significant benefit over the standard techniques  
for reading in UTF-8.  That wouldn't be hard to believe.  Without  
that number, there's a lot of guesswork involved.  English requires  
about 1.1 bytes per UTF-8 code point; Japanese, 3.  Multiple memory  
allocation ops may be required as bytes get read in, especially if  
the final String object kicked out HAS to use the bare minimum amount  
of memory.  I don't suppose there's any way for me to snoop just  
what's happening under the hood in these CharsetDecoder classes or  
String constructors, is there?

Scanning through a SegmentTermEnum with next() doesn't seem to be any  
slower with a byte-based TermBuffer, and my index-1000-wikipedia-docs  
benchmarker doesn't slow down that much when IndexInput is changed to  
use a String constructor that accepts UTF-8 bytes rather than chars.   
However, it's possible that the modified toTerm method of TermBuffer  
is a bottleneck, as it also uses the UTF-8 String constructor.  It  
doesn't get exercised under SegmentTermEnum.next(), but during  
merging of segments I believe it sees plenty of action -- maybe a lot  
more than IndexInput's readString.

So my next step is to write a utf8ToString method that's as efficient  
as I can make it.  After that... I dunno, I'm running out of ideas.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


   public static final ByteBuffer stringToUTF8(
         String s, int start, int length, ByteBuffer byteBuf) {
     byteBuf.clear();
     int i = start;
     int j = 0;
     try {
       final int end = start + length;
       byte[] bytes = byteBuf.array();
       for ( ; i < end; i++) {
         final int code = (int)s.charAt(i);
         if (code < 0x80)
           bytes[j++] = (byte)code;
         else if (code < 0x800) {
           bytes[j++] = (byte)(0xC0 | (code >> 6));
           bytes[j++] = (byte)(0x80 | (code & 0x3F));
         } else if (code < 0xD800 || code > 0xDFFF) {
           bytes[j++] = (byte)(0xE0 | (code >>> 12));
           bytes[j++] = (byte)(0x80 | ((code >> 6) & 0x3F));
           bytes[j++] = (byte)(0x80 | (code & 0x3F));
         } else {
           // surrogate pair
           int utf32;
           // confirm valid high surrogate
           if (code < 0xDC00 && (i < end-1)) {
             utf32 = ((int)s.charAt(i+1));
             // confirm valid low surrogate and write pair
             if (utf32 >= 0xDC00 && utf32 <= 0xDFFF) {
               utf32 = ((code - 0xD7C0) << 10) + (utf32 & 0x3FF);
               i++;
               bytes[j++] = (byte)(0xF0 | (utf32 >>> 18));
               bytes[j++] = (byte)(0x80 | ((utf32 >> 12) & 0x3f));
               bytes[j++] = (byte)(0x80 | ((utf32 >> 6) & 0x3F));
               bytes[j++] = (byte)(0x80 | (utf32 & 0x3F));
               continue;
             }
           }
           // replace unpaired surrogate or out-of-order low surrogate
           // with substitution character
           bytes[j++] = (byte)0xEF;
           bytes[j++] = (byte)0xBF;
           bytes[j++] = (byte)0xBD;
         }
       }
     }
     catch (ArrayIndexOutOfBoundsException e) {
       // guess how many more bytes it will take, plus 10%
       float charsProcessed = (float)(i - start);
       float bytesPerChar = (j / charsProcessed) * 1.1f;

       float charsLeft = length - charsProcessed;
       float targetSize
         = (float)byteBuf.capacity() + bytesPerChar * charsLeft + 1.0f;

       return stringToUTF8(s, start, length, ByteBuffer.allocate((int) 
targetSize));
     }
     byteBuf.position(j);
     return byteBuf;
   }





---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: bytecount as String and prefix length

Posted by Marvin Humphrey <ma...@rectangular.com>.

I wrote...

> Unfortunately, once the changes to TermBuffer, TermInfosWriter, and  
> StringHelper are applied, execution speed at index-time suffers a  
> slowdown of about 20%.  Perhaps this can be blamed on all the calls  
> to getBytes("UTF-8") in TermInfosWriter?  Maybe alternative  
> implementations using ByteBuffer, CharsetDecoder, and  
> CharsetEncoder are possible that can mitigate the problem?

Nope.

The version of writeTerm below is about the same speed as the one  
with the calls to getBytes("UTF-8").

I think I'll take a crack at a custom charsToUTF8 converter algo.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

//---------------------------------------------------------------------- 
---

   private final void writeTerm(Term term)
        throws IOException {
     byteBuf.clear();
     while (true) {
       CoderResult status = utf8Encoder.encode(CharBuffer.wrap 
(term.text()),
         byteBuf, false);
       if (status.isOverflow()) {
         bufSize += 32;
         byteBuf = ByteBuffer.allocate(bufSize);
         utf8Encoder.reset();
       }
       else {
         break;
       }
     }
     int totalLength = byteBuf.position();
     int start = StringHelper.bytesDifference(lastByteBuf, byteBuf);
     int length = totalLength - start;

     output.writeVInt(start);                   // write shared  
prefix length
     output.writeVInt(length);                  // write delta length

     byte[] bytes = byteBuf.array();
     for (int i = start ; i < totalLength; i++) {
       output.writeByte(bytes[i]);              // write delta UTF-8  
bytes
     }
     output.writeVInt(fieldInfos.fieldNumber(term.field)); // write  
field num

     lastTerm = term;
     // swap byteBuf and lastByteBuf
     scratchByteBuf = lastByteBuf;
     lastByteBuf = byteBuf;
     byteBuf = scratchByteBuf;
   }



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org