You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Yonik Seeley <yo...@apache.org> on 2007/06/13 04:07:23 UTC

start + end for TermDocs.read()

If I know the docfreq of a term is 1000, I'd like to be able to
allocate an int[1000]
and grab all the ids via TermDocs.read().  But because there is no
offset, MultiTermDocs returns a list from each sub-segment, forcing me
to copy each partial int[] filled into my full int[].

If one could specify a start/end into the array, this copying could be
avoided (see partial and untested patch below).

My feeling is that this is probably too specialized to warrant adding
an additional method on the interface, so I won't open a JIRA for it.
I brought it up in case anyone cared
to argue otherwise.

-Yonik


Index: src/java/org/apache/lucene/index/MultiReader.java
===================================================================
--- src/java/org/apache/lucene/index/MultiReader.java	(revision 545184)
+++ src/java/org/apache/lucene/index/MultiReader.java	(working copy)
@@ -384,27 +384,33 @@
     }
   }

+  public int read(final int[] docs, final int[] freqs) throws IOException {
+    return read(docs, freqs, 0, docs.length);
+  }
+
   /** Optimized implementation. */
-  public int read(final int[] docs, final int[] freqs) throws IOException {
-    while (true) {
+  public int read(final int[] docs, final int[] freqs, int start, int
end) throws IOException {
+    while (start < end) {
       while (current == null) {
         if (pointer < readers.length) {      // try next segment
           base = starts[pointer];
           current = termDocs(pointer++);
         } else {
-          return 0;
+          return start;
         }
       }
-      int end = current.read(docs, freqs);
-      if (end == 0) {          // none left in segment
+      int newStart = current.read(docs, freqs, start, end);
+      if (newStart == start) {
         current = null;
       } else {            // got some
         final int b = base;        // adjust doc numbers
-        for (int i = 0; i < end; i++)
+        for (int i = start; i < newStart; i++) {
          docs[i] += b;
-        return end;
+        }
+        start = newStart;
       }
     }
+    return start;
   }

  /* A Possible future optimization could skip entire segments */
Index: src/java/org/apache/lucene/index/SegmentTermDocs.java
===================================================================
--- src/java/org/apache/lucene/index/SegmentTermDocs.java	(revision 545184)
+++ src/java/org/apache/lucene/index/SegmentTermDocs.java	(working copy)
@@ -122,11 +122,15 @@
     return true;
   }

+  public int read(final int[] docs, final int[] freqs) throws IOException {
+    return read(docs, freqs, 0, docs.length);
+  }
+
   /** Optimized implementation. */
-  public int read(final int[] docs, final int[] freqs)
+  public int read(final int[] docs, final int[] freqs, int start, int end)
           throws IOException {
-    final int length = docs.length;
-    int i = 0;
+    final int length = end;
+    int i = start;
     while (i < length && count < df) {

       // manually inlined call to next() for speed
Index: src/java/org/apache/lucene/index/TermDocs.java
===================================================================
--- src/java/org/apache/lucene/index/TermDocs.java	(revision 545184)
+++ src/java/org/apache/lucene/index/TermDocs.java	(working copy)
@@ -60,6 +60,8 @@
    * stream has been exhausted.  */
   int read(int[] docs, int[] freqs) throws IOException;

+  int read(int[] docs, int[] freqs, int start, int end) throws IOException;
+
   /** Skips entries to the first beyond the current whose document number is
    * greater than or equal to <i>target</i>. <p>Returns true iff there is such
    * an entry.  <p>Behaves as if written: <pre>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: start + end for TermDocs.read()

Posted by Yonik Seeley <yo...@apache.org>.

On 6/13/07, Chris Hostetter <ho...@fucit.org> wrote:
> : and grab all the ids via TermDocs.read().  But because there is no
> : offset, MultiTermDocs returns a list from each sub-segment, forcing me
> : to copy each partial int[] filled into my full int[].
>
> it seems like at the very least, MultiTermDocs.read(int[],int[]) could
> allocate new arrays to pass to current.read and do the array copying for
> you.

That would slow down the other usecase though... someone "streaming"
or incrementally handing batches of ids (like TermScorer does...)

> : If one could specify a start/end into the array, this copying could be
> : avoided (see partial and untested patch below).
>
> what usecase are you thinking of where specifying the end would be handy?

I didn't have one in mind, except it's pretty much free.

> Am i being naive, or would it also be useful to pass "base" to
> current.read(int[],int[],...) so it's added to the docIds before they are
> put in the array and we don't have to iterate over them again?

Yeah, that occured to me also... but I decided to take baby steps :-)

Yeah, the other thing that occured to me is to also have a reverse of
the interface, a-la HitCollector. The JVM would probably need to
inline the call at runtime for this to be competitive though.

> : My feeling is that this is probably too specialized to warrant adding
> : an additional method on the interface, so I won't open a JIRA for it.
> : I brought it up in case anyone cared
> : to argue otherwise.
>
> I say benchmark it ... if it has any serious benefits absolutely add it to
> the interface.  If it means reving up to Lucene 3.0 so be it.

The catch is that normal lucene searching would probably not be
sped up at all... only custom code that needs to use TermDocs.read()
in a specific way.

Just a general interface reminder to everyone... passing offset + len
along with array parameters is generally a very good thing.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: start + end for TermDocs.read()

Posted by Chris Hostetter <ho...@fucit.org>.

: and grab all the ids via TermDocs.read().  But because there is no
: offset, MultiTermDocs returns a list from each sub-segment, forcing me
: to copy each partial int[] filled into my full int[].

it seems like at the very least, MultiTermDocs.read(int[],int[]) could
allocate new arrays to pass to current.read and do the array copying for
you.

: If one could specify a start/end into the array, this copying could be
: avoided (see partial and untested patch below).

what usecase are you thinking of where specifying the end would be handy?
wouldn't you always want it to read as much as possible as long as the
array's aren't full?

Am i being naive, or would it also be useful to pass "base" to
current.read(int[],int[],...) so it's added to the docIds before they are
put in the array and we don't have to iterate over them again?

: My feeling is that this is probably too specialized to warrant adding
: an additional method on the interface, so I won't open a JIRA for it.
: I brought it up in case anyone cared
: to argue otherwise.

I say benchmark it ... if it has any serious benefits absolutely add it to
the interface.  If it means reving up to Lucene 3.0 so be it.

(for something like SortComparatorSource, or FieldSelector where we expect
clients who use the interfaces to *implement* the interface i might be
more hesitant, but I'm guessing the number of people who write their own
TermDocs impls are pretty small, and would probably be understanding of a
new method in the API if it allowed for serious performance benefits.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org