You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@solr.apache.org by GitBox <gi...@apache.org> on 2021/03/10 13:50:36 UTC

[GitHub] [solr] magibney opened a new pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

magibney opened a new pull request #2:
URL: https://github.com/apache/solr/pull/2


   See also: [SOLR-14185](https://issues.apache.org/jira/browse/SOLR-14185)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] dsmiley commented on pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

dsmiley commented on pull request #2:
URL: https://github.com/apache/solr/pull/2#issuecomment-821772007

I did some benchmarking, finally. The new implementation appears 8% faster overall, excluding the optimization I added. The data set was a million docs and a field with a gaussian distribution of terms. The queries had a filter query with a term against that field, and it resulted in a SortedIntDocSet the vast majority of the time. The main "q" query was a randomly produced phrase query that would always match some subphrase of a sentence found in many documents. The benchmark produced 2000 consistently random queries and I re-ran this about 10 times and took the average of the fastest 3 runs.

I added a small optimization to short-circuit intersect(DocSet) when there was no intersection change. The % improvement moved to ~11%. In my benchmark, there was another filter query that matched everything, and SolrIndexSearcher.getProcessedFilter would intersect that with the cached SortedIntDocSet producing a new SortedIntDocSet every time, and thus there was _never_ any cache re-use of cachedOrdIdxMap. Of course this is highly dependent on the benchmarking scenario. This really emphasizes how "YMMV" applies to benchmarking this stuff because it's so dependent on what the app's usage pattern looks like. I think in some future JIRA issue, getProcessedFilter would be better off not intersecting any SortedIntDocSets; they could simply be added as separate filters in a BooleanQuery (dependent on SOLR-14166).

So I think this is ready to merge!

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org

[GitHub] [solr] magibney commented on a change in pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

magibney commented on a change in pull request #2:
URL: https://github.com/apache/solr/pull/2#discussion_r593312504



##########
File path: solr/core/src/java/org/apache/solr/search/BitDocSet.java
##########
@@ -237,6 +237,150 @@ public BitDocSet clone() {
     return new BitDocSet(bits.clone(), size);
   }
 
+  @Override
+  public Bits getBits(LeafReaderContext context) {
+    if (context.isTopLevel) {
+      return bits;
+    }
+
+    final int base = context.docBase;
+    final int length = context.reader().maxDoc();
+    final FixedBitSet bs = bits;
+
+    return new Bits() {
+      @Override
+      public boolean get(int index) {
+        return bs.get(index + base);
+      }
+
+      @Override
+      public int length() {
+        return length;
+      }
+    };
+  }
+
+  private static final int NO_DOCS_THIS_SEGMENT = -1;
+  private int[] cachedFloorDocs;
+
+  /**
+   * Because `bits.nextSetBit(int)` (called in `nextDoc()`) has no upper limit, lazily cache
+   * floorDocs for each segment to avoid duplicate scanning of bits (and to enable optimization
+   * in consumers afforded by returning <code>null</code> when there are no docs for a given
+   * segment).
+   */
+  private int getFloorDoc(final LeafReaderContext ctx) {
+    assert !ctx.isTopLevel;
+    final int[] floorDocs;
+    final int setMax = bits.length();
+    if (cachedFloorDocs != null) {
+      floorDocs = cachedFloorDocs;
+    } else {
+      List<LeafReaderContext> leaves = ReaderUtil.getTopLevelContext(ctx).leaves();
+      floorDocs = new int[leaves.size()];
+      int idx = 0;
+      int nextFloorDoc = -1;
+      for (LeafReaderContext c : leaves) {
+        final int base = c.docBase;
+        final int max = base + c.reader().maxDoc();
+        final int recordFloorDoc;
+        if (nextFloorDoc >= max) {
+          recordFloorDoc = NO_DOCS_THIS_SEGMENT;
+        } else if (nextFloorDoc >= base) {
+          recordFloorDoc = nextFloorDoc;
+        } else if (setMax <= base || (nextFloorDoc = bits.nextSetBit(base)) >= max) {
+          recordFloorDoc = NO_DOCS_THIS_SEGMENT;
+        } else {
+          recordFloorDoc = nextFloorDoc;
+        }
+        floorDocs[idx++] = recordFloorDoc;
+      }
+
+      cachedFloorDocs = floorDocs;
+    }
+    return floorDocs[ctx.ord];
+  }
+
+  @Override
+  public DocIdSetIterator iterator(LeafReaderContext context) {
+    if (context.isTopLevel) {
+      switch (size) {
+        case 0:
+          return null;
+        default:
+          // we have an explicit size; use it
+          return new BitSetIterator(bits, size);
+        case -1:
+          // size has not been computed; use bits.length() as an upper bound on cost
+          final int maxSize = bits.length();
+          if (maxSize < 1) {
+            return null;
+          } else {
+            return new BitSetIterator(bits, maxSize);
+          }
+      }
+    }
+
+    final int maxDoc = context.reader().maxDoc();
+    if (maxDoc < 1) {
+      // entirely empty segment; verified this actually happens
+      return null;
+    }
+
+    final int firstDocId = getFloorDoc(context);

Review comment:
       True, but `docBase` and `reader.maxDoc()` refer to the entire segment. What we're caching here is the floor doc for this segment that's actually present in this DocSet, which helps support returning `null` in place of empty iterators for segments that contain docs, but none that are included in this DocSet. (the main goal here isn't saving the returning of the iterator _per se_, but supporting empty-iterator optimizations in consumers).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] magibney commented on pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

magibney commented on pull request #2:
URL: https://github.com/apache/solr/pull/2#issuecomment-822540537


   Sounds good! Thanks for tackling the benchmarking side of this; and +1 to SOLR-14166.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org

[GitHub] [solr] dsmiley commented on pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

dsmiley commented on pull request #2:
URL: https://github.com/apache/solr/pull/2#issuecomment-797744614


   Maybe the cached floor stuff can be done in a separate issue?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] magibney commented on a change in pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

magibney commented on a change in pull request #2:
URL: https://github.com/apache/solr/pull/2#discussion_r593335767



##########
File path: solr/core/src/java/org/apache/solr/search/DocSet.java
##########
@@ -131,6 +140,15 @@ public int andNotSize(DocSet other) {
    */
   public abstract Bits getBits();
 
+  /**
+   * A per-segment {@link Bits} instance that has fast random access (as is generally
+   * required of Bits). In contrast with {@link #getBits()}, only trivial work should
+   * be done to generate a return value (i.e., if the underlying set natively supports
+   * random access). This method should return <code>null</code> for sets that do not
+   * support random access.
+   */
+  public abstract Bits getBits(LeafReaderContext context);

Review comment:
       The only reason this was introduced was to continue to support explicit random access from [IntervalFacets.getCountString()](https://github.com/apache/solr/blob/7ada4032180b516548fc0263f42da6a7a917f92b/solr/core/src/java/org/apache/solr/request/IntervalFacets.java#L295-L304).
   But as you point out [here](https://github.com/apache/solr/pull/2#discussion_r593178464), I think you're right that [it's redundant](https://github.com/apache/solr/blob/7ada4032180b516548fc0263f42da6a7a917f92b/solr/core/src/java/org/apache/solr/request/IntervalFacets.java#L406-L409), since the `bits` instance is derived from the same source as the `disi`, so `bits.get(doc)` will _never_ be `false`.
   Nice, this can be removed then!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] magibney commented on a change in pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

magibney commented on a change in pull request #2:
URL: https://github.com/apache/solr/pull/2#discussion_r593337561



##########
File path: solr/core/src/java/org/apache/solr/request/IntervalFacets.java
##########
@@ -292,16 +285,16 @@ private void getCountString() throws IOException {
           final SortedDocValues singleton = DocValues.unwrapSingleton(sub);
           if (singleton != null) {
             // some codecs may optimize SORTED_SET storage for single-valued fields
-            accumIntervalsSingle(singleton, disi, dis.bits());
+            accumIntervalsSingle(singleton, disi, docs.getBits(leaf));

Review comment:
       Yes, you're right! See [this comment](https://github.com/apache/solr/pull/2#discussion_r593335767).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] magibney edited a comment on pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

magibney edited a comment on pull request #2:
URL: https://github.com/apache/solr/pull/2#issuecomment-802891894

> Do you have something on-hand to perf-test this?

I don't really have anything on-hand, unfortunately. Some indecision here too (hence my delayed response), due to the fact that I'm having a hard time brainstorming anything that I'd expect to make a difference (this being close to a refactoring in essence); but what you've suggested would serve well as a sanity check. How large would you consider a "large index" to be, for this purpose?

Worst-case would probably involve combinations of `fq`, to generate uncached intersections. Because the "work" of determining leaf boundaries on SortedIntDocSet is now cached in the DocSet itself, as opposed to in the Filter, I'd expect performance of iterators over _cached_ DocSets to be better (assuming any perceptible difference at all). Are there any cases where iterators would be requested e.g. for the first leaf, but not later leaves (e.g. some kind of shortcircuit -- maybe `facet.exists`?)? That could also be a difference, since the boundaries for all leaves are now calculated up-front when the first iterator is requested.

tbh I'd be surprised if any of this made a difference (improvement or regression) ... but of course nobody ever _expects_ to be surprised :-)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] magibney commented on a change in pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

magibney commented on a change in pull request #2:
URL: https://github.com/apache/solr/pull/2#discussion_r593295065



##########
File path: solr/core/src/java/org/apache/solr/search/SortedIntDocSet.java
##########
@@ -671,107 +679,141 @@ public DocSet union(DocSet other) {
     return new BitDocSet(newbits);
   }
 
-  @Override
-  public Filter getTopFilter() {
-    return new Filter() {
+  private int[] cachedOrdIdxMap;

Review comment:
       Ah, I had thought you meant to effectively build this "`ordIdxMap`" _in_ the ctor, as opposed to provided as an explicit arg to the `SortedIntDocSet` ctor! I'll look into what it would take to do the latter. 
   The idea of building this during collection probably changes things, but I went the "cached" route initially (knowing that it could be moved into the ctor) to avoid doing extra work in the case of DocSets that may only used at the global level, and never asked to supply any `DocIdSetIterator` (e.g., in [JoinQuery.getDocSetEnumerate()](https://github.com/apache/solr/blob/7ada4032180b516548fc0263f42da6a7a917f92b/solr/core/src/java/org/apache/solr/search/JoinQuery.java#L454-L457)).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] magibney commented on a change in pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

magibney commented on a change in pull request #2:
URL: https://github.com/apache/solr/pull/2#discussion_r595183891



##########
File path: solr/core/src/java/org/apache/solr/search/SortedIntDocSet.java
##########
@@ -671,107 +673,141 @@ public DocSet union(DocSet other) {
     return new BitDocSet(newbits);
   }
 
-  @Override
-  public Filter getTopFilter() {
-    return new Filter() {
+  private int[] cachedOrdIdxMap;
+  private int[] getOrdIdxMap(LeafReaderContext ctx) {
+    final int[] ret;
+    if (ctx.isTopLevel) {
+      // don't bother caching this?
+      ret = new int[] {0, docs.length - 1};
+    } else if (cachedOrdIdxMap != null) {
+      ret = cachedOrdIdxMap;
+    } else {
+      List<LeafReaderContext> leaves = ReaderUtil.getTopLevelContext(ctx).leaves();
+      ret = new int[leaves.size() << 1];
       int lastEndIdx = 0;
-
-      @Override
-      public DocIdSet getDocIdSet(final LeafReaderContext context, final Bits acceptDocs) {
-        LeafReader reader = context.reader();
-        // all Solr DocSets that are used as filters only include live docs
-        final Bits acceptDocs2 = acceptDocs == null ? null : (reader.getLiveDocs() == acceptDocs ? null : acceptDocs);
-
-        final int base = context.docBase;
-        final int maxDoc = reader.maxDoc();
+      for (LeafReaderContext lrc : leaves) {
+        final int base = lrc.docBase;
+        final int maxDoc = lrc.reader().maxDoc();
         final int max = base + maxDoc;   // one past the max doc in this segment.
         int sidx = Math.max(0,lastEndIdx);
 
-        if (sidx > 0 && docs[sidx-1] >= base) {
-          // oops, the lastEndIdx isn't correct... we must have been used
-          // in a multi-threaded context, or the indexreaders are being
-          // used out-of-order.  start at 0.
-          sidx = 0;
-        }
         if (sidx < docs.length && docs[sidx] < base) {
           // if docs[sidx] is < base, we need to seek to find the real start.
           sidx = findIndex(docs, base, sidx, docs.length-1);
         }
 
-        final int startIdx = sidx;
-
         // Largest possible end index is limited to the start index
         // plus the number of docs contained in the segment.  Subtract 1 since
         // the end index is inclusive.
-        int eidx = Math.min(docs.length, startIdx + maxDoc) - 1;
+        int eidx = Math.min(docs.length, sidx + maxDoc) - 1;
 
         // find the real end
-        eidx = findIndex(docs, max, startIdx, eidx) - 1;
+        eidx = findIndex(docs, max, sidx, eidx) - 1;
+
+        final int mapOrdIdx = lrc.ord << 1;
+        ret[mapOrdIdx] = sidx;
+        ret[mapOrdIdx + 1] = eidx;
+        lastEndIdx = eidx;
+      }
+      cachedOrdIdxMap = ret; // replace atomically after building
+    }
+    return ret;
+  }
+
+  @Override
+  public DocIdSetIterator iterator(LeafReaderContext context) {
+
+    if (docs.length == 0 || context.reader().maxDoc() < 1) {
+      // empty docset or entirely empty segment (verified that the latter actually happens)
+      // NOTE: wrt the "empty docset" case, this is not just an optimization; this shortcircuits also
+      // to prevent the static DocSet.EmptyLazyHolder.INSTANCE from having cachedOrdIdxMap initiated
+      // across different IndexReaders.
+      return null;
+    }
+
+    int[] ordIdxMap = getOrdIdxMap(context);
+    final int base = context.docBase;
+
+    final int mapOrdIdx = context.ord << 1;
+    final int startIdx = ordIdxMap[mapOrdIdx];
+    final int endIdx = ordIdxMap[mapOrdIdx + 1];
+
+    if (startIdx > endIdx) {
+      return null; // verified this does happen

Review comment:
       Agreed, will do. I initially left this because I was trying to change as little as possible...




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] magibney commented on a change in pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

magibney commented on a change in pull request #2:
URL: https://github.com/apache/solr/pull/2#discussion_r591555139



##########
File path: solr/core/src/java/org/apache/solr/search/join/GraphQuery.java
##########
@@ -268,13 +266,12 @@ private Automaton buildAutomaton(BytesRefHash termBytesHash) {
     
     @Override
     public Scorer scorer(LeafReaderContext context) throws IOException {
-      if (filter == null) {
+      if (resultSet == null) {
         resultSet = getDocSet();
-        filter = resultSet.getTopFilter();
       }
-      DocIdSet readerSet = filter.getDocIdSet(context,context.reader().getLiveDocs());
+      DocIdSetIterator disi = resultSet.iterator(context);

Review comment:
       there was previously an `acceptDocs` filter here for `context.reader().getLiveDocs()` -- probably superfluous?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] magibney commented on pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

magibney commented on pull request #2:
URL: https://github.com/apache/solr/pull/2#issuecomment-802891894

> Do you have something on-hand to perf-test this?
I don't really have anything on-hand, unfortunately. Some indecision here too (hence my delayed response), due to the fact that I'm having a hard time brainstorming anything that I'd expect to make a difference (this being close to a refactoring in essence); but what you've suggested would serve well as a sanity check. How large would you consider a "large index" to be, for this purpose?

tbh I'd be surprised if any of this made a difference (improvement or regression) ... but of course nobody ever _expects_ to be surprised :-)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] magibney commented on a change in pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

magibney commented on a change in pull request #2:
URL: https://github.com/apache/solr/pull/2#discussion_r595182025



##########
File path: solr/core/src/java/org/apache/solr/search/BitDocSet.java
##########
@@ -237,68 +235,15 @@ public BitDocSet clone() {
     return new BitDocSet(bits.clone(), size);
   }
 
-  @Override
-  public Bits getBits(LeafReaderContext context) {
-    if (context.isTopLevel) {
-      return bits;
-    }
-
-    final int base = context.docBase;
-    final int length = context.reader().maxDoc();
-    final FixedBitSet bs = bits;
-
-    return new Bits() {
-      @Override
-      public boolean get(int index) {
-        return bs.get(index + base);
-      }
-
-      @Override
-      public int length() {
-        return length;
-      }
-    };
-  }
-
-  private static final int NO_DOCS_THIS_SEGMENT = -1;
-  private int[] cachedFloorDocs;
-
-  /**
-   * Because `bits.nextSetBit(int)` (called in `nextDoc()`) has no upper limit, lazily cache
-   * floorDocs for each segment to avoid duplicate scanning of bits (and to enable optimization
-   * in consumers afforded by returning <code>null</code> when there are no docs for a given
-   * segment).
-   */
   private int getFloorDoc(final LeafReaderContext ctx) {

Review comment:
       Ah, I wasn't sure whether you were taking issue with the "caching" aspect of this optimization, or the entire idea of front-loading the first "seek" on the bits.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] magibney commented on pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

magibney commented on pull request #2:
URL: https://github.com/apache/solr/pull/2#issuecomment-801172368


   >Another option would be to make it final and eagerly compute it in the constructor
   
   Hmm. on second thought I think there's another practical problem with any approach seeking to eagerly build `cachedOrdIdxMap`, either as an arg to the ctor, or within the ctor. There are quite a few invocations of the existing ctors, many of which appear not to have a reference to the IndexReader (and its "leaves") readily-available.
   
   And even aside from the above problem, if thinking of this as an optimization, at time of construction it's often difficult (perhaps impossible?) to determine a priori which cases are temporary/transient and should thus _not_ have this array eagerly constructed.
   
   It's feeling like with the addition of `volatile` making this thread-safe, lazy build/cache is not so much an optimization as "the only practical way to go"? Perhaps I'm still missing something, though ...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] dsmiley commented on a change in pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

dsmiley commented on a change in pull request #2:
URL: https://github.com/apache/solr/pull/2#discussion_r593167365



##########
File path: solr/core/src/java/org/apache/solr/search/join/GraphQuery.java
##########
@@ -268,13 +266,12 @@ private Automaton buildAutomaton(BytesRefHash termBytesHash) {
     
     @Override
     public Scorer scorer(LeafReaderContext context) throws IOException {
-      if (filter == null) {
+      if (resultSet == null) {
         resultSet = getDocSet();
-        filter = resultSet.getTopFilter();
       }
-      DocIdSet readerSet = filter.getDocIdSet(context,context.reader().getLiveDocs());
+      DocIdSetIterator disi = resultSet.iterator(context);

Review comment:
       Yes; there was a change in approach over the last decade as where filters are applied.  It used to be on top, then it was "pushed down" (in the form of acceptDocs here), and then was reverted to return back up top.  Meanwhile this Filter tech-debt didn't receive attention.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] magibney commented on a change in pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

magibney commented on a change in pull request #2:
URL: https://github.com/apache/solr/pull/2#discussion_r593320574



##########
File path: solr/core/src/java/org/apache/solr/search/BitDocSet.java
##########
@@ -237,6 +237,150 @@ public BitDocSet clone() {
     return new BitDocSet(bits.clone(), size);
   }
 
+  @Override
+  public Bits getBits(LeafReaderContext context) {
+    if (context.isTopLevel) {
+      return bits;
+    }
+
+    final int base = context.docBase;
+    final int length = context.reader().maxDoc();
+    final FixedBitSet bs = bits;
+
+    return new Bits() {
+      @Override
+      public boolean get(int index) {
+        return bs.get(index + base);
+      }
+
+      @Override
+      public int length() {
+        return length;
+      }
+    };
+  }
+
+  private static final int NO_DOCS_THIS_SEGMENT = -1;
+  private int[] cachedFloorDocs;

Review comment:
       fwiw, I did intend this code to be (and am pretty sure it is) thread-safe. Because `cachedFloorDocs` is privately built in its entirety by a calling thread and set atomically, the worst-case scenario is that multiple threads end up building the same array, and the array that ends up being cached is arbitrary.
   Analogous to `cachedOrdIdxMap` in `SortedIntDocSet`, there are four options:
   1. accumulate the relevant information during collection and supply it as an arg to the ctor (feasible? given the different ways in which DocSets might be built?)
   2. eagerly compute the relevant information in the ctor (unnecessary work for DocSets that are only used at the global level -- i.e., never need to supply a per-segment iterator) 
   3. compute (and cache) the relevant information on-demand the first time `DocSet.iterator(LeafReaderContext)` is called
   A fourth option (for `BitDocSet` anyway) is "do nothing", but given that DocSets are cached, and the optimization of returning `null` instead of empty iterators can be valuable to consumers, I do think it's worth the extra effort to do _something_ here ...




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] magibney commented on pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

magibney commented on pull request #2:
URL: https://github.com/apache/solr/pull/2#issuecomment-801119305


   Made cachedOrdIdxMap `volatile` in 8007e5c. Another option would be to make it `final` and eagerly compute it in the constructor (trading `volatile` overhead for some unnecessary work on transient DocSets that are never asked to supply an iterator).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] magibney closed pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

magibney closed pull request #2:
URL: https://github.com/apache/solr/pull/2


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] dsmiley commented on pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

dsmiley commented on pull request #2:
URL: https://github.com/apache/solr/pull/2#issuecomment-797670423


   I filed a separate issue for a DocSetQuery https://issues.apache.org/jira/browse/SOLR-15257 (linked to requiring this issue).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] magibney commented on a change in pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

magibney commented on a change in pull request #2:
URL: https://github.com/apache/solr/pull/2#discussion_r593343380



##########
File path: solr/core/src/java/org/apache/solr/search/SortedIntDocSet.java
##########
@@ -671,107 +679,141 @@ public DocSet union(DocSet other) {
     return new BitDocSet(newbits);
   }
 
-  @Override
-  public Filter getTopFilter() {
-    return new Filter() {
+  private int[] cachedOrdIdxMap;
+  private int[] getOrdIdxMap(LeafReaderContext ctx) {
+    final int[] ret;
+    if (ctx.isTopLevel) {
+      // don't bother caching this?
+      ret = new int[] {0, docs.length - 1};
+    } else if (cachedOrdIdxMap != null) {
+      ret = cachedOrdIdxMap;
+    } else {
+      List<LeafReaderContext> leaves = ReaderUtil.getTopLevelContext(ctx).leaves();
+      ret = new int[leaves.size() << 1];
       int lastEndIdx = 0;
-
-      @Override
-      public DocIdSet getDocIdSet(final LeafReaderContext context, final Bits acceptDocs) {
-        LeafReader reader = context.reader();
-        // all Solr DocSets that are used as filters only include live docs
-        final Bits acceptDocs2 = acceptDocs == null ? null : (reader.getLiveDocs() == acceptDocs ? null : acceptDocs);
-
-        final int base = context.docBase;
-        final int maxDoc = reader.maxDoc();
+      for (LeafReaderContext lrc : leaves) {
+        final int base = lrc.docBase;
+        final int maxDoc = lrc.reader().maxDoc();
         final int max = base + maxDoc;   // one past the max doc in this segment.
         int sidx = Math.max(0,lastEndIdx);
 
-        if (sidx > 0 && docs[sidx-1] >= base) {
-          // oops, the lastEndIdx isn't correct... we must have been used
-          // in a multi-threaded context, or the indexreaders are being
-          // used out-of-order.  start at 0.
-          sidx = 0;
-        }
         if (sidx < docs.length && docs[sidx] < base) {
           // if docs[sidx] is < base, we need to seek to find the real start.
           sidx = findIndex(docs, base, sidx, docs.length-1);
         }
 
-        final int startIdx = sidx;
-
         // Largest possible end index is limited to the start index
         // plus the number of docs contained in the segment.  Subtract 1 since
         // the end index is inclusive.
-        int eidx = Math.min(docs.length, startIdx + maxDoc) - 1;
+        int eidx = Math.min(docs.length, sidx + maxDoc) - 1;
 
         // find the real end
-        eidx = findIndex(docs, max, startIdx, eidx) - 1;
+        eidx = findIndex(docs, max, sidx, eidx) - 1;
+
+        final int mapOrdIdx = lrc.ord << 1;
+        ret[mapOrdIdx] = sidx;
+        ret[mapOrdIdx + 1] = eidx;
+        lastEndIdx = eidx;
+      }
+      cachedOrdIdxMap = ret; // replace atomically after building
+    }
+    return ret;
+  }
+
+  @Override
+  public DocIdSetIterator iterator(LeafReaderContext context) {
+
+    if (docs.length == 0 || context.reader().maxDoc() < 1) {
+      // empty docset or entirely empty segment (verified that the latter actually happens)
+      // NOTE: wrt the "empty docset" case, this is not just an optimization; this shortcircuits also
+      // to prevent the static DocSet.EmptyLazyHolder.INSTANCE from having cachedOrdIdxMap initiated
+      // across different IndexReaders.
+      return null;
+    }
 
-        final int endIdx = eidx;
-        lastEndIdx = endIdx;
+    int[] ordIdxMap = getOrdIdxMap(context);
+    final int base = context.docBase;
 
+    final int mapOrdIdx = context.ord << 1;
+    final int startIdx = ordIdxMap[mapOrdIdx];
+    final int endIdx = ordIdxMap[mapOrdIdx + 1];
+
+    if (startIdx > endIdx) {
+      return null; // verified this does happen
+    }
+
+    return new DocIdSetIterator() {
+      int idx = startIdx;
+      int adjustedDoc = -1;
+
+      @Override
+      public int docID() {
+        return adjustedDoc;
+      }
+
+      @Override
+      public int nextDoc() {
+        return adjustedDoc = (idx > endIdx) ? NO_MORE_DOCS : (docs[idx++] - base);
+      }
+
+      @Override
+      public int advance(int target) {
+        if (idx > endIdx || target==NO_MORE_DOCS) return adjustedDoc=NO_MORE_DOCS;
+        target += base;
+
+        // probe next
+        int rawDoc = docs[idx++];
+        if (rawDoc >= target) return adjustedDoc=rawDoc-base;
+
+        int high = endIdx;
+
+        // TODO: probe more before resorting to binary search?
+
+        // binary search
+        while (idx <= high) {
+          int mid = (idx+high) >>> 1;
+          rawDoc = docs[mid];
+
+          if (rawDoc < target) {
+            idx = mid+1;
+          }
+          else if (rawDoc > target) {
+            high = mid-1;
+          }
+          else {
+            idx=mid+1;
+            return adjustedDoc=rawDoc - base;
+          }
+        }
+
+        // low is on the insertion point...
+        if (idx <= endIdx) {
+          return adjustedDoc = docs[idx++] - base;
+        } else {
+          return adjustedDoc=NO_MORE_DOCS;
+        }
+      }
+
+      @Override
+      public long cost() {
+        return docs.length;

Review comment:
       Indeed, will fix!: `endIdx - startIdx + 1`
   (I initially just that alone from the [analogous implementation in the main branch](https://github.com/apache/solr/blob/7ada4032180b516548fc0263f42da6a7a917f92b/solr/core/src/java/org/apache/solr/search/SortedIntDocSet.java#L771-L773))




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] dsmiley commented on pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

dsmiley commented on pull request #2:
URL: https://github.com/apache/solr/pull/2#issuecomment-801360299


   I think all my concerns have been addressed.
   
   A meaningful perf test would probably be some "q" query that matches a lot of documents plus a cached "fq" that matches very little (thus uses SortedIntDocSet).  And on a large index; not "optimized".  Do you have something on-hand to perf-test this?  I'll ask a colleague of mine who may have something.
   
   @chatman has been building https://github.com/SearchScale/solr-bench and I've briefly looked at it in the past but not lately.  Ishan, can you give some hints as to what's needed to benchmark this in solr-bench?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] magibney commented on pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

magibney commented on pull request #2:
URL: https://github.com/apache/solr/pull/2#issuecomment-800752941


   6dbc647 addresses the potential read race condition on checking the top-level non-volatile `cachedOrdIdxMap`; but iiuc you're also concerned about the possibility of re-ordering writes to the _elements_ of the array to _after_ writing the top-level array to the shared variable? Uwe's answer on the mailing list implicitly seems (?) to suggest that might not be a concern; and I was unable (in a trivial test application) to generate errors on that type of condition; but I still think the concern is well-founded. In any event 6dbc647 should be an improvement ...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] magibney commented on a change in pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

magibney commented on a change in pull request #2:
URL: https://github.com/apache/solr/pull/2#discussion_r593330235



##########
File path: solr/core/src/java/org/apache/solr/query/SolrRangeQuery.java
##########
@@ -320,16 +319,25 @@ public TermsEnum getTermsEnum(LeafReaderContext ctx) throws IOException {
 
   private static class SegState {
     final Weight weight;
-    final DocIdSet set;
+    final DocSet docs;
+    final DocIdSet docIdSet;

Review comment:
       I initially did that actually! but afaict there's something a little weird going on here where if the "DISI supplier" is built by the "builder" at the end of `getSegState(LeafReaderContext)`, it's actually building a seg-specific set that's ultimately retrieved via `setStates[context.ord]`, and passed a somewhat redundant `LeafReaderContext`. So I _think_ that for sets that get manually built in this way, they either need to be built as a single DocSet across all segs, or as a DocSet that pertains to a single segment only (if that's even a thing?). Both of those seem awkward, which is why I ended up sticking with the DocIdSet for this case. It's definitely a bit weird though, so I'm curious what you think of any alternatives.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] dsmiley commented on a change in pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

dsmiley commented on a change in pull request #2:
URL: https://github.com/apache/solr/pull/2#discussion_r594859845



##########
File path: solr/core/src/java/org/apache/solr/search/SortedIntDocSet.java
##########
@@ -671,107 +673,141 @@ public DocSet union(DocSet other) {
     return new BitDocSet(newbits);
   }
 
-  @Override
-  public Filter getTopFilter() {
-    return new Filter() {
+  private int[] cachedOrdIdxMap;
+  private int[] getOrdIdxMap(LeafReaderContext ctx) {
+    final int[] ret;
+    if (ctx.isTopLevel) {
+      // don't bother caching this?
+      ret = new int[] {0, docs.length - 1};
+    } else if (cachedOrdIdxMap != null) {
+      ret = cachedOrdIdxMap;
+    } else {
+      List<LeafReaderContext> leaves = ReaderUtil.getTopLevelContext(ctx).leaves();
+      ret = new int[leaves.size() << 1];
       int lastEndIdx = 0;
-
-      @Override
-      public DocIdSet getDocIdSet(final LeafReaderContext context, final Bits acceptDocs) {
-        LeafReader reader = context.reader();
-        // all Solr DocSets that are used as filters only include live docs
-        final Bits acceptDocs2 = acceptDocs == null ? null : (reader.getLiveDocs() == acceptDocs ? null : acceptDocs);
-
-        final int base = context.docBase;
-        final int maxDoc = reader.maxDoc();
+      for (LeafReaderContext lrc : leaves) {
+        final int base = lrc.docBase;
+        final int maxDoc = lrc.reader().maxDoc();
         final int max = base + maxDoc;   // one past the max doc in this segment.
         int sidx = Math.max(0,lastEndIdx);
 
-        if (sidx > 0 && docs[sidx-1] >= base) {
-          // oops, the lastEndIdx isn't correct... we must have been used
-          // in a multi-threaded context, or the indexreaders are being
-          // used out-of-order.  start at 0.
-          sidx = 0;
-        }
         if (sidx < docs.length && docs[sidx] < base) {
           // if docs[sidx] is < base, we need to seek to find the real start.
           sidx = findIndex(docs, base, sidx, docs.length-1);
         }
 
-        final int startIdx = sidx;
-
         // Largest possible end index is limited to the start index
         // plus the number of docs contained in the segment.  Subtract 1 since
         // the end index is inclusive.
-        int eidx = Math.min(docs.length, startIdx + maxDoc) - 1;
+        int eidx = Math.min(docs.length, sidx + maxDoc) - 1;
 
         // find the real end
-        eidx = findIndex(docs, max, startIdx, eidx) - 1;
+        eidx = findIndex(docs, max, sidx, eidx) - 1;
+
+        final int mapOrdIdx = lrc.ord << 1;
+        ret[mapOrdIdx] = sidx;
+        ret[mapOrdIdx + 1] = eidx;
+        lastEndIdx = eidx;
+      }
+      cachedOrdIdxMap = ret; // replace atomically after building
+    }
+    return ret;
+  }
+
+  @Override
+  public DocIdSetIterator iterator(LeafReaderContext context) {
+
+    if (docs.length == 0 || context.reader().maxDoc() < 1) {
+      // empty docset or entirely empty segment (verified that the latter actually happens)
+      // NOTE: wrt the "empty docset" case, this is not just an optimization; this shortcircuits also
+      // to prevent the static DocSet.EmptyLazyHolder.INSTANCE from having cachedOrdIdxMap initiated
+      // across different IndexReaders.
+      return null;
+    }
+
+    int[] ordIdxMap = getOrdIdxMap(context);
+    final int base = context.docBase;
+
+    final int mapOrdIdx = context.ord << 1;

Review comment:
       Why the left shift (times two)?  (deserves a comment)

##########
File path: solr/core/src/java/org/apache/solr/search/SortedIntDocSet.java
##########
@@ -671,107 +673,141 @@ public DocSet union(DocSet other) {
     return new BitDocSet(newbits);
   }
 
-  @Override
-  public Filter getTopFilter() {
-    return new Filter() {
+  private int[] cachedOrdIdxMap;
+  private int[] getOrdIdxMap(LeafReaderContext ctx) {
+    final int[] ret;
+    if (ctx.isTopLevel) {
+      // don't bother caching this?
+      ret = new int[] {0, docs.length - 1};
+    } else if (cachedOrdIdxMap != null) {
+      ret = cachedOrdIdxMap;
+    } else {
+      List<LeafReaderContext> leaves = ReaderUtil.getTopLevelContext(ctx).leaves();
+      ret = new int[leaves.size() << 1];
       int lastEndIdx = 0;
-
-      @Override
-      public DocIdSet getDocIdSet(final LeafReaderContext context, final Bits acceptDocs) {
-        LeafReader reader = context.reader();
-        // all Solr DocSets that are used as filters only include live docs
-        final Bits acceptDocs2 = acceptDocs == null ? null : (reader.getLiveDocs() == acceptDocs ? null : acceptDocs);
-
-        final int base = context.docBase;
-        final int maxDoc = reader.maxDoc();
+      for (LeafReaderContext lrc : leaves) {
+        final int base = lrc.docBase;
+        final int maxDoc = lrc.reader().maxDoc();
         final int max = base + maxDoc;   // one past the max doc in this segment.
         int sidx = Math.max(0,lastEndIdx);
 
-        if (sidx > 0 && docs[sidx-1] >= base) {
-          // oops, the lastEndIdx isn't correct... we must have been used
-          // in a multi-threaded context, or the indexreaders are being
-          // used out-of-order.  start at 0.
-          sidx = 0;
-        }
         if (sidx < docs.length && docs[sidx] < base) {
           // if docs[sidx] is < base, we need to seek to find the real start.
           sidx = findIndex(docs, base, sidx, docs.length-1);
         }
 
-        final int startIdx = sidx;
-
         // Largest possible end index is limited to the start index
         // plus the number of docs contained in the segment.  Subtract 1 since
         // the end index is inclusive.
-        int eidx = Math.min(docs.length, startIdx + maxDoc) - 1;
+        int eidx = Math.min(docs.length, sidx + maxDoc) - 1;
 
         // find the real end
-        eidx = findIndex(docs, max, startIdx, eidx) - 1;
+        eidx = findIndex(docs, max, sidx, eidx) - 1;
+
+        final int mapOrdIdx = lrc.ord << 1;
+        ret[mapOrdIdx] = sidx;
+        ret[mapOrdIdx + 1] = eidx;
+        lastEndIdx = eidx;
+      }
+      cachedOrdIdxMap = ret; // replace atomically after building
+    }
+    return ret;
+  }
+
+  @Override
+  public DocIdSetIterator iterator(LeafReaderContext context) {
+
+    if (docs.length == 0 || context.reader().maxDoc() < 1) {
+      // empty docset or entirely empty segment (verified that the latter actually happens)
+      // NOTE: wrt the "empty docset" case, this is not just an optimization; this shortcircuits also
+      // to prevent the static DocSet.EmptyLazyHolder.INSTANCE from having cachedOrdIdxMap initiated
+      // across different IndexReaders.
+      return null;
+    }
+
+    int[] ordIdxMap = getOrdIdxMap(context);
+    final int base = context.docBase;
+
+    final int mapOrdIdx = context.ord << 1;
+    final int startIdx = ordIdxMap[mapOrdIdx];
+    final int endIdx = ordIdxMap[mapOrdIdx + 1];
+
+    if (startIdx > endIdx) {
+      return null; // verified this does happen

Review comment:
       Wha?!  oh, is endIdx not +1 beyond?  For consistency with most of these algorithms, lets have endIdx be one beyond.

##########
File path: solr/core/src/java/org/apache/solr/search/SortedIntDocSet.java
##########
@@ -671,107 +679,141 @@ public DocSet union(DocSet other) {
     return new BitDocSet(newbits);
   }
 
-  @Override
-  public Filter getTopFilter() {
-    return new Filter() {
+  private int[] cachedOrdIdxMap;

Review comment:
       Achieving thread-safety without 'volatile', 'synchronized', or some higher level things that tend to use them is tricky.  I see that this array ought to always have the same content.  If this were an `int` or other 4-byte primitive or a final/immutable object then I'd be comfortable.  But arrays... I'm not sure.  It'd be good to get another opinion.
   
   I don't think computing this up-front always is a waste of time.  But I can see that asking the caller to do this is not as clean as keeping this matter internal, but moreover it's places other than DocSetBuilder that are harder to figure out.  So it's not important to me provided the caching works.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] magibney commented on a change in pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

magibney commented on a change in pull request #2:
URL: https://github.com/apache/solr/pull/2#discussion_r595447714



##########
File path: solr/core/src/java/org/apache/solr/search/SortedIntDocSet.java
##########
@@ -671,107 +673,141 @@ public DocSet union(DocSet other) {
     return new BitDocSet(newbits);
   }
 
-  @Override
-  public Filter getTopFilter() {
-    return new Filter() {
+  private int[] cachedOrdIdxMap;
+  private int[] getOrdIdxMap(LeafReaderContext ctx) {
+    final int[] ret;
+    if (ctx.isTopLevel) {
+      // don't bother caching this?
+      ret = new int[] {0, docs.length - 1};
+    } else if (cachedOrdIdxMap != null) {
+      ret = cachedOrdIdxMap;
+    } else {
+      List<LeafReaderContext> leaves = ReaderUtil.getTopLevelContext(ctx).leaves();
+      ret = new int[leaves.size() << 1];
       int lastEndIdx = 0;
-
-      @Override
-      public DocIdSet getDocIdSet(final LeafReaderContext context, final Bits acceptDocs) {
-        LeafReader reader = context.reader();
-        // all Solr DocSets that are used as filters only include live docs
-        final Bits acceptDocs2 = acceptDocs == null ? null : (reader.getLiveDocs() == acceptDocs ? null : acceptDocs);
-
-        final int base = context.docBase;
-        final int maxDoc = reader.maxDoc();
+      for (LeafReaderContext lrc : leaves) {
+        final int base = lrc.docBase;
+        final int maxDoc = lrc.reader().maxDoc();
         final int max = base + maxDoc;   // one past the max doc in this segment.
         int sidx = Math.max(0,lastEndIdx);
 
-        if (sidx > 0 && docs[sidx-1] >= base) {
-          // oops, the lastEndIdx isn't correct... we must have been used
-          // in a multi-threaded context, or the indexreaders are being
-          // used out-of-order.  start at 0.
-          sidx = 0;
-        }
         if (sidx < docs.length && docs[sidx] < base) {
           // if docs[sidx] is < base, we need to seek to find the real start.
           sidx = findIndex(docs, base, sidx, docs.length-1);
         }
 
-        final int startIdx = sidx;
-
         // Largest possible end index is limited to the start index
         // plus the number of docs contained in the segment.  Subtract 1 since
         // the end index is inclusive.
-        int eidx = Math.min(docs.length, startIdx + maxDoc) - 1;
+        int eidx = Math.min(docs.length, sidx + maxDoc) - 1;
 
         // find the real end
-        eidx = findIndex(docs, max, startIdx, eidx) - 1;
+        eidx = findIndex(docs, max, sidx, eidx) - 1;
+
+        final int mapOrdIdx = lrc.ord << 1;
+        ret[mapOrdIdx] = sidx;
+        ret[mapOrdIdx + 1] = eidx;
+        lastEndIdx = eidx;
+      }
+      cachedOrdIdxMap = ret; // replace atomically after building
+    }
+    return ret;
+  }
+
+  @Override
+  public DocIdSetIterator iterator(LeafReaderContext context) {
+
+    if (docs.length == 0 || context.reader().maxDoc() < 1) {
+      // empty docset or entirely empty segment (verified that the latter actually happens)
+      // NOTE: wrt the "empty docset" case, this is not just an optimization; this shortcircuits also
+      // to prevent the static DocSet.EmptyLazyHolder.INSTANCE from having cachedOrdIdxMap initiated
+      // across different IndexReaders.
+      return null;
+    }
+
+    int[] ordIdxMap = getOrdIdxMap(context);
+    final int base = context.docBase;
+
+    final int mapOrdIdx = context.ord << 1;
+    final int startIdx = ordIdxMap[mapOrdIdx];
+    final int endIdx = ordIdxMap[mapOrdIdx + 1];
+
+    if (startIdx > endIdx) {
+      return null; // verified this does happen

Review comment:
       This ended up being a little more involved than I'd anticipated (f8ba35129b) -- and also addressed the "left-shift" comment above.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] magibney commented on a change in pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

magibney commented on a change in pull request #2:
URL: https://github.com/apache/solr/pull/2#discussion_r593304113



##########
File path: solr/core/src/java/org/apache/solr/search/BitDocSet.java
##########
@@ -237,6 +237,150 @@ public BitDocSet clone() {
     return new BitDocSet(bits.clone(), size);
   }
 
+  @Override
+  public Bits getBits(LeafReaderContext context) {
+    if (context.isTopLevel) {
+      return bits;
+    }
+
+    final int base = context.docBase;
+    final int length = context.reader().maxDoc();
+    final FixedBitSet bs = bits;
+
+    return new Bits() {
+      @Override
+      public boolean get(int index) {
+        return bs.get(index + base);
+      }
+
+      @Override
+      public int length() {
+        return length;
+      }
+    };
+  }
+
+  private static final int NO_DOCS_THIS_SEGMENT = -1;
+  private int[] cachedFloorDocs;
+
+  /**
+   * Because `bits.nextSetBit(int)` (called in `nextDoc()`) has no upper limit, lazily cache
+   * floorDocs for each segment to avoid duplicate scanning of bits (and to enable optimization
+   * in consumers afforded by returning <code>null</code> when there are no docs for a given
+   * segment).
+   */
+  private int getFloorDoc(final LeafReaderContext ctx) {
+    assert !ctx.isTopLevel;
+    final int[] floorDocs;
+    final int setMax = bits.length();
+    if (cachedFloorDocs != null) {
+      floorDocs = cachedFloorDocs;
+    } else {
+      List<LeafReaderContext> leaves = ReaderUtil.getTopLevelContext(ctx).leaves();
+      floorDocs = new int[leaves.size()];
+      int idx = 0;
+      int nextFloorDoc = -1;
+      for (LeafReaderContext c : leaves) {
+        final int base = c.docBase;
+        final int max = base + c.reader().maxDoc();
+        final int recordFloorDoc;
+        if (nextFloorDoc >= max) {
+          recordFloorDoc = NO_DOCS_THIS_SEGMENT;
+        } else if (nextFloorDoc >= base) {
+          recordFloorDoc = nextFloorDoc;
+        } else if (setMax <= base || (nextFloorDoc = bits.nextSetBit(base)) >= max) {
+          recordFloorDoc = NO_DOCS_THIS_SEGMENT;
+        } else {
+          recordFloorDoc = nextFloorDoc;
+        }
+        floorDocs[idx++] = recordFloorDoc;
+      }
+
+      cachedFloorDocs = floorDocs;
+    }
+    return floorDocs[ctx.ord];
+  }
+
+  @Override
+  public DocIdSetIterator iterator(LeafReaderContext context) {
+    if (context.isTopLevel) {
+      switch (size) {
+        case 0:
+          return null;
+        default:
+          // we have an explicit size; use it
+          return new BitSetIterator(bits, size);

Review comment:
       This is only within the `if (context.isTopLevel)` block -- my assumption was that makes this ok, no? I think the equivalent functionality on `Filter` in current master branch effectively [makes the same assumption](https://github.com/apache/solr/blob/7ada4032180b516548fc0263f42da6a7a917f92b/solr/core/src/java/org/apache/solr/search/BitDocSet.java#L254-L256)?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] dsmiley merged pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

dsmiley merged pull request #2:
URL: https://github.com/apache/solr/pull/2


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org

[GitHub] [solr] magibney commented on a change in pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

magibney commented on a change in pull request #2:
URL: https://github.com/apache/solr/pull/2#discussion_r593335767



##########
File path: solr/core/src/java/org/apache/solr/search/DocSet.java
##########
@@ -131,6 +140,15 @@ public int andNotSize(DocSet other) {
    */
   public abstract Bits getBits();
 
+  /**
+   * A per-segment {@link Bits} instance that has fast random access (as is generally
+   * required of Bits). In contrast with {@link #getBits()}, only trivial work should
+   * be done to generate a return value (i.e., if the underlying set natively supports
+   * random access). This method should return <code>null</code> for sets that do not
+   * support random access.
+   */
+  public abstract Bits getBits(LeafReaderContext context);

Review comment:
       The only reason this is needed is to continue to support explicit random access from [IntervalFacets.getCountString()](https://github.com/apache/solr/blob/7ada4032180b516548fc0263f42da6a7a917f92b/solr/core/src/java/org/apache/solr/request/IntervalFacets.java#L295-L304).
   But as you point out [here](https://github.com/apache/solr/pull/2#discussion_r593178464), I think you're right that [it's redundant](https://github.com/apache/solr/blob/7ada4032180b516548fc0263f42da6a7a917f92b/solr/core/src/java/org/apache/solr/request/IntervalFacets.java#L406-L409), since the `bits` instance is derived from the same source as the `disi`, so `bits.get(doc)` will _never_ be `false`.
   Nice, this can be removed then!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] magibney commented on a change in pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

magibney commented on a change in pull request #2:
URL: https://github.com/apache/solr/pull/2#discussion_r593348695



##########
File path: solr/core/src/java/org/apache/solr/search/facet/SweepCountAware.java
##########
@@ -165,7 +165,10 @@ public final void incrementCount(int segOrd, int inc, int maxIdx) {
      */
     public void register(CountSlotAcc[] countAccs, LongValues toGlobal, int maxSegOrd) {
       int segOrd = maxSegOrd;
-      final int maxIdx = countAccs.length - 1;
+      // NOTE: the `countAccs` array may be oversized (e.g., in the event that one or more of the domains supplies a <code>null</code>

Review comment:
       :-) hehe, on a latent bug that I introduced. Hopefully caught it while still entirely latent!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] dsmiley commented on a change in pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

dsmiley commented on a change in pull request #2:
URL: https://github.com/apache/solr/pull/2#discussion_r593413834



##########
File path: solr/core/src/java/org/apache/solr/query/SolrRangeQuery.java
##########
@@ -320,16 +319,25 @@ public TermsEnum getTermsEnum(LeafReaderContext ctx) throws IOException {
 
   private static class SegState {
     final Weight weight;
-    final DocIdSet set;
+    final DocSet docs;
+    final DocIdSet docIdSet;

Review comment:
       I see that this class is heavily borrowing from MultiTermQueryConstantScoreWrapper as its own comments say.  In that case, let's go the other way and embrace DocIdSet in this class; don't add DocSet to SegState.  You could add a trivial DocIdSet (via anonymous class?) that simply returns the iterator from the DocSet.
   
   > or as a DocSet that pertains to a single segment only (if that's even a thing?)
   
   No; by definition a DocSet is global.  It's not generic like Bits.

##########
File path: solr/core/src/java/org/apache/solr/search/BitDocSet.java
##########
@@ -237,6 +237,150 @@ public BitDocSet clone() {
     return new BitDocSet(bits.clone(), size);
   }
 
+  @Override
+  public Bits getBits(LeafReaderContext context) {
+    if (context.isTopLevel) {
+      return bits;
+    }
+
+    final int base = context.docBase;
+    final int length = context.reader().maxDoc();
+    final FixedBitSet bs = bits;
+
+    return new Bits() {
+      @Override
+      public boolean get(int index) {
+        return bs.get(index + base);
+      }
+
+      @Override
+      public int length() {
+        return length;
+      }
+    };
+  }
+
+  private static final int NO_DOCS_THIS_SEGMENT = -1;
+  private int[] cachedFloorDocs;

Review comment:
       Lets scope this to BitDocSet (not SortedIntDocSet).  If you are thinking of this as a new optimization... then I'm not sure it's worth it.  Maybe?  I guess a benchmark will say.  It adds complexity.  If this is about an optimization, then for BitDocSet I don't think we need the cached floor arcs but maybe just need a leaf ord based bitmap to indicate which leaves are empty.  

##########
File path: solr/core/src/java/org/apache/solr/search/BitDocSet.java
##########
@@ -237,6 +237,150 @@ public BitDocSet clone() {
     return new BitDocSet(bits.clone(), size);
   }
 
+  @Override
+  public Bits getBits(LeafReaderContext context) {
+    if (context.isTopLevel) {
+      return bits;
+    }
+
+    final int base = context.docBase;
+    final int length = context.reader().maxDoc();
+    final FixedBitSet bs = bits;
+
+    return new Bits() {
+      @Override
+      public boolean get(int index) {
+        return bs.get(index + base);
+      }
+
+      @Override
+      public int length() {
+        return length;
+      }
+    };
+  }
+
+  private static final int NO_DOCS_THIS_SEGMENT = -1;
+  private int[] cachedFloorDocs;
+
+  /**
+   * Because `bits.nextSetBit(int)` (called in `nextDoc()`) has no upper limit, lazily cache
+   * floorDocs for each segment to avoid duplicate scanning of bits (and to enable optimization
+   * in consumers afforded by returning <code>null</code> when there are no docs for a given
+   * segment).
+   */
+  private int getFloorDoc(final LeafReaderContext ctx) {
+    assert !ctx.isTopLevel;
+    final int[] floorDocs;
+    final int setMax = bits.length();
+    if (cachedFloorDocs != null) {
+      floorDocs = cachedFloorDocs;
+    } else {
+      List<LeafReaderContext> leaves = ReaderUtil.getTopLevelContext(ctx).leaves();
+      floorDocs = new int[leaves.size()];
+      int idx = 0;
+      int nextFloorDoc = -1;
+      for (LeafReaderContext c : leaves) {
+        final int base = c.docBase;
+        final int max = base + c.reader().maxDoc();
+        final int recordFloorDoc;
+        if (nextFloorDoc >= max) {
+          recordFloorDoc = NO_DOCS_THIS_SEGMENT;
+        } else if (nextFloorDoc >= base) {
+          recordFloorDoc = nextFloorDoc;
+        } else if (setMax <= base || (nextFloorDoc = bits.nextSetBit(base)) >= max) {
+          recordFloorDoc = NO_DOCS_THIS_SEGMENT;
+        } else {
+          recordFloorDoc = nextFloorDoc;
+        }
+        floorDocs[idx++] = recordFloorDoc;
+      }
+
+      cachedFloorDocs = floorDocs;
+    }
+    return floorDocs[ctx.ord];
+  }
+
+  @Override
+  public DocIdSetIterator iterator(LeafReaderContext context) {
+    if (context.isTopLevel) {
+      switch (size) {
+        case 0:
+          return null;
+        default:
+          // we have an explicit size; use it
+          return new BitSetIterator(bits, size);
+        case -1:
+          // size has not been computed; use bits.length() as an upper bound on cost
+          final int maxSize = bits.length();
+          if (maxSize < 1) {
+            return null;
+          } else {
+            return new BitSetIterator(bits, maxSize);
+          }
+      }
+    }
+
+    final int maxDoc = context.reader().maxDoc();
+    if (maxDoc < 1) {
+      // entirely empty segment; verified this actually happens
+      return null;
+    }
+
+    final int firstDocId = getFloorDoc(context);

Review comment:
       Okay; you are adding complexity for an optimization (see my previous comment).  I wonder if it pays off?

##########
File path: solr/core/src/java/org/apache/solr/search/BitDocSet.java
##########
@@ -237,6 +237,150 @@ public BitDocSet clone() {
     return new BitDocSet(bits.clone(), size);
   }
 
+  @Override
+  public Bits getBits(LeafReaderContext context) {
+    if (context.isTopLevel) {
+      return bits;
+    }
+
+    final int base = context.docBase;
+    final int length = context.reader().maxDoc();
+    final FixedBitSet bs = bits;
+
+    return new Bits() {
+      @Override
+      public boolean get(int index) {
+        return bs.get(index + base);
+      }
+
+      @Override
+      public int length() {
+        return length;
+      }
+    };
+  }
+
+  private static final int NO_DOCS_THIS_SEGMENT = -1;
+  private int[] cachedFloorDocs;
+
+  /**
+   * Because `bits.nextSetBit(int)` (called in `nextDoc()`) has no upper limit, lazily cache
+   * floorDocs for each segment to avoid duplicate scanning of bits (and to enable optimization
+   * in consumers afforded by returning <code>null</code> when there are no docs for a given
+   * segment).
+   */
+  private int getFloorDoc(final LeafReaderContext ctx) {
+    assert !ctx.isTopLevel;
+    final int[] floorDocs;
+    final int setMax = bits.length();
+    if (cachedFloorDocs != null) {
+      floorDocs = cachedFloorDocs;
+    } else {
+      List<LeafReaderContext> leaves = ReaderUtil.getTopLevelContext(ctx).leaves();
+      floorDocs = new int[leaves.size()];
+      int idx = 0;
+      int nextFloorDoc = -1;
+      for (LeafReaderContext c : leaves) {
+        final int base = c.docBase;
+        final int max = base + c.reader().maxDoc();
+        final int recordFloorDoc;
+        if (nextFloorDoc >= max) {
+          recordFloorDoc = NO_DOCS_THIS_SEGMENT;
+        } else if (nextFloorDoc >= base) {
+          recordFloorDoc = nextFloorDoc;
+        } else if (setMax <= base || (nextFloorDoc = bits.nextSetBit(base)) >= max) {
+          recordFloorDoc = NO_DOCS_THIS_SEGMENT;
+        } else {
+          recordFloorDoc = nextFloorDoc;
+        }
+        floorDocs[idx++] = recordFloorDoc;
+      }
+
+      cachedFloorDocs = floorDocs;
+    }
+    return floorDocs[ctx.ord];
+  }
+
+  @Override
+  public DocIdSetIterator iterator(LeafReaderContext context) {
+    if (context.isTopLevel) {
+      switch (size) {
+        case 0:
+          return null;
+        default:
+          // we have an explicit size; use it
+          return new BitSetIterator(bits, size);

Review comment:
       oooooh, gotcha!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] magibney commented on pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

magibney commented on pull request #2:
URL: https://github.com/apache/solr/pull/2#issuecomment-796964155


   Restored branch, reopened PR. Sorry for the spam; I accidentally deleted the branch (which closed the PR)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] dsmiley commented on a change in pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

dsmiley commented on a change in pull request #2:
URL: https://github.com/apache/solr/pull/2#discussion_r594847721



##########
File path: solr/core/src/java/org/apache/solr/search/BitDocSet.java
##########
@@ -237,68 +235,15 @@ public BitDocSet clone() {
     return new BitDocSet(bits.clone(), size);
   }
 
-  @Override
-  public Bits getBits(LeafReaderContext context) {
-    if (context.isTopLevel) {
-      return bits;
-    }
-
-    final int base = context.docBase;
-    final int length = context.reader().maxDoc();
-    final FixedBitSet bs = bits;
-
-    return new Bits() {
-      @Override
-      public boolean get(int index) {
-        return bs.get(index + base);
-      }
-
-      @Override
-      public int length() {
-        return length;
-      }
-    };
-  }
-
-  private static final int NO_DOCS_THIS_SEGMENT = -1;
-  private int[] cachedFloorDocs;
-
-  /**
-   * Because `bits.nextSetBit(int)` (called in `nextDoc()`) has no upper limit, lazily cache
-   * floorDocs for each segment to avoid duplicate scanning of bits (and to enable optimization
-   * in consumers afforded by returning <code>null</code> when there are no docs for a given
-   * segment).
-   */
   private int getFloorDoc(final LeafReaderContext ctx) {

Review comment:
       okay but I still don't think you need this method (getFloorDoc) at all.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] dsmiley commented on pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

dsmiley commented on pull request #2:
URL: https://github.com/apache/solr/pull/2#issuecomment-803123778


   I have a Solr benchmarking tool my colleagues built that I will try.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] magibney commented on a change in pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

magibney commented on a change in pull request #2:
URL: https://github.com/apache/solr/pull/2#discussion_r593278705



##########
File path: solr/core/src/java/org/apache/solr/search/SortedIntDocSet.java
##########
@@ -671,107 +679,141 @@ public DocSet union(DocSet other) {
     return new BitDocSet(newbits);
   }
 
-  @Override
-  public Filter getTopFilter() {
-    return new Filter() {
+  private int[] cachedOrdIdxMap;
+  private int[] getOrdIdxMap(LeafReaderContext ctx) {
+    final int[] ret;
+    if (ctx.isTopLevel) {
+      // don't bother caching this?
+      ret = new int[] {0, docs.length - 1};
+    } else if (cachedOrdIdxMap != null) {
+      ret = cachedOrdIdxMap;
+    } else {
+      List<LeafReaderContext> leaves = ReaderUtil.getTopLevelContext(ctx).leaves();
+      ret = new int[leaves.size() << 1];
       int lastEndIdx = 0;
-
-      @Override
-      public DocIdSet getDocIdSet(final LeafReaderContext context, final Bits acceptDocs) {
-        LeafReader reader = context.reader();
-        // all Solr DocSets that are used as filters only include live docs
-        final Bits acceptDocs2 = acceptDocs == null ? null : (reader.getLiveDocs() == acceptDocs ? null : acceptDocs);
-
-        final int base = context.docBase;
-        final int maxDoc = reader.maxDoc();
+      for (LeafReaderContext lrc : leaves) {
+        final int base = lrc.docBase;
+        final int maxDoc = lrc.reader().maxDoc();
         final int max = base + maxDoc;   // one past the max doc in this segment.
         int sidx = Math.max(0,lastEndIdx);
 
-        if (sidx > 0 && docs[sidx-1] >= base) {
-          // oops, the lastEndIdx isn't correct... we must have been used
-          // in a multi-threaded context, or the indexreaders are being
-          // used out-of-order.  start at 0.
-          sidx = 0;
-        }
         if (sidx < docs.length && docs[sidx] < base) {
           // if docs[sidx] is < base, we need to seek to find the real start.
           sidx = findIndex(docs, base, sidx, docs.length-1);
         }
 
-        final int startIdx = sidx;
-
         // Largest possible end index is limited to the start index
         // plus the number of docs contained in the segment.  Subtract 1 since
         // the end index is inclusive.
-        int eidx = Math.min(docs.length, startIdx + maxDoc) - 1;
+        int eidx = Math.min(docs.length, sidx + maxDoc) - 1;
 
         // find the real end
-        eidx = findIndex(docs, max, startIdx, eidx) - 1;
+        eidx = findIndex(docs, max, sidx, eidx) - 1;
+
+        final int mapOrdIdx = lrc.ord << 1;
+        ret[mapOrdIdx] = sidx;
+        ret[mapOrdIdx + 1] = eidx;
+        lastEndIdx = eidx;
+      }
+      cachedOrdIdxMap = ret; // replace atomically after building
+    }
+    return ret;
+  }
+
+  @Override
+  public DocIdSetIterator iterator(LeafReaderContext context) {
+
+    if (docs.length == 0 || context.reader().maxDoc() < 1) {
+      // empty docset or entirely empty segment (verified that the latter actually happens)
+      // NOTE: wrt the "empty docset" case, this is not just an optimization; this shortcircuits also
+      // to prevent the static DocSet.EmptyLazyHolder.INSTANCE from having cachedOrdIdxMap initiated
+      // across different IndexReaders.
+      return null;
+    }
 
-        final int endIdx = eidx;
-        lastEndIdx = endIdx;
+    int[] ordIdxMap = getOrdIdxMap(context);
+    final int base = context.docBase;
 
+    final int mapOrdIdx = context.ord << 1;
+    final int startIdx = ordIdxMap[mapOrdIdx];
+    final int endIdx = ordIdxMap[mapOrdIdx + 1];
+
+    if (startIdx > endIdx) {
+      return null; // verified this does happen
+    }
+
+    return new DocIdSetIterator() {
+      int idx = startIdx;
+      int adjustedDoc = -1;
+
+      @Override
+      public int docID() {
+        return adjustedDoc;
+      }
+
+      @Override
+      public int nextDoc() {
+        return adjustedDoc = (idx > endIdx) ? NO_MORE_DOCS : (docs[idx++] - base);
+      }
+
+      @Override
+      public int advance(int target) {
+        if (idx > endIdx || target==NO_MORE_DOCS) return adjustedDoc=NO_MORE_DOCS;
+        target += base;
+
+        // probe next
+        int rawDoc = docs[idx++];
+        if (rawDoc >= target) return adjustedDoc=rawDoc-base;
+
+        int high = endIdx;
+
+        // TODO: probe more before resorting to binary search?
+
+        // binary search
+        while (idx <= high) {
+          int mid = (idx+high) >>> 1;
+          rawDoc = docs[mid];
+
+          if (rawDoc < target) {
+            idx = mid+1;
+          }
+          else if (rawDoc > target) {
+            high = mid-1;
+          }
+          else {
+            idx=mid+1;
+            return adjustedDoc=rawDoc - base;
+          }
+        }
+
+        // low is on the insertion point...
+        if (idx <= endIdx) {
+          return adjustedDoc = docs[idx++] - base;
+        } else {
+          return adjustedDoc=NO_MORE_DOCS;
+        }
+      }
+
+      @Override
+      public long cost() {
+        return docs.length;
+      }
+    };
+  }
+
+  @Override
+  public Filter getTopFilter() {

Review comment:
       I would have liked to remove this entirely, but there are still a number of cases where the `Filter` returned from `DocSet.getTopFilter()` is used as a `Query`, rather than directly used to retrieve a per-segment `DocIdSetIterator`. Some of these cases go pretty deep, so I considered that use case to be a separate question, and left it alone for now.
   It's true that this might be more straightforward than I initially perceived it to be; but even if straightforward, I think it might involve a significant amount of refactoring (however rote), so I guess my inclination would be to make the "Filter used as Query" case its own issue?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [solr] dsmiley commented on a change in pull request #2: SOLR-14185: introduce DocSet.iterator(LeafReaderContext), replacing Filter where possible

Posted by GitBox <gi...@apache.org>.

dsmiley commented on a change in pull request #2:
URL: https://github.com/apache/solr/pull/2#discussion_r593178464



##########
File path: solr/core/src/java/org/apache/solr/request/IntervalFacets.java
##########
@@ -292,16 +285,16 @@ private void getCountString() throws IOException {
           final SortedDocValues singleton = DocValues.unwrapSingleton(sub);
           if (singleton != null) {
             // some codecs may optimize SORTED_SET storage for single-valued fields
-            accumIntervalsSingle(singleton, disi, dis.bits());
+            accumIntervalsSingle(singleton, disi, docs.getBits(leaf));

Review comment:
       something is suspicious to me around here... I think IntervalFacets may be doing more work than necessary.  We shouldn't need both Bits & DISI from the same DocSet -- pick one.  Can you check?

##########
File path: solr/core/src/java/org/apache/solr/query/SolrRangeQuery.java
##########
@@ -320,16 +319,25 @@ public TermsEnum getTermsEnum(LeafReaderContext ctx) throws IOException {
 
   private static class SegState {
     final Weight weight;
-    final DocIdSet set;
+    final DocSet docs;
+    final DocIdSet docIdSet;

Review comment:
       Can you easily remove SolrRangeQuery's use of "DocIdSet"?  On the Solr end of things, it's associated with the deprecated Filter that you are removing.  It's a Lucene abstraction I'd prefer to avoid if we can easily do so.

##########
File path: solr/core/src/java/org/apache/solr/search/BitDocSet.java
##########
@@ -237,6 +237,150 @@ public BitDocSet clone() {
     return new BitDocSet(bits.clone(), size);
   }
 
+  @Override
+  public Bits getBits(LeafReaderContext context) {
+    if (context.isTopLevel) {
+      return bits;
+    }
+
+    final int base = context.docBase;
+    final int length = context.reader().maxDoc();
+    final FixedBitSet bs = bits;
+
+    return new Bits() {
+      @Override
+      public boolean get(int index) {
+        return bs.get(index + base);
+      }
+
+      @Override
+      public int length() {
+        return length;
+      }
+    };
+  }
+
+  private static final int NO_DOCS_THIS_SEGMENT = -1;
+  private int[] cachedFloorDocs;

Review comment:
       caching this concerns me because DocSets are themselves cached and could be accessed concurrently.  I'm sure there's an answer involving volatile or other constructs, but my preference would to not bother -- either pre-compute or avoid it altogether; don't cache.

##########
File path: solr/core/src/java/org/apache/solr/search/SortedIntDocSet.java
##########
@@ -671,107 +679,141 @@ public DocSet union(DocSet other) {
     return new BitDocSet(newbits);
   }
 
-  @Override
-  public Filter getTopFilter() {
-    return new Filter() {
+  private int[] cachedOrdIdxMap;

Review comment:
       What I was thinking as expressed in the JIRA issue (maybe I didn't say it clearly) was that this would be calculated at the collection, and thus would become an argument to create a SortedIntDocSet.  Wouldn't be "cached".

##########
File path: solr/core/src/java/org/apache/solr/search/DocSet.java
##########
@@ -63,6 +65,13 @@ public static DocSet empty() {
   //TODO switch to DocIdSetIterator in Solr 9?
   public abstract DocIterator iterator();
 
+  /**
+   * Returns an ordered iterator of the documents in the set for the specified {@link LeafReaderContext}.
+   * <b>NOTE:</b> <code>null</code> should be returned if the filter doesn't accept any documents otherwise

Review comment:
       What "filter"?  Anyway, I'd word this simply as "May return null if there are no matching documents for this leaf"

##########
File path: solr/core/src/java/org/apache/solr/search/SortedIntDocSet.java
##########
@@ -671,107 +679,141 @@ public DocSet union(DocSet other) {
     return new BitDocSet(newbits);
   }
 
-  @Override
-  public Filter getTopFilter() {
-    return new Filter() {
+  private int[] cachedOrdIdxMap;
+  private int[] getOrdIdxMap(LeafReaderContext ctx) {
+    final int[] ret;
+    if (ctx.isTopLevel) {
+      // don't bother caching this?
+      ret = new int[] {0, docs.length - 1};
+    } else if (cachedOrdIdxMap != null) {
+      ret = cachedOrdIdxMap;
+    } else {
+      List<LeafReaderContext> leaves = ReaderUtil.getTopLevelContext(ctx).leaves();
+      ret = new int[leaves.size() << 1];
       int lastEndIdx = 0;
-
-      @Override
-      public DocIdSet getDocIdSet(final LeafReaderContext context, final Bits acceptDocs) {
-        LeafReader reader = context.reader();
-        // all Solr DocSets that are used as filters only include live docs
-        final Bits acceptDocs2 = acceptDocs == null ? null : (reader.getLiveDocs() == acceptDocs ? null : acceptDocs);
-
-        final int base = context.docBase;
-        final int maxDoc = reader.maxDoc();
+      for (LeafReaderContext lrc : leaves) {
+        final int base = lrc.docBase;
+        final int maxDoc = lrc.reader().maxDoc();
         final int max = base + maxDoc;   // one past the max doc in this segment.
         int sidx = Math.max(0,lastEndIdx);
 
-        if (sidx > 0 && docs[sidx-1] >= base) {
-          // oops, the lastEndIdx isn't correct... we must have been used
-          // in a multi-threaded context, or the indexreaders are being
-          // used out-of-order.  start at 0.
-          sidx = 0;
-        }
         if (sidx < docs.length && docs[sidx] < base) {
           // if docs[sidx] is < base, we need to seek to find the real start.
           sidx = findIndex(docs, base, sidx, docs.length-1);
         }
 
-        final int startIdx = sidx;
-
         // Largest possible end index is limited to the start index
         // plus the number of docs contained in the segment.  Subtract 1 since
         // the end index is inclusive.
-        int eidx = Math.min(docs.length, startIdx + maxDoc) - 1;
+        int eidx = Math.min(docs.length, sidx + maxDoc) - 1;
 
         // find the real end
-        eidx = findIndex(docs, max, startIdx, eidx) - 1;
+        eidx = findIndex(docs, max, sidx, eidx) - 1;
+
+        final int mapOrdIdx = lrc.ord << 1;
+        ret[mapOrdIdx] = sidx;
+        ret[mapOrdIdx + 1] = eidx;
+        lastEndIdx = eidx;
+      }
+      cachedOrdIdxMap = ret; // replace atomically after building
+    }
+    return ret;
+  }
+
+  @Override
+  public DocIdSetIterator iterator(LeafReaderContext context) {
+
+    if (docs.length == 0 || context.reader().maxDoc() < 1) {
+      // empty docset or entirely empty segment (verified that the latter actually happens)
+      // NOTE: wrt the "empty docset" case, this is not just an optimization; this shortcircuits also
+      // to prevent the static DocSet.EmptyLazyHolder.INSTANCE from having cachedOrdIdxMap initiated
+      // across different IndexReaders.
+      return null;
+    }
 
-        final int endIdx = eidx;
-        lastEndIdx = endIdx;
+    int[] ordIdxMap = getOrdIdxMap(context);
+    final int base = context.docBase;
 
+    final int mapOrdIdx = context.ord << 1;
+    final int startIdx = ordIdxMap[mapOrdIdx];
+    final int endIdx = ordIdxMap[mapOrdIdx + 1];
+
+    if (startIdx > endIdx) {
+      return null; // verified this does happen
+    }
+
+    return new DocIdSetIterator() {
+      int idx = startIdx;
+      int adjustedDoc = -1;
+
+      @Override
+      public int docID() {
+        return adjustedDoc;
+      }
+
+      @Override
+      public int nextDoc() {
+        return adjustedDoc = (idx > endIdx) ? NO_MORE_DOCS : (docs[idx++] - base);
+      }
+
+      @Override
+      public int advance(int target) {
+        if (idx > endIdx || target==NO_MORE_DOCS) return adjustedDoc=NO_MORE_DOCS;
+        target += base;
+
+        // probe next
+        int rawDoc = docs[idx++];
+        if (rawDoc >= target) return adjustedDoc=rawDoc-base;
+
+        int high = endIdx;
+
+        // TODO: probe more before resorting to binary search?
+
+        // binary search
+        while (idx <= high) {
+          int mid = (idx+high) >>> 1;
+          rawDoc = docs[mid];
+
+          if (rawDoc < target) {
+            idx = mid+1;
+          }
+          else if (rawDoc > target) {
+            high = mid-1;
+          }
+          else {
+            idx=mid+1;
+            return adjustedDoc=rawDoc - base;
+          }
+        }
+
+        // low is on the insertion point...
+        if (idx <= endIdx) {
+          return adjustedDoc = docs[idx++] - base;
+        } else {
+          return adjustedDoc=NO_MORE_DOCS;
+        }
+      }
+
+      @Override
+      public long cost() {
+        return docs.length;

Review comment:
       Can you provide a more accurate cost since we know exactly how many docs are in this leaf segment?

##########
File path: solr/core/src/java/org/apache/solr/search/BitDocSet.java
##########
@@ -237,6 +237,150 @@ public BitDocSet clone() {
     return new BitDocSet(bits.clone(), size);
   }
 
+  @Override
+  public Bits getBits(LeafReaderContext context) {
+    if (context.isTopLevel) {
+      return bits;
+    }
+
+    final int base = context.docBase;
+    final int length = context.reader().maxDoc();
+    final FixedBitSet bs = bits;
+
+    return new Bits() {
+      @Override
+      public boolean get(int index) {
+        return bs.get(index + base);
+      }
+
+      @Override
+      public int length() {
+        return length;
+      }
+    };
+  }
+
+  private static final int NO_DOCS_THIS_SEGMENT = -1;
+  private int[] cachedFloorDocs;
+
+  /**
+   * Because `bits.nextSetBit(int)` (called in `nextDoc()`) has no upper limit, lazily cache
+   * floorDocs for each segment to avoid duplicate scanning of bits (and to enable optimization
+   * in consumers afforded by returning <code>null</code> when there are no docs for a given
+   * segment).
+   */
+  private int getFloorDoc(final LeafReaderContext ctx) {
+    assert !ctx.isTopLevel;
+    final int[] floorDocs;
+    final int setMax = bits.length();
+    if (cachedFloorDocs != null) {
+      floorDocs = cachedFloorDocs;
+    } else {
+      List<LeafReaderContext> leaves = ReaderUtil.getTopLevelContext(ctx).leaves();
+      floorDocs = new int[leaves.size()];
+      int idx = 0;
+      int nextFloorDoc = -1;
+      for (LeafReaderContext c : leaves) {
+        final int base = c.docBase;
+        final int max = base + c.reader().maxDoc();
+        final int recordFloorDoc;
+        if (nextFloorDoc >= max) {
+          recordFloorDoc = NO_DOCS_THIS_SEGMENT;
+        } else if (nextFloorDoc >= base) {
+          recordFloorDoc = nextFloorDoc;
+        } else if (setMax <= base || (nextFloorDoc = bits.nextSetBit(base)) >= max) {
+          recordFloorDoc = NO_DOCS_THIS_SEGMENT;
+        } else {
+          recordFloorDoc = nextFloorDoc;
+        }
+        floorDocs[idx++] = recordFloorDoc;
+      }
+
+      cachedFloorDocs = floorDocs;
+    }
+    return floorDocs[ctx.ord];
+  }
+
+  @Override
+  public DocIdSetIterator iterator(LeafReaderContext context) {
+    if (context.isTopLevel) {
+      switch (size) {
+        case 0:
+          return null;
+        default:
+          // we have an explicit size; use it
+          return new BitSetIterator(bits, size);

Review comment:
       I am very suspicious this could possibly be correct.  if context's first doc is 0 then it works.  If context points to a later segment, this needs to be shifted; no?  Same for the "-1" size case just below.

##########
File path: solr/core/src/java/org/apache/solr/search/SortedIntDocSet.java
##########
@@ -671,107 +679,141 @@ public DocSet union(DocSet other) {
     return new BitDocSet(newbits);
   }
 
-  @Override
-  public Filter getTopFilter() {
-    return new Filter() {
+  private int[] cachedOrdIdxMap;
+  private int[] getOrdIdxMap(LeafReaderContext ctx) {
+    final int[] ret;
+    if (ctx.isTopLevel) {
+      // don't bother caching this?
+      ret = new int[] {0, docs.length - 1};
+    } else if (cachedOrdIdxMap != null) {
+      ret = cachedOrdIdxMap;
+    } else {
+      List<LeafReaderContext> leaves = ReaderUtil.getTopLevelContext(ctx).leaves();
+      ret = new int[leaves.size() << 1];
       int lastEndIdx = 0;
-
-      @Override
-      public DocIdSet getDocIdSet(final LeafReaderContext context, final Bits acceptDocs) {
-        LeafReader reader = context.reader();
-        // all Solr DocSets that are used as filters only include live docs
-        final Bits acceptDocs2 = acceptDocs == null ? null : (reader.getLiveDocs() == acceptDocs ? null : acceptDocs);
-
-        final int base = context.docBase;
-        final int maxDoc = reader.maxDoc();
+      for (LeafReaderContext lrc : leaves) {
+        final int base = lrc.docBase;
+        final int maxDoc = lrc.reader().maxDoc();
         final int max = base + maxDoc;   // one past the max doc in this segment.
         int sidx = Math.max(0,lastEndIdx);
 
-        if (sidx > 0 && docs[sidx-1] >= base) {
-          // oops, the lastEndIdx isn't correct... we must have been used
-          // in a multi-threaded context, or the indexreaders are being
-          // used out-of-order.  start at 0.
-          sidx = 0;
-        }
         if (sidx < docs.length && docs[sidx] < base) {
           // if docs[sidx] is < base, we need to seek to find the real start.
           sidx = findIndex(docs, base, sidx, docs.length-1);
         }
 
-        final int startIdx = sidx;
-
         // Largest possible end index is limited to the start index
         // plus the number of docs contained in the segment.  Subtract 1 since
         // the end index is inclusive.
-        int eidx = Math.min(docs.length, startIdx + maxDoc) - 1;
+        int eidx = Math.min(docs.length, sidx + maxDoc) - 1;
 
         // find the real end
-        eidx = findIndex(docs, max, startIdx, eidx) - 1;
+        eidx = findIndex(docs, max, sidx, eidx) - 1;
+
+        final int mapOrdIdx = lrc.ord << 1;
+        ret[mapOrdIdx] = sidx;
+        ret[mapOrdIdx + 1] = eidx;
+        lastEndIdx = eidx;
+      }
+      cachedOrdIdxMap = ret; // replace atomically after building
+    }
+    return ret;
+  }
+
+  @Override
+  public DocIdSetIterator iterator(LeafReaderContext context) {
+
+    if (docs.length == 0 || context.reader().maxDoc() < 1) {
+      // empty docset or entirely empty segment (verified that the latter actually happens)
+      // NOTE: wrt the "empty docset" case, this is not just an optimization; this shortcircuits also
+      // to prevent the static DocSet.EmptyLazyHolder.INSTANCE from having cachedOrdIdxMap initiated
+      // across different IndexReaders.
+      return null;
+    }
 
-        final int endIdx = eidx;
-        lastEndIdx = endIdx;
+    int[] ordIdxMap = getOrdIdxMap(context);
+    final int base = context.docBase;
 
+    final int mapOrdIdx = context.ord << 1;
+    final int startIdx = ordIdxMap[mapOrdIdx];
+    final int endIdx = ordIdxMap[mapOrdIdx + 1];
+
+    if (startIdx > endIdx) {
+      return null; // verified this does happen
+    }
+
+    return new DocIdSetIterator() {
+      int idx = startIdx;
+      int adjustedDoc = -1;
+
+      @Override
+      public int docID() {
+        return adjustedDoc;
+      }
+
+      @Override
+      public int nextDoc() {
+        return adjustedDoc = (idx > endIdx) ? NO_MORE_DOCS : (docs[idx++] - base);
+      }
+
+      @Override
+      public int advance(int target) {
+        if (idx > endIdx || target==NO_MORE_DOCS) return adjustedDoc=NO_MORE_DOCS;
+        target += base;
+
+        // probe next
+        int rawDoc = docs[idx++];
+        if (rawDoc >= target) return adjustedDoc=rawDoc-base;
+
+        int high = endIdx;
+
+        // TODO: probe more before resorting to binary search?
+
+        // binary search
+        while (idx <= high) {
+          int mid = (idx+high) >>> 1;
+          rawDoc = docs[mid];
+
+          if (rawDoc < target) {
+            idx = mid+1;
+          }
+          else if (rawDoc > target) {
+            high = mid-1;
+          }
+          else {
+            idx=mid+1;
+            return adjustedDoc=rawDoc - base;
+          }
+        }
+
+        // low is on the insertion point...
+        if (idx <= endIdx) {
+          return adjustedDoc = docs[idx++] - base;
+        } else {
+          return adjustedDoc=NO_MORE_DOCS;
+        }
+      }
+
+      @Override
+      public long cost() {
+        return docs.length;
+      }
+    };
+  }
+
+  @Override
+  public Filter getTopFilter() {

Review comment:
       I thought you were including in the scope of this PR to not return Filter anymore?  No worries either way; I'll be making it go away if you don't.  We needn't scope-creep here; I'm bad at that.

##########
File path: solr/core/src/java/org/apache/solr/search/facet/SweepCountAware.java
##########
@@ -165,7 +165,10 @@ public final void incrementCount(int segOrd, int inc, int maxIdx) {
      */
     public void register(CountSlotAcc[] countAccs, LongValues toGlobal, int maxSegOrd) {
       int segOrd = maxSegOrd;
-      final int maxIdx = countAccs.length - 1;
+      // NOTE: the `countAccs` array may be oversized (e.g., in the event that one or more of the domains supplies a <code>null</code>

Review comment:
       I love seeing detailed comments like this

##########
File path: solr/core/src/java/org/apache/solr/search/BitDocSet.java
##########
@@ -237,6 +237,150 @@ public BitDocSet clone() {
     return new BitDocSet(bits.clone(), size);
   }
 
+  @Override
+  public Bits getBits(LeafReaderContext context) {
+    if (context.isTopLevel) {
+      return bits;
+    }
+
+    final int base = context.docBase;
+    final int length = context.reader().maxDoc();
+    final FixedBitSet bs = bits;
+
+    return new Bits() {
+      @Override
+      public boolean get(int index) {
+        return bs.get(index + base);
+      }
+
+      @Override
+      public int length() {
+        return length;
+      }
+    };
+  }
+
+  private static final int NO_DOCS_THIS_SEGMENT = -1;
+  private int[] cachedFloorDocs;
+
+  /**
+   * Because `bits.nextSetBit(int)` (called in `nextDoc()`) has no upper limit, lazily cache
+   * floorDocs for each segment to avoid duplicate scanning of bits (and to enable optimization
+   * in consumers afforded by returning <code>null</code> when there are no docs for a given
+   * segment).
+   */
+  private int getFloorDoc(final LeafReaderContext ctx) {
+    assert !ctx.isTopLevel;
+    final int[] floorDocs;
+    final int setMax = bits.length();
+    if (cachedFloorDocs != null) {
+      floorDocs = cachedFloorDocs;
+    } else {
+      List<LeafReaderContext> leaves = ReaderUtil.getTopLevelContext(ctx).leaves();
+      floorDocs = new int[leaves.size()];
+      int idx = 0;
+      int nextFloorDoc = -1;
+      for (LeafReaderContext c : leaves) {
+        final int base = c.docBase;
+        final int max = base + c.reader().maxDoc();
+        final int recordFloorDoc;
+        if (nextFloorDoc >= max) {
+          recordFloorDoc = NO_DOCS_THIS_SEGMENT;
+        } else if (nextFloorDoc >= base) {
+          recordFloorDoc = nextFloorDoc;
+        } else if (setMax <= base || (nextFloorDoc = bits.nextSetBit(base)) >= max) {
+          recordFloorDoc = NO_DOCS_THIS_SEGMENT;
+        } else {
+          recordFloorDoc = nextFloorDoc;
+        }
+        floorDocs[idx++] = recordFloorDoc;
+      }
+
+      cachedFloorDocs = floorDocs;
+    }
+    return floorDocs[ctx.ord];
+  }
+
+  @Override
+  public DocIdSetIterator iterator(LeafReaderContext context) {
+    if (context.isTopLevel) {
+      switch (size) {
+        case 0:
+          return null;
+        default:
+          // we have an explicit size; use it
+          return new BitSetIterator(bits, size);
+        case -1:
+          // size has not been computed; use bits.length() as an upper bound on cost
+          final int maxSize = bits.length();
+          if (maxSize < 1) {
+            return null;
+          } else {
+            return new BitSetIterator(bits, maxSize);
+          }
+      }
+    }
+
+    final int maxDoc = context.reader().maxDoc();
+    if (maxDoc < 1) {
+      // entirely empty segment; verified this actually happens
+      return null;
+    }
+
+    final int firstDocId = getFloorDoc(context);

Review comment:
       I'm suspicious that it's necessary cache and use this information.  The LeafReaderContext itself has docBase (which I see you are aware of) and reader.maxDoc.

##########
File path: solr/core/src/java/org/apache/solr/search/DocSet.java
##########
@@ -131,6 +140,15 @@ public int andNotSize(DocSet other) {
    */
   public abstract Bits getBits();
 
+  /**
+   * A per-segment {@link Bits} instance that has fast random access (as is generally
+   * required of Bits). In contrast with {@link #getBits()}, only trivial work should
+   * be done to generate a return value (i.e., if the underlying set natively supports
+   * random access). This method should return <code>null</code> for sets that do not
+   * support random access.
+   */
+  public abstract Bits getBits(LeafReaderContext context);

Review comment:
       Hmm; I wonder if this is needed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org