You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "jpountz (via GitHub)" <gi...@apache.org> on 2023/01/27 09:30:00 UTC

[GitHub] [lucene] jpountz opened a new pull request, #12114: Use radix sort to sort postings when index sorting is enabled.

jpountz opened a new pull request, #12114:
URL: https://github.com/apache/lucene/pull/12114

   This switches to LSBRadixSorter instead of TimSorter to sort postings whose index options are `DOCS`. On a synthetic benchmark this yielded barely any difference in the case when the index order is the same as the sort order, or reverse, but almost a 3x speedup for writing postings in the case when the index order is mostly random.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #12114: Use radix sort to sort postings when index sorting is enabled.

Posted by "jpountz (via GitHub)" <gi...@apache.org>.
jpountz commented on PR #12114:
URL: https://github.com/apache/lucene/pull/12114#issuecomment-1469650532

   I purposedly introduced a bug to see what would fail, and only high-level tests that check early query termination or dynamic pruning failed, so I introduced lower-level tests that make sure that postings get reordered correctly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on a diff in pull request #12114: Use radix sort to sort postings when index sorting is enabled.

Posted by "jpountz (via GitHub)" <gi...@apache.org>.
jpountz commented on code in PR #12114:
URL: https://github.com/apache/lucene/pull/12114#discussion_r1092226923


##########
lucene/core/src/java/org/apache/lucene/index/FreqProxTermsWriter.java:
##########
@@ -379,27 +272,24 @@ public int advance(final int target) throws IOException {
 
     @Override
     public int docID() {
-      return docIt < 0 ? -1 : docIt >= upto ? NO_MORE_DOCS : docs[docIt];
+      return docIt < 0 ? -1 : docs[docIt];
     }
 
     @Override
-    public int freq() throws IOException {
-      return withFreqs && docIt < upto ? freqs[docIt] : 1;
+    public int nextDoc() throws IOException {
+      return docs[++docIt];
     }
 
     @Override
-    public int nextDoc() throws IOException {
-      if (++docIt >= upto) return NO_MORE_DOCS;
-      return docs[docIt];
+    public long cost() {
+      return upTo;
     }
 
-    /** Returns the wrapped {@link PostingsEnum}. */
-    PostingsEnum getWrapped() {
-      return in;
+    @Override
+    public int freq() throws IOException {

Review Comment:
   With this change, fields that have frequencies are now handled by `SortingPostingsEnum` while `SortingDocsEnum` focuses on fields that only index docs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zhaih commented on a diff in pull request #12114: Use radix sort to sort postings when index sorting is enabled.

Posted by "zhaih (via GitHub)" <gi...@apache.org>.
zhaih commented on code in PR #12114:
URL: https://github.com/apache/lucene/pull/12114#discussion_r1091567169


##########
lucene/core/src/java/org/apache/lucene/index/FreqProxTermsWriter.java:
##########
@@ -379,27 +272,24 @@ public int advance(final int target) throws IOException {
 
     @Override
     public int docID() {
-      return docIt < 0 ? -1 : docIt >= upto ? NO_MORE_DOCS : docs[docIt];
+      return docIt < 0 ? -1 : docs[docIt];
     }
 
     @Override
-    public int freq() throws IOException {
-      return withFreqs && docIt < upto ? freqs[docIt] : 1;
+    public int nextDoc() throws IOException {
+      return docs[++docIt];
     }
 
     @Override
-    public int nextDoc() throws IOException {
-      if (++docIt >= upto) return NO_MORE_DOCS;
-      return docs[docIt];
+    public long cost() {
+      return upTo;
     }
 
-    /** Returns the wrapped {@link PostingsEnum}. */
-    PostingsEnum getWrapped() {
-      return in;
+    @Override
+    public int freq() throws IOException {

Review Comment:
   So we're removing `freq` support because no one is really using it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz merged pull request #12114: Use radix sort to sort postings when index sorting is enabled.

Posted by "jpountz (via GitHub)" <gi...@apache.org>.
jpountz merged PR #12114:
URL: https://github.com/apache/lucene/pull/12114


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #12114: Use radix sort to sort postings when index sorting is enabled.

Posted by "jpountz (via GitHub)" <gi...@apache.org>.
jpountz commented on PR #12114:
URL: https://github.com/apache/lucene/pull/12114#issuecomment-1406248901

   Here is the synthetic benchmark that I used if someone is interested in reproducing:
   
   ```java
     enum Order {
       RANDOM,
       ASC,
       DESC;
     }
   
     public static void main(String[] args) throws IOException {
       Order order = Order.RANDOM;
       Directory dir = FSDirectory.open(Paths.get("/tmp/a"));
       IndexWriterConfig cfg = new IndexWriterConfig(null);
       cfg.setInfoStream(new PrintStreamInfoStream(System.out));
       cfg.setMaxBufferedDocs(100_000);
       cfg.setRAMBufferSizeMB(IndexWriterConfig.DISABLE_AUTO_FLUSH);
       cfg.setIndexSort(new Sort(LongField.newSortField("sort_field", false, SortedNumericSelector.Type.MIN)));
       IndexWriter w = new IndexWriter(dir, cfg);
       Document doc = new Document();
       LongField sortField = new LongField("sort_field", 0);
       doc.add(sortField);
       StringField stringField1 = new StringField("string_field", "", Store.NO);
       doc.add(stringField1);
       StringField stringField2 = new StringField("string_field", "", Store.NO);
       doc.add(stringField2);
       StringField stringField3 = new StringField("string_field", "", Store.NO);
       doc.add(stringField3);
       for (int i = 0; i < 5_000_000; ++i) {
         long sortValue = switch (order) {
         case RANDOM -> i % 15;
         case ASC -> i;
         case DESC -> -i;
         };
         sortField.setLongValue(sortValue);
         stringField1.setStringValue(Integer.toBinaryString(i % 10));
         stringField2.setStringValue(Integer.toBinaryString(i % 100));
         stringField3.setStringValue(Integer.toBinaryString(i % 1000));
         w.addDocument(doc);
       }
     }
   ```
   
   And flush times for postings:
   
   | | Main  | Patch |
   | ------ | ------------- | ------------- |
   | Index sort matches indexing order | 6 | 7  |
   | Index sort is reverse indexing order | 7  | 8  |
   | Random sort | 27 | 10 |


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org