You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2020/09/22 13:28:43 UTC

[GitHub] [lucene-solr] s1monw opened a new pull request #1909: LUCENE-9539: Remove caches from SortingCodecReader

s1monw opened a new pull request #1909:
URL: https://github.com/apache/lucene-solr/pull/1909


   SortingCodecReader keeps all docvalues in memory that are loaded from this reader.
   Yet, this reader should only be used for merging which happens sequentially. This makes
   caching docvalues unnecessary.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jpountz commented on a change in pull request #1909: LUCENE-9539: Remove caches from SortingCodecReader

Posted by GitBox <gi...@apache.org>.
jpountz commented on a change in pull request #1909:
URL: https://github.com/apache/lucene-solr/pull/1909#discussion_r493279172



##########
File path: lucene/core/src/java/org/apache/lucene/index/SortingCodecReader.java
##########
@@ -510,4 +457,52 @@ public LeafMetaData getMetaData() {
     return metaData;
   }
 
+  // we try to cache the last used DV or Norms instance since during merge
+  // this instance is used more than once. We could in addition to this single instance
+  // also cache the fields that are used for sorting since we do the work twice for these fields
+  private String cachedField;
+  private Object cachedObject;
+  private boolean cacheIsNorms;
+
+  private <T> T getOrCreateNorms(String field, IOSupplier<T> supplier) throws IOException {
+    return getOrCreate(field, true, supplier);
+  }
+
+  @SuppressWarnings("unchecked")
+  private synchronized  <T> T getOrCreate(String field, boolean norms, IOSupplier<T> supplier) throws IOException {
+    if ((field.equals(cachedField) && cacheIsNorms == norms) == false) {
+      assert assertCreatedOnlyOnce(field, norms);
+      cachedObject = supplier.get();
+      cachedField = field;
+      cacheIsNorms = norms;
+
+    }
+    assert cachedObject != null;
+    return (T) cachedObject;
+  }
+
+  private final Map<String, Integer> cacheStats = new HashMap<>(); // only with assertions enabled
+  private boolean assertCreatedOnlyOnce(String field, boolean norms) {
+    assert Thread.holdsLock(this);
+    // this is mainly there to make sure we change anything in the way we merge we realize it early
+    Integer timesCached = cacheStats.compute(field + "N:" + norms, (s, i) -> i == null ? 1 : i.intValue() + 1);
+    if (timesCached > 1) {
+      assert norms == false :"[" + field + "] norms must not be cached twice";

Review comment:
       I think we might cache norms twice if full-text is indexed, as we'd pull norms once for merging norms, and another time to index impacts in postings for the same field.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] s1monw commented on a change in pull request #1909: LUCENE-9539: Remove caches from SortingCodecReader

Posted by GitBox <gi...@apache.org>.
s1monw commented on a change in pull request #1909:
URL: https://github.com/apache/lucene-solr/pull/1909#discussion_r493482012



##########
File path: lucene/core/src/java/org/apache/lucene/index/SortingCodecReader.java
##########
@@ -510,4 +457,52 @@ public LeafMetaData getMetaData() {
     return metaData;
   }
 
+  // we try to cache the last used DV or Norms instance since during merge
+  // this instance is used more than once. We could in addition to this single instance
+  // also cache the fields that are used for sorting since we do the work twice for these fields
+  private String cachedField;
+  private Object cachedObject;
+  private boolean cacheIsNorms;
+
+  private <T> T getOrCreateNorms(String field, IOSupplier<T> supplier) throws IOException {
+    return getOrCreate(field, true, supplier);
+  }
+
+  @SuppressWarnings("unchecked")
+  private synchronized  <T> T getOrCreate(String field, boolean norms, IOSupplier<T> supplier) throws IOException {
+    if ((field.equals(cachedField) && cacheIsNorms == norms) == false) {
+      assert assertCreatedOnlyOnce(field, norms);
+      cachedObject = supplier.get();
+      cachedField = field;
+      cacheIsNorms = norms;
+
+    }
+    assert cachedObject != null;
+    return (T) cachedObject;
+  }
+
+  private final Map<String, Integer> cacheStats = new HashMap<>(); // only with assertions enabled
+  private boolean assertCreatedOnlyOnce(String field, boolean norms) {
+    assert Thread.holdsLock(this);
+    // this is mainly there to make sure we change anything in the way we merge we realize it early
+    Integer timesCached = cacheStats.compute(field + "N:" + norms, (s, i) -> i == null ? 1 : i.intValue() + 1);
+    if (timesCached > 1) {
+      assert norms == false :"[" + field + "] norms must not be cached twice";

Review comment:
       I think what we do here is we pull the already merged norms instance from disk instead of the one from the source reader. Is that what you mean in `PushPostingsWriterBase`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jimczi commented on a change in pull request #1909: LUCENE-9539: Remove caches from SortingCodecReader

Posted by GitBox <gi...@apache.org>.
jimczi commented on a change in pull request #1909:
URL: https://github.com/apache/lucene-solr/pull/1909#discussion_r492921270



##########
File path: lucene/core/src/java/org/apache/lucene/index/SortingCodecReader.java
##########
@@ -41,21 +41,13 @@
  * {@link Sort}. This can be used to re-sort and index after it's been created by wrapping all
  * readers of the index with this reader and adding it to a fresh IndexWriter via
  * {@link IndexWriter#addIndexes(CodecReader...)}.
+ * NOTE: This reader should only be used for merging. Pulling fields from this ready might be very costly and memory

Review comment:
       nit: 
   ```suggestion
    * NOTE: This reader should only be used for merging. Pulling fields from this reader might be very costly and memory
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jimczi commented on a change in pull request #1909: LUCENE-9539: Remove caches from SortingCodecReader

Posted by GitBox <gi...@apache.org>.
jimczi commented on a change in pull request #1909:
URL: https://github.com/apache/lucene-solr/pull/1909#discussion_r492921270



##########
File path: lucene/core/src/java/org/apache/lucene/index/SortingCodecReader.java
##########
@@ -41,21 +41,13 @@
  * {@link Sort}. This can be used to re-sort and index after it's been created by wrapping all
  * readers of the index with this reader and adding it to a fresh IndexWriter via
  * {@link IndexWriter#addIndexes(CodecReader...)}.
+ * NOTE: This reader should only be used for merging. Pulling fields from this ready might be very costly and memory

Review comment:
       nit: 
   ```suggestion
    * NOTE: This reader should only be used for merging. Pulling fields from this reader might be very costly and memory
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] s1monw merged pull request #1909: LUCENE-9539: Remove caches from SortingCodecReader

Posted by GitBox <gi...@apache.org>.
s1monw merged pull request #1909:
URL: https://github.com/apache/lucene-solr/pull/1909


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jpountz commented on a change in pull request #1909: LUCENE-9539: Remove caches from SortingCodecReader

Posted by GitBox <gi...@apache.org>.
jpountz commented on a change in pull request #1909:
URL: https://github.com/apache/lucene-solr/pull/1909#discussion_r493490751



##########
File path: lucene/core/src/java/org/apache/lucene/index/SortingCodecReader.java
##########
@@ -510,4 +457,52 @@ public LeafMetaData getMetaData() {
     return metaData;
   }
 
+  // we try to cache the last used DV or Norms instance since during merge
+  // this instance is used more than once. We could in addition to this single instance
+  // also cache the fields that are used for sorting since we do the work twice for these fields
+  private String cachedField;
+  private Object cachedObject;
+  private boolean cacheIsNorms;
+
+  private <T> T getOrCreateNorms(String field, IOSupplier<T> supplier) throws IOException {
+    return getOrCreate(field, true, supplier);
+  }
+
+  @SuppressWarnings("unchecked")
+  private synchronized  <T> T getOrCreate(String field, boolean norms, IOSupplier<T> supplier) throws IOException {
+    if ((field.equals(cachedField) && cacheIsNorms == norms) == false) {
+      assert assertCreatedOnlyOnce(field, norms);
+      cachedObject = supplier.get();
+      cachedField = field;
+      cacheIsNorms = norms;
+
+    }
+    assert cachedObject != null;
+    return (T) cachedObject;
+  }
+
+  private final Map<String, Integer> cacheStats = new HashMap<>(); // only with assertions enabled
+  private boolean assertCreatedOnlyOnce(String field, boolean norms) {
+    assert Thread.holdsLock(this);
+    // this is mainly there to make sure we change anything in the way we merge we realize it early
+    Integer timesCached = cacheStats.compute(field + "N:" + norms, (s, i) -> i == null ? 1 : i.intValue() + 1);
+    if (timesCached > 1) {
+      assert norms == false :"[" + field + "] norms must not be cached twice";

Review comment:
       Ah I had forgotten we were doing things this way. Then ignore my comment!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] s1monw commented on pull request #1909: LUCENE-9539: Remove caches from SortingCodecReader

Posted by GitBox <gi...@apache.org>.
s1monw commented on pull request #1909:
URL: https://github.com/apache/lucene-solr/pull/1909#issuecomment-697184182


   > I think our current DV format pulls doc values for a single field several times when flushing/merging, e.g. first to figure out whether the field is single-valued and how many bits per value are needed, and a second time to actually write data. Should we at least cache the last DVs that got pulled so that the second time you pull them, we don't re-do a lot of work?
   
   that's correct. some of them are pulled like 5 times. I added a very simple cache and assertions that makes sure we can reuse the same instance if it's pulled more than once.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] s1monw commented on a change in pull request #1909: LUCENE-9539: Remove caches from SortingCodecReader

Posted by GitBox <gi...@apache.org>.
s1monw commented on a change in pull request #1909:
URL: https://github.com/apache/lucene-solr/pull/1909#discussion_r493477140



##########
File path: lucene/core/src/java/org/apache/lucene/index/SortingCodecReader.java
##########
@@ -510,4 +457,52 @@ public LeafMetaData getMetaData() {
     return metaData;
   }
 
+  // we try to cache the last used DV or Norms instance since during merge
+  // this instance is used more than once. We could in addition to this single instance
+  // also cache the fields that are used for sorting since we do the work twice for these fields
+  private String cachedField;
+  private Object cachedObject;
+  private boolean cacheIsNorms;
+
+  private <T> T getOrCreateNorms(String field, IOSupplier<T> supplier) throws IOException {
+    return getOrCreate(field, true, supplier);
+  }
+
+  @SuppressWarnings("unchecked")
+  private synchronized  <T> T getOrCreate(String field, boolean norms, IOSupplier<T> supplier) throws IOException {
+    if ((field.equals(cachedField) && cacheIsNorms == norms) == false) {
+      assert assertCreatedOnlyOnce(field, norms);
+      cachedObject = supplier.get();
+      cachedField = field;
+      cacheIsNorms = norms;
+
+    }
+    assert cachedObject != null;
+    return (T) cachedObject;
+  }
+
+  private final Map<String, Integer> cacheStats = new HashMap<>(); // only with assertions enabled
+  private boolean assertCreatedOnlyOnce(String field, boolean norms) {
+    assert Thread.holdsLock(this);
+    // this is mainly there to make sure we change anything in the way we merge we realize it early
+    Integer timesCached = cacheStats.compute(field + "N:" + norms, (s, i) -> i == null ? 1 : i.intValue() + 1);
+    if (timesCached > 1) {
+      assert norms == false :"[" + field + "] norms must not be cached twice";

Review comment:
       can you point me to the place where we do this? If that is the case our tests are not good enough here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org