You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/02/24 15:30:59 UTC

[GitHub] [lucene-solr] pawel-bugalski-dynatrace opened a new pull request #2429: LUCENE-9791 Allow calling BytesRefHash#find concurrently

pawel-bugalski-dynatrace opened a new pull request #2429:
URL: https://github.com/apache/lucene-solr/pull/2429


   # Description
   
   Replaces `scratch1` field in `BytesRefHash` with `scratch` local variable. As a result
   it is now possible to call `BytesRefHash#find` concurrently as long as there are
   no concurrent modifications to BytesRefHash instance and it is correctly published.
   
   This addresses the concurrency issue with Monitor (aka Luwak) since it
   is using `BytesRefHash#find` concurrently without additional synchronization.
   
   While in theory creating a new `BytesRef` in `equals` method means additional allocation,
   in practice this allocation is removed by C2 compiler when inlining find method.
   
   # Solution
   
   Remove unnecessary state from `BytesRefHash` so that non-mutating find method behaves similar
   to `HashMap`'s get/containsKey.
   
   # Tests
   
   Added a new unit test that reproduces a problem.
   Checked that C2 compiler correctly removes the new allocation.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request title.
   - [x] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `master` branch.
   - [x] I have run `./gradlew check`.
   - [x] I have added tests for my changes.
   - [ ] I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] pawel-bugalski-dynatrace commented on a change in pull request #2429: LUCENE-9791 Allow calling BytesRefHash#find concurrently

Posted by GitBox <gi...@apache.org>.
pawel-bugalski-dynatrace commented on a change in pull request #2429:
URL: https://github.com/apache/lucene-solr/pull/2429#discussion_r583527169



##########
File path: lucene/core/src/java/org/apache/lucene/util/BytesRefHash.java
##########
@@ -31,18 +31,21 @@
  * to the id is encapsulated inside {@link BytesRefHash} and is guaranteed to be increased for each
  * added {@link BytesRef}.
  *
+ * <p><strong>Note that this implementation is not synchronized.</strong> If multiple threads access
+ * a {@link BytesRefHash} instance concurrently, and at least one of the threads modifies it
+ * structurally, it <i>must</i> be synchronized externally. (A structural modification is any
+ * operation on the map except operations explicitly listed in {@link UnmodifiableBytesRefHash}
+ * interface).
+ *
  * <p>Note: The maximum capacity {@link BytesRef} instance passed to {@link #add(BytesRef)} must not
  * be longer than {@link ByteBlockPool#BYTE_BLOCK_SIZE}-2. The internal storage is limited to 2GB
  * total byte storage.
  *
  * @lucene.internal
  */
-public final class BytesRefHash implements Accountable {
+public final class BytesRefHash implements Accountable, UnmodifiableBytesRefHash {

Review comment:
       Definitely. I've replaced interface with a class that wraps BytesRefHash. For me it looks much better now, but I wander what you think about it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] uschindler commented on pull request #2429: LUCENE-9791 Allow calling BytesRefHash#find concurrently

Posted by GitBox <gi...@apache.org>.
uschindler commented on pull request #2429:
URL: https://github.com/apache/lucene-solr/pull/2429#issuecomment-790589517


   Hi,
   I agree with Mike. I like the equals() method to be thread safe. That was my original proposal.
   Generally: BytesRefHash is my favourite class if you need a Set<String>. Although it's marked internal, I prefer to use it. Especially if you need a set of millions of strings, this is fast and does not produce millions of Strings. I personally used it only single threaded, but in all cases a method called equals should never ever change state. Sorry!
   
   +1 for the fix
   -1 to add the unmodifiable interface. That's over-engineered.
   
   Uwe
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] pawel-bugalski-dynatrace commented on a change in pull request #2429: LUCENE-9791 Allow calling BytesRefHash#find concurrently

Posted by GitBox <gi...@apache.org>.
pawel-bugalski-dynatrace commented on a change in pull request #2429:
URL: https://github.com/apache/lucene-solr/pull/2429#discussion_r583526175



##########
File path: lucene/core/src/java/org/apache/lucene/util/BytesRefHash.java
##########
@@ -31,18 +31,21 @@
  * to the id is encapsulated inside {@link BytesRefHash} and is guaranteed to be increased for each
  * added {@link BytesRef}.
  *
+ * <p><strong>Note that this implementation is not synchronized.</strong> If multiple threads access
+ * a {@link BytesRefHash} instance concurrently, and at least one of the threads modifies it
+ * structurally, it <i>must</i> be synchronized externally. (A structural modification is any
+ * operation on the map except operations explicitly listed in {@link UnmodifiableBytesRefHash}

Review comment:
       Fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] uschindler edited a comment on pull request #2429: LUCENE-9791 Allow calling BytesRefHash#find concurrently

Posted by GitBox <gi...@apache.org>.
uschindler edited a comment on pull request #2429:
URL: https://github.com/apache/lucene-solr/pull/2429#issuecomment-790589517


   Hi,
   I agree with Mike. I like the equals() method to be thread safe. That was my original proposal.
   Generally: BytesRefHash is my favourite class if you need a `Set<String>`. Although it's marked internal, I prefer to use it. Especially if you need a set of millions of strings, this is fast and does not produce millions of Strings. I personally used it only single threaded, but in all cases a method called equals should never ever change state. Sorry!
   
   +1 for the fix
   -1 to add the unmodifiable interface. That's over-engineered.
   
   Uwe
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] mikemccand commented on pull request #2429: LUCENE-9791 Allow calling BytesRefHash#find concurrently

Posted by GitBox <gi...@apache.org>.
mikemccand commented on pull request #2429:
URL: https://github.com/apache/lucene-solr/pull/2429#issuecomment-790583058


   +1 for changing `equals` to not require allocation, enabling us to remove the thread-unsafe shared `BytesRef scratch1`!  This makes `find` thread-safe (as long as no other threads are making structural changes), and would suffice to fix `Luwak`'s usage, right?  This is a nice improvement by itself!
   
   I'm also not a fan of adding the `UnmodifiableBytesRefHash` wrapper -- this is indeed an `@lucene.internal` API, not a generic JDK Collections class.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] pawel-bugalski-dynatrace commented on a change in pull request #2429: LUCENE-9791 Allow calling BytesRefHash#find concurrently

Posted by GitBox <gi...@apache.org>.
pawel-bugalski-dynatrace commented on a change in pull request #2429:
URL: https://github.com/apache/lucene-solr/pull/2429#discussion_r587450225



##########
File path: lucene/core/src/java/org/apache/lucene/util/BytesRefHash.java
##########
@@ -31,18 +31,21 @@
  * to the id is encapsulated inside {@link BytesRefHash} and is guaranteed to be increased for each
  * added {@link BytesRef}.
  *
+ * <p><strong>Note that this implementation is not synchronized.</strong> If multiple threads access
+ * a {@link BytesRefHash} instance concurrently, and at least one of the threads modifies it
+ * structurally, it <i>must</i> be synchronized externally. (A structural modification is any
+ * operation on the map except operations explicitly listed in {@link UnmodifiableBytesRefHash}
+ * interface).
+ *
  * <p>Note: The maximum capacity {@link BytesRef} instance passed to {@link #add(BytesRef)} must not
  * be longer than {@link ByteBlockPool#BYTE_BLOCK_SIZE}-2. The internal storage is limited to 2GB
  * total byte storage.
  *
  * @lucene.internal
  */
-public final class BytesRefHash implements Accountable {
+public final class BytesRefHash implements Accountable, UnmodifiableBytesRefHash {

Review comment:
       Based on comments I'm going to remove UnmodifiableBytesRefHash altogether




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] pawel-bugalski-dynatrace closed pull request #2429: LUCENE-9791 Allow calling BytesRefHash#find concurrently

Posted by GitBox <gi...@apache.org>.
pawel-bugalski-dynatrace closed pull request #2429:
URL: https://github.com/apache/lucene-solr/pull/2429


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] rmuir commented on pull request #2429: LUCENE-9791 Allow calling BytesRefHash#find concurrently

Posted by GitBox <gi...@apache.org>.
rmuir commented on pull request #2429:
URL: https://github.com/apache/lucene-solr/pull/2429#issuecomment-790554086


   Just to emphasize it even more, this class is marked `@lucene.internal`. The class shouldn't even be exposed to the outside in the public API to start with, so let's please not increase the exposure.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] pawel-bugalski-dynatrace commented on pull request #2429: LUCENE-9791 Allow calling BytesRefHash#find concurrently

Posted by GitBox <gi...@apache.org>.
pawel-bugalski-dynatrace commented on pull request #2429:
URL: https://github.com/apache/lucene-solr/pull/2429#issuecomment-786053617


   I've implemented some of the ideas discussed here. What do you think?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] pawel-bugalski-dynatrace commented on a change in pull request #2429: LUCENE-9791 Allow calling BytesRefHash#find concurrently

Posted by GitBox <gi...@apache.org>.
pawel-bugalski-dynatrace commented on a change in pull request #2429:
URL: https://github.com/apache/lucene-solr/pull/2429#discussion_r583564032



##########
File path: lucene/core/src/test/org/apache/lucene/util/TestBytesRefHash.java
##########
@@ -267,6 +271,71 @@ public void testFind() throws Exception {
     }
   }
 
+  @Test
+  public void testConcurrentAccessToUnmodifiableBytesRefHash() throws Exception {
+    int num = atLeast(2);
+    for (int j = 0; j < num; j++) {
+      int numStrings = 797;
+      List<String> strings = new ArrayList<>(numStrings);
+      for (int i = 0; i < numStrings; i++) {
+        final String str = TestUtil.randomRealisticUnicodeString(random(), 1, 1000);
+        hash.add(new BytesRef(str));
+        assertTrue(strings.add(str));
+      }
+      int hashSize = hash.size();
+
+      UnmodifiableBytesRefHash unmodifiableHash = hash;
+
+      AtomicInteger notFound = new AtomicInteger();
+      AtomicInteger notEquals = new AtomicInteger();
+      AtomicInteger wrongSize = new AtomicInteger();
+      int numThreads = 10;
+      CountDownLatch latch = new CountDownLatch(numThreads);
+      Thread[] threads = new Thread[numThreads];
+      for (int i = 0; i < threads.length; i++) {
+        int loops = atLeast(100);
+        threads[i] =
+            new Thread(
+                () -> {
+                  BytesRef scratch = new BytesRef();
+                  latch.countDown();
+                  try {
+                    latch.await();
+                  } catch (InterruptedException e) {

Review comment:
       Yes. My mistake. Fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] rmuir commented on a change in pull request #2429: LUCENE-9791 Allow calling BytesRefHash#find concurrently

Posted by GitBox <gi...@apache.org>.
rmuir commented on a change in pull request #2429:
URL: https://github.com/apache/lucene-solr/pull/2429#discussion_r583677955



##########
File path: lucene/core/src/test/org/apache/lucene/util/TestBytesRefHash.java
##########
@@ -267,6 +271,71 @@ public void testFind() throws Exception {
     }
   }
 
+  @Test
+  public void testConcurrentAccessToUnmodifiableBytesRefHash() throws Exception {
+    int num = atLeast(2);
+    for (int j = 0; j < num; j++) {
+      int numStrings = 797;
+      List<String> strings = new ArrayList<>(numStrings);
+      for (int i = 0; i < numStrings; i++) {
+        final String str = TestUtil.randomRealisticUnicodeString(random(), 1, 1000);
+        hash.add(new BytesRef(str));
+        assertTrue(strings.add(str));
+      }
+      int hashSize = hash.size();
+
+      UnmodifiableBytesRefHash unmodifiableHash = new UnmodifiableBytesRefHash(hash);
+
+      AtomicInteger notFound = new AtomicInteger();
+      AtomicInteger notEquals = new AtomicInteger();
+      AtomicInteger wrongSize = new AtomicInteger();
+      int numThreads = 10;

Review comment:
       Can we tone this down to use less threads (at least locally, it is ok to increase for NIGHTLY). Using many threads causes tests like this to run extremely slow on wimpier machine (e.g. i have 2 cores)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] pawel-bugalski-dynatrace commented on a change in pull request #2429: LUCENE-9791 Allow calling BytesRefHash#find concurrently

Posted by GitBox <gi...@apache.org>.
pawel-bugalski-dynatrace commented on a change in pull request #2429:
URL: https://github.com/apache/lucene-solr/pull/2429#discussion_r585459927



##########
File path: lucene/core/src/test/org/apache/lucene/util/TestBytesRefHash.java
##########
@@ -267,6 +271,71 @@ public void testFind() throws Exception {
     }
   }
 
+  @Test
+  public void testConcurrentAccessToUnmodifiableBytesRefHash() throws Exception {
+    int num = atLeast(2);
+    for (int j = 0; j < num; j++) {
+      int numStrings = 797;
+      List<String> strings = new ArrayList<>(numStrings);
+      for (int i = 0; i < numStrings; i++) {
+        final String str = TestUtil.randomRealisticUnicodeString(random(), 1, 1000);
+        hash.add(new BytesRef(str));
+        assertTrue(strings.add(str));
+      }
+      int hashSize = hash.size();
+
+      UnmodifiableBytesRefHash unmodifiableHash = new UnmodifiableBytesRefHash(hash);
+
+      AtomicInteger notFound = new AtomicInteger();
+      AtomicInteger notEquals = new AtomicInteger();
+      AtomicInteger wrongSize = new AtomicInteger();
+      int numThreads = 10;

Review comment:
       ```suggestion
         int numThreads = atLeast(3);
   ```
   Sure. I've copy&pasted it from another test. I think atLeast(3) should be fine. It's fast on smaller number of CPUs and grows to larger number of threads when nightly tests are enabled. What you think?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] pawel-bugalski-dynatrace commented on pull request #2429: LUCENE-9791 Allow calling BytesRefHash#find concurrently

Posted by GitBox <gi...@apache.org>.
pawel-bugalski-dynatrace commented on pull request #2429:
URL: https://github.com/apache/lucene-solr/pull/2429#issuecomment-790467163


   @rmuir @madrob what do you think about current state of this PR? Any more comments? What else needs to be done to merge it?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] pawel-bugalski-dynatrace commented on pull request #2429: LUCENE-9791 Allow calling BytesRefHash#find concurrently

Posted by GitBox <gi...@apache.org>.
pawel-bugalski-dynatrace commented on pull request #2429:
URL: https://github.com/apache/lucene-solr/pull/2429#issuecomment-795758595


   I'm closing this pull request as it was replaced by https://github.com/apache/lucene/pull/8 that is targeting a new repository.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] rmuir commented on pull request #2429: LUCENE-9791 Allow calling BytesRefHash#find concurrently

Posted by GitBox <gi...@apache.org>.
rmuir commented on pull request #2429:
URL: https://github.com/apache/lucene-solr/pull/2429#issuecomment-785256914


   > > As a result
   > > it is now possible to call BytesRefHash#find concurrently as long as there are
   > > no concurrent modifications to BytesRefHash instance and it is correctly published.
   > 
   > This feels like a trap waiting for somebody to fall in. I think a better solution would be a pool of buffers and proper concurrency controls around them, but don't know what the performance would look like.
   > 
   > At a minimum, we need to update the method javadoc to explain what is going on and what proper usage looks like.
   
   sorry I don't think it is a trap really. Everything else using this thing is using it in a per-thread way. A lot of lucene works this way, it keeps things fast.
   
   I don't think we should slow down all the correct users and add complex concurrency, just because we have one module abusing it. This would be equivalent to making all java code use HashTable/StringBuffer when most code only needs HashMap/StringBuilder. We could easily slow down important things such as indexing. See my comments on the issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] rmuir commented on pull request #2429: LUCENE-9791 Allow calling BytesRefHash#find concurrently

Posted by GitBox <gi...@apache.org>.
rmuir commented on pull request #2429:
URL: https://github.com/apache/lucene-solr/pull/2429#issuecomment-785989874


   > In that case we can easily implement equals without allocation.
   
   +1 for this idea: just inline the ByteBlockPool.setBytesRef and simply avoid the allocation.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] madrob commented on a change in pull request #2429: LUCENE-9791 Allow calling BytesRefHash#find concurrently

Posted by GitBox <gi...@apache.org>.
madrob commented on a change in pull request #2429:
URL: https://github.com/apache/lucene-solr/pull/2429#discussion_r583176395



##########
File path: lucene/core/src/java/org/apache/lucene/util/BytesRefHash.java
##########
@@ -31,18 +31,21 @@
  * to the id is encapsulated inside {@link BytesRefHash} and is guaranteed to be increased for each
  * added {@link BytesRef}.
  *
+ * <p><strong>Note that this implementation is not synchronized.</strong> If multiple threads access
+ * a {@link BytesRefHash} instance concurrently, and at least one of the threads modifies it
+ * structurally, it <i>must</i> be synchronized externally. (A structural modification is any
+ * operation on the map except operations explicitly listed in {@link UnmodifiableBytesRefHash}

Review comment:
       not a map

##########
File path: lucene/core/src/test/org/apache/lucene/util/TestBytesRefHash.java
##########
@@ -267,6 +271,71 @@ public void testFind() throws Exception {
     }
   }
 
+  @Test
+  public void testConcurrentAccessToUnmodifiableBytesRefHash() throws Exception {
+    int num = atLeast(2);
+    for (int j = 0; j < num; j++) {
+      int numStrings = 797;
+      List<String> strings = new ArrayList<>(numStrings);
+      for (int i = 0; i < numStrings; i++) {
+        final String str = TestUtil.randomRealisticUnicodeString(random(), 1, 1000);
+        hash.add(new BytesRef(str));
+        assertTrue(strings.add(str));
+      }
+      int hashSize = hash.size();
+
+      UnmodifiableBytesRefHash unmodifiableHash = hash;
+
+      AtomicInteger notFound = new AtomicInteger();
+      AtomicInteger notEquals = new AtomicInteger();
+      AtomicInteger wrongSize = new AtomicInteger();
+      int numThreads = 10;
+      CountDownLatch latch = new CountDownLatch(numThreads);
+      Thread[] threads = new Thread[numThreads];
+      for (int i = 0; i < threads.length; i++) {
+        int loops = atLeast(100);
+        threads[i] =
+            new Thread(
+                () -> {
+                  BytesRef scratch = new BytesRef();
+                  latch.countDown();
+                  try {
+                    latch.await();
+                  } catch (InterruptedException e) {

Review comment:
       This should fail the test?

##########
File path: lucene/core/src/java/org/apache/lucene/util/BytesRefHash.java
##########
@@ -31,18 +31,21 @@
  * to the id is encapsulated inside {@link BytesRefHash} and is guaranteed to be increased for each
  * added {@link BytesRef}.
  *
+ * <p><strong>Note that this implementation is not synchronized.</strong> If multiple threads access
+ * a {@link BytesRefHash} instance concurrently, and at least one of the threads modifies it
+ * structurally, it <i>must</i> be synchronized externally. (A structural modification is any
+ * operation on the map except operations explicitly listed in {@link UnmodifiableBytesRefHash}
+ * interface).
+ *
  * <p>Note: The maximum capacity {@link BytesRef} instance passed to {@link #add(BytesRef)} must not
  * be longer than {@link ByteBlockPool#BYTE_BLOCK_SIZE}-2. The internal storage is limited to 2GB
  * total byte storage.
  *
  * @lucene.internal
  */
-public final class BytesRefHash implements Accountable {
+public final class BytesRefHash implements Accountable, UnmodifiableBytesRefHash {

Review comment:
       this feels weird




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] rmuir commented on pull request #2429: LUCENE-9791 Allow calling BytesRefHash#find concurrently

Posted by GitBox <gi...@apache.org>.
rmuir commented on pull request #2429:
URL: https://github.com/apache/lucene-solr/pull/2429#issuecomment-785265843


   I think for this issue, if we are not going to fix the bad consumer of the API, we should consider separate method (maybe deprecated) for it to use.
   
   The problem is this change doesn't just impact find() like the description states, but also other things like add() getting called by TermsHashPerField, a hot spot of indexer. I don't think we should rely on oracle's C2 EA to not create monster amounts of garbage, this is too important of a spot.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] madrob commented on pull request #2429: LUCENE-9791 Allow calling BytesRefHash#find concurrently

Posted by GitBox <gi...@apache.org>.
madrob commented on pull request #2429:
URL: https://github.com/apache/lucene-solr/pull/2429#issuecomment-785259427


   I think we agree with each other, Robert. I'll move further discussion back to the JIRA.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] madrob commented on a change in pull request #2429: LUCENE-9791 Allow calling BytesRefHash#find concurrently

Posted by GitBox <gi...@apache.org>.
madrob commented on a change in pull request #2429:
URL: https://github.com/apache/lucene-solr/pull/2429#discussion_r582073943



##########
File path: lucene/core/src/test/org/apache/lucene/util/TestBytesRefHash.java
##########
@@ -267,6 +270,44 @@ public void testFind() throws Exception {
     }
   }
 
+  @Test
+  public void testConcurrentFind() throws Exception {
+    int num = atLeast(2);
+    for (int j = 0; j < num; j++) {
+      List<String> strings = new ArrayList<>();
+      for (int i = 0; i < 797; i++) {
+        final String str = TestUtil.randomRealisticUnicodeString(random(), 1, 1000);
+        hash.add(new BytesRef(str));
+        assertTrue(strings.add(str));
+      }
+
+      AtomicInteger miss = new AtomicInteger();
+      Thread[] threads = new Thread[10];
+      for (int i = 0; i < threads.length; i++) {
+        int loops = atLeast(100);
+        threads[i] =
+            new Thread("t" + i) {

Review comment:
       Better practice to pass a Runnable instead of overriding Thread.run directly.

##########
File path: lucene/core/src/test/org/apache/lucene/util/TestBytesRefHash.java
##########
@@ -267,6 +270,44 @@ public void testFind() throws Exception {
     }
   }
 
+  @Test
+  public void testConcurrentFind() throws Exception {
+    int num = atLeast(2);
+    for (int j = 0; j < num; j++) {
+      List<String> strings = new ArrayList<>();

Review comment:
       Can init to the proper size at construction time.

##########
File path: lucene/core/src/test/org/apache/lucene/util/TestBytesRefHash.java
##########
@@ -267,6 +270,44 @@ public void testFind() throws Exception {
     }
   }
 
+  @Test
+  public void testConcurrentFind() throws Exception {
+    int num = atLeast(2);
+    for (int j = 0; j < num; j++) {
+      List<String> strings = new ArrayList<>();
+      for (int i = 0; i < 797; i++) {
+        final String str = TestUtil.randomRealisticUnicodeString(random(), 1, 1000);
+        hash.add(new BytesRef(str));
+        assertTrue(strings.add(str));
+      }
+
+      AtomicInteger miss = new AtomicInteger();
+      Thread[] threads = new Thread[10];
+      for (int i = 0; i < threads.length; i++) {
+        int loops = atLeast(100);
+        threads[i] =
+            new Thread("t" + i) {
+              public void run() {

Review comment:
       I suspect you need a latch here or something so that the threads will all run concurrently, otherwise they are likely to start and run in series.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] pawel-bugalski-dynatrace commented on pull request #2429: LUCENE-9791 Allow calling BytesRefHash#find concurrently

Posted by GitBox <gi...@apache.org>.
pawel-bugalski-dynatrace commented on pull request #2429:
URL: https://github.com/apache/lucene-solr/pull/2429#issuecomment-785282649


   > The problem is this change doesn't just impact find() like the description states, but also other things like add() getting called by TermsHashPerField, a hot spot of indexer. I don't think we should rely on oracle's C2 EA to not create monster amounts of garbage, this is too important of a spot.
   
   In that case we can easily implement `equals` without allocation.
   ```
   private boolean equals(int id, BytesRef b) {
     final int textStart = bytesStart[id];
     final byte[] bytes = pool.buffers[textStart >> BYTE_BLOCK_SHIFT];
     int pos = textStart & BYTE_BLOCK_MASK;
     final int length;
     final int offset;
     if ((bytes[pos] & 0x80) == 0) {
       // length is 1 byte
       length = bytes[pos];
       offset = pos + 1;
     } else {
       // length is 2 bytes
       length = (bytes[pos] & 0x7f) + ((bytes[pos + 1] & 0xff) << 7);
       offset = pos + 2;
     }
     return Arrays.equals(
         bytes,
         offset,
         offset + length,
         b.bytes,
         b.offset,
         b.offset + b.length);
   } 
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] rmuir commented on pull request #2429: LUCENE-9791 Allow calling BytesRefHash#find concurrently

Posted by GitBox <gi...@apache.org>.
rmuir commented on pull request #2429:
URL: https://github.com/apache/lucene-solr/pull/2429#issuecomment-790551382


   I still don't like the unmodifiable-interface. Sorry, I disagree with exposing thread-safe methods officially in the API for class that should only be used by one thread, just because one user of the class did it in a wrong way.
   
   It was my understanding that the problem is being solved this way because its "too hard" to fix lucene-monitor to instead do things correctly: I'll accept that we should do a "quick fix" to workaround its bugginess, but we should ultimately file JIRA issue to fix it (it should not use such a class with multiple threads).
   
   We shouldn't expose what we have done in public apis, it is just a temporary solution. If someone wants such a genpurpose hashtable they can use `HashMap` from their jdk, we aren't a hashtable library.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org