You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "dantuzi (via GitHub)" <gi...@apache.org> on 2023/05/24 13:36:26 UTC

[GitHub] [lucene] dantuzi opened a new pull request, #12246: Set word2vec getSynonyms method thread-safe

dantuzi opened a new pull request, #12246:
URL: https://github.com/apache/lucene/pull/12246

   ### Description
   This is a quick mitigation to avoid unexpected behavior if multiple queries are executed concurrently.
   
   This commit wants to address some comments received in PR [12169](https://github.com/apache/lucene/pull/12169).
   
   As @jimczi pointed out, the `OnHeapHnswGraph` is not thread-safe so we can avoid concurrent access to that object synchronizing the requests. Indeed, the graph has been reset by the searcher that calls the method `seek` [before each search](https://github.com/apache/lucene/blob/776149f0f6964bbc72ad2d292d1bfe770f82ba45/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L255).
   Talking about performance, we will attach the result of some benchmarks we executed.
   
   @zhaih noticed that the number of nodes that can be visited during the HnswGraph search was limited so I increased that limit to the max integer. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jimczi commented on a diff in pull request #12246: Set word2vec getSynonyms method synchronized

Posted by "jimczi (via GitHub)" <gi...@apache.org>.
jimczi commented on code in PR #12246:
URL: https://github.com/apache/lucene/pull/12246#discussion_r1179707991


##########
lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Word2VecSynonymProvider.java:
##########
@@ -64,7 +65,15 @@ public Word2VecSynonymProvider(Word2VecModel model) throws IOException {
     this.hnswGraph = builder.build(word2VecModel.copy());
   }
 
-  public List<TermAndBoost> getSynonyms(
+  /**
+   * Returns the list of synonyms of a provided term. This method is synchronized because it uses
+   * the {@link org.apache.lucene.util.hnsw.OnHeapHnswGraph} that is not thread-safe.
+   *
+   * @param term term to search to find synonyms
+   * @param maxSynonymsPerTerm limit of synonyms returned
+   * @param minAcceptedSimilarity lower similarity threshold to consider another term as synonym
+   */
+  public synchronized List<TermAndBoost> getSynonyms(

Review Comment:
   My concern was more that the feature is exposed in the analysis-common package and considering the current limitations it doesn't feel common to me. I'd be less pedantic, if that's the good term, if it was added in the sandbox as an experimentation where we can iterate more freely. In any case don't consider my comment as a veto to merge, please proceed with what you think is best as long as the issue around multi-threading is solved. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zhaih commented on a diff in pull request #12246: Set word2vec getSynonyms method synchronized

Posted by "zhaih (via GitHub)" <gi...@apache.org>.
zhaih commented on code in PR #12246:
URL: https://github.com/apache/lucene/pull/12246#discussion_r1181173636


##########
lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Word2VecSynonymProvider.java:
##########
@@ -64,7 +65,15 @@ public Word2VecSynonymProvider(Word2VecModel model) throws IOException {
     this.hnswGraph = builder.build(word2VecModel.copy());
   }
 
-  public List<TermAndBoost> getSynonyms(
+  /**
+   * Returns the list of synonyms of a provided term. This method is synchronized because it uses
+   * the {@link org.apache.lucene.util.hnsw.OnHeapHnswGraph} that is not thread-safe.
+   *
+   * @param term term to search to find synonyms
+   * @param maxSynonymsPerTerm limit of synonyms returned
+   * @param minAcceptedSimilarity lower similarity threshold to consider another term as synonym
+   */
+  public synchronized List<TermAndBoost> getSynonyms(

Review Comment:
   > Maybe adding this synchronized block on the graph, may do the trick?
   
   I don't think this will work.. since the part that the onHeapHnswGraph violates the thread-safety is that it uses internal `cur` and `upto` to remember the seek state ([here](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java#L126)). So even add an sync block inside won't really help, because one thread can still set `cur` to some node while the other set it to another value after the first thread exit the sync block.
   
   But if I didn't miss anything, I think make the graph thread safe to search is not hard, the current thread-unsafe code is:
   ```
     @Override
     public void seek(int level, int targetNode) {
       cur = getNeighbors(level, targetNode);
       upto = -1;
     }
   
     @Override
     public int nextNeighbor() {
       if (++upto < cur.size()) {
         return cur.node[upto];
       }
       return NO_MORE_DOCS;
     }
   ```
   What we need to do probably is just simply not using those two member methods of the graph but mimic the behavior in the search process. (Basically using some external local variable instead of `cur` and `upto`)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zhaih commented on a diff in pull request #12246: Set word2vec getSynonyms method synchronized

Posted by "zhaih (via GitHub)" <gi...@apache.org>.
zhaih commented on code in PR #12246:
URL: https://github.com/apache/lucene/pull/12246#discussion_r1199956128


##########
lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Word2VecSynonymProvider.java:
##########
@@ -42,6 +42,7 @@ public class Word2VecSynonymProvider {
   private static final VectorSimilarityFunction SIMILARITY_FUNCTION =
       VectorSimilarityFunction.DOT_PRODUCT;
   private static final VectorEncoding VECTOR_ENCODING = VectorEncoding.FLOAT32;
+  private static final int NO_LIMIT_ON_VISITED_NODES = Integer.MAX_VALUE;
   private final Word2VecModel word2VecModel;
   private final HnswGraph hnswGraph;

Review Comment:
   I merged https://github.com/apache/lucene/pull/12257, but I think we still need to make a change here to declare it as `OnHeapHnswGraph` to make it work.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jimczi commented on a diff in pull request #12246: Set word2vec getSynonyms method synchronized

Posted by "jimczi (via GitHub)" <gi...@apache.org>.
jimczi commented on code in PR #12246:
URL: https://github.com/apache/lucene/pull/12246#discussion_r1179349185


##########
lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Word2VecSynonymProvider.java:
##########
@@ -64,7 +65,15 @@ public Word2VecSynonymProvider(Word2VecModel model) throws IOException {
     this.hnswGraph = builder.build(word2VecModel.copy());
   }
 
-  public List<TermAndBoost> getSynonyms(
+  /**
+   * Returns the list of synonyms of a provided term. This method is synchronized because it uses
+   * the {@link org.apache.lucene.util.hnsw.OnHeapHnswGraph} that is not thread-safe.
+   *
+   * @param term term to search to find synonyms
+   * @param maxSynonymsPerTerm limit of synonyms returned
+   * @param minAcceptedSimilarity lower similarity threshold to consider another term as synonym
+   */
+  public synchronized List<TermAndBoost> getSynonyms(

Review Comment:
   Sure that works but I wouldn't call that a solution to be honest. I am again raising the usability here. Currently this filter has a very high loading cost and this change would add thread contentions on a non-trivial task (searching the graph). As I said in the previous PR, it would be preferable to add an offline path that builds a static graph. It's also possible to generate the synonym rules offline by performing the knn search on the entire vocabulary. The advantage of such approach is that the existing synonym filter could be reuse. 
   I don't see how the current approach would be usable in a real world scenario.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zhaih commented on a diff in pull request #12246: Set word2vec getSynonyms method synchronized

Posted by "zhaih (via GitHub)" <gi...@apache.org>.
zhaih commented on code in PR #12246:
URL: https://github.com/apache/lucene/pull/12246#discussion_r1194674783


##########
lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Word2VecSynonymProvider.java:
##########
@@ -85,7 +86,7 @@ public List<TermAndBoost> getSynonyms(
               SIMILARITY_FUNCTION,
               hnswGraph,
               null,
-              word2VecModel.size());
+              NO_LIMIT_ON_VISITED_NODES);

Review Comment:
   FYI I applied the same change to https://github.com/apache/lucene/pull/12235 (to unblock that PR) without having this as a static final, you can still make it static here if you want tho :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #12246: Set word2vec getSynonyms method synchronized

Posted by "alessandrobenedetti (via GitHub)" <gi...@apache.org>.
alessandrobenedetti commented on code in PR #12246:
URL: https://github.com/apache/lucene/pull/12246#discussion_r1194809968


##########
lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Word2VecSynonymProvider.java:
##########
@@ -85,7 +86,7 @@ public List<TermAndBoost> getSynonyms(
               SIMILARITY_FUNCTION,
               hnswGraph,
               null,
-              word2VecModel.size());
+              NO_LIMIT_ON_VISITED_NODES);

Review Comment:
   Thanks @zhaih ! I would suggest you close this PR as soon as the other two are merged.
   
   The one about concurrent problems, to be honest I would not wait further and proceed merging, for https://github.com/apache/lucene/pull/12235 I am not sure if you need some additional review, let me know!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] dantuzi commented on pull request #12246: Set word2vec getSynonyms method thread-safe

Posted by "dantuzi (via GitHub)" <gi...@apache.org>.
dantuzi commented on PR #12246:
URL: https://github.com/apache/lucene/pull/12246#issuecomment-1561173810

   @zhaih thank you for your contribution. I updated this PR changing the `HnswGraph` type to `OnHeapHnswGraph` and I verified that the runtime support uses the correct `HnswGraphSearcher.search` method


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] alessandrobenedetti merged pull request #12246: Set word2vec getSynonyms method thread-safe

Posted by "alessandrobenedetti (via GitHub)" <gi...@apache.org>.
alessandrobenedetti merged PR #12246:
URL: https://github.com/apache/lucene/pull/12246


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zhaih commented on pull request #12246: Set word2vec getSynonyms method synchronized

Posted by "zhaih (via GitHub)" <gi...@apache.org>.
zhaih commented on PR #12246:
URL: https://github.com/apache/lucene/pull/12246#issuecomment-1529144023

   I opened a PR (https://github.com/apache/lucene/pull/12257) of  thread-safe searcher that should address the problem here I believe


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] dantuzi closed pull request #12246: Set word2vec getSynonyms method synchronized

Posted by "dantuzi (via GitHub)" <gi...@apache.org>.
dantuzi closed pull request #12246: Set word2vec getSynonyms method synchronized
URL: https://github.com/apache/lucene/pull/12246


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #12246: Set word2vec getSynonyms method synchronized

Posted by "alessandrobenedetti (via GitHub)" <gi...@apache.org>.
alessandrobenedetti commented on code in PR #12246:
URL: https://github.com/apache/lucene/pull/12246#discussion_r1180201182


##########
lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Word2VecSynonymProvider.java:
##########
@@ -64,7 +65,15 @@ public Word2VecSynonymProvider(Word2VecModel model) throws IOException {
     this.hnswGraph = builder.build(word2VecModel.copy());
   }
 
-  public List<TermAndBoost> getSynonyms(
+  /**
+   * Returns the list of synonyms of a provided term. This method is synchronized because it uses
+   * the {@link org.apache.lucene.util.hnsw.OnHeapHnswGraph} that is not thread-safe.
+   *
+   * @param term term to search to find synonyms
+   * @param maxSynonymsPerTerm limit of synonyms returned
+   * @param minAcceptedSimilarity lower similarity threshold to consider another term as synonym
+   */
+  public synchronized List<TermAndBoost> getSynonyms(

Review Comment:
   So, I spent some time this morning trying to think about a more elegant way to communicate that currently, the OnHeap graph is not thread-safe.
   The latest commit is just tentative, to discuss with you, if it's more complicated than expected I can just revert it and leave the mitigation in the Word2Vec stuff.
   
   My main intent is to automatically get a benefit for the Word2Vec token filter if anyone in the future contributes a thread-safe version of the graph.
   If we put the synchronized in the word2vec and someone fixes the thread-safety problem, we may never realize it, and word2bec remain unnecessarily slower than it should be.
   
   Maybe adding this synchronized block on the graph, may do the trick?
   So if someone implements thread safety, then he/she changes that block?
   I know synchronized may slow down a bit the things, so to be honest I am looking for ideas :) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zhaih commented on a diff in pull request #12246: Set word2vec getSynonyms method synchronized

Posted by "zhaih (via GitHub)" <gi...@apache.org>.
zhaih commented on code in PR #12246:
URL: https://github.com/apache/lucene/pull/12246#discussion_r1194674783


##########
lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Word2VecSynonymProvider.java:
##########
@@ -85,7 +86,7 @@ public List<TermAndBoost> getSynonyms(
               SIMILARITY_FUNCTION,
               hnswGraph,
               null,
-              word2VecModel.size());
+              NO_LIMIT_ON_VISITED_NODES);

Review Comment:
   FYI I applied the same change to https://github.com/apache/lucene/pull/12235 (to unblock that change) without having this as a static final, you can still make it static here if you want tho :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #12246: Set word2vec getSynonyms method synchronized

Posted by "alessandrobenedetti (via GitHub)" <gi...@apache.org>.
alessandrobenedetti commented on code in PR #12246:
URL: https://github.com/apache/lucene/pull/12246#discussion_r1179445835


##########
lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Word2VecSynonymProvider.java:
##########
@@ -64,7 +65,15 @@ public Word2VecSynonymProvider(Word2VecModel model) throws IOException {
     this.hnswGraph = builder.build(word2VecModel.copy());
   }
 
-  public List<TermAndBoost> getSynonyms(
+  /**
+   * Returns the list of synonyms of a provided term. This method is synchronized because it uses
+   * the {@link org.apache.lucene.util.hnsw.OnHeapHnswGraph} that is not thread-safe.
+   *
+   * @param term term to search to find synonyms
+   * @param maxSynonymsPerTerm limit of synonyms returned
+   * @param minAcceptedSimilarity lower similarity threshold to consider another term as synonym
+   */
+  public synchronized List<TermAndBoost> getSynonyms(

Review Comment:
   Hi @jimczi, as someone used to say "Perfect is the enemy of good" :)
   I agree with you, there's definitely space for improvement here, but I believe it is a nice feature that someone may consider useful.
   
   If it is too taxing in terms of resources, we can leave users to decide, mark it as experimental, and let people try and gather more insights.
   Not all users may be interested in million-sized dictionaries.
   In this way, we can gather users, insights, and potential new contributions on the matter.
   We have various benchmarks we collected back in the days, @dantuzi can provide them and we'll post them, as far as I remember they looked decent.
   
   We spent a considerable amount of time(=money) in my company to just donate this feature to the community, and keep maintaining it on our own with all the main branch divergence is not sustainable anymore.
   
   With this mitigation I believe is a good enough first step that may get attention, new contributors and potentially new fundings to address all your concern.
   
   On the other hand, if you believe this can harm the project in any way, I am happy to revert, but I can't guarantee we'll have any time soon to change it and contribute it back :)
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org