You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/10/13 16:01:21 UTC

[GitHub] [lucene] spyk opened a new pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

spyk opened a new pull request #380:
URL: https://github.com/apache/lucene/pull/380

# Description

When providing a lemmas.txt dictionary file, OpenNLPLemmatizerFilterFactory caches internally only the string format of the dictionary, and not the DictionaryLemmatizer object. This results in parsing and creating a new DictionaryLemmatizer object every time the OpenNLPLemmatizerFilterFactory.create() is called.

# Solution

Switch the internal caching to the DictionaryLemmatizer parsed object instead of the String representation.

# Tests

No additional tests were provided, but the existing ones pass

# Checklist

Please review the following and check all that apply:

- [X] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code conforms to the standards described there to the best of my ability.
- [X] I have created a Jira issue and added the issue ID to my pull request title.
- [X] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
- [X] I have developed this patch against the `main` branch.
- [X] I have run `./gradlew check`.
- [ ] I have added tests for my changes.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] magibney commented on pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

Posted by GitBox <gi...@apache.org>.

magibney commented on pull request #380:
URL: https://github.com/apache/lucene/pull/380#issuecomment-1060886236


   Yes, essentially. Perhaps something like?:
   ```
   * LUCENE-10171: OpenNLPOpsFactory.getLemmatizerDictionary(String, ResourceLoader) now returns a
     DictionaryLemmatizer object instead of a raw String serialization of the dictionary.
     (Spyros Kapnissis via Michael Gibney, Alessandro Benedetti)
   ```
   I'd be willing to take care of adding it, if you'd prefer. The CHANGES.txt file changes often and gets conflicts which, although trivial, can be kind of annoying. At this point it's possible we may have missed the boat for 9.1, so that might also be an argument for me handling figuring out the right place to put the CHANGES.txt, if you're ok with that. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] magibney commented on a change in pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

Posted by GitBox <gi...@apache.org>.

magibney commented on a change in pull request #380:
URL: https://github.com/apache/lucene/pull/380#discussion_r751397674



##########
File path: lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java
##########
@@ -169,11 +169,14 @@ public static String getLemmatizerDictionary(String dictionaryFile, ResourceLoad
             builder.append(chars, 0, numRead);
           }
         } while (numRead > 0);
-        dictionary = builder.toString();
-        lemmaDictionaries.put(dictionaryFile, dictionary);
+        String dictionary = builder.toString();
+        InputStream dictionaryInputStream =
+            new ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8));
+        dictionaryLemmatizer = new DictionaryLemmatizer(dictionaryInputStream);

Review comment:
       True, good catch! Now that you mention it though, I think the original implementation was already vulnerable to the same issue:
   ```java
   dictionaryInputStream = new ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8))
   ```
   `dictionaryInputStream` is encoded as UTF-8, but then passed to the `DictionaryLemmatizer` ctor, which reads it as system default charset. So in essence you have the same situation: in both cases it's assumed/required that the input file is UTF-8, but `DictionaryLemmatizer` parses is according to system default charset.
   
   This could probably be addressed with something like:
   ```java
       InputStream rawIn = loader.openResource(dictionaryFile);
       InputStream in;
       if (Charset.defaultCharset() == StandardCharsets.UTF_8) {
         in = rawIn;
       } else {
         Reader r = new InputStreamReader(rawIn, StandardCharsets.UTF_8);
         in = new ReaderInputStream(r, Charset.defaultCharset());
       }
   ```
   ...though would probably need to manually convert the `Reader` to differently-encoded `InputStream`, since Apache commons-io is not (I think?) on the classpath?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] magibney commented on pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

Posted by GitBox <gi...@apache.org>.

magibney commented on pull request #380:
URL: https://github.com/apache/lucene/pull/380#issuecomment-1062061030


   Merged to `main` and backported to `branch_9x`; this should be available as of release 9.1.
   Thanks @spyk!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] magibney commented on a change in pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

Posted by GitBox <gi...@apache.org>.

magibney commented on a change in pull request #380:
URL: https://github.com/apache/lucene/pull/380#discussion_r752448693



##########
File path: lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java
##########
@@ -169,11 +169,14 @@ public static String getLemmatizerDictionary(String dictionaryFile, ResourceLoad
             builder.append(chars, 0, numRead);
           }
         } while (numRead > 0);
-        dictionary = builder.toString();
-        lemmaDictionaries.put(dictionaryFile, dictionary);
+        String dictionary = builder.toString();
+        InputStream dictionaryInputStream =
+            new ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8));
+        dictionaryLemmatizer = new DictionaryLemmatizer(dictionaryInputStream);

Review comment:
       Yes, sorry -- thanks for the clarification. What I mentioned was tangential, and narrowly focused on/reading between the lines of one minor aspect of the current implementation. I didn't mean to imply the String/re-parsing _per se_ was the main issue. The hashmaps are surely more important, as you say.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] magibney commented on pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

Posted by GitBox <gi...@apache.org>.

magibney commented on pull request #380:
URL: https://github.com/apache/lucene/pull/380#issuecomment-988928174

Thanks for the nudge, @fmmoret.

I think if introducing this change, we should really avoid [needlessly building and throwing away](https://github.com/apache/lucene/pull/380#discussion_r750515187) the stringified dictionary. @spyk is this something you'd be interested in pursuing (i.e., pushing a new commit to your PR branch)? Lmk if not and I'll try (or Alessandro, per his earlier comment?) to move it along.

>Ideally, opennlp would have a DictionaryLemmatizer ctor that accepts a Reader directly -- I can't imagine that would be a controversial upstream PR?

I don't think concerns over the default character encoding issue should hold things up. We're not making anything worse wrt the default encoding assumption. A simple `TODO` comment should suffice. I think we should circle back (I should be able to find the time for this if nobody else steps forward) to actually address such a `TODO` as a separate issue/PR, following something like the `InputStreamReader` approach I mentioned above (trusting someone will contradict me if they disagree with this proposed approach!).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] magibney merged pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

Posted by GitBox <gi...@apache.org>.

magibney merged pull request #380:
URL: https://github.com/apache/lucene/pull/380


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] fmmoret commented on pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

Posted by GitBox <gi...@apache.org>.

fmmoret commented on pull request #380:
URL: https://github.com/apache/lucene/pull/380#issuecomment-988465154


   How far is this from making it into a release? We have been dealing with this exact issue & have considered a local patch 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] spyk commented on pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

Posted by GitBox <gi...@apache.org>.

spyk commented on pull request #380:
URL: https://github.com/apache/lucene/pull/380#issuecomment-1005559984


   @magibney @alessandrobenedetti could you please check and review the above? 
   Also, wondering if it could/should be backported to 8.x versions. The same patch should work almost as is. 
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] spyk commented on pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

Posted by GitBox <gi...@apache.org>.

spyk commented on pull request #380:
URL: https://github.com/apache/lucene/pull/380#issuecomment-989601844


   Thanks @magibney! Yes, makes sense, I'll add a commit for the redundant parsing code removal and add a TODO comment as suggested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] magibney commented on pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

Posted by GitBox <gi...@apache.org>.

magibney commented on pull request #380:
URL: https://github.com/apache/lucene/pull/380#issuecomment-1005887111


   Apologies for the delay, and thanks for bearing with me, @spyk. I'm inclined to err on the cautious side with this, since I'm not as familiar with this part of the codebase or the OpenNLP community. That said, this seems really straightforward to me, and clearly an improvement. (I considered, but decided against, suggesting to adopt `computeIfAbsent()` in place of `cached = map.get(...); if (cached==null) map.put(...)` ... the former is "newer" and semantically clearer, but the latter is more idiomatic to this part of the codebase).
   
   The only thing giving me pause now is that I notice we're changing the return type of a _public_ method. If there are third-party extensions that rely on the existing return type of this method, they will break. An easy fix, but still ... I'm additionally chastened to see that the issue introducing this code, [LUCENE-2899](https://issues.apache.org/jira/browse/LUCENE-2899), has 36 "votes" and 68 "watchers" (!) -- so the chance of this being a breaking change for some third party extension are not insignificant (FWIW I'd be surprised if third parties actually called this method in practice, but I'm not sure I have the perspective to judge :slightly_smiling_face:)
   
   I dislike the idea of maintaining backward compatibility "just because", when this seems like such a clear improvement, and when I suspect that the `public` access for these static methods may not necessarily represent an explicit design choice (?); and with the 9.0 release still quite fresh, arguably now would not be the worst time to break backcompat (esp. in such a minor way). But I'm afraid I really would like another committer (ideally, @sarowe?) to weigh in on this. Thank you for your patience!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] alessandrobenedetti commented on pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

Posted by GitBox <gi...@apache.org>.

alessandrobenedetti commented on pull request #380:
URL: https://github.com/apache/lucene/pull/380#issuecomment-1011243486


   I agree, I am a supporter of the https://en.wikipedia.org/wiki/You_aren%27t_gonna_need_it philosophy, so if we are not breaking any test in Lucene I suggest we go with the change(and potentially in Solr, but this won't strictly be necessary anymore as they are separate projects now).
   Better having an improvement now and an unlikely third party to complain in the future than something not working well now and unlikely third party happy for backcompat :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] epugh commented on pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

Posted by GitBox <gi...@apache.org>.

epugh commented on pull request #380:
URL: https://github.com/apache/lucene/pull/380#issuecomment-1060934405


   CHANGES.txt does change a lot ;-).    Nice patch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] spyk commented on a change in pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

Posted by GitBox <gi...@apache.org>.

spyk commented on a change in pull request #380:
URL: https://github.com/apache/lucene/pull/380#discussion_r752359873



##########
File path: lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java
##########
@@ -169,11 +169,14 @@ public static String getLemmatizerDictionary(String dictionaryFile, ResourceLoad
             builder.append(chars, 0, numRead);
           }
         } while (numRead > 0);
-        dictionary = builder.toString();
-        lemmaDictionaries.put(dictionaryFile, dictionary);
+        String dictionary = builder.toString();
+        InputStream dictionaryInputStream =
+            new ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8));
+        dictionaryLemmatizer = new DictionaryLemmatizer(dictionaryInputStream);

Review comment:
       Oh, you're right, it's the same exact issue with all `DictionaryLemmatizer` constructors. So I guess it can be replaced anyway by `loader.openResource(dictionaryFile)` for the sake of simplicity. I was not aware of the ReaderInputStream solution btw, but yes, looks like it's not on the classpath.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] spyk commented on pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

Posted by GitBox <gi...@apache.org>.

spyk commented on pull request #380:
URL: https://github.com/apache/lucene/pull/380#issuecomment-1006905936


   Thank you @magibney for sharing your thoughts and also for taking the cautious approach! I think your concerns make sense, to be honest I had not thought of the public API change (had that particular class as a helper class in my mind). But, I don't have the complete picture either, since I'm coming mostly from a Solr-based usage perspective.
   
   I do think, however, that we should communicate at some point at least the drawbacks / dangers of using the particular implementation, as it can severely impact performance in unexpected ways. Not sure what would be the best way to do that though. eg. deprecate the API, or add an explicit warning on the documentation? 
   
   Looking forward to other suggestions from other committers, and once more, thank you for all the help!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] magibney commented on pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

Posted by GitBox <gi...@apache.org>.

magibney commented on pull request #380:
URL: https://github.com/apache/lucene/pull/380#issuecomment-1006945362


   That definitely makes sense. Despite all my caution, my own inclination honestly is essentially to "just do it". For all the reasons you've identified and as we've discussed further, this _really_ seems like the right thing to do, and the odds of users having:
   1. written extensions to call this method, and
   2. already upgraded to 9.0 and washed their hands of the upgrade
   
   ... yeah, the odds seem pretty low of this being a practical problem.
   
   @alessandrobenedetti what do you think? Any further sense of who takes a special interest in this part of the code, and/or might be more keyed into the ways in which the community may have extended this (or not)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] magibney commented on a change in pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

Posted by GitBox <gi...@apache.org>.

magibney commented on a change in pull request #380:
URL: https://github.com/apache/lucene/pull/380#discussion_r750515187



##########
File path: lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java
##########
@@ -169,11 +169,14 @@ public static String getLemmatizerDictionary(String dictionaryFile, ResourceLoad
             builder.append(chars, 0, numRead);
           }
         } while (numRead > 0);
-        dictionary = builder.toString();
-        lemmaDictionaries.put(dictionaryFile, dictionary);
+        String dictionary = builder.toString();
+        InputStream dictionaryInputStream =
+            new ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8));
+        dictionaryLemmatizer = new DictionaryLemmatizer(dictionaryInputStream);

Review comment:
       If introducing this change (and it seems reasonable to me?) couldn't we just pass `loader.openResource(dictionaryFile)` as the arg to the `DictionaryLemmatizer` ctor? The only reason it was manually read into a String before was because the String _per se_ was being stored. With this change we're no longer doing that, so the String should be unnecessary.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] magibney commented on a change in pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

Posted by GitBox <gi...@apache.org>.

magibney commented on a change in pull request #380:
URL: https://github.com/apache/lucene/pull/380#discussion_r752385177



##########
File path: lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java
##########
@@ -169,11 +169,14 @@ public static String getLemmatizerDictionary(String dictionaryFile, ResourceLoad
             builder.append(chars, 0, numRead);
           }
         } while (numRead > 0);
-        dictionary = builder.toString();
-        lemmaDictionaries.put(dictionaryFile, dictionary);
+        String dictionary = builder.toString();
+        InputStream dictionaryInputStream =
+            new ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8));
+        dictionaryLemmatizer = new DictionaryLemmatizer(dictionaryInputStream);

Review comment:
       All `ReaderInputStream` (from Apache commons-io) does is re-encode a `Reader` as an input-stream in some specified encoding. It would be relatively straightforward to do this manually, but again is orthogonal to the change you're looking to introduce. Ideally, opennlp would have a `DictionaryLemmatizer` ctor that accepts a `Reader` directly -- I can't imagine that would be a controversial upstream PR?
   
   >for the sake of simplicity
   Yes; I think in its current state this PR would make things better in some respects, even if leaving some problems unchanged.
   
   fwiw, I'm not necessarily inclined to assume a logical motivation in the current recreation-from-String of `DictionaryLemmatizer` for every call to `.create()`. Even if there _had_ been concerns for thread safety or whatever, that wouldn't explain why the dictionary would be stored as a `String` as opposed to a `byte[]`, given that as a consequence every invocation of `.create()` superfluously re-encodes the String as UTF-8 :-/
   
   Ultimately it looks like the heap requirement here is currently ~3x what it could be, given the on-heap String, encoded to a transient monolithic `byte[]`, parsed into a `DictionaryLemmatizer` ... 

##########
File path: lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java
##########
@@ -169,11 +169,14 @@ public static String getLemmatizerDictionary(String dictionaryFile, ResourceLoad
             builder.append(chars, 0, numRead);
           }
         } while (numRead > 0);
-        dictionary = builder.toString();
-        lemmaDictionaries.put(dictionaryFile, dictionary);
+        String dictionary = builder.toString();
+        InputStream dictionaryInputStream =
+            new ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8));
+        dictionaryLemmatizer = new DictionaryLemmatizer(dictionaryInputStream);

Review comment:
       All `ReaderInputStream` (from Apache commons-io) does is re-encode a `Reader` as an input-stream in some specified encoding. It would be relatively straightforward to do this manually, but again is orthogonal to the change you're looking to introduce. Ideally, opennlp would have a `DictionaryLemmatizer` ctor that accepts a `Reader` directly -- I can't imagine that would be a controversial upstream PR?
   
   >for the sake of simplicity
   
   Yes; I think in its current state this PR would make things better in some respects, even if leaving some problems unchanged.
   
   fwiw, I'm not necessarily inclined to assume a logical motivation in the current recreation-from-String of `DictionaryLemmatizer` for every call to `.create()`. Even if there _had_ been concerns for thread safety or whatever, that wouldn't explain why the dictionary would be stored as a `String` as opposed to a `byte[]`, given that as a consequence every invocation of `.create()` superfluously re-encodes the String as UTF-8 :-/
   
   Ultimately it looks like the heap requirement here is currently ~3x what it could be, given the on-heap String, encoded to a transient monolithic `byte[]`, parsed into a `DictionaryLemmatizer` ... 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] spyk commented on pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

Posted by GitBox <gi...@apache.org>.

spyk commented on pull request #380:
URL: https://github.com/apache/lucene/pull/380#issuecomment-1059723009


   Thank you @magibney for getting back on this!
   
   > @spyk thanks for your patience, and for keeping the PR up-to-date. If you can add a CHANGES.txt entry, that's probably warranted here (if only because of the change to the public API).
   
   Not sure of the exact process of adding an entry to CHANGES.txt. Should I add an entry with the JIRA ticket and the public API changes for lucene 9.1.0?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] magibney commented on pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

Posted by GitBox <gi...@apache.org>.

magibney commented on pull request #380:
URL: https://github.com/apache/lucene/pull/380#issuecomment-970573111


   I'm definitely not the ideal reviewer for this either, but fwiw: the only potential issue I can see here is the sharing of this component (and ultimately the underlying opennlp `DictionaryLemmatizer`) across multiple threads. IIUC, the initial approach (repeatedly recreating a new `DictionaryLemmatizer` from the raw dictionary string for each call to `FilterFactory.create()`) dodges that issue, because TokenStreams only support single-threaded access.
   
   That said, nothing in the underlying implementation and expected use of `DictionaryLemmatizer` looks like it would be an issue in practice for multi-threaded access. The only thing that seems a little weird to me, even for a single-threaded case, is that `DictionaryLemmatizer` [directly exposes](https://github.com/apache/opennlp/blob/c88f57814c0af0dccf471b895a35981ecdac2e7a/opennlp-tools/src/main/java/opennlp/tools/lemmatizer/DictionaryLemmatizer.java#L84-L86) its internal backing HashMap via a public getter, so it could in principle be modified by any caller. But it doesn't look like DictionaryLemmatizer leaks from `NLPLemmatizerOp`, so I guess that's ok?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] spyk commented on a change in pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

Posted by GitBox <gi...@apache.org>.

spyk commented on a change in pull request #380:
URL: https://github.com/apache/lucene/pull/380#discussion_r751061365



##########
File path: lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java
##########
@@ -169,11 +169,14 @@ public static String getLemmatizerDictionary(String dictionaryFile, ResourceLoad
             builder.append(chars, 0, numRead);
           }
         } while (numRead > 0);
-        dictionary = builder.toString();
-        lemmaDictionaries.put(dictionaryFile, dictionary);
+        String dictionary = builder.toString();
+        InputStream dictionaryInputStream =
+            new ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8));
+        dictionaryLemmatizer = new DictionaryLemmatizer(dictionaryInputStream);

Review comment:
       Thank you @magibney , that's a great point. One concern, however, is that the `DictionaryLemmatizer` does not specify the UTF-8 encoding by default while reading the InputStream, but gets the platform's default instead, so that could lead to encoding errors?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] spyk commented on a change in pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

Posted by GitBox <gi...@apache.org>.

spyk commented on a change in pull request #380:
URL: https://github.com/apache/lucene/pull/380#discussion_r752443863



##########
File path: lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java
##########
@@ -169,11 +169,14 @@ public static String getLemmatizerDictionary(String dictionaryFile, ResourceLoad
             builder.append(chars, 0, numRead);
           }
         } while (numRead > 0);
-        dictionary = builder.toString();
-        lemmaDictionaries.put(dictionaryFile, dictionary);
+        String dictionary = builder.toString();
+        InputStream dictionaryInputStream =
+            new ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8));
+        dictionaryLemmatizer = new DictionaryLemmatizer(dictionaryInputStream);

Review comment:
       Just to clarify the main issue with the String being cached instead of the DictionaryLemmatizer (aside any thread safety concerns) is that the DictionaryLemmatizer parses and creates a new HashMap each time `create()` is called. So, in our case with a 5MB lemmas.txt file across several fields, it crashed a 64GB cluster with OOM with the heap containing several hundred DictionaryLemmatizer instances, each with its own ~80MB internal hashmap.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] magibney commented on pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

Posted by GitBox <gi...@apache.org>.

magibney commented on pull request #380:
URL: https://github.com/apache/lucene/pull/380#issuecomment-1050227228


   This patch applies cleanly and all tests pass. I plan to commit this within the next few days, because i think it does improve things (targeting 9.1 release).
   
   But I want to go on the record mentioning that there are a number of things I've noticed in the process of looking at this code (completely orthogonal to this PR) that make me a bit uncomfortable:
   
   1. All the objects retrieved through OpenNLPOpsFactory are cached in [static maps](https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java#L41-L48) that are [never cleared](https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java#L191-L199).
   2. These maps are populated in a way where there's a race condition (could end up with multiple copies of these objects being held external to this class).
   
   wrt this PR in particular, I'm nervous about the fact that everything _but_ the DictionaryLemmatizers have long been cached as objects, which makes me wonder if there was a reason that from this class's inception, the dictionaryLemmatizers have been cached as String versions of the raw dictionary. But I've tried to chase down this line of reasoning, and still can't find any obvious problem with this change.
   
   Analogous to not "letting the perfect be the enemy of the good", I'm going to not "let the 'not-so-good' be the enemy of the 'probably worse'". I hope this is the correct decision.
   
   @spyk thanks for your patience, and for keeping the PR up-to-date. If you can add a CHANGES.txt entry, that's probably warranted here (if only because of the change to the public API).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] spyk commented on pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

Posted by GitBox <gi...@apache.org>.

spyk commented on pull request #380:
URL: https://github.com/apache/lucene/pull/380#issuecomment-1061732060


   > I'd be willing to take care of adding it, if you'd prefer. The CHANGES.txt file changes often and gets conflicts which, although trivial, can be kind of annoying. At this point it's possible we may have missed the boat for 9.1, so that might also be an argument for me handling figuring out the right place to put the CHANGES.txt, if you're ok with that.
   
   Thanks @magibney , yes please, that'd be great if you could. I'd hate to take up your valuable time until I figure the whole process out :) But please let me know if you need anything from me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org