You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2020/02/24 09:27:51 UTC

[GitHub] [lucene-solr] bruno-roustant opened a new pull request #1281: LUCENE-9245: Optimize AutomatonTermsEnum memory and automaton Operations.getCommonPrefixBytesRef.

bruno-roustant opened a new pull request #1281: LUCENE-9245: Optimize AutomatonTermsEnum memory and automaton Operations.getCommonPrefixBytesRef.
URL: https://github.com/apache/lucene-solr/pull/1281
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] rmuir commented on a change in pull request #1281: LUCENE-9245: Optimize AutomatonTermsEnum memory and automaton Operations.getCommonPrefixBytesRef.

Posted by GitBox <gi...@apache.org>.
rmuir commented on a change in pull request #1281: LUCENE-9245: Optimize AutomatonTermsEnum memory and automaton Operations.getCommonPrefixBytesRef.
URL: https://github.com/apache/lucene-solr/pull/1281#discussion_r383289307
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/index/AutomatonTermsEnum.java
 ##########
 @@ -54,18 +56,20 @@
   private final boolean finite;
   // array of sorted transitions for each state, indexed by state number
   private final Automaton automaton;
-  // for path tracking: each long records gen when we last
+  // for path tracking: each short records gen when we last
   // visited the state; we use gens to avoid having to clear
-  private final long[] visited;
 
 Review comment:
   visited-state-tracking is only needed when the automaton accepts an infinite language. We use it for loop detection. I think before we get too fancy with how we clear it, we should first stop being stupid about it?
   
   So it is wasteful that we do this stuff when `finite == true` (example: fuzzy query) because we will never even look for a loop. its just that the current code unconditionally records states that it visited.
   
   I think first, in the ctor when `finite == true`, `visited[]` can be initialized to `null` or `new long[0]` or something, and we change this line:
   ```
   visited[state] = curGen;
   ```
   to something like this:
   ```
   if (!finite)
     visited[state] = curGen;
   ```
   
   I agree we should separately avoid tracking 64 bits per state when only 1 is needed. But before optimizing the storage, first lets avoid doing this stuff at all for ones like complex fuzzy queries?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] rmuir commented on a change in pull request #1281: LUCENE-9245: Optimize AutomatonTermsEnum memory and automaton Operations.getCommonPrefixBytesRef.

Posted by GitBox <gi...@apache.org>.
rmuir commented on a change in pull request #1281: LUCENE-9245: Optimize AutomatonTermsEnum memory and automaton Operations.getCommonPrefixBytesRef.
URL: https://github.com/apache/lucene-solr/pull/1281#discussion_r383282576
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/index/AutomatonTermsEnum.java
 ##########
 @@ -188,7 +188,11 @@ private boolean nextString() {
     savedStates.setIntAt(0, 0);
     
     while (true) {
-      curGen++;
+      if (++curGen == 0) {
+        // Clear the visited states every time curGen overflows (so very infrequently to not impact average perf).
+        curGen++;
 
 Review comment:
   Can we remove this unnecessary increment. Also i'd change the comment from `overflows` to `wraps`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] bruno-roustant commented on a change in pull request #1281: LUCENE-9245: Optimize AutomatonTermsEnum memory and automaton Operations.getCommonPrefixBytesRef.

Posted by GitBox <gi...@apache.org>.
bruno-roustant commented on a change in pull request #1281: LUCENE-9245: Optimize AutomatonTermsEnum memory and automaton Operations.getCommonPrefixBytesRef.
URL: https://github.com/apache/lucene-solr/pull/1281#discussion_r383260144
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/index/AutomatonTermsEnum.java
 ##########
 @@ -160,17 +161,18 @@ private void setLinear(int position) {
     if (maxInterval != 0xff)
       maxInterval++;
     int length = position + 1; /* position + maxTransition */
-    if (linearUpperBound.bytes.length < length)
-      linearUpperBound.bytes = new byte[length];
+    if (linearUpperBound == null) {
+      linearUpperBound = new BytesRef(ArrayUtil.oversize(Math.max(length, 16), Byte.BYTES));
+    } else if (linearUpperBound.bytes.length < length) {
+      linearUpperBound.bytes = new byte[ArrayUtil.oversize(length, Byte.BYTES)];
 
 Review comment:
   +1 thanks

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] bruno-roustant closed pull request #1281: LUCENE-9245: Reduce AutomatonTermsEnum memory usage.

Posted by GitBox <gi...@apache.org>.
bruno-roustant closed pull request #1281: LUCENE-9245: Reduce AutomatonTermsEnum memory usage.
URL: https://github.com/apache/lucene-solr/pull/1281
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] rmuir commented on a change in pull request #1281: LUCENE-9245: Optimize AutomatonTermsEnum memory and automaton Operations.getCommonPrefixBytesRef.

Posted by GitBox <gi...@apache.org>.
rmuir commented on a change in pull request #1281: LUCENE-9245: Optimize AutomatonTermsEnum memory and automaton Operations.getCommonPrefixBytesRef.
URL: https://github.com/apache/lucene-solr/pull/1281#discussion_r383259436
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/index/AutomatonTermsEnum.java
 ##########
 @@ -160,17 +161,18 @@ private void setLinear(int position) {
     if (maxInterval != 0xff)
       maxInterval++;
     int length = position + 1; /* position + maxTransition */
-    if (linearUpperBound.bytes.length < length)
-      linearUpperBound.bytes = new byte[length];
+    if (linearUpperBound == null) {
+      linearUpperBound = new BytesRef(ArrayUtil.oversize(Math.max(length, 16), Byte.BYTES));
+    } else if (linearUpperBound.bytes.length < length) {
+      linearUpperBound.bytes = new byte[ArrayUtil.oversize(length, Byte.BYTES)];
 
 Review comment:
   I don't think we should have the additional null check path here. It is not worth it to save 10 bytes :). Let's make linearUpperBound final again. Better to initialize it to `new BytesRef()` if you really want to save 10 bytes for the case that its not used, does not require additional branches in the code, it will just get extended by the length check.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] bruno-roustant commented on a change in pull request #1281: LUCENE-9245: Optimize AutomatonTermsEnum memory and automaton Operations.getCommonPrefixBytesRef.

Posted by GitBox <gi...@apache.org>.
bruno-roustant commented on a change in pull request #1281: LUCENE-9245: Optimize AutomatonTermsEnum memory and automaton Operations.getCommonPrefixBytesRef.
URL: https://github.com/apache/lucene-solr/pull/1281#discussion_r383154930
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/index/AutomatonTermsEnum.java
 ##########
 @@ -54,18 +56,20 @@
   private final boolean finite;
   // array of sorted transitions for each state, indexed by state number
   private final Automaton automaton;
-  // for path tracking: each long records gen when we last
+  // for path tracking: each short records gen when we last
   // visited the state; we use gens to avoid having to clear
-  private final long[] visited;
 
 Review comment:
   Main change is replacing the long[] by a short[] to reduce memory usage (the size is the number of automaton states). I also tried a FixedBitSet but it impacts perf.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] bruno-roustant commented on a change in pull request #1281: LUCENE-9245: Optimize AutomatonTermsEnum memory and automaton Operations.getCommonPrefixBytesRef.

Posted by GitBox <gi...@apache.org>.
bruno-roustant commented on a change in pull request #1281: LUCENE-9245: Optimize AutomatonTermsEnum memory and automaton Operations.getCommonPrefixBytesRef.
URL: https://github.com/apache/lucene-solr/pull/1281#discussion_r383154138
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java
 ##########
 @@ -1091,25 +1091,33 @@ public static String getCommonPrefix(Automaton a) {
    * @return common prefix, which can be an empty (length 0) BytesRef (never null)
    */
   public static BytesRef getCommonPrefixBytesRef(Automaton a) {
 
 Review comment:
   Lazy create structures (WildcardQuery may not have a first single-path state), and only one visited.add() call. This speeds up mainly WildcardQuery (a couple % in my luceneutil measures).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] bruno-roustant commented on a change in pull request #1281: LUCENE-9245: Optimize AutomatonTermsEnum memory and automaton Operations.getCommonPrefixBytesRef.

Posted by GitBox <gi...@apache.org>.
bruno-roustant commented on a change in pull request #1281: LUCENE-9245: Optimize AutomatonTermsEnum memory and automaton Operations.getCommonPrefixBytesRef.
URL: https://github.com/apache/lucene-solr/pull/1281#discussion_r383351481
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/index/AutomatonTermsEnum.java
 ##########
 @@ -54,18 +56,20 @@
   private final boolean finite;
   // array of sorted transitions for each state, indexed by state number
   private final Automaton automaton;
-  // for path tracking: each long records gen when we last
+  // for path tracking: each short records gen when we last
   // visited the state; we use gens to avoid having to clear
-  private final long[] visited;
 
 Review comment:
   Good catch.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] bruno-roustant commented on a change in pull request #1281: LUCENE-9245: Optimize AutomatonTermsEnum memory and automaton Operations.getCommonPrefixBytesRef.

Posted by GitBox <gi...@apache.org>.
bruno-roustant commented on a change in pull request #1281: LUCENE-9245: Optimize AutomatonTermsEnum memory and automaton Operations.getCommonPrefixBytesRef.
URL: https://github.com/apache/lucene-solr/pull/1281#discussion_r383377648
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java
 ##########
 @@ -1091,25 +1091,33 @@ public static String getCommonPrefix(Automaton a) {
    * @return common prefix, which can be an empty (length 0) BytesRef (never null)
    */
   public static BytesRef getCommonPrefixBytesRef(Automaton a) {
 
 Review comment:
   Ok, removed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] rmuir commented on a change in pull request #1281: LUCENE-9245: Optimize AutomatonTermsEnum memory and automaton Operations.getCommonPrefixBytesRef.

Posted by GitBox <gi...@apache.org>.
rmuir commented on a change in pull request #1281: LUCENE-9245: Optimize AutomatonTermsEnum memory and automaton Operations.getCommonPrefixBytesRef.
URL: https://github.com/apache/lucene-solr/pull/1281#discussion_r383312415
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java
 ##########
 @@ -1091,25 +1091,33 @@ public static String getCommonPrefix(Automaton a) {
    * @return common prefix, which can be an empty (length 0) BytesRef (never null)
    */
   public static BytesRef getCommonPrefixBytesRef(Automaton a) {
 
 Review comment:
   I don't think this is the right tradeoff. It makes the code more complex, saving the cost of creating a few simple ordinary objects. I hate to say i don't trust your benchmark, but I don't trust your benchmark represents a typical case here. We should keep this code simple, there are other ways it can be improved.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org