You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "lukas-vlcek (via GitHub)" <gi...@apache.org> on 2023/12/04 15:28:56 UTC

[PR] Fix position increment in (Reverse)PathHierarchyTokenizer [lucene]

lukas-vlcek opened a new pull request, #12875:
URL: https://github.com/apache/lucene/pull/12875

   ### Description
   
   Incrementing position attribute for each token in both `PathHieararchyTokenizer` and `ReversePathHieararchyTokenizer`.
   This change makes it possible to use both tokenizers in `BaseTokenStreamTestCase.assertAnalyzesTo()` method (test cases extended to demonstrate this).
   
   This PR solves and surpasses https://github.com/apache/lucene/pull/12750.
   
   @msfroh I included your original commit from your Lucene fork to keep proper commit attribution.
   
   Shall we squash all commit?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Fix position increment in (Reverse)PathHierarchyTokenizer [lucene]

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand merged PR #12875:
URL: https://github.com/apache/lucene/pull/12875


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Fix position increment in (Reverse)PathHierarchyTokenizer [lucene]

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand commented on PR #12875:
URL: https://github.com/apache/lucene/pull/12875#issuecomment-1860545292

   Thanks @lukas-vlcek -- looks great -- no need to squash, I'll take care during merge.  Exciting improvement, finally!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Fix position increment in (Reverse)PathHierarchyTokenizer [lucene]

Posted by "lukas-vlcek (via GitHub)" <gi...@apache.org>.
lukas-vlcek commented on PR #12875:
URL: https://github.com/apache/lucene/pull/12875#issuecomment-1847241112

   @mikemccand Do you think you can give me some hint about?
   
   > (e.g. `UnifiedHighlighter`, in certain modes)
   
   I am looking at `TestUnifiedHighlighter*` tests. Does it mean that I need to use specific fieldType? Can I use any fieldType(s) from existing `UHTestHelper.parametersFactoryList()`? Such as `postingsType` and `postingsWithTvType`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Fix position increment in (Reverse)PathHierarchyTokenizer [lucene]

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand commented on PR #12875:
URL: https://github.com/apache/lucene/pull/12875#issuecomment-1856089585

   > I am looking at `TestUnifiedHighlighter*` tests. Does it mean that I need to use specific fieldType? Can I use any fieldType(s) from existing `UHTestHelper.parametersFactoryList()`? Such as `postingsType` and `postingsWithTvType`?
   
   Hi @lukas-vlcek -- I don't think we need to add a specific unit test to `UnifiedHighlighter` tests for this.  I think it's fine to say something general in the `MIGRATE.txt` like "`(Reverse)PathHieararchyTokenizer` now produces sequential (instead of overlapping) tokens with accurate offsets, making positional queries and highlighters possible for fields tokenized with this tokenizer" or so?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Fix position increment in (Reverse)PathHierarchyTokenizer [lucene]

Posted by "lukas-vlcek (via GitHub)" <gi...@apache.org>.
lukas-vlcek commented on PR #12875:
URL: https://github.com/apache/lucene/pull/12875#issuecomment-1845620998

   > Note that this should make highlighting based on postings offsets (e.g. UnifiedHighlighter, in certain modes) work on such fields when it does not today.
   
   Ture... so it sounds like we shall add a test for this as well if we want to claim it in the migration notes!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Fix position increment in (Reverse)PathHierarchyTokenizer [lucene]

Posted by "lukas-vlcek (via GitHub)" <gi...@apache.org>.
lukas-vlcek commented on PR #12875:
URL: https://github.com/apache/lucene/pull/12875#issuecomment-1860204767

   @mikemccand Thanks for review. Done!
   Shall I squash all commits are will you do it when merging in GH?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Fix position increment in (Reverse)PathHierarchyTokenizer [lucene]

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand commented on code in PR #12875:
URL: https://github.com/apache/lucene/pull/12875#discussion_r1418918221


##########
lucene/analysis/common/src/java/org/apache/lucene/analysis/path/PathHierarchyTokenizer.java:
##########
@@ -112,11 +115,8 @@ public PathHierarchyTokenizer(
   public final boolean incrementToken() throws IOException {
     clearAttributes();
     termAtt.append(resultToken);
-    if (resultToken.length() == 0) {
-      posAtt.setPositionIncrement(1);
-    } else {
-      posAtt.setPositionIncrement(0);
-    }
+    posIncAtt.setPositionIncrement(1);
+    posLenAtt.setPositionLength(1);

Review Comment:
   Note that `1` is already the default for `PositionLengthAttribute` so you could skip pulling/setting it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Fix position increment in (Reverse)PathHierarchyTokenizer [lucene]

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand commented on code in PR #12875:
URL: https://github.com/apache/lucene/pull/12875#discussion_r1418918500


##########
lucene/analysis/common/src/java/org/apache/lucene/analysis/path/ReversePathHierarchyTokenizer.java:
##########
@@ -158,10 +161,9 @@ public final boolean incrementToken() throws IOException {
         endPosition = delimiterPositions.get(idx);
       }
       finalOffset = correctOffset(length);
-      posAtt.setPositionIncrement(1);
-    } else {
-      posAtt.setPositionIncrement(0);
     }
+    posIncAtt.setPositionIncrement(1);
+    posLenAtt.setPositionLength(1);

Review Comment:
   Same here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Fix position increment in (Reverse)PathHierarchyTokenizer [lucene]

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand commented on PR #12875:
URL: https://github.com/apache/lucene/pull/12875#issuecomment-1845294522

   Thanks for tackling this @lukas-vlcek and @msfroh!  I left a couple small comments, but otherwise it looks great.
   
   Given that this alters the indexed tokens (makes them non-overlapping), I think this should be a 10.0 only change?  It's highly unlikely any users are relying on how `PhraseQuery` or synonyms, etc. (any positional tokenfilters / queries) behave on fieleds analyzed with these tokenizers, but, still, could you add an entry in `MIGRATE.txt` explaining this change?  Note that this should make highlighting based on postings offsets (e.g. `UnifiedHighlighter`, in certain modes) work on such fields when it does not today.
   
   Also please add a `CHANGES.txt` entry under 10.0 as well.  Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org