You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/06/14 07:59:30 UTC

[GitHub] [lucene] zacharymorn opened a new pull request #180: LUCENE-9959: [WIP] Add non thread local based API for term vector reader usage

zacharymorn opened a new pull request #180:
URL: https://github.com/apache/lucene/pull/180


   *This is a WIP PR for discussion only*
   
   # Description
   This PR is to try out the idea proposed in https://github.com/apache/lucene/pull/137#issuecomment-840111367, and is for discussion only / in WIP state.
   
   The commit that contains the most meaningful changes is https://github.com/apache/lucene/commit/5062e4d69938f104b461004022e19c10a65960a5
   
   # Solution
   Add non thread local based API for term vector reader usage.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zacharymorn commented on a change in pull request #180: LUCENE-9959: [WIP] Add non thread local based API for term vector reader usage

Posted by GitBox <gi...@apache.org>.
zacharymorn commented on a change in pull request #180:
URL: https://github.com/apache/lucene/pull/180#discussion_r654154850



##########
File path: lucene/core/src/java/org/apache/lucene/index/TermVectors.java
##########
@@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.index;
+
+import java.io.Closeable;
+import java.io.IOException;
+import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
+
+/** Index API to access TermVectors */
+public abstract class TermVectors implements Cloneable, Closeable {

Review comment:
       Ok I was a bit unsure on these earlier as well. 
   
   For `Cloneable`, I feel it may still be needed / recommended to have on `TermVectorsReader`, if we would like to invoke `core.termVectorsReader.clone()` directly from inside `SegmentReader#getTermVectorsReader`? 
   
   For `Closeable`, I added it here originally since I thought we may need to close the resources used by the cloned readers (such as `vectorsStream`) from this new API. However, after reading your comment and more digging, I also saw this java doc in `IndexInput`:
   https://github.com/apache/lucene/blob/065026b74e60301e09f409e67d41ff42ebc36bb2/lucene/core/src/java/org/apache/lucene/store/IndexInput.java#L32-L33 . So I guess closing here is not actually needed. I've moved back the `Closeable` interface to `TermVectorsReader`.

##########
File path: lucene/core/src/java/org/apache/lucene/index/FilterCodecReader.java
##########
@@ -152,6 +152,11 @@ public Bits getLiveDocs() {
         return liveDocs;
       }
 
+      @Override
+      public TermVectorsReader getTermVectorsReader() {
+        return null;

Review comment:
       Ops sorry I missed this one. This is redundant as the forwarding / delegation was already handled in `FilterCodecReader` class above. I've removed it.

##########
File path: lucene/core/src/java/org/apache/lucene/index/SegmentReader.java
##########
@@ -310,6 +304,12 @@ public Directory directory() {
 
   private final Set<ClosedListener> readerClosedListeners = new CopyOnWriteArraySet<>();
 
+  @Override
+  public TermVectorsReader getTermVectorsReader() {
+    ensureOpen();
+    return core.termVectorsReader;

Review comment:
       Done. 

##########
File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java
##########
@@ -615,11 +607,10 @@ protected FieldInfo getFieldInfo(String field) {
           if (content == null) {
             continue;
           }
-          IndexReader indexReader =
-              (fieldHighlighter.getOffsetSource() == OffsetSource.TERM_VECTORS
-                      && indexReaderWithTermVecCache != null)
-                  ? indexReaderWithTermVecCache
-                  : searcher.getIndexReader();
+          // nocommit TermVectorReusingLeafReader is no longer valid given

Review comment:
       Thanks for the detailed information and suggestion. I have done some searches, and it looks like these are two places that need to be changed:
   
   https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/PostingsWithTermVectorsOffsetStrategy.java#L38
   
   https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/TermVectorOffsetStrategy.java#L43
   
   I'll work on adding this caching back via the new API.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zacharymorn commented on pull request #180: LUCENE-9959: Add non thread local based API for term vector reader usage

Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #180:
URL: https://github.com/apache/lucene/pull/180#issuecomment-864358727


   I have fixed a few more things and think this PR should be in good shape now. Please let me know if it looks ready. 
   
   Note that I haven't added any new tests, as the existing ones seems to be providing good coverage for the changes already. Please also let me know if more tests are needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] dsmiley commented on pull request #180: LUCENE-9959: Add non thread local based API for term vector reader usage

Posted by GitBox <gi...@apache.org>.
dsmiley commented on pull request #180:
URL: https://github.com/apache/lucene/pull/180#issuecomment-873774406


   This PR introduces a "TermVectors" class.  Shouldn't the renaming happen _before_ this PR so that this PR can be updated to use the name "TermVectorsReader" instead?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zacharymorn commented on pull request #180: LUCENE-9959: [WIP] Add non thread local based API for term vector reader usage

Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #180:
URL: https://github.com/apache/lucene/pull/180#issuecomment-861193051


   Thanks @jpountz for the review and suggestions! I have implemented the `TermVectors` idea in https://github.com/apache/lucene/pull/180/commits/0f8b0f49faab675b735c2c2f9713acdf17cfd7be and https://github.com/apache/lucene/pull/180/commits/c1cb5eb0753b4138ef0e4805db05510fae465ac0. 
   
   I noticed that upgrading the return type of the new method in `CodecReader` would fail a test case `TestFilterCodecReader#implTestDeclaredMethodsOverridden`, specifically at https://github.com/apache/lucene/blob/50607e0fb9090a1321093aca4117b6560c511af0/lucene/core/src/test/org/apache/lucene/index/TestFilterCodecReader.java#L59-L62. The reason being that the call `superClass.getDeclaredMethods` in that test would return both methods with different return types, so one of them would fail the equality check with the current method's return type. I think this issue is mostly about the test implementation though as it doesn't support Java's method return type change during inheritance, and I can look for a work around next.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zacharymorn commented on pull request #180: LUCENE-9959: [WIP] Add non thread local based API for term vector reader usage

Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #180:
URL: https://github.com/apache/lucene/pull/180#issuecomment-863775298






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zacharymorn commented on pull request #180: LUCENE-9959: Add non thread local based API for term vector reader usage

Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #180:
URL: https://github.com/apache/lucene/pull/180#issuecomment-873862517


   > This PR introduces a "TermVectors" class. Shouldn't the renaming happen _before_ this PR so that this PR can be updated to use the name "TermVectorsReader" instead?
   
   Oh sorry @dsmiley I misread your suggestion earlier! I've opened a new PR for the renaming here https://github.com/apache/lucene/pull/205


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zacharymorn commented on pull request #180: LUCENE-9959: Add non thread local based API for term vector reader usage

Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #180:
URL: https://github.com/apache/lucene/pull/180#issuecomment-877676142


   > Yes; javadocs will need to warn people. This is _already_ a trap, not a new one. The method `org.apache.lucene.index.IndexReader#getTermVector` is tempting but can be bad for performance unless you only ever need one field's TV Terms.
   
   Makes sense. Thanks for the approval! I'll wait for a few more days before merging, in case other folks may have further feedback on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zacharymorn commented on pull request #180: LUCENE-9959: Add non thread local based API for term vector reader usage

Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #180:
URL: https://github.com/apache/lucene/pull/180#issuecomment-877530400


   > I think my preference is for you to commit this PR as-is after all. It introduces a new TermVectors interface (not great usability) but I think in LUCENE-10018 it would be reasonable to have TermVectors abstraction change to _be_ what Fields is today for the TV use-case, with some modifications. TV would look like:
   > 
   > ```
   > interface TermVectors {
   >   Collection<String> termVectorFields(int docId) throws IOException;
   >   Terms termVectorTerms(int docId, String fieldName) throws IOException;
   > }
   > ```
   > 
   > Since TermVectorsReader will be a thread-safe clone of the original (unlike 8x), it's then reasonable for it to cache the current TV info for the most recently accessed doc. So if you get the Terms for all fields for the current doc, it won't have to re-read anything. WDYT?
   
   I think this is a great idea! I like that it removes the need of having an extra class, and also mirrors `Fields` 's APIs and inheritance relationship with `FieldsProducer` for TV usage. In terms of avoiding re-read, in addition to caching, I guess we can also add some java doc to recommend calling `termVectorTerms` immediately after `termVectorFields` for the same doc id before jumping to the next, so that users / applications won't accidentally call them in the wrong way and encounter performance issue?  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zacharymorn commented on pull request #180: LUCENE-9959: Add non thread local based API for term vector reader usage

Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #180:
URL: https://github.com/apache/lucene/pull/180#issuecomment-873345151


   Hi @jpountz @rmuir, just want to check back to see if the current changes look good to you? I plan to create another PR to do the `TermVectorsReader` renaming as per https://github.com/apache/lucene/pull/180#issuecomment-871922823 after this PR is merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] dsmiley commented on a change in pull request #180: LUCENE-9959: Add non thread local based API for term vector reader usage

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #180:
URL: https://github.com/apache/lucene/pull/180#discussion_r655067190



##########
File path: lucene/core/src/java/org/apache/lucene/index/TermVectors.java
##########
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
+
+/** Index API to access TermVectors */
+public abstract class TermVectors {

Review comment:
       I agree with Rob's point of view.  In my mind it's a matter of taste.  Given that Java only lets you have one inheritor, I think of that particular inheritor as forming an important part of an identify of new class choosing to subclass it.  It's like implicit documentation to consider what a class subclasses as being more important than the interfaces it implements.  If someone is going to write a TermVectors impl, then surely that's an important part of its identity.  Perhaps it might implement Closeable or other interfaces but those won't be as important to understanding what this class is/does, not as much as it being a TermVectors subclass.  Interfaces are good for generic abstractions.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on a change in pull request #180: LUCENE-9959: [WIP] Add non thread local based API for term vector reader usage

Posted by GitBox <gi...@apache.org>.
rmuir commented on a change in pull request #180:
URL: https://github.com/apache/lucene/pull/180#discussion_r654350144



##########
File path: lucene/core/src/java/org/apache/lucene/index/TermVectors.java
##########
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
+
+/** Index API to access TermVectors */
+public abstract class TermVectors {

Review comment:
       -1 to make this interface. Please don't make interfaces just because you can. That's bad API design. Make interfaces only when you *must*




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zacharymorn merged pull request #180: LUCENE-9959: Add non thread local based API for term vector reader usage

Posted by GitBox <gi...@apache.org>.
zacharymorn merged pull request #180:
URL: https://github.com/apache/lucene/pull/180


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on a change in pull request #180: LUCENE-9959: [WIP] Add non thread local based API for term vector reader usage

Posted by GitBox <gi...@apache.org>.
jpountz commented on a change in pull request #180:
URL: https://github.com/apache/lucene/pull/180#discussion_r650714447



##########
File path: lucene/core/src/java/org/apache/lucene/index/BaseCompositeReader.java
##########
@@ -118,6 +121,65 @@ public final Fields getTermVectors(int docID) throws IOException {
     return subReaders[i].getTermVectors(docID - starts[i]); // dispatch to subreader
   }
 
+  private class CompositeTermVectorsReader extends TermVectorsReader {
+    List<TermVectorsReader> termVectorsReaders;
+
+    public CompositeTermVectorsReader(List<TermVectorsReader> termVectorsReaders) {
+      this.termVectorsReaders = termVectorsReaders;
+    }
+
+    @Override
+    public Fields get(int doc) throws IOException {
+      ensureOpen();
+      final int i = readerIndex(doc); // find subreader num
+      return termVectorsReaders.get(i).get(doc - starts[i]); // dispatch to subreader
+    }
+
+    @Override
+    public void checkIntegrity() throws IOException {
+      termVectorsReaders.stream()
+          .forEach(
+              r -> {
+                try {
+                  r.checkIntegrity();
+                } catch (IOException e) {
+                  throw new UncheckedIOException(e);
+                }
+              });
+    }
+
+    @Override
+    public TermVectorsReader clone() {
+      List<TermVectorsReader> newTermVectorReaders =
+          termVectorsReaders.stream().map(r -> r.clone()).collect(Collectors.toList());

Review comment:
       nit: I like method refs better when applicable
   
   ```suggestion
             termVectorsReaders.stream().map(TermVectorsReader::clone).collect(Collectors.toList());
   ```

##########
File path: lucene/core/src/java/org/apache/lucene/index/BaseCompositeReader.java
##########
@@ -118,6 +121,65 @@ public final Fields getTermVectors(int docID) throws IOException {
     return subReaders[i].getTermVectors(docID - starts[i]); // dispatch to subreader
   }
 
+  private class CompositeTermVectorsReader extends TermVectorsReader {
+    List<TermVectorsReader> termVectorsReaders;
+
+    public CompositeTermVectorsReader(List<TermVectorsReader> termVectorsReaders) {
+      this.termVectorsReaders = termVectorsReaders;
+    }
+
+    @Override
+    public Fields get(int doc) throws IOException {
+      ensureOpen();
+      final int i = readerIndex(doc); // find subreader num
+      return termVectorsReaders.get(i).get(doc - starts[i]); // dispatch to subreader
+    }
+
+    @Override
+    public void checkIntegrity() throws IOException {
+      termVectorsReaders.stream()
+          .forEach(
+              r -> {
+                try {
+                  r.checkIntegrity();
+                } catch (IOException e) {
+                  throw new UncheckedIOException(e);
+                }
+              });

Review comment:
       nit: a `for` loop would be simpler than streams here? :)

##########
File path: lucene/core/src/java/org/apache/lucene/index/IndexReader.java
##########
@@ -310,6 +311,9 @@ public final int hashCode() {
    */
   public abstract Fields getTermVectors(int docID) throws IOException;
 
+  /** Get TermVectorsReader from this index. */
+  public abstract TermVectorsReader getTermVectorsReaderNonThreadLocal();

Review comment:
       We shouldn't expose `TermVectorsReader` directly but instead create a new class that has a `public Fields get(int doc) throws IOException;` method like `TermVectorsReader` but none of the `clone`/`getMergeInstance`/`checkIntegrity` logic, which only belongs to codec APIs.
   
   One way to do this would be to create a class e.g. called `TermVectors` that only has `public Fields get(int doc) throws IOException;`, use this class in `IndexReader`, make `TermVectorsReader` extend `TermVectors`, and then upgrade the return type of `getTermVectorsReaderNonThreadLocal` to `TermVectorsReader` in `CodecReader`.
   
   

##########
File path: lucene/core/src/java/org/apache/lucene/index/BaseCompositeReader.java
##########
@@ -118,6 +121,65 @@ public final Fields getTermVectors(int docID) throws IOException {
     return subReaders[i].getTermVectors(docID - starts[i]); // dispatch to subreader
   }
 
+  private class CompositeTermVectorsReader extends TermVectorsReader {
+    List<TermVectorsReader> termVectorsReaders;
+
+    public CompositeTermVectorsReader(List<TermVectorsReader> termVectorsReaders) {
+      this.termVectorsReaders = termVectorsReaders;
+    }
+
+    @Override
+    public Fields get(int doc) throws IOException {
+      ensureOpen();
+      final int i = readerIndex(doc); // find subreader num
+      return termVectorsReaders.get(i).get(doc - starts[i]); // dispatch to subreader
+    }
+
+    @Override
+    public void checkIntegrity() throws IOException {
+      termVectorsReaders.stream()
+          .forEach(
+              r -> {
+                try {
+                  r.checkIntegrity();
+                } catch (IOException e) {
+                  throw new UncheckedIOException(e);
+                }
+              });
+    }
+
+    @Override
+    public TermVectorsReader clone() {
+      List<TermVectorsReader> newTermVectorReaders =
+          termVectorsReaders.stream().map(r -> r.clone()).collect(Collectors.toList());
+
+      return new CompositeTermVectorsReader(newTermVectorReaders);
+    }
+
+    @Override
+    public void close() throws IOException {
+      termVectorsReaders.stream()
+          .forEach(
+              r -> {
+                try {
+                  r.close();
+                } catch (IOException e) {
+                  throw new UncheckedIOException(e);
+                }
+              });

Review comment:
       we should use `IOUtils#close` for this, which will make sure to keep closing other resources if one of them throws an IOException upon close




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #180: LUCENE-9959: [WIP] Add non thread local based API for term vector reader usage

Posted by GitBox <gi...@apache.org>.
jpountz commented on pull request #180:
URL: https://github.com/apache/lucene/pull/180#issuecomment-861482554


   +1 to fix the test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zacharymorn commented on pull request #180: LUCENE-9959: Add non thread local based API for term vector reader usage

Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #180:
URL: https://github.com/apache/lucene/pull/180#issuecomment-872734403


   > I agree that replacing `Fields` for a similar class `TermVectors` should be its own issue.
   > I love your naming proposals: TermVectorsReader and TermVectorsReaderBase!
   > Renaming TermVectorsReader to TermVectorsReaderBase should probably be it's own commit (PR) done first so that Git doesn't get confused.
   
   Sounds good! I've opened an issue for this https://issues.apache.org/jira/browse/LUCENE-10018
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zacharymorn commented on a change in pull request #180: LUCENE-9959: [WIP] Add non thread local based API for term vector reader usage

Posted by GitBox <gi...@apache.org>.
zacharymorn commented on a change in pull request #180:
URL: https://github.com/apache/lucene/pull/180#discussion_r650443353



##########
File path: lucene/core/src/java/org/apache/lucene/index/IndexReader.java
##########
@@ -310,6 +311,9 @@ public final int hashCode() {
    */
   public abstract Fields getTermVectors(int docID) throws IOException;
 
+  /** Get TermVectorsReader from this index. */
+  public abstract TermVectorsReader getTermVectorsReaderNonThreadLocal();

Review comment:
       This method name is temporary for ease of differentiation.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zacharymorn commented on a change in pull request #180: LUCENE-9959: [WIP] Add non thread local based API for term vector reader usage

Posted by GitBox <gi...@apache.org>.
zacharymorn commented on a change in pull request #180:
URL: https://github.com/apache/lucene/pull/180#discussion_r651468446



##########
File path: lucene/core/src/java/org/apache/lucene/index/IndexReader.java
##########
@@ -310,6 +311,9 @@ public final int hashCode() {
    */
   public abstract Fields getTermVectors(int docID) throws IOException;
 
+  /** Get TermVectorsReader from this index. */
+  public abstract TermVectorsReader getTermVectorsReaderNonThreadLocal();

Review comment:
       Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zacharymorn commented on a change in pull request #180: LUCENE-9959: Add non thread local based API for term vector reader usage

Posted by GitBox <gi...@apache.org>.
zacharymorn commented on a change in pull request #180:
URL: https://github.com/apache/lucene/pull/180#discussion_r661242673



##########
File path: lucene/core/src/java/org/apache/lucene/index/BaseCompositeReader.java
##########
@@ -112,10 +112,29 @@ protected BaseCompositeReader(R[] subReaders, Comparator<R> subReadersSorter) th
   }
 
   @Override
-  public final Fields getTermVectors(int docID) throws IOException {
-    ensureOpen();
-    final int i = readerIndex(docID); // find subreader num
-    return subReaders[i].getTermVectors(docID - starts[i]); // dispatch to subreader
+  public final TermVectors getTermVectorsReader() {
+    TermVectors[] termVectors = new TermVectors[subReaders.length];
+
+    // subReaders is a collection of segmentReaders
+    for (int i = 0; i < subReaders.length; i++) {
+      // the getTermVectorsReader would clone a new instance, hence saving it into an array
+      // to avoid re-cloning from direct subReaders[i].getTermVectorsReader() call
+      termVectors[i] = subReaders[i].getTermVectorsReader();
+    }
+
+    return new TermVectors() {

Review comment:
       I've implemented the lazy-init approach in the latest commit and it passed the existing tests (in particular, `TestMultiThreadTermVectors`), so I think it should be safe for threaded situation? One downside from this approach I see though is that since `null` was used to indicate both TV not available and not initialized, lazy-init will require repeated `subReaders[i].getTermVectorsReader()` calls to get TV reader in the same segment if that segment simply doesn't contain any TV index.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on a change in pull request #180: LUCENE-9959: [WIP] Add non thread local based API for term vector reader usage

Posted by GitBox <gi...@apache.org>.
jpountz commented on a change in pull request #180:
URL: https://github.com/apache/lucene/pull/180#discussion_r652536636



##########
File path: lucene/core/src/java/org/apache/lucene/index/IndexReader.java
##########
@@ -310,6 +310,9 @@ public final int hashCode() {
    */
   public abstract Fields getTermVectors(int docID) throws IOException;

Review comment:
       Let's make this one final, deprecated and delegate to the below method?

##########
File path: lucene/core/src/java/org/apache/lucene/index/IndexReader.java
##########
@@ -310,6 +310,9 @@ public final int hashCode() {
    */
   public abstract Fields getTermVectors(int docID) throws IOException;
 
+  /** Get TermVectors from this index. */
+  public abstract TermVectors getTermVectorsNonThreadLocal();

Review comment:
       Let's rename to `getTermVectorsReader`, the same name it has on `CodecReader`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on a change in pull request #180: LUCENE-9959: [WIP] Add non thread local based API for term vector reader usage

Posted by GitBox <gi...@apache.org>.
jpountz commented on a change in pull request #180:
URL: https://github.com/apache/lucene/pull/180#discussion_r654536493



##########
File path: lucene/core/src/java/org/apache/lucene/index/BaseCompositeReader.java
##########
@@ -112,10 +112,29 @@ protected BaseCompositeReader(R[] subReaders, Comparator<R> subReadersSorter) th
   }
 
   @Override
-  public final Fields getTermVectors(int docID) throws IOException {
-    ensureOpen();
-    final int i = readerIndex(docID); // find subreader num
-    return subReaders[i].getTermVectors(docID - starts[i]); // dispatch to subreader
+  public final TermVectors getTermVectorsReader() {
+    TermVectors[] termVectors = new TermVectors[subReaders.length];
+
+    // subReaders is a collection of segmentReaders
+    for (int i = 0; i < subReaders.length; i++) {
+      // the getTermVectorsReader would clone a new instance, hence saving it into an array
+      // to avoid re-cloning from direct subReaders[i].getTermVectorsReader() call
+      termVectors[i] = subReaders[i].getTermVectorsReader();
+    }
+
+    return new TermVectors() {

Review comment:
       The general rule is that we try very hard to avoid allocating on a per-document basis, but that it is fine on a per-reader or per-segment basis. Since this one is per-reader, I'm not concerned.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] dsmiley commented on pull request #180: LUCENE-9959: Add non thread local based API for term vector reader usage

Posted by GitBox <gi...@apache.org>.
dsmiley commented on pull request #180:
URL: https://github.com/apache/lucene/pull/180#issuecomment-864720190


   > If I understanding the above correctly, eventually we would like to also have TermVectors to be similar to Fields, and act like the abstraction API for term vectors index data and not just an indirection, and Fields should be an internal data structure reserved for posting?
   
   Yes!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz edited a comment on pull request #180: LUCENE-9959: [WIP] Add non thread local based API for term vector reader usage

Posted by GitBox <gi...@apache.org>.
jpountz edited a comment on pull request #180:
URL: https://github.com/apache/lucene/pull/180#issuecomment-861482554


   > +1 to fix the test
   
   I've come up with a simple fix in https://github.com/apache/lucene/pull/180/commits/5520c5d0aed30fdb3231f920881cd2c685153267. Please let me know if it looks good.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] dsmiley commented on pull request #180: LUCENE-9959: Add non thread local based API for term vector reader usage

Posted by GitBox <gi...@apache.org>.
dsmiley commented on pull request #180:
URL: https://github.com/apache/lucene/pull/180#issuecomment-876482149


   I think my preference is for you to commit this PR as-is after all.  It introduces a new TermVectors interface (not great usability) but I think in LUCENE-10018 it would be reasonable to have TermVectors abstraction change to *be* what Fields is today for the TV use-case, with some modifications.  TV would look like:
   ````
   interface TermVectors {
     Collection<String> termVectorFields(int docId) throws IOException;
     Terms termVectorTerms(int docId, String fieldName) throws IOException;
   }
   ````
   Since TermVectorsReader will be a thread-safe clone of the original (unlike 8x), it's then reasonable for it to cache the current TV info for the most recently accessed doc.  So if you get the Terms for all fields for the current doc, it won't have to re-read anything.  WDYT?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] dsmiley commented on pull request #180: LUCENE-9959: Add non thread local based API for term vector reader usage

Posted by GitBox <gi...@apache.org>.
dsmiley commented on pull request #180:
URL: https://github.com/apache/lucene/pull/180#issuecomment-871922823


   I agree that replacing `Fields` for a similar class `TermVectors` should be its own issue.  
   I love your naming proposals: TermVectorsReader and TermVectorsReaderBase!
   Renaming TermVectorsReader to TermVectorsReaderBase should probably be it's own commit (PR) done first so that Git doesn't get confused.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zacharymorn commented on a change in pull request #180: LUCENE-9959: [WIP] Add non thread local based API for term vector reader usage

Posted by GitBox <gi...@apache.org>.
zacharymorn commented on a change in pull request #180:
URL: https://github.com/apache/lucene/pull/180#discussion_r651468301



##########
File path: lucene/core/src/java/org/apache/lucene/index/BaseCompositeReader.java
##########
@@ -118,6 +121,65 @@ public final Fields getTermVectors(int docID) throws IOException {
     return subReaders[i].getTermVectors(docID - starts[i]); // dispatch to subreader
   }
 
+  private class CompositeTermVectorsReader extends TermVectorsReader {
+    List<TermVectorsReader> termVectorsReaders;
+
+    public CompositeTermVectorsReader(List<TermVectorsReader> termVectorsReaders) {
+      this.termVectorsReaders = termVectorsReaders;
+    }
+
+    @Override
+    public Fields get(int doc) throws IOException {
+      ensureOpen();
+      final int i = readerIndex(doc); // find subreader num
+      return termVectorsReaders.get(i).get(doc - starts[i]); // dispatch to subreader
+    }
+
+    @Override
+    public void checkIntegrity() throws IOException {
+      termVectorsReaders.stream()
+          .forEach(
+              r -> {
+                try {
+                  r.checkIntegrity();
+                } catch (IOException e) {
+                  throw new UncheckedIOException(e);
+                }
+              });

Review comment:
       Yeah I think so. Sorry was too used to typing stream these days :D .

##########
File path: lucene/core/src/java/org/apache/lucene/index/BaseCompositeReader.java
##########
@@ -118,6 +121,65 @@ public final Fields getTermVectors(int docID) throws IOException {
     return subReaders[i].getTermVectors(docID - starts[i]); // dispatch to subreader
   }
 
+  private class CompositeTermVectorsReader extends TermVectorsReader {
+    List<TermVectorsReader> termVectorsReaders;
+
+    public CompositeTermVectorsReader(List<TermVectorsReader> termVectorsReaders) {
+      this.termVectorsReaders = termVectorsReaders;
+    }
+
+    @Override
+    public Fields get(int doc) throws IOException {
+      ensureOpen();
+      final int i = readerIndex(doc); // find subreader num
+      return termVectorsReaders.get(i).get(doc - starts[i]); // dispatch to subreader
+    }
+
+    @Override
+    public void checkIntegrity() throws IOException {
+      termVectorsReaders.stream()
+          .forEach(
+              r -> {
+                try {
+                  r.checkIntegrity();
+                } catch (IOException e) {
+                  throw new UncheckedIOException(e);
+                }
+              });
+    }
+
+    @Override
+    public TermVectorsReader clone() {
+      List<TermVectorsReader> newTermVectorReaders =
+          termVectorsReaders.stream().map(r -> r.clone()).collect(Collectors.toList());

Review comment:
       I updated this section to use `for` loop now, but will prefer to use method references in later changes.

##########
File path: lucene/core/src/java/org/apache/lucene/index/BaseCompositeReader.java
##########
@@ -118,6 +121,65 @@ public final Fields getTermVectors(int docID) throws IOException {
     return subReaders[i].getTermVectors(docID - starts[i]); // dispatch to subreader
   }
 
+  private class CompositeTermVectorsReader extends TermVectorsReader {
+    List<TermVectorsReader> termVectorsReaders;
+
+    public CompositeTermVectorsReader(List<TermVectorsReader> termVectorsReaders) {
+      this.termVectorsReaders = termVectorsReaders;
+    }
+
+    @Override
+    public Fields get(int doc) throws IOException {
+      ensureOpen();
+      final int i = readerIndex(doc); // find subreader num
+      return termVectorsReaders.get(i).get(doc - starts[i]); // dispatch to subreader
+    }
+
+    @Override
+    public void checkIntegrity() throws IOException {
+      termVectorsReaders.stream()
+          .forEach(
+              r -> {
+                try {
+                  r.checkIntegrity();
+                } catch (IOException e) {
+                  throw new UncheckedIOException(e);
+                }
+              });
+    }
+
+    @Override
+    public TermVectorsReader clone() {
+      List<TermVectorsReader> newTermVectorReaders =
+          termVectorsReaders.stream().map(r -> r.clone()).collect(Collectors.toList());
+
+      return new CompositeTermVectorsReader(newTermVectorReaders);
+    }
+
+    @Override
+    public void close() throws IOException {
+      termVectorsReaders.stream()
+          .forEach(
+              r -> {
+                try {
+                  r.close();
+                } catch (IOException e) {
+                  throw new UncheckedIOException(e);
+                }
+              });

Review comment:
       Ah yes. This part is no longer needed in latest changes, but will keep it in mind for future changes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] dsmiley commented on a change in pull request #180: LUCENE-9959: Add non thread local based API for term vector reader usage

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #180:
URL: https://github.com/apache/lucene/pull/180#discussion_r655070117



##########
File path: lucene/core/src/java/org/apache/lucene/index/BaseCompositeReader.java
##########
@@ -112,10 +112,29 @@ protected BaseCompositeReader(R[] subReaders, Comparator<R> subReadersSorter) th
   }
 
   @Override
-  public final Fields getTermVectors(int docID) throws IOException {
-    ensureOpen();
-    final int i = readerIndex(docID); // find subreader num
-    return subReaders[i].getTermVectors(docID - starts[i]); // dispatch to subreader
+  public final TermVectors getTermVectorsReader() {
+    TermVectors[] termVectors = new TermVectors[subReaders.length];
+
+    // subReaders is a collection of segmentReaders
+    for (int i = 0; i < subReaders.length; i++) {
+      // the getTermVectorsReader would clone a new instance, hence saving it into an array
+      // to avoid re-cloning from direct subReaders[i].getTermVectorsReader() call
+      termVectors[i] = subReaders[i].getTermVectorsReader();
+    }
+
+    return new TermVectors() {

Review comment:
       Maybe the above array should act as a cache (lazy init) instead of forcing eager load for all segments?  There may be a thread-safety concern there but maybe not since the only risk is doing more work on a race.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zacharymorn commented on pull request #180: LUCENE-9959: Add non thread local based API for term vector reader usage

Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #180:
URL: https://github.com/apache/lucene/pull/180#issuecomment-871155896


   > > If I understanding the above correctly, eventually we would like to also have TermVectors to be similar to Fields, and act like the abstraction API for term vectors index data and not just an indirection, and Fields should be an internal data structure reserved for posting?
   >
   > Yes!
   
   > FYI Years ago Fields was even more prevalent but I reduced it further -- https://issues.apache.org/jira/browse/LUCENE-7500 (read the description). TVs were noted as the last place where a Lucene user sees them. A class "TermVectors" would be the perfect named substitute for Fields for the TV use-case.
   
   Thanks for the confirmation, and it definitely makes sense! I feel maybe we should rename `TermVectors` and `TermVectorsReader` to something else if we want to go in that direction? Maybe something like: 
   
   *TermVectors -> TermVectorsReader*, and it has the following API:
   ```
   public abstract class TermVectorsReader {
     public abstract TermVectors get(int doc) throws IOException;
   }
   ```
   
   *TermVectorsReader -> TermVectorsReaderBase*, and it has the same APIs as before:
   ```
   public abstract class TermVectorsReaderBase extends TermVectorsReader implements Cloneable, Closeable {
     
     public abstract void checkIntegrity() throws IOException;
   
     @Override
     public abstract TermVectorsReader clone();
   
     public TermVectorsReader getMergeInstance() {
       return this;
     }
   }
   ```
   
   as such, `TermVectors` can have similar APIs as `Fields`, for example:
   
   ```
   public abstract class TermVectors implements Iterable<String> {
     protected TermVectors() {}
   
     @Override
     public abstract Iterator<String> iterator();
   
     public abstract Terms terms(String field) throws IOException;
   
     public abstract int size();
   
     public static final TermVectors[] EMPTY_ARRAY = new TermVectors[0];
   }
   ```
   
   What do you think about these APIs?
   
   In addition, I also feel that given the extra discussion needed, we should probably track this in a different issue?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zacharymorn commented on a change in pull request #180: LUCENE-9959: [WIP] Add non thread local based API for term vector reader usage

Posted by GitBox <gi...@apache.org>.
zacharymorn commented on a change in pull request #180:
URL: https://github.com/apache/lucene/pull/180#discussion_r654748459



##########
File path: lucene/core/src/java/org/apache/lucene/index/IndexReader.java
##########
@@ -307,8 +307,21 @@ public final int hashCode() {
   /**
    * Retrieve term vectors for this document, or null if term vectors were not indexed. The returned
    * Fields instance acts like a single-document inverted index (the docID will be 0).
+   *
+   * @deprecated Use {@link IndexReader#getTermVectorsReader} instead.
    */
-  public abstract Fields getTermVectors(int docID) throws IOException;
+  @Deprecated
+  public final Fields getTermVectors(int docID) throws IOException {
+    TermVectors termVectors = getTermVectorsReader();
+    if (termVectors != null) {
+      return termVectors.get(docID);
+    }
+    return null;
+  }
+  ;
+
+  /** Get TermVectors from this index, or null if term vectors were not indexed. */
+  public abstract TermVectors getTermVectorsReader();

Review comment:
       This (not returning `TermVectorsReader` here) was done on purpose to avoid leaking codec API into index https://github.com/apache/lucene/pull/180#discussion_r650720327, and renaming the method to `getTermVectors` may make it not consistent with the rest of reader methods in `CodecReader`. I guess this is a consequence of having this API at the Index level, in order to get rid of ThreadLocal? Please let me know if you have other suggestions.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] madrob commented on a change in pull request #180: LUCENE-9959: [WIP] Add non thread local based API for term vector reader usage

Posted by GitBox <gi...@apache.org>.
madrob commented on a change in pull request #180:
URL: https://github.com/apache/lucene/pull/180#discussion_r654162699



##########
File path: lucene/core/src/java/org/apache/lucene/index/IndexReader.java
##########
@@ -307,8 +307,21 @@ public final int hashCode() {
   /**
    * Retrieve term vectors for this document, or null if term vectors were not indexed. The returned
    * Fields instance acts like a single-document inverted index (the docID will be 0).
+   *
+   * @deprecated Use {@link IndexReader#getTermVectorsReader} instead.
    */
-  public abstract Fields getTermVectors(int docID) throws IOException;
+  @Deprecated
+  public final Fields getTermVectors(int docID) throws IOException {
+    TermVectors termVectors = getTermVectorsReader();
+    if (termVectors != null) {
+      return termVectors.get(docID);
+    }
+    return null;
+  }
+  ;
+
+  /** Get TermVectors from this index, or null if term vectors were not indexed. */
+  public abstract TermVectors getTermVectorsReader();

Review comment:
       This naming feels weird to me that it returns TV not TVR.

##########
File path: lucene/core/src/java/org/apache/lucene/index/TermVectors.java
##########
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
+
+/** Index API to access TermVectors */
+public abstract class TermVectors {

Review comment:
       This should probably be an interface?

##########
File path: lucene/core/src/java/org/apache/lucene/index/BaseCompositeReader.java
##########
@@ -112,10 +112,29 @@ protected BaseCompositeReader(R[] subReaders, Comparator<R> subReadersSorter) th
   }
 
   @Override
-  public final Fields getTermVectors(int docID) throws IOException {
-    ensureOpen();
-    final int i = readerIndex(docID); // find subreader num
-    return subReaders[i].getTermVectors(docID - starts[i]); // dispatch to subreader
+  public final TermVectors getTermVectorsReader() {
+    TermVectors[] termVectors = new TermVectors[subReaders.length];
+
+    // subReaders is a collection of segmentReaders
+    for (int i = 0; i < subReaders.length; i++) {
+      // the getTermVectorsReader would clone a new instance, hence saving it into an array
+      // to avoid re-cloning from direct subReaders[i].getTermVectorsReader() call
+      termVectors[i] = subReaders[i].getTermVectorsReader();
+    }
+
+    return new TermVectors() {

Review comment:
       I'm not sure why but this scares me a little bit. Feels like there's a potential for more garbage than we currently get on the heap. I haven't measured anything though, so maybe it's ok?

##########
File path: lucene/core/src/java/org/apache/lucene/index/TermVectors.java
##########
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
+
+/** Index API to access TermVectors */
+public abstract class TermVectors {

Review comment:
       I’m really curious about the disadvantages you see with an interface versus an class with every method abstract. Can you give some hints so that I can do further research?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] dsmiley commented on a change in pull request #180: LUCENE-9959: Add non thread local based API for term vector reader usage

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #180:
URL: https://github.com/apache/lucene/pull/180#discussion_r661978375



##########
File path: lucene/core/src/java/org/apache/lucene/index/BaseCompositeReader.java
##########
@@ -112,10 +112,29 @@ protected BaseCompositeReader(R[] subReaders, Comparator<R> subReadersSorter) th
   }
 
   @Override
-  public final Fields getTermVectors(int docID) throws IOException {
-    ensureOpen();
-    final int i = readerIndex(docID); // find subreader num
-    return subReaders[i].getTermVectors(docID - starts[i]); // dispatch to subreader
+  public final TermVectors getTermVectorsReader() {
+    TermVectors[] termVectors = new TermVectors[subReaders.length];
+
+    // subReaders is a collection of segmentReaders
+    for (int i = 0; i < subReaders.length; i++) {
+      // the getTermVectorsReader would clone a new instance, hence saving it into an array
+      // to avoid re-cloning from direct subReaders[i].getTermVectorsReader() call
+      termVectors[i] = subReaders[i].getTermVectorsReader();
+    }
+
+    return new TermVectors() {

Review comment:
       Looks good!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org