You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2023/01/01 09:50:26 UTC

[GitHub] [lucene] jpountz opened a new pull request, #12054: Introduce a new `KeywordField`.

jpountz opened a new pull request, #12054:
URL: https://github.com/apache/lucene/pull/12054

   `KeywordField` is a combination of `StringField` and `SortedSetDocValuesField`, similarly to how `LongField` is a combination of `LongPoint` and `SortedNumericDocValuesField`. This makes it easier for users to create fields that can be used for filtering, sorting and faceting.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on a diff in pull request #12054: Introduce a new `KeywordField`.

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on code in PR #12054:
URL: https://github.com/apache/lucene/pull/12054#discussion_r1097736939


##########
lucene/core/src/java/org/apache/lucene/document/KeywordField.java:
##########
@@ -0,0 +1,188 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.document;
+
+import java.util.Collection;
+import java.util.Objects;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.IndexOptions;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.search.ConstantScoreQuery;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.search.SortField;
+import org.apache.lucene.search.SortedSetSelector;
+import org.apache.lucene.search.SortedSetSortField;
+import org.apache.lucene.search.TermInSetQuery;
+import org.apache.lucene.search.TermQuery;
+import org.apache.lucene.util.BytesRef;
+
+/**
+ * Field that indexes a per-document String or {@link BytesRef} into an inverted index for fast
+ * filtering, stores values in a columnar fashion using {@link DocValuesType#SORTED_SET} doc values
+ * for sorting and faceting, and optionally stores values as stored fields for top-hits retrieval.
+ * This field does not support scoring: queries produce constant scores. If you also need to store

Review Comment:
   We can nuke this sentence about "if you also need to store the value" now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on a diff in pull request #12054: Introduce a new `KeywordField`.

Posted by "jpountz (via GitHub)" <gi...@apache.org>.
jpountz commented on code in PR #12054:
URL: https://github.com/apache/lucene/pull/12054#discussion_r1089154108


##########
lucene/demo/src/java/org/apache/lucene/demo/IndexFiles.java:
##########
@@ -234,8 +234,8 @@ void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOExcepti
       // field that is indexed (i.e. searchable), but don't tokenize
       // the field into separate words and don't index term frequency
       // or positional information:
-      Field pathField = new StringField("path", file.toString(), Field.Store.YES);
-      doc.add(pathField);
+      doc.add(new KeywordField("path", file.toString()));
+      doc.add(new StoredField("path", file.toString()));

Review Comment:
   I tried to think more about this and opened https://github.com/apache/lucene/pull/12116 as a possible way forward.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12054: Introduce a new `KeywordField`.

Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #12054:
URL: https://github.com/apache/lucene/pull/12054#issuecomment-1370089520

   +1 to adding this new field definition. Looks like the new test failed in the precommit checks?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12054: Introduce a new `KeywordField`.

Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #12054:
URL: https://github.com/apache/lucene/pull/12054#issuecomment-1385952712

   Somewhat related to this PR, I've been experimenting with the idea of a "self optimizing" `TermInSetQuery` implementation that toggles between using postings and doc values based on index statistics, etc. I wanted to link that idea here as it's a bit related (requires indexing both postings and dv, which this PR makes easy). This is just an early idea, but I'll link an early draft here in case anyone is curious or has thoughts: #12089


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on a diff in pull request #12054: Introduce a new `KeywordField`.

Posted by GitBox <gi...@apache.org>.
rmuir commented on code in PR #12054:
URL: https://github.com/apache/lucene/pull/12054#discussion_r1068105620


##########
lucene/demo/src/java/org/apache/lucene/demo/IndexFiles.java:
##########
@@ -234,8 +234,8 @@ void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOExcepti
       // field that is indexed (i.e. searchable), but don't tokenize
       // the field into separate words and don't index term frequency
       // or positional information:
-      Field pathField = new StringField("path", file.toString(), Field.Store.YES);
-      doc.add(pathField);
+      doc.add(new KeywordField("path", file.toString()));
+      doc.add(new StoredField("path", file.toString()));

Review Comment:
   as far as the LongField etc, we should deal with that in another issue. I agree it should support Field.Store.YES/NO. The things are you speak of are not barriers to that. They are self-created problems that we can fix.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on a diff in pull request #12054: Introduce a new `KeywordField`.

Posted by GitBox <gi...@apache.org>.
rmuir commented on code in PR #12054:
URL: https://github.com/apache/lucene/pull/12054#discussion_r1068156278


##########
lucene/demo/src/java/org/apache/lucene/demo/IndexFiles.java:
##########
@@ -234,8 +234,8 @@ void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOExcepti
       // field that is indexed (i.e. searchable), but don't tokenize
       // the field into separate words and don't index term frequency
       // or positional information:
-      Field pathField = new StringField("path", file.toString(), Field.Store.YES);
-      doc.add(pathField);
+      doc.add(new KeywordField("path", file.toString()));
+      doc.add(new StoredField("path", file.toString()));

Review Comment:
   and obviously, fixing this can be a followup to this PR. but we should do it before releasing the new APIs. These new fields are supposed to be easy to use, so they should support storing as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on a diff in pull request #12054: Introduce a new `KeywordField`.

Posted by GitBox <gi...@apache.org>.
rmuir commented on code in PR #12054:
URL: https://github.com/apache/lucene/pull/12054#discussion_r1068103957


##########
lucene/demo/src/java/org/apache/lucene/demo/IndexFiles.java:
##########
@@ -234,8 +234,8 @@ void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOExcepti
       // field that is indexed (i.e. searchable), but don't tokenize
       // the field into separate words and don't index term frequency
       // or positional information:
-      Field pathField = new StringField("path", file.toString(), Field.Store.YES);
-      doc.add(pathField);
+      doc.add(new KeywordField("path", file.toString()));
+      doc.add(new StoredField("path", file.toString()));

Review Comment:
   Whoah... this field is for strings. Kill the BytesRef constructor. It is enough to pass a String. and you can support Field.Store.YES/NO



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #12054: Introduce a new `KeywordField`.

Posted by GitBox <gi...@apache.org>.
jpountz commented on PR #12054:
URL: https://github.com/apache/lucene/pull/12054#issuecomment-1371192608

   Thanks for looking @gsmiller. The test that fails is because this PR relies on behavior introduced by https://github.com/apache/lucene/pull/12053. I'll rebase when this other PR is merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12054: Introduce a new `KeywordField`.

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12054:
URL: https://github.com/apache/lucene/pull/12054#issuecomment-1421143189

   > You can make it new IndexOrDocValuesQuery(new TermInSetQuery, SortedSetDocValuesField.newSlowSetQuery()) right now and it performs better than what is on that PR.
   
   +1 to using `IndexOrDocValues` for now. Given the feedback in #12089, I'm going to see if I can come up with a way to help `IndexOrDocValuesQuery` make a better decision between postings/doc values for the case it currently doesn't handle well, as opposed to changing the guts of `TermInSetQuery`. I'll benchmark a couple different ideas for that and post a separate PR with what I'm able to come up with. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz merged pull request #12054: Introduce a new `KeywordField`.

Posted by "jpountz (via GitHub)" <gi...@apache.org>.
jpountz merged PR #12054:
URL: https://github.com/apache/lucene/pull/12054


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12054: Introduce a new `KeywordField`.

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on PR #12054:
URL: https://github.com/apache/lucene/pull/12054#issuecomment-1420805369

   >     * added a `newSetQuery` that creates a `TermInSetQuery` and hopefully soon benefits from @gsmiller 's optimization
   
   You can make it `new IndexOrDocValuesQuery(new TermInSetQuery, SortedSetDocValuesField.newSlowSetQuery())` right now and it performs better than what is on that PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on a diff in pull request #12054: Introduce a new `KeywordField`.

Posted by GitBox <gi...@apache.org>.
jpountz commented on code in PR #12054:
URL: https://github.com/apache/lucene/pull/12054#discussion_r1068141063


##########
lucene/demo/src/java/org/apache/lucene/demo/IndexFiles.java:
##########
@@ -234,8 +234,8 @@ void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOExcepti
       // field that is indexed (i.e. searchable), but don't tokenize
       // the field into separate words and don't index term frequency
       // or positional information:
-      Field pathField = new StringField("path", file.toString(), Field.Store.YES);
-      doc.add(pathField);
+      doc.add(new KeywordField("path", file.toString()));
+      doc.add(new StoredField("path", file.toString()));

Review Comment:
   I don't mind killing the BytesRef ctor, but I don't think it would be enough. We need this field to implement `binaryValue()` so that doc values can be indexed. But then stored fields are going to see a field where both `stringValue()` and `binaryValue()` return non-null, and it would be a problem for the current stored fields format which checks the binary value first, so this `KeywordField` would be considered as a binary field by stored fields.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #12054: Introduce a new `KeywordField`.

Posted by "jpountz (via GitHub)" <gi...@apache.org>.
jpountz commented on PR #12054:
URL: https://github.com/apache/lucene/pull/12054#issuecomment-1419458041

   I updated this PR to
    - add a `Field.Store` parameter to the constructor that does not rely on Field's guessing
    - update the demo to pass Field.Store.YES as a value for this parameter instead of adding a separate StoredField
    - added a `newSetQuery` that creates a `TermInSetQuery` and hopefully soon benefits from @gsmiller 's optimization


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on a diff in pull request #12054: Introduce a new `KeywordField`.

Posted by GitBox <gi...@apache.org>.
rmuir commented on code in PR #12054:
URL: https://github.com/apache/lucene/pull/12054#discussion_r1068154625


##########
lucene/demo/src/java/org/apache/lucene/demo/IndexFiles.java:
##########
@@ -234,8 +234,8 @@ void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOExcepti
       // field that is indexed (i.e. searchable), but don't tokenize
       // the field into separate words and don't index term frequency
       // or positional information:
-      Field pathField = new StringField("path", file.toString(), Field.Store.YES);
-      doc.add(pathField);
+      doc.add(new KeywordField("path", file.toString()));
+      doc.add(new StoredField("path", file.toString()));

Review Comment:
   look at current stored fields writer as example. All these damn java abstractions, yet our codec writer is doing **TYPE-GUESSING?**. Let's add a new method, so the codec knows the type and never guesses. This "guessing" belongs as an impl detail behind a new method in Field.java IMO (because Field.java tries to be a superhero and support all types). For structured types like KeywordField it should just be `return STRING`.
   
   ```
       Number number = field.numericValue();
       if (number != null) {
         if (number instanceof Byte || number instanceof Short || number instanceof Integer) {
           bits = NUMERIC_INT;
         } else if (number instanceof Long) {
           bits = NUMERIC_LONG;
         } else if (number instanceof Float) {
           bits = NUMERIC_FLOAT;
         } else if (number instanceof Double) {
           bits = NUMERIC_DOUBLE;
         } else {
           throw new IllegalArgumentException("cannot store numeric type " + number.getClass());
         }
         string = null;
         bytes = null;
       } else {
         bytes = field.binaryValue();
         if (bytes != null) {
           bits = BYTE_ARR;
           string = null;
         } else {
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on a diff in pull request #12054: Introduce a new `KeywordField`.

Posted by GitBox <gi...@apache.org>.
rmuir commented on code in PR #12054:
URL: https://github.com/apache/lucene/pull/12054#discussion_r1068144382


##########
lucene/demo/src/java/org/apache/lucene/demo/IndexFiles.java:
##########
@@ -234,8 +234,8 @@ void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOExcepti
       // field that is indexed (i.e. searchable), but don't tokenize
       // the field into separate words and don't index term frequency
       // or positional information:
-      Field pathField = new StringField("path", file.toString(), Field.Store.YES);
-      doc.add(pathField);
+      doc.add(new KeywordField("path", file.toString()));
+      doc.add(new StoredField("path", file.toString()));

Review Comment:
   Right, its a self-created problem though, because we know it should go into storedfields as a string.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on a diff in pull request #12054: Introduce a new `KeywordField`.

Posted by GitBox <gi...@apache.org>.
rmuir commented on code in PR #12054:
URL: https://github.com/apache/lucene/pull/12054#discussion_r1068145212


##########
lucene/demo/src/java/org/apache/lucene/demo/IndexFiles.java:
##########
@@ -234,8 +234,8 @@ void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOExcepti
       // field that is indexed (i.e. searchable), but don't tokenize
       // the field into separate words and don't index term frequency
       // or positional information:
-      Field pathField = new StringField("path", file.toString(), Field.Store.YES);
-      doc.add(pathField);
+      doc.add(new KeywordField("path", file.toString()));
+      doc.add(new StoredField("path", file.toString()));

Review Comment:
   by self-created i mean, too many java abstractions / java abstractions are the things causing the issue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on a diff in pull request #12054: Introduce a new `KeywordField`.

Posted by GitBox <gi...@apache.org>.
rmuir commented on code in PR #12054:
URL: https://github.com/apache/lucene/pull/12054#discussion_r1068060646


##########
lucene/demo/src/java/org/apache/lucene/demo/IndexFiles.java:
##########
@@ -234,8 +234,8 @@ void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOExcepti
       // field that is indexed (i.e. searchable), but don't tokenize
       // the field into separate words and don't index term frequency
       // or positional information:
-      Field pathField = new StringField("path", file.toString(), Field.Store.YES);
-      doc.add(pathField);
+      doc.add(new KeywordField("path", file.toString()));
+      doc.add(new StoredField("path", file.toString()));

Review Comment:
   > If you also need to store the value, you should add a separate {@link StoredField} instance.
   
   Let's rethink this for the new fields we are adding. I think storing is quite common and offering Field.Store.YES/NO is the best choice.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on a diff in pull request #12054: Introduce a new `KeywordField`.

Posted by GitBox <gi...@apache.org>.
jpountz commented on code in PR #12054:
URL: https://github.com/apache/lucene/pull/12054#discussion_r1068102375


##########
lucene/demo/src/java/org/apache/lucene/demo/IndexFiles.java:
##########
@@ -234,8 +234,8 @@ void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOExcepti
       // field that is indexed (i.e. searchable), but don't tokenize
       // the field into separate words and don't index term frequency
       // or positional information:
-      Field pathField = new StringField("path", file.toString(), Field.Store.YES);
-      doc.add(pathField);
+      doc.add(new KeywordField("path", file.toString()));
+      doc.add(new StoredField("path", file.toString()));

Review Comment:
   Right. The challenge I'm seeing is that to index both points and doc values on numeric fields, we're creating a field that produces a binaryValue() consumed by points, as well as a numeric value consumed by doc values. But stored fields can store both binary and numeric data, so how should they know which value they should look at?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on a diff in pull request #12054: Introduce a new `KeywordField`.

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on code in PR #12054:
URL: https://github.com/apache/lucene/pull/12054#discussion_r1097738909


##########
lucene/core/src/java/org/apache/lucene/document/KeywordField.java:
##########
@@ -0,0 +1,188 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.document;
+
+import java.util.Collection;
+import java.util.Objects;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.IndexOptions;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.search.ConstantScoreQuery;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.search.SortField;
+import org.apache.lucene.search.SortedSetSelector;
+import org.apache.lucene.search.SortedSetSortField;
+import org.apache.lucene.search.TermInSetQuery;
+import org.apache.lucene.search.TermQuery;
+import org.apache.lucene.util.BytesRef;
+
+/**
+ * Field that indexes a per-document String or {@link BytesRef} into an inverted index for fast
+ * filtering, stores values in a columnar fashion using {@link DocValuesType#SORTED_SET} doc values
+ * for sorting and faceting, and optionally stores values as stored fields for top-hits retrieval.
+ * This field does not support scoring: queries produce constant scores. If you also need to store
+ * the value, you should add a separate {@link StoredField} instance. If you need more fine-grained
+ * control you can use {@link StringField}, {@link SortedDocValuesField} or {@link
+ * SortedSetDocValuesField}, and {@link StoredField}.
+ *
+ * <p>This field defines static factory methods for creating common query objects:
+ *
+ * <ul>
+ *   <li>{@link #newExactQuery} for matching a value.
+ *   <li>{@link #newSetQuery} for matching any of the values coming from a set.
+ *   <li>{@link #newSortField} for matching a value.
+ * </ul>
+ */
+public class KeywordField extends Field {
+
+  private static final FieldType FIELD_TYPE = new FieldType();
+  private static final FieldType FIELD_TYPE_STORED;
+
+  static {
+    FIELD_TYPE.setIndexOptions(IndexOptions.DOCS);
+    FIELD_TYPE.setOmitNorms(true);
+    FIELD_TYPE.setTokenized(false);
+    FIELD_TYPE.setDocValuesType(DocValuesType.SORTED_SET);
+    FIELD_TYPE.freeze();
+
+    FIELD_TYPE_STORED = new FieldType(FIELD_TYPE);
+    FIELD_TYPE_STORED.setStored(true);
+    FIELD_TYPE_STORED.freeze();
+  }
+
+  private final StoredValue storedValue;
+
+  /**
+   * Creates a new KeywordField.
+   *
+   * @param name field name
+   * @param value the BytesRef value
+   * @param stored whether to store the field
+   * @throws IllegalArgumentException if the field name or value is null.
+   */
+  public KeywordField(String name, BytesRef value, Store stored) {
+    super(name, value, stored == Field.Store.YES ? FIELD_TYPE_STORED : FIELD_TYPE);
+    if (stored == Store.YES) {
+      storedValue = new StoredValue(value);
+    } else {
+      storedValue = null;
+    }
+  }
+
+  /**
+   * Creates a new KeywordField from a String value, by indexing its UTF-8 representation.
+   *
+   * @param name field name
+   * @param value the BytesRef value
+   * @param stored whether to store the field
+   * @throws IllegalArgumentException if the field name or value is null.
+   */
+  public KeywordField(String name, String value, Store stored) {
+    super(name, value, stored == Field.Store.YES ? FIELD_TYPE_STORED : FIELD_TYPE);
+    if (stored == Store.YES) {
+      storedValue = new StoredValue(value);
+    } else {
+      storedValue = null;
+    }
+  }
+
+  @Override
+  public BytesRef binaryValue() {
+    BytesRef binaryValue = super.binaryValue();
+    if (binaryValue != null) {
+      return binaryValue;
+    } else {
+      return new BytesRef(stringValue());
+    }
+  }
+
+  @Override
+  public void setStringValue(String value) {
+    super.setStringValue(value);
+    if (storedValue != null) {
+      storedValue.setStringValue(value);
+    }
+  }
+
+  @Override
+  public void setBytesValue(BytesRef value) {
+    super.setBytesValue(value);
+    if (storedValue != null) {
+      storedValue.setBinaryValue(value);
+    }
+  }
+
+  @Override
+  public StoredValue storedValue() {
+    return storedValue;
+  }
+
+  /**
+   * Create a query for matching an exact {@link BytesRef} value.
+   *
+   * @param field field name. must not be {@code null}.
+   * @param value exact value
+   * @throws NullPointerException if {@code field} is null.
+   * @return a query matching documents with this exact value
+   */
+  public static Query newExactQuery(String field, BytesRef value) {
+    Objects.requireNonNull(field, "field must not be null");
+    Objects.requireNonNull(value, "value must not be null");
+    return new ConstantScoreQuery(new TermQuery(new Term(field, value)));
+  }
+
+  /**
+   * Create a query for matching an exact {@link String} value.
+   *
+   * @param field field name. must not be {@code null}.
+   * @param value exact value
+   * @throws NullPointerException if {@code field} is null.
+   * @return a query matching documents with this exact value
+   */
+  public static Query newExactQuery(String field, String value) {
+    Objects.requireNonNull(value, "value must not be null");
+    return newExactQuery(field, new BytesRef(value));
+  }
+
+  /**
+   * Create a query for matching any of a set of provided {@link BytesRef} values.
+   *
+   * @param field field name. must not be {@code null}.
+   * @param values the set of values to match
+   * @throws NullPointerException if {@code field} is null.
+   * @return a query matching documents with this exact value
+   */
+  public static Query newSetQuery(String field, Collection<BytesRef> values) {

Review Comment:
   Can we expose this as `BytesRef...` instead of collection, consistent with all the other `newSetQuery`'s? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org