You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/09/21 12:26:36 UTC

[GitHub] [lucene] thongnt99 opened a new issue, #11799: Indexing method for learned sparse retrieval

thongnt99 opened a new issue, #11799:
URL: https://github.com/apache/lucene/issues/11799

   ### Description
   
   Recent learned sparse retrieval methods ([Splade](https://github.com/naver/splade), [uniCOIL](https://github.com/castorini/pyserini/blob/master/docs/experiments-unicoil.md)) were trained to generate impact score directly (replacing tf-idf score).  
   For each document, they will generate a json file with terms and weights,  e.g. `{";": 80, "the": 161, "of": 85, "and": 27, "to": 24, "was": 47, "as": 27, "their": 96, "what": 40, "over": 123, "only": 123, "important": 186, "project": 208, "success": 215, "meant": 131, "lives": 140, "presence": 180, "scientific": 200, "communication": 235, "thousands": 142, "hundreds": 144, "truly": 170, "hanging": 141, "cloud": 187, "engineers": 127, "achievement": 192, "researchers": 137, "innocent": 181, "manhattan": 244, "impressive": 191, "equally": 163, "##rated": 132, "minds": 137, "atomic": 214, "amid": 201, "##lite": 120, "intellect": 202, "ob": 140}}`
   Can we make a new feature that could index this type of document efficiently? 
   The current [work-around ](https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/collection/JsonVectorCollection.java) I am aware of is to create a fake document by repeating the terms: e.g., `"the the the the .... of of of of of "`
   However, this way is not very efficient if the impact score gets bigger and also it requires impact score quantization before indexing. 
   I think it would be very useful for many people if we can index the json files directly with float impact scores. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on issue #11799: Indexing method for learned sparse retrieval

Posted by GitBox <gi...@apache.org>.
rmuir commented on issue #11799:
URL: https://github.com/apache/lucene/issues/11799#issuecomment-1253945883

   You can use `TermFrequencyAttribute` in the analysis chain to set the frequency directly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] thongnt99 commented on issue #11799: Indexing method for learned sparse retrieval

Posted by GitBox <gi...@apache.org>.
thongnt99 commented on issue #11799:
URL: https://github.com/apache/lucene/issues/11799#issuecomment-1256442806

   Yes, I think that would be nicer to have dedicated classes for LSR?  Though using FeatureField is efficient, I feels it is still a bit of hacking. 
   If we replaced FeatureQuery with `new BoostQuery(new TermQuery(new Term(field, term)), weight)`, then it doesn't work. So i think there is some internal difference in the indexes created by this approach and the repetition approach.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] thongnt99 commented on issue #11799: Indexing method for learned sparse retrieval

Posted by GitBox <gi...@apache.org>.
thongnt99 commented on issue #11799:
URL: https://github.com/apache/lucene/issues/11799#issuecomment-1256416719

   I confirmed @jpountz  approach working.  In my dataset, the indexing time goes down from  more than 1 hours to ~ 10 minutes. 
   A small issue, the weight in `FeatureField.newLinearQuery` is constrained to be in range (0, 64]. This is not desirable, but it is fine for now is there is no easy fix. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mocobeta commented on issue #11799: Indexing method for learned sparse retrieval

Posted by GitBox <gi...@apache.org>.
mocobeta commented on issue #11799:
URL: https://github.com/apache/lucene/issues/11799#issuecomment-1253724856

   In general I'm +1 for supporting learned sparse retrieval, though, I think it would not be so trivial as it looks.
   
   For a starter perhaps we could utilize terms' payloads to tweak the weights instead of modifying the indexing chain... but there may be some overheads in score calculation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] msokolov commented on issue #11799: Indexing method for learned sparse retrieval

Posted by GitBox <gi...@apache.org>.
msokolov commented on issue #11799:
URL: https://github.com/apache/lucene/issues/11799#issuecomment-1254100366

   Using `TermFrequencyAttribute` to customize the term frequencies you can then create a Query in the normal way and compute BM25 using `b==0` then I think you will directly control the similarity scores. Or you might want to write a custom Similarity to be a bit more efficient.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] thongnt99 commented on issue #11799: Indexing method for learned sparse retrieval

Posted by GitBox <gi...@apache.org>.
thongnt99 commented on issue #11799:
URL: https://github.com/apache/lucene/issues/11799#issuecomment-1254119695

   @jtibshirani  The query side is same as document side, which is a dictionary of terms and weights. To make it compatible with Lucene, people just repeat the terms with its frequency. This is fine because queries are usually much shorter. 
   Yes, FeatureField is something similar, but we want a single Field containing a list of key-value pairs or a json formatted. 
   @msokolov @rmuir @mocobeta: I fould [this](https://github.com/apache/lucene/blob/475fbd0bdde31c6a2ae62c59505cf9e8becd50e4/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/DelimitedTermFrequencyTokenFilter.java), which could somehow achieves what we want;  But I think it is not so flexible, we need to turn the json file into a token stream formatted as:  [<term><delimiter><frequency>......] ...  I think this step is redundant. Can we just load the json file directly? For this I think we might have to move away from TokenStream pipeline?  
   What do you think? Your thought is very much appreciated as I am not very familiar with Lucene. 
   
   We can form a group to do this if you guys are interested in. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on issue #11799: Indexing method for learned sparse retrieval

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on issue #11799:
URL: https://github.com/apache/lucene/issues/11799#issuecomment-1254070746

   +1 from me too, it'd be great to think through how to support this. Could you explain how the query side would look? Are the queries also sparse vectors with custom impacts?
   
   As a note, we have a `FeatureField` field type that accepts key-value pairs and stores the value in `TermFrequencyAttribute`. It's designed to help incorporate other storing signals like popularity, page rank, etc. It may not be exactly what we want for this use case, but it could provide some inspiration.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on issue #11799: Indexing method for learned sparse retrieval

Posted by GitBox <gi...@apache.org>.
jpountz commented on issue #11799:
URL: https://github.com/apache/lucene/issues/11799#issuecomment-1256425435

   This is a good point. This limit was introduced with the idea that `FeatureField` would be used to incorporate features into a BM25/TFIDF/DFR score and higher weights than 64 would generally be a mistake, but we could lift this limit if it feels like a useful query to use on its own for sparse-learned retrieval.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on issue #11799: Indexing method for learned sparse retrieval

Posted by GitBox <gi...@apache.org>.
jpountz commented on issue #11799:
URL: https://github.com/apache/lucene/issues/11799#issuecomment-1254691652

   > we want a single Field containing a list of key-value pairs or a json formatted
   
   Note that you can add one `FeatureField` field to your Lucene document for every key/value pair in your JSON document. The logic of converting from a high-level representation like a JSON map into a low-level representation that Lucene understands feels like something that could be managed on the application side?
   
   Here's a code example that I think does something similar to what you are looking for:
   
   ```java
   import org.apache.lucene.document.Document;
   import org.apache.lucene.document.FeatureField;
   import org.apache.lucene.index.DirectoryReader;
   import org.apache.lucene.index.IndexReader;
   import org.apache.lucene.index.IndexWriter;
   import org.apache.lucene.index.IndexWriterConfig;
   import org.apache.lucene.search.BooleanClause.Occur;
   import org.apache.lucene.search.BooleanQuery;
   import org.apache.lucene.search.IndexSearcher;
   import org.apache.lucene.search.Query;
   import org.apache.lucene.store.ByteBuffersDirectory;
   import org.apache.lucene.store.Directory;
   
   public class LearnedSparseRetrieval {
   
     public static void main(String[] args) throws Exception {
       try (Directory dir = new ByteBuffersDirectory()) {
         try (IndexWriter w = new IndexWriter(dir, new IndexWriterConfig())) {
           {
             Document doc = new Document();
             doc.add(new FeatureField("my_feature", "scientific", 200));
             doc.add(new FeatureField("my_feature", "intellect", 202));
             doc.add(new FeatureField("my_feature", "communication", 235));
             w.addDocument(doc);
           }
           {
             Document doc = new Document();
             doc.add(new FeatureField("my_feature", "scientific", 100));
             doc.add(new FeatureField("my_feature", "communication", 350));
             doc.add(new FeatureField("my_feature", "project", 80));
             w.addDocument(doc);
           }
         }
   
         try (IndexReader reader = DirectoryReader.open(dir)) {
           IndexSearcher searcher = new IndexSearcher(reader);
           Query query = new BooleanQuery.Builder()
               .add(FeatureField.newLinearQuery("my_feature", "scientific", 24), Occur.SHOULD)
               .add(FeatureField.newLinearQuery("my_feature", "communication", 50), Occur.SHOULD)
               .build();
           System.out.println(searcher.explain(query, 0));
           System.out.println();
           System.out.println(searcher.explain(query, 0));
         }
       }
     }
   
   }
   ```
   
   which outputs
   
   ```
   16550.0 = sum of:
     4800.0 = Linear function on the my_feature field for the scientific feature, computed as w * S from:
       24.0 = w, weight of this function
       200.0 = S, feature value
     11750.0 = Linear function on the my_feature field for the communication feature, computed as w * S from:
       50.0 = w, weight of this function
       235.0 = S, feature value
   
   
   19900.0 = sum of:
     2400.0 = Linear function on the my_feature field for the scientific feature, computed as w * S from:
       24.0 = w, weight of this function
       100.0 = S, feature value
     17500.0 = Linear function on the my_feature field for the communication feature, computed as w * S from:
       50.0 = w, weight of this function
       350.0 = S, feature value
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] thongnt99 commented on issue #11799: Indexing method for learned sparse retrieval

Posted by GitBox <gi...@apache.org>.
thongnt99 commented on issue #11799:
URL: https://github.com/apache/lucene/issues/11799#issuecomment-1254781175

   @ jpountz Great. Thank you very much. I will try it out and see if there is any different in the scores. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org