You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@lucene.apache.org by kr...@apache.org on 2017/01/06 22:03:27 UTC

[1/3] lucene-solr:jira/solr-8593: LUCENE-7576: AutomatonTermsEnum ctor should also insist on a NORMAL CompiledAutomaton in

Repository: lucene-solr
Updated Branches:
  refs/heads/jira/solr-8593 3793eb5ec -> 4b17b82a9


LUCENE-7576: AutomatonTermsEnum ctor should also insist on a NORMAL CompiledAutomaton in


Project: http://git-wip-us.apache.org/repos/asf/lucene-solr/repo
Commit: http://git-wip-us.apache.org/repos/asf/lucene-solr/commit/ebb5c7e6
Tree: http://git-wip-us.apache.org/repos/asf/lucene-solr/tree/ebb5c7e6
Diff: http://git-wip-us.apache.org/repos/asf/lucene-solr/diff/ebb5c7e6

Branch: refs/heads/jira/solr-8593
Commit: ebb5c7e6768c03c83be4aa3abdab22e16cb67c2c
Parents: cd4f908
Author: Mike McCandless <mi...@apache.org>
Authored: Fri Jan 6 14:50:01 2017 -0500
Committer: Mike McCandless <mi...@apache.org>
Committed: Fri Jan 6 14:50:01 2017 -0500

----------------------------------------------------------------------
 .../src/java/org/apache/lucene/index/AutomatonTermsEnum.java | 3 +++
 .../core/src/test/org/apache/lucene/index/TestTermsEnum.java | 8 ++++++++
 2 files changed, 11 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/ebb5c7e6/lucene/core/src/java/org/apache/lucene/index/AutomatonTermsEnum.java
----------------------------------------------------------------------
diff --git a/lucene/core/src/java/org/apache/lucene/index/AutomatonTermsEnum.java b/lucene/core/src/java/org/apache/lucene/index/AutomatonTermsEnum.java
index 8aa10ec..411a810 100644
--- a/lucene/core/src/java/org/apache/lucene/index/AutomatonTermsEnum.java
+++ b/lucene/core/src/java/org/apache/lucene/index/AutomatonTermsEnum.java
@@ -76,6 +76,9 @@ public class AutomatonTermsEnum extends FilteredTermsEnum {
    */
   public AutomatonTermsEnum(TermsEnum tenum, CompiledAutomaton compiled) {
     super(tenum);
+    if (compiled.type != CompiledAutomaton.AUTOMATON_TYPE.NORMAL) {
+      throw new IllegalArgumentException("please use CompiledAutomaton.getTermsEnum instead");
+    }
     this.finite = compiled.finite;
     this.runAutomaton = compiled.runAutomaton;
     assert this.runAutomaton != null;

http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/ebb5c7e6/lucene/core/src/test/org/apache/lucene/index/TestTermsEnum.java
----------------------------------------------------------------------
diff --git a/lucene/core/src/test/org/apache/lucene/index/TestTermsEnum.java b/lucene/core/src/test/org/apache/lucene/index/TestTermsEnum.java
index a388d42..d2df59f 100644
--- a/lucene/core/src/test/org/apache/lucene/index/TestTermsEnum.java
+++ b/lucene/core/src/test/org/apache/lucene/index/TestTermsEnum.java
@@ -1016,4 +1016,12 @@ public class TestTermsEnum extends LuceneTestCase {
     w.close();
     d.close();
   }
+
+  // LUCENE-7576
+  public void testInvalidAutomatonTermsEnum() throws Exception {
+    expectThrows(IllegalArgumentException.class,
+                 () -> {
+                   new AutomatonTermsEnum(TermsEnum.EMPTY, new CompiledAutomaton(Automata.makeString("foo")));
+                 });
+  }
 }

[3/3] lucene-solr:jira/solr-8593: Merge branch 'apache-https-master' into jira/solr-8593

Posted by kr...@apache.org.

Merge branch 'apache-https-master' into jira/solr-8593


Project: http://git-wip-us.apache.org/repos/asf/lucene-solr/repo
Commit: http://git-wip-us.apache.org/repos/asf/lucene-solr/commit/4b17b82a
Tree: http://git-wip-us.apache.org/repos/asf/lucene-solr/tree/4b17b82a
Diff: http://git-wip-us.apache.org/repos/asf/lucene-solr/diff/4b17b82a

Branch: refs/heads/jira/solr-8593
Commit: 4b17b82a919cce9eb5e10f10b03fe6cedff49e1e
Parents: 3793eb5 024c403
Author: Kevin Risden <kr...@apache.org>
Authored: Fri Jan 6 16:03:20 2017 -0600
Committer: Kevin Risden <kr...@apache.org>
Committed: Fri Jan 6 16:03:20 2017 -0600

----------------------------------------------------------------------
 .../apache/lucene/index/AutomatonTermsEnum.java |   3 +
 .../org/apache/lucene/index/TestTermsEnum.java  |   8 ++
 solr/contrib/ltr/example/README.md              | 118 +++++++++++++++----
 solr/contrib/ltr/example/user_queries.txt       |  12 +-
 4 files changed, 112 insertions(+), 29 deletions(-)
----------------------------------------------------------------------

[2/3] lucene-solr:jira/solr-8593: SOLR-8542: expand 'Assemble training data' content in solr/contrib/ltr/README

Posted by kr...@apache.org.

SOLR-8542: expand 'Assemble training data' content in solr/contrib/ltr/README

(Diego Ceccarelli via Christine Poerschke in response to SOLR-9929 enquiry from Jeffery Yuan.)


Project: http://git-wip-us.apache.org/repos/asf/lucene-solr/repo
Commit: http://git-wip-us.apache.org/repos/asf/lucene-solr/commit/024c4031
Tree: http://git-wip-us.apache.org/repos/asf/lucene-solr/tree/024c4031
Diff: http://git-wip-us.apache.org/repos/asf/lucene-solr/diff/024c4031

Branch: refs/heads/jira/solr-8593
Commit: 024c4031e55a998b73288fd276e30ffd626f0b91
Parents: ebb5c7e
Author: Christine Poerschke <cp...@apache.org>
Authored: Fri Jan 6 20:52:21 2017 +0000
Committer: Christine Poerschke <cp...@apache.org>
Committed: Fri Jan 6 20:52:21 2017 +0000

----------------------------------------------------------------------
 solr/contrib/ltr/example/README.md        | 118 ++++++++++++++++++++-----
 solr/contrib/ltr/example/user_queries.txt |  12 +--
 2 files changed, 101 insertions(+), 29 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/024c4031/solr/contrib/ltr/example/README.md
----------------------------------------------------------------------
diff --git a/solr/contrib/ltr/example/README.md b/solr/contrib/ltr/example/README.md
index 1363d5d..7842494 100644
--- a/solr/contrib/ltr/example/README.md
+++ b/solr/contrib/ltr/example/README.md
@@ -28,33 +28,105 @@ Please refer to the Solr Reference Guide's section on [Result Reranking](https:/
 
 4. Search and rerank the results using the trained model
 
-   http://localhost:8983/solr/techproducts/query?indent=on&q=test&wt=json&rq={!ltr%20model=exampleModel%20reRankDocs=25%20efi.user_query=%27test%27}&fl=price,score,name
+```
+http://localhost:8983/solr/techproducts/query?indent=on&q=test&wt=json&rq={!ltr%20model=exampleModel%20reRankDocs=25%20efi.user_query=%27test%27}&fl=price,score,name
+```
 
 # Assemble training data
 In order to train a learning to rank model you need training data. Training data is
-what "teaches" the model what the appropriate weight for each feature is. In general
+what *teaches* the model what the appropriate weight for each feature is. In general
 training data is a collection of queries with associated documents and what their ranking/score
 should be. As an example:
 ```
-hard drive|SP2514N|0.6666666|CLICK_LOGS
-hard drive|6H500F0|0.330082034|CLICK_LOGS
+hard drive|SP2514N        |0.6|CLICK_LOGS
+hard drive|6H500F0        |0.3|CLICK_LOGS
 hard drive|F8V7067-APL-KIT|0.0|CLICK_LOGS
-hard drive|IW-02|0.0|CLICK_LOGS
-
-ipod|MA147LL/A|1.0|EXPLICIT
-ipod|F8V7067-APL-KIT|0.25|EXPLICIT
-ipod|IW-02|0.25|EXPLICIT
-ipod|6H500F0|0.0|EXPLICIT
-```
-In this example the first column indicates the query, the second column indicates a unique id for that doc,
-the third column indicates the relative importance or relevance of that doc, and the fourth column indicates the source.
-There are 2 primary ways you might collect data for use with your machine learning algorithim. The first
-is to collect the clicks of your users given a specific query. There are many ways of preparing this data
-to train a model (http://www.cs.cornell.edu/people/tj/publications/joachims_etal_05a.pdf). The general idea
-is that if a user sees multiple documents and clicks the one lower down, that document should be scored higher
-than the one above it. The second way is explicitly through a crowdsourcing platform like Mechanical Turk or
-CrowdFlower. These platforms allow you to show human workers documents associated with a query and have them
-tell you what the correct ranking should be.
-
-At this point you'll need to collect feature vectors for each query document pair. You can use the information
-from the Extract features section above to do this. An example script has been included in example/train_and_upload_demo_model.py.
+hard drive|IW-02          |0.0|CLICK_LOGS
+
+ipod      |MA147LL/A      |1.0|HUMAN_JUDGEMENT
+ipod      |F8V7067-APL-KIT|0.5|HUMAN_JUDGEMENT
+ipod      |IW-02          |0.5|HUMAN_JUDGEMENT
+ipod      |6H500F0        |0.0|HUMAN_JUDGEMENT
+```
+The columns in the example represent:
+
+  1. the user query;
+
+  2. a unique id for a document in the response;
+
+  3. the a score representing the relevance of that document (not necessarily between zero and one);
+
+  4. the source, i.e., if the training record was produced by using interaction data (`CLICK_LOGS`) or by human judgements (`HUMAN_JUDGEMENT`).
+
+## How to produce training data
+
+You might collect data for use with your machine learning algorithm relying on:
+
+  * **Users Interactions**: given a specific query, it is possible to log all the users interactions (e.g., clicks, shares on social networks, send by email etc), and then use them as proxies for relevance;
+  * **Human Judgements**: A training dataset is produced by explicitly asking some judges to evaluate the relevance of a document given the query.
+
+### How to prepare training data from interaction data?
+
+There are many ways of preparing interaction data for training a model, and it is outside the scope of this readme to provide a complete review of all the techniques.  In the following we illustrate a simple way for obtaining training data from simple interaction data.
+
+Simple interaction data will be a log file generated by your application after it
+has talked to Solr. The log will contain two different types of record:
+
+  * **query**: when a user performs a query we have a record with `user-id, query, responses`,
+  where `responses` is a list of unique document ids returned for a query.
+
+**Example:**
+
+```
+diego, hard drive, [SP2514N,6H500F0,F8V7067-APL-KIT,IW-02]
+```
+
+  * **click**: when a user performs a click we have a record with `user-id, query, document-id, click`
+
+**Example:**
+```
+christine, hard drive, SP2154N
+diego    , hard drive, SP2154N
+michael  , hard drive, SP2154N
+joshua   , hard drive, IW-02
+```
+
+Given a log composed by records like these, a simple way to produce a training dataset is to group on the query field
+and then assign to each query a relevance score equal to the number of clicks:
+
+```
+hard drive|SP2514N        |3|CLICK_LOGS
+hard drive|IW-02          |1|CLICK_LOGS
+hard drive|6H500F0        |0|CLICK_LOGS
+hard drive|F8V7067-APL-KIT|0|CLICK_LOGS
+```
+
+This is a really trival way to generate a training dataset, and in many settings 
+it might not produce great results. Indeed, it is a well known fact that 
+clicks are *biased*: users tend to click  on the first
+result proposed for a query, also if it is not relevant. A click on a document in position
+five could be considered more important than a click on a document in position one, because
+the user took the effort to browse the results list until position five.
+
+Some approaches take into account the time spent on the clicked document (if the user
+spent only two seconds on the document and then clicked on other documents in the list,
+probably she did not intend to click that document).
+
+There are many papers proposing techniques for removing the bias, or for taking into account the click positions,
+a good survey is  [Click Models for Web Search](http://clickmodels.weebly.com/uploads/5/2/2/5/52257029/mc2015-clickmodels.pdf),
+by Chuklin, Markov and Rijke.
+
+### Prepare training data from human judgements
+
+Another way to get training data is asking human judges to label them.
+Producing human judgements is in general more expensive, but the quality of the
+dataset produced can be better than the one produced from interaction data.
+It is worth to note that human judgements can be produced also relying on a
+crowdsourcing platform, that allows a user to show human workers documents associated with a
+query and to get back relevance labels.
+Usually a human worker visualizes a query together with a list of results and the task
+consists in assigning a relevance label to each document (e.g., Perfect, Excellent, Good, Fair, Not relevant).
+Training data can then be obtained by translating the labels into numeric scores
+(e.g., Perfect = 4, Excellent = 3, Good = 2, Fair = 1, Not relevant = 0).
+
+

http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/024c4031/solr/contrib/ltr/example/user_queries.txt
----------------------------------------------------------------------
diff --git a/solr/contrib/ltr/example/user_queries.txt b/solr/contrib/ltr/example/user_queries.txt
index a3a3455..5e820c7 100644
--- a/solr/contrib/ltr/example/user_queries.txt
+++ b/solr/contrib/ltr/example/user_queries.txt
@@ -1,8 +1,8 @@
-hard drive|SP2514N|0.6666666|CLICK_LOGS
-hard drive|6H500F0|0.330082034|CLICK_LOGS
+hard drive|SP2514N|0.6|CLICK_LOGS
+hard drive|6H500F0|0.3|CLICK_LOGS
 hard drive|F8V7067-APL-KIT|0.0|CLICK_LOGS
 hard drive|IW-02|0.0|CLICK_LOGS
-ipod|MA147LL/A|1.0|EXPLICIT
-ipod|F8V7067-APL-KIT|0.25|EXPLICIT
-ipod|IW-02|0.25|EXPLICIT
-ipod|6H500F0|0.0|EXPLICIT
+ipod|MA147LL/A|1.0|HUMAN_JUDGEMENT
+ipod|F8V7067-APL-KIT|0.5|HUMAN_JUDGEMENT
+ipod|IW-02|0.5|HUMAN_JUDGEMENT
+ipod|6H500F0|0.0|HUMAN_JUDGEMENT