You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by cp...@apache.org on 2017/01/06 21:11:12 UTC

lucene-solr:branch_6x: SOLR-8542: expand 'Assemble training data' content in solr/contrib/ltr/README

Repository: lucene-solr
Updated Branches:
  refs/heads/branch_6x 8e974ecdc -> 88450c70b


SOLR-8542: expand 'Assemble training data' content in solr/contrib/ltr/README

(Diego Ceccarelli via Christine Poerschke in response to SOLR-9929 enquiry from Jeffery Yuan.)


Project: http://git-wip-us.apache.org/repos/asf/lucene-solr/repo
Commit: http://git-wip-us.apache.org/repos/asf/lucene-solr/commit/88450c70
Tree: http://git-wip-us.apache.org/repos/asf/lucene-solr/tree/88450c70
Diff: http://git-wip-us.apache.org/repos/asf/lucene-solr/diff/88450c70

Branch: refs/heads/branch_6x
Commit: 88450c70bb4daa3ca6c4750581bddeaad9bea6f9
Parents: 8e974ec
Author: Christine Poerschke <cp...@apache.org>
Authored: Fri Jan 6 20:52:21 2017 +0000
Committer: Christine Poerschke <cp...@apache.org>
Committed: Fri Jan 6 21:10:45 2017 +0000

----------------------------------------------------------------------
 solr/contrib/ltr/example/README.md        | 118 ++++++++++++++++++++-----
 solr/contrib/ltr/example/user_queries.txt |  12 +--
 2 files changed, 101 insertions(+), 29 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/88450c70/solr/contrib/ltr/example/README.md
----------------------------------------------------------------------
diff --git a/solr/contrib/ltr/example/README.md b/solr/contrib/ltr/example/README.md
index 1363d5d..7842494 100644
--- a/solr/contrib/ltr/example/README.md
+++ b/solr/contrib/ltr/example/README.md
@@ -28,33 +28,105 @@ Please refer to the Solr Reference Guide's section on [Result Reranking](https:/
 
 4. Search and rerank the results using the trained model
 
-   http://localhost:8983/solr/techproducts/query?indent=on&q=test&wt=json&rq={!ltr%20model=exampleModel%20reRankDocs=25%20efi.user_query=%27test%27}&fl=price,score,name
+```
+http://localhost:8983/solr/techproducts/query?indent=on&q=test&wt=json&rq={!ltr%20model=exampleModel%20reRankDocs=25%20efi.user_query=%27test%27}&fl=price,score,name
+```
 
 # Assemble training data
 In order to train a learning to rank model you need training data. Training data is
-what "teaches" the model what the appropriate weight for each feature is. In general
+what *teaches* the model what the appropriate weight for each feature is. In general
 training data is a collection of queries with associated documents and what their ranking/score
 should be. As an example:
 ```
-hard drive|SP2514N|0.6666666|CLICK_LOGS
-hard drive|6H500F0|0.330082034|CLICK_LOGS
+hard drive|SP2514N        |0.6|CLICK_LOGS
+hard drive|6H500F0        |0.3|CLICK_LOGS
 hard drive|F8V7067-APL-KIT|0.0|CLICK_LOGS
-hard drive|IW-02|0.0|CLICK_LOGS
-
-ipod|MA147LL/A|1.0|EXPLICIT
-ipod|F8V7067-APL-KIT|0.25|EXPLICIT
-ipod|IW-02|0.25|EXPLICIT
-ipod|6H500F0|0.0|EXPLICIT
-```
-In this example the first column indicates the query, the second column indicates a unique id for that doc,
-the third column indicates the relative importance or relevance of that doc, and the fourth column indicates the source.
-There are 2 primary ways you might collect data for use with your machine learning algorithim. The first
-is to collect the clicks of your users given a specific query. There are many ways of preparing this data
-to train a model (http://www.cs.cornell.edu/people/tj/publications/joachims_etal_05a.pdf). The general idea
-is that if a user sees multiple documents and clicks the one lower down, that document should be scored higher
-than the one above it. The second way is explicitly through a crowdsourcing platform like Mechanical Turk or
-CrowdFlower. These platforms allow you to show human workers documents associated with a query and have them
-tell you what the correct ranking should be.
-
-At this point you'll need to collect feature vectors for each query document pair. You can use the information
-from the Extract features section above to do this. An example script has been included in example/train_and_upload_demo_model.py.
+hard drive|IW-02          |0.0|CLICK_LOGS
+
+ipod      |MA147LL/A      |1.0|HUMAN_JUDGEMENT
+ipod      |F8V7067-APL-KIT|0.5|HUMAN_JUDGEMENT
+ipod      |IW-02          |0.5|HUMAN_JUDGEMENT
+ipod      |6H500F0        |0.0|HUMAN_JUDGEMENT
+```
+The columns in the example represent:
+
+  1. the user query;
+
+  2. a unique id for a document in the response;
+
+  3. the a score representing the relevance of that document (not necessarily between zero and one);
+
+  4. the source, i.e., if the training record was produced by using interaction data (`CLICK_LOGS`) or by human judgements (`HUMAN_JUDGEMENT`).
+
+## How to produce training data
+
+You might collect data for use with your machine learning algorithm relying on:
+
+  * **Users Interactions**: given a specific query, it is possible to log all the users interactions (e.g., clicks, shares on social networks, send by email etc), and then use them as proxies for relevance;
+  * **Human Judgements**: A training dataset is produced by explicitly asking some judges to evaluate the relevance of a document given the query.
+
+### How to prepare training data from interaction data?
+
+There are many ways of preparing interaction data for training a model, and it is outside the scope of this readme to provide a complete review of all the techniques.  In the following we illustrate a simple way for obtaining training data from simple interaction data.
+
+Simple interaction data will be a log file generated by your application after it
+has talked to Solr. The log will contain two different types of record:
+
+  * **query**: when a user performs a query we have a record with `user-id, query, responses`,
+  where `responses` is a list of unique document ids returned for a query.
+
+**Example:**
+
+```
+diego, hard drive, [SP2514N,6H500F0,F8V7067-APL-KIT,IW-02]
+```
+
+  * **click**: when a user performs a click we have a record with `user-id, query, document-id, click`
+
+**Example:**
+```
+christine, hard drive, SP2154N
+diego    , hard drive, SP2154N
+michael  , hard drive, SP2154N
+joshua   , hard drive, IW-02
+```
+
+Given a log composed by records like these, a simple way to produce a training dataset is to group on the query field
+and then assign to each query a relevance score equal to the number of clicks:
+
+```
+hard drive|SP2514N        |3|CLICK_LOGS
+hard drive|IW-02          |1|CLICK_LOGS
+hard drive|6H500F0        |0|CLICK_LOGS
+hard drive|F8V7067-APL-KIT|0|CLICK_LOGS
+```
+
+This is a really trival way to generate a training dataset, and in many settings 
+it might not produce great results. Indeed, it is a well known fact that 
+clicks are *biased*: users tend to click  on the first
+result proposed for a query, also if it is not relevant. A click on a document in position
+five could be considered more important than a click on a document in position one, because
+the user took the effort to browse the results list until position five.
+
+Some approaches take into account the time spent on the clicked document (if the user
+spent only two seconds on the document and then clicked on other documents in the list,
+probably she did not intend to click that document).
+
+There are many papers proposing techniques for removing the bias, or for taking into account the click positions,
+a good survey is  [Click Models for Web Search](http://clickmodels.weebly.com/uploads/5/2/2/5/52257029/mc2015-clickmodels.pdf),
+by Chuklin, Markov and Rijke.
+
+### Prepare training data from human judgements
+
+Another way to get training data is asking human judges to label them.
+Producing human judgements is in general more expensive, but the quality of the
+dataset produced can be better than the one produced from interaction data.
+It is worth to note that human judgements can be produced also relying on a
+crowdsourcing platform, that allows a user to show human workers documents associated with a
+query and to get back relevance labels.
+Usually a human worker visualizes a query together with a list of results and the task
+consists in assigning a relevance label to each document (e.g., Perfect, Excellent, Good, Fair, Not relevant).
+Training data can then be obtained by translating the labels into numeric scores
+(e.g., Perfect = 4, Excellent = 3, Good = 2, Fair = 1, Not relevant = 0).
+
+

http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/88450c70/solr/contrib/ltr/example/user_queries.txt
----------------------------------------------------------------------
diff --git a/solr/contrib/ltr/example/user_queries.txt b/solr/contrib/ltr/example/user_queries.txt
index a3a3455..5e820c7 100644
--- a/solr/contrib/ltr/example/user_queries.txt
+++ b/solr/contrib/ltr/example/user_queries.txt
@@ -1,8 +1,8 @@
-hard drive|SP2514N|0.6666666|CLICK_LOGS
-hard drive|6H500F0|0.330082034|CLICK_LOGS
+hard drive|SP2514N|0.6|CLICK_LOGS
+hard drive|6H500F0|0.3|CLICK_LOGS
 hard drive|F8V7067-APL-KIT|0.0|CLICK_LOGS
 hard drive|IW-02|0.0|CLICK_LOGS
-ipod|MA147LL/A|1.0|EXPLICIT
-ipod|F8V7067-APL-KIT|0.25|EXPLICIT
-ipod|IW-02|0.25|EXPLICIT
-ipod|6H500F0|0.0|EXPLICIT
+ipod|MA147LL/A|1.0|HUMAN_JUDGEMENT
+ipod|F8V7067-APL-KIT|0.5|HUMAN_JUDGEMENT
+ipod|IW-02|0.5|HUMAN_JUDGEMENT
+ipod|6H500F0|0.0|HUMAN_JUDGEMENT