You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Pat Ferrel <pa...@gmail.com> on 2013/11/06 19:13:01 UTC

Solr-recommender for Mahout 0.9

Trying to integrate the Solr-recoemmender with the latest Mahout snapshot. The project uses a modified RecommenderJob because it needs SequenceFile output and to get the location of the preparePreferenceMatrix directory. If #1 and #2 are addressed I can remove the modified Mahout code from the project and rely on the default implementations in Mahout 0.9. #3 is a longer term issue related to the creation of a CrossRowSimilarityJob. 

I have dropped the modified code from the Solr-recommender project and have a modified build of the current Mahout 0.9 snapshot. If the following changes are made to Mahout I can test and release a Mahout 0.9 version of the Solr-recommender.

1. Option to change RecommenderJob output format

Can someone add an option to output a SequenceFile. I modified the code to do the following, note the SequenceFileOutputFormat.class as the last parameter but this should really be determined with an option I think.

      Job aggregateAndRecommend = prepareJob(
              new Path(aggregateAndRecommendInput), outputPath, SequenceFileInputFormat.class,
              PartialMultiplyMapper.class, VarLongWritable.class, PrefAndSimilarityColumnWritable.class,
              AggregateAndRecommendReducer.class, VarLongWritable.class, RecommendedItemsWritable.class,
              SequenceFileOutputFormat.class);

2. Visibility of preparePreferenceMatrix directory location

The Solr-recommender needs to find where the RecommenderJob is putting it’s output. 

Mahout 0.8 RecommenderJob code was:
    public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix”;

Mahout 0.9 RecommenderJob code just puts “preparePreferenceMatrix” inline in the code:
    Path prepPath = getTempPath("preparePreferenceMatrix");

This change to Mahout 0.9 works:
    public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix”;
and
    Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);

You could also make this a getter method on the RecommenderJob Class instead of using a public constant.

3. Downsampling

The downsampling for maximum prefs per user has been moved from PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob uses matrix math instead of RSJ so it will no longer support downsampling until there is a hypothetical CrossRowSimilairtyJob with downsampling in it.

Re: Solr-recommender for Mahout 0.9

Posted by Pat Ferrel <pa...@gmail.com>.

Yes you are correct but my integration framework treats non-text fields as scalars so it is easier to neuter text than implement fulltext searching on strings. I would do what you suggest if were using raw Solr. My understanding was that string also does not get tfidf applied, which is not what I had intended.

 
On Nov 7, 2013, at 9:57 AM, Andrew Psaltis <An...@Webtrends.com> wrote:

Pat,
Perhaps I am missing something here, but why not use a String field if you
do not need any of the analysis? Seems like from your previous email "The
query is a simple text query made of space delimited video id strings" - -
that you basically have a keyword style query which would seem to fit
better with a String field and not a neutered Text field.

Thanks,
Andrew





On 11/7/13 10:44 AM, "Pat Ferrel" <pa...@gmail.com> wrote:

> One difference is that a ³text² field has analyzers like Porter stemming
> applied. I had to take these out of the schema.xml. I think TFIDF is also
> applied to the tems in ³text² but may not be to MV fields. I think TFIDF
> is good in the application. The idea is that if everyone likes a movie,
> it isn¹t much of a differentiator. Also changing to MV fields is simply
> applying a different type to the field in the schema I think, so trivial
> to try out.
> 
> At this point the only test is an eyeball test so measuring differences
> is problematic. If anyone has intuition fire away.
> 
> On Nov 7, 2013, at 9:23 AM, Dominik Hübner <co...@dhuebner.com> wrote:
> 
> Does anyone know what the difference is between keeping the ids in a
> space delimited string and indexing a multivalued field of ids? I
> recently tried the latter since ... it felt right, however I am not sure
> which of both has which advantages.
> 
> On 07 Nov 2013, at 18:18, Pat Ferrel <pa...@gmail.com> wrote:
> 
>> I have dismax (no edismax) but am not using it yet, using the default
>> query, which does use ŒAND¹. I had much the same though as I slept on
>> it. Changing to OR is now working much much better. So obvious it almost
>> bit me, not good in this case...
>> 
>> With only a trivially small amount of testing I¹d say we have a new
>> recommender on the block.
>> 
>> If anyone would like to help eyeball test the thing let me know
>> off-list. There are a few instructions I¹ll need to give. And it can¹t
>> handle much load right now due to intentional design limits.
>> 
>> 
>> On Nov 7, 2013, at 6:11 AM, Dyer, James <Ja...@ingramcontent.com>
>> wrote:
>> 
>> Pat,
>> 
>> Can you give us the query it generates when you enter "vampire werewolf
>> zombie", q/qt/defType ?
>> 
>> My guess is you're using the default query parser with "q.op=AND" , or,
>> you're using dismax/edismax with a high "mm" (min-must-match) value.
>> 
>> James Dyer
>> Ingram Content Group
>> (615) 213-4311
>> 
>> 
>> -----Original Message-----
>> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
>> Sent: Wednesday, November 06, 2013 5:53 PM
>> To: ssc@apache.org Schelter; user@mahout.apache.org
>> Subject: Re: Solr-recommender for Mahout 0.9
>> 
>> Done,
>> 
>> BTW I have the thing running on a demo site but am getting very poor
>> results that I think are related to the Solr setup. I'd appreciate any
>> ideas.
>> 
>> The sample data has 27,000 items and something like 4000 users. The
>> preference data is fairly dense since the users are professional
>> reviewers and the items videos.
>> 
>> 1) The number of item-item similarities that are kept is 100. Is this a
>> good starting point? Ted, do you recall how many you used before?
>> 2) The query is a simple text query made of space delimited video id
>> strings. These are the same ids as are stored in the item-item
>> similarity docs that Solr indexes.
>> 
>> Hit thumbs up on one video you you get several recommendations. Hit
>> thumbs up on several videos you get no recs. I'm either using the wrong
>> query type or have it set up to be too restrictive. As I read through
>> the docs if someone has a suggestion or pointer I'd appreciate it.
>> 
>> BTW the same sort of thing happens with Title search. Search for
>> "vampire werewolf zombie" you get no results, search for "zombie" you
>> get several.
>> 
>> On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <ss...@apache.org> wrote:
>> 
>> Hi Pat,
>> 
>> can you create issues for 1) and 2) ? Then I will try to get this into
>> trunk asap.
>> 
>> Best,
>> Sebastian
>> 
>> On 06.11.2013 19:13, Pat Ferrel wrote:
>>> Trying to integrate the Solr-recoemmender with the latest Mahout
>>> snapshot. The project uses a modified RecommenderJob because it needs
>>> SequenceFile output and to get the location of the
>>> preparePreferenceMatrix directory. If #1 and #2 are addressed I can
>>> remove the modified Mahout code from the project and rely on the
>>> default implementations in Mahout 0.9. #3 is a longer term issue
>>> related to the creation of a CrossRowSimilarityJob.
>>> 
>>> I have dropped the modified code from the Solr-recommender project and
>>> have a modified build of the current Mahout 0.9 snapshot. If the
>>> following changes are made to Mahout I can test and release a Mahout
>>> 0.9 version of the Solr-recommender.
>>> 
>>> 1. Option to change RecommenderJob output format
>>> 
>>> Can someone add an option to output a SequenceFile. I modified the
>>> code to do the following, note the SequenceFileOutputFormat.class as
>>> the last parameter but this should really be determined with an option
>>> I think.
>>> 
>>>  Job aggregateAndRecommend = prepareJob(
>>>          new Path(aggregateAndRecommendInput), outputPath,
>>> SequenceFileInputFormat.class,
>>>          PartialMultiplyMapper.class, VarLongWritable.class,
>>> PrefAndSimilarityColumnWritable.class,
>>>          AggregateAndRecommendReducer.class, VarLongWritable.class,
>>> RecommendedItemsWritable.class,
>>>          SequenceFileOutputFormat.class);
>>> 
>>> 2. Visibility of preparePreferenceMatrix directory location
>>> 
>>> The Solr-recommender needs to find where the RecommenderJob is putting
>>> it's output. 
>>> 
>>> Mahout 0.8 RecommenderJob code was:
>>> public static final String DEFAULT_PREPARE_DIR =
>>> "preparePreferenceMatrix";
>>> 
>>> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix"
>>> inline in the code:
>>> Path prepPath = getTempPath("preparePreferenceMatrix");
>>> 
>>> This change to Mahout 0.9 works:
>>> public static final String DEFAULT_PREPARE_DIR =
>>> "preparePreferenceMatrix";
>>> and
>>> Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
>>> 
>>> You could also make this a getter method on the RecommenderJob Class
>>> instead of using a public constant.
>>> 
>>> 3. Downsampling
>>> 
>>> The downsampling for maximum prefs per user has been moved from
>>> PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob
>>> uses matrix math instead of RSJ so it will no longer support
>>> downsampling until there is a hypothetical CrossRowSimilairtyJob with
>>> downsampling in it.
>>> 
>>> 
>> 
>> 
>> 
>> 
>> 
> 
>

Re: Solr-recommender for Mahout 0.9

Posted by Andrew Psaltis <An...@Webtrends.com>.

Pat,
Perhaps I am missing something here, but why not use a String field if you
do not need any of the analysis? Seems like from your previous email "The
query is a simple text query made of space delimited video id strings" - -
that you basically have a keyword style query which would seem to fit
better with a String field and not a neutered Text field.

Thanks,
Andrew





On 11/7/13 10:44 AM, "Pat Ferrel" <pa...@gmail.com> wrote:

>One difference is that a ³text² field has analyzers like Porter stemming
>applied. I had to take these out of the schema.xml. I think TFIDF is also
>applied to the tems in ³text² but may not be to MV fields. I think TFIDF
>is good in the application. The idea is that if everyone likes a movie,
>it isn¹t much of a differentiator. Also changing to MV fields is simply
>applying a different type to the field in the schema I think, so trivial
>to try out.
>
>At this point the only test is an eyeball test so measuring differences
>is problematic. If anyone has intuition fire away.
>
>On Nov 7, 2013, at 9:23 AM, Dominik Hübner <co...@dhuebner.com> wrote:
>
>Does anyone know what the difference is between keeping the ids in a
>space delimited string and indexing a multivalued field of ids? I
>recently tried the latter since ... it felt right, however I am not sure
>which of both has which advantages.
>
>On 07 Nov 2013, at 18:18, Pat Ferrel <pa...@gmail.com> wrote:
>
>> I have dismax (no edismax) but am not using it yet, using the default
>>query, which does use ŒAND¹. I had much the same though as I slept on
>>it. Changing to OR is now working much much better. So obvious it almost
>>bit me, not good in this case...
>> 
>> With only a trivially small amount of testing I¹d say we have a new
>>recommender on the block.
>> 
>> If anyone would like to help eyeball test the thing let me know
>>off-list. There are a few instructions I¹ll need to give. And it can¹t
>>handle much load right now due to intentional design limits.
>> 
>> 
>> On Nov 7, 2013, at 6:11 AM, Dyer, James <Ja...@ingramcontent.com>
>>wrote:
>> 
>> Pat,
>> 
>> Can you give us the query it generates when you enter "vampire werewolf
>>zombie", q/qt/defType ?
>> 
>> My guess is you're using the default query parser with "q.op=AND" , or,
>>you're using dismax/edismax with a high "mm" (min-must-match) value.
>> 
>> James Dyer
>> Ingram Content Group
>> (615) 213-4311
>> 
>> 
>> -----Original Message-----
>> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
>> Sent: Wednesday, November 06, 2013 5:53 PM
>> To: ssc@apache.org Schelter; user@mahout.apache.org
>> Subject: Re: Solr-recommender for Mahout 0.9
>> 
>> Done,
>> 
>> BTW I have the thing running on a demo site but am getting very poor
>>results that I think are related to the Solr setup. I'd appreciate any
>>ideas.
>> 
>> The sample data has 27,000 items and something like 4000 users. The
>>preference data is fairly dense since the users are professional
>>reviewers and the items videos.
>> 
>> 1) The number of item-item similarities that are kept is 100. Is this a
>>good starting point? Ted, do you recall how many you used before?
>> 2) The query is a simple text query made of space delimited video id
>>strings. These are the same ids as are stored in the item-item
>>similarity docs that Solr indexes.
>> 
>> Hit thumbs up on one video you you get several recommendations. Hit
>>thumbs up on several videos you get no recs. I'm either using the wrong
>>query type or have it set up to be too restrictive. As I read through
>>the docs if someone has a suggestion or pointer I'd appreciate it.
>> 
>> BTW the same sort of thing happens with Title search. Search for
>>"vampire werewolf zombie" you get no results, search for "zombie" you
>>get several.
>> 
>> On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <ss...@apache.org> wrote:
>> 
>> Hi Pat,
>> 
>> can you create issues for 1) and 2) ? Then I will try to get this into
>> trunk asap.
>> 
>> Best,
>> Sebastian
>> 
>> On 06.11.2013 19:13, Pat Ferrel wrote:
>>> Trying to integrate the Solr-recoemmender with the latest Mahout
>>>snapshot. The project uses a modified RecommenderJob because it needs
>>>SequenceFile output and to get the location of the
>>>preparePreferenceMatrix directory. If #1 and #2 are addressed I can
>>>remove the modified Mahout code from the project and rely on the
>>>default implementations in Mahout 0.9. #3 is a longer term issue
>>>related to the creation of a CrossRowSimilarityJob.
>>> 
>>> I have dropped the modified code from the Solr-recommender project and
>>>have a modified build of the current Mahout 0.9 snapshot. If the
>>>following changes are made to Mahout I can test and release a Mahout
>>>0.9 version of the Solr-recommender.
>>> 
>>> 1. Option to change RecommenderJob output format
>>> 
>>> Can someone add an option to output a SequenceFile. I modified the
>>>code to do the following, note the SequenceFileOutputFormat.class as
>>>the last parameter but this should really be determined with an option
>>>I think.
>>> 
>>>   Job aggregateAndRecommend = prepareJob(
>>>           new Path(aggregateAndRecommendInput), outputPath,
>>>SequenceFileInputFormat.class,
>>>           PartialMultiplyMapper.class, VarLongWritable.class,
>>>PrefAndSimilarityColumnWritable.class,
>>>           AggregateAndRecommendReducer.class, VarLongWritable.class,
>>>RecommendedItemsWritable.class,
>>>           SequenceFileOutputFormat.class);
>>> 
>>> 2. Visibility of preparePreferenceMatrix directory location
>>> 
>>> The Solr-recommender needs to find where the RecommenderJob is putting
>>>it's output. 
>>> 
>>> Mahout 0.8 RecommenderJob code was:
>>> public static final String DEFAULT_PREPARE_DIR =
>>>"preparePreferenceMatrix";
>>> 
>>> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix"
>>>inline in the code:
>>> Path prepPath = getTempPath("preparePreferenceMatrix");
>>> 
>>> This change to Mahout 0.9 works:
>>> public static final String DEFAULT_PREPARE_DIR =
>>>"preparePreferenceMatrix";
>>> and
>>> Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
>>> 
>>> You could also make this a getter method on the RecommenderJob Class
>>>instead of using a public constant.
>>> 
>>> 3. Downsampling
>>> 
>>> The downsampling for maximum prefs per user has been moved from
>>>PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob
>>>uses matrix math instead of RSJ so it will no longer support
>>>downsampling until there is a hypothetical CrossRowSimilairtyJob with
>>>downsampling in it.
>>> 
>>> 
>> 
>> 
>> 
>> 
>> 
>
>

Re: Solr-recommender for Mahout 0.9

Posted by Pat Ferrel <pa...@gmail.com>.

One difference is that a “text” field has analyzers like Porter stemming applied. I had to take these out of the schema.xml. I think TFIDF is also applied to the tems in “text” but may not be to MV fields. I think TFIDF is good in the application. The idea is that if everyone likes a movie, it isn’t much of a differentiator. Also changing to MV fields is simply applying a different type to the field in the schema I think, so trivial to try out.

At this point the only test is an eyeball test so measuring differences is problematic. If anyone has intuition fire away.

On Nov 7, 2013, at 9:23 AM, Dominik Hübner <co...@dhuebner.com> wrote:

Does anyone know what the difference is between keeping the ids in a space delimited string and indexing a multivalued field of ids? I recently tried the latter since ... it felt right, however I am not sure which of both has which advantages.

On 07 Nov 2013, at 18:18, Pat Ferrel <pa...@gmail.com> wrote:

> I have dismax (no edismax) but am not using it yet, using the default query, which does use ‘AND’. I had much the same though as I slept on it. Changing to OR is now working much much better. So obvious it almost bit me, not good in this case...
> 
> With only a trivially small amount of testing I’d say we have a new recommender on the block.
> 
> If anyone would like to help eyeball test the thing let me know off-list. There are a few instructions I’ll need to give. And it can’t handle much load right now due to intentional design limits.
> 
> 
> On Nov 7, 2013, at 6:11 AM, Dyer, James <Ja...@ingramcontent.com> wrote:
> 
> Pat,
> 
> Can you give us the query it generates when you enter "vampire werewolf zombie", q/qt/defType ?
> 
> My guess is you're using the default query parser with "q.op=AND" , or, you're using dismax/edismax with a high "mm" (min-must-match) value.
> 
> James Dyer
> Ingram Content Group
> (615) 213-4311
> 
> 
> -----Original Message-----
> From: Pat Ferrel [mailto:pat.ferrel@gmail.com] 
> Sent: Wednesday, November 06, 2013 5:53 PM
> To: ssc@apache.org Schelter; user@mahout.apache.org
> Subject: Re: Solr-recommender for Mahout 0.9
> 
> Done,
> 
> BTW I have the thing running on a demo site but am getting very poor results that I think are related to the Solr setup. I'd appreciate any ideas.
> 
> The sample data has 27,000 items and something like 4000 users. The preference data is fairly dense since the users are professional reviewers and the items videos.
> 
> 1) The number of item-item similarities that are kept is 100. Is this a good starting point? Ted, do you recall how many you used before?
> 2) The query is a simple text query made of space delimited video id strings. These are the same ids as are stored in the item-item similarity docs that Solr indexes.
> 
> Hit thumbs up on one video you you get several recommendations. Hit thumbs up on several videos you get no recs. I'm either using the wrong query type or have it set up to be too restrictive. As I read through the docs if someone has a suggestion or pointer I'd appreciate it. 
> 
> BTW the same sort of thing happens with Title search. Search for "vampire werewolf zombie" you get no results, search for "zombie" you get several.
> 
> On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <ss...@apache.org> wrote:
> 
> Hi Pat,
> 
> can you create issues for 1) and 2) ? Then I will try to get this into
> trunk asap.
> 
> Best,
> Sebastian
> 
> On 06.11.2013 19:13, Pat Ferrel wrote:
>> Trying to integrate the Solr-recoemmender with the latest Mahout snapshot. The project uses a modified RecommenderJob because it needs SequenceFile output and to get the location of the preparePreferenceMatrix directory. If #1 and #2 are addressed I can remove the modified Mahout code from the project and rely on the default implementations in Mahout 0.9. #3 is a longer term issue related to the creation of a CrossRowSimilarityJob. 
>> 
>> I have dropped the modified code from the Solr-recommender project and have a modified build of the current Mahout 0.9 snapshot. If the following changes are made to Mahout I can test and release a Mahout 0.9 version of the Solr-recommender.
>> 
>> 1. Option to change RecommenderJob output format
>> 
>> Can someone add an option to output a SequenceFile. I modified the code to do the following, note the SequenceFileOutputFormat.class as the last parameter but this should really be determined with an option I think.
>> 
>>   Job aggregateAndRecommend = prepareJob(
>>           new Path(aggregateAndRecommendInput), outputPath, SequenceFileInputFormat.class,
>>           PartialMultiplyMapper.class, VarLongWritable.class, PrefAndSimilarityColumnWritable.class,
>>           AggregateAndRecommendReducer.class, VarLongWritable.class, RecommendedItemsWritable.class,
>>           SequenceFileOutputFormat.class);
>> 
>> 2. Visibility of preparePreferenceMatrix directory location
>> 
>> The Solr-recommender needs to find where the RecommenderJob is putting it's output. 
>> 
>> Mahout 0.8 RecommenderJob code was:
>> public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix";
>> 
>> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix" inline in the code:
>> Path prepPath = getTempPath("preparePreferenceMatrix");
>> 
>> This change to Mahout 0.9 works:
>> public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix";
>> and
>> Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
>> 
>> You could also make this a getter method on the RecommenderJob Class instead of using a public constant.
>> 
>> 3. Downsampling
>> 
>> The downsampling for maximum prefs per user has been moved from PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob uses matrix math instead of RSJ so it will no longer support downsampling until there is a hypothetical CrossRowSimilairtyJob with downsampling in it.
>> 
>> 
> 
> 
> 
> 
>

Re: Solr-recommender for Mahout 0.9

Posted by Pat Ferrel <pa...@gmail.com>.

not including Solr in the project so it should work with any recent version. I’m actually using 4.2

On Nov 8, 2013, at 9:24 AM, Suneel Marthi <su...@yahoo.com> wrote:

Pat,

Would all of this be on Solr/Lucene 4.5.1?

Mahout's presently at Lucene 4.3.1 (as of 0.8) , but we should be moving to the latest stable release at time of 0.9 release.

On Thursday, November 7, 2013 10:33 PM, Pat Ferrel <pa...@gmail.com> wrote:

Another approach would be to weight the terms in the docs by there Mahout similarity strength. But that will be for another day. 

My current question is whether Lucene looks at word proximity. I see the query syntax supports proximity but I don’t see that it is default so that’s good.

On Nov 7, 2013, at 12:41 PM, Dyer, James <Ja...@ingramcontent.com> wrote:

Best to my knowledge, Lucene does not care about the position of a keyword within a document.

You could bucket the ids into several fields.  Then use a dismax query to boost the top-tier ids more than then second, etc.

A more fine-grained approach would probably involve a custom Similarity class that scales the score based on its position in the document.  If we did this, it might be simpler to index as 1 single-valued field so each id was position+1 rather than position+100, etc.

James Dyer
Ingram Content Group
(615) 213-4311

-----Original Message-----
From: Pat Ferrel [mailto:pat.ferrel@gmail.com] 
Sent: Thursday, November 07, 2013 1:46 PM
To: user@mahout.apache.org
Subject: Re: Solr-recommender for Mahout 0.9

Interesting to think about ordering and adjacentness. The index ids are sorted by Mahout strength so the first id is the most similar to the row key and so forth. But the query is ordered buy recency. In both cases the first id is in some sense the most important. Does Solr/Lucene care about closeness to the top of doc for queries or indexed docs? I don't recall any mention of this.

However adjacentness has no meaning in recommendations though I think it's used in default queries so I may have to account for that.

The object returned is an ordered list of ids. I use only the IDs now but there are cases when the contents are also of interest; shopping cart/watchlist queries for example.

On Nov 7, 2013, at 10:00 AM, Dyer, James <Ja...@ingramcontent.com> wrote:

The multivalued field will obey the "positionIncrementGap" value you specify (default=100).  So for querying purposes, those id's will be 100 (or whatever you specified) positions apart.  So a phrase search for adjacent ids would not match, unless you set the slop for >= positionIncrementGap.  Other than this, both scenarios index the same.

For stored fields, solr returns an array of values for multivalued fields, which is convienent when writing a UI.

James Dyer
Ingram Content Group
(615) 213-4311

-----Original Message-----
From: Dominik Hübner [mailto:contact@dhuebner.com] 
Sent: Thursday, November 07, 2013 11:23 AM
To: user@mahout.apache.org
Subject: Re: Solr-recommender for Mahout 0.9

Does anyone know what the difference is between keeping the ids in a space delimited string and indexing a multivalued field of ids? I recently tried the latter since ... it felt right, however I am not sure which of both has which advantages.

On 07 Nov 2013, at 18:18, Pat Ferrel <pa...@gmail.com> wrote:

> I have dismax (no edismax) but am not using it yet, using the default query, which does use 'AND'. I had much the same though as I slept on it. Changing to OR is now working much much better. So obvious it almost bit me, not good in this case...
> 
> With only a trivially small amount of testing I'd say we have a new recommender on the block.
> 
> If anyone would like to help eyeball test the thing let me know off-list. There are a few instructions I'll need to give. And it can't handle much load right now due to intentional design limits.
> 
> 
> On Nov 7, 2013, at 6:11 AM, Dyer, James <Ja...@ingramcontent.com> wrote:
> 
> Pat,
> 
> Can you give us the query it generates when you enter "vampire werewolf zombie", q/qt/defType ?
> 
> My guess is you're using the default query parser with "q.op=AND" , or, you're using dismax/edismax with a high "mm" (min-must-match) value.
> 
> James Dyer
> Ingram Content Group
> (615) 213-4311
> 
> 
> -----Original Message-----
> From: Pat Ferrel [mailto:pat.ferrel@gmail.com] 
> Sent: Wednesday, November 06, 2013 5:53 PM
> To: ssc@apache.org Schelter; user@mahout.apache.org
> Subject: Re: Solr-recommender for Mahout 0.9
> 
> Done,
> 
> BTW I have the thing running on a demo site but am getting very poor results that I think are related to the Solr setup. I'd appreciate any ideas.
> 
> The sample data has 27,000 items and something like 4000 users. The preference data is fairly dense since the users are professional reviewers and the items videos.
> 
> 1) The number of item-item similarities that are kept is 100. Is this a good starting point? Ted, do you recall how many you used before?
> 2) The query is a simple text query made of space delimited video id strings. These are the same ids as are stored in the item-item similarity docs that Solr indexes.
> 
> Hit thumbs up on one video you you get several recommendations. Hit thumbs up on several videos you get no recs. I'm either using the wrong query type or have it set up to be too restrictive. As I read through the docs if someone has a suggestion or pointer I'd appreciate it. 
> 
> BTW the same sort of thing happens with Title search. Search for "vampire werewolf zombie" you get no results, search for "zombie" you get several.
> 
> On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <ss...@apache.org> wrote:
> 
> Hi Pat,
> 
> can you create issues for 1) and 2) ? Then I will try to get this into
> trunk asap.
> 
> Best,
> Sebastian
> 
> On 06.11.2013 19:13, Pat Ferrel wrote:
>> Trying to integrate the Solr-recoemmender with the latest Mahout snapshot. The project uses a modified RecommenderJob because it needs SequenceFile output and to get the location of the preparePreferenceMatrix directory. If #1 and #2 are addressed I can remove the modified Mahout code from the project and rely on the default implementations in Mahout 0.9. #3 is a longer term issue related to the creation of a CrossRowSimilarityJob. 
>> 
>> I have dropped the modified code from the Solr-recommender project and have a modified build of the current Mahout 0.9 snapshot. If the following changes are made to Mahout I can test and release a Mahout 0.9 version of the Solr-recommender.
>> 
>> 1. Option to change RecommenderJob output format
>> 
>> Can someone add an option to output a SequenceFile. I modified the code to do the following, note the SequenceFileOutputFormat.class as the last parameter but this should really be determined with an option I think.
>> 
>>   Job aggregateAndRecommend = prepareJob(
>>           new Path(aggregateAndRecommendInput), outputPath, SequenceFileInputFormat.class,
>>           PartialMultiplyMapper.class, VarLongWritable.class, PrefAndSimilarityColumnWritable.class,
>>           AggregateAndRecommendReducer.class, VarLongWritable.class, RecommendedItemsWritable.class,
>>           SequenceFileOutputFormat.class);
>> 
>> 2. Visibility of preparePreferenceMatrix directory location
>> 
>> The Solr-recommender needs to find where the RecommenderJob is putting it's output. 
>> 
>> Mahout 0.8 RecommenderJob code was:
>> public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix";
>> 
>> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix" inline in the code:
>> Path prepPath = getTempPath("preparePreferenceMatrix");
>> 
>> This change to Mahout 0.9 works:
>> public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix";
>> and
>> Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
>> 
>> You could also make this a getter method on the RecommenderJob Class instead of using a public constant.
>> 
>> 3. Downsampling
>> 
>> The downsampling for maximum prefs per user has been moved from PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob uses matrix math instead of RSJ so it will no longer support downsampling until there is a hypothetical CrossRowSimilairtyJob with downsampling in it.
>> 
>> 
> 
> 
> 
>

Re: Solr-recommender for Mahout 0.9

Posted by Suneel Marthi <su...@yahoo.com>.

Pat,

Would all of this be on Solr/Lucene 4.5.1?

Mahout's presently at Lucene 4.3.1 (as of 0.8) , but we should be moving to the latest stable release at time of 0.9 release.

On Thursday, November 7, 2013 10:33 PM, Pat Ferrel <pa...@gmail.com> wrote:

Another approach would be to weight the terms in the docs by there Mahout similarity strength. But that will be for another day. 

My current question is whether Lucene looks at word proximity. I see the query syntax supports proximity but I don’t see that it is default so that’s good.

On Nov 7, 2013, at 12:41 PM, Dyer, James <Ja...@ingramcontent.com> wrote:

Best to my knowledge, Lucene does not care about the position of a keyword within a document.

You could bucket the ids into several fields.  Then use a dismax query to boost the top-tier ids more than then second, etc.

A more fine-grained approach would probably involve a custom Similarity class that scales the score based on its position in the document.  If we did this, it might be simpler to index as 1 single-valued field so each id was position+1 rather than position+100, etc.

James Dyer
Ingram Content Group
(615) 213-4311

-----Original Message-----
From: Pat Ferrel [mailto:pat.ferrel@gmail.com] 
Sent: Thursday, November 07, 2013 1:46 PM
To: user@mahout.apache.org
Subject: Re: Solr-recommender for Mahout 0.9

Interesting to think about ordering and adjacentness. The index ids are sorted by Mahout strength so the first id is the most similar to the row key and so forth. But the query is ordered buy recency. In both cases the first id is in some sense the most important. Does Solr/Lucene care about closeness to the top of doc for queries or indexed docs? I don't recall any mention of this.

However adjacentness has no meaning in recommendations though I think it's used in default queries so I may have to account for that.

The object returned is an ordered list of ids. I use only the IDs now but there are cases when the contents are also of interest; shopping cart/watchlist queries for example.

On Nov 7, 2013, at 10:00 AM, Dyer, James <Ja...@ingramcontent.com> wrote:

The multivalued field will obey the "positionIncrementGap" value you specify (default=100).  So for querying purposes, those id's will be 100 (or whatever you specified) positions apart.  So a phrase search for adjacent ids would not match, unless you set the slop for >= positionIncrementGap.  Other than this, both scenarios index the same.

For stored fields, solr returns an array of values for multivalued fields, which is convienent when writing a UI.

James Dyer
Ingram Content Group
(615) 213-4311

-----Original Message-----
From: Dominik Hübner [mailto:contact@dhuebner.com] 
Sent: Thursday, November 07, 2013 11:23 AM
To: user@mahout.apache.org
Subject: Re: Solr-recommender for Mahout 0.9

Does anyone know what the difference is between keeping the ids in a space delimited string and indexing a multivalued field of ids? I recently tried the latter since ... it felt right, however I am not sure which of both has which advantages.

On 07 Nov 2013, at 18:18, Pat Ferrel <pa...@gmail.com> wrote:

> I have dismax (no edismax) but am not using it yet, using the default query, which does use 'AND'. I had much the same though as I slept on it. Changing to OR is now working much much better. So obvious it almost bit me, not good in this case...
> 
> With only a trivially small amount of testing I'd say we have a new recommender on the block.
> 
> If anyone would like to help eyeball test the thing let me know off-list. There are a few instructions I'll need to give. And it can't handle much load right now due to intentional design limits.
> 
> 
> On Nov 7, 2013, at 6:11 AM, Dyer, James <Ja...@ingramcontent.com> wrote:
> 
> Pat,
> 
> Can you give us the query it generates when you enter "vampire werewolf zombie", q/qt/defType ?
> 
> My guess is you're using the default query parser with "q.op=AND" , or, you're using dismax/edismax with a high "mm" (min-must-match) value.
> 
> James Dyer
> Ingram Content Group
> (615) 213-4311
> 
> 
> -----Original Message-----
> From: Pat Ferrel [mailto:pat.ferrel@gmail.com] 
> Sent: Wednesday, November 06, 2013 5:53 PM
> To: ssc@apache.org Schelter; user@mahout.apache.org
> Subject: Re: Solr-recommender for Mahout 0.9
> 
> Done,
> 
> BTW I have the thing running on a demo site but am getting very poor results that I think are related to the Solr setup. I'd appreciate any ideas.
> 
> The sample data has 27,000 items and something like 4000 users. The preference data is fairly dense since the users are professional reviewers and the items videos.
> 
> 1) The number of item-item similarities that are kept is 100. Is this a good starting point? Ted, do you recall how many you used before?
> 2) The query is a simple text query made of space delimited video id strings. These are the same ids as are stored in the item-item similarity docs that Solr indexes.
> 
> Hit thumbs up on one video you you get several recommendations. Hit thumbs up on several videos you get no recs. I'm either using the wrong query type or have it set up to be too restrictive. As I read through the docs if someone has a suggestion or pointer I'd appreciate it. 
> 
> BTW the same sort of thing happens with Title search. Search for "vampire werewolf zombie" you get no results, search for "zombie" you get several.
> 
> On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <ss...@apache.org> wrote:
> 
> Hi Pat,
> 
> can you create issues for 1) and 2) ? Then I will try to get this into
> trunk asap.
> 
> Best,
> Sebastian
> 
> On 06.11.2013 19:13, Pat Ferrel wrote:
>> Trying to integrate the Solr-recoemmender with the latest Mahout snapshot. The project uses a modified RecommenderJob because it needs SequenceFile output and to get the location of the preparePreferenceMatrix directory. If #1 and #2 are addressed I can remove the modified Mahout code from the project and rely on the default implementations in Mahout 0.9. #3 is a longer term issue related to the creation of a CrossRowSimilarityJob. 
>> 
>> I have dropped the modified code from the Solr-recommender project and have a modified build of the current Mahout 0.9 snapshot. If the following changes are made to Mahout I can test and release a Mahout 0.9 version of the Solr-recommender.
>> 
>> 1. Option to change RecommenderJob output format
>> 
>> Can someone add an option to output a SequenceFile. I modified the code to do the following, note the SequenceFileOutputFormat.class as the last parameter but this should really be determined with an option I think.
>> 
>>  Job aggregateAndRecommend = prepareJob(
>>          new Path(aggregateAndRecommendInput), outputPath, SequenceFileInputFormat.class,
>>          PartialMultiplyMapper.class, VarLongWritable.class, PrefAndSimilarityColumnWritable.class,
>>          AggregateAndRecommendReducer.class, VarLongWritable.class, RecommendedItemsWritable.class,
>>          SequenceFileOutputFormat.class);
>> 
>> 2. Visibility of preparePreferenceMatrix directory location
>> 
>> The Solr-recommender needs to find where the RecommenderJob is putting it's output. 
>> 
>> Mahout 0.8 RecommenderJob code was:
>> public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix";
>> 
>> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix" inline in the code:
>> Path prepPath = getTempPath("preparePreferenceMatrix");
>> 
>> This change to Mahout 0.9 works:
>> public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix";
>> and
>> Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
>> 
>> You could also make this a getter method on the RecommenderJob Class instead of using a public constant.
>> 
>> 3. Downsampling
>> 
>> The downsampling for maximum prefs per user has been moved from PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob uses matrix math instead of RSJ so it will no longer support downsampling until there is a hypothetical CrossRowSimilairtyJob with downsampling in it.
>> 
>> 
> 
> 
> 
> 
>

Re: Solr-recommender for Mahout 0.9

Posted by Ted Dunning <te...@gmail.com>.

On Sat, Feb 22, 2014 at 4:50 PM, Andrew Musselman <
andrew.musselman@gmail.com> wrote:

> *Ted*, if you have any code you could donate for this example from your and
> Ellen's book I'd love to be able to re-use it.
>

I do.  I will try to open up access to that sometime today.

Pat's work on the cross recommender job is also important.

Re: Solr-recommender for Mahout 0.9

Posted by Andrew Musselman <an...@gmail.com>.

*Pat*, I opened a ticket(M-1420) for putting a new script in examples/ that
uses the solr-recommender.  Seems there's another, related ticket from
Suneel in M-1288.

Did the work described in the thread below make it into 0.9, and/or how
much more is needed on it?

*Ted*, if you have any code you could donate for this example from your and
Ellen's book I'd love to be able to re-use it.

Thanks
Andrew

On Sun, Nov 17, 2013 at 3:36 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Eventually I'd like to get MAP built into the solr-recommender. Used it at
> a client who had good data. It was very helpful for exploring what data was
> useful and what wasn't. We'd run map with and without detail-view data for
> instance and take the MAP as a measure of how predictive the data was. In
> our case the MAP@ numbers went down with purchase and detail-view mixed
> together. That was why I got interested in the cross-action recommender--as
> a way to scrub less predictive actions. Didn't finish it before I lost
> access to the data unfortunately.
>
> What form of precision calc will you use? Obviously we used mean average
> precision at different numbers of recommendations, which had the effect of
> producing a fall-off curve. The curve, we took, as a measure of how well
> our ranking was working.
>
> On Nov 17, 2013, at 10:47 AM, Ken Krugler <kk...@transpac.com>
> wrote:
>
> Hi Pat,
>
> On Nov 13, 2013, at 4:43pm, Pat Ferrel <pa...@gmail.com> wrote:
>
> > Ever done an offline precision calc?
>
> No, sorry.
>
> I do (finally) have one client with some data that could be used to
> calculate precision, and a willingness to pay for the work, so I'm hoping
> to include details on that in my next blog post about text feature
> selection.
>
> -- Ken
>
>
> >> On Nov 13, 2013, at 1:39 PM, Ken Krugler <kk...@transpac.com>
> wrote:
> >>
> >> Hi Pat,
> >>
> >>> On Nov 13, 2013, at 9:21am, Pat Ferrel <pa...@occamsmachete.com> wrote:
> >>>
> >>> A version is now checked in that uses mahout 0.9. Haven't tested it on
> a cluster yet, only locally. I have to upgrade my cluster to Hadoop 1.2.1,
> which takes some time.
> >>>
> >>> Saw the Strata slides from Ted touting dithering of results, which
> I'll implement.
> >>>
> >>> Ken, did you have anything specific for "And usually I just use Solr
> to generate a candidate list, then I do more specific scoring to find the N
> best form N*4 candidates"?
> >>
> >> If I'm looking for the top N best matches, I'll do a Solr query with
> rows=N*4.
> >>
> >> Then I use all of the data from these potential matches, and calculate
> a more sophisticated similarity score (e.g. adding a weighting based on the
> user's activity level) between my target and these candidates.
> >>
> >> Regards,
> >>
> >> -- Ken
> >>
> >>>
> >>> Was planning to try boosting by something like genre/category in the
> recs query. For instance, in the demo data, each item will soon have a set
> of tags (actually genre names) so these could be a field being queried
> along with the item-item links. The query for recs would then include the
> user history against the item-item links, and the average genre tags
> preferred by the user against item genre tags. This would return recs
> skewed towards the user's genre preference.
> >>>
> >>> Another way this could be used is when showing similar items. You'd
> have the tags for the item being viewed and so could use them to skew
> towards items with similar tags. I think this works but would turn similar
> items from a lookup (they are pre-calculated by Mahout) into another Solr
> query.
> >>>
> >>>
> >>>
> >>> On Nov 8, 2013, at 1:27 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> >>>
> >>> Not planning to do anything with weights at present. An ORed query
> should suffice for the time being and Solr weights. There are a good list
> of ways to do this later if it warrants an experiment. Thanks.
> >>>
> >>> Have, similar items as input, recommendations from user "likes", and
> just got recs from recently viewed working. Once you have online recs from
> the pre-calculated model experimenting is super easy. The next step will be
> to get more metadata ingested so we can try boosting by context genre, or
> recent genre viewed, which is sort of in line with "more specific scoring
> to find the N best from N*4 candidates". Also want to do what Ted calls
> dithering to vary the choices you see.
> >>>
> >>> On Nov 8, 2013, at 10:10 AM, Ken Krugler <kk...@transpac.com>
> wrote:
> >>>
> >>> One other thing I should have mentioned is that if you care about
> setting weights on incoming terms, you can boost them using the ^<value>
> syntax.
> >>>
> >>> E.g. "the_kings_speech^1.5 OR skyfalll^0.5 OR looper^3.0..."
> >>>
> >>> If you want to account for weights of terms in the index, it's a bit
> harder. You can do simple boosting by replicating terms, or you can use
> payload-based boosting, or you could code up your own Similarity class that
> takes advantage of side-channel data.
> >>>
> >>> But in my experience the gain from applying weights to terms int he
> index isn't very significant.
> >>>
> >>> And usually I just Solr to generate a candidate list, then I do more
> specific scoring to find the N best form N*4 candidates.
> >>>
> >>> -- Ken
> >>>
> >>>> On Nov 8, 2013, at 9:54am, Ted Dunning <te...@gmail.com> wrote:
> >>>>
> >>>> For recommendation work, I suggest that it would be better to simply
> code
> >>>> out an explicit OR query.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Nov 7, 2013 at 8:11 PM, Ken Krugler <
> kkrugler_lists@transpac.com>wrote:
> >>>>
> >>>>> Hi Pat,
> >>>>>
> >>>>>> On Nov 7, 2013, at 7:30pm, Pat Ferrel <pa...@gmail.com> wrote:
> >>>>>>
> >>>>>> Another approach would be to weight the terms in the docs by there
> >>>>> Mahout similarity strength. But that will be for another day.
> >>>>>>
> >>>>>> My current question is whether Lucene looks at word proximity. I
> see the
> >>>>> query syntax supports proximity but I don't see that it is default so
> >>>>> that's good.
> >>>>>
> >>>>> Based on your description of what you do (generate an OR query of N
> terms)
> >>>>> then no, you shouldn't be getting a boost from proximity.
> >>>>>
> >>>>> Note that with edismax you can specify a phrase boost, but it will
> be on
> >>>>> the entire set of terms being searched, so unlikely to come into
> play even
> >>>>> if you were using that.
> >>>>>
> >>>>> -- Ken
> >>>>>
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>>> On Nov 7, 2013, at 12:41 PM, Dyer, James <
> James.Dyer@ingramcontent.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>> Best to my knowledge, Lucene does not care about the position of a
> >>>>> keyword within a document.
> >>>>>>
> >>>>>> You could bucket the ids into several fields.  Then use a dismax
> query
> >>>>> to boost the top-tier ids more than then second, etc.
> >>>>>>
> >>>>>> A more fine-grained approach would probably involve a custom
> Similarity
> >>>>> class that scales the score based on its position in the document.
>  If we
> >>>>> did this, it might be simpler to index as 1 single-valued field so
> each id
> >>>>> was position+1 rather than position+100, etc.
> >>>>>>
> >>>>>> James Dyer
> >>>>>> Ingram Content Group
> >>>>>> (615) 213-4311
> >>>>>>
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
> >>>>>> Sent: Thursday, November 07, 2013 1:46 PM
> >>>>>> To: user@mahout.apache.org
> >>>>>> Subject: Re: Solr-recommender for Mahout 0.9
> >>>>>>
> >>>>>> Interesting to think about ordering and adjacentness. The index ids
> are
> >>>>> sorted by Mahout strength so the first id is the most similar to the
> row
> >>>>> key and so forth. But the query is ordered buy recency. In both
> cases the
> >>>>> first id is in some sense the most important. Does Solr/Lucene care
> about
> >>>>> closeness to the top of doc for queries or indexed docs? I don't
> recall any
> >>>>> mention of this.
> >>>>>>
> >>>>>> However adjacentness has no meaning in recommendations though I
> think
> >>>>> it's used in default queries so I may have to account for that.
> >>>>>>
> >>>>>> The object returned is an ordered list of ids. I use only the IDs
> now
> >>>>> but there are cases when the contents are also of interest; shopping
> >>>>> cart/watchlist queries for example.
> >>>>>>
> >>>>>>> On Nov 7, 2013, at 10:00 AM, Dyer, James <
> James.Dyer@ingramcontent.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>> The multivalued field will obey the "positionIncrementGap" value you
> >>>>> specify (default=100).  So for querying purposes, those id's will be
> 100
> >>>>> (or whatever you specified) positions apart.  So a phrase search for
> >>>>> adjacent ids would not match, unless you set the slop for >=
> >>>>> positionIncrementGap.  Other than this, both scenarios index the
> same.
> >>>>>>
> >>>>>> For stored fields, solr returns an array of values for multivalued
> >>>>> fields, which is convienent when writing a UI.
> >>>>>>
> >>>>>> James Dyer
> >>>>>> Ingram Content Group
> >>>>>> (615) 213-4311
> >>>>>>
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Dominik Hübner [mailto:contact@dhuebner.com]
> >>>>>> Sent: Thursday, November 07, 2013 11:23 AM
> >>>>>> To: user@mahout.apache.org
> >>>>>> Subject: Re: Solr-recommender for Mahout 0.9
> >>>>>>
> >>>>>> Does anyone know what the difference is between keeping the ids in a
> >>>>> space delimited string and indexing a multivalued field of ids? I
> recently
> >>>>> tried the latter since ... it felt right, however I am not sure
> which of
> >>>>> both has which advantages.
> >>>>>>
> >>>>>>> On 07 Nov 2013, at 18:18, Pat Ferrel <pa...@gmail.com> wrote:
> >>>>>>>
> >>>>>>> I have dismax (no edismax) but am not using it yet, using the
> default
> >>>>> query, which does use 'AND'. I had much the same though as I slept
> on it.
> >>>>> Changing to OR is now working much much better. So obvious it almost
> bit
> >>>>> me, not good in this case...
> >>>>>>>
> >>>>>>> With only a trivially small amount of testing I'd say we have a new
> >>>>> recommender on the block.
> >>>>>>>
> >>>>>>> If anyone would like to help eyeball test the thing let me know
> >>>>> off-list. There are a few instructions I'll need to give. And it
> can't
> >>>>> handle much load right now due to intentional design limits.
> >>>>>>>
> >>>>>>>
> >>>>>>> On Nov 7, 2013, at 6:11 AM, Dyer, James <
> James.Dyer@ingramcontent.com>
> >>>>> wrote:
> >>>>>>>
> >>>>>>> Pat,
> >>>>>>>
> >>>>>>> Can you give us the query it generates when you enter "vampire
> werewolf
> >>>>> zombie", q/qt/defType ?
> >>>>>>>
> >>>>>>> My guess is you're using the default query parser with "q.op=AND"
> , or,
> >>>>> you're using dismax/edismax with a high "mm" (min-must-match) value.
> >>>>>>>
> >>>>>>> James Dyer
> >>>>>>> Ingram Content Group
> >>>>>>> (615) 213-4311
> >>>>>>>
> >>>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
> >>>>>>> Sent: Wednesday, November 06, 2013 5:53 PM
> >>>>>>> To: ssc@apache.org Schelter; user@mahout.apache.org
> >>>>>>> Subject: Re: Solr-recommender for Mahout 0.9
> >>>>>>>
> >>>>>>> Done,
> >>>>>>>
> >>>>>>> BTW I have the thing running on a demo site but am getting very
> poor
> >>>>> results that I think are related to the Solr setup. I'd appreciate
> any
> >>>>> ideas.
> >>>>>>>
> >>>>>>> The sample data has 27,000 items and something like 4000 users. The
> >>>>> preference data is fairly dense since the users are professional
> reviewers
> >>>>> and the items videos.
> >>>>>>>
> >>>>>>> 1) The number of item-item similarities that are kept is 100. Is
> this a
> >>>>> good starting point? Ted, do you recall how many you used before?
> >>>>>>> 2) The query is a simple text query made of space delimited video
> id
> >>>>> strings. These are the same ids as are stored in the item-item
> similarity
> >>>>> docs that Solr indexes.
> >>>>>>>
> >>>>>>> Hit thumbs up on one video you you get several recommendations. Hit
> >>>>> thumbs up on several videos you get no recs. I'm either using the
> wrong
> >>>>> query type or have it set up to be too restrictive. As I read
> through the
> >>>>> docs if someone has a suggestion or pointer I'd appreciate it.
> >>>>>>>
> >>>>>>> BTW the same sort of thing happens with Title search. Search for
> >>>>> "vampire werewolf zombie" you get no results, search for "zombie"
> you get
> >>>>> several.
> >>>>>>>
> >>>>>>> On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <ss...@apache.org>
> wrote:
> >>>>>>>
> >>>>>>> Hi Pat,
> >>>>>>>
> >>>>>>> can you create issues for 1) and 2) ? Then I will try to get this
> into
> >>>>>>> trunk asap.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Sebastian
> >>>>>>>
> >>>>>>>> On 06.11.2013 19:13, Pat Ferrel wrote:
> >>>>>>>> Trying to integrate the Solr-recoemmender with the latest Mahout
> >>>>> snapshot. The project uses a modified RecommenderJob because it needs
> >>>>> SequenceFile output and to get the location of the
> preparePreferenceMatrix
> >>>>> directory. If #1 and #2 are addressed I can remove the modified
> Mahout code
> >>>>> from the project and rely on the default implementations in Mahout
> 0.9. #3
> >>>>> is a longer term issue related to the creation of a
> CrossRowSimilarityJob.
> >>>>>>>>
> >>>>>>>> I have dropped the modified code from the Solr-recommender
> project and
> >>>>> have a modified build of the current Mahout 0.9 snapshot. If the
> following
> >>>>> changes are made to Mahout I can test and release a Mahout 0.9
> version of
> >>>>> the Solr-recommender.
> >>>>>>>>
> >>>>>>>> 1. Option to change RecommenderJob output format
> >>>>>>>>
> >>>>>>>> Can someone add an option to output a SequenceFile. I modified the
> >>>>> code to do the following, note the SequenceFileOutputFormat.class as
> the
> >>>>> last parameter but this should really be determined with an option I
> think.
> >>>>>>>>
> >>>>>>>> Job aggregateAndRecommend = prepareJob(
> >>>>>>>>  new Path(aggregateAndRecommendInput), outputPath,
> >>>>> SequenceFileInputFormat.class,
> >>>>>>>>  PartialMultiplyMapper.class, VarLongWritable.class,
> >>>>> PrefAndSimilarityColumnWritable.class,
> >>>>>>>>  AggregateAndRecommendReducer.class, VarLongWritable.class,
> >>>>> RecommendedItemsWritable.class,
> >>>>>>>>  SequenceFileOutputFormat.class);
> >>>>>>>>
> >>>>>>>> 2. Visibility of preparePreferenceMatrix directory location
> >>>>>>>>
> >>>>>>>> The Solr-recommender needs to find where the RecommenderJob is
> putting
> >>>>> it's output.
> >>>>>>>>
> >>>>>>>> Mahout 0.8 RecommenderJob code was:
> >>>>>>>> public static final String DEFAULT_PREPARE_DIR =
> >>>>> "preparePreferenceMatrix";
> >>>>>>>>
> >>>>>>>> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix"
> >>>>> inline in the code:
> >>>>>>>> Path prepPath = getTempPath("preparePreferenceMatrix");
> >>>>>>>>
> >>>>>>>> This change to Mahout 0.9 works:
> >>>>>>>> public static final String DEFAULT_PREPARE_DIR =
> >>>>> "preparePreferenceMatrix";
> >>>>>>>> and
> >>>>>>>> Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
> >>>>>>>>
> >>>>>>>> You could also make this a getter method on the RecommenderJob
> Class
> >>>>> instead of using a public constant.
> >>>>>>>>
> >>>>>>>> 3. Downsampling
> >>>>>>>>
> >>>>>>>> The downsampling for maximum prefs per user has been moved from
> >>>>> PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob
> uses
> >>>>> matrix math instead of RSJ so it will no longer support downsampling
> until
> >>>>> there is a hypothetical CrossRowSimilairtyJob with downsampling in
> it.
> >>>>>
> >>>>> --------------------------
> >>>>> Ken Krugler
> >>>>> +1 530-210-6378
> >>>>> http://www.scaleunlimited.com
> >>>>> custom big data solutions & training
> >>>>> Hadoop, Cascading, Cassandra & Solr
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --------------------------
> >>>>> Ken Krugler
> >>>>> +1 530-210-6378
> >>>>> http://www.scaleunlimited.com
> >>>>> custom big data solutions & training
> >>>>> Hadoop, Cascading, Cassandra & Solr
> >>>
> >>> --------------------------
> >>> Ken Krugler
> >>> +1 530-210-6378
> >>> http://www.scaleunlimited.com
> >>> custom big data solutions & training
> >>> Hadoop, Cascading, Cassandra & Solr
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> --------------------------
> >>> Ken Krugler
> >>> +1 530-210-6378
> >>> http://www.scaleunlimited.com
> >>> custom big data solutions & training
> >>> Hadoop, Cascading, Cassandra & Solr
> >>
> >> --------------------------
> >> Ken Krugler
> >> +1 530-210-6378
> >> http://www.scaleunlimited.com
> >> custom big data solutions & training
> >> Hadoop, Cascading, Cassandra & Solr
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> --------------------------
> >> Ken Krugler
> >> +1 530-210-6378
> >> http://www.scaleunlimited.com
> >> custom big data solutions & training
> >> Hadoop, Cascading, Cassandra & Solr
> >>
> >>
> >>
> >>
> >>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>
>

Re: Solr-recommender for Mahout 0.9

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Eventually I’d like to get MAP built into the solr-recommender. Used it at a client who had good data. It was very helpful for exploring what data was useful and what wasn’t. We’d run map with and without detail-view data for instance and take the MAP as a measure of how predictive the data was. In our case the MAP@ numbers went down with purchase and detail-view mixed together. That was why I got interested in the cross-action recommender—as a way to scrub less predictive actions. Didn’t finish it before I lost access to the data unfortunately.

What form of precision calc will you use? Obviously we used mean average precision at different numbers of recommendations, which had the effect of producing a fall-off curve. The curve, we took, as a measure of how well our ranking was working.

On Nov 17, 2013, at 10:47 AM, Ken Krugler <kk...@transpac.com> wrote:

Hi Pat,

On Nov 13, 2013, at 4:43pm, Pat Ferrel <pa...@gmail.com> wrote:

> Ever done an offline precision calc?

No, sorry.

I do (finally) have one client with some data that could be used to calculate precision, and a willingness to pay for the work, so I'm hoping to include details on that in my next blog post about text feature selection.

-- Ken


>> On Nov 13, 2013, at 1:39 PM, Ken Krugler <kk...@transpac.com> wrote:
>> 
>> Hi Pat,
>> 
>>> On Nov 13, 2013, at 9:21am, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>> 
>>> A version is now checked in that uses mahout 0.9. Haven’t tested it on a cluster yet, only locally. I have to upgrade my cluster to Hadoop 1.2.1, which takes some time.
>>> 
>>> Saw the Strata slides from Ted touting dithering of results, which I’ll implement.
>>> 
>>> Ken, did you have anything specific for "And usually I just use Solr to generate a candidate list, then I do more specific scoring to find the N best form N*4 candidates”?
>> 
>> If I'm looking for the top N best matches, I'll do a Solr query with rows=N*4.
>> 
>> Then I use all of the data from these potential matches, and calculate a more sophisticated similarity score (e.g. adding a weighting based on the user's activity level) between my target and these candidates.
>> 
>> Regards,
>> 
>> -- Ken
>> 
>>> 
>>> Was planning to try boosting by something like genre/category in the recs query. For instance, in the demo data, each item will soon have a set of tags (actually genre names) so these could be a field being queried along with the item-item links. The query for recs would then include the user history against the item-item links, and the average genre tags preferred by the user against item genre tags. This would return recs skewed towards the user’s genre preference.
>>> 
>>> Another way this could be used is when showing similar items. You’d have the tags for the item being viewed and so could use them to skew towards items with similar tags. I think this works but would turn similar items from a lookup (they are pre-calculated by Mahout) into another Solr query.
>>> 
>>> 
>>> 
>>> On Nov 8, 2013, at 1:27 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>> 
>>> Not planning to do anything with weights at present. An ORed query should suffice for the time being and Solr weights. There are a good list of ways to do this later if it warrants an experiment. Thanks.
>>> 
>>> Have, similar items as input, recommendations from user “likes”, and just got recs from recently viewed working. Once you have online recs from the pre-calculated model experimenting is super easy. The next step will be to get more metadata ingested so we can try boosting by context genre, or recent genre viewed, which is sort of in line with "more specific scoring to find the N best from N*4 candidates”. Also want to do what Ted calls dithering to vary the choices you see.
>>> 
>>> On Nov 8, 2013, at 10:10 AM, Ken Krugler <kk...@transpac.com> wrote:
>>> 
>>> One other thing I should have mentioned is that if you care about setting weights on incoming terms, you can boost them using the ^<value> syntax.
>>> 
>>> E.g. "the_kings_speech^1.5 OR skyfalll^0.5 OR looper^3.0…"
>>> 
>>> If you want to account for weights of terms in the index, it's a bit harder. You can do simple boosting by replicating terms, or you can use payload-based boosting, or you could code up your own Similarity class that takes advantage of side-channel data.
>>> 
>>> But in my experience the gain from applying weights to terms int he index isn't very significant.
>>> 
>>> And usually I just Solr to generate a candidate list, then I do more specific scoring to find the N best form N*4 candidates.
>>> 
>>> -- Ken
>>> 
>>>> On Nov 8, 2013, at 9:54am, Ted Dunning <te...@gmail.com> wrote:
>>>> 
>>>> For recommendation work, I suggest that it would be better to simply code
>>>> out an explicit OR query.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Thu, Nov 7, 2013 at 8:11 PM, Ken Krugler <kk...@transpac.com>wrote:
>>>> 
>>>>> Hi Pat,
>>>>> 
>>>>>> On Nov 7, 2013, at 7:30pm, Pat Ferrel <pa...@gmail.com> wrote:
>>>>>> 
>>>>>> Another approach would be to weight the terms in the docs by there
>>>>> Mahout similarity strength. But that will be for another day.
>>>>>> 
>>>>>> My current question is whether Lucene looks at word proximity. I see the
>>>>> query syntax supports proximity but I don’t see that it is default so
>>>>> that’s good.
>>>>> 
>>>>> Based on your description of what you do (generate an OR query of N terms)
>>>>> then no, you shouldn't be getting a boost from proximity.
>>>>> 
>>>>> Note that with edismax you can specify a phrase boost, but it will be on
>>>>> the entire set of terms being searched, so unlikely to come into play even
>>>>> if you were using that.
>>>>> 
>>>>> -- Ken
>>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Nov 7, 2013, at 12:41 PM, Dyer, James <Ja...@ingramcontent.com>
>>>>>> wrote:
>>>>>> 
>>>>>> Best to my knowledge, Lucene does not care about the position of a
>>>>> keyword within a document.
>>>>>> 
>>>>>> You could bucket the ids into several fields.  Then use a dismax query
>>>>> to boost the top-tier ids more than then second, etc.
>>>>>> 
>>>>>> A more fine-grained approach would probably involve a custom Similarity
>>>>> class that scales the score based on its position in the document.  If we
>>>>> did this, it might be simpler to index as 1 single-valued field so each id
>>>>> was position+1 rather than position+100, etc.
>>>>>> 
>>>>>> James Dyer
>>>>>> Ingram Content Group
>>>>>> (615) 213-4311
>>>>>> 
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
>>>>>> Sent: Thursday, November 07, 2013 1:46 PM
>>>>>> To: user@mahout.apache.org
>>>>>> Subject: Re: Solr-recommender for Mahout 0.9
>>>>>> 
>>>>>> Interesting to think about ordering and adjacentness. The index ids are
>>>>> sorted by Mahout strength so the first id is the most similar to the row
>>>>> key and so forth. But the query is ordered buy recency. In both cases the
>>>>> first id is in some sense the most important. Does Solr/Lucene care about
>>>>> closeness to the top of doc for queries or indexed docs? I don't recall any
>>>>> mention of this.
>>>>>> 
>>>>>> However adjacentness has no meaning in recommendations though I think
>>>>> it's used in default queries so I may have to account for that.
>>>>>> 
>>>>>> The object returned is an ordered list of ids. I use only the IDs now
>>>>> but there are cases when the contents are also of interest; shopping
>>>>> cart/watchlist queries for example.
>>>>>> 
>>>>>>> On Nov 7, 2013, at 10:00 AM, Dyer, James <Ja...@ingramcontent.com>
>>>>>> wrote:
>>>>>> 
>>>>>> The multivalued field will obey the "positionIncrementGap" value you
>>>>> specify (default=100).  So for querying purposes, those id's will be 100
>>>>> (or whatever you specified) positions apart.  So a phrase search for
>>>>> adjacent ids would not match, unless you set the slop for >=
>>>>> positionIncrementGap.  Other than this, both scenarios index the same.
>>>>>> 
>>>>>> For stored fields, solr returns an array of values for multivalued
>>>>> fields, which is convienent when writing a UI.
>>>>>> 
>>>>>> James Dyer
>>>>>> Ingram Content Group
>>>>>> (615) 213-4311
>>>>>> 
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Dominik Hübner [mailto:contact@dhuebner.com]
>>>>>> Sent: Thursday, November 07, 2013 11:23 AM
>>>>>> To: user@mahout.apache.org
>>>>>> Subject: Re: Solr-recommender for Mahout 0.9
>>>>>> 
>>>>>> Does anyone know what the difference is between keeping the ids in a
>>>>> space delimited string and indexing a multivalued field of ids? I recently
>>>>> tried the latter since ... it felt right, however I am not sure which of
>>>>> both has which advantages.
>>>>>> 
>>>>>>> On 07 Nov 2013, at 18:18, Pat Ferrel <pa...@gmail.com> wrote:
>>>>>>> 
>>>>>>> I have dismax (no edismax) but am not using it yet, using the default
>>>>> query, which does use 'AND'. I had much the same though as I slept on it.
>>>>> Changing to OR is now working much much better. So obvious it almost bit
>>>>> me, not good in this case...
>>>>>>> 
>>>>>>> With only a trivially small amount of testing I'd say we have a new
>>>>> recommender on the block.
>>>>>>> 
>>>>>>> If anyone would like to help eyeball test the thing let me know
>>>>> off-list. There are a few instructions I'll need to give. And it can't
>>>>> handle much load right now due to intentional design limits.
>>>>>>> 
>>>>>>> 
>>>>>>> On Nov 7, 2013, at 6:11 AM, Dyer, James <Ja...@ingramcontent.com>
>>>>> wrote:
>>>>>>> 
>>>>>>> Pat,
>>>>>>> 
>>>>>>> Can you give us the query it generates when you enter "vampire werewolf
>>>>> zombie", q/qt/defType ?
>>>>>>> 
>>>>>>> My guess is you're using the default query parser with "q.op=AND" , or,
>>>>> you're using dismax/edismax with a high "mm" (min-must-match) value.
>>>>>>> 
>>>>>>> James Dyer
>>>>>>> Ingram Content Group
>>>>>>> (615) 213-4311
>>>>>>> 
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
>>>>>>> Sent: Wednesday, November 06, 2013 5:53 PM
>>>>>>> To: ssc@apache.org Schelter; user@mahout.apache.org
>>>>>>> Subject: Re: Solr-recommender for Mahout 0.9
>>>>>>> 
>>>>>>> Done,
>>>>>>> 
>>>>>>> BTW I have the thing running on a demo site but am getting very poor
>>>>> results that I think are related to the Solr setup. I'd appreciate any
>>>>> ideas.
>>>>>>> 
>>>>>>> The sample data has 27,000 items and something like 4000 users. The
>>>>> preference data is fairly dense since the users are professional reviewers
>>>>> and the items videos.
>>>>>>> 
>>>>>>> 1) The number of item-item similarities that are kept is 100. Is this a
>>>>> good starting point? Ted, do you recall how many you used before?
>>>>>>> 2) The query is a simple text query made of space delimited video id
>>>>> strings. These are the same ids as are stored in the item-item similarity
>>>>> docs that Solr indexes.
>>>>>>> 
>>>>>>> Hit thumbs up on one video you you get several recommendations. Hit
>>>>> thumbs up on several videos you get no recs. I'm either using the wrong
>>>>> query type or have it set up to be too restrictive. As I read through the
>>>>> docs if someone has a suggestion or pointer I'd appreciate it.
>>>>>>> 
>>>>>>> BTW the same sort of thing happens with Title search. Search for
>>>>> "vampire werewolf zombie" you get no results, search for "zombie" you get
>>>>> several.
>>>>>>> 
>>>>>>> On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <ss...@apache.org> wrote:
>>>>>>> 
>>>>>>> Hi Pat,
>>>>>>> 
>>>>>>> can you create issues for 1) and 2) ? Then I will try to get this into
>>>>>>> trunk asap.
>>>>>>> 
>>>>>>> Best,
>>>>>>> Sebastian
>>>>>>> 
>>>>>>>> On 06.11.2013 19:13, Pat Ferrel wrote:
>>>>>>>> Trying to integrate the Solr-recoemmender with the latest Mahout
>>>>> snapshot. The project uses a modified RecommenderJob because it needs
>>>>> SequenceFile output and to get the location of the preparePreferenceMatrix
>>>>> directory. If #1 and #2 are addressed I can remove the modified Mahout code
>>>>> from the project and rely on the default implementations in Mahout 0.9. #3
>>>>> is a longer term issue related to the creation of a CrossRowSimilarityJob.
>>>>>>>> 
>>>>>>>> I have dropped the modified code from the Solr-recommender project and
>>>>> have a modified build of the current Mahout 0.9 snapshot. If the following
>>>>> changes are made to Mahout I can test and release a Mahout 0.9 version of
>>>>> the Solr-recommender.
>>>>>>>> 
>>>>>>>> 1. Option to change RecommenderJob output format
>>>>>>>> 
>>>>>>>> Can someone add an option to output a SequenceFile. I modified the
>>>>> code to do the following, note the SequenceFileOutputFormat.class as the
>>>>> last parameter but this should really be determined with an option I think.
>>>>>>>> 
>>>>>>>> Job aggregateAndRecommend = prepareJob(
>>>>>>>>  new Path(aggregateAndRecommendInput), outputPath,
>>>>> SequenceFileInputFormat.class,
>>>>>>>>  PartialMultiplyMapper.class, VarLongWritable.class,
>>>>> PrefAndSimilarityColumnWritable.class,
>>>>>>>>  AggregateAndRecommendReducer.class, VarLongWritable.class,
>>>>> RecommendedItemsWritable.class,
>>>>>>>>  SequenceFileOutputFormat.class);
>>>>>>>> 
>>>>>>>> 2. Visibility of preparePreferenceMatrix directory location
>>>>>>>> 
>>>>>>>> The Solr-recommender needs to find where the RecommenderJob is putting
>>>>> it's output.
>>>>>>>> 
>>>>>>>> Mahout 0.8 RecommenderJob code was:
>>>>>>>> public static final String DEFAULT_PREPARE_DIR =
>>>>> "preparePreferenceMatrix";
>>>>>>>> 
>>>>>>>> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix"
>>>>> inline in the code:
>>>>>>>> Path prepPath = getTempPath("preparePreferenceMatrix");
>>>>>>>> 
>>>>>>>> This change to Mahout 0.9 works:
>>>>>>>> public static final String DEFAULT_PREPARE_DIR =
>>>>> "preparePreferenceMatrix";
>>>>>>>> and
>>>>>>>> Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
>>>>>>>> 
>>>>>>>> You could also make this a getter method on the RecommenderJob Class
>>>>> instead of using a public constant.
>>>>>>>> 
>>>>>>>> 3. Downsampling
>>>>>>>> 
>>>>>>>> The downsampling for maximum prefs per user has been moved from
>>>>> PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob uses
>>>>> matrix math instead of RSJ so it will no longer support downsampling until
>>>>> there is a hypothetical CrossRowSimilairtyJob with downsampling in it.
>>>>> 
>>>>> --------------------------
>>>>> Ken Krugler
>>>>> +1 530-210-6378
>>>>> http://www.scaleunlimited.com
>>>>> custom big data solutions & training
>>>>> Hadoop, Cascading, Cassandra & Solr
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --------------------------
>>>>> Ken Krugler
>>>>> +1 530-210-6378
>>>>> http://www.scaleunlimited.com
>>>>> custom big data solutions & training
>>>>> Hadoop, Cascading, Cassandra & Solr
>>> 
>>> --------------------------
>>> Ken Krugler
>>> +1 530-210-6378
>>> http://www.scaleunlimited.com
>>> custom big data solutions & training
>>> Hadoop, Cascading, Cassandra & Solr
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> --------------------------
>>> Ken Krugler
>>> +1 530-210-6378
>>> http://www.scaleunlimited.com
>>> custom big data solutions & training
>>> Hadoop, Cascading, Cassandra & Solr
>> 
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>> 
>> 
>> 
>> 
>> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Re: Solr-recommender for Mahout 0.9

Posted by Ken Krugler <kk...@transpac.com>.

Hi Pat,

On Nov 13, 2013, at 4:43pm, Pat Ferrel <pa...@gmail.com> wrote:

> Ever done an offline precision calc?

No, sorry.

I do (finally) have one client with some data that could be used to calculate precision, and a willingness to pay for the work, so I'm hoping to include details on that in my next blog post about text feature selection.

-- Ken


>> On Nov 13, 2013, at 1:39 PM, Ken Krugler <kk...@transpac.com> wrote:
>> 
>> Hi Pat,
>> 
>>> On Nov 13, 2013, at 9:21am, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>> 
>>> A version is now checked in that uses mahout 0.9. Haven’t tested it on a cluster yet, only locally. I have to upgrade my cluster to Hadoop 1.2.1, which takes some time.
>>> 
>>> Saw the Strata slides from Ted touting dithering of results, which I’ll implement.
>>> 
>>> Ken, did you have anything specific for "And usually I just use Solr to generate a candidate list, then I do more specific scoring to find the N best form N*4 candidates”?
>> 
>> If I'm looking for the top N best matches, I'll do a Solr query with rows=N*4.
>> 
>> Then I use all of the data from these potential matches, and calculate a more sophisticated similarity score (e.g. adding a weighting based on the user's activity level) between my target and these candidates.
>> 
>> Regards,
>> 
>> -- Ken
>> 
>>> 
>>> Was planning to try boosting by something like genre/category in the recs query. For instance, in the demo data, each item will soon have a set of tags (actually genre names) so these could be a field being queried along with the item-item links. The query for recs would then include the user history against the item-item links, and the average genre tags preferred by the user against item genre tags. This would return recs skewed towards the user’s genre preference.
>>> 
>>> Another way this could be used is when showing similar items. You’d have the tags for the item being viewed and so could use them to skew towards items with similar tags. I think this works but would turn similar items from a lookup (they are pre-calculated by Mahout) into another Solr query.
>>> 
>>> 
>>> 
>>> On Nov 8, 2013, at 1:27 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>> 
>>> Not planning to do anything with weights at present. An ORed query should suffice for the time being and Solr weights. There are a good list of ways to do this later if it warrants an experiment. Thanks.
>>> 
>>> Have, similar items as input, recommendations from user “likes”, and just got recs from recently viewed working. Once you have online recs from the pre-calculated model experimenting is super easy. The next step will be to get more metadata ingested so we can try boosting by context genre, or recent genre viewed, which is sort of in line with "more specific scoring to find the N best from N*4 candidates”. Also want to do what Ted calls dithering to vary the choices you see.
>>> 
>>> On Nov 8, 2013, at 10:10 AM, Ken Krugler <kk...@transpac.com> wrote:
>>> 
>>> One other thing I should have mentioned is that if you care about setting weights on incoming terms, you can boost them using the ^<value> syntax.
>>> 
>>> E.g. "the_kings_speech^1.5 OR skyfalll^0.5 OR looper^3.0…"
>>> 
>>> If you want to account for weights of terms in the index, it's a bit harder. You can do simple boosting by replicating terms, or you can use payload-based boosting, or you could code up your own Similarity class that takes advantage of side-channel data.
>>> 
>>> But in my experience the gain from applying weights to terms int he index isn't very significant.
>>> 
>>> And usually I just Solr to generate a candidate list, then I do more specific scoring to find the N best form N*4 candidates.
>>> 
>>> -- Ken
>>> 
>>>> On Nov 8, 2013, at 9:54am, Ted Dunning <te...@gmail.com> wrote:
>>>> 
>>>> For recommendation work, I suggest that it would be better to simply code
>>>> out an explicit OR query.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Thu, Nov 7, 2013 at 8:11 PM, Ken Krugler <kk...@transpac.com>wrote:
>>>> 
>>>>> Hi Pat,
>>>>> 
>>>>>> On Nov 7, 2013, at 7:30pm, Pat Ferrel <pa...@gmail.com> wrote:
>>>>>> 
>>>>>> Another approach would be to weight the terms in the docs by there
>>>>> Mahout similarity strength. But that will be for another day.
>>>>>> 
>>>>>> My current question is whether Lucene looks at word proximity. I see the
>>>>> query syntax supports proximity but I don’t see that it is default so
>>>>> that’s good.
>>>>> 
>>>>> Based on your description of what you do (generate an OR query of N terms)
>>>>> then no, you shouldn't be getting a boost from proximity.
>>>>> 
>>>>> Note that with edismax you can specify a phrase boost, but it will be on
>>>>> the entire set of terms being searched, so unlikely to come into play even
>>>>> if you were using that.
>>>>> 
>>>>> -- Ken
>>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Nov 7, 2013, at 12:41 PM, Dyer, James <Ja...@ingramcontent.com>
>>>>>> wrote:
>>>>>> 
>>>>>> Best to my knowledge, Lucene does not care about the position of a
>>>>> keyword within a document.
>>>>>> 
>>>>>> You could bucket the ids into several fields.  Then use a dismax query
>>>>> to boost the top-tier ids more than then second, etc.
>>>>>> 
>>>>>> A more fine-grained approach would probably involve a custom Similarity
>>>>> class that scales the score based on its position in the document.  If we
>>>>> did this, it might be simpler to index as 1 single-valued field so each id
>>>>> was position+1 rather than position+100, etc.
>>>>>> 
>>>>>> James Dyer
>>>>>> Ingram Content Group
>>>>>> (615) 213-4311
>>>>>> 
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
>>>>>> Sent: Thursday, November 07, 2013 1:46 PM
>>>>>> To: user@mahout.apache.org
>>>>>> Subject: Re: Solr-recommender for Mahout 0.9
>>>>>> 
>>>>>> Interesting to think about ordering and adjacentness. The index ids are
>>>>> sorted by Mahout strength so the first id is the most similar to the row
>>>>> key and so forth. But the query is ordered buy recency. In both cases the
>>>>> first id is in some sense the most important. Does Solr/Lucene care about
>>>>> closeness to the top of doc for queries or indexed docs? I don't recall any
>>>>> mention of this.
>>>>>> 
>>>>>> However adjacentness has no meaning in recommendations though I think
>>>>> it's used in default queries so I may have to account for that.
>>>>>> 
>>>>>> The object returned is an ordered list of ids. I use only the IDs now
>>>>> but there are cases when the contents are also of interest; shopping
>>>>> cart/watchlist queries for example.
>>>>>> 
>>>>>>> On Nov 7, 2013, at 10:00 AM, Dyer, James <Ja...@ingramcontent.com>
>>>>>> wrote:
>>>>>> 
>>>>>> The multivalued field will obey the "positionIncrementGap" value you
>>>>> specify (default=100).  So for querying purposes, those id's will be 100
>>>>> (or whatever you specified) positions apart.  So a phrase search for
>>>>> adjacent ids would not match, unless you set the slop for >=
>>>>> positionIncrementGap.  Other than this, both scenarios index the same.
>>>>>> 
>>>>>> For stored fields, solr returns an array of values for multivalued
>>>>> fields, which is convienent when writing a UI.
>>>>>> 
>>>>>> James Dyer
>>>>>> Ingram Content Group
>>>>>> (615) 213-4311
>>>>>> 
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Dominik Hübner [mailto:contact@dhuebner.com]
>>>>>> Sent: Thursday, November 07, 2013 11:23 AM
>>>>>> To: user@mahout.apache.org
>>>>>> Subject: Re: Solr-recommender for Mahout 0.9
>>>>>> 
>>>>>> Does anyone know what the difference is between keeping the ids in a
>>>>> space delimited string and indexing a multivalued field of ids? I recently
>>>>> tried the latter since ... it felt right, however I am not sure which of
>>>>> both has which advantages.
>>>>>> 
>>>>>>> On 07 Nov 2013, at 18:18, Pat Ferrel <pa...@gmail.com> wrote:
>>>>>>> 
>>>>>>> I have dismax (no edismax) but am not using it yet, using the default
>>>>> query, which does use 'AND'. I had much the same though as I slept on it.
>>>>> Changing to OR is now working much much better. So obvious it almost bit
>>>>> me, not good in this case...
>>>>>>> 
>>>>>>> With only a trivially small amount of testing I'd say we have a new
>>>>> recommender on the block.
>>>>>>> 
>>>>>>> If anyone would like to help eyeball test the thing let me know
>>>>> off-list. There are a few instructions I'll need to give. And it can't
>>>>> handle much load right now due to intentional design limits.
>>>>>>> 
>>>>>>> 
>>>>>>> On Nov 7, 2013, at 6:11 AM, Dyer, James <Ja...@ingramcontent.com>
>>>>> wrote:
>>>>>>> 
>>>>>>> Pat,
>>>>>>> 
>>>>>>> Can you give us the query it generates when you enter "vampire werewolf
>>>>> zombie", q/qt/defType ?
>>>>>>> 
>>>>>>> My guess is you're using the default query parser with "q.op=AND" , or,
>>>>> you're using dismax/edismax with a high "mm" (min-must-match) value.
>>>>>>> 
>>>>>>> James Dyer
>>>>>>> Ingram Content Group
>>>>>>> (615) 213-4311
>>>>>>> 
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
>>>>>>> Sent: Wednesday, November 06, 2013 5:53 PM
>>>>>>> To: ssc@apache.org Schelter; user@mahout.apache.org
>>>>>>> Subject: Re: Solr-recommender for Mahout 0.9
>>>>>>> 
>>>>>>> Done,
>>>>>>> 
>>>>>>> BTW I have the thing running on a demo site but am getting very poor
>>>>> results that I think are related to the Solr setup. I'd appreciate any
>>>>> ideas.
>>>>>>> 
>>>>>>> The sample data has 27,000 items and something like 4000 users. The
>>>>> preference data is fairly dense since the users are professional reviewers
>>>>> and the items videos.
>>>>>>> 
>>>>>>> 1) The number of item-item similarities that are kept is 100. Is this a
>>>>> good starting point? Ted, do you recall how many you used before?
>>>>>>> 2) The query is a simple text query made of space delimited video id
>>>>> strings. These are the same ids as are stored in the item-item similarity
>>>>> docs that Solr indexes.
>>>>>>> 
>>>>>>> Hit thumbs up on one video you you get several recommendations. Hit
>>>>> thumbs up on several videos you get no recs. I'm either using the wrong
>>>>> query type or have it set up to be too restrictive. As I read through the
>>>>> docs if someone has a suggestion or pointer I'd appreciate it.
>>>>>>> 
>>>>>>> BTW the same sort of thing happens with Title search. Search for
>>>>> "vampire werewolf zombie" you get no results, search for "zombie" you get
>>>>> several.
>>>>>>> 
>>>>>>> On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <ss...@apache.org> wrote:
>>>>>>> 
>>>>>>> Hi Pat,
>>>>>>> 
>>>>>>> can you create issues for 1) and 2) ? Then I will try to get this into
>>>>>>> trunk asap.
>>>>>>> 
>>>>>>> Best,
>>>>>>> Sebastian
>>>>>>> 
>>>>>>>> On 06.11.2013 19:13, Pat Ferrel wrote:
>>>>>>>> Trying to integrate the Solr-recoemmender with the latest Mahout
>>>>> snapshot. The project uses a modified RecommenderJob because it needs
>>>>> SequenceFile output and to get the location of the preparePreferenceMatrix
>>>>> directory. If #1 and #2 are addressed I can remove the modified Mahout code
>>>>> from the project and rely on the default implementations in Mahout 0.9. #3
>>>>> is a longer term issue related to the creation of a CrossRowSimilarityJob.
>>>>>>>> 
>>>>>>>> I have dropped the modified code from the Solr-recommender project and
>>>>> have a modified build of the current Mahout 0.9 snapshot. If the following
>>>>> changes are made to Mahout I can test and release a Mahout 0.9 version of
>>>>> the Solr-recommender.
>>>>>>>> 
>>>>>>>> 1. Option to change RecommenderJob output format
>>>>>>>> 
>>>>>>>> Can someone add an option to output a SequenceFile. I modified the
>>>>> code to do the following, note the SequenceFileOutputFormat.class as the
>>>>> last parameter but this should really be determined with an option I think.
>>>>>>>> 
>>>>>>>> Job aggregateAndRecommend = prepareJob(
>>>>>>>>   new Path(aggregateAndRecommendInput), outputPath,
>>>>> SequenceFileInputFormat.class,
>>>>>>>>   PartialMultiplyMapper.class, VarLongWritable.class,
>>>>> PrefAndSimilarityColumnWritable.class,
>>>>>>>>   AggregateAndRecommendReducer.class, VarLongWritable.class,
>>>>> RecommendedItemsWritable.class,
>>>>>>>>   SequenceFileOutputFormat.class);
>>>>>>>> 
>>>>>>>> 2. Visibility of preparePreferenceMatrix directory location
>>>>>>>> 
>>>>>>>> The Solr-recommender needs to find where the RecommenderJob is putting
>>>>> it's output.
>>>>>>>> 
>>>>>>>> Mahout 0.8 RecommenderJob code was:
>>>>>>>> public static final String DEFAULT_PREPARE_DIR =
>>>>> "preparePreferenceMatrix";
>>>>>>>> 
>>>>>>>> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix"
>>>>> inline in the code:
>>>>>>>> Path prepPath = getTempPath("preparePreferenceMatrix");
>>>>>>>> 
>>>>>>>> This change to Mahout 0.9 works:
>>>>>>>> public static final String DEFAULT_PREPARE_DIR =
>>>>> "preparePreferenceMatrix";
>>>>>>>> and
>>>>>>>> Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
>>>>>>>> 
>>>>>>>> You could also make this a getter method on the RecommenderJob Class
>>>>> instead of using a public constant.
>>>>>>>> 
>>>>>>>> 3. Downsampling
>>>>>>>> 
>>>>>>>> The downsampling for maximum prefs per user has been moved from
>>>>> PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob uses
>>>>> matrix math instead of RSJ so it will no longer support downsampling until
>>>>> there is a hypothetical CrossRowSimilairtyJob with downsampling in it.
>>>>> 
>>>>> --------------------------
>>>>> Ken Krugler
>>>>> +1 530-210-6378
>>>>> http://www.scaleunlimited.com
>>>>> custom big data solutions & training
>>>>> Hadoop, Cascading, Cassandra & Solr
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --------------------------
>>>>> Ken Krugler
>>>>> +1 530-210-6378
>>>>> http://www.scaleunlimited.com
>>>>> custom big data solutions & training
>>>>> Hadoop, Cascading, Cassandra & Solr
>>> 
>>> --------------------------
>>> Ken Krugler
>>> +1 530-210-6378
>>> http://www.scaleunlimited.com
>>> custom big data solutions & training
>>> Hadoop, Cascading, Cassandra & Solr
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> --------------------------
>>> Ken Krugler
>>> +1 530-210-6378
>>> http://www.scaleunlimited.com
>>> custom big data solutions & training
>>> Hadoop, Cascading, Cassandra & Solr
>> 
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>> 
>> 
>> 
>> 
>> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Re: Solr-recommender for Mahout 0.9

Posted by Pat Ferrel <pa...@gmail.com>.

Ever done an offline precision calc?

Sent from my iPhone

> On Nov 13, 2013, at 1:39 PM, Ken Krugler <kk...@transpac.com> wrote:
> 
> Hi Pat,
> 
>> On Nov 13, 2013, at 9:21am, Pat Ferrel <pa...@occamsmachete.com> wrote:
>> 
>> A version is now checked in that uses mahout 0.9. Haven’t tested it on a cluster yet, only locally. I have to upgrade my cluster to Hadoop 1.2.1, which takes some time.
>> 
>> Saw the Strata slides from Ted touting dithering of results, which I’ll implement.
>> 
>> Ken, did you have anything specific for "And usually I just use Solr to generate a candidate list, then I do more specific scoring to find the N best form N*4 candidates”?
> 
> If I'm looking for the top N best matches, I'll do a Solr query with rows=N*4.
> 
> Then I use all of the data from these potential matches, and calculate a more sophisticated similarity score (e.g. adding a weighting based on the user's activity level) between my target and these candidates.
> 
> Regards,
> 
> -- Ken
> 
>> 
>> Was planning to try boosting by something like genre/category in the recs query. For instance, in the demo data, each item will soon have a set of tags (actually genre names) so these could be a field being queried along with the item-item links. The query for recs would then include the user history against the item-item links, and the average genre tags preferred by the user against item genre tags. This would return recs skewed towards the user’s genre preference.
>> 
>> Another way this could be used is when showing similar items. You’d have the tags for the item being viewed and so could use them to skew towards items with similar tags. I think this works but would turn similar items from a lookup (they are pre-calculated by Mahout) into another Solr query.
>> 
>> 
>> 
>> On Nov 8, 2013, at 1:27 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>> 
>> Not planning to do anything with weights at present. An ORed query should suffice for the time being and Solr weights. There are a good list of ways to do this later if it warrants an experiment. Thanks.
>> 
>> Have, similar items as input, recommendations from user “likes”, and just got recs from recently viewed working. Once you have online recs from the pre-calculated model experimenting is super easy. The next step will be to get more metadata ingested so we can try boosting by context genre, or recent genre viewed, which is sort of in line with "more specific scoring to find the N best from N*4 candidates”. Also want to do what Ted calls dithering to vary the choices you see.
>> 
>> On Nov 8, 2013, at 10:10 AM, Ken Krugler <kk...@transpac.com> wrote:
>> 
>> One other thing I should have mentioned is that if you care about setting weights on incoming terms, you can boost them using the ^<value> syntax.
>> 
>> E.g. "the_kings_speech^1.5 OR skyfalll^0.5 OR looper^3.0…"
>> 
>> If you want to account for weights of terms in the index, it's a bit harder. You can do simple boosting by replicating terms, or you can use payload-based boosting, or you could code up your own Similarity class that takes advantage of side-channel data.
>> 
>> But in my experience the gain from applying weights to terms int he index isn't very significant.
>> 
>> And usually I just Solr to generate a candidate list, then I do more specific scoring to find the N best form N*4 candidates.
>> 
>> -- Ken
>> 
>>> On Nov 8, 2013, at 9:54am, Ted Dunning <te...@gmail.com> wrote:
>>> 
>>> For recommendation work, I suggest that it would be better to simply code
>>> out an explicit OR query.
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Nov 7, 2013 at 8:11 PM, Ken Krugler <kk...@transpac.com>wrote:
>>> 
>>>> Hi Pat,
>>>> 
>>>>> On Nov 7, 2013, at 7:30pm, Pat Ferrel <pa...@gmail.com> wrote:
>>>>> 
>>>>> Another approach would be to weight the terms in the docs by there
>>>> Mahout similarity strength. But that will be for another day.
>>>>> 
>>>>> My current question is whether Lucene looks at word proximity. I see the
>>>> query syntax supports proximity but I don’t see that it is default so
>>>> that’s good.
>>>> 
>>>> Based on your description of what you do (generate an OR query of N terms)
>>>> then no, you shouldn't be getting a boost from proximity.
>>>> 
>>>> Note that with edismax you can specify a phrase boost, but it will be on
>>>> the entire set of terms being searched, so unlikely to come into play even
>>>> if you were using that.
>>>> 
>>>> -- Ken
>>>> 
>>>> 
>>>>> 
>>>>> 
>>>>>> On Nov 7, 2013, at 12:41 PM, Dyer, James <Ja...@ingramcontent.com>
>>>>> wrote:
>>>>> 
>>>>> Best to my knowledge, Lucene does not care about the position of a
>>>> keyword within a document.
>>>>> 
>>>>> You could bucket the ids into several fields.  Then use a dismax query
>>>> to boost the top-tier ids more than then second, etc.
>>>>> 
>>>>> A more fine-grained approach would probably involve a custom Similarity
>>>> class that scales the score based on its position in the document.  If we
>>>> did this, it might be simpler to index as 1 single-valued field so each id
>>>> was position+1 rather than position+100, etc.
>>>>> 
>>>>> James Dyer
>>>>> Ingram Content Group
>>>>> (615) 213-4311
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
>>>>> Sent: Thursday, November 07, 2013 1:46 PM
>>>>> To: user@mahout.apache.org
>>>>> Subject: Re: Solr-recommender for Mahout 0.9
>>>>> 
>>>>> Interesting to think about ordering and adjacentness. The index ids are
>>>> sorted by Mahout strength so the first id is the most similar to the row
>>>> key and so forth. But the query is ordered buy recency. In both cases the
>>>> first id is in some sense the most important. Does Solr/Lucene care about
>>>> closeness to the top of doc for queries or indexed docs? I don't recall any
>>>> mention of this.
>>>>> 
>>>>> However adjacentness has no meaning in recommendations though I think
>>>> it's used in default queries so I may have to account for that.
>>>>> 
>>>>> The object returned is an ordered list of ids. I use only the IDs now
>>>> but there are cases when the contents are also of interest; shopping
>>>> cart/watchlist queries for example.
>>>>> 
>>>>>> On Nov 7, 2013, at 10:00 AM, Dyer, James <Ja...@ingramcontent.com>
>>>>> wrote:
>>>>> 
>>>>> The multivalued field will obey the "positionIncrementGap" value you
>>>> specify (default=100).  So for querying purposes, those id's will be 100
>>>> (or whatever you specified) positions apart.  So a phrase search for
>>>> adjacent ids would not match, unless you set the slop for >=
>>>> positionIncrementGap.  Other than this, both scenarios index the same.
>>>>> 
>>>>> For stored fields, solr returns an array of values for multivalued
>>>> fields, which is convienent when writing a UI.
>>>>> 
>>>>> James Dyer
>>>>> Ingram Content Group
>>>>> (615) 213-4311
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Dominik Hübner [mailto:contact@dhuebner.com]
>>>>> Sent: Thursday, November 07, 2013 11:23 AM
>>>>> To: user@mahout.apache.org
>>>>> Subject: Re: Solr-recommender for Mahout 0.9
>>>>> 
>>>>> Does anyone know what the difference is between keeping the ids in a
>>>> space delimited string and indexing a multivalued field of ids? I recently
>>>> tried the latter since ... it felt right, however I am not sure which of
>>>> both has which advantages.
>>>>> 
>>>>>> On 07 Nov 2013, at 18:18, Pat Ferrel <pa...@gmail.com> wrote:
>>>>>> 
>>>>>> I have dismax (no edismax) but am not using it yet, using the default
>>>> query, which does use 'AND'. I had much the same though as I slept on it.
>>>> Changing to OR is now working much much better. So obvious it almost bit
>>>> me, not good in this case...
>>>>>> 
>>>>>> With only a trivially small amount of testing I'd say we have a new
>>>> recommender on the block.
>>>>>> 
>>>>>> If anyone would like to help eyeball test the thing let me know
>>>> off-list. There are a few instructions I'll need to give. And it can't
>>>> handle much load right now due to intentional design limits.
>>>>>> 
>>>>>> 
>>>>>> On Nov 7, 2013, at 6:11 AM, Dyer, James <Ja...@ingramcontent.com>
>>>> wrote:
>>>>>> 
>>>>>> Pat,
>>>>>> 
>>>>>> Can you give us the query it generates when you enter "vampire werewolf
>>>> zombie", q/qt/defType ?
>>>>>> 
>>>>>> My guess is you're using the default query parser with "q.op=AND" , or,
>>>> you're using dismax/edismax with a high "mm" (min-must-match) value.
>>>>>> 
>>>>>> James Dyer
>>>>>> Ingram Content Group
>>>>>> (615) 213-4311
>>>>>> 
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
>>>>>> Sent: Wednesday, November 06, 2013 5:53 PM
>>>>>> To: ssc@apache.org Schelter; user@mahout.apache.org
>>>>>> Subject: Re: Solr-recommender for Mahout 0.9
>>>>>> 
>>>>>> Done,
>>>>>> 
>>>>>> BTW I have the thing running on a demo site but am getting very poor
>>>> results that I think are related to the Solr setup. I'd appreciate any
>>>> ideas.
>>>>>> 
>>>>>> The sample data has 27,000 items and something like 4000 users. The
>>>> preference data is fairly dense since the users are professional reviewers
>>>> and the items videos.
>>>>>> 
>>>>>> 1) The number of item-item similarities that are kept is 100. Is this a
>>>> good starting point? Ted, do you recall how many you used before?
>>>>>> 2) The query is a simple text query made of space delimited video id
>>>> strings. These are the same ids as are stored in the item-item similarity
>>>> docs that Solr indexes.
>>>>>> 
>>>>>> Hit thumbs up on one video you you get several recommendations. Hit
>>>> thumbs up on several videos you get no recs. I'm either using the wrong
>>>> query type or have it set up to be too restrictive. As I read through the
>>>> docs if someone has a suggestion or pointer I'd appreciate it.
>>>>>> 
>>>>>> BTW the same sort of thing happens with Title search. Search for
>>>> "vampire werewolf zombie" you get no results, search for "zombie" you get
>>>> several.
>>>>>> 
>>>>>> On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <ss...@apache.org> wrote:
>>>>>> 
>>>>>> Hi Pat,
>>>>>> 
>>>>>> can you create issues for 1) and 2) ? Then I will try to get this into
>>>>>> trunk asap.
>>>>>> 
>>>>>> Best,
>>>>>> Sebastian
>>>>>> 
>>>>>>> On 06.11.2013 19:13, Pat Ferrel wrote:
>>>>>>> Trying to integrate the Solr-recoemmender with the latest Mahout
>>>> snapshot. The project uses a modified RecommenderJob because it needs
>>>> SequenceFile output and to get the location of the preparePreferenceMatrix
>>>> directory. If #1 and #2 are addressed I can remove the modified Mahout code
>>>> from the project and rely on the default implementations in Mahout 0.9. #3
>>>> is a longer term issue related to the creation of a CrossRowSimilarityJob.
>>>>>>> 
>>>>>>> I have dropped the modified code from the Solr-recommender project and
>>>> have a modified build of the current Mahout 0.9 snapshot. If the following
>>>> changes are made to Mahout I can test and release a Mahout 0.9 version of
>>>> the Solr-recommender.
>>>>>>> 
>>>>>>> 1. Option to change RecommenderJob output format
>>>>>>> 
>>>>>>> Can someone add an option to output a SequenceFile. I modified the
>>>> code to do the following, note the SequenceFileOutputFormat.class as the
>>>> last parameter but this should really be determined with an option I think.
>>>>>>> 
>>>>>>> Job aggregateAndRecommend = prepareJob(
>>>>>>>    new Path(aggregateAndRecommendInput), outputPath,
>>>> SequenceFileInputFormat.class,
>>>>>>>    PartialMultiplyMapper.class, VarLongWritable.class,
>>>> PrefAndSimilarityColumnWritable.class,
>>>>>>>    AggregateAndRecommendReducer.class, VarLongWritable.class,
>>>> RecommendedItemsWritable.class,
>>>>>>>    SequenceFileOutputFormat.class);
>>>>>>> 
>>>>>>> 2. Visibility of preparePreferenceMatrix directory location
>>>>>>> 
>>>>>>> The Solr-recommender needs to find where the RecommenderJob is putting
>>>> it's output.
>>>>>>> 
>>>>>>> Mahout 0.8 RecommenderJob code was:
>>>>>>> public static final String DEFAULT_PREPARE_DIR =
>>>> "preparePreferenceMatrix";
>>>>>>> 
>>>>>>> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix"
>>>> inline in the code:
>>>>>>> Path prepPath = getTempPath("preparePreferenceMatrix");
>>>>>>> 
>>>>>>> This change to Mahout 0.9 works:
>>>>>>> public static final String DEFAULT_PREPARE_DIR =
>>>> "preparePreferenceMatrix";
>>>>>>> and
>>>>>>> Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
>>>>>>> 
>>>>>>> You could also make this a getter method on the RecommenderJob Class
>>>> instead of using a public constant.
>>>>>>> 
>>>>>>> 3. Downsampling
>>>>>>> 
>>>>>>> The downsampling for maximum prefs per user has been moved from
>>>> PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob uses
>>>> matrix math instead of RSJ so it will no longer support downsampling until
>>>> there is a hypothetical CrossRowSimilairtyJob with downsampling in it.
>>>> 
>>>> --------------------------
>>>> Ken Krugler
>>>> +1 530-210-6378
>>>> http://www.scaleunlimited.com
>>>> custom big data solutions & training
>>>> Hadoop, Cascading, Cassandra & Solr
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --------------------------
>>>> Ken Krugler
>>>> +1 530-210-6378
>>>> http://www.scaleunlimited.com
>>>> custom big data solutions & training
>>>> Hadoop, Cascading, Cassandra & Solr
>> 
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
> 
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
> 
> 
> 
> 
> 
> 
> 
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
> 
> 
> 
> 
>

Re: Solr-recommender for Mahout 0.9

Posted by Ken Krugler <kk...@transpac.com>.

Hi Pat,

On Nov 13, 2013, at 9:21am, Pat Ferrel <pa...@occamsmachete.com> wrote:

> A version is now checked in that uses mahout 0.9. Haven’t tested it on a cluster yet, only locally. I have to upgrade my cluster to Hadoop 1.2.1, which takes some time.
> 
> Saw the Strata slides from Ted touting dithering of results, which I’ll implement.
> 
> Ken, did you have anything specific for "And usually I just use Solr to generate a candidate list, then I do more specific scoring to find the N best form N*4 candidates”?

If I'm looking for the top N best matches, I'll do a Solr query with rows=N*4.

Then I use all of the data from these potential matches, and calculate a more sophisticated similarity score (e.g. adding a weighting based on the user's activity level) between my target and these candidates.

Regards,

-- Ken

> 
> Was planning to try boosting by something like genre/category in the recs query. For instance, in the demo data, each item will soon have a set of tags (actually genre names) so these could be a field being queried along with the item-item links. The query for recs would then include the user history against the item-item links, and the average genre tags preferred by the user against item genre tags. This would return recs skewed towards the user’s genre preference.
> 
> Another way this could be used is when showing similar items. You’d have the tags for the item being viewed and so could use them to skew towards items with similar tags. I think this works but would turn similar items from a lookup (they are pre-calculated by Mahout) into another Solr query.
> 
> 
> 
> On Nov 8, 2013, at 1:27 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
> Not planning to do anything with weights at present. An ORed query should suffice for the time being and Solr weights. There are a good list of ways to do this later if it warrants an experiment. Thanks.
> 
> Have, similar items as input, recommendations from user “likes”, and just got recs from recently viewed working. Once you have online recs from the pre-calculated model experimenting is super easy. The next step will be to get more metadata ingested so we can try boosting by context genre, or recent genre viewed, which is sort of in line with "more specific scoring to find the N best from N*4 candidates”. Also want to do what Ted calls dithering to vary the choices you see.
> 
> On Nov 8, 2013, at 10:10 AM, Ken Krugler <kk...@transpac.com> wrote:
> 
> One other thing I should have mentioned is that if you care about setting weights on incoming terms, you can boost them using the ^<value> syntax.
> 
> E.g. "the_kings_speech^1.5 OR skyfalll^0.5 OR looper^3.0…"
> 
> If you want to account for weights of terms in the index, it's a bit harder. You can do simple boosting by replicating terms, or you can use payload-based boosting, or you could code up your own Similarity class that takes advantage of side-channel data.
> 
> But in my experience the gain from applying weights to terms int he index isn't very significant.
> 
> And usually I just Solr to generate a candidate list, then I do more specific scoring to find the N best form N*4 candidates.
> 
> -- Ken
> 
> On Nov 8, 2013, at 9:54am, Ted Dunning <te...@gmail.com> wrote:
> 
>> For recommendation work, I suggest that it would be better to simply code
>> out an explicit OR query.
>> 
>> 
>> 
>> 
>> On Thu, Nov 7, 2013 at 8:11 PM, Ken Krugler <kk...@transpac.com>wrote:
>> 
>>> Hi Pat,
>>> 
>>> On Nov 7, 2013, at 7:30pm, Pat Ferrel <pa...@gmail.com> wrote:
>>> 
>>>> Another approach would be to weight the terms in the docs by there
>>> Mahout similarity strength. But that will be for another day.
>>>> 
>>>> My current question is whether Lucene looks at word proximity. I see the
>>> query syntax supports proximity but I don’t see that it is default so
>>> that’s good.
>>> 
>>> Based on your description of what you do (generate an OR query of N terms)
>>> then no, you shouldn't be getting a boost from proximity.
>>> 
>>> Note that with edismax you can specify a phrase boost, but it will be on
>>> the entire set of terms being searched, so unlikely to come into play even
>>> if you were using that.
>>> 
>>> -- Ken
>>> 
>>> 
>>>> 
>>>> 
>>>> On Nov 7, 2013, at 12:41 PM, Dyer, James <Ja...@ingramcontent.com>
>>> wrote:
>>>> 
>>>> Best to my knowledge, Lucene does not care about the position of a
>>> keyword within a document.
>>>> 
>>>> You could bucket the ids into several fields.  Then use a dismax query
>>> to boost the top-tier ids more than then second, etc.
>>>> 
>>>> A more fine-grained approach would probably involve a custom Similarity
>>> class that scales the score based on its position in the document.  If we
>>> did this, it might be simpler to index as 1 single-valued field so each id
>>> was position+1 rather than position+100, etc.
>>>> 
>>>> James Dyer
>>>> Ingram Content Group
>>>> (615) 213-4311
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
>>>> Sent: Thursday, November 07, 2013 1:46 PM
>>>> To: user@mahout.apache.org
>>>> Subject: Re: Solr-recommender for Mahout 0.9
>>>> 
>>>> Interesting to think about ordering and adjacentness. The index ids are
>>> sorted by Mahout strength so the first id is the most similar to the row
>>> key and so forth. But the query is ordered buy recency. In both cases the
>>> first id is in some sense the most important. Does Solr/Lucene care about
>>> closeness to the top of doc for queries or indexed docs? I don't recall any
>>> mention of this.
>>>> 
>>>> However adjacentness has no meaning in recommendations though I think
>>> it's used in default queries so I may have to account for that.
>>>> 
>>>> The object returned is an ordered list of ids. I use only the IDs now
>>> but there are cases when the contents are also of interest; shopping
>>> cart/watchlist queries for example.
>>>> 
>>>> On Nov 7, 2013, at 10:00 AM, Dyer, James <Ja...@ingramcontent.com>
>>> wrote:
>>>> 
>>>> The multivalued field will obey the "positionIncrementGap" value you
>>> specify (default=100).  So for querying purposes, those id's will be 100
>>> (or whatever you specified) positions apart.  So a phrase search for
>>> adjacent ids would not match, unless you set the slop for >=
>>> positionIncrementGap.  Other than this, both scenarios index the same.
>>>> 
>>>> For stored fields, solr returns an array of values for multivalued
>>> fields, which is convienent when writing a UI.
>>>> 
>>>> James Dyer
>>>> Ingram Content Group
>>>> (615) 213-4311
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Dominik Hübner [mailto:contact@dhuebner.com]
>>>> Sent: Thursday, November 07, 2013 11:23 AM
>>>> To: user@mahout.apache.org
>>>> Subject: Re: Solr-recommender for Mahout 0.9
>>>> 
>>>> Does anyone know what the difference is between keeping the ids in a
>>> space delimited string and indexing a multivalued field of ids? I recently
>>> tried the latter since ... it felt right, however I am not sure which of
>>> both has which advantages.
>>>> 
>>>> On 07 Nov 2013, at 18:18, Pat Ferrel <pa...@gmail.com> wrote:
>>>> 
>>>>> I have dismax (no edismax) but am not using it yet, using the default
>>> query, which does use 'AND'. I had much the same though as I slept on it.
>>> Changing to OR is now working much much better. So obvious it almost bit
>>> me, not good in this case...
>>>>> 
>>>>> With only a trivially small amount of testing I'd say we have a new
>>> recommender on the block.
>>>>> 
>>>>> If anyone would like to help eyeball test the thing let me know
>>> off-list. There are a few instructions I'll need to give. And it can't
>>> handle much load right now due to intentional design limits.
>>>>> 
>>>>> 
>>>>> On Nov 7, 2013, at 6:11 AM, Dyer, James <Ja...@ingramcontent.com>
>>> wrote:
>>>>> 
>>>>> Pat,
>>>>> 
>>>>> Can you give us the query it generates when you enter "vampire werewolf
>>> zombie", q/qt/defType ?
>>>>> 
>>>>> My guess is you're using the default query parser with "q.op=AND" , or,
>>> you're using dismax/edismax with a high "mm" (min-must-match) value.
>>>>> 
>>>>> James Dyer
>>>>> Ingram Content Group
>>>>> (615) 213-4311
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
>>>>> Sent: Wednesday, November 06, 2013 5:53 PM
>>>>> To: ssc@apache.org Schelter; user@mahout.apache.org
>>>>> Subject: Re: Solr-recommender for Mahout 0.9
>>>>> 
>>>>> Done,
>>>>> 
>>>>> BTW I have the thing running on a demo site but am getting very poor
>>> results that I think are related to the Solr setup. I'd appreciate any
>>> ideas.
>>>>> 
>>>>> The sample data has 27,000 items and something like 4000 users. The
>>> preference data is fairly dense since the users are professional reviewers
>>> and the items videos.
>>>>> 
>>>>> 1) The number of item-item similarities that are kept is 100. Is this a
>>> good starting point? Ted, do you recall how many you used before?
>>>>> 2) The query is a simple text query made of space delimited video id
>>> strings. These are the same ids as are stored in the item-item similarity
>>> docs that Solr indexes.
>>>>> 
>>>>> Hit thumbs up on one video you you get several recommendations. Hit
>>> thumbs up on several videos you get no recs. I'm either using the wrong
>>> query type or have it set up to be too restrictive. As I read through the
>>> docs if someone has a suggestion or pointer I'd appreciate it.
>>>>> 
>>>>> BTW the same sort of thing happens with Title search. Search for
>>> "vampire werewolf zombie" you get no results, search for "zombie" you get
>>> several.
>>>>> 
>>>>> On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <ss...@apache.org> wrote:
>>>>> 
>>>>> Hi Pat,
>>>>> 
>>>>> can you create issues for 1) and 2) ? Then I will try to get this into
>>>>> trunk asap.
>>>>> 
>>>>> Best,
>>>>> Sebastian
>>>>> 
>>>>> On 06.11.2013 19:13, Pat Ferrel wrote:
>>>>>> Trying to integrate the Solr-recoemmender with the latest Mahout
>>> snapshot. The project uses a modified RecommenderJob because it needs
>>> SequenceFile output and to get the location of the preparePreferenceMatrix
>>> directory. If #1 and #2 are addressed I can remove the modified Mahout code
>>> from the project and rely on the default implementations in Mahout 0.9. #3
>>> is a longer term issue related to the creation of a CrossRowSimilarityJob.
>>>>>> 
>>>>>> I have dropped the modified code from the Solr-recommender project and
>>> have a modified build of the current Mahout 0.9 snapshot. If the following
>>> changes are made to Mahout I can test and release a Mahout 0.9 version of
>>> the Solr-recommender.
>>>>>> 
>>>>>> 1. Option to change RecommenderJob output format
>>>>>> 
>>>>>> Can someone add an option to output a SequenceFile. I modified the
>>> code to do the following, note the SequenceFileOutputFormat.class as the
>>> last parameter but this should really be determined with an option I think.
>>>>>> 
>>>>>> Job aggregateAndRecommend = prepareJob(
>>>>>>     new Path(aggregateAndRecommendInput), outputPath,
>>> SequenceFileInputFormat.class,
>>>>>>     PartialMultiplyMapper.class, VarLongWritable.class,
>>> PrefAndSimilarityColumnWritable.class,
>>>>>>     AggregateAndRecommendReducer.class, VarLongWritable.class,
>>> RecommendedItemsWritable.class,
>>>>>>     SequenceFileOutputFormat.class);
>>>>>> 
>>>>>> 2. Visibility of preparePreferenceMatrix directory location
>>>>>> 
>>>>>> The Solr-recommender needs to find where the RecommenderJob is putting
>>> it's output.
>>>>>> 
>>>>>> Mahout 0.8 RecommenderJob code was:
>>>>>> public static final String DEFAULT_PREPARE_DIR =
>>> "preparePreferenceMatrix";
>>>>>> 
>>>>>> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix"
>>> inline in the code:
>>>>>> Path prepPath = getTempPath("preparePreferenceMatrix");
>>>>>> 
>>>>>> This change to Mahout 0.9 works:
>>>>>> public static final String DEFAULT_PREPARE_DIR =
>>> "preparePreferenceMatrix";
>>>>>> and
>>>>>> Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
>>>>>> 
>>>>>> You could also make this a getter method on the RecommenderJob Class
>>> instead of using a public constant.
>>>>>> 
>>>>>> 3. Downsampling
>>>>>> 
>>>>>> The downsampling for maximum prefs per user has been moved from
>>> PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob uses
>>> matrix math instead of RSJ so it will no longer support downsampling until
>>> there is a hypothetical CrossRowSimilairtyJob with downsampling in it.
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> --------------------------
>>> Ken Krugler
>>> +1 530-210-6378
>>> http://www.scaleunlimited.com
>>> custom big data solutions & training
>>> Hadoop, Cascading, Cassandra & Solr
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> --------------------------
>>> Ken Krugler
>>> +1 530-210-6378
>>> http://www.scaleunlimited.com
>>> custom big data solutions & training
>>> Hadoop, Cascading, Cassandra & Solr
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
> 
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
> 
> 
> 
> 
> 
> 
> 
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
> 
> 
> 
> 
> 
> 
> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Re: Solr-recommender for Mahout 0.9

Posted by Pat Ferrel <pa...@occamsmachete.com>.

A version is now checked in that uses mahout 0.9. Haven’t tested it on a cluster yet, only locally. I have to upgrade my cluster to Hadoop 1.2.1, which takes some time.

Saw the Strata slides from Ted touting dithering of results, which I’ll implement.

Ken, did you have anything specific for "And usually I just use Solr to generate a candidate list, then I do more specific scoring to find the N best form N*4 candidates”?

Was planning to try boosting by something like genre/category in the recs query. For instance, in the demo data, each item will soon have a set of tags (actually genre names) so these could be a field being queried along with the item-item links. The query for recs would then include the user history against the item-item links, and the average genre tags preferred by the user against item genre tags. This would return recs skewed towards the user’s genre preference.

Another way this could be used is when showing similar items. You’d have the tags for the item being viewed and so could use them to skew towards items with similar tags. I think this works but would turn similar items from a lookup (they are pre-calculated by Mahout) into another Solr query.



On Nov 8, 2013, at 1:27 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

Not planning to do anything with weights at present. An ORed query should suffice for the time being and Solr weights. There are a good list of ways to do this later if it warrants an experiment. Thanks.

Have, similar items as input, recommendations from user “likes”, and just got recs from recently viewed working. Once you have online recs from the pre-calculated model experimenting is super easy. The next step will be to get more metadata ingested so we can try boosting by context genre, or recent genre viewed, which is sort of in line with "more specific scoring to find the N best from N*4 candidates”. Also want to do what Ted calls dithering to vary the choices you see.

On Nov 8, 2013, at 10:10 AM, Ken Krugler <kk...@transpac.com> wrote:

One other thing I should have mentioned is that if you care about setting weights on incoming terms, you can boost them using the ^<value> syntax.

E.g. "the_kings_speech^1.5 OR skyfalll^0.5 OR looper^3.0…"

If you want to account for weights of terms in the index, it's a bit harder. You can do simple boosting by replicating terms, or you can use payload-based boosting, or you could code up your own Similarity class that takes advantage of side-channel data.

But in my experience the gain from applying weights to terms int he index isn't very significant.

And usually I just Solr to generate a candidate list, then I do more specific scoring to find the N best form N*4 candidates.

-- Ken

On Nov 8, 2013, at 9:54am, Ted Dunning <te...@gmail.com> wrote:

> For recommendation work, I suggest that it would be better to simply code
> out an explicit OR query.
> 
> 
> 
> 
> On Thu, Nov 7, 2013 at 8:11 PM, Ken Krugler <kk...@transpac.com>wrote:
> 
>> Hi Pat,
>> 
>> On Nov 7, 2013, at 7:30pm, Pat Ferrel <pa...@gmail.com> wrote:
>> 
>>> Another approach would be to weight the terms in the docs by there
>> Mahout similarity strength. But that will be for another day.
>>> 
>>> My current question is whether Lucene looks at word proximity. I see the
>> query syntax supports proximity but I don’t see that it is default so
>> that’s good.
>> 
>> Based on your description of what you do (generate an OR query of N terms)
>> then no, you shouldn't be getting a boost from proximity.
>> 
>> Note that with edismax you can specify a phrase boost, but it will be on
>> the entire set of terms being searched, so unlikely to come into play even
>> if you were using that.
>> 
>> -- Ken
>> 
>> 
>>> 
>>> 
>>> On Nov 7, 2013, at 12:41 PM, Dyer, James <Ja...@ingramcontent.com>
>> wrote:
>>> 
>>> Best to my knowledge, Lucene does not care about the position of a
>> keyword within a document.
>>> 
>>> You could bucket the ids into several fields.  Then use a dismax query
>> to boost the top-tier ids more than then second, etc.
>>> 
>>> A more fine-grained approach would probably involve a custom Similarity
>> class that scales the score based on its position in the document.  If we
>> did this, it might be simpler to index as 1 single-valued field so each id
>> was position+1 rather than position+100, etc.
>>> 
>>> James Dyer
>>> Ingram Content Group
>>> (615) 213-4311
>>> 
>>> 
>>> -----Original Message-----
>>> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
>>> Sent: Thursday, November 07, 2013 1:46 PM
>>> To: user@mahout.apache.org
>>> Subject: Re: Solr-recommender for Mahout 0.9
>>> 
>>> Interesting to think about ordering and adjacentness. The index ids are
>> sorted by Mahout strength so the first id is the most similar to the row
>> key and so forth. But the query is ordered buy recency. In both cases the
>> first id is in some sense the most important. Does Solr/Lucene care about
>> closeness to the top of doc for queries or indexed docs? I don't recall any
>> mention of this.
>>> 
>>> However adjacentness has no meaning in recommendations though I think
>> it's used in default queries so I may have to account for that.
>>> 
>>> The object returned is an ordered list of ids. I use only the IDs now
>> but there are cases when the contents are also of interest; shopping
>> cart/watchlist queries for example.
>>> 
>>> On Nov 7, 2013, at 10:00 AM, Dyer, James <Ja...@ingramcontent.com>
>> wrote:
>>> 
>>> The multivalued field will obey the "positionIncrementGap" value you
>> specify (default=100).  So for querying purposes, those id's will be 100
>> (or whatever you specified) positions apart.  So a phrase search for
>> adjacent ids would not match, unless you set the slop for >=
>> positionIncrementGap.  Other than this, both scenarios index the same.
>>> 
>>> For stored fields, solr returns an array of values for multivalued
>> fields, which is convienent when writing a UI.
>>> 
>>> James Dyer
>>> Ingram Content Group
>>> (615) 213-4311
>>> 
>>> 
>>> -----Original Message-----
>>> From: Dominik Hübner [mailto:contact@dhuebner.com]
>>> Sent: Thursday, November 07, 2013 11:23 AM
>>> To: user@mahout.apache.org
>>> Subject: Re: Solr-recommender for Mahout 0.9
>>> 
>>> Does anyone know what the difference is between keeping the ids in a
>> space delimited string and indexing a multivalued field of ids? I recently
>> tried the latter since ... it felt right, however I am not sure which of
>> both has which advantages.
>>> 
>>> On 07 Nov 2013, at 18:18, Pat Ferrel <pa...@gmail.com> wrote:
>>> 
>>>> I have dismax (no edismax) but am not using it yet, using the default
>> query, which does use 'AND'. I had much the same though as I slept on it.
>> Changing to OR is now working much much better. So obvious it almost bit
>> me, not good in this case...
>>>> 
>>>> With only a trivially small amount of testing I'd say we have a new
>> recommender on the block.
>>>> 
>>>> If anyone would like to help eyeball test the thing let me know
>> off-list. There are a few instructions I'll need to give. And it can't
>> handle much load right now due to intentional design limits.
>>>> 
>>>> 
>>>> On Nov 7, 2013, at 6:11 AM, Dyer, James <Ja...@ingramcontent.com>
>> wrote:
>>>> 
>>>> Pat,
>>>> 
>>>> Can you give us the query it generates when you enter "vampire werewolf
>> zombie", q/qt/defType ?
>>>> 
>>>> My guess is you're using the default query parser with "q.op=AND" , or,
>> you're using dismax/edismax with a high "mm" (min-must-match) value.
>>>> 
>>>> James Dyer
>>>> Ingram Content Group
>>>> (615) 213-4311
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
>>>> Sent: Wednesday, November 06, 2013 5:53 PM
>>>> To: ssc@apache.org Schelter; user@mahout.apache.org
>>>> Subject: Re: Solr-recommender for Mahout 0.9
>>>> 
>>>> Done,
>>>> 
>>>> BTW I have the thing running on a demo site but am getting very poor
>> results that I think are related to the Solr setup. I'd appreciate any
>> ideas.
>>>> 
>>>> The sample data has 27,000 items and something like 4000 users. The
>> preference data is fairly dense since the users are professional reviewers
>> and the items videos.
>>>> 
>>>> 1) The number of item-item similarities that are kept is 100. Is this a
>> good starting point? Ted, do you recall how many you used before?
>>>> 2) The query is a simple text query made of space delimited video id
>> strings. These are the same ids as are stored in the item-item similarity
>> docs that Solr indexes.
>>>> 
>>>> Hit thumbs up on one video you you get several recommendations. Hit
>> thumbs up on several videos you get no recs. I'm either using the wrong
>> query type or have it set up to be too restrictive. As I read through the
>> docs if someone has a suggestion or pointer I'd appreciate it.
>>>> 
>>>> BTW the same sort of thing happens with Title search. Search for
>> "vampire werewolf zombie" you get no results, search for "zombie" you get
>> several.
>>>> 
>>>> On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <ss...@apache.org> wrote:
>>>> 
>>>> Hi Pat,
>>>> 
>>>> can you create issues for 1) and 2) ? Then I will try to get this into
>>>> trunk asap.
>>>> 
>>>> Best,
>>>> Sebastian
>>>> 
>>>> On 06.11.2013 19:13, Pat Ferrel wrote:
>>>>> Trying to integrate the Solr-recoemmender with the latest Mahout
>> snapshot. The project uses a modified RecommenderJob because it needs
>> SequenceFile output and to get the location of the preparePreferenceMatrix
>> directory. If #1 and #2 are addressed I can remove the modified Mahout code
>> from the project and rely on the default implementations in Mahout 0.9. #3
>> is a longer term issue related to the creation of a CrossRowSimilarityJob.
>>>>> 
>>>>> I have dropped the modified code from the Solr-recommender project and
>> have a modified build of the current Mahout 0.9 snapshot. If the following
>> changes are made to Mahout I can test and release a Mahout 0.9 version of
>> the Solr-recommender.
>>>>> 
>>>>> 1. Option to change RecommenderJob output format
>>>>> 
>>>>> Can someone add an option to output a SequenceFile. I modified the
>> code to do the following, note the SequenceFileOutputFormat.class as the
>> last parameter but this should really be determined with an option I think.
>>>>> 
>>>>> Job aggregateAndRecommend = prepareJob(
>>>>>      new Path(aggregateAndRecommendInput), outputPath,
>> SequenceFileInputFormat.class,
>>>>>      PartialMultiplyMapper.class, VarLongWritable.class,
>> PrefAndSimilarityColumnWritable.class,
>>>>>      AggregateAndRecommendReducer.class, VarLongWritable.class,
>> RecommendedItemsWritable.class,
>>>>>      SequenceFileOutputFormat.class);
>>>>> 
>>>>> 2. Visibility of preparePreferenceMatrix directory location
>>>>> 
>>>>> The Solr-recommender needs to find where the RecommenderJob is putting
>> it's output.
>>>>> 
>>>>> Mahout 0.8 RecommenderJob code was:
>>>>> public static final String DEFAULT_PREPARE_DIR =
>> "preparePreferenceMatrix";
>>>>> 
>>>>> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix"
>> inline in the code:
>>>>> Path prepPath = getTempPath("preparePreferenceMatrix");
>>>>> 
>>>>> This change to Mahout 0.9 works:
>>>>> public static final String DEFAULT_PREPARE_DIR =
>> "preparePreferenceMatrix";
>>>>> and
>>>>> Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
>>>>> 
>>>>> You could also make this a getter method on the RecommenderJob Class
>> instead of using a public constant.
>>>>> 
>>>>> 3. Downsampling
>>>>> 
>>>>> The downsampling for maximum prefs per user has been moved from
>> PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob uses
>> matrix math instead of RSJ so it will no longer support downsampling until
>> there is a hypothetical CrossRowSimilairtyJob with downsampling in it.
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>> 
>> 
>> 
>> 
>> 
>> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Re: Solr-recommender for Mahout 0.9

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Not planning to do anything with weights at present. An ORed query should suffice for the time being and Solr weights. There are a good list of ways to do this later if it warrants an experiment. Thanks.

Have, similar items as input, recommendations from user “likes”, and just got recs from recently viewed working. Once you have online recs from the pre-calculated model experimenting is super easy. The next step will be to get more metadata ingested so we can try boosting by context genre, or recent genre viewed, which is sort of in line with "more specific scoring to find the N best from N*4 candidates”. Also want to do what Ted calls dithering to vary the choices you see.

On Nov 8, 2013, at 10:10 AM, Ken Krugler <kk...@transpac.com> wrote:

One other thing I should have mentioned is that if you care about setting weights on incoming terms, you can boost them using the ^<value> syntax.

E.g. "the_kings_speech^1.5 OR skyfalll^0.5 OR looper^3.0…"

If you want to account for weights of terms in the index, it's a bit harder. You can do simple boosting by replicating terms, or you can use payload-based boosting, or you could code up your own Similarity class that takes advantage of side-channel data.

But in my experience the gain from applying weights to terms int he index isn't very significant.

And usually I just Solr to generate a candidate list, then I do more specific scoring to find the N best form N*4 candidates.

-- Ken

On Nov 8, 2013, at 9:54am, Ted Dunning <te...@gmail.com> wrote:

> For recommendation work, I suggest that it would be better to simply code
> out an explicit OR query.
> 
> 
> 
> 
> On Thu, Nov 7, 2013 at 8:11 PM, Ken Krugler <kk...@transpac.com>wrote:
> 
>> Hi Pat,
>> 
>> On Nov 7, 2013, at 7:30pm, Pat Ferrel <pa...@gmail.com> wrote:
>> 
>>> Another approach would be to weight the terms in the docs by there
>> Mahout similarity strength. But that will be for another day.
>>> 
>>> My current question is whether Lucene looks at word proximity. I see the
>> query syntax supports proximity but I don’t see that it is default so
>> that’s good.
>> 
>> Based on your description of what you do (generate an OR query of N terms)
>> then no, you shouldn't be getting a boost from proximity.
>> 
>> Note that with edismax you can specify a phrase boost, but it will be on
>> the entire set of terms being searched, so unlikely to come into play even
>> if you were using that.
>> 
>> -- Ken
>> 
>> 
>>> 
>>> 
>>> On Nov 7, 2013, at 12:41 PM, Dyer, James <Ja...@ingramcontent.com>
>> wrote:
>>> 
>>> Best to my knowledge, Lucene does not care about the position of a
>> keyword within a document.
>>> 
>>> You could bucket the ids into several fields.  Then use a dismax query
>> to boost the top-tier ids more than then second, etc.
>>> 
>>> A more fine-grained approach would probably involve a custom Similarity
>> class that scales the score based on its position in the document.  If we
>> did this, it might be simpler to index as 1 single-valued field so each id
>> was position+1 rather than position+100, etc.
>>> 
>>> James Dyer
>>> Ingram Content Group
>>> (615) 213-4311
>>> 
>>> 
>>> -----Original Message-----
>>> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
>>> Sent: Thursday, November 07, 2013 1:46 PM
>>> To: user@mahout.apache.org
>>> Subject: Re: Solr-recommender for Mahout 0.9
>>> 
>>> Interesting to think about ordering and adjacentness. The index ids are
>> sorted by Mahout strength so the first id is the most similar to the row
>> key and so forth. But the query is ordered buy recency. In both cases the
>> first id is in some sense the most important. Does Solr/Lucene care about
>> closeness to the top of doc for queries or indexed docs? I don't recall any
>> mention of this.
>>> 
>>> However adjacentness has no meaning in recommendations though I think
>> it's used in default queries so I may have to account for that.
>>> 
>>> The object returned is an ordered list of ids. I use only the IDs now
>> but there are cases when the contents are also of interest; shopping
>> cart/watchlist queries for example.
>>> 
>>> On Nov 7, 2013, at 10:00 AM, Dyer, James <Ja...@ingramcontent.com>
>> wrote:
>>> 
>>> The multivalued field will obey the "positionIncrementGap" value you
>> specify (default=100).  So for querying purposes, those id's will be 100
>> (or whatever you specified) positions apart.  So a phrase search for
>> adjacent ids would not match, unless you set the slop for >=
>> positionIncrementGap.  Other than this, both scenarios index the same.
>>> 
>>> For stored fields, solr returns an array of values for multivalued
>> fields, which is convienent when writing a UI.
>>> 
>>> James Dyer
>>> Ingram Content Group
>>> (615) 213-4311
>>> 
>>> 
>>> -----Original Message-----
>>> From: Dominik Hübner [mailto:contact@dhuebner.com]
>>> Sent: Thursday, November 07, 2013 11:23 AM
>>> To: user@mahout.apache.org
>>> Subject: Re: Solr-recommender for Mahout 0.9
>>> 
>>> Does anyone know what the difference is between keeping the ids in a
>> space delimited string and indexing a multivalued field of ids? I recently
>> tried the latter since ... it felt right, however I am not sure which of
>> both has which advantages.
>>> 
>>> On 07 Nov 2013, at 18:18, Pat Ferrel <pa...@gmail.com> wrote:
>>> 
>>>> I have dismax (no edismax) but am not using it yet, using the default
>> query, which does use 'AND'. I had much the same though as I slept on it.
>> Changing to OR is now working much much better. So obvious it almost bit
>> me, not good in this case...
>>>> 
>>>> With only a trivially small amount of testing I'd say we have a new
>> recommender on the block.
>>>> 
>>>> If anyone would like to help eyeball test the thing let me know
>> off-list. There are a few instructions I'll need to give. And it can't
>> handle much load right now due to intentional design limits.
>>>> 
>>>> 
>>>> On Nov 7, 2013, at 6:11 AM, Dyer, James <Ja...@ingramcontent.com>
>> wrote:
>>>> 
>>>> Pat,
>>>> 
>>>> Can you give us the query it generates when you enter "vampire werewolf
>> zombie", q/qt/defType ?
>>>> 
>>>> My guess is you're using the default query parser with "q.op=AND" , or,
>> you're using dismax/edismax with a high "mm" (min-must-match) value.
>>>> 
>>>> James Dyer
>>>> Ingram Content Group
>>>> (615) 213-4311
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
>>>> Sent: Wednesday, November 06, 2013 5:53 PM
>>>> To: ssc@apache.org Schelter; user@mahout.apache.org
>>>> Subject: Re: Solr-recommender for Mahout 0.9
>>>> 
>>>> Done,
>>>> 
>>>> BTW I have the thing running on a demo site but am getting very poor
>> results that I think are related to the Solr setup. I'd appreciate any
>> ideas.
>>>> 
>>>> The sample data has 27,000 items and something like 4000 users. The
>> preference data is fairly dense since the users are professional reviewers
>> and the items videos.
>>>> 
>>>> 1) The number of item-item similarities that are kept is 100. Is this a
>> good starting point? Ted, do you recall how many you used before?
>>>> 2) The query is a simple text query made of space delimited video id
>> strings. These are the same ids as are stored in the item-item similarity
>> docs that Solr indexes.
>>>> 
>>>> Hit thumbs up on one video you you get several recommendations. Hit
>> thumbs up on several videos you get no recs. I'm either using the wrong
>> query type or have it set up to be too restrictive. As I read through the
>> docs if someone has a suggestion or pointer I'd appreciate it.
>>>> 
>>>> BTW the same sort of thing happens with Title search. Search for
>> "vampire werewolf zombie" you get no results, search for "zombie" you get
>> several.
>>>> 
>>>> On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <ss...@apache.org> wrote:
>>>> 
>>>> Hi Pat,
>>>> 
>>>> can you create issues for 1) and 2) ? Then I will try to get this into
>>>> trunk asap.
>>>> 
>>>> Best,
>>>> Sebastian
>>>> 
>>>> On 06.11.2013 19:13, Pat Ferrel wrote:
>>>>> Trying to integrate the Solr-recoemmender with the latest Mahout
>> snapshot. The project uses a modified RecommenderJob because it needs
>> SequenceFile output and to get the location of the preparePreferenceMatrix
>> directory. If #1 and #2 are addressed I can remove the modified Mahout code
>> from the project and rely on the default implementations in Mahout 0.9. #3
>> is a longer term issue related to the creation of a CrossRowSimilarityJob.
>>>>> 
>>>>> I have dropped the modified code from the Solr-recommender project and
>> have a modified build of the current Mahout 0.9 snapshot. If the following
>> changes are made to Mahout I can test and release a Mahout 0.9 version of
>> the Solr-recommender.
>>>>> 
>>>>> 1. Option to change RecommenderJob output format
>>>>> 
>>>>> Can someone add an option to output a SequenceFile. I modified the
>> code to do the following, note the SequenceFileOutputFormat.class as the
>> last parameter but this should really be determined with an option I think.
>>>>> 
>>>>> Job aggregateAndRecommend = prepareJob(
>>>>>       new Path(aggregateAndRecommendInput), outputPath,
>> SequenceFileInputFormat.class,
>>>>>       PartialMultiplyMapper.class, VarLongWritable.class,
>> PrefAndSimilarityColumnWritable.class,
>>>>>       AggregateAndRecommendReducer.class, VarLongWritable.class,
>> RecommendedItemsWritable.class,
>>>>>       SequenceFileOutputFormat.class);
>>>>> 
>>>>> 2. Visibility of preparePreferenceMatrix directory location
>>>>> 
>>>>> The Solr-recommender needs to find where the RecommenderJob is putting
>> it's output.
>>>>> 
>>>>> Mahout 0.8 RecommenderJob code was:
>>>>> public static final String DEFAULT_PREPARE_DIR =
>> "preparePreferenceMatrix";
>>>>> 
>>>>> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix"
>> inline in the code:
>>>>> Path prepPath = getTempPath("preparePreferenceMatrix");
>>>>> 
>>>>> This change to Mahout 0.9 works:
>>>>> public static final String DEFAULT_PREPARE_DIR =
>> "preparePreferenceMatrix";
>>>>> and
>>>>> Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
>>>>> 
>>>>> You could also make this a getter method on the RecommenderJob Class
>> instead of using a public constant.
>>>>> 
>>>>> 3. Downsampling
>>>>> 
>>>>> The downsampling for maximum prefs per user has been moved from
>> PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob uses
>> matrix math instead of RSJ so it will no longer support downsampling until
>> there is a hypothetical CrossRowSimilairtyJob with downsampling in it.
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>> 
>> 
>> 
>> 
>> 
>> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Re: Solr-recommender for Mahout 0.9

Posted by Ken Krugler <kk...@transpac.com>.

One other thing I should have mentioned is that if you care about setting weights on incoming terms, you can boost them using the ^<value> syntax.

E.g. "the_kings_speech^1.5 OR skyfalll^0.5 OR looper^3.0…"

If you want to account for weights of terms in the index, it's a bit harder. You can do simple boosting by replicating terms, or you can use payload-based boosting, or you could code up your own Similarity class that takes advantage of side-channel data.

But in my experience the gain from applying weights to terms int he index isn't very significant.

And usually I just Solr to generate a candidate list, then I do more specific scoring to find the N best form N*4 candidates.

-- Ken

On Nov 8, 2013, at 9:54am, Ted Dunning <te...@gmail.com> wrote:

> For recommendation work, I suggest that it would be better to simply code
> out an explicit OR query.
> 
> 
> 
> 
> On Thu, Nov 7, 2013 at 8:11 PM, Ken Krugler <kk...@transpac.com>wrote:
> 
>> Hi Pat,
>> 
>> On Nov 7, 2013, at 7:30pm, Pat Ferrel <pa...@gmail.com> wrote:
>> 
>>> Another approach would be to weight the terms in the docs by there
>> Mahout similarity strength. But that will be for another day.
>>> 
>>> My current question is whether Lucene looks at word proximity. I see the
>> query syntax supports proximity but I don’t see that it is default so
>> that’s good.
>> 
>> Based on your description of what you do (generate an OR query of N terms)
>> then no, you shouldn't be getting a boost from proximity.
>> 
>> Note that with edismax you can specify a phrase boost, but it will be on
>> the entire set of terms being searched, so unlikely to come into play even
>> if you were using that.
>> 
>> -- Ken
>> 
>> 
>>> 
>>> 
>>> On Nov 7, 2013, at 12:41 PM, Dyer, James <Ja...@ingramcontent.com>
>> wrote:
>>> 
>>> Best to my knowledge, Lucene does not care about the position of a
>> keyword within a document.
>>> 
>>> You could bucket the ids into several fields.  Then use a dismax query
>> to boost the top-tier ids more than then second, etc.
>>> 
>>> A more fine-grained approach would probably involve a custom Similarity
>> class that scales the score based on its position in the document.  If we
>> did this, it might be simpler to index as 1 single-valued field so each id
>> was position+1 rather than position+100, etc.
>>> 
>>> James Dyer
>>> Ingram Content Group
>>> (615) 213-4311
>>> 
>>> 
>>> -----Original Message-----
>>> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
>>> Sent: Thursday, November 07, 2013 1:46 PM
>>> To: user@mahout.apache.org
>>> Subject: Re: Solr-recommender for Mahout 0.9
>>> 
>>> Interesting to think about ordering and adjacentness. The index ids are
>> sorted by Mahout strength so the first id is the most similar to the row
>> key and so forth. But the query is ordered buy recency. In both cases the
>> first id is in some sense the most important. Does Solr/Lucene care about
>> closeness to the top of doc for queries or indexed docs? I don't recall any
>> mention of this.
>>> 
>>> However adjacentness has no meaning in recommendations though I think
>> it's used in default queries so I may have to account for that.
>>> 
>>> The object returned is an ordered list of ids. I use only the IDs now
>> but there are cases when the contents are also of interest; shopping
>> cart/watchlist queries for example.
>>> 
>>> On Nov 7, 2013, at 10:00 AM, Dyer, James <Ja...@ingramcontent.com>
>> wrote:
>>> 
>>> The multivalued field will obey the "positionIncrementGap" value you
>> specify (default=100).  So for querying purposes, those id's will be 100
>> (or whatever you specified) positions apart.  So a phrase search for
>> adjacent ids would not match, unless you set the slop for >=
>> positionIncrementGap.  Other than this, both scenarios index the same.
>>> 
>>> For stored fields, solr returns an array of values for multivalued
>> fields, which is convienent when writing a UI.
>>> 
>>> James Dyer
>>> Ingram Content Group
>>> (615) 213-4311
>>> 
>>> 
>>> -----Original Message-----
>>> From: Dominik Hübner [mailto:contact@dhuebner.com]
>>> Sent: Thursday, November 07, 2013 11:23 AM
>>> To: user@mahout.apache.org
>>> Subject: Re: Solr-recommender for Mahout 0.9
>>> 
>>> Does anyone know what the difference is between keeping the ids in a
>> space delimited string and indexing a multivalued field of ids? I recently
>> tried the latter since ... it felt right, however I am not sure which of
>> both has which advantages.
>>> 
>>> On 07 Nov 2013, at 18:18, Pat Ferrel <pa...@gmail.com> wrote:
>>> 
>>>> I have dismax (no edismax) but am not using it yet, using the default
>> query, which does use 'AND'. I had much the same though as I slept on it.
>> Changing to OR is now working much much better. So obvious it almost bit
>> me, not good in this case...
>>>> 
>>>> With only a trivially small amount of testing I'd say we have a new
>> recommender on the block.
>>>> 
>>>> If anyone would like to help eyeball test the thing let me know
>> off-list. There are a few instructions I'll need to give. And it can't
>> handle much load right now due to intentional design limits.
>>>> 
>>>> 
>>>> On Nov 7, 2013, at 6:11 AM, Dyer, James <Ja...@ingramcontent.com>
>> wrote:
>>>> 
>>>> Pat,
>>>> 
>>>> Can you give us the query it generates when you enter "vampire werewolf
>> zombie", q/qt/defType ?
>>>> 
>>>> My guess is you're using the default query parser with "q.op=AND" , or,
>> you're using dismax/edismax with a high "mm" (min-must-match) value.
>>>> 
>>>> James Dyer
>>>> Ingram Content Group
>>>> (615) 213-4311
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
>>>> Sent: Wednesday, November 06, 2013 5:53 PM
>>>> To: ssc@apache.org Schelter; user@mahout.apache.org
>>>> Subject: Re: Solr-recommender for Mahout 0.9
>>>> 
>>>> Done,
>>>> 
>>>> BTW I have the thing running on a demo site but am getting very poor
>> results that I think are related to the Solr setup. I'd appreciate any
>> ideas.
>>>> 
>>>> The sample data has 27,000 items and something like 4000 users. The
>> preference data is fairly dense since the users are professional reviewers
>> and the items videos.
>>>> 
>>>> 1) The number of item-item similarities that are kept is 100. Is this a
>> good starting point? Ted, do you recall how many you used before?
>>>> 2) The query is a simple text query made of space delimited video id
>> strings. These are the same ids as are stored in the item-item similarity
>> docs that Solr indexes.
>>>> 
>>>> Hit thumbs up on one video you you get several recommendations. Hit
>> thumbs up on several videos you get no recs. I'm either using the wrong
>> query type or have it set up to be too restrictive. As I read through the
>> docs if someone has a suggestion or pointer I'd appreciate it.
>>>> 
>>>> BTW the same sort of thing happens with Title search. Search for
>> "vampire werewolf zombie" you get no results, search for "zombie" you get
>> several.
>>>> 
>>>> On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <ss...@apache.org> wrote:
>>>> 
>>>> Hi Pat,
>>>> 
>>>> can you create issues for 1) and 2) ? Then I will try to get this into
>>>> trunk asap.
>>>> 
>>>> Best,
>>>> Sebastian
>>>> 
>>>> On 06.11.2013 19:13, Pat Ferrel wrote:
>>>>> Trying to integrate the Solr-recoemmender with the latest Mahout
>> snapshot. The project uses a modified RecommenderJob because it needs
>> SequenceFile output and to get the location of the preparePreferenceMatrix
>> directory. If #1 and #2 are addressed I can remove the modified Mahout code
>> from the project and rely on the default implementations in Mahout 0.9. #3
>> is a longer term issue related to the creation of a CrossRowSimilarityJob.
>>>>> 
>>>>> I have dropped the modified code from the Solr-recommender project and
>> have a modified build of the current Mahout 0.9 snapshot. If the following
>> changes are made to Mahout I can test and release a Mahout 0.9 version of
>> the Solr-recommender.
>>>>> 
>>>>> 1. Option to change RecommenderJob output format
>>>>> 
>>>>> Can someone add an option to output a SequenceFile. I modified the
>> code to do the following, note the SequenceFileOutputFormat.class as the
>> last parameter but this should really be determined with an option I think.
>>>>> 
>>>>> Job aggregateAndRecommend = prepareJob(
>>>>>        new Path(aggregateAndRecommendInput), outputPath,
>> SequenceFileInputFormat.class,
>>>>>        PartialMultiplyMapper.class, VarLongWritable.class,
>> PrefAndSimilarityColumnWritable.class,
>>>>>        AggregateAndRecommendReducer.class, VarLongWritable.class,
>> RecommendedItemsWritable.class,
>>>>>        SequenceFileOutputFormat.class);
>>>>> 
>>>>> 2. Visibility of preparePreferenceMatrix directory location
>>>>> 
>>>>> The Solr-recommender needs to find where the RecommenderJob is putting
>> it's output.
>>>>> 
>>>>> Mahout 0.8 RecommenderJob code was:
>>>>> public static final String DEFAULT_PREPARE_DIR =
>> "preparePreferenceMatrix";
>>>>> 
>>>>> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix"
>> inline in the code:
>>>>> Path prepPath = getTempPath("preparePreferenceMatrix");
>>>>> 
>>>>> This change to Mahout 0.9 works:
>>>>> public static final String DEFAULT_PREPARE_DIR =
>> "preparePreferenceMatrix";
>>>>> and
>>>>> Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
>>>>> 
>>>>> You could also make this a getter method on the RecommenderJob Class
>> instead of using a public constant.
>>>>> 
>>>>> 3. Downsampling
>>>>> 
>>>>> The downsampling for maximum prefs per user has been moved from
>> PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob uses
>> matrix math instead of RSJ so it will no longer support downsampling until
>> there is a hypothetical CrossRowSimilairtyJob with downsampling in it.
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>> 
>> 
>> 
>> 
>> 
>> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Re: Solr-recommender for Mahout 0.9

Posted by Ted Dunning <te...@gmail.com>.

For recommendation work, I suggest that it would be better to simply code
out an explicit OR query.




On Thu, Nov 7, 2013 at 8:11 PM, Ken Krugler <kk...@transpac.com>wrote:

> Hi Pat,
>
> On Nov 7, 2013, at 7:30pm, Pat Ferrel <pa...@gmail.com> wrote:
>
> > Another approach would be to weight the terms in the docs by there
> Mahout similarity strength. But that will be for another day.
> >
> > My current question is whether Lucene looks at word proximity. I see the
> query syntax supports proximity but I don’t see that it is default so
> that’s good.
>
> Based on your description of what you do (generate an OR query of N terms)
> then no, you shouldn't be getting a boost from proximity.
>
> Note that with edismax you can specify a phrase boost, but it will be on
> the entire set of terms being searched, so unlikely to come into play even
> if you were using that.
>
> -- Ken
>
>
> >
> >
> > On Nov 7, 2013, at 12:41 PM, Dyer, James <Ja...@ingramcontent.com>
> wrote:
> >
> > Best to my knowledge, Lucene does not care about the position of a
> keyword within a document.
> >
> > You could bucket the ids into several fields.  Then use a dismax query
> to boost the top-tier ids more than then second, etc.
> >
> > A more fine-grained approach would probably involve a custom Similarity
> class that scales the score based on its position in the document.  If we
> did this, it might be simpler to index as 1 single-valued field so each id
> was position+1 rather than position+100, etc.
> >
> > James Dyer
> > Ingram Content Group
> > (615) 213-4311
> >
> >
> > -----Original Message-----
> > From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
> > Sent: Thursday, November 07, 2013 1:46 PM
> > To: user@mahout.apache.org
> > Subject: Re: Solr-recommender for Mahout 0.9
> >
> > Interesting to think about ordering and adjacentness. The index ids are
> sorted by Mahout strength so the first id is the most similar to the row
> key and so forth. But the query is ordered buy recency. In both cases the
> first id is in some sense the most important. Does Solr/Lucene care about
> closeness to the top of doc for queries or indexed docs? I don't recall any
> mention of this.
> >
> > However adjacentness has no meaning in recommendations though I think
> it's used in default queries so I may have to account for that.
> >
> > The object returned is an ordered list of ids. I use only the IDs now
> but there are cases when the contents are also of interest; shopping
> cart/watchlist queries for example.
> >
> > On Nov 7, 2013, at 10:00 AM, Dyer, James <Ja...@ingramcontent.com>
> wrote:
> >
> > The multivalued field will obey the "positionIncrementGap" value you
> specify (default=100).  So for querying purposes, those id's will be 100
> (or whatever you specified) positions apart.  So a phrase search for
> adjacent ids would not match, unless you set the slop for >=
> positionIncrementGap.  Other than this, both scenarios index the same.
> >
> > For stored fields, solr returns an array of values for multivalued
> fields, which is convienent when writing a UI.
> >
> > James Dyer
> > Ingram Content Group
> > (615) 213-4311
> >
> >
> > -----Original Message-----
> > From: Dominik Hübner [mailto:contact@dhuebner.com]
> > Sent: Thursday, November 07, 2013 11:23 AM
> > To: user@mahout.apache.org
> > Subject: Re: Solr-recommender for Mahout 0.9
> >
> > Does anyone know what the difference is between keeping the ids in a
> space delimited string and indexing a multivalued field of ids? I recently
> tried the latter since ... it felt right, however I am not sure which of
> both has which advantages.
> >
> > On 07 Nov 2013, at 18:18, Pat Ferrel <pa...@gmail.com> wrote:
> >
> >> I have dismax (no edismax) but am not using it yet, using the default
> query, which does use 'AND'. I had much the same though as I slept on it.
> Changing to OR is now working much much better. So obvious it almost bit
> me, not good in this case...
> >>
> >> With only a trivially small amount of testing I'd say we have a new
> recommender on the block.
> >>
> >> If anyone would like to help eyeball test the thing let me know
> off-list. There are a few instructions I'll need to give. And it can't
> handle much load right now due to intentional design limits.
> >>
> >>
> >> On Nov 7, 2013, at 6:11 AM, Dyer, James <Ja...@ingramcontent.com>
> wrote:
> >>
> >> Pat,
> >>
> >> Can you give us the query it generates when you enter "vampire werewolf
> zombie", q/qt/defType ?
> >>
> >> My guess is you're using the default query parser with "q.op=AND" , or,
> you're using dismax/edismax with a high "mm" (min-must-match) value.
> >>
> >> James Dyer
> >> Ingram Content Group
> >> (615) 213-4311
> >>
> >>
> >> -----Original Message-----
> >> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
> >> Sent: Wednesday, November 06, 2013 5:53 PM
> >> To: ssc@apache.org Schelter; user@mahout.apache.org
> >> Subject: Re: Solr-recommender for Mahout 0.9
> >>
> >> Done,
> >>
> >> BTW I have the thing running on a demo site but am getting very poor
> results that I think are related to the Solr setup. I'd appreciate any
> ideas.
> >>
> >> The sample data has 27,000 items and something like 4000 users. The
> preference data is fairly dense since the users are professional reviewers
> and the items videos.
> >>
> >> 1) The number of item-item similarities that are kept is 100. Is this a
> good starting point? Ted, do you recall how many you used before?
> >> 2) The query is a simple text query made of space delimited video id
> strings. These are the same ids as are stored in the item-item similarity
> docs that Solr indexes.
> >>
> >> Hit thumbs up on one video you you get several recommendations. Hit
> thumbs up on several videos you get no recs. I'm either using the wrong
> query type or have it set up to be too restrictive. As I read through the
> docs if someone has a suggestion or pointer I'd appreciate it.
> >>
> >> BTW the same sort of thing happens with Title search. Search for
> "vampire werewolf zombie" you get no results, search for "zombie" you get
> several.
> >>
> >> On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <ss...@apache.org> wrote:
> >>
> >> Hi Pat,
> >>
> >> can you create issues for 1) and 2) ? Then I will try to get this into
> >> trunk asap.
> >>
> >> Best,
> >> Sebastian
> >>
> >> On 06.11.2013 19:13, Pat Ferrel wrote:
> >>> Trying to integrate the Solr-recoemmender with the latest Mahout
> snapshot. The project uses a modified RecommenderJob because it needs
> SequenceFile output and to get the location of the preparePreferenceMatrix
> directory. If #1 and #2 are addressed I can remove the modified Mahout code
> from the project and rely on the default implementations in Mahout 0.9. #3
> is a longer term issue related to the creation of a CrossRowSimilarityJob.
> >>>
> >>> I have dropped the modified code from the Solr-recommender project and
> have a modified build of the current Mahout 0.9 snapshot. If the following
> changes are made to Mahout I can test and release a Mahout 0.9 version of
> the Solr-recommender.
> >>>
> >>> 1. Option to change RecommenderJob output format
> >>>
> >>> Can someone add an option to output a SequenceFile. I modified the
> code to do the following, note the SequenceFileOutputFormat.class as the
> last parameter but this should really be determined with an option I think.
> >>>
> >>> Job aggregateAndRecommend = prepareJob(
> >>>         new Path(aggregateAndRecommendInput), outputPath,
> SequenceFileInputFormat.class,
> >>>         PartialMultiplyMapper.class, VarLongWritable.class,
> PrefAndSimilarityColumnWritable.class,
> >>>         AggregateAndRecommendReducer.class, VarLongWritable.class,
> RecommendedItemsWritable.class,
> >>>         SequenceFileOutputFormat.class);
> >>>
> >>> 2. Visibility of preparePreferenceMatrix directory location
> >>>
> >>> The Solr-recommender needs to find where the RecommenderJob is putting
> it's output.
> >>>
> >>> Mahout 0.8 RecommenderJob code was:
> >>> public static final String DEFAULT_PREPARE_DIR =
> "preparePreferenceMatrix";
> >>>
> >>> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix"
> inline in the code:
> >>> Path prepPath = getTempPath("preparePreferenceMatrix");
> >>>
> >>> This change to Mahout 0.9 works:
> >>> public static final String DEFAULT_PREPARE_DIR =
> "preparePreferenceMatrix";
> >>> and
> >>> Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
> >>>
> >>> You could also make this a getter method on the RecommenderJob Class
> instead of using a public constant.
> >>>
> >>> 3. Downsampling
> >>>
> >>> The downsampling for maximum prefs per user has been moved from
> PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob uses
> matrix math instead of RSJ so it will no longer support downsampling until
> there is a hypothetical CrossRowSimilairtyJob with downsampling in it.
> >>>
> >>>
> >>
> >>
> >>
> >>
> >>
> >
> >
> >
> >
> >
> >
> >
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>

Re: Solr-recommender for Mahout 0.9

Posted by Ken Krugler <kk...@transpac.com>.

Hi Pat,

On Nov 7, 2013, at 7:30pm, Pat Ferrel <pa...@gmail.com> wrote:

> Another approach would be to weight the terms in the docs by there Mahout similarity strength. But that will be for another day. 
> 
> My current question is whether Lucene looks at word proximity. I see the query syntax supports proximity but I don’t see that it is default so that’s good.

Based on your description of what you do (generate an OR query of N terms) then no, you shouldn't be getting a boost from proximity.

Note that with edismax you can specify a phrase boost, but it will be on the entire set of terms being searched, so unlikely to come into play even if you were using that.

-- Ken


> 
> 
> On Nov 7, 2013, at 12:41 PM, Dyer, James <Ja...@ingramcontent.com> wrote:
> 
> Best to my knowledge, Lucene does not care about the position of a keyword within a document.
> 
> You could bucket the ids into several fields.  Then use a dismax query to boost the top-tier ids more than then second, etc.
> 
> A more fine-grained approach would probably involve a custom Similarity class that scales the score based on its position in the document.  If we did this, it might be simpler to index as 1 single-valued field so each id was position+1 rather than position+100, etc.
> 
> James Dyer
> Ingram Content Group
> (615) 213-4311
> 
> 
> -----Original Message-----
> From: Pat Ferrel [mailto:pat.ferrel@gmail.com] 
> Sent: Thursday, November 07, 2013 1:46 PM
> To: user@mahout.apache.org
> Subject: Re: Solr-recommender for Mahout 0.9
> 
> Interesting to think about ordering and adjacentness. The index ids are sorted by Mahout strength so the first id is the most similar to the row key and so forth. But the query is ordered buy recency. In both cases the first id is in some sense the most important. Does Solr/Lucene care about closeness to the top of doc for queries or indexed docs? I don't recall any mention of this.
> 
> However adjacentness has no meaning in recommendations though I think it's used in default queries so I may have to account for that.
> 
> The object returned is an ordered list of ids. I use only the IDs now but there are cases when the contents are also of interest; shopping cart/watchlist queries for example.
> 
> On Nov 7, 2013, at 10:00 AM, Dyer, James <Ja...@ingramcontent.com> wrote:
> 
> The multivalued field will obey the "positionIncrementGap" value you specify (default=100).  So for querying purposes, those id's will be 100 (or whatever you specified) positions apart.  So a phrase search for adjacent ids would not match, unless you set the slop for >= positionIncrementGap.  Other than this, both scenarios index the same.
> 
> For stored fields, solr returns an array of values for multivalued fields, which is convienent when writing a UI.
> 
> James Dyer
> Ingram Content Group
> (615) 213-4311
> 
> 
> -----Original Message-----
> From: Dominik Hübner [mailto:contact@dhuebner.com] 
> Sent: Thursday, November 07, 2013 11:23 AM
> To: user@mahout.apache.org
> Subject: Re: Solr-recommender for Mahout 0.9
> 
> Does anyone know what the difference is between keeping the ids in a space delimited string and indexing a multivalued field of ids? I recently tried the latter since ... it felt right, however I am not sure which of both has which advantages.
> 
> On 07 Nov 2013, at 18:18, Pat Ferrel <pa...@gmail.com> wrote:
> 
>> I have dismax (no edismax) but am not using it yet, using the default query, which does use 'AND'. I had much the same though as I slept on it. Changing to OR is now working much much better. So obvious it almost bit me, not good in this case...
>> 
>> With only a trivially small amount of testing I'd say we have a new recommender on the block.
>> 
>> If anyone would like to help eyeball test the thing let me know off-list. There are a few instructions I'll need to give. And it can't handle much load right now due to intentional design limits.
>> 
>> 
>> On Nov 7, 2013, at 6:11 AM, Dyer, James <Ja...@ingramcontent.com> wrote:
>> 
>> Pat,
>> 
>> Can you give us the query it generates when you enter "vampire werewolf zombie", q/qt/defType ?
>> 
>> My guess is you're using the default query parser with "q.op=AND" , or, you're using dismax/edismax with a high "mm" (min-must-match) value.
>> 
>> James Dyer
>> Ingram Content Group
>> (615) 213-4311
>> 
>> 
>> -----Original Message-----
>> From: Pat Ferrel [mailto:pat.ferrel@gmail.com] 
>> Sent: Wednesday, November 06, 2013 5:53 PM
>> To: ssc@apache.org Schelter; user@mahout.apache.org
>> Subject: Re: Solr-recommender for Mahout 0.9
>> 
>> Done,
>> 
>> BTW I have the thing running on a demo site but am getting very poor results that I think are related to the Solr setup. I'd appreciate any ideas.
>> 
>> The sample data has 27,000 items and something like 4000 users. The preference data is fairly dense since the users are professional reviewers and the items videos.
>> 
>> 1) The number of item-item similarities that are kept is 100. Is this a good starting point? Ted, do you recall how many you used before?
>> 2) The query is a simple text query made of space delimited video id strings. These are the same ids as are stored in the item-item similarity docs that Solr indexes.
>> 
>> Hit thumbs up on one video you you get several recommendations. Hit thumbs up on several videos you get no recs. I'm either using the wrong query type or have it set up to be too restrictive. As I read through the docs if someone has a suggestion or pointer I'd appreciate it. 
>> 
>> BTW the same sort of thing happens with Title search. Search for "vampire werewolf zombie" you get no results, search for "zombie" you get several.
>> 
>> On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <ss...@apache.org> wrote:
>> 
>> Hi Pat,
>> 
>> can you create issues for 1) and 2) ? Then I will try to get this into
>> trunk asap.
>> 
>> Best,
>> Sebastian
>> 
>> On 06.11.2013 19:13, Pat Ferrel wrote:
>>> Trying to integrate the Solr-recoemmender with the latest Mahout snapshot. The project uses a modified RecommenderJob because it needs SequenceFile output and to get the location of the preparePreferenceMatrix directory. If #1 and #2 are addressed I can remove the modified Mahout code from the project and rely on the default implementations in Mahout 0.9. #3 is a longer term issue related to the creation of a CrossRowSimilarityJob. 
>>> 
>>> I have dropped the modified code from the Solr-recommender project and have a modified build of the current Mahout 0.9 snapshot. If the following changes are made to Mahout I can test and release a Mahout 0.9 version of the Solr-recommender.
>>> 
>>> 1. Option to change RecommenderJob output format
>>> 
>>> Can someone add an option to output a SequenceFile. I modified the code to do the following, note the SequenceFileOutputFormat.class as the last parameter but this should really be determined with an option I think.
>>> 
>>> Job aggregateAndRecommend = prepareJob(
>>>         new Path(aggregateAndRecommendInput), outputPath, SequenceFileInputFormat.class,
>>>         PartialMultiplyMapper.class, VarLongWritable.class, PrefAndSimilarityColumnWritable.class,
>>>         AggregateAndRecommendReducer.class, VarLongWritable.class, RecommendedItemsWritable.class,
>>>         SequenceFileOutputFormat.class);
>>> 
>>> 2. Visibility of preparePreferenceMatrix directory location
>>> 
>>> The Solr-recommender needs to find where the RecommenderJob is putting it's output. 
>>> 
>>> Mahout 0.8 RecommenderJob code was:
>>> public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix";
>>> 
>>> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix" inline in the code:
>>> Path prepPath = getTempPath("preparePreferenceMatrix");
>>> 
>>> This change to Mahout 0.9 works:
>>> public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix";
>>> and
>>> Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
>>> 
>>> You could also make this a getter method on the RecommenderJob Class instead of using a public constant.
>>> 
>>> 3. Downsampling
>>> 
>>> The downsampling for maximum prefs per user has been moved from PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob uses matrix math instead of RSJ so it will no longer support downsampling until there is a hypothetical CrossRowSimilairtyJob with downsampling in it.
>>> 
>>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> 
> 
> 
> 
> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Re: Solr-recommender for Mahout 0.9

Posted by Pat Ferrel <pa...@gmail.com>.

Another approach would be to weight the terms in the docs by there Mahout similarity strength. But that will be for another day. 

My current question is whether Lucene looks at word proximity. I see the query syntax supports proximity but I don’t see that it is default so that’s good.

On Nov 7, 2013, at 12:41 PM, Dyer, James <Ja...@ingramcontent.com> wrote:

Best to my knowledge, Lucene does not care about the position of a keyword within a document.

You could bucket the ids into several fields.  Then use a dismax query to boost the top-tier ids more than then second, etc.

A more fine-grained approach would probably involve a custom Similarity class that scales the score based on its position in the document.  If we did this, it might be simpler to index as 1 single-valued field so each id was position+1 rather than position+100, etc.

James Dyer
Ingram Content Group
(615) 213-4311

-----Original Message-----
From: Pat Ferrel [mailto:pat.ferrel@gmail.com] 
Sent: Thursday, November 07, 2013 1:46 PM
To: user@mahout.apache.org
Subject: Re: Solr-recommender for Mahout 0.9

Interesting to think about ordering and adjacentness. The index ids are sorted by Mahout strength so the first id is the most similar to the row key and so forth. But the query is ordered buy recency. In both cases the first id is in some sense the most important. Does Solr/Lucene care about closeness to the top of doc for queries or indexed docs? I don't recall any mention of this.

However adjacentness has no meaning in recommendations though I think it's used in default queries so I may have to account for that.

The object returned is an ordered list of ids. I use only the IDs now but there are cases when the contents are also of interest; shopping cart/watchlist queries for example.

On Nov 7, 2013, at 10:00 AM, Dyer, James <Ja...@ingramcontent.com> wrote:

The multivalued field will obey the "positionIncrementGap" value you specify (default=100).  So for querying purposes, those id's will be 100 (or whatever you specified) positions apart.  So a phrase search for adjacent ids would not match, unless you set the slop for >= positionIncrementGap.  Other than this, both scenarios index the same.

For stored fields, solr returns an array of values for multivalued fields, which is convienent when writing a UI.

James Dyer
Ingram Content Group
(615) 213-4311

-----Original Message-----
From: Dominik Hübner [mailto:contact@dhuebner.com] 
Sent: Thursday, November 07, 2013 11:23 AM
To: user@mahout.apache.org
Subject: Re: Solr-recommender for Mahout 0.9

Does anyone know what the difference is between keeping the ids in a space delimited string and indexing a multivalued field of ids? I recently tried the latter since ... it felt right, however I am not sure which of both has which advantages.

On 07 Nov 2013, at 18:18, Pat Ferrel <pa...@gmail.com> wrote:

> I have dismax (no edismax) but am not using it yet, using the default query, which does use 'AND'. I had much the same though as I slept on it. Changing to OR is now working much much better. So obvious it almost bit me, not good in this case...
> 
> With only a trivially small amount of testing I'd say we have a new recommender on the block.
> 
> If anyone would like to help eyeball test the thing let me know off-list. There are a few instructions I'll need to give. And it can't handle much load right now due to intentional design limits.
> 
> 
> On Nov 7, 2013, at 6:11 AM, Dyer, James <Ja...@ingramcontent.com> wrote:
> 
> Pat,
> 
> Can you give us the query it generates when you enter "vampire werewolf zombie", q/qt/defType ?
> 
> My guess is you're using the default query parser with "q.op=AND" , or, you're using dismax/edismax with a high "mm" (min-must-match) value.
> 
> James Dyer
> Ingram Content Group
> (615) 213-4311
> 
> 
> -----Original Message-----
> From: Pat Ferrel [mailto:pat.ferrel@gmail.com] 
> Sent: Wednesday, November 06, 2013 5:53 PM
> To: ssc@apache.org Schelter; user@mahout.apache.org
> Subject: Re: Solr-recommender for Mahout 0.9
> 
> Done,
> 
> BTW I have the thing running on a demo site but am getting very poor results that I think are related to the Solr setup. I'd appreciate any ideas.
> 
> The sample data has 27,000 items and something like 4000 users. The preference data is fairly dense since the users are professional reviewers and the items videos.
> 
> 1) The number of item-item similarities that are kept is 100. Is this a good starting point? Ted, do you recall how many you used before?
> 2) The query is a simple text query made of space delimited video id strings. These are the same ids as are stored in the item-item similarity docs that Solr indexes.
> 
> Hit thumbs up on one video you you get several recommendations. Hit thumbs up on several videos you get no recs. I'm either using the wrong query type or have it set up to be too restrictive. As I read through the docs if someone has a suggestion or pointer I'd appreciate it. 
> 
> BTW the same sort of thing happens with Title search. Search for "vampire werewolf zombie" you get no results, search for "zombie" you get several.
> 
> On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <ss...@apache.org> wrote:
> 
> Hi Pat,
> 
> can you create issues for 1) and 2) ? Then I will try to get this into
> trunk asap.
> 
> Best,
> Sebastian
> 
> On 06.11.2013 19:13, Pat Ferrel wrote:
>> Trying to integrate the Solr-recoemmender with the latest Mahout snapshot. The project uses a modified RecommenderJob because it needs SequenceFile output and to get the location of the preparePreferenceMatrix directory. If #1 and #2 are addressed I can remove the modified Mahout code from the project and rely on the default implementations in Mahout 0.9. #3 is a longer term issue related to the creation of a CrossRowSimilarityJob. 
>> 
>> I have dropped the modified code from the Solr-recommender project and have a modified build of the current Mahout 0.9 snapshot. If the following changes are made to Mahout I can test and release a Mahout 0.9 version of the Solr-recommender.
>> 
>> 1. Option to change RecommenderJob output format
>> 
>> Can someone add an option to output a SequenceFile. I modified the code to do the following, note the SequenceFileOutputFormat.class as the last parameter but this should really be determined with an option I think.
>> 
>>  Job aggregateAndRecommend = prepareJob(
>>          new Path(aggregateAndRecommendInput), outputPath, SequenceFileInputFormat.class,
>>          PartialMultiplyMapper.class, VarLongWritable.class, PrefAndSimilarityColumnWritable.class,
>>          AggregateAndRecommendReducer.class, VarLongWritable.class, RecommendedItemsWritable.class,
>>          SequenceFileOutputFormat.class);
>> 
>> 2. Visibility of preparePreferenceMatrix directory location
>> 
>> The Solr-recommender needs to find where the RecommenderJob is putting it's output. 
>> 
>> Mahout 0.8 RecommenderJob code was:
>> public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix";
>> 
>> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix" inline in the code:
>> Path prepPath = getTempPath("preparePreferenceMatrix");
>> 
>> This change to Mahout 0.9 works:
>> public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix";
>> and
>> Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
>> 
>> You could also make this a getter method on the RecommenderJob Class instead of using a public constant.
>> 
>> 3. Downsampling
>> 
>> The downsampling for maximum prefs per user has been moved from PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob uses matrix math instead of RSJ so it will no longer support downsampling until there is a hypothetical CrossRowSimilairtyJob with downsampling in it.
>> 
>> 
> 
> 
> 
> 
>

RE: Solr-recommender for Mahout 0.9

Posted by "Dyer, James" <Ja...@ingramcontent.com>.

Best to my knowledge, Lucene does not care about the position of a keyword within a document.

You could bucket the ids into several fields.  Then use a dismax query to boost the top-tier ids more than then second, etc.

A more fine-grained approach would probably involve a custom Similarity class that scales the score based on its position in the document.  If we did this, it might be simpler to index as 1 single-valued field so each id was position+1 rather than position+100, etc.

James Dyer
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Pat Ferrel [mailto:pat.ferrel@gmail.com] 
Sent: Thursday, November 07, 2013 1:46 PM
To: user@mahout.apache.org
Subject: Re: Solr-recommender for Mahout 0.9

Interesting to think about ordering and adjacentness. The index ids are sorted by Mahout strength so the first id is the most similar to the row key and so forth. But the query is ordered buy recency. In both cases the first id is in some sense the most important. Does Solr/Lucene care about closeness to the top of doc for queries or indexed docs? I don't recall any mention of this.

However adjacentness has no meaning in recommendations though I think it's used in default queries so I may have to account for that.

The object returned is an ordered list of ids. I use only the IDs now but there are cases when the contents are also of interest; shopping cart/watchlist queries for example.
 
On Nov 7, 2013, at 10:00 AM, Dyer, James <Ja...@ingramcontent.com> wrote:

The multivalued field will obey the "positionIncrementGap" value you specify (default=100).  So for querying purposes, those id's will be 100 (or whatever you specified) positions apart.  So a phrase search for adjacent ids would not match, unless you set the slop for >= positionIncrementGap.  Other than this, both scenarios index the same.

For stored fields, solr returns an array of values for multivalued fields, which is convienent when writing a UI.

James Dyer
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Dominik Hübner [mailto:contact@dhuebner.com] 
Sent: Thursday, November 07, 2013 11:23 AM
To: user@mahout.apache.org
Subject: Re: Solr-recommender for Mahout 0.9

Does anyone know what the difference is between keeping the ids in a space delimited string and indexing a multivalued field of ids? I recently tried the latter since ... it felt right, however I am not sure which of both has which advantages.

On 07 Nov 2013, at 18:18, Pat Ferrel <pa...@gmail.com> wrote:

> I have dismax (no edismax) but am not using it yet, using the default query, which does use 'AND'. I had much the same though as I slept on it. Changing to OR is now working much much better. So obvious it almost bit me, not good in this case...
> 
> With only a trivially small amount of testing I'd say we have a new recommender on the block.
> 
> If anyone would like to help eyeball test the thing let me know off-list. There are a few instructions I'll need to give. And it can't handle much load right now due to intentional design limits.
> 
> 
> On Nov 7, 2013, at 6:11 AM, Dyer, James <Ja...@ingramcontent.com> wrote:
> 
> Pat,
> 
> Can you give us the query it generates when you enter "vampire werewolf zombie", q/qt/defType ?
> 
> My guess is you're using the default query parser with "q.op=AND" , or, you're using dismax/edismax with a high "mm" (min-must-match) value.
> 
> James Dyer
> Ingram Content Group
> (615) 213-4311
> 
> 
> -----Original Message-----
> From: Pat Ferrel [mailto:pat.ferrel@gmail.com] 
> Sent: Wednesday, November 06, 2013 5:53 PM
> To: ssc@apache.org Schelter; user@mahout.apache.org
> Subject: Re: Solr-recommender for Mahout 0.9
> 
> Done,
> 
> BTW I have the thing running on a demo site but am getting very poor results that I think are related to the Solr setup. I'd appreciate any ideas.
> 
> The sample data has 27,000 items and something like 4000 users. The preference data is fairly dense since the users are professional reviewers and the items videos.
> 
> 1) The number of item-item similarities that are kept is 100. Is this a good starting point? Ted, do you recall how many you used before?
> 2) The query is a simple text query made of space delimited video id strings. These are the same ids as are stored in the item-item similarity docs that Solr indexes.
> 
> Hit thumbs up on one video you you get several recommendations. Hit thumbs up on several videos you get no recs. I'm either using the wrong query type or have it set up to be too restrictive. As I read through the docs if someone has a suggestion or pointer I'd appreciate it. 
> 
> BTW the same sort of thing happens with Title search. Search for "vampire werewolf zombie" you get no results, search for "zombie" you get several.
> 
> On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <ss...@apache.org> wrote:
> 
> Hi Pat,
> 
> can you create issues for 1) and 2) ? Then I will try to get this into
> trunk asap.
> 
> Best,
> Sebastian
> 
> On 06.11.2013 19:13, Pat Ferrel wrote:
>> Trying to integrate the Solr-recoemmender with the latest Mahout snapshot. The project uses a modified RecommenderJob because it needs SequenceFile output and to get the location of the preparePreferenceMatrix directory. If #1 and #2 are addressed I can remove the modified Mahout code from the project and rely on the default implementations in Mahout 0.9. #3 is a longer term issue related to the creation of a CrossRowSimilarityJob. 
>> 
>> I have dropped the modified code from the Solr-recommender project and have a modified build of the current Mahout 0.9 snapshot. If the following changes are made to Mahout I can test and release a Mahout 0.9 version of the Solr-recommender.
>> 
>> 1. Option to change RecommenderJob output format
>> 
>> Can someone add an option to output a SequenceFile. I modified the code to do the following, note the SequenceFileOutputFormat.class as the last parameter but this should really be determined with an option I think.
>> 
>>   Job aggregateAndRecommend = prepareJob(
>>           new Path(aggregateAndRecommendInput), outputPath, SequenceFileInputFormat.class,
>>           PartialMultiplyMapper.class, VarLongWritable.class, PrefAndSimilarityColumnWritable.class,
>>           AggregateAndRecommendReducer.class, VarLongWritable.class, RecommendedItemsWritable.class,
>>           SequenceFileOutputFormat.class);
>> 
>> 2. Visibility of preparePreferenceMatrix directory location
>> 
>> The Solr-recommender needs to find where the RecommenderJob is putting it's output. 
>> 
>> Mahout 0.8 RecommenderJob code was:
>> public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix";
>> 
>> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix" inline in the code:
>> Path prepPath = getTempPath("preparePreferenceMatrix");
>> 
>> This change to Mahout 0.9 works:
>> public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix";
>> and
>> Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
>> 
>> You could also make this a getter method on the RecommenderJob Class instead of using a public constant.
>> 
>> 3. Downsampling
>> 
>> The downsampling for maximum prefs per user has been moved from PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob uses matrix math instead of RSJ so it will no longer support downsampling until there is a hypothetical CrossRowSimilairtyJob with downsampling in it.
>> 
>> 
> 
> 
> 
> 
>

Re: Solr-recommender for Mahout 0.9

Posted by Pat Ferrel <pa...@gmail.com>.

Interesting to think about ordering and adjacentness. The index ids are sorted by Mahout strength so the first id is the most similar to the row key and so forth. But the query is ordered buy recency. In both cases the first id is in some sense the most important. Does Solr/Lucene care about closeness to the top of doc for queries or indexed docs? I don’t recall any mention of this.

However adjacentness has no meaning in recommendations though I think it’s used in default queries so I may have to account for that.

The object returned is an ordered list of ids. I use only the IDs now but there are cases when the contents are also of interest; shopping cart/watchlist queries for example.
 
On Nov 7, 2013, at 10:00 AM, Dyer, James <Ja...@ingramcontent.com> wrote:

The multivalued field will obey the "positionIncrementGap" value you specify (default=100).  So for querying purposes, those id's will be 100 (or whatever you specified) positions apart.  So a phrase search for adjacent ids would not match, unless you set the slop for >= positionIncrementGap.  Other than this, both scenarios index the same.

For stored fields, solr returns an array of values for multivalued fields, which is convienent when writing a UI.

James Dyer
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Dominik Hübner [mailto:contact@dhuebner.com] 
Sent: Thursday, November 07, 2013 11:23 AM
To: user@mahout.apache.org
Subject: Re: Solr-recommender for Mahout 0.9

Does anyone know what the difference is between keeping the ids in a space delimited string and indexing a multivalued field of ids? I recently tried the latter since ... it felt right, however I am not sure which of both has which advantages.

On 07 Nov 2013, at 18:18, Pat Ferrel <pa...@gmail.com> wrote:

> I have dismax (no edismax) but am not using it yet, using the default query, which does use 'AND'. I had much the same though as I slept on it. Changing to OR is now working much much better. So obvious it almost bit me, not good in this case...
> 
> With only a trivially small amount of testing I'd say we have a new recommender on the block.
> 
> If anyone would like to help eyeball test the thing let me know off-list. There are a few instructions I'll need to give. And it can't handle much load right now due to intentional design limits.
> 
> 
> On Nov 7, 2013, at 6:11 AM, Dyer, James <Ja...@ingramcontent.com> wrote:
> 
> Pat,
> 
> Can you give us the query it generates when you enter "vampire werewolf zombie", q/qt/defType ?
> 
> My guess is you're using the default query parser with "q.op=AND" , or, you're using dismax/edismax with a high "mm" (min-must-match) value.
> 
> James Dyer
> Ingram Content Group
> (615) 213-4311
> 
> 
> -----Original Message-----
> From: Pat Ferrel [mailto:pat.ferrel@gmail.com] 
> Sent: Wednesday, November 06, 2013 5:53 PM
> To: ssc@apache.org Schelter; user@mahout.apache.org
> Subject: Re: Solr-recommender for Mahout 0.9
> 
> Done,
> 
> BTW I have the thing running on a demo site but am getting very poor results that I think are related to the Solr setup. I'd appreciate any ideas.
> 
> The sample data has 27,000 items and something like 4000 users. The preference data is fairly dense since the users are professional reviewers and the items videos.
> 
> 1) The number of item-item similarities that are kept is 100. Is this a good starting point? Ted, do you recall how many you used before?
> 2) The query is a simple text query made of space delimited video id strings. These are the same ids as are stored in the item-item similarity docs that Solr indexes.
> 
> Hit thumbs up on one video you you get several recommendations. Hit thumbs up on several videos you get no recs. I'm either using the wrong query type or have it set up to be too restrictive. As I read through the docs if someone has a suggestion or pointer I'd appreciate it. 
> 
> BTW the same sort of thing happens with Title search. Search for "vampire werewolf zombie" you get no results, search for "zombie" you get several.
> 
> On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <ss...@apache.org> wrote:
> 
> Hi Pat,
> 
> can you create issues for 1) and 2) ? Then I will try to get this into
> trunk asap.
> 
> Best,
> Sebastian
> 
> On 06.11.2013 19:13, Pat Ferrel wrote:
>> Trying to integrate the Solr-recoemmender with the latest Mahout snapshot. The project uses a modified RecommenderJob because it needs SequenceFile output and to get the location of the preparePreferenceMatrix directory. If #1 and #2 are addressed I can remove the modified Mahout code from the project and rely on the default implementations in Mahout 0.9. #3 is a longer term issue related to the creation of a CrossRowSimilarityJob. 
>> 
>> I have dropped the modified code from the Solr-recommender project and have a modified build of the current Mahout 0.9 snapshot. If the following changes are made to Mahout I can test and release a Mahout 0.9 version of the Solr-recommender.
>> 
>> 1. Option to change RecommenderJob output format
>> 
>> Can someone add an option to output a SequenceFile. I modified the code to do the following, note the SequenceFileOutputFormat.class as the last parameter but this should really be determined with an option I think.
>> 
>>   Job aggregateAndRecommend = prepareJob(
>>           new Path(aggregateAndRecommendInput), outputPath, SequenceFileInputFormat.class,
>>           PartialMultiplyMapper.class, VarLongWritable.class, PrefAndSimilarityColumnWritable.class,
>>           AggregateAndRecommendReducer.class, VarLongWritable.class, RecommendedItemsWritable.class,
>>           SequenceFileOutputFormat.class);
>> 
>> 2. Visibility of preparePreferenceMatrix directory location
>> 
>> The Solr-recommender needs to find where the RecommenderJob is putting it's output. 
>> 
>> Mahout 0.8 RecommenderJob code was:
>> public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix";
>> 
>> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix" inline in the code:
>> Path prepPath = getTempPath("preparePreferenceMatrix");
>> 
>> This change to Mahout 0.9 works:
>> public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix";
>> and
>> Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
>> 
>> You could also make this a getter method on the RecommenderJob Class instead of using a public constant.
>> 
>> 3. Downsampling
>> 
>> The downsampling for maximum prefs per user has been moved from PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob uses matrix math instead of RSJ so it will no longer support downsampling until there is a hypothetical CrossRowSimilairtyJob with downsampling in it.
>> 
>> 
> 
> 
> 
> 
>

RE: Solr-recommender for Mahout 0.9

Posted by "Dyer, James" <Ja...@ingramcontent.com>.

The multivalued field will obey the "positionIncrementGap" value you specify (default=100).  So for querying purposes, those id's will be 100 (or whatever you specified) positions apart.  So a phrase search for adjacent ids would not match, unless you set the slop for >= positionIncrementGap.  Other than this, both scenarios index the same.

For stored fields, solr returns an array of values for multivalued fields, which is convienent when writing a UI.

James Dyer
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Dominik Hübner [mailto:contact@dhuebner.com] 
Sent: Thursday, November 07, 2013 11:23 AM
To: user@mahout.apache.org
Subject: Re: Solr-recommender for Mahout 0.9

Does anyone know what the difference is between keeping the ids in a space delimited string and indexing a multivalued field of ids? I recently tried the latter since ... it felt right, however I am not sure which of both has which advantages.

On 07 Nov 2013, at 18:18, Pat Ferrel <pa...@gmail.com> wrote:

> I have dismax (no edismax) but am not using it yet, using the default query, which does use 'AND'. I had much the same though as I slept on it. Changing to OR is now working much much better. So obvious it almost bit me, not good in this case...
> 
> With only a trivially small amount of testing I'd say we have a new recommender on the block.
> 
> If anyone would like to help eyeball test the thing let me know off-list. There are a few instructions I'll need to give. And it can't handle much load right now due to intentional design limits.
> 
> 
> On Nov 7, 2013, at 6:11 AM, Dyer, James <Ja...@ingramcontent.com> wrote:
> 
> Pat,
> 
> Can you give us the query it generates when you enter "vampire werewolf zombie", q/qt/defType ?
> 
> My guess is you're using the default query parser with "q.op=AND" , or, you're using dismax/edismax with a high "mm" (min-must-match) value.
> 
> James Dyer
> Ingram Content Group
> (615) 213-4311
> 
> 
> -----Original Message-----
> From: Pat Ferrel [mailto:pat.ferrel@gmail.com] 
> Sent: Wednesday, November 06, 2013 5:53 PM
> To: ssc@apache.org Schelter; user@mahout.apache.org
> Subject: Re: Solr-recommender for Mahout 0.9
> 
> Done,
> 
> BTW I have the thing running on a demo site but am getting very poor results that I think are related to the Solr setup. I'd appreciate any ideas.
> 
> The sample data has 27,000 items and something like 4000 users. The preference data is fairly dense since the users are professional reviewers and the items videos.
> 
> 1) The number of item-item similarities that are kept is 100. Is this a good starting point? Ted, do you recall how many you used before?
> 2) The query is a simple text query made of space delimited video id strings. These are the same ids as are stored in the item-item similarity docs that Solr indexes.
> 
> Hit thumbs up on one video you you get several recommendations. Hit thumbs up on several videos you get no recs. I'm either using the wrong query type or have it set up to be too restrictive. As I read through the docs if someone has a suggestion or pointer I'd appreciate it. 
> 
> BTW the same sort of thing happens with Title search. Search for "vampire werewolf zombie" you get no results, search for "zombie" you get several.
> 
> On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <ss...@apache.org> wrote:
> 
> Hi Pat,
> 
> can you create issues for 1) and 2) ? Then I will try to get this into
> trunk asap.
> 
> Best,
> Sebastian
> 
> On 06.11.2013 19:13, Pat Ferrel wrote:
>> Trying to integrate the Solr-recoemmender with the latest Mahout snapshot. The project uses a modified RecommenderJob because it needs SequenceFile output and to get the location of the preparePreferenceMatrix directory. If #1 and #2 are addressed I can remove the modified Mahout code from the project and rely on the default implementations in Mahout 0.9. #3 is a longer term issue related to the creation of a CrossRowSimilarityJob. 
>> 
>> I have dropped the modified code from the Solr-recommender project and have a modified build of the current Mahout 0.9 snapshot. If the following changes are made to Mahout I can test and release a Mahout 0.9 version of the Solr-recommender.
>> 
>> 1. Option to change RecommenderJob output format
>> 
>> Can someone add an option to output a SequenceFile. I modified the code to do the following, note the SequenceFileOutputFormat.class as the last parameter but this should really be determined with an option I think.
>> 
>>    Job aggregateAndRecommend = prepareJob(
>>            new Path(aggregateAndRecommendInput), outputPath, SequenceFileInputFormat.class,
>>            PartialMultiplyMapper.class, VarLongWritable.class, PrefAndSimilarityColumnWritable.class,
>>            AggregateAndRecommendReducer.class, VarLongWritable.class, RecommendedItemsWritable.class,
>>            SequenceFileOutputFormat.class);
>> 
>> 2. Visibility of preparePreferenceMatrix directory location
>> 
>> The Solr-recommender needs to find where the RecommenderJob is putting it's output. 
>> 
>> Mahout 0.8 RecommenderJob code was:
>>  public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix";
>> 
>> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix" inline in the code:
>>  Path prepPath = getTempPath("preparePreferenceMatrix");
>> 
>> This change to Mahout 0.9 works:
>>  public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix";
>> and
>>  Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
>> 
>> You could also make this a getter method on the RecommenderJob Class instead of using a public constant.
>> 
>> 3. Downsampling
>> 
>> The downsampling for maximum prefs per user has been moved from PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob uses matrix math instead of RSJ so it will no longer support downsampling until there is a hypothetical CrossRowSimilairtyJob with downsampling in it.
>> 
>> 
> 
> 
> 
> 
>

Re: Solr-recommender for Mahout 0.9

Posted by Dominik Hübner <co...@dhuebner.com>.

Does anyone know what the difference is between keeping the ids in a space delimited string and indexing a multivalued field of ids? I recently tried the latter since ... it felt right, however I am not sure which of both has which advantages.

On 07 Nov 2013, at 18:18, Pat Ferrel <pa...@gmail.com> wrote:

> I have dismax (no edismax) but am not using it yet, using the default query, which does use ‘AND’. I had much the same though as I slept on it. Changing to OR is now working much much better. So obvious it almost bit me, not good in this case...
> 
> With only a trivially small amount of testing I’d say we have a new recommender on the block.
> 
> If anyone would like to help eyeball test the thing let me know off-list. There are a few instructions I’ll need to give. And it can’t handle much load right now due to intentional design limits.
> 
> 
> On Nov 7, 2013, at 6:11 AM, Dyer, James <Ja...@ingramcontent.com> wrote:
> 
> Pat,
> 
> Can you give us the query it generates when you enter "vampire werewolf zombie", q/qt/defType ?
> 
> My guess is you're using the default query parser with "q.op=AND" , or, you're using dismax/edismax with a high "mm" (min-must-match) value.
> 
> James Dyer
> Ingram Content Group
> (615) 213-4311
> 
> 
> -----Original Message-----
> From: Pat Ferrel [mailto:pat.ferrel@gmail.com] 
> Sent: Wednesday, November 06, 2013 5:53 PM
> To: ssc@apache.org Schelter; user@mahout.apache.org
> Subject: Re: Solr-recommender for Mahout 0.9
> 
> Done,
> 
> BTW I have the thing running on a demo site but am getting very poor results that I think are related to the Solr setup. I'd appreciate any ideas.
> 
> The sample data has 27,000 items and something like 4000 users. The preference data is fairly dense since the users are professional reviewers and the items videos.
> 
> 1) The number of item-item similarities that are kept is 100. Is this a good starting point? Ted, do you recall how many you used before?
> 2) The query is a simple text query made of space delimited video id strings. These are the same ids as are stored in the item-item similarity docs that Solr indexes.
> 
> Hit thumbs up on one video you you get several recommendations. Hit thumbs up on several videos you get no recs. I'm either using the wrong query type or have it set up to be too restrictive. As I read through the docs if someone has a suggestion or pointer I'd appreciate it. 
> 
> BTW the same sort of thing happens with Title search. Search for "vampire werewolf zombie" you get no results, search for "zombie" you get several.
> 
> On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <ss...@apache.org> wrote:
> 
> Hi Pat,
> 
> can you create issues for 1) and 2) ? Then I will try to get this into
> trunk asap.
> 
> Best,
> Sebastian
> 
> On 06.11.2013 19:13, Pat Ferrel wrote:
>> Trying to integrate the Solr-recoemmender with the latest Mahout snapshot. The project uses a modified RecommenderJob because it needs SequenceFile output and to get the location of the preparePreferenceMatrix directory. If #1 and #2 are addressed I can remove the modified Mahout code from the project and rely on the default implementations in Mahout 0.9. #3 is a longer term issue related to the creation of a CrossRowSimilarityJob. 
>> 
>> I have dropped the modified code from the Solr-recommender project and have a modified build of the current Mahout 0.9 snapshot. If the following changes are made to Mahout I can test and release a Mahout 0.9 version of the Solr-recommender.
>> 
>> 1. Option to change RecommenderJob output format
>> 
>> Can someone add an option to output a SequenceFile. I modified the code to do the following, note the SequenceFileOutputFormat.class as the last parameter but this should really be determined with an option I think.
>> 
>>    Job aggregateAndRecommend = prepareJob(
>>            new Path(aggregateAndRecommendInput), outputPath, SequenceFileInputFormat.class,
>>            PartialMultiplyMapper.class, VarLongWritable.class, PrefAndSimilarityColumnWritable.class,
>>            AggregateAndRecommendReducer.class, VarLongWritable.class, RecommendedItemsWritable.class,
>>            SequenceFileOutputFormat.class);
>> 
>> 2. Visibility of preparePreferenceMatrix directory location
>> 
>> The Solr-recommender needs to find where the RecommenderJob is putting it's output. 
>> 
>> Mahout 0.8 RecommenderJob code was:
>>  public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix";
>> 
>> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix" inline in the code:
>>  Path prepPath = getTempPath("preparePreferenceMatrix");
>> 
>> This change to Mahout 0.9 works:
>>  public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix";
>> and
>>  Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
>> 
>> You could also make this a getter method on the RecommenderJob Class instead of using a public constant.
>> 
>> 3. Downsampling
>> 
>> The downsampling for maximum prefs per user has been moved from PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob uses matrix math instead of RSJ so it will no longer support downsampling until there is a hypothetical CrossRowSimilairtyJob with downsampling in it.
>> 
>> 
> 
> 
> 
> 
>

Re: Solr-recommender for Mahout 0.9

Posted by Pat Ferrel <pa...@gmail.com>.

I have dismax (no edismax) but am not using it yet, using the default query, which does use ‘AND’. I had much the same though as I slept on it. Changing to OR is now working much much better. So obvious it almost bit me, not good in this case...

With only a trivially small amount of testing I’d say we have a new recommender on the block.

If anyone would like to help eyeball test the thing let me know off-list. There are a few instructions I’ll need to give. And it can’t handle much load right now due to intentional design limits.

On Nov 7, 2013, at 6:11 AM, Dyer, James <Ja...@ingramcontent.com> wrote:

Pat,

Can you give us the query it generates when you enter "vampire werewolf zombie", q/qt/defType ?

My guess is you're using the default query parser with "q.op=AND" , or, you're using dismax/edismax with a high "mm" (min-must-match) value.

James Dyer
Ingram Content Group
(615) 213-4311

-----Original Message-----
From: Pat Ferrel [mailto:pat.ferrel@gmail.com] 
Sent: Wednesday, November 06, 2013 5:53 PM
To: ssc@apache.org Schelter; user@mahout.apache.org
Subject: Re: Solr-recommender for Mahout 0.9

Done,

BTW I have the thing running on a demo site but am getting very poor results that I think are related to the Solr setup. I'd appreciate any ideas.

The sample data has 27,000 items and something like 4000 users. The preference data is fairly dense since the users are professional reviewers and the items videos.

1) The number of item-item similarities that are kept is 100. Is this a good starting point? Ted, do you recall how many you used before?
2) The query is a simple text query made of space delimited video id strings. These are the same ids as are stored in the item-item similarity docs that Solr indexes.

Hit thumbs up on one video you you get several recommendations. Hit thumbs up on several videos you get no recs. I'm either using the wrong query type or have it set up to be too restrictive. As I read through the docs if someone has a suggestion or pointer I'd appreciate it. 

BTW the same sort of thing happens with Title search. Search for "vampire werewolf zombie" you get no results, search for "zombie" you get several.

On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <ss...@apache.org> wrote:

Hi Pat,

can you create issues for 1) and 2) ? Then I will try to get this into
trunk asap.

Best,
Sebastian

On 06.11.2013 19:13, Pat Ferrel wrote:
> Trying to integrate the Solr-recoemmender with the latest Mahout snapshot. The project uses a modified RecommenderJob because it needs SequenceFile output and to get the location of the preparePreferenceMatrix directory. If #1 and #2 are addressed I can remove the modified Mahout code from the project and rely on the default implementations in Mahout 0.9. #3 is a longer term issue related to the creation of a CrossRowSimilarityJob. 
> 
> I have dropped the modified code from the Solr-recommender project and have a modified build of the current Mahout 0.9 snapshot. If the following changes are made to Mahout I can test and release a Mahout 0.9 version of the Solr-recommender.
> 
> 1. Option to change RecommenderJob output format
> 
> Can someone add an option to output a SequenceFile. I modified the code to do the following, note the SequenceFileOutputFormat.class as the last parameter but this should really be determined with an option I think.
> 
>     Job aggregateAndRecommend = prepareJob(
>             new Path(aggregateAndRecommendInput), outputPath, SequenceFileInputFormat.class,
>             PartialMultiplyMapper.class, VarLongWritable.class, PrefAndSimilarityColumnWritable.class,
>             AggregateAndRecommendReducer.class, VarLongWritable.class, RecommendedItemsWritable.class,
>             SequenceFileOutputFormat.class);
> 
> 2. Visibility of preparePreferenceMatrix directory location
> 
> The Solr-recommender needs to find where the RecommenderJob is putting it's output. 
> 
> Mahout 0.8 RecommenderJob code was:
>   public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix";
> 
> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix" inline in the code:
>   Path prepPath = getTempPath("preparePreferenceMatrix");
> 
> This change to Mahout 0.9 works:
>   public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix";
> and
>   Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
> 
> You could also make this a getter method on the RecommenderJob Class instead of using a public constant.
> 
> 3. Downsampling
> 
> The downsampling for maximum prefs per user has been moved from PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob uses matrix math instead of RSJ so it will no longer support downsampling until there is a hypothetical CrossRowSimilairtyJob with downsampling in it.
> 
>

RE: Solr-recommender for Mahout 0.9

Posted by "Dyer, James" <Ja...@ingramcontent.com>.

Pat,

Can you give us the query it generates when you enter "vampire werewolf zombie", q/qt/defType ?

My guess is you're using the default query parser with "q.op=AND" , or, you're using dismax/edismax with a high "mm" (min-must-match) value.

James Dyer
Ingram Content Group
(615) 213-4311

-----Original Message-----
From: Pat Ferrel [mailto:pat.ferrel@gmail.com] 
Sent: Wednesday, November 06, 2013 5:53 PM
To: ssc@apache.org Schelter; user@mahout.apache.org
Subject: Re: Solr-recommender for Mahout 0.9

Done,

BTW I have the thing running on a demo site but am getting very poor results that I think are related to the Solr setup. I'd appreciate any ideas.

The sample data has 27,000 items and something like 4000 users. The preference data is fairly dense since the users are professional reviewers and the items videos.

1) The number of item-item similarities that are kept is 100. Is this a good starting point? Ted, do you recall how many you used before?
2) The query is a simple text query made of space delimited video id strings. These are the same ids as are stored in the item-item similarity docs that Solr indexes.

Hit thumbs up on one video you you get several recommendations. Hit thumbs up on several videos you get no recs. I'm either using the wrong query type or have it set up to be too restrictive. As I read through the docs if someone has a suggestion or pointer I'd appreciate it. 

BTW the same sort of thing happens with Title search. Search for "vampire werewolf zombie" you get no results, search for "zombie" you get several.

On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <ss...@apache.org> wrote:

Hi Pat,

can you create issues for 1) and 2) ? Then I will try to get this into
trunk asap.

Best,
Sebastian

On 06.11.2013 19:13, Pat Ferrel wrote:
> Trying to integrate the Solr-recoemmender with the latest Mahout snapshot. The project uses a modified RecommenderJob because it needs SequenceFile output and to get the location of the preparePreferenceMatrix directory. If #1 and #2 are addressed I can remove the modified Mahout code from the project and rely on the default implementations in Mahout 0.9. #3 is a longer term issue related to the creation of a CrossRowSimilarityJob. 
> 
> I have dropped the modified code from the Solr-recommender project and have a modified build of the current Mahout 0.9 snapshot. If the following changes are made to Mahout I can test and release a Mahout 0.9 version of the Solr-recommender.
> 
> 1. Option to change RecommenderJob output format
> 
> Can someone add an option to output a SequenceFile. I modified the code to do the following, note the SequenceFileOutputFormat.class as the last parameter but this should really be determined with an option I think.
> 
>      Job aggregateAndRecommend = prepareJob(
>              new Path(aggregateAndRecommendInput), outputPath, SequenceFileInputFormat.class,
>              PartialMultiplyMapper.class, VarLongWritable.class, PrefAndSimilarityColumnWritable.class,
>              AggregateAndRecommendReducer.class, VarLongWritable.class, RecommendedItemsWritable.class,
>              SequenceFileOutputFormat.class);
> 
> 2. Visibility of preparePreferenceMatrix directory location
> 
> The Solr-recommender needs to find where the RecommenderJob is putting it's output. 
> 
> Mahout 0.8 RecommenderJob code was:
>    public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix";
> 
> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix" inline in the code:
>    Path prepPath = getTempPath("preparePreferenceMatrix");
> 
> This change to Mahout 0.9 works:
>    public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix";
> and
>    Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
> 
> You could also make this a getter method on the RecommenderJob Class instead of using a public constant.
> 
> 3. Downsampling
> 
> The downsampling for maximum prefs per user has been moved from PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob uses matrix math instead of RSJ so it will no longer support downsampling until there is a hypothetical CrossRowSimilairtyJob with downsampling in it.
> 
>

Re: Solr-recommender for Mahout 0.9

Posted by Pat Ferrel <pa...@gmail.com>.

Done,

BTW I have the thing running on a demo site but am getting very poor results that I think are related to the Solr setup. I’d appreciate any ideas.

The sample data has 27,000 items and something like 4000 users. The preference data is fairly dense since the users are professional reviewers and the items videos.

1) The number of item-item similarities that are kept is 100. Is this a good starting point? Ted, do you recall how many you used before?
2) The query is a simple text query made of space delimited video id strings. These are the same ids as are stored in the item-item similarity docs that Solr indexes.

Hit thumbs up on one video you you get several recommendations. Hit thumbs up on several videos you get no recs. I’m either using the wrong query type or have it set up to be too restrictive. As I read through the docs if someone has a suggestion or pointer I’d appreciate it. 

BTW the same sort of thing happens with Title search. Search for “vampire werewolf zombie” you get no results, search for “zombie” you get several.

On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <ss...@apache.org> wrote:

Hi Pat,

can you create issues for 1) and 2) ? Then I will try to get this into
trunk asap.

Best,
Sebastian

On 06.11.2013 19:13, Pat Ferrel wrote:
> Trying to integrate the Solr-recoemmender with the latest Mahout snapshot. The project uses a modified RecommenderJob because it needs SequenceFile output and to get the location of the preparePreferenceMatrix directory. If #1 and #2 are addressed I can remove the modified Mahout code from the project and rely on the default implementations in Mahout 0.9. #3 is a longer term issue related to the creation of a CrossRowSimilarityJob. 
> 
> I have dropped the modified code from the Solr-recommender project and have a modified build of the current Mahout 0.9 snapshot. If the following changes are made to Mahout I can test and release a Mahout 0.9 version of the Solr-recommender.
> 
> 1. Option to change RecommenderJob output format
> 
> Can someone add an option to output a SequenceFile. I modified the code to do the following, note the SequenceFileOutputFormat.class as the last parameter but this should really be determined with an option I think.
> 
>      Job aggregateAndRecommend = prepareJob(
>              new Path(aggregateAndRecommendInput), outputPath, SequenceFileInputFormat.class,
>              PartialMultiplyMapper.class, VarLongWritable.class, PrefAndSimilarityColumnWritable.class,
>              AggregateAndRecommendReducer.class, VarLongWritable.class, RecommendedItemsWritable.class,
>              SequenceFileOutputFormat.class);
> 
> 2. Visibility of preparePreferenceMatrix directory location
> 
> The Solr-recommender needs to find where the RecommenderJob is putting it’s output. 
> 
> Mahout 0.8 RecommenderJob code was:
>    public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix”;
> 
> Mahout 0.9 RecommenderJob code just puts “preparePreferenceMatrix” inline in the code:
>    Path prepPath = getTempPath("preparePreferenceMatrix");
> 
> This change to Mahout 0.9 works:
>    public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix”;
> and
>    Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
> 
> You could also make this a getter method on the RecommenderJob Class instead of using a public constant.
> 
> 3. Downsampling
> 
> The downsampling for maximum prefs per user has been moved from PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob uses matrix math instead of RSJ so it will no longer support downsampling until there is a hypothetical CrossRowSimilairtyJob with downsampling in it.
> 
>

Re: Solr-recommender for Mahout 0.9

Posted by Sebastian Schelter <ss...@apache.org>.

Hi Pat,

can you create issues for 1) and 2) ? Then I will try to get this into
trunk asap.

Best,
Sebastian

On 06.11.2013 19:13, Pat Ferrel wrote:
> Trying to integrate the Solr-recoemmender with the latest Mahout snapshot. The project uses a modified RecommenderJob because it needs SequenceFile output and to get the location of the preparePreferenceMatrix directory. If #1 and #2 are addressed I can remove the modified Mahout code from the project and rely on the default implementations in Mahout 0.9. #3 is a longer term issue related to the creation of a CrossRowSimilarityJob. 
> 
> I have dropped the modified code from the Solr-recommender project and have a modified build of the current Mahout 0.9 snapshot. If the following changes are made to Mahout I can test and release a Mahout 0.9 version of the Solr-recommender.
> 
> 1. Option to change RecommenderJob output format
> 
> Can someone add an option to output a SequenceFile. I modified the code to do the following, note the SequenceFileOutputFormat.class as the last parameter but this should really be determined with an option I think.
> 
>       Job aggregateAndRecommend = prepareJob(
>               new Path(aggregateAndRecommendInput), outputPath, SequenceFileInputFormat.class,
>               PartialMultiplyMapper.class, VarLongWritable.class, PrefAndSimilarityColumnWritable.class,
>               AggregateAndRecommendReducer.class, VarLongWritable.class, RecommendedItemsWritable.class,
>               SequenceFileOutputFormat.class);
> 
> 2. Visibility of preparePreferenceMatrix directory location
> 
> The Solr-recommender needs to find where the RecommenderJob is putting it’s output. 
> 
> Mahout 0.8 RecommenderJob code was:
>     public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix”;
> 
> Mahout 0.9 RecommenderJob code just puts “preparePreferenceMatrix” inline in the code:
>     Path prepPath = getTempPath("preparePreferenceMatrix");
> 
> This change to Mahout 0.9 works:
>     public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix”;
> and
>     Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
> 
> You could also make this a getter method on the RecommenderJob Class instead of using a public constant.
> 
> 3. Downsampling
> 
> The downsampling for maximum prefs per user has been moved from PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob uses matrix math instead of RSJ so it will no longer support downsampling until there is a hypothetical CrossRowSimilairtyJob with downsampling in it.
> 
>

Re: Solr-recommender for Mahout 0.9

Posted by Sebastian Schelter <ss...@apache.org>.

Hi Pat,

can you create issues for 1) and 2) ? Then I will try to get this into
trunk asap.

Best,
Sebastian

On 06.11.2013 19:13, Pat Ferrel wrote:
> Trying to integrate the Solr-recoemmender with the latest Mahout snapshot. The project uses a modified RecommenderJob because it needs SequenceFile output and to get the location of the preparePreferenceMatrix directory. If #1 and #2 are addressed I can remove the modified Mahout code from the project and rely on the default implementations in Mahout 0.9. #3 is a longer term issue related to the creation of a CrossRowSimilarityJob. 
> 
> I have dropped the modified code from the Solr-recommender project and have a modified build of the current Mahout 0.9 snapshot. If the following changes are made to Mahout I can test and release a Mahout 0.9 version of the Solr-recommender.
> 
> 1. Option to change RecommenderJob output format
> 
> Can someone add an option to output a SequenceFile. I modified the code to do the following, note the SequenceFileOutputFormat.class as the last parameter but this should really be determined with an option I think.
> 
>       Job aggregateAndRecommend = prepareJob(
>               new Path(aggregateAndRecommendInput), outputPath, SequenceFileInputFormat.class,
>               PartialMultiplyMapper.class, VarLongWritable.class, PrefAndSimilarityColumnWritable.class,
>               AggregateAndRecommendReducer.class, VarLongWritable.class, RecommendedItemsWritable.class,
>               SequenceFileOutputFormat.class);
> 
> 2. Visibility of preparePreferenceMatrix directory location
> 
> The Solr-recommender needs to find where the RecommenderJob is putting it’s output. 
> 
> Mahout 0.8 RecommenderJob code was:
>     public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix”;
> 
> Mahout 0.9 RecommenderJob code just puts “preparePreferenceMatrix” inline in the code:
>     Path prepPath = getTempPath("preparePreferenceMatrix");
> 
> This change to Mahout 0.9 works:
>     public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix”;
> and
>     Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
> 
> You could also make this a getter method on the RecommenderJob Class instead of using a public constant.
> 
> 3. Downsampling
> 
> The downsampling for maximum prefs per user has been moved from PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob uses matrix math instead of RSJ so it will no longer support downsampling until there is a hypothetical CrossRowSimilairtyJob with downsampling in it.
> 
>