You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Serega Sheypak <se...@gmail.com> on 2014/08/16 14:10:36 UTC

mapreduce ItemSimilarity input optimization

Hi, We are trying calculate ItemSimilarity.
Right now we have 2*10^7 input lines. I do provide input data as raw text
each day to recalculate item similarities. We do get +100..1000 new items
each day.
1. It takes too much time to prepare input data.
2. It takes too much time to convert user_id, item_id to mahout ids

Is there any poissibility to provide data to mahout mapreduce
ItemSimilarity using some binary format with compression?

Re: mapreduce ItemSimilarity input optimization

Posted by Pat Ferrel <pa...@gmail.com>.

If you have purchase data, train with that. Purchase data is always much better than recommending views. Don’t worry that you have 100 views per purchase, trust me this will give you much better recommendations.

Filter with un-related categories. So filter electronics, throwing aways clothes and home appliances but not filtering our other possibly related categories like accessories. There are other ways to do this but they are more complicated, let’s get this working first.

You should consider doing personalized recommendations next instead of only “other people purchased these items”. All users see the same recs, since you have no personalization yet.

If you are keeping user purchase history you can train itemsimilarity with that—it will give you "“other people purchased these items”, these are similar items. Then index the similarity data with a search engine (Solr, Elastic Search) and query with the current users purchase history as the query. The results will be an ordered list of items that are personalized recs.

I think you can get a free copy of “Practical Machine Learning” from the MapR site: https://www.mapr.com/practical-machine-learning It describes the Solr recommender method that uses itemsimilarity. 

On Aug 19, 2014, at 10:23 AM, Serega Sheypak <se...@gmail.com> wrote:

Yes, ecommerce.
>> #2 data includes #1 data, right?
Yes, #1 are "raw" output of ItemSimilarity recommendtions
#2 are recommednations #1 with category filter applied.

I can't drop #1 "look-with" since #2(ith category filter) doesn't have
accessories. Category filter would remove accessory recommendations for
iphone and leave only other iphones.

The idea is to provide different "demensions" of data: look-with, similar
(look-with with category filter applied), recommnedations based on sale
input would have name "also-bought", e.t.c.

"cross-cooccurrence" - what does it mean? Run itemsimilairty with "views",
then with "sales" and provide merged result? where item->item pairs exist
in both outputs?



2014-08-19 20:37 GMT+04:00 Pat Ferrel <pa...@gmail.com>:

> emon is a typo
> 
> I still don’t understand the difference between these “recommendations” 1)
> "look-with recommendations" = recommended items clicked? 2) similar = items
> viewed by others? The recommendations clicked will lead to viewing an item
> so #2 data includes #1 data, right? I would drop #1 and use only #2 data.
> Besides if you only recommend items that have been recommended you will
> decrease sales because you will never show other items. Over time the
> recommended items will become out of date since you never mix-in new items.
> You may always recommend an iPhone 5 even after it has been discontinued.
> 
> If you know the category of an item--filter the recs by that category or
> related categories. You are already doing this in #2 below so if you drop
> #1 there is no problem, correct? Users will not see the water heater with
> the iphone.
> 
> Question) Why do you get a water heater with iphone? Unless there is a bug
> somewhere the data says that similar people looked at both. Item view data
> is not very predictive and in any case you will get this type if thing if
> it exists in user behavior. There may even be a correlation between the
> need for an iphone and a water heater that you don’t know about or it may
> just be a coincidence. But for now let’s say it’s an anomaly in the data
> and just filter those out by category.
> 
> What I was beginning to say is that it sounds like you have an ECOM site.
> If so do you have purchase data? Purchase data is usually much, much better
> than item view data. People tend to look at a lot of things but when they
> purchase something it means a much higher preference than merely looking at
> something.
> 
> The first rule of making a good recommender is find the best action, one
> that shows a user preference in the strongest possible way. For ecommerce
> that usually means a purchase. Then once you have that working you can add
> more actions but only with cross-cooccurrence, adding by weighting will not
> work with this type of recommender, it will only pollute your strong data
> with weaker actions.
> 
> On Aug 19, 2014, at 8:18 AM, Serega Sheypak <se...@gmail.com>
> wrote:
> 
> Hi, what is "emon"?
> 1. I do create "look-with recommendations". I really it's just "raw" output
> from itemSimilarityJob with booleanData=true and LLR as similarity function
> (your suggestion)
> 2. I do create "similar" recommendations. I do apply category filter before
> serving recommendations
> 
> "look-with", means other users watched iPhone case and other accessory with
> iphone. I do have accessory for iPhone here, but also water heating
> device...
> similar - means show only other smarphones as recommendations to iPhone.
> 
> Right now the problem is in water heating device in 'look-with' (category
> filter not applied). How can I put away such kind of recommendations and
> why Do I get them?
> 
> 
> 
> 2014-08-19 18:01 GMT+04:00 Pat Ferrel <pa...@gmail.com>:
> 
>> That sounds much better.
>> 
>> Do you have metadata like product category? Electronics vs. home
>> appliance? One easy thing to do if you have categories in your catalog is
>> filter by the same category as the item being viewed.
>> 
>> BTW it sounds like you have an emon
>> On Aug 19, 2014, at 12:53 AM, Serega Sheypak <se...@gmail.com>
>> wrote:
>> 
>> Hi, I 've used LLR with properties you've suggested.
>> Right now I have a trouble.
>> A trouble:
>> Water heat device (
>> 
> http://www.vasko.ru/upload/iblock/58a/58ac7efe640a551f00bec156601e9035.jpg
>> )
>> is recommedned for iPhone. And it has one of the highest score.
>> good things:
>> iPhone cases (
>> 
>> 
> https://www.apple.com/euro/iphone/c/generic/accessories/images/accessories_iphone_5s_case_colors.jpg
>> )
>> are recommedned for iPhone, It's good
>> Other smartphones are recommended to iPhone, it's good
>> Other iPhones are recommedned to iPhone. It's good. 16GB recommended to
>> 32GB, e.t.c.
>> 
>> What could be a reason for recommending "Water heat device " to iPhone?
>> iPhone is one of the most popular item. There should be a lot of people
>> viewing iPhone with "Water heat device "?
>> 
>> 
>> 
>> 2014-08-18 20:15 GMT+04:00 Pat Ferrel <pa...@gmail.com>:
>> 
>>> Oh, and as to using different algorithms, this is an “ensemble” method.
>> In
>>> the paper they are talking about using widely differing algorithms like
>> ALS
>>> + Cooccurrence + … This technique was used to win the Netflix prize but
>> in
>>> practice the improvements may be to small to warrant running multiple
>>> pipelines. In any case it isn’t the first improvement you may want to
>> try.
>>> For instance your UI will have a drastic effect on how well you recs do,
>>> and there are other much easier techniques that we can talk about once
>> you
>>> get the basics working.
>>> 
>>> 
>>> On Aug 18, 2014, at 9:04 AM, Pat Ferrel <pa...@gmail.com> wrote:
>>> 
>>> When beginning to use a recommender from Mahout I always suggest you
>> start
>>> from the defaults. These often give the best results—then tune
> afterwards
>>> to improve.
>>> 
>>> Your intuition is correct that multiple actions can be used to improve
>>> results but get the basics working first. The easiest way to use
> multiple
>>> actions is to use spark-itemsimilarity so since you are using mapreduce
>> for
>>> now, just use one action.
>>> 
>>> I would not try to combine the results from two similarity measures
> there
>>> is no benefit since LLR is better than any of them, at least I’ve never
>>> seen it loose. Below is my experience with trying many of the similarity
>>> metrics on exactly the same data. I did cross-validation with precision
>>> (MAP, mean average precision). LLR wins in other cases I’ve tried too.
> So
>>> LLR is the only method presently used in the Spark version of
>>> itemsimilarity.
>>> 
>>> <map-comparison.xlsx 2014-08-18 08-50-44 2014-08-18 08-51-53.jpeg>
>>> 
>>> If you still get weird results double check your ID mapping. Run a small
>>> bit of data through and spot check the mapping by hand.
>>> 
>>> At some point you will want to create a cross-validation test. This is
>>> good as a sort of integration sanity check when making changes to the
>>> recommender. You run cross-validation using standard test data to see if
>>> the score changes drastically between releases. Big changes may indicate
>> a
>>> bug. At the beginning it will help you tune as in the case above where
> it
>>> helped decide on LLR.
>>> 
>>> 
>>> 
>>> On Aug 18, 2014, at 1:43 AM, Serega Sheypak <se...@gmail.com>
>>> wrote:
>>> 
>>> Thank you very much. I'll do what you are sayning in bullets 1...5 and
>> try
>>> again.
>>> 
>>> I also tried:
>>> 1. calc data using COUSINE_SIMILARITY
>>> 2. calc the same data using COOCCURENCE_SIMILARTY
>>> 3. join #1 and #2 where COOCURENCE >= $threshold
>>> 
>>> Where threshold is some emperical integer value. I've used  "2" The idea
>> is
>>> to filter out item pairs which never-ever met together...
>>> Please see this link:
>>> 
>>> 
>> 
> http://blog.godatadriven.com/merge-mahout-recommendations-results-from-different-algorithms.html
>>> 
>>> If I replace COUSINE_SIMILARITY with LLR and booleanData=true, does this
>>> approach still make sense, or it's useless waste of time?
>>> 
>>> "What do you mean the similar items are terrible? How are you measuring
>>> that? " I have eye testing only,
>>> I did automate preparation->calculation->hbase upload-> web-app serving,
>> I
>>> didn't automate testing.
>>> 
>>> 
>>> 
>>> 
>>> 2014-08-18 5:16 GMT+04:00 Pat Ferrel <pa...@occamsmachete.com>:
>>> 
>>>> the things that stand out:
>>>> 
>>>> 1) remove your maxSimilaritiesPerItem option! 50000
>>> maxSimilaritiesPerItem
>>>> will _kill_ performance and give no gain, leave this setting at the
>>> default
>>>> of 500
>>>> 2) use only one action. What do you want the user to do? Do you want
>> them
>>>> to read a page? Then train on item page views. If those pages lead to a
>>>> purchase then you want to recommend purchases so train on user
>> purchases.
>>>> 3) remove your minPrefsPerUser option, this should never be 0 or it
> will
>>>> leave users in the training data that have no data and may contribute
> to
>>>> longer runs with no gain.
>>>> 4) this is a pretty small Hadoop cluster for the size of your data but
> I
>>>> bet changing #1 will noticeably reduce the runtime
>>>> 5) change —similarityClassname to SIMILARITY_LOGLIKELIHOOD
>>>> 6) remove your —booleanData option since LLR ignores weights.
>>>> 
>>>> Remember that this is not the same as personalized recommendations.
> This
>>>> method alone will show the same “similar items” for all users.
>>>> 
>>>> Sorry but both your “recommendation” types sound like the same thing.
>>>> Using both item page view  _and_ clicks on recommended items will both
>>> lead
>>>> to an item page view so you have two actions that lead to the same
>> thing,
>>>> right? Just train on an item page view (unless you really want the user
>>> to
>>>> make a purchase)
>>>> 
>>>> What do you mean the similar items are terrible? How are you measuring
>>>> that? Are you doing cross-validation measuring precision or A/B
> testing?
>>>> What looks bad to you may be good, the eyeball test is not always
>>> reliable.
>>>> If they are coming up completely crazy or random then you may have a
> bug
>>> in
>>>> your ID translation logic.
>>>> 
>>>> It sounds like you have enough data to produce good results.
>>>> 
>>>> On Aug 17, 2014, at 11:14 AM, Serega Sheypak <serega.sheypak@gmail.com
>> 
>>>> wrote:
>>>> 
>>>> 1. 7 nodes 4 CPU per node, 48 GB ram, 2 HDD for MR and HDFS. Not too
>> much
>>>> but enough for the start..
>>>> 2. I run it as oozie action.
>>>> <action name="run-mahout-primary-similarity-ItemSimilarityJob">
>>>>    <java>
>>>>        <job-tracker>${jobTracker}</job-tracker>
>>>>        <name-node>${nameNode}</name-node>
>>>>        <prepare>
>>>>            <delete path="${mahoutOutputDir}/primary" />
>>>>            <delete
>>>> path="${tempDir}/run-mahout-ItemSimilarityJob/primary" />
>>>>        </prepare>
>>>>        <configuration>
>>>>            <property>
>>>>                <name>mapred.queue.name</name>
>>>>                <value>default</value>
>>>>            </property>
>>>> 
>>>>        </configuration>
>>>> 
>>>> 
>>>> 
>>> 
>> 
> <main-class>org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob</main-class>
>>>>        <arg>--input</arg>
>>>>        <arg>${tempDir}/to-mahout-id/projPrefs</arg><!-- dense user_id,
>>>> item_id, pref [can be 3 or 5, 3 is VIEW item, 5 is CLICK on
>>> recommendation,
>>>> a kind of try to increase quality of recommender...]-->
>>>> 
>>>>        <arg>--output</arg>
>>>>        <arg>${mahoutOutputDir}/primary</arg>
>>>> 
>>>>        <arg>--similarityClassname</arg>
>>>>        <arg>SIMILARITY_COSINE</arg>
>>>> 
>>>>        <arg>--maxSimilaritiesPerItem</arg>
>>>>        <arg>50000</arg>
>>>> 
>>>>        <arg>--minPrefsPerUser</arg>
>>>>        <arg>0</arg>
>>>> 
>>>>        <arg>--booleanData</arg>
>>>>        <arg>false</arg>
>>>> 
>>>>        <arg>--tempDir</arg>
>>>>        <arg>${tempDir}/run-mahout-ItemSimilarityJob/primary</arg>
>>>> 
>>>>    </java>
>>>>    <ok to="to-narrow-table"/>
>>>>    <error to="kill"/>
>>>> </action>
>>>> 
>>>> 3) RANK does it, here is a script:
>>>> 
>>>> --user, item, pref previously prepared by hive
>>>> user_item_pref = LOAD '$user_item_pref' using PigStorage(',') as
>>>> (user_id:chararray, item_id:long, pref:double);
>>>> 
>>>> --get distinct user from the whole input
>>>> distUserId = distinct(FOREACH user_item_pref GENERATE user_id);
>>>> 
>>>> --get distinct item from the whole input
>>>> distItemId = distinct(FOREACH user_item_pref GENERATE item_id);
>>>> 
>>>> --rank user 1....N
>>>> rankUsers_ = RANK distUserId;
>>>> rankUsers = FOREACH rankUsers_ GENERATE $0 as rank_id, user_id;
>>>> 
>>>> --rank items 1....M
>>>> rankItems_ = RANK distItemId;
>>>> rankItems = FOREACH rankItems_ GENERATE $0 as rank_id, item_id;
>>>> 
>>>> --join and remap natural user_id, item_id, to RANKS: 1.N, 1..M
>>>> joinedUsers = join user_item_pref by user_id, rankUsers by user_id
> USING
>>>> 'skewed';
>>>> joinedItems = join joinedUsers by user_item_pref::item_id, rankItems by
>>>> item_id using 'replicated';
>>>> 
>>>> projPrefs = FOREACH joinedItems GENERATE
> joinedUsers::rankUsers::rank_id
>>>> as user_id,
>>>>                                     rankItems::rank_id
>>>> as item_id,
>>>>                                     joinedUsers::user_item_pref::pref
>>>> as pref;
>>>> 
>>>> --store mapping for later remapping from RANK back to natural values
>>>> STORE (FOREACH rankUsers GENERATE rank_id, user_id) into '$rankUsers'
>>> using
>>>> PigStorage('\t');
>>>> STORE (FOREACH rankItems GENERATE rank_id, item_id) into '$rankItems'
>>> using
>>>> PigStorage('\t');
>>>> STORE (FOREACH projPrefs GENERATE user_id, item_id, pref) into
>>> '$projPrefs'
>>>> using PigStorage('\t');
>>>> 
>>>> 4) I've seen this idea in different discussion, that different weight
>> for
>>>> different actions are not good. Sorry, I don't understand what you do
>>>> suggest.
>>>> I have two kind of actions: user viewed item, user clicked on
>> recommended
>>>> item (recommended item produced by my item similarity system).
>>>> I want to produce two kinds of recommendations:
>>>> 1. current item + recommend other items which other users visit in
>>>> conjuction with current item
>>>> 2. similar item: recommend items similar to current viewed item.
>>>> What can I try?
>>>> LLR=http://en.wikipedia.org/wiki/Log-likelihood_ratio= LOG_LIKEHOOD?
>>>> 
>>>> Right now I do get awful recommendations and I can't understand what
> can
>>> I
>>>> try next :((((((((((((
>>>> 
>>>> 
>>>> 2014-08-17 19:02 GMT+04:00 Pat Ferrel <pa...@gmail.com>:
>>>> 
>>>>> 1) how many cores in the cluster? The whole idea behind mapreduce is
>> you
>>>>> buy more cpus you get nearly linear decrease in runtime.
>>>>> 2) what is your mahout command line with options, or how are you
>>> invoking
>>>>> mahout. I have seen the Mahout mapreduce recommender take this long so
>>> we
>>>>> should check what you are doing with downsampling.
>>>>> 3) do you really need to RANK your ids, that’s a full sort? When using
>>>> pig
>>>>> I usually get DISTINCT ones and assign an incrementing integer as the
>>>>> Mahout ID corresponding
>>>>> 4) your #2 assigning different weights to different actions usually
>> does
>>>>> not work. I’ve done this before and compared offline metrics and seen
>>>>> precision go down. I’d get this working using only your primary
> actions
>>>>> first. What are you trying to get the user to do? View something, buy
>>>>> something? Use that action as the primary preference and start out
> with
>>> a
>>>>> weight of 1 using LLR. With LLR the weights are not used anyway so
> your
>>>>> data may not produce good results with mixed actions.
>>>>> 
>>>>> A plug for the (admittedly pre-alpha) spark-itemsimilairty:
>>>>> 1) output from 2 can be directly ingested and will create output.
>>>>> 2) multiple actions can be used with cross-cooccurrence, not by
>> guessing
>>>>> at weights.
>>>>> 3) output has your application specific IDs preserved.
>>>>> 4) its about 10x faster than mapreduce and will do aways with your ID
>>>>> translation steps
>>>>> 
>>>>> One caveat is that your cluster machines will need lots of memory. I
>>> have
>>>>> 8-16g on mine.
>>>>> 
>>>>> On Aug 17, 2014, at 1:26 AM, Serega Sheypak <serega.sheypak@gmail.com
>> 
>>>>> wrote:
>>>>> 
>>>>> 1. I do collect preferences for items using 60days sliding window.
>> today
>>>> -
>>>>> 60 days.
>>>>> 2. I do prepare triples user_id, item_id, descrete_pref_value (3 for
>>> item
>>>>> view, 5 for clicking recommndation block. The idea is to give more
>> value
>>>>> for recommendations which attact visitor attention). I get ~
> 20.000.000
>>>> of
>>>>> lines with ~1.000.000 distinct items and ~2.000.000 distinct users
>>>>> 3. I do use apache pig RANK function to rank all distinct user_id
>>>>> 4. I do the same for item_id
>>>>> 5. I do join input dataset with ranked datasets and provide input to
>>>> mahout
>>>>> with dense interger user_id, item_id
>>>>> 6. I do get mahout output and join integer item_id back to get natural
>>>> key
>>>>> value.
>>>>> 
>>>>> step #1-2 takes ~ 40min
>>>>> step #3-5 takes ~1 hour
>>>>> mahout calc takes ~3hours
>>>>> 
>>>>> 
>>>>> 
>>>>> 2014-08-17 10:45 GMT+04:00 Ted Dunning <te...@gmail.com>:
>>>>> 
>>>>>> This really doesn't sound right.  It should be possible to process
>>>>> almost a
>>>>>> thousand times that much data every night without that much problem.
>>>>>> 
>>>>>> How are you preparing the input data?
>>>>>> 
>>>>>> How are you converting to Mahout id's?
>>>>>> 
>>>>>> Even using python, you should be able to do the conversion in just a
>>> few
>>>>>> minutes without any parallelism whatsoever.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak <
>>>>> serega.sheypak@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi, We are trying calculate ItemSimilarity.
>>>>>>> Right now we have 2*10^7 input lines. I do provide input data as raw
>>>>> text
>>>>>>> each day to recalculate item similarities. We do get +100..1000 new
>>>>> items
>>>>>>> each day.
>>>>>>> 1. It takes too much time to prepare input data.
>>>>>>> 2. It takes too much time to convert user_id, item_id to mahout ids
>>>>>>> 
>>>>>>> Is there any poissibility to provide data to mahout mapreduce
>>>>>>> ItemSimilarity using some binary format with compression?
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: mapreduce ItemSimilarity input optimization

Posted by Serega Sheypak <se...@gmail.com>.

Yes, ecommerce.
>>#2 data includes #1 data, right?
Yes, #1 are "raw" output of ItemSimilarity recommendtions
#2 are recommednations #1 with category filter applied.

I can't drop #1 "look-with" since #2(ith category filter) doesn't have
accessories. Category filter would remove accessory recommendations for
iphone and leave only other iphones.

The idea is to provide different "demensions" of data: look-with, similar
(look-with with category filter applied), recommnedations based on sale
input would have name "also-bought", e.t.c.

"cross-cooccurrence" - what does it mean? Run itemsimilairty with "views",
then with "sales" and provide merged result? where item->item pairs exist
in both outputs?



2014-08-19 20:37 GMT+04:00 Pat Ferrel <pa...@gmail.com>:

> emon is a typo
>
> I still don’t understand the difference between these “recommendations” 1)
> "look-with recommendations" = recommended items clicked? 2) similar = items
> viewed by others? The recommendations clicked will lead to viewing an item
> so #2 data includes #1 data, right? I would drop #1 and use only #2 data.
> Besides if you only recommend items that have been recommended you will
> decrease sales because you will never show other items. Over time the
> recommended items will become out of date since you never mix-in new items.
> You may always recommend an iPhone 5 even after it has been discontinued.
>
> If you know the category of an item--filter the recs by that category or
> related categories. You are already doing this in #2 below so if you drop
> #1 there is no problem, correct? Users will not see the water heater with
> the iphone.
>
> Question) Why do you get a water heater with iphone? Unless there is a bug
> somewhere the data says that similar people looked at both. Item view data
> is not very predictive and in any case you will get this type if thing if
> it exists in user behavior. There may even be a correlation between the
> need for an iphone and a water heater that you don’t know about or it may
> just be a coincidence. But for now let’s say it’s an anomaly in the data
> and just filter those out by category.
>
> What I was beginning to say is that it sounds like you have an ECOM site.
> If so do you have purchase data? Purchase data is usually much, much better
> than item view data. People tend to look at a lot of things but when they
> purchase something it means a much higher preference than merely looking at
> something.
>
> The first rule of making a good recommender is find the best action, one
> that shows a user preference in the strongest possible way. For ecommerce
> that usually means a purchase. Then once you have that working you can add
> more actions but only with cross-cooccurrence, adding by weighting will not
> work with this type of recommender, it will only pollute your strong data
> with weaker actions.
>
> On Aug 19, 2014, at 8:18 AM, Serega Sheypak <se...@gmail.com>
> wrote:
>
> Hi, what is "emon"?
> 1. I do create "look-with recommendations". I really it's just "raw" output
> from itemSimilarityJob with booleanData=true and LLR as similarity function
> (your suggestion)
> 2. I do create "similar" recommendations. I do apply category filter before
> serving recommendations
>
> "look-with", means other users watched iPhone case and other accessory with
> iphone. I do have accessory for iPhone here, but also water heating
> device...
> similar - means show only other smarphones as recommendations to iPhone.
>
> Right now the problem is in water heating device in 'look-with' (category
> filter not applied). How can I put away such kind of recommendations and
> why Do I get them?
>
>
>
> 2014-08-19 18:01 GMT+04:00 Pat Ferrel <pa...@gmail.com>:
>
> > That sounds much better.
> >
> > Do you have metadata like product category? Electronics vs. home
> > appliance? One easy thing to do if you have categories in your catalog is
> > filter by the same category as the item being viewed.
> >
> > BTW it sounds like you have an emon
> > On Aug 19, 2014, at 12:53 AM, Serega Sheypak <se...@gmail.com>
> > wrote:
> >
> > Hi, I 've used LLR with properties you've suggested.
> > Right now I have a trouble.
> > A trouble:
> > Water heat device (
> >
> http://www.vasko.ru/upload/iblock/58a/58ac7efe640a551f00bec156601e9035.jpg
> > )
> > is recommedned for iPhone. And it has one of the highest score.
> > good things:
> > iPhone cases (
> >
> >
> https://www.apple.com/euro/iphone/c/generic/accessories/images/accessories_iphone_5s_case_colors.jpg
> > )
> > are recommedned for iPhone, It's good
> > Other smartphones are recommended to iPhone, it's good
> > Other iPhones are recommedned to iPhone. It's good. 16GB recommended to
> > 32GB, e.t.c.
> >
> > What could be a reason for recommending "Water heat device " to iPhone?
> > iPhone is one of the most popular item. There should be a lot of people
> > viewing iPhone with "Water heat device "?
> >
> >
> >
> > 2014-08-18 20:15 GMT+04:00 Pat Ferrel <pa...@gmail.com>:
> >
> >> Oh, and as to using different algorithms, this is an “ensemble” method.
> > In
> >> the paper they are talking about using widely differing algorithms like
> > ALS
> >> + Cooccurrence + … This technique was used to win the Netflix prize but
> > in
> >> practice the improvements may be to small to warrant running multiple
> >> pipelines. In any case it isn’t the first improvement you may want to
> > try.
> >> For instance your UI will have a drastic effect on how well you recs do,
> >> and there are other much easier techniques that we can talk about once
> > you
> >> get the basics working.
> >>
> >>
> >> On Aug 18, 2014, at 9:04 AM, Pat Ferrel <pa...@gmail.com> wrote:
> >>
> >> When beginning to use a recommender from Mahout I always suggest you
> > start
> >> from the defaults. These often give the best results—then tune
> afterwards
> >> to improve.
> >>
> >> Your intuition is correct that multiple actions can be used to improve
> >> results but get the basics working first. The easiest way to use
> multiple
> >> actions is to use spark-itemsimilarity so since you are using mapreduce
> > for
> >> now, just use one action.
> >>
> >> I would not try to combine the results from two similarity measures
> there
> >> is no benefit since LLR is better than any of them, at least I’ve never
> >> seen it loose. Below is my experience with trying many of the similarity
> >> metrics on exactly the same data. I did cross-validation with precision
> >> (MAP, mean average precision). LLR wins in other cases I’ve tried too.
> So
> >> LLR is the only method presently used in the Spark version of
> >> itemsimilarity.
> >>
> >> <map-comparison.xlsx 2014-08-18 08-50-44 2014-08-18 08-51-53.jpeg>
> >>
> >> If you still get weird results double check your ID mapping. Run a small
> >> bit of data through and spot check the mapping by hand.
> >>
> >> At some point you will want to create a cross-validation test. This is
> >> good as a sort of integration sanity check when making changes to the
> >> recommender. You run cross-validation using standard test data to see if
> >> the score changes drastically between releases. Big changes may indicate
> > a
> >> bug. At the beginning it will help you tune as in the case above where
> it
> >> helped decide on LLR.
> >>
> >>
> >>
> >> On Aug 18, 2014, at 1:43 AM, Serega Sheypak <se...@gmail.com>
> >> wrote:
> >>
> >> Thank you very much. I'll do what you are sayning in bullets 1...5 and
> > try
> >> again.
> >>
> >> I also tried:
> >> 1. calc data using COUSINE_SIMILARITY
> >> 2. calc the same data using COOCCURENCE_SIMILARTY
> >> 3. join #1 and #2 where COOCURENCE >= $threshold
> >>
> >> Where threshold is some emperical integer value. I've used  "2" The idea
> > is
> >> to filter out item pairs which never-ever met together...
> >> Please see this link:
> >>
> >>
> >
> http://blog.godatadriven.com/merge-mahout-recommendations-results-from-different-algorithms.html
> >>
> >> If I replace COUSINE_SIMILARITY with LLR and booleanData=true, does this
> >> approach still make sense, or it's useless waste of time?
> >>
> >> "What do you mean the similar items are terrible? How are you measuring
> >> that? " I have eye testing only,
> >> I did automate preparation->calculation->hbase upload-> web-app serving,
> > I
> >> didn't automate testing.
> >>
> >>
> >>
> >>
> >> 2014-08-18 5:16 GMT+04:00 Pat Ferrel <pa...@occamsmachete.com>:
> >>
> >>> the things that stand out:
> >>>
> >>> 1) remove your maxSimilaritiesPerItem option! 50000
> >> maxSimilaritiesPerItem
> >>> will _kill_ performance and give no gain, leave this setting at the
> >> default
> >>> of 500
> >>> 2) use only one action. What do you want the user to do? Do you want
> > them
> >>> to read a page? Then train on item page views. If those pages lead to a
> >>> purchase then you want to recommend purchases so train on user
> > purchases.
> >>> 3) remove your minPrefsPerUser option, this should never be 0 or it
> will
> >>> leave users in the training data that have no data and may contribute
> to
> >>> longer runs with no gain.
> >>> 4) this is a pretty small Hadoop cluster for the size of your data but
> I
> >>> bet changing #1 will noticeably reduce the runtime
> >>> 5) change —similarityClassname to SIMILARITY_LOGLIKELIHOOD
> >>> 6) remove your —booleanData option since LLR ignores weights.
> >>>
> >>> Remember that this is not the same as personalized recommendations.
> This
> >>> method alone will show the same “similar items” for all users.
> >>>
> >>> Sorry but both your “recommendation” types sound like the same thing.
> >>> Using both item page view  _and_ clicks on recommended items will both
> >> lead
> >>> to an item page view so you have two actions that lead to the same
> > thing,
> >>> right? Just train on an item page view (unless you really want the user
> >> to
> >>> make a purchase)
> >>>
> >>> What do you mean the similar items are terrible? How are you measuring
> >>> that? Are you doing cross-validation measuring precision or A/B
> testing?
> >>> What looks bad to you may be good, the eyeball test is not always
> >> reliable.
> >>> If they are coming up completely crazy or random then you may have a
> bug
> >> in
> >>> your ID translation logic.
> >>>
> >>> It sounds like you have enough data to produce good results.
> >>>
> >>> On Aug 17, 2014, at 11:14 AM, Serega Sheypak <serega.sheypak@gmail.com
> >
> >>> wrote:
> >>>
> >>> 1. 7 nodes 4 CPU per node, 48 GB ram, 2 HDD for MR and HDFS. Not too
> > much
> >>> but enough for the start..
> >>> 2. I run it as oozie action.
> >>> <action name="run-mahout-primary-similarity-ItemSimilarityJob">
> >>>     <java>
> >>>         <job-tracker>${jobTracker}</job-tracker>
> >>>         <name-node>${nameNode}</name-node>
> >>>         <prepare>
> >>>             <delete path="${mahoutOutputDir}/primary" />
> >>>             <delete
> >>> path="${tempDir}/run-mahout-ItemSimilarityJob/primary" />
> >>>         </prepare>
> >>>         <configuration>
> >>>             <property>
> >>>                 <name>mapred.queue.name</name>
> >>>                 <value>default</value>
> >>>             </property>
> >>>
> >>>         </configuration>
> >>>
> >>>
> >>>
> >>
> >
> <main-class>org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob</main-class>
> >>>         <arg>--input</arg>
> >>>         <arg>${tempDir}/to-mahout-id/projPrefs</arg><!-- dense user_id,
> >>> item_id, pref [can be 3 or 5, 3 is VIEW item, 5 is CLICK on
> >> recommendation,
> >>> a kind of try to increase quality of recommender...]-->
> >>>
> >>>         <arg>--output</arg>
> >>>         <arg>${mahoutOutputDir}/primary</arg>
> >>>
> >>>         <arg>--similarityClassname</arg>
> >>>         <arg>SIMILARITY_COSINE</arg>
> >>>
> >>>         <arg>--maxSimilaritiesPerItem</arg>
> >>>         <arg>50000</arg>
> >>>
> >>>         <arg>--minPrefsPerUser</arg>
> >>>         <arg>0</arg>
> >>>
> >>>         <arg>--booleanData</arg>
> >>>         <arg>false</arg>
> >>>
> >>>         <arg>--tempDir</arg>
> >>>         <arg>${tempDir}/run-mahout-ItemSimilarityJob/primary</arg>
> >>>
> >>>     </java>
> >>>     <ok to="to-narrow-table"/>
> >>>     <error to="kill"/>
> >>> </action>
> >>>
> >>> 3) RANK does it, here is a script:
> >>>
> >>> --user, item, pref previously prepared by hive
> >>> user_item_pref = LOAD '$user_item_pref' using PigStorage(',') as
> >>> (user_id:chararray, item_id:long, pref:double);
> >>>
> >>> --get distinct user from the whole input
> >>> distUserId = distinct(FOREACH user_item_pref GENERATE user_id);
> >>>
> >>> --get distinct item from the whole input
> >>> distItemId = distinct(FOREACH user_item_pref GENERATE item_id);
> >>>
> >>> --rank user 1....N
> >>> rankUsers_ = RANK distUserId;
> >>> rankUsers = FOREACH rankUsers_ GENERATE $0 as rank_id, user_id;
> >>>
> >>> --rank items 1....M
> >>> rankItems_ = RANK distItemId;
> >>> rankItems = FOREACH rankItems_ GENERATE $0 as rank_id, item_id;
> >>>
> >>> --join and remap natural user_id, item_id, to RANKS: 1.N, 1..M
> >>> joinedUsers = join user_item_pref by user_id, rankUsers by user_id
> USING
> >>> 'skewed';
> >>> joinedItems = join joinedUsers by user_item_pref::item_id, rankItems by
> >>> item_id using 'replicated';
> >>>
> >>> projPrefs = FOREACH joinedItems GENERATE
> joinedUsers::rankUsers::rank_id
> >>> as user_id,
> >>>                                      rankItems::rank_id
> >>> as item_id,
> >>>                                      joinedUsers::user_item_pref::pref
> >>> as pref;
> >>>
> >>> --store mapping for later remapping from RANK back to natural values
> >>> STORE (FOREACH rankUsers GENERATE rank_id, user_id) into '$rankUsers'
> >> using
> >>> PigStorage('\t');
> >>> STORE (FOREACH rankItems GENERATE rank_id, item_id) into '$rankItems'
> >> using
> >>> PigStorage('\t');
> >>> STORE (FOREACH projPrefs GENERATE user_id, item_id, pref) into
> >> '$projPrefs'
> >>> using PigStorage('\t');
> >>>
> >>> 4) I've seen this idea in different discussion, that different weight
> > for
> >>> different actions are not good. Sorry, I don't understand what you do
> >>> suggest.
> >>> I have two kind of actions: user viewed item, user clicked on
> > recommended
> >>> item (recommended item produced by my item similarity system).
> >>> I want to produce two kinds of recommendations:
> >>> 1. current item + recommend other items which other users visit in
> >>> conjuction with current item
> >>> 2. similar item: recommend items similar to current viewed item.
> >>> What can I try?
> >>> LLR=http://en.wikipedia.org/wiki/Log-likelihood_ratio= LOG_LIKEHOOD?
> >>>
> >>> Right now I do get awful recommendations and I can't understand what
> can
> >> I
> >>> try next :((((((((((((
> >>>
> >>>
> >>> 2014-08-17 19:02 GMT+04:00 Pat Ferrel <pa...@gmail.com>:
> >>>
> >>>> 1) how many cores in the cluster? The whole idea behind mapreduce is
> > you
> >>>> buy more cpus you get nearly linear decrease in runtime.
> >>>> 2) what is your mahout command line with options, or how are you
> >> invoking
> >>>> mahout. I have seen the Mahout mapreduce recommender take this long so
> >> we
> >>>> should check what you are doing with downsampling.
> >>>> 3) do you really need to RANK your ids, that’s a full sort? When using
> >>> pig
> >>>> I usually get DISTINCT ones and assign an incrementing integer as the
> >>>> Mahout ID corresponding
> >>>> 4) your #2 assigning different weights to different actions usually
> > does
> >>>> not work. I’ve done this before and compared offline metrics and seen
> >>>> precision go down. I’d get this working using only your primary
> actions
> >>>> first. What are you trying to get the user to do? View something, buy
> >>>> something? Use that action as the primary preference and start out
> with
> >> a
> >>>> weight of 1 using LLR. With LLR the weights are not used anyway so
> your
> >>>> data may not produce good results with mixed actions.
> >>>>
> >>>> A plug for the (admittedly pre-alpha) spark-itemsimilairty:
> >>>> 1) output from 2 can be directly ingested and will create output.
> >>>> 2) multiple actions can be used with cross-cooccurrence, not by
> > guessing
> >>>> at weights.
> >>>> 3) output has your application specific IDs preserved.
> >>>> 4) its about 10x faster than mapreduce and will do aways with your ID
> >>>> translation steps
> >>>>
> >>>> One caveat is that your cluster machines will need lots of memory. I
> >> have
> >>>> 8-16g on mine.
> >>>>
> >>>> On Aug 17, 2014, at 1:26 AM, Serega Sheypak <serega.sheypak@gmail.com
> >
> >>>> wrote:
> >>>>
> >>>> 1. I do collect preferences for items using 60days sliding window.
> > today
> >>> -
> >>>> 60 days.
> >>>> 2. I do prepare triples user_id, item_id, descrete_pref_value (3 for
> >> item
> >>>> view, 5 for clicking recommndation block. The idea is to give more
> > value
> >>>> for recommendations which attact visitor attention). I get ~
> 20.000.000
> >>> of
> >>>> lines with ~1.000.000 distinct items and ~2.000.000 distinct users
> >>>> 3. I do use apache pig RANK function to rank all distinct user_id
> >>>> 4. I do the same for item_id
> >>>> 5. I do join input dataset with ranked datasets and provide input to
> >>> mahout
> >>>> with dense interger user_id, item_id
> >>>> 6. I do get mahout output and join integer item_id back to get natural
> >>> key
> >>>> value.
> >>>>
> >>>> step #1-2 takes ~ 40min
> >>>> step #3-5 takes ~1 hour
> >>>> mahout calc takes ~3hours
> >>>>
> >>>>
> >>>>
> >>>> 2014-08-17 10:45 GMT+04:00 Ted Dunning <te...@gmail.com>:
> >>>>
> >>>>> This really doesn't sound right.  It should be possible to process
> >>>> almost a
> >>>>> thousand times that much data every night without that much problem.
> >>>>>
> >>>>> How are you preparing the input data?
> >>>>>
> >>>>> How are you converting to Mahout id's?
> >>>>>
> >>>>> Even using python, you should be able to do the conversion in just a
> >> few
> >>>>> minutes without any parallelism whatsoever.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak <
> >>>> serega.sheypak@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi, We are trying calculate ItemSimilarity.
> >>>>>> Right now we have 2*10^7 input lines. I do provide input data as raw
> >>>> text
> >>>>>> each day to recalculate item similarities. We do get +100..1000 new
> >>>> items
> >>>>>> each day.
> >>>>>> 1. It takes too much time to prepare input data.
> >>>>>> 2. It takes too much time to convert user_id, item_id to mahout ids
> >>>>>>
> >>>>>> Is there any poissibility to provide data to mahout mapreduce
> >>>>>> ItemSimilarity using some binary format with compression?
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >>
> >
> >
>
>

Re: mapreduce ItemSimilarity input optimization

Posted by Pat Ferrel <pa...@gmail.com>.

emon is a typo

I still don’t understand the difference between these “recommendations” 1) "look-with recommendations" = recommended items clicked? 2) similar = items viewed by others? The recommendations clicked will lead to viewing an item so #2 data includes #1 data, right? I would drop #1 and use only #2 data. Besides if you only recommend items that have been recommended you will decrease sales because you will never show other items. Over time the recommended items will become out of date since you never mix-in new items. You may always recommend an iPhone 5 even after it has been discontinued.

If you know the category of an item--filter the recs by that category or related categories. You are already doing this in #2 below so if you drop #1 there is no problem, correct? Users will not see the water heater with the iphone.

Question) Why do you get a water heater with iphone? Unless there is a bug somewhere the data says that similar people looked at both. Item view data is not very predictive and in any case you will get this type if thing if it exists in user behavior. There may even be a correlation between the need for an iphone and a water heater that you don’t know about or it may just be a coincidence. But for now let’s say it’s an anomaly in the data and just filter those out by category.

What I was beginning to say is that it sounds like you have an ECOM site. If so do you have purchase data? Purchase data is usually much, much better than item view data. People tend to look at a lot of things but when they purchase something it means a much higher preference than merely looking at something.

The first rule of making a good recommender is find the best action, one that shows a user preference in the strongest possible way. For ecommerce that usually means a purchase. Then once you have that working you can add more actions but only with cross-cooccurrence, adding by weighting will not work with this type of recommender, it will only pollute your strong data with weaker actions. 

On Aug 19, 2014, at 8:18 AM, Serega Sheypak <se...@gmail.com> wrote:

Hi, what is "emon"?
1. I do create "look-with recommendations". I really it's just "raw" output
from itemSimilarityJob with booleanData=true and LLR as similarity function
(your suggestion)
2. I do create "similar" recommendations. I do apply category filter before
serving recommendations

"look-with", means other users watched iPhone case and other accessory with
iphone. I do have accessory for iPhone here, but also water heating
device...
similar - means show only other smarphones as recommendations to iPhone.

Right now the problem is in water heating device in 'look-with' (category
filter not applied). How can I put away such kind of recommendations and
why Do I get them?



2014-08-19 18:01 GMT+04:00 Pat Ferrel <pa...@gmail.com>:

> That sounds much better.
> 
> Do you have metadata like product category? Electronics vs. home
> appliance? One easy thing to do if you have categories in your catalog is
> filter by the same category as the item being viewed.
> 
> BTW it sounds like you have an emon
> On Aug 19, 2014, at 12:53 AM, Serega Sheypak <se...@gmail.com>
> wrote:
> 
> Hi, I 've used LLR with properties you've suggested.
> Right now I have a trouble.
> A trouble:
> Water heat device (
> http://www.vasko.ru/upload/iblock/58a/58ac7efe640a551f00bec156601e9035.jpg
> )
> is recommedned for iPhone. And it has one of the highest score.
> good things:
> iPhone cases (
> 
> https://www.apple.com/euro/iphone/c/generic/accessories/images/accessories_iphone_5s_case_colors.jpg
> )
> are recommedned for iPhone, It's good
> Other smartphones are recommended to iPhone, it's good
> Other iPhones are recommedned to iPhone. It's good. 16GB recommended to
> 32GB, e.t.c.
> 
> What could be a reason for recommending "Water heat device " to iPhone?
> iPhone is one of the most popular item. There should be a lot of people
> viewing iPhone with "Water heat device "?
> 
> 
> 
> 2014-08-18 20:15 GMT+04:00 Pat Ferrel <pa...@gmail.com>:
> 
>> Oh, and as to using different algorithms, this is an “ensemble” method.
> In
>> the paper they are talking about using widely differing algorithms like
> ALS
>> + Cooccurrence + … This technique was used to win the Netflix prize but
> in
>> practice the improvements may be to small to warrant running multiple
>> pipelines. In any case it isn’t the first improvement you may want to
> try.
>> For instance your UI will have a drastic effect on how well you recs do,
>> and there are other much easier techniques that we can talk about once
> you
>> get the basics working.
>> 
>> 
>> On Aug 18, 2014, at 9:04 AM, Pat Ferrel <pa...@gmail.com> wrote:
>> 
>> When beginning to use a recommender from Mahout I always suggest you
> start
>> from the defaults. These often give the best results—then tune afterwards
>> to improve.
>> 
>> Your intuition is correct that multiple actions can be used to improve
>> results but get the basics working first. The easiest way to use multiple
>> actions is to use spark-itemsimilarity so since you are using mapreduce
> for
>> now, just use one action.
>> 
>> I would not try to combine the results from two similarity measures there
>> is no benefit since LLR is better than any of them, at least I’ve never
>> seen it loose. Below is my experience with trying many of the similarity
>> metrics on exactly the same data. I did cross-validation with precision
>> (MAP, mean average precision). LLR wins in other cases I’ve tried too. So
>> LLR is the only method presently used in the Spark version of
>> itemsimilarity.
>> 
>> <map-comparison.xlsx 2014-08-18 08-50-44 2014-08-18 08-51-53.jpeg>
>> 
>> If you still get weird results double check your ID mapping. Run a small
>> bit of data through and spot check the mapping by hand.
>> 
>> At some point you will want to create a cross-validation test. This is
>> good as a sort of integration sanity check when making changes to the
>> recommender. You run cross-validation using standard test data to see if
>> the score changes drastically between releases. Big changes may indicate
> a
>> bug. At the beginning it will help you tune as in the case above where it
>> helped decide on LLR.
>> 
>> 
>> 
>> On Aug 18, 2014, at 1:43 AM, Serega Sheypak <se...@gmail.com>
>> wrote:
>> 
>> Thank you very much. I'll do what you are sayning in bullets 1...5 and
> try
>> again.
>> 
>> I also tried:
>> 1. calc data using COUSINE_SIMILARITY
>> 2. calc the same data using COOCCURENCE_SIMILARTY
>> 3. join #1 and #2 where COOCURENCE >= $threshold
>> 
>> Where threshold is some emperical integer value. I've used  "2" The idea
> is
>> to filter out item pairs which never-ever met together...
>> Please see this link:
>> 
>> 
> http://blog.godatadriven.com/merge-mahout-recommendations-results-from-different-algorithms.html
>> 
>> If I replace COUSINE_SIMILARITY with LLR and booleanData=true, does this
>> approach still make sense, or it's useless waste of time?
>> 
>> "What do you mean the similar items are terrible? How are you measuring
>> that? " I have eye testing only,
>> I did automate preparation->calculation->hbase upload-> web-app serving,
> I
>> didn't automate testing.
>> 
>> 
>> 
>> 
>> 2014-08-18 5:16 GMT+04:00 Pat Ferrel <pa...@occamsmachete.com>:
>> 
>>> the things that stand out:
>>> 
>>> 1) remove your maxSimilaritiesPerItem option! 50000
>> maxSimilaritiesPerItem
>>> will _kill_ performance and give no gain, leave this setting at the
>> default
>>> of 500
>>> 2) use only one action. What do you want the user to do? Do you want
> them
>>> to read a page? Then train on item page views. If those pages lead to a
>>> purchase then you want to recommend purchases so train on user
> purchases.
>>> 3) remove your minPrefsPerUser option, this should never be 0 or it will
>>> leave users in the training data that have no data and may contribute to
>>> longer runs with no gain.
>>> 4) this is a pretty small Hadoop cluster for the size of your data but I
>>> bet changing #1 will noticeably reduce the runtime
>>> 5) change —similarityClassname to SIMILARITY_LOGLIKELIHOOD
>>> 6) remove your —booleanData option since LLR ignores weights.
>>> 
>>> Remember that this is not the same as personalized recommendations. This
>>> method alone will show the same “similar items” for all users.
>>> 
>>> Sorry but both your “recommendation” types sound like the same thing.
>>> Using both item page view  _and_ clicks on recommended items will both
>> lead
>>> to an item page view so you have two actions that lead to the same
> thing,
>>> right? Just train on an item page view (unless you really want the user
>> to
>>> make a purchase)
>>> 
>>> What do you mean the similar items are terrible? How are you measuring
>>> that? Are you doing cross-validation measuring precision or A/B testing?
>>> What looks bad to you may be good, the eyeball test is not always
>> reliable.
>>> If they are coming up completely crazy or random then you may have a bug
>> in
>>> your ID translation logic.
>>> 
>>> It sounds like you have enough data to produce good results.
>>> 
>>> On Aug 17, 2014, at 11:14 AM, Serega Sheypak <se...@gmail.com>
>>> wrote:
>>> 
>>> 1. 7 nodes 4 CPU per node, 48 GB ram, 2 HDD for MR and HDFS. Not too
> much
>>> but enough for the start..
>>> 2. I run it as oozie action.
>>> <action name="run-mahout-primary-similarity-ItemSimilarityJob">
>>>     <java>
>>>         <job-tracker>${jobTracker}</job-tracker>
>>>         <name-node>${nameNode}</name-node>
>>>         <prepare>
>>>             <delete path="${mahoutOutputDir}/primary" />
>>>             <delete
>>> path="${tempDir}/run-mahout-ItemSimilarityJob/primary" />
>>>         </prepare>
>>>         <configuration>
>>>             <property>
>>>                 <name>mapred.queue.name</name>
>>>                 <value>default</value>
>>>             </property>
>>> 
>>>         </configuration>
>>> 
>>> 
>>> 
>> 
> <main-class>org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob</main-class>
>>>         <arg>--input</arg>
>>>         <arg>${tempDir}/to-mahout-id/projPrefs</arg><!-- dense user_id,
>>> item_id, pref [can be 3 or 5, 3 is VIEW item, 5 is CLICK on
>> recommendation,
>>> a kind of try to increase quality of recommender...]-->
>>> 
>>>         <arg>--output</arg>
>>>         <arg>${mahoutOutputDir}/primary</arg>
>>> 
>>>         <arg>--similarityClassname</arg>
>>>         <arg>SIMILARITY_COSINE</arg>
>>> 
>>>         <arg>--maxSimilaritiesPerItem</arg>
>>>         <arg>50000</arg>
>>> 
>>>         <arg>--minPrefsPerUser</arg>
>>>         <arg>0</arg>
>>> 
>>>         <arg>--booleanData</arg>
>>>         <arg>false</arg>
>>> 
>>>         <arg>--tempDir</arg>
>>>         <arg>${tempDir}/run-mahout-ItemSimilarityJob/primary</arg>
>>> 
>>>     </java>
>>>     <ok to="to-narrow-table"/>
>>>     <error to="kill"/>
>>> </action>
>>> 
>>> 3) RANK does it, here is a script:
>>> 
>>> --user, item, pref previously prepared by hive
>>> user_item_pref = LOAD '$user_item_pref' using PigStorage(',') as
>>> (user_id:chararray, item_id:long, pref:double);
>>> 
>>> --get distinct user from the whole input
>>> distUserId = distinct(FOREACH user_item_pref GENERATE user_id);
>>> 
>>> --get distinct item from the whole input
>>> distItemId = distinct(FOREACH user_item_pref GENERATE item_id);
>>> 
>>> --rank user 1....N
>>> rankUsers_ = RANK distUserId;
>>> rankUsers = FOREACH rankUsers_ GENERATE $0 as rank_id, user_id;
>>> 
>>> --rank items 1....M
>>> rankItems_ = RANK distItemId;
>>> rankItems = FOREACH rankItems_ GENERATE $0 as rank_id, item_id;
>>> 
>>> --join and remap natural user_id, item_id, to RANKS: 1.N, 1..M
>>> joinedUsers = join user_item_pref by user_id, rankUsers by user_id USING
>>> 'skewed';
>>> joinedItems = join joinedUsers by user_item_pref::item_id, rankItems by
>>> item_id using 'replicated';
>>> 
>>> projPrefs = FOREACH joinedItems GENERATE joinedUsers::rankUsers::rank_id
>>> as user_id,
>>>                                      rankItems::rank_id
>>> as item_id,
>>>                                      joinedUsers::user_item_pref::pref
>>> as pref;
>>> 
>>> --store mapping for later remapping from RANK back to natural values
>>> STORE (FOREACH rankUsers GENERATE rank_id, user_id) into '$rankUsers'
>> using
>>> PigStorage('\t');
>>> STORE (FOREACH rankItems GENERATE rank_id, item_id) into '$rankItems'
>> using
>>> PigStorage('\t');
>>> STORE (FOREACH projPrefs GENERATE user_id, item_id, pref) into
>> '$projPrefs'
>>> using PigStorage('\t');
>>> 
>>> 4) I've seen this idea in different discussion, that different weight
> for
>>> different actions are not good. Sorry, I don't understand what you do
>>> suggest.
>>> I have two kind of actions: user viewed item, user clicked on
> recommended
>>> item (recommended item produced by my item similarity system).
>>> I want to produce two kinds of recommendations:
>>> 1. current item + recommend other items which other users visit in
>>> conjuction with current item
>>> 2. similar item: recommend items similar to current viewed item.
>>> What can I try?
>>> LLR=http://en.wikipedia.org/wiki/Log-likelihood_ratio= LOG_LIKEHOOD?
>>> 
>>> Right now I do get awful recommendations and I can't understand what can
>> I
>>> try next :((((((((((((
>>> 
>>> 
>>> 2014-08-17 19:02 GMT+04:00 Pat Ferrel <pa...@gmail.com>:
>>> 
>>>> 1) how many cores in the cluster? The whole idea behind mapreduce is
> you
>>>> buy more cpus you get nearly linear decrease in runtime.
>>>> 2) what is your mahout command line with options, or how are you
>> invoking
>>>> mahout. I have seen the Mahout mapreduce recommender take this long so
>> we
>>>> should check what you are doing with downsampling.
>>>> 3) do you really need to RANK your ids, that’s a full sort? When using
>>> pig
>>>> I usually get DISTINCT ones and assign an incrementing integer as the
>>>> Mahout ID corresponding
>>>> 4) your #2 assigning different weights to different actions usually
> does
>>>> not work. I’ve done this before and compared offline metrics and seen
>>>> precision go down. I’d get this working using only your primary actions
>>>> first. What are you trying to get the user to do? View something, buy
>>>> something? Use that action as the primary preference and start out with
>> a
>>>> weight of 1 using LLR. With LLR the weights are not used anyway so your
>>>> data may not produce good results with mixed actions.
>>>> 
>>>> A plug for the (admittedly pre-alpha) spark-itemsimilairty:
>>>> 1) output from 2 can be directly ingested and will create output.
>>>> 2) multiple actions can be used with cross-cooccurrence, not by
> guessing
>>>> at weights.
>>>> 3) output has your application specific IDs preserved.
>>>> 4) its about 10x faster than mapreduce and will do aways with your ID
>>>> translation steps
>>>> 
>>>> One caveat is that your cluster machines will need lots of memory. I
>> have
>>>> 8-16g on mine.
>>>> 
>>>> On Aug 17, 2014, at 1:26 AM, Serega Sheypak <se...@gmail.com>
>>>> wrote:
>>>> 
>>>> 1. I do collect preferences for items using 60days sliding window.
> today
>>> -
>>>> 60 days.
>>>> 2. I do prepare triples user_id, item_id, descrete_pref_value (3 for
>> item
>>>> view, 5 for clicking recommndation block. The idea is to give more
> value
>>>> for recommendations which attact visitor attention). I get ~ 20.000.000
>>> of
>>>> lines with ~1.000.000 distinct items and ~2.000.000 distinct users
>>>> 3. I do use apache pig RANK function to rank all distinct user_id
>>>> 4. I do the same for item_id
>>>> 5. I do join input dataset with ranked datasets and provide input to
>>> mahout
>>>> with dense interger user_id, item_id
>>>> 6. I do get mahout output and join integer item_id back to get natural
>>> key
>>>> value.
>>>> 
>>>> step #1-2 takes ~ 40min
>>>> step #3-5 takes ~1 hour
>>>> mahout calc takes ~3hours
>>>> 
>>>> 
>>>> 
>>>> 2014-08-17 10:45 GMT+04:00 Ted Dunning <te...@gmail.com>:
>>>> 
>>>>> This really doesn't sound right.  It should be possible to process
>>>> almost a
>>>>> thousand times that much data every night without that much problem.
>>>>> 
>>>>> How are you preparing the input data?
>>>>> 
>>>>> How are you converting to Mahout id's?
>>>>> 
>>>>> Even using python, you should be able to do the conversion in just a
>> few
>>>>> minutes without any parallelism whatsoever.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak <
>>>> serega.sheypak@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> Hi, We are trying calculate ItemSimilarity.
>>>>>> Right now we have 2*10^7 input lines. I do provide input data as raw
>>>> text
>>>>>> each day to recalculate item similarities. We do get +100..1000 new
>>>> items
>>>>>> each day.
>>>>>> 1. It takes too much time to prepare input data.
>>>>>> 2. It takes too much time to convert user_id, item_id to mahout ids
>>>>>> 
>>>>>> Is there any poissibility to provide data to mahout mapreduce
>>>>>> ItemSimilarity using some binary format with compression?
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
> 
>

Re: mapreduce ItemSimilarity input optimization

Posted by Serega Sheypak <se...@gmail.com>.

Hi, what is "emon"?
1. I do create "look-with recommendations". I really it's just "raw" output
from itemSimilarityJob with booleanData=true and LLR as similarity function
(your suggestion)
2. I do create "similar" recommendations. I do apply category filter before
serving recommendations

"look-with", means other users watched iPhone case and other accessory with
iphone. I do have accessory for iPhone here, but also water heating
device...
similar - means show only other smarphones as recommendations to iPhone.

Right now the problem is in water heating device in 'look-with' (category
filter not applied). How can I put away such kind of recommendations and
why Do I get them?



2014-08-19 18:01 GMT+04:00 Pat Ferrel <pa...@gmail.com>:

> That sounds much better.
>
> Do you have metadata like product category? Electronics vs. home
> appliance? One easy thing to do if you have categories in your catalog is
> filter by the same category as the item being viewed.
>
> BTW it sounds like you have an emon
> On Aug 19, 2014, at 12:53 AM, Serega Sheypak <se...@gmail.com>
> wrote:
>
> Hi, I 've used LLR with properties you've suggested.
> Right now I have a trouble.
> A trouble:
> Water heat device (
> http://www.vasko.ru/upload/iblock/58a/58ac7efe640a551f00bec156601e9035.jpg
> )
> is recommedned for iPhone. And it has one of the highest score.
> good things:
> iPhone cases (
>
> https://www.apple.com/euro/iphone/c/generic/accessories/images/accessories_iphone_5s_case_colors.jpg
> )
> are recommedned for iPhone, It's good
> Other smartphones are recommended to iPhone, it's good
> Other iPhones are recommedned to iPhone. It's good. 16GB recommended to
> 32GB, e.t.c.
>
> What could be a reason for recommending "Water heat device " to iPhone?
> iPhone is one of the most popular item. There should be a lot of people
> viewing iPhone with "Water heat device "?
>
>
>
> 2014-08-18 20:15 GMT+04:00 Pat Ferrel <pa...@gmail.com>:
>
> > Oh, and as to using different algorithms, this is an “ensemble” method.
> In
> > the paper they are talking about using widely differing algorithms like
> ALS
> > + Cooccurrence + … This technique was used to win the Netflix prize but
> in
> > practice the improvements may be to small to warrant running multiple
> > pipelines. In any case it isn’t the first improvement you may want to
> try.
> > For instance your UI will have a drastic effect on how well you recs do,
> > and there are other much easier techniques that we can talk about once
> you
> > get the basics working.
> >
> >
> > On Aug 18, 2014, at 9:04 AM, Pat Ferrel <pa...@gmail.com> wrote:
> >
> > When beginning to use a recommender from Mahout I always suggest you
> start
> > from the defaults. These often give the best results—then tune afterwards
> > to improve.
> >
> > Your intuition is correct that multiple actions can be used to improve
> > results but get the basics working first. The easiest way to use multiple
> > actions is to use spark-itemsimilarity so since you are using mapreduce
> for
> > now, just use one action.
> >
> > I would not try to combine the results from two similarity measures there
> > is no benefit since LLR is better than any of them, at least I’ve never
> > seen it loose. Below is my experience with trying many of the similarity
> > metrics on exactly the same data. I did cross-validation with precision
> > (MAP, mean average precision). LLR wins in other cases I’ve tried too. So
> > LLR is the only method presently used in the Spark version of
> > itemsimilarity.
> >
> > <map-comparison.xlsx 2014-08-18 08-50-44 2014-08-18 08-51-53.jpeg>
> >
> > If you still get weird results double check your ID mapping. Run a small
> > bit of data through and spot check the mapping by hand.
> >
> > At some point you will want to create a cross-validation test. This is
> > good as a sort of integration sanity check when making changes to the
> > recommender. You run cross-validation using standard test data to see if
> > the score changes drastically between releases. Big changes may indicate
> a
> > bug. At the beginning it will help you tune as in the case above where it
> > helped decide on LLR.
> >
> >
> >
> > On Aug 18, 2014, at 1:43 AM, Serega Sheypak <se...@gmail.com>
> > wrote:
> >
> > Thank you very much. I'll do what you are sayning in bullets 1...5 and
> try
> > again.
> >
> > I also tried:
> > 1. calc data using COUSINE_SIMILARITY
> > 2. calc the same data using COOCCURENCE_SIMILARTY
> > 3. join #1 and #2 where COOCURENCE >= $threshold
> >
> > Where threshold is some emperical integer value. I've used  "2" The idea
> is
> > to filter out item pairs which never-ever met together...
> > Please see this link:
> >
> >
> http://blog.godatadriven.com/merge-mahout-recommendations-results-from-different-algorithms.html
> >
> > If I replace COUSINE_SIMILARITY with LLR and booleanData=true, does this
> > approach still make sense, or it's useless waste of time?
> >
> > "What do you mean the similar items are terrible? How are you measuring
> > that? " I have eye testing only,
> > I did automate preparation->calculation->hbase upload-> web-app serving,
> I
> > didn't automate testing.
> >
> >
> >
> >
> > 2014-08-18 5:16 GMT+04:00 Pat Ferrel <pa...@occamsmachete.com>:
> >
> >> the things that stand out:
> >>
> >> 1) remove your maxSimilaritiesPerItem option! 50000
> > maxSimilaritiesPerItem
> >> will _kill_ performance and give no gain, leave this setting at the
> > default
> >> of 500
> >> 2) use only one action. What do you want the user to do? Do you want
> them
> >> to read a page? Then train on item page views. If those pages lead to a
> >> purchase then you want to recommend purchases so train on user
> purchases.
> >> 3) remove your minPrefsPerUser option, this should never be 0 or it will
> >> leave users in the training data that have no data and may contribute to
> >> longer runs with no gain.
> >> 4) this is a pretty small Hadoop cluster for the size of your data but I
> >> bet changing #1 will noticeably reduce the runtime
> >> 5) change —similarityClassname to SIMILARITY_LOGLIKELIHOOD
> >> 6) remove your —booleanData option since LLR ignores weights.
> >>
> >> Remember that this is not the same as personalized recommendations. This
> >> method alone will show the same “similar items” for all users.
> >>
> >> Sorry but both your “recommendation” types sound like the same thing.
> >> Using both item page view  _and_ clicks on recommended items will both
> > lead
> >> to an item page view so you have two actions that lead to the same
> thing,
> >> right? Just train on an item page view (unless you really want the user
> > to
> >> make a purchase)
> >>
> >> What do you mean the similar items are terrible? How are you measuring
> >> that? Are you doing cross-validation measuring precision or A/B testing?
> >> What looks bad to you may be good, the eyeball test is not always
> > reliable.
> >> If they are coming up completely crazy or random then you may have a bug
> > in
> >> your ID translation logic.
> >>
> >> It sounds like you have enough data to produce good results.
> >>
> >> On Aug 17, 2014, at 11:14 AM, Serega Sheypak <se...@gmail.com>
> >> wrote:
> >>
> >> 1. 7 nodes 4 CPU per node, 48 GB ram, 2 HDD for MR and HDFS. Not too
> much
> >> but enough for the start..
> >> 2. I run it as oozie action.
> >> <action name="run-mahout-primary-similarity-ItemSimilarityJob">
> >>      <java>
> >>          <job-tracker>${jobTracker}</job-tracker>
> >>          <name-node>${nameNode}</name-node>
> >>          <prepare>
> >>              <delete path="${mahoutOutputDir}/primary" />
> >>              <delete
> >> path="${tempDir}/run-mahout-ItemSimilarityJob/primary" />
> >>          </prepare>
> >>          <configuration>
> >>              <property>
> >>                  <name>mapred.queue.name</name>
> >>                  <value>default</value>
> >>              </property>
> >>
> >>          </configuration>
> >>
> >>
> >>
> >
> <main-class>org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob</main-class>
> >>          <arg>--input</arg>
> >>          <arg>${tempDir}/to-mahout-id/projPrefs</arg><!-- dense user_id,
> >> item_id, pref [can be 3 or 5, 3 is VIEW item, 5 is CLICK on
> > recommendation,
> >> a kind of try to increase quality of recommender...]-->
> >>
> >>          <arg>--output</arg>
> >>          <arg>${mahoutOutputDir}/primary</arg>
> >>
> >>          <arg>--similarityClassname</arg>
> >>          <arg>SIMILARITY_COSINE</arg>
> >>
> >>          <arg>--maxSimilaritiesPerItem</arg>
> >>          <arg>50000</arg>
> >>
> >>          <arg>--minPrefsPerUser</arg>
> >>          <arg>0</arg>
> >>
> >>          <arg>--booleanData</arg>
> >>          <arg>false</arg>
> >>
> >>          <arg>--tempDir</arg>
> >>          <arg>${tempDir}/run-mahout-ItemSimilarityJob/primary</arg>
> >>
> >>      </java>
> >>      <ok to="to-narrow-table"/>
> >>      <error to="kill"/>
> >>  </action>
> >>
> >> 3) RANK does it, here is a script:
> >>
> >> --user, item, pref previously prepared by hive
> >> user_item_pref = LOAD '$user_item_pref' using PigStorage(',') as
> >> (user_id:chararray, item_id:long, pref:double);
> >>
> >> --get distinct user from the whole input
> >> distUserId = distinct(FOREACH user_item_pref GENERATE user_id);
> >>
> >> --get distinct item from the whole input
> >> distItemId = distinct(FOREACH user_item_pref GENERATE item_id);
> >>
> >> --rank user 1....N
> >> rankUsers_ = RANK distUserId;
> >> rankUsers = FOREACH rankUsers_ GENERATE $0 as rank_id, user_id;
> >>
> >> --rank items 1....M
> >> rankItems_ = RANK distItemId;
> >> rankItems = FOREACH rankItems_ GENERATE $0 as rank_id, item_id;
> >>
> >> --join and remap natural user_id, item_id, to RANKS: 1.N, 1..M
> >> joinedUsers = join user_item_pref by user_id, rankUsers by user_id USING
> >> 'skewed';
> >> joinedItems = join joinedUsers by user_item_pref::item_id, rankItems by
> >> item_id using 'replicated';
> >>
> >> projPrefs = FOREACH joinedItems GENERATE joinedUsers::rankUsers::rank_id
> >> as user_id,
> >>                                       rankItems::rank_id
> >> as item_id,
> >>                                       joinedUsers::user_item_pref::pref
> >> as pref;
> >>
> >> --store mapping for later remapping from RANK back to natural values
> >> STORE (FOREACH rankUsers GENERATE rank_id, user_id) into '$rankUsers'
> > using
> >> PigStorage('\t');
> >> STORE (FOREACH rankItems GENERATE rank_id, item_id) into '$rankItems'
> > using
> >> PigStorage('\t');
> >> STORE (FOREACH projPrefs GENERATE user_id, item_id, pref) into
> > '$projPrefs'
> >> using PigStorage('\t');
> >>
> >> 4) I've seen this idea in different discussion, that different weight
> for
> >> different actions are not good. Sorry, I don't understand what you do
> >> suggest.
> >> I have two kind of actions: user viewed item, user clicked on
> recommended
> >> item (recommended item produced by my item similarity system).
> >> I want to produce two kinds of recommendations:
> >> 1. current item + recommend other items which other users visit in
> >> conjuction with current item
> >> 2. similar item: recommend items similar to current viewed item.
> >> What can I try?
> >> LLR=http://en.wikipedia.org/wiki/Log-likelihood_ratio= LOG_LIKEHOOD?
> >>
> >> Right now I do get awful recommendations and I can't understand what can
> > I
> >> try next :((((((((((((
> >>
> >>
> >> 2014-08-17 19:02 GMT+04:00 Pat Ferrel <pa...@gmail.com>:
> >>
> >>> 1) how many cores in the cluster? The whole idea behind mapreduce is
> you
> >>> buy more cpus you get nearly linear decrease in runtime.
> >>> 2) what is your mahout command line with options, or how are you
> > invoking
> >>> mahout. I have seen the Mahout mapreduce recommender take this long so
> > we
> >>> should check what you are doing with downsampling.
> >>> 3) do you really need to RANK your ids, that’s a full sort? When using
> >> pig
> >>> I usually get DISTINCT ones and assign an incrementing integer as the
> >>> Mahout ID corresponding
> >>> 4) your #2 assigning different weights to different actions usually
> does
> >>> not work. I’ve done this before and compared offline metrics and seen
> >>> precision go down. I’d get this working using only your primary actions
> >>> first. What are you trying to get the user to do? View something, buy
> >>> something? Use that action as the primary preference and start out with
> > a
> >>> weight of 1 using LLR. With LLR the weights are not used anyway so your
> >>> data may not produce good results with mixed actions.
> >>>
> >>> A plug for the (admittedly pre-alpha) spark-itemsimilairty:
> >>> 1) output from 2 can be directly ingested and will create output.
> >>> 2) multiple actions can be used with cross-cooccurrence, not by
> guessing
> >>> at weights.
> >>> 3) output has your application specific IDs preserved.
> >>> 4) its about 10x faster than mapreduce and will do aways with your ID
> >>> translation steps
> >>>
> >>> One caveat is that your cluster machines will need lots of memory. I
> > have
> >>> 8-16g on mine.
> >>>
> >>> On Aug 17, 2014, at 1:26 AM, Serega Sheypak <se...@gmail.com>
> >>> wrote:
> >>>
> >>> 1. I do collect preferences for items using 60days sliding window.
> today
> >> -
> >>> 60 days.
> >>> 2. I do prepare triples user_id, item_id, descrete_pref_value (3 for
> > item
> >>> view, 5 for clicking recommndation block. The idea is to give more
> value
> >>> for recommendations which attact visitor attention). I get ~ 20.000.000
> >> of
> >>> lines with ~1.000.000 distinct items and ~2.000.000 distinct users
> >>> 3. I do use apache pig RANK function to rank all distinct user_id
> >>> 4. I do the same for item_id
> >>> 5. I do join input dataset with ranked datasets and provide input to
> >> mahout
> >>> with dense interger user_id, item_id
> >>> 6. I do get mahout output and join integer item_id back to get natural
> >> key
> >>> value.
> >>>
> >>> step #1-2 takes ~ 40min
> >>> step #3-5 takes ~1 hour
> >>> mahout calc takes ~3hours
> >>>
> >>>
> >>>
> >>> 2014-08-17 10:45 GMT+04:00 Ted Dunning <te...@gmail.com>:
> >>>
> >>>> This really doesn't sound right.  It should be possible to process
> >>> almost a
> >>>> thousand times that much data every night without that much problem.
> >>>>
> >>>> How are you preparing the input data?
> >>>>
> >>>> How are you converting to Mahout id's?
> >>>>
> >>>> Even using python, you should be able to do the conversion in just a
> > few
> >>>> minutes without any parallelism whatsoever.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak <
> >>> serega.sheypak@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hi, We are trying calculate ItemSimilarity.
> >>>>> Right now we have 2*10^7 input lines. I do provide input data as raw
> >>> text
> >>>>> each day to recalculate item similarities. We do get +100..1000 new
> >>> items
> >>>>> each day.
> >>>>> 1. It takes too much time to prepare input data.
> >>>>> 2. It takes too much time to convert user_id, item_id to mahout ids
> >>>>>
> >>>>> Is there any poissibility to provide data to mahout mapreduce
> >>>>> ItemSimilarity using some binary format with compression?
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >
> >
> >
>
>

Re: mapreduce ItemSimilarity input optimization

Posted by Pat Ferrel <pa...@gmail.com>.

That sounds much better.

Do you have metadata like product category? Electronics vs. home appliance? One easy thing to do if you have categories in your catalog is filter by the same category as the item being viewed.

BTW it sounds like you have an emon
On Aug 19, 2014, at 12:53 AM, Serega Sheypak <se...@gmail.com> wrote:

Hi, I 've used LLR with properties you've suggested.
Right now I have a trouble.
A trouble:
Water heat device (
http://www.vasko.ru/upload/iblock/58a/58ac7efe640a551f00bec156601e9035.jpg)
is recommedned for iPhone. And it has one of the highest score.
good things:
iPhone cases (
https://www.apple.com/euro/iphone/c/generic/accessories/images/accessories_iphone_5s_case_colors.jpg)
are recommedned for iPhone, It's good
Other smartphones are recommended to iPhone, it's good
Other iPhones are recommedned to iPhone. It's good. 16GB recommended to
32GB, e.t.c.

What could be a reason for recommending "Water heat device " to iPhone?
iPhone is one of the most popular item. There should be a lot of people
viewing iPhone with "Water heat device "?



2014-08-18 20:15 GMT+04:00 Pat Ferrel <pa...@gmail.com>:

> Oh, and as to using different algorithms, this is an “ensemble” method. In
> the paper they are talking about using widely differing algorithms like ALS
> + Cooccurrence + … This technique was used to win the Netflix prize but in
> practice the improvements may be to small to warrant running multiple
> pipelines. In any case it isn’t the first improvement you may want to try.
> For instance your UI will have a drastic effect on how well you recs do,
> and there are other much easier techniques that we can talk about once you
> get the basics working.
> 
> 
> On Aug 18, 2014, at 9:04 AM, Pat Ferrel <pa...@gmail.com> wrote:
> 
> When beginning to use a recommender from Mahout I always suggest you start
> from the defaults. These often give the best results—then tune afterwards
> to improve.
> 
> Your intuition is correct that multiple actions can be used to improve
> results but get the basics working first. The easiest way to use multiple
> actions is to use spark-itemsimilarity so since you are using mapreduce for
> now, just use one action.
> 
> I would not try to combine the results from two similarity measures there
> is no benefit since LLR is better than any of them, at least I’ve never
> seen it loose. Below is my experience with trying many of the similarity
> metrics on exactly the same data. I did cross-validation with precision
> (MAP, mean average precision). LLR wins in other cases I’ve tried too. So
> LLR is the only method presently used in the Spark version of
> itemsimilarity.
> 
> <map-comparison.xlsx 2014-08-18 08-50-44 2014-08-18 08-51-53.jpeg>
> 
> If you still get weird results double check your ID mapping. Run a small
> bit of data through and spot check the mapping by hand.
> 
> At some point you will want to create a cross-validation test. This is
> good as a sort of integration sanity check when making changes to the
> recommender. You run cross-validation using standard test data to see if
> the score changes drastically between releases. Big changes may indicate a
> bug. At the beginning it will help you tune as in the case above where it
> helped decide on LLR.
> 
> 
> 
> On Aug 18, 2014, at 1:43 AM, Serega Sheypak <se...@gmail.com>
> wrote:
> 
> Thank you very much. I'll do what you are sayning in bullets 1...5 and try
> again.
> 
> I also tried:
> 1. calc data using COUSINE_SIMILARITY
> 2. calc the same data using COOCCURENCE_SIMILARTY
> 3. join #1 and #2 where COOCURENCE >= $threshold
> 
> Where threshold is some emperical integer value. I've used  "2" The idea is
> to filter out item pairs which never-ever met together...
> Please see this link:
> 
> http://blog.godatadriven.com/merge-mahout-recommendations-results-from-different-algorithms.html
> 
> If I replace COUSINE_SIMILARITY with LLR and booleanData=true, does this
> approach still make sense, or it's useless waste of time?
> 
> "What do you mean the similar items are terrible? How are you measuring
> that? " I have eye testing only,
> I did automate preparation->calculation->hbase upload-> web-app serving, I
> didn't automate testing.
> 
> 
> 
> 
> 2014-08-18 5:16 GMT+04:00 Pat Ferrel <pa...@occamsmachete.com>:
> 
>> the things that stand out:
>> 
>> 1) remove your maxSimilaritiesPerItem option! 50000
> maxSimilaritiesPerItem
>> will _kill_ performance and give no gain, leave this setting at the
> default
>> of 500
>> 2) use only one action. What do you want the user to do? Do you want them
>> to read a page? Then train on item page views. If those pages lead to a
>> purchase then you want to recommend purchases so train on user purchases.
>> 3) remove your minPrefsPerUser option, this should never be 0 or it will
>> leave users in the training data that have no data and may contribute to
>> longer runs with no gain.
>> 4) this is a pretty small Hadoop cluster for the size of your data but I
>> bet changing #1 will noticeably reduce the runtime
>> 5) change —similarityClassname to SIMILARITY_LOGLIKELIHOOD
>> 6) remove your —booleanData option since LLR ignores weights.
>> 
>> Remember that this is not the same as personalized recommendations. This
>> method alone will show the same “similar items” for all users.
>> 
>> Sorry but both your “recommendation” types sound like the same thing.
>> Using both item page view  _and_ clicks on recommended items will both
> lead
>> to an item page view so you have two actions that lead to the same thing,
>> right? Just train on an item page view (unless you really want the user
> to
>> make a purchase)
>> 
>> What do you mean the similar items are terrible? How are you measuring
>> that? Are you doing cross-validation measuring precision or A/B testing?
>> What looks bad to you may be good, the eyeball test is not always
> reliable.
>> If they are coming up completely crazy or random then you may have a bug
> in
>> your ID translation logic.
>> 
>> It sounds like you have enough data to produce good results.
>> 
>> On Aug 17, 2014, at 11:14 AM, Serega Sheypak <se...@gmail.com>
>> wrote:
>> 
>> 1. 7 nodes 4 CPU per node, 48 GB ram, 2 HDD for MR and HDFS. Not too much
>> but enough for the start..
>> 2. I run it as oozie action.
>> <action name="run-mahout-primary-similarity-ItemSimilarityJob">
>>      <java>
>>          <job-tracker>${jobTracker}</job-tracker>
>>          <name-node>${nameNode}</name-node>
>>          <prepare>
>>              <delete path="${mahoutOutputDir}/primary" />
>>              <delete
>> path="${tempDir}/run-mahout-ItemSimilarityJob/primary" />
>>          </prepare>
>>          <configuration>
>>              <property>
>>                  <name>mapred.queue.name</name>
>>                  <value>default</value>
>>              </property>
>> 
>>          </configuration>
>> 
>> 
>> 
> <main-class>org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob</main-class>
>>          <arg>--input</arg>
>>          <arg>${tempDir}/to-mahout-id/projPrefs</arg><!-- dense user_id,
>> item_id, pref [can be 3 or 5, 3 is VIEW item, 5 is CLICK on
> recommendation,
>> a kind of try to increase quality of recommender...]-->
>> 
>>          <arg>--output</arg>
>>          <arg>${mahoutOutputDir}/primary</arg>
>> 
>>          <arg>--similarityClassname</arg>
>>          <arg>SIMILARITY_COSINE</arg>
>> 
>>          <arg>--maxSimilaritiesPerItem</arg>
>>          <arg>50000</arg>
>> 
>>          <arg>--minPrefsPerUser</arg>
>>          <arg>0</arg>
>> 
>>          <arg>--booleanData</arg>
>>          <arg>false</arg>
>> 
>>          <arg>--tempDir</arg>
>>          <arg>${tempDir}/run-mahout-ItemSimilarityJob/primary</arg>
>> 
>>      </java>
>>      <ok to="to-narrow-table"/>
>>      <error to="kill"/>
>>  </action>
>> 
>> 3) RANK does it, here is a script:
>> 
>> --user, item, pref previously prepared by hive
>> user_item_pref = LOAD '$user_item_pref' using PigStorage(',') as
>> (user_id:chararray, item_id:long, pref:double);
>> 
>> --get distinct user from the whole input
>> distUserId = distinct(FOREACH user_item_pref GENERATE user_id);
>> 
>> --get distinct item from the whole input
>> distItemId = distinct(FOREACH user_item_pref GENERATE item_id);
>> 
>> --rank user 1....N
>> rankUsers_ = RANK distUserId;
>> rankUsers = FOREACH rankUsers_ GENERATE $0 as rank_id, user_id;
>> 
>> --rank items 1....M
>> rankItems_ = RANK distItemId;
>> rankItems = FOREACH rankItems_ GENERATE $0 as rank_id, item_id;
>> 
>> --join and remap natural user_id, item_id, to RANKS: 1.N, 1..M
>> joinedUsers = join user_item_pref by user_id, rankUsers by user_id USING
>> 'skewed';
>> joinedItems = join joinedUsers by user_item_pref::item_id, rankItems by
>> item_id using 'replicated';
>> 
>> projPrefs = FOREACH joinedItems GENERATE joinedUsers::rankUsers::rank_id
>> as user_id,
>>                                       rankItems::rank_id
>> as item_id,
>>                                       joinedUsers::user_item_pref::pref
>> as pref;
>> 
>> --store mapping for later remapping from RANK back to natural values
>> STORE (FOREACH rankUsers GENERATE rank_id, user_id) into '$rankUsers'
> using
>> PigStorage('\t');
>> STORE (FOREACH rankItems GENERATE rank_id, item_id) into '$rankItems'
> using
>> PigStorage('\t');
>> STORE (FOREACH projPrefs GENERATE user_id, item_id, pref) into
> '$projPrefs'
>> using PigStorage('\t');
>> 
>> 4) I've seen this idea in different discussion, that different weight for
>> different actions are not good. Sorry, I don't understand what you do
>> suggest.
>> I have two kind of actions: user viewed item, user clicked on recommended
>> item (recommended item produced by my item similarity system).
>> I want to produce two kinds of recommendations:
>> 1. current item + recommend other items which other users visit in
>> conjuction with current item
>> 2. similar item: recommend items similar to current viewed item.
>> What can I try?
>> LLR=http://en.wikipedia.org/wiki/Log-likelihood_ratio= LOG_LIKEHOOD?
>> 
>> Right now I do get awful recommendations and I can't understand what can
> I
>> try next :((((((((((((
>> 
>> 
>> 2014-08-17 19:02 GMT+04:00 Pat Ferrel <pa...@gmail.com>:
>> 
>>> 1) how many cores in the cluster? The whole idea behind mapreduce is you
>>> buy more cpus you get nearly linear decrease in runtime.
>>> 2) what is your mahout command line with options, or how are you
> invoking
>>> mahout. I have seen the Mahout mapreduce recommender take this long so
> we
>>> should check what you are doing with downsampling.
>>> 3) do you really need to RANK your ids, that’s a full sort? When using
>> pig
>>> I usually get DISTINCT ones and assign an incrementing integer as the
>>> Mahout ID corresponding
>>> 4) your #2 assigning different weights to different actions usually does
>>> not work. I’ve done this before and compared offline metrics and seen
>>> precision go down. I’d get this working using only your primary actions
>>> first. What are you trying to get the user to do? View something, buy
>>> something? Use that action as the primary preference and start out with
> a
>>> weight of 1 using LLR. With LLR the weights are not used anyway so your
>>> data may not produce good results with mixed actions.
>>> 
>>> A plug for the (admittedly pre-alpha) spark-itemsimilairty:
>>> 1) output from 2 can be directly ingested and will create output.
>>> 2) multiple actions can be used with cross-cooccurrence, not by guessing
>>> at weights.
>>> 3) output has your application specific IDs preserved.
>>> 4) its about 10x faster than mapreduce and will do aways with your ID
>>> translation steps
>>> 
>>> One caveat is that your cluster machines will need lots of memory. I
> have
>>> 8-16g on mine.
>>> 
>>> On Aug 17, 2014, at 1:26 AM, Serega Sheypak <se...@gmail.com>
>>> wrote:
>>> 
>>> 1. I do collect preferences for items using 60days sliding window. today
>> -
>>> 60 days.
>>> 2. I do prepare triples user_id, item_id, descrete_pref_value (3 for
> item
>>> view, 5 for clicking recommndation block. The idea is to give more value
>>> for recommendations which attact visitor attention). I get ~ 20.000.000
>> of
>>> lines with ~1.000.000 distinct items and ~2.000.000 distinct users
>>> 3. I do use apache pig RANK function to rank all distinct user_id
>>> 4. I do the same for item_id
>>> 5. I do join input dataset with ranked datasets and provide input to
>> mahout
>>> with dense interger user_id, item_id
>>> 6. I do get mahout output and join integer item_id back to get natural
>> key
>>> value.
>>> 
>>> step #1-2 takes ~ 40min
>>> step #3-5 takes ~1 hour
>>> mahout calc takes ~3hours
>>> 
>>> 
>>> 
>>> 2014-08-17 10:45 GMT+04:00 Ted Dunning <te...@gmail.com>:
>>> 
>>>> This really doesn't sound right.  It should be possible to process
>>> almost a
>>>> thousand times that much data every night without that much problem.
>>>> 
>>>> How are you preparing the input data?
>>>> 
>>>> How are you converting to Mahout id's?
>>>> 
>>>> Even using python, you should be able to do the conversion in just a
> few
>>>> minutes without any parallelism whatsoever.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak <
>>> serega.sheypak@gmail.com>
>>>> wrote:
>>>> 
>>>>> Hi, We are trying calculate ItemSimilarity.
>>>>> Right now we have 2*10^7 input lines. I do provide input data as raw
>>> text
>>>>> each day to recalculate item similarities. We do get +100..1000 new
>>> items
>>>>> each day.
>>>>> 1. It takes too much time to prepare input data.
>>>>> 2. It takes too much time to convert user_id, item_id to mahout ids
>>>>> 
>>>>> Is there any poissibility to provide data to mahout mapreduce
>>>>> ItemSimilarity using some binary format with compression?
>>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 
>

Re: mapreduce ItemSimilarity input optimization

Posted by Ted Dunning <te...@gmail.com>.

No.

Go for this more recent (and much shorter) one:

http://www.mapr.com/practical-machine-learning

And if you like it, leave a review on Amazon:

http://www.amazon.com/Practical-Machine-Learning-Innovations-Recommendation-ebook/dp/B00JRHVNT4






On Thu, Aug 21, 2014 at 11:31 PM, Serega Sheypak <se...@gmail.com>
wrote:

> Ok, I got it. Is it Ted's book?
>
> http://www.amazon.com/Mahout-Action-Sean-Owen/dp/1935182684/ref=la_B00EHXC1NK_1_1?s=books&ie=UTF8&qid=1408689021&sr=1-1
>
> I've read this one:
>
> http://www.amazon.com/Apache-Mahout-Cookbook-Piero-Giacomelli-ebook/dp/B00HJR6R86/ref=sr_1_2?s=books&ie=UTF8&qid=1408689063&sr=1-2&keywords=mahout
>
> No satisfaction at all
>
>
>
>
> 2014-08-21 20:32 GMT+04:00 Pat Ferrel <pa...@occamsmachete.com>:
>
> > Sorry that wasn’t clear.
> >
> > Given your purchase volume, you may not have very good coverage training
> > on purchases only. So using views may be your best bet. Ted’s metrics
> were:
> > "How many people got each item?  How many people total?  How many people
> > got both?” This is how you tell what action has enough data to be useful.
> > In your case that my be views.
> >
> > The other point was about doing personalization. Since you have
> > itemsimilarity working well you can add personalization with a search
> > engine using methods described in Ted’s book. This requires that you
> > capture user history (views in this case) and use that as a query on the
> > itemsimilarity data. If you know enough of the current user’s recent
> > history this will allow you to show “people with the same taste as you
> also
> > looked at these items”.
> >
> > Currently you are not personalizing, you are showing the same “similar
> > items” to every user. That is fine but personalization may improve things
> > further.
> >
> >
> > On Aug 21, 2014, at 8:08 AM, Serega Sheypak <se...@gmail.com>
> > wrote:
> >
> > Excuse me looks like I've missed important point
> > "Ah, then using Ted’s metrics views _is_ probably your best bet."
> > You are talking about "personal recommendations" serving from search
> > engine? The idea was to get active vitior view history and give him
> > "similar" view histories from search engine in runtime?
> >
> >
> > 2014-08-21 18:50 GMT+04:00 Pat Ferrel <pa...@gmail.com>:
> >
> > >>
> > >> On Aug 21, 2014, at 1:22 AM, Serega Sheypak <serega.sheypak@gmail.com
> >
> > > wrote:
> > >>
> > >>>> What you are doing is best practice for showing similar “views”. The
> > >> technique for using multiple actions will be covered in a series of
> > blogs
> > >> posts and may be put on the Mahout site eventually
> > >> Great thanks!
> > >>
> > >>>> People look at 100 things and buy 1, as you say. The question is: Do
> > > you
> > >> want people to buy something or just browse your site?
> > >> No objections for your point. I understand it. It should work for
> pretty
> > >> big ecom, right? Small ecom sell 100-200 items per day and have wide
> > > range
> > >> of items.
> > >
> > > Ah, then using Ted’s metrics views _is_ probably your best bet. You can
> > > probably still personalize view recommendations. Since you are already
> > > using itemsimilarity it can be a second step that builds on the first.
> > >
> > >>
> > >>>> Filter out any items not in the catalog from your recommendations.
> > >> We have it on data preparation stage. We recalculate item similarity
> > each
> > >> day sliding back for 60 days excluding non-available items on
> > preparation
> > >> stage.
> > >>
> > >> Thank you! We did reach good results, business guys got satisfaction
> :)
> > >>
> > >>
> > >> 2014-08-20 20:28 GMT+04:00 Pat Ferrel <pa...@occamsmachete.com>:
> > >>
> > >>>>
> > >>>> On Aug 19, 2014, at 11:26 PM, Serega Sheypak <
> > serega.sheypak@gmail.com
> > >>
> > >>> wrote:
> > >>>>
> > >>>> Hi!
> > >>>> 1. There was a bug in UI, I've checked raw recommendations. "water
> > >>> heating
> > >>>> device" has low score. So first 30 recommended items really fits
> > > iPhone,
> > >>>> next are not so good. Anyway result is good, thank you very much.
> > >>>> 2. I've inspected "sessions" of users, really there are people who
> > > viewed
> > >>>> iphone and heating device. 10 people for last month.
> > >>>> 3. I will calculate relative measurment, I didn't calc what is % of
> > > these
> > >>>> people comparing to others and how they fluence on score result.
> > >>>>
> > >>>
> > >>> That’s great. The Spark version sorts the result by weights, but I
> > think
> > >>> the mapreduce version doesn't
> > >>>
> > >>>> *You wrote:*
> > >>>> Then once you have that working you can add more actions but only
> with
> > >>>> cross-cooccurrence, adding by weighting* will not work with this
> type
> > > of
> > >>>> recommender*, which recommender can work with weights for actions?
> > >>>
> > >>> What you are doing is best practice for showing similar “views”. The
> > >>> technique for using multiple actions will be covered in a series of
> > > blogs
> > >>> posts and may be put on the Mahout site eventually. It requires
> > >>> spark-itemsimilarity. For now I’d strongly suggest you look at
> training
> > > on
> > >>> purchase data alone - see the comments below.
> > >>>
> > >>>>
> > >>>> *About building recommendations using sales.*
> > >>>> Sales are less than 1% from item views. You will recommend only
> stuff
> > >>>> people buy.
> > >>>
> > >>> The point is not volume of data but quality of data. I once measured
> > how
> > >>> predictive of purchases the views were and found them a rather poor
> > >>> predictor. People look at 100 things and buy 1, as you say. The
> > question
> > >>> is: Do you want people to buy something or just browse your site?
> > >>>
> > >>> On the other hand you would need to see how good your coverage is of
> > >>> purchases. Do you have enough items purchased by several people
> (Ted’s
> > >>> questions below will guide you)? If there is good coverage then you
> _do
> > >>> not_ restrict the range by using only purchase data. You increase the
> > >>> quality.
> > >>>
> > >>>> If you recommend what people see you significantly widen range
> > >>>> of possible buy actions. People always buy case "XXX" with iphone.
> You
> > >>>> would never recommened them to buy case "YYY". If people watch "XXX"
> > > and
> > >>>> "YYY" it's reasonable to recommened "YYY". Maybe "YYY" it's more
> > >>> expensive
> > >>>> that is why people prefer cheaper "XXX". What's wrong with this
> > >>> assumption?
> > >>>
> > >>> Nothing at all. Remember that your goal is to cause a purchase but
> > using
> > >>> views requires some “scrubbing” of views. You want, in effect,
> > >>> views-that-lead-to-purchases. In a cooccurrence recommender this can
> be
> > >>> done with cross-cooccurrence and I’ll describe that elsewhere, it’s
> too
> > >>> long for an email to describe but pretty easy to use.
> > >>>
> > >>> I’d wager that if you restrict to purchases your sales will go up
> over
> > >>> recommending views. But that is without looking at your data. If you
> > > need
> > >>> more data try increase the sliding time window to add more purchases.
> > > This
> > >>> will eventually start including things that are no longer in your
> > > catalog
> > >>> so will have diminishing returns but 60 days seem like a short time
> > > period.
> > >>> Filter out any items not in the catalog from your recommendations.
> > >>>
> > >>> You want recency to matter, this is good intuition. The in-catalog
> > > filter
> > >>> is one simple way, and there are others when you get to
> > personalization.
> > >>>
> > >>>>
> > >>>> *About our obsessive desire to add weights for actions.*
> > >>>> We would like to self-tune our recommendations. If user clicks our
> > >>>> recommendation it's a signal for us that items are related. So next
> > > time
> > >>>> this link should have higher score. What are the approaches to do
> it?
> > >>>>
> > >>>
> > >>> Yes, you do want the things that lead to purchases to go into the
> > > training
> > >>> data. This is good intuition. But you don’t do it with weights you
> > > train on
> > >>> new purchases, regardless of whether they came from random views,
> > >>> rec-views, or … You don’t care whether a rec was clicked on; you care
> > > if a
> > >>> purchase was made and you don’t care what part of the UI caused it.
> UI
> > >>> analysis is very very important but doesn’t help the recommender, it
> > > guides
> > >>> UI decisions. So measuring clicks is good but shouldn’t be used to
> > > change
> > >>> recs.
> > >>>
> > >>> One way to increase the value of your recs is to add a little
> > randomness
> > >>> to their ordering. If you have 10 things to recommend get 20 from
> > >>> itemsimilarity and apply a normally distributed random weighting,
> then
> > >>> re-sort and show the top 10. This will move some things up in order
> and
> > >>> show them where without the re-ordering they would never be shown.
> The
> > >>> technique allows you to expose more items to possible purchase and
> > >>> therefore affect the ordering the next time you train. The actual
> > > algorithm
> > >>> takes more space to describe but the idea is a lot like a multi-armed
> > >>> bandit where the best bandit eventually gets all trials. In this case
> > > the
> > >>> best rec leads to a purchase and gets into the new training data and
> so
> > >>> will be shown more often the next time.
> > >>>
> > >>> Another thing you can do is create a “shopping cart” recommender.
> This
> > >>> looks at items purchased together—an item-set. It is a strong
> indicator
> > > of
> > >>> relatedness.
> > >>>
> > >>> Suggestions:
> > >>> 1) personalize: this is likely to make the most difference since you
> > > will
> > >>> be showing different things to different people. The “Practical
> Machine
> > >>> Learning” is short and easy to read—it describes this.
> > >>> 2) move to purchase data training, wait for cross-cooccurrence to add
> > in
> > >>> view data. Do this if you have good coverage (Ted’s questions below
> > > relate
> > >>> to this).
> > >>> 3) increase the training period if needed to get good catalog
> coverage
> > >>> 4) consider dithering your recs to expose more items to purchase and
> > >>> therefore self-tune by increasing the quality of your training data.
> > >>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> 2014-08-20 7:18 GMT+04:00 Ted Dunning <te...@gmail.com>:
> > >>>>
> > >>>>> On Tue, Aug 19, 2014 at 12:53 AM, Serega Sheypak <
> > >>> serega.sheypak@gmail.com
> > >>>>>>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> What could be a reason for recommending "Water heat device " to
> > > iPhone?
> > >>>>>> iPhone is one of the most popular item. There should be a lot of
> > > people
> > >>>>>> viewing iPhone with "Water heat device "?
> > >>>>>>
> > >>>>>
> > >>>>> What are the numbers?
> > >>>>>
> > >>>>> How many people got each item?  How many people total?  How many
> > > people
> > >>> got
> > >>>>> both?
> > >>>>>
> > >>>>> What about the same for the iPhone related items?
> > >>>>>
> > >>>>
> > >>>
> > >>
> > >
> >
> >
>

Re: mapreduce ItemSimilarity input optimization

Posted by Serega Sheypak <se...@gmail.com>.

Ok, I got it. Is it Ted's book?
http://www.amazon.com/Mahout-Action-Sean-Owen/dp/1935182684/ref=la_B00EHXC1NK_1_1?s=books&ie=UTF8&qid=1408689021&sr=1-1

I've read this one:
http://www.amazon.com/Apache-Mahout-Cookbook-Piero-Giacomelli-ebook/dp/B00HJR6R86/ref=sr_1_2?s=books&ie=UTF8&qid=1408689063&sr=1-2&keywords=mahout

No satisfaction at all




2014-08-21 20:32 GMT+04:00 Pat Ferrel <pa...@occamsmachete.com>:

> Sorry that wasn’t clear.
>
> Given your purchase volume, you may not have very good coverage training
> on purchases only. So using views may be your best bet. Ted’s metrics were:
> "How many people got each item?  How many people total?  How many people
> got both?” This is how you tell what action has enough data to be useful.
> In your case that my be views.
>
> The other point was about doing personalization. Since you have
> itemsimilarity working well you can add personalization with a search
> engine using methods described in Ted’s book. This requires that you
> capture user history (views in this case) and use that as a query on the
> itemsimilarity data. If you know enough of the current user’s recent
> history this will allow you to show “people with the same taste as you also
> looked at these items”.
>
> Currently you are not personalizing, you are showing the same “similar
> items” to every user. That is fine but personalization may improve things
> further.
>
>
> On Aug 21, 2014, at 8:08 AM, Serega Sheypak <se...@gmail.com>
> wrote:
>
> Excuse me looks like I've missed important point
> "Ah, then using Ted’s metrics views _is_ probably your best bet."
> You are talking about "personal recommendations" serving from search
> engine? The idea was to get active vitior view history and give him
> "similar" view histories from search engine in runtime?
>
>
> 2014-08-21 18:50 GMT+04:00 Pat Ferrel <pa...@gmail.com>:
>
> >>
> >> On Aug 21, 2014, at 1:22 AM, Serega Sheypak <se...@gmail.com>
> > wrote:
> >>
> >>>> What you are doing is best practice for showing similar “views”. The
> >> technique for using multiple actions will be covered in a series of
> blogs
> >> posts and may be put on the Mahout site eventually
> >> Great thanks!
> >>
> >>>> People look at 100 things and buy 1, as you say. The question is: Do
> > you
> >> want people to buy something or just browse your site?
> >> No objections for your point. I understand it. It should work for pretty
> >> big ecom, right? Small ecom sell 100-200 items per day and have wide
> > range
> >> of items.
> >
> > Ah, then using Ted’s metrics views _is_ probably your best bet. You can
> > probably still personalize view recommendations. Since you are already
> > using itemsimilarity it can be a second step that builds on the first.
> >
> >>
> >>>> Filter out any items not in the catalog from your recommendations.
> >> We have it on data preparation stage. We recalculate item similarity
> each
> >> day sliding back for 60 days excluding non-available items on
> preparation
> >> stage.
> >>
> >> Thank you! We did reach good results, business guys got satisfaction :)
> >>
> >>
> >> 2014-08-20 20:28 GMT+04:00 Pat Ferrel <pa...@occamsmachete.com>:
> >>
> >>>>
> >>>> On Aug 19, 2014, at 11:26 PM, Serega Sheypak <
> serega.sheypak@gmail.com
> >>
> >>> wrote:
> >>>>
> >>>> Hi!
> >>>> 1. There was a bug in UI, I've checked raw recommendations. "water
> >>> heating
> >>>> device" has low score. So first 30 recommended items really fits
> > iPhone,
> >>>> next are not so good. Anyway result is good, thank you very much.
> >>>> 2. I've inspected "sessions" of users, really there are people who
> > viewed
> >>>> iphone and heating device. 10 people for last month.
> >>>> 3. I will calculate relative measurment, I didn't calc what is % of
> > these
> >>>> people comparing to others and how they fluence on score result.
> >>>>
> >>>
> >>> That’s great. The Spark version sorts the result by weights, but I
> think
> >>> the mapreduce version doesn't
> >>>
> >>>> *You wrote:*
> >>>> Then once you have that working you can add more actions but only with
> >>>> cross-cooccurrence, adding by weighting* will not work with this type
> > of
> >>>> recommender*, which recommender can work with weights for actions?
> >>>
> >>> What you are doing is best practice for showing similar “views”. The
> >>> technique for using multiple actions will be covered in a series of
> > blogs
> >>> posts and may be put on the Mahout site eventually. It requires
> >>> spark-itemsimilarity. For now I’d strongly suggest you look at training
> > on
> >>> purchase data alone - see the comments below.
> >>>
> >>>>
> >>>> *About building recommendations using sales.*
> >>>> Sales are less than 1% from item views. You will recommend only stuff
> >>>> people buy.
> >>>
> >>> The point is not volume of data but quality of data. I once measured
> how
> >>> predictive of purchases the views were and found them a rather poor
> >>> predictor. People look at 100 things and buy 1, as you say. The
> question
> >>> is: Do you want people to buy something or just browse your site?
> >>>
> >>> On the other hand you would need to see how good your coverage is of
> >>> purchases. Do you have enough items purchased by several people (Ted’s
> >>> questions below will guide you)? If there is good coverage then you _do
> >>> not_ restrict the range by using only purchase data. You increase the
> >>> quality.
> >>>
> >>>> If you recommend what people see you significantly widen range
> >>>> of possible buy actions. People always buy case "XXX" with iphone. You
> >>>> would never recommened them to buy case "YYY". If people watch "XXX"
> > and
> >>>> "YYY" it's reasonable to recommened "YYY". Maybe "YYY" it's more
> >>> expensive
> >>>> that is why people prefer cheaper "XXX". What's wrong with this
> >>> assumption?
> >>>
> >>> Nothing at all. Remember that your goal is to cause a purchase but
> using
> >>> views requires some “scrubbing” of views. You want, in effect,
> >>> views-that-lead-to-purchases. In a cooccurrence recommender this can be
> >>> done with cross-cooccurrence and I’ll describe that elsewhere, it’s too
> >>> long for an email to describe but pretty easy to use.
> >>>
> >>> I’d wager that if you restrict to purchases your sales will go up over
> >>> recommending views. But that is without looking at your data. If you
> > need
> >>> more data try increase the sliding time window to add more purchases.
> > This
> >>> will eventually start including things that are no longer in your
> > catalog
> >>> so will have diminishing returns but 60 days seem like a short time
> > period.
> >>> Filter out any items not in the catalog from your recommendations.
> >>>
> >>> You want recency to matter, this is good intuition. The in-catalog
> > filter
> >>> is one simple way, and there are others when you get to
> personalization.
> >>>
> >>>>
> >>>> *About our obsessive desire to add weights for actions.*
> >>>> We would like to self-tune our recommendations. If user clicks our
> >>>> recommendation it's a signal for us that items are related. So next
> > time
> >>>> this link should have higher score. What are the approaches to do it?
> >>>>
> >>>
> >>> Yes, you do want the things that lead to purchases to go into the
> > training
> >>> data. This is good intuition. But you don’t do it with weights you
> > train on
> >>> new purchases, regardless of whether they came from random views,
> >>> rec-views, or … You don’t care whether a rec was clicked on; you care
> > if a
> >>> purchase was made and you don’t care what part of the UI caused it. UI
> >>> analysis is very very important but doesn’t help the recommender, it
> > guides
> >>> UI decisions. So measuring clicks is good but shouldn’t be used to
> > change
> >>> recs.
> >>>
> >>> One way to increase the value of your recs is to add a little
> randomness
> >>> to their ordering. If you have 10 things to recommend get 20 from
> >>> itemsimilarity and apply a normally distributed random weighting, then
> >>> re-sort and show the top 10. This will move some things up in order and
> >>> show them where without the re-ordering they would never be shown. The
> >>> technique allows you to expose more items to possible purchase and
> >>> therefore affect the ordering the next time you train. The actual
> > algorithm
> >>> takes more space to describe but the idea is a lot like a multi-armed
> >>> bandit where the best bandit eventually gets all trials. In this case
> > the
> >>> best rec leads to a purchase and gets into the new training data and so
> >>> will be shown more often the next time.
> >>>
> >>> Another thing you can do is create a “shopping cart” recommender. This
> >>> looks at items purchased together—an item-set. It is a strong indicator
> > of
> >>> relatedness.
> >>>
> >>> Suggestions:
> >>> 1) personalize: this is likely to make the most difference since you
> > will
> >>> be showing different things to different people. The “Practical Machine
> >>> Learning” is short and easy to read—it describes this.
> >>> 2) move to purchase data training, wait for cross-cooccurrence to add
> in
> >>> view data. Do this if you have good coverage (Ted’s questions below
> > relate
> >>> to this).
> >>> 3) increase the training period if needed to get good catalog coverage
> >>> 4) consider dithering your recs to expose more items to purchase and
> >>> therefore self-tune by increasing the quality of your training data.
> >>>
> >>>>
> >>>>
> >>>>
> >>>> 2014-08-20 7:18 GMT+04:00 Ted Dunning <te...@gmail.com>:
> >>>>
> >>>>> On Tue, Aug 19, 2014 at 12:53 AM, Serega Sheypak <
> >>> serega.sheypak@gmail.com
> >>>>>>
> >>>>> wrote:
> >>>>>
> >>>>>> What could be a reason for recommending "Water heat device " to
> > iPhone?
> >>>>>> iPhone is one of the most popular item. There should be a lot of
> > people
> >>>>>> viewing iPhone with "Water heat device "?
> >>>>>>
> >>>>>
> >>>>> What are the numbers?
> >>>>>
> >>>>> How many people got each item?  How many people total?  How many
> > people
> >>> got
> >>>>> both?
> >>>>>
> >>>>> What about the same for the iPhone related items?
> >>>>>
> >>>>
> >>>
> >>
> >
>
>

Re: mapreduce ItemSimilarity input optimization

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Sorry that wasn’t clear.

Given your purchase volume, you may not have very good coverage training on purchases only. So using views may be your best bet. Ted’s metrics were: "How many people got each item?  How many people total?  How many people got both?” This is how you tell what action has enough data to be useful. In your case that my be views.

The other point was about doing personalization. Since you have itemsimilarity working well you can add personalization with a search engine using methods described in Ted’s book. This requires that you capture user history (views in this case) and use that as a query on the itemsimilarity data. If you know enough of the current user’s recent history this will allow you to show “people with the same taste as you also looked at these items”.

Currently you are not personalizing, you are showing the same “similar items” to every user. That is fine but personalization may improve things further.


On Aug 21, 2014, at 8:08 AM, Serega Sheypak <se...@gmail.com> wrote:

Excuse me looks like I've missed important point
"Ah, then using Ted’s metrics views _is_ probably your best bet."
You are talking about "personal recommendations" serving from search
engine? The idea was to get active vitior view history and give him
"similar" view histories from search engine in runtime?


2014-08-21 18:50 GMT+04:00 Pat Ferrel <pa...@gmail.com>:

>> 
>> On Aug 21, 2014, at 1:22 AM, Serega Sheypak <se...@gmail.com>
> wrote:
>> 
>>>> What you are doing is best practice for showing similar “views”. The
>> technique for using multiple actions will be covered in a series of blogs
>> posts and may be put on the Mahout site eventually
>> Great thanks!
>> 
>>>> People look at 100 things and buy 1, as you say. The question is: Do
> you
>> want people to buy something or just browse your site?
>> No objections for your point. I understand it. It should work for pretty
>> big ecom, right? Small ecom sell 100-200 items per day and have wide
> range
>> of items.
> 
> Ah, then using Ted’s metrics views _is_ probably your best bet. You can
> probably still personalize view recommendations. Since you are already
> using itemsimilarity it can be a second step that builds on the first.
> 
>> 
>>>> Filter out any items not in the catalog from your recommendations.
>> We have it on data preparation stage. We recalculate item similarity each
>> day sliding back for 60 days excluding non-available items on preparation
>> stage.
>> 
>> Thank you! We did reach good results, business guys got satisfaction :)
>> 
>> 
>> 2014-08-20 20:28 GMT+04:00 Pat Ferrel <pa...@occamsmachete.com>:
>> 
>>>> 
>>>> On Aug 19, 2014, at 11:26 PM, Serega Sheypak <serega.sheypak@gmail.com
>> 
>>> wrote:
>>>> 
>>>> Hi!
>>>> 1. There was a bug in UI, I've checked raw recommendations. "water
>>> heating
>>>> device" has low score. So first 30 recommended items really fits
> iPhone,
>>>> next are not so good. Anyway result is good, thank you very much.
>>>> 2. I've inspected "sessions" of users, really there are people who
> viewed
>>>> iphone and heating device. 10 people for last month.
>>>> 3. I will calculate relative measurment, I didn't calc what is % of
> these
>>>> people comparing to others and how they fluence on score result.
>>>> 
>>> 
>>> That’s great. The Spark version sorts the result by weights, but I think
>>> the mapreduce version doesn't
>>> 
>>>> *You wrote:*
>>>> Then once you have that working you can add more actions but only with
>>>> cross-cooccurrence, adding by weighting* will not work with this type
> of
>>>> recommender*, which recommender can work with weights for actions?
>>> 
>>> What you are doing is best practice for showing similar “views”. The
>>> technique for using multiple actions will be covered in a series of
> blogs
>>> posts and may be put on the Mahout site eventually. It requires
>>> spark-itemsimilarity. For now I’d strongly suggest you look at training
> on
>>> purchase data alone - see the comments below.
>>> 
>>>> 
>>>> *About building recommendations using sales.*
>>>> Sales are less than 1% from item views. You will recommend only stuff
>>>> people buy.
>>> 
>>> The point is not volume of data but quality of data. I once measured how
>>> predictive of purchases the views were and found them a rather poor
>>> predictor. People look at 100 things and buy 1, as you say. The question
>>> is: Do you want people to buy something or just browse your site?
>>> 
>>> On the other hand you would need to see how good your coverage is of
>>> purchases. Do you have enough items purchased by several people (Ted’s
>>> questions below will guide you)? If there is good coverage then you _do
>>> not_ restrict the range by using only purchase data. You increase the
>>> quality.
>>> 
>>>> If you recommend what people see you significantly widen range
>>>> of possible buy actions. People always buy case "XXX" with iphone. You
>>>> would never recommened them to buy case "YYY". If people watch "XXX"
> and
>>>> "YYY" it's reasonable to recommened "YYY". Maybe "YYY" it's more
>>> expensive
>>>> that is why people prefer cheaper "XXX". What's wrong with this
>>> assumption?
>>> 
>>> Nothing at all. Remember that your goal is to cause a purchase but using
>>> views requires some “scrubbing” of views. You want, in effect,
>>> views-that-lead-to-purchases. In a cooccurrence recommender this can be
>>> done with cross-cooccurrence and I’ll describe that elsewhere, it’s too
>>> long for an email to describe but pretty easy to use.
>>> 
>>> I’d wager that if you restrict to purchases your sales will go up over
>>> recommending views. But that is without looking at your data. If you
> need
>>> more data try increase the sliding time window to add more purchases.
> This
>>> will eventually start including things that are no longer in your
> catalog
>>> so will have diminishing returns but 60 days seem like a short time
> period.
>>> Filter out any items not in the catalog from your recommendations.
>>> 
>>> You want recency to matter, this is good intuition. The in-catalog
> filter
>>> is one simple way, and there are others when you get to personalization.
>>> 
>>>> 
>>>> *About our obsessive desire to add weights for actions.*
>>>> We would like to self-tune our recommendations. If user clicks our
>>>> recommendation it's a signal for us that items are related. So next
> time
>>>> this link should have higher score. What are the approaches to do it?
>>>> 
>>> 
>>> Yes, you do want the things that lead to purchases to go into the
> training
>>> data. This is good intuition. But you don’t do it with weights you
> train on
>>> new purchases, regardless of whether they came from random views,
>>> rec-views, or … You don’t care whether a rec was clicked on; you care
> if a
>>> purchase was made and you don’t care what part of the UI caused it. UI
>>> analysis is very very important but doesn’t help the recommender, it
> guides
>>> UI decisions. So measuring clicks is good but shouldn’t be used to
> change
>>> recs.
>>> 
>>> One way to increase the value of your recs is to add a little randomness
>>> to their ordering. If you have 10 things to recommend get 20 from
>>> itemsimilarity and apply a normally distributed random weighting, then
>>> re-sort and show the top 10. This will move some things up in order and
>>> show them where without the re-ordering they would never be shown. The
>>> technique allows you to expose more items to possible purchase and
>>> therefore affect the ordering the next time you train. The actual
> algorithm
>>> takes more space to describe but the idea is a lot like a multi-armed
>>> bandit where the best bandit eventually gets all trials. In this case
> the
>>> best rec leads to a purchase and gets into the new training data and so
>>> will be shown more often the next time.
>>> 
>>> Another thing you can do is create a “shopping cart” recommender. This
>>> looks at items purchased together—an item-set. It is a strong indicator
> of
>>> relatedness.
>>> 
>>> Suggestions:
>>> 1) personalize: this is likely to make the most difference since you
> will
>>> be showing different things to different people. The “Practical Machine
>>> Learning” is short and easy to read—it describes this.
>>> 2) move to purchase data training, wait for cross-cooccurrence to add in
>>> view data. Do this if you have good coverage (Ted’s questions below
> relate
>>> to this).
>>> 3) increase the training period if needed to get good catalog coverage
>>> 4) consider dithering your recs to expose more items to purchase and
>>> therefore self-tune by increasing the quality of your training data.
>>> 
>>>> 
>>>> 
>>>> 
>>>> 2014-08-20 7:18 GMT+04:00 Ted Dunning <te...@gmail.com>:
>>>> 
>>>>> On Tue, Aug 19, 2014 at 12:53 AM, Serega Sheypak <
>>> serega.sheypak@gmail.com
>>>>>> 
>>>>> wrote:
>>>>> 
>>>>>> What could be a reason for recommending "Water heat device " to
> iPhone?
>>>>>> iPhone is one of the most popular item. There should be a lot of
> people
>>>>>> viewing iPhone with "Water heat device "?
>>>>>> 
>>>>> 
>>>>> What are the numbers?
>>>>> 
>>>>> How many people got each item?  How many people total?  How many
> people
>>> got
>>>>> both?
>>>>> 
>>>>> What about the same for the iPhone related items?
>>>>> 
>>>> 
>>> 
>> 
>

Re: mapreduce ItemSimilarity input optimization

Posted by Serega Sheypak <se...@gmail.com>.

Excuse me looks like I've missed important point
"Ah, then using Ted’s metrics views _is_ probably your best bet."
You are talking about "personal recommendations" serving from search
engine? The idea was to get active vitior view history and give him
"similar" view histories from search engine in runtime?


2014-08-21 18:50 GMT+04:00 Pat Ferrel <pa...@gmail.com>:

> >
> > On Aug 21, 2014, at 1:22 AM, Serega Sheypak <se...@gmail.com>
> wrote:
> >
> >>> What you are doing is best practice for showing similar “views”. The
> > technique for using multiple actions will be covered in a series of blogs
> > posts and may be put on the Mahout site eventually
> > Great thanks!
> >
> >>> People look at 100 things and buy 1, as you say. The question is: Do
> you
> > want people to buy something or just browse your site?
> > No objections for your point. I understand it. It should work for pretty
> > big ecom, right? Small ecom sell 100-200 items per day and have wide
> range
> > of items.
>
> Ah, then using Ted’s metrics views _is_ probably your best bet. You can
> probably still personalize view recommendations. Since you are already
> using itemsimilarity it can be a second step that builds on the first.
>
> >
> >>> Filter out any items not in the catalog from your recommendations.
> > We have it on data preparation stage. We recalculate item similarity each
> > day sliding back for 60 days excluding non-available items on preparation
> > stage.
> >
> > Thank you! We did reach good results, business guys got satisfaction :)
> >
> >
> > 2014-08-20 20:28 GMT+04:00 Pat Ferrel <pa...@occamsmachete.com>:
> >
> >>>
> >>> On Aug 19, 2014, at 11:26 PM, Serega Sheypak <serega.sheypak@gmail.com
> >
> >> wrote:
> >>>
> >>> Hi!
> >>> 1. There was a bug in UI, I've checked raw recommendations. "water
> >> heating
> >>> device" has low score. So first 30 recommended items really fits
> iPhone,
> >>> next are not so good. Anyway result is good, thank you very much.
> >>> 2. I've inspected "sessions" of users, really there are people who
> viewed
> >>> iphone and heating device. 10 people for last month.
> >>> 3. I will calculate relative measurment, I didn't calc what is % of
> these
> >>> people comparing to others and how they fluence on score result.
> >>>
> >>
> >> That’s great. The Spark version sorts the result by weights, but I think
> >> the mapreduce version doesn't
> >>
> >>> *You wrote:*
> >>> Then once you have that working you can add more actions but only with
> >>> cross-cooccurrence, adding by weighting* will not work with this type
> of
> >>> recommender*, which recommender can work with weights for actions?
> >>
> >> What you are doing is best practice for showing similar “views”. The
> >> technique for using multiple actions will be covered in a series of
> blogs
> >> posts and may be put on the Mahout site eventually. It requires
> >> spark-itemsimilarity. For now I’d strongly suggest you look at training
> on
> >> purchase data alone - see the comments below.
> >>
> >>>
> >>> *About building recommendations using sales.*
> >>> Sales are less than 1% from item views. You will recommend only stuff
> >>> people buy.
> >>
> >> The point is not volume of data but quality of data. I once measured how
> >> predictive of purchases the views were and found them a rather poor
> >> predictor. People look at 100 things and buy 1, as you say. The question
> >> is: Do you want people to buy something or just browse your site?
> >>
> >> On the other hand you would need to see how good your coverage is of
> >> purchases. Do you have enough items purchased by several people (Ted’s
> >> questions below will guide you)? If there is good coverage then you _do
> >> not_ restrict the range by using only purchase data. You increase the
> >> quality.
> >>
> >>> If you recommend what people see you significantly widen range
> >>> of possible buy actions. People always buy case "XXX" with iphone. You
> >>> would never recommened them to buy case "YYY". If people watch "XXX"
> and
> >>> "YYY" it's reasonable to recommened "YYY". Maybe "YYY" it's more
> >> expensive
> >>> that is why people prefer cheaper "XXX". What's wrong with this
> >> assumption?
> >>
> >> Nothing at all. Remember that your goal is to cause a purchase but using
> >> views requires some “scrubbing” of views. You want, in effect,
> >> views-that-lead-to-purchases. In a cooccurrence recommender this can be
> >> done with cross-cooccurrence and I’ll describe that elsewhere, it’s too
> >> long for an email to describe but pretty easy to use.
> >>
> >> I’d wager that if you restrict to purchases your sales will go up over
> >> recommending views. But that is without looking at your data. If you
> need
> >> more data try increase the sliding time window to add more purchases.
> This
> >> will eventually start including things that are no longer in your
> catalog
> >> so will have diminishing returns but 60 days seem like a short time
> period.
> >> Filter out any items not in the catalog from your recommendations.
> >>
> >> You want recency to matter, this is good intuition. The in-catalog
> filter
> >> is one simple way, and there are others when you get to personalization.
> >>
> >>>
> >>> *About our obsessive desire to add weights for actions.*
> >>> We would like to self-tune our recommendations. If user clicks our
> >>> recommendation it's a signal for us that items are related. So next
> time
> >>> this link should have higher score. What are the approaches to do it?
> >>>
> >>
> >> Yes, you do want the things that lead to purchases to go into the
> training
> >> data. This is good intuition. But you don’t do it with weights you
> train on
> >> new purchases, regardless of whether they came from random views,
> >> rec-views, or … You don’t care whether a rec was clicked on; you care
> if a
> >> purchase was made and you don’t care what part of the UI caused it. UI
> >> analysis is very very important but doesn’t help the recommender, it
> guides
> >> UI decisions. So measuring clicks is good but shouldn’t be used to
> change
> >> recs.
> >>
> >> One way to increase the value of your recs is to add a little randomness
> >> to their ordering. If you have 10 things to recommend get 20 from
> >> itemsimilarity and apply a normally distributed random weighting, then
> >> re-sort and show the top 10. This will move some things up in order and
> >> show them where without the re-ordering they would never be shown. The
> >> technique allows you to expose more items to possible purchase and
> >> therefore affect the ordering the next time you train. The actual
> algorithm
> >> takes more space to describe but the idea is a lot like a multi-armed
> >> bandit where the best bandit eventually gets all trials. In this case
> the
> >> best rec leads to a purchase and gets into the new training data and so
> >> will be shown more often the next time.
> >>
> >> Another thing you can do is create a “shopping cart” recommender. This
> >> looks at items purchased together—an item-set. It is a strong indicator
> of
> >> relatedness.
> >>
> >> Suggestions:
> >> 1) personalize: this is likely to make the most difference since you
> will
> >> be showing different things to different people. The “Practical Machine
> >> Learning” is short and easy to read—it describes this.
> >> 2) move to purchase data training, wait for cross-cooccurrence to add in
> >> view data. Do this if you have good coverage (Ted’s questions below
> relate
> >> to this).
> >> 3) increase the training period if needed to get good catalog coverage
> >> 4) consider dithering your recs to expose more items to purchase and
> >> therefore self-tune by increasing the quality of your training data.
> >>
> >>>
> >>>
> >>>
> >>> 2014-08-20 7:18 GMT+04:00 Ted Dunning <te...@gmail.com>:
> >>>
> >>>> On Tue, Aug 19, 2014 at 12:53 AM, Serega Sheypak <
> >> serega.sheypak@gmail.com
> >>>>>
> >>>> wrote:
> >>>>
> >>>>> What could be a reason for recommending "Water heat device " to
> iPhone?
> >>>>> iPhone is one of the most popular item. There should be a lot of
> people
> >>>>> viewing iPhone with "Water heat device "?
> >>>>>
> >>>>
> >>>> What are the numbers?
> >>>>
> >>>> How many people got each item?  How many people total?  How many
> people
> >> got
> >>>> both?
> >>>>
> >>>> What about the same for the iPhone related items?
> >>>>
> >>>
> >>
> >
>

Re: mapreduce ItemSimilarity input optimization

Posted by Pat Ferrel <pa...@gmail.com>.

> 
> On Aug 21, 2014, at 1:22 AM, Serega Sheypak <se...@gmail.com> wrote:
> 
>>> What you are doing is best practice for showing similar “views”. The
> technique for using multiple actions will be covered in a series of blogs
> posts and may be put on the Mahout site eventually
> Great thanks!
> 
>>> People look at 100 things and buy 1, as you say. The question is: Do you
> want people to buy something or just browse your site?
> No objections for your point. I understand it. It should work for pretty
> big ecom, right? Small ecom sell 100-200 items per day and have wide range
> of items.

Ah, then using Ted’s metrics views _is_ probably your best bet. You can probably still personalize view recommendations. Since you are already using itemsimilarity it can be a second step that builds on the first. 

> 
>>> Filter out any items not in the catalog from your recommendations.
> We have it on data preparation stage. We recalculate item similarity each
> day sliding back for 60 days excluding non-available items on preparation
> stage.
> 
> Thank you! We did reach good results, business guys got satisfaction :)
> 
> 
> 2014-08-20 20:28 GMT+04:00 Pat Ferrel <pa...@occamsmachete.com>:
> 
>>> 
>>> On Aug 19, 2014, at 11:26 PM, Serega Sheypak <se...@gmail.com>
>> wrote:
>>> 
>>> Hi!
>>> 1. There was a bug in UI, I've checked raw recommendations. "water
>> heating
>>> device" has low score. So first 30 recommended items really fits iPhone,
>>> next are not so good. Anyway result is good, thank you very much.
>>> 2. I've inspected "sessions" of users, really there are people who viewed
>>> iphone and heating device. 10 people for last month.
>>> 3. I will calculate relative measurment, I didn't calc what is % of these
>>> people comparing to others and how they fluence on score result.
>>> 
>> 
>> That’s great. The Spark version sorts the result by weights, but I think
>> the mapreduce version doesn't
>> 
>>> *You wrote:*
>>> Then once you have that working you can add more actions but only with
>>> cross-cooccurrence, adding by weighting* will not work with this type of
>>> recommender*, which recommender can work with weights for actions?
>> 
>> What you are doing is best practice for showing similar “views”. The
>> technique for using multiple actions will be covered in a series of blogs
>> posts and may be put on the Mahout site eventually. It requires
>> spark-itemsimilarity. For now I’d strongly suggest you look at training on
>> purchase data alone - see the comments below.
>> 
>>> 
>>> *About building recommendations using sales.*
>>> Sales are less than 1% from item views. You will recommend only stuff
>>> people buy.
>> 
>> The point is not volume of data but quality of data. I once measured how
>> predictive of purchases the views were and found them a rather poor
>> predictor. People look at 100 things and buy 1, as you say. The question
>> is: Do you want people to buy something or just browse your site?
>> 
>> On the other hand you would need to see how good your coverage is of
>> purchases. Do you have enough items purchased by several people (Ted’s
>> questions below will guide you)? If there is good coverage then you _do
>> not_ restrict the range by using only purchase data. You increase the
>> quality.
>> 
>>> If you recommend what people see you significantly widen range
>>> of possible buy actions. People always buy case "XXX" with iphone. You
>>> would never recommened them to buy case "YYY". If people watch "XXX" and
>>> "YYY" it's reasonable to recommened "YYY". Maybe "YYY" it's more
>> expensive
>>> that is why people prefer cheaper "XXX". What's wrong with this
>> assumption?
>> 
>> Nothing at all. Remember that your goal is to cause a purchase but using
>> views requires some “scrubbing” of views. You want, in effect,
>> views-that-lead-to-purchases. In a cooccurrence recommender this can be
>> done with cross-cooccurrence and I’ll describe that elsewhere, it’s too
>> long for an email to describe but pretty easy to use.
>> 
>> I’d wager that if you restrict to purchases your sales will go up over
>> recommending views. But that is without looking at your data. If you need
>> more data try increase the sliding time window to add more purchases. This
>> will eventually start including things that are no longer in your catalog
>> so will have diminishing returns but 60 days seem like a short time period.
>> Filter out any items not in the catalog from your recommendations.
>> 
>> You want recency to matter, this is good intuition. The in-catalog filter
>> is one simple way, and there are others when you get to personalization.
>> 
>>> 
>>> *About our obsessive desire to add weights for actions.*
>>> We would like to self-tune our recommendations. If user clicks our
>>> recommendation it's a signal for us that items are related. So next time
>>> this link should have higher score. What are the approaches to do it?
>>> 
>> 
>> Yes, you do want the things that lead to purchases to go into the training
>> data. This is good intuition. But you don’t do it with weights you train on
>> new purchases, regardless of whether they came from random views,
>> rec-views, or … You don’t care whether a rec was clicked on; you care if a
>> purchase was made and you don’t care what part of the UI caused it. UI
>> analysis is very very important but doesn’t help the recommender, it guides
>> UI decisions. So measuring clicks is good but shouldn’t be used to change
>> recs.
>> 
>> One way to increase the value of your recs is to add a little randomness
>> to their ordering. If you have 10 things to recommend get 20 from
>> itemsimilarity and apply a normally distributed random weighting, then
>> re-sort and show the top 10. This will move some things up in order and
>> show them where without the re-ordering they would never be shown. The
>> technique allows you to expose more items to possible purchase and
>> therefore affect the ordering the next time you train. The actual algorithm
>> takes more space to describe but the idea is a lot like a multi-armed
>> bandit where the best bandit eventually gets all trials. In this case the
>> best rec leads to a purchase and gets into the new training data and so
>> will be shown more often the next time.
>> 
>> Another thing you can do is create a “shopping cart” recommender. This
>> looks at items purchased together—an item-set. It is a strong indicator of
>> relatedness.
>> 
>> Suggestions:
>> 1) personalize: this is likely to make the most difference since you will
>> be showing different things to different people. The “Practical Machine
>> Learning” is short and easy to read—it describes this.
>> 2) move to purchase data training, wait for cross-cooccurrence to add in
>> view data. Do this if you have good coverage (Ted’s questions below relate
>> to this).
>> 3) increase the training period if needed to get good catalog coverage
>> 4) consider dithering your recs to expose more items to purchase and
>> therefore self-tune by increasing the quality of your training data.
>> 
>>> 
>>> 
>>> 
>>> 2014-08-20 7:18 GMT+04:00 Ted Dunning <te...@gmail.com>:
>>> 
>>>> On Tue, Aug 19, 2014 at 12:53 AM, Serega Sheypak <
>> serega.sheypak@gmail.com
>>>>> 
>>>> wrote:
>>>> 
>>>>> What could be a reason for recommending "Water heat device " to iPhone?
>>>>> iPhone is one of the most popular item. There should be a lot of people
>>>>> viewing iPhone with "Water heat device "?
>>>>> 
>>>> 
>>>> What are the numbers?
>>>> 
>>>> How many people got each item?  How many people total?  How many people
>> got
>>>> both?
>>>> 
>>>> What about the same for the iPhone related items?
>>>> 
>>> 
>> 
>

Re: mapreduce ItemSimilarity input optimization

Posted by Serega Sheypak <se...@gmail.com>.

>>What you are doing is best practice for showing similar “views”. The
technique for using multiple actions will be covered in a series of blogs
posts and may be put on the Mahout site eventually
Great thanks!

>>People look at 100 things and buy 1, as you say. The question is: Do you
want people to buy something or just browse your site?
No objections for your point. I understand it. It should work for pretty
big ecom, right? Small ecom sell 100-200 items per day and have wide range
of items.

>>Filter out any items not in the catalog from your recommendations.
We have it on data preparation stage. We recalculate item similarity each
day sliding back for 60 days excluding non-available items on preparation
stage.

Thank you! We did reach good results, business guys got satisfaction :)


2014-08-20 20:28 GMT+04:00 Pat Ferrel <pa...@occamsmachete.com>:

> >
> > On Aug 19, 2014, at 11:26 PM, Serega Sheypak <se...@gmail.com>
> wrote:
> >
> > Hi!
> > 1. There was a bug in UI, I've checked raw recommendations. "water
> heating
> > device" has low score. So first 30 recommended items really fits iPhone,
> > next are not so good. Anyway result is good, thank you very much.
> > 2. I've inspected "sessions" of users, really there are people who viewed
> > iphone and heating device. 10 people for last month.
> > 3. I will calculate relative measurment, I didn't calc what is % of these
> > people comparing to others and how they fluence on score result.
> >
>
> That’s great. The Spark version sorts the result by weights, but I think
> the mapreduce version doesn't
>
> > *You wrote:*
> > Then once you have that working you can add more actions but only with
> > cross-cooccurrence, adding by weighting* will not work with this type of
> > recommender*, which recommender can work with weights for actions?
>
> What you are doing is best practice for showing similar “views”. The
> technique for using multiple actions will be covered in a series of blogs
> posts and may be put on the Mahout site eventually. It requires
> spark-itemsimilarity. For now I’d strongly suggest you look at training on
> purchase data alone - see the comments below.
>
> >
> > *About building recommendations using sales.*
> > Sales are less than 1% from item views. You will recommend only stuff
> > people buy.
>
> The point is not volume of data but quality of data. I once measured how
> predictive of purchases the views were and found them a rather poor
> predictor. People look at 100 things and buy 1, as you say. The question
> is: Do you want people to buy something or just browse your site?
>
> On the other hand you would need to see how good your coverage is of
> purchases. Do you have enough items purchased by several people (Ted’s
> questions below will guide you)? If there is good coverage then you _do
> not_ restrict the range by using only purchase data. You increase the
> quality.
>
> > If you recommend what people see you significantly widen range
> > of possible buy actions. People always buy case "XXX" with iphone. You
> > would never recommened them to buy case "YYY". If people watch "XXX" and
> > "YYY" it's reasonable to recommened "YYY". Maybe "YYY" it's more
> expensive
> > that is why people prefer cheaper "XXX". What's wrong with this
> assumption?
>
> Nothing at all. Remember that your goal is to cause a purchase but using
> views requires some “scrubbing” of views. You want, in effect,
> views-that-lead-to-purchases. In a cooccurrence recommender this can be
> done with cross-cooccurrence and I’ll describe that elsewhere, it’s too
> long for an email to describe but pretty easy to use.
>
> I’d wager that if you restrict to purchases your sales will go up over
> recommending views. But that is without looking at your data. If you need
> more data try increase the sliding time window to add more purchases. This
> will eventually start including things that are no longer in your catalog
> so will have diminishing returns but 60 days seem like a short time period.
> Filter out any items not in the catalog from your recommendations.
>
> You want recency to matter, this is good intuition. The in-catalog filter
> is one simple way, and there are others when you get to personalization.
>
> >
> > *About our obsessive desire to add weights for actions.*
> > We would like to self-tune our recommendations. If user clicks our
> > recommendation it's a signal for us that items are related. So next time
> > this link should have higher score. What are the approaches to do it?
> >
>
> Yes, you do want the things that lead to purchases to go into the training
> data. This is good intuition. But you don’t do it with weights you train on
> new purchases, regardless of whether they came from random views,
> rec-views, or … You don’t care whether a rec was clicked on; you care if a
> purchase was made and you don’t care what part of the UI caused it. UI
> analysis is very very important but doesn’t help the recommender, it guides
> UI decisions. So measuring clicks is good but shouldn’t be used to change
> recs.
>
> One way to increase the value of your recs is to add a little randomness
> to their ordering. If you have 10 things to recommend get 20 from
> itemsimilarity and apply a normally distributed random weighting, then
> re-sort and show the top 10. This will move some things up in order and
> show them where without the re-ordering they would never be shown. The
> technique allows you to expose more items to possible purchase and
> therefore affect the ordering the next time you train. The actual algorithm
> takes more space to describe but the idea is a lot like a multi-armed
> bandit where the best bandit eventually gets all trials. In this case the
> best rec leads to a purchase and gets into the new training data and so
> will be shown more often the next time.
>
> Another thing you can do is create a “shopping cart” recommender. This
> looks at items purchased together—an item-set. It is a strong indicator of
> relatedness.
>
> Suggestions:
> 1) personalize: this is likely to make the most difference since you will
> be showing different things to different people. The “Practical Machine
> Learning” is short and easy to read—it describes this.
> 2) move to purchase data training, wait for cross-cooccurrence to add in
> view data. Do this if you have good coverage (Ted’s questions below relate
> to this).
> 3) increase the training period if needed to get good catalog coverage
> 4) consider dithering your recs to expose more items to purchase and
> therefore self-tune by increasing the quality of your training data.
>
> >
> >
> >
> > 2014-08-20 7:18 GMT+04:00 Ted Dunning <te...@gmail.com>:
> >
> >> On Tue, Aug 19, 2014 at 12:53 AM, Serega Sheypak <
> serega.sheypak@gmail.com
> >>>
> >> wrote:
> >>
> >>> What could be a reason for recommending "Water heat device " to iPhone?
> >>> iPhone is one of the most popular item. There should be a lot of people
> >>> viewing iPhone with "Water heat device "?
> >>>
> >>
> >> What are the numbers?
> >>
> >> How many people got each item?  How many people total?  How many people
> got
> >> both?
> >>
> >> What about the same for the iPhone related items?
> >>
> >
>

Re: mapreduce ItemSimilarity input optimization

Posted by Pat Ferrel <pa...@occamsmachete.com>.

> 
> On Aug 19, 2014, at 11:26 PM, Serega Sheypak <se...@gmail.com> wrote:
> 
> Hi!
> 1. There was a bug in UI, I've checked raw recommendations. "water heating
> device" has low score. So first 30 recommended items really fits iPhone,
> next are not so good. Anyway result is good, thank you very much.
> 2. I've inspected "sessions" of users, really there are people who viewed
> iphone and heating device. 10 people for last month.
> 3. I will calculate relative measurment, I didn't calc what is % of these
> people comparing to others and how they fluence on score result.
> 

That’s great. The Spark version sorts the result by weights, but I think the mapreduce version doesn't

> *You wrote:*
> Then once you have that working you can add more actions but only with
> cross-cooccurrence, adding by weighting* will not work with this type of
> recommender*, which recommender can work with weights for actions?

What you are doing is best practice for showing similar “views”. The technique for using multiple actions will be covered in a series of blogs posts and may be put on the Mahout site eventually. It requires spark-itemsimilarity. For now I’d strongly suggest you look at training on purchase data alone - see the comments below. 

> 
> *About building recommendations using sales.*
> Sales are less than 1% from item views. You will recommend only stuff
> people buy.

The point is not volume of data but quality of data. I once measured how predictive of purchases the views were and found them a rather poor predictor. People look at 100 things and buy 1, as you say. The question is: Do you want people to buy something or just browse your site?

On the other hand you would need to see how good your coverage is of purchases. Do you have enough items purchased by several people (Ted’s questions below will guide you)? If there is good coverage then you _do not_ restrict the range by using only purchase data. You increase the quality.

> If you recommend what people see you significantly widen range
> of possible buy actions. People always buy case "XXX" with iphone. You
> would never recommened them to buy case "YYY". If people watch "XXX" and
> "YYY" it's reasonable to recommened "YYY". Maybe "YYY" it's more expensive
> that is why people prefer cheaper "XXX". What's wrong with this assumption?

Nothing at all. Remember that your goal is to cause a purchase but using views requires some “scrubbing” of views. You want, in effect, views-that-lead-to-purchases. In a cooccurrence recommender this can be done with cross-cooccurrence and I’ll describe that elsewhere, it’s too long for an email to describe but pretty easy to use.

I’d wager that if you restrict to purchases your sales will go up over recommending views. But that is without looking at your data. If you need more data try increase the sliding time window to add more purchases. This will eventually start including things that are no longer in your catalog so will have diminishing returns but 60 days seem like a short time period. Filter out any items not in the catalog from your recommendations. 

You want recency to matter, this is good intuition. The in-catalog filter is one simple way, and there are others when you get to personalization.

> 
> *About our obsessive desire to add weights for actions.*
> We would like to self-tune our recommendations. If user clicks our
> recommendation it's a signal for us that items are related. So next time
> this link should have higher score. What are the approaches to do it?
> 

Yes, you do want the things that lead to purchases to go into the training data. This is good intuition. But you don’t do it with weights you train on new purchases, regardless of whether they came from random views, rec-views, or … You don’t care whether a rec was clicked on; you care if a purchase was made and you don’t care what part of the UI caused it. UI analysis is very very important but doesn’t help the recommender, it guides UI decisions. So measuring clicks is good but shouldn’t be used to change recs.

One way to increase the value of your recs is to add a little randomness to their ordering. If you have 10 things to recommend get 20 from itemsimilarity and apply a normally distributed random weighting, then re-sort and show the top 10. This will move some things up in order and show them where without the re-ordering they would never be shown. The technique allows you to expose more items to possible purchase and therefore affect the ordering the next time you train. The actual algorithm takes more space to describe but the idea is a lot like a multi-armed bandit where the best bandit eventually gets all trials. In this case the best rec leads to a purchase and gets into the new training data and so will be shown more often the next time.

Another thing you can do is create a “shopping cart” recommender. This looks at items purchased together—an item-set. It is a strong indicator of relatedness. 

Suggestions:
1) personalize: this is likely to make the most difference since you will be showing different things to different people. The “Practical Machine Learning” is short and easy to read—it describes this.
2) move to purchase data training, wait for cross-cooccurrence to add in view data. Do this if you have good coverage (Ted’s questions below relate to this).
3) increase the training period if needed to get good catalog coverage
4) consider dithering your recs to expose more items to purchase and therefore self-tune by increasing the quality of your training data.

> 
> 
> 
> 2014-08-20 7:18 GMT+04:00 Ted Dunning <te...@gmail.com>:
> 
>> On Tue, Aug 19, 2014 at 12:53 AM, Serega Sheypak <serega.sheypak@gmail.com
>>> 
>> wrote:
>> 
>>> What could be a reason for recommending "Water heat device " to iPhone?
>>> iPhone is one of the most popular item. There should be a lot of people
>>> viewing iPhone with "Water heat device "?
>>> 
>> 
>> What are the numbers?
>> 
>> How many people got each item?  How many people total?  How many people got
>> both?
>> 
>> What about the same for the iPhone related items?
>> 
>

Re: mapreduce ItemSimilarity input optimization

Posted by Serega Sheypak <se...@gmail.com>.

Hi!
1. There was a bug in UI, I've checked raw recommendations. "water heating
device" has low score. So first 30 recommended items really fits iPhone,
next are not so good. Anyway result is good, thank you very much.
2. I've inspected "sessions" of users, really there are people who viewed
iphone and heating device. 10 people for last month.
3. I will calculate relative measurment, I didn't calc what is % of these
people comparing to others and how they fluence on score result.

*You wrote:*
 Then once you have that working you can add more actions but only with
cross-cooccurrence, adding by weighting* will not work with this type of
recommender*, which recommender can work with weights for actions?

*About building recommendations using sales.*
Sales are less than 1% from item views. You will recommend only stuff
people buy. If you recommend what people see you significantly widen range
of possible buy actions. People always buy case "XXX" with iphone. You
would never recommened them to buy case "YYY". If people watch "XXX" and
"YYY" it's reasonable to recommened "YYY". Maybe "YYY" it's more expensive
that is why people prefer cheaper "XXX". What's wrong with this assumption?

*About our obsessive desire to add weights for actions.*
We would like to self-tune our recommendations. If user clicks our
recommendation it's a signal for us that items are related. So next time
this link should have higher score. What are the approaches to do it?

2014-08-20 7:18 GMT+04:00 Ted Dunning <te...@gmail.com>:

> On Tue, Aug 19, 2014 at 12:53 AM, Serega Sheypak <serega.sheypak@gmail.com
> >
> wrote:
>
> > What could be a reason for recommending "Water heat device " to iPhone?
> > iPhone is one of the most popular item. There should be a lot of people
> > viewing iPhone with "Water heat device "?
> >
>
> What are the numbers?
>
> How many people got each item?  How many people total?  How many people got
> both?
>
> What about the same for the iPhone related items?
>

Re: mapreduce ItemSimilarity input optimization

Posted by Ted Dunning <te...@gmail.com>.

On Tue, Aug 19, 2014 at 12:53 AM, Serega Sheypak <se...@gmail.com>
wrote:

> What could be a reason for recommending "Water heat device " to iPhone?
> iPhone is one of the most popular item. There should be a lot of people
> viewing iPhone with "Water heat device "?
>

What are the numbers?

How many people got each item?  How many people total?  How many people got
both?

What about the same for the iPhone related items?

Re: mapreduce ItemSimilarity input optimization

Posted by Serega Sheypak <se...@gmail.com>.

Hi, I 've used LLR with properties you've suggested.
Right now I have a trouble.
A trouble:
Water heat device (
http://www.vasko.ru/upload/iblock/58a/58ac7efe640a551f00bec156601e9035.jpg)
is recommedned for iPhone. And it has one of the highest score.
good things:
iPhone cases (
https://www.apple.com/euro/iphone/c/generic/accessories/images/accessories_iphone_5s_case_colors.jpg)
are recommedned for iPhone, It's good
Other smartphones are recommended to iPhone, it's good
Other iPhones are recommedned to iPhone. It's good. 16GB recommended to
32GB, e.t.c.

What could be a reason for recommending "Water heat device " to iPhone?
iPhone is one of the most popular item. There should be a lot of people
viewing iPhone with "Water heat device "?



2014-08-18 20:15 GMT+04:00 Pat Ferrel <pa...@gmail.com>:

> Oh, and as to using different algorithms, this is an “ensemble” method. In
> the paper they are talking about using widely differing algorithms like ALS
> + Cooccurrence + … This technique was used to win the Netflix prize but in
> practice the improvements may be to small to warrant running multiple
> pipelines. In any case it isn’t the first improvement you may want to try.
> For instance your UI will have a drastic effect on how well you recs do,
> and there are other much easier techniques that we can talk about once you
> get the basics working.
>
>
> On Aug 18, 2014, at 9:04 AM, Pat Ferrel <pa...@gmail.com> wrote:
>
> When beginning to use a recommender from Mahout I always suggest you start
> from the defaults. These often give the best results—then tune afterwards
> to improve.
>
> Your intuition is correct that multiple actions can be used to improve
> results but get the basics working first. The easiest way to use multiple
> actions is to use spark-itemsimilarity so since you are using mapreduce for
> now, just use one action.
>
> I would not try to combine the results from two similarity measures there
> is no benefit since LLR is better than any of them, at least I’ve never
> seen it loose. Below is my experience with trying many of the similarity
> metrics on exactly the same data. I did cross-validation with precision
> (MAP, mean average precision). LLR wins in other cases I’ve tried too. So
> LLR is the only method presently used in the Spark version of
> itemsimilarity.
>
> <map-comparison.xlsx 2014-08-18 08-50-44 2014-08-18 08-51-53.jpeg>
>
> If you still get weird results double check your ID mapping. Run a small
> bit of data through and spot check the mapping by hand.
>
> At some point you will want to create a cross-validation test. This is
> good as a sort of integration sanity check when making changes to the
> recommender. You run cross-validation using standard test data to see if
> the score changes drastically between releases. Big changes may indicate a
> bug. At the beginning it will help you tune as in the case above where it
> helped decide on LLR.
>
>
>
> On Aug 18, 2014, at 1:43 AM, Serega Sheypak <se...@gmail.com>
> wrote:
>
> Thank you very much. I'll do what you are sayning in bullets 1...5 and try
> again.
>
> I also tried:
> 1. calc data using COUSINE_SIMILARITY
> 2. calc the same data using COOCCURENCE_SIMILARTY
> 3. join #1 and #2 where COOCURENCE >= $threshold
>
> Where threshold is some emperical integer value. I've used  "2" The idea is
> to filter out item pairs which never-ever met together...
> Please see this link:
>
> http://blog.godatadriven.com/merge-mahout-recommendations-results-from-different-algorithms.html
>
> If I replace COUSINE_SIMILARITY with LLR and booleanData=true, does this
> approach still make sense, or it's useless waste of time?
>
> "What do you mean the similar items are terrible? How are you measuring
> that? " I have eye testing only,
> I did automate preparation->calculation->hbase upload-> web-app serving, I
> didn't automate testing.
>
>
>
>
> 2014-08-18 5:16 GMT+04:00 Pat Ferrel <pa...@occamsmachete.com>:
>
> > the things that stand out:
> >
> > 1) remove your maxSimilaritiesPerItem option! 50000
> maxSimilaritiesPerItem
> > will _kill_ performance and give no gain, leave this setting at the
> default
> > of 500
> > 2) use only one action. What do you want the user to do? Do you want them
> > to read a page? Then train on item page views. If those pages lead to a
> > purchase then you want to recommend purchases so train on user purchases.
> > 3) remove your minPrefsPerUser option, this should never be 0 or it will
> > leave users in the training data that have no data and may contribute to
> > longer runs with no gain.
> > 4) this is a pretty small Hadoop cluster for the size of your data but I
> > bet changing #1 will noticeably reduce the runtime
> > 5) change —similarityClassname to SIMILARITY_LOGLIKELIHOOD
> > 6) remove your —booleanData option since LLR ignores weights.
> >
> > Remember that this is not the same as personalized recommendations. This
> > method alone will show the same “similar items” for all users.
> >
> > Sorry but both your “recommendation” types sound like the same thing.
> > Using both item page view  _and_ clicks on recommended items will both
> lead
> > to an item page view so you have two actions that lead to the same thing,
> > right? Just train on an item page view (unless you really want the user
> to
> > make a purchase)
> >
> > What do you mean the similar items are terrible? How are you measuring
> > that? Are you doing cross-validation measuring precision or A/B testing?
> > What looks bad to you may be good, the eyeball test is not always
> reliable.
> > If they are coming up completely crazy or random then you may have a bug
> in
> > your ID translation logic.
> >
> > It sounds like you have enough data to produce good results.
> >
> > On Aug 17, 2014, at 11:14 AM, Serega Sheypak <se...@gmail.com>
> > wrote:
> >
> > 1. 7 nodes 4 CPU per node, 48 GB ram, 2 HDD for MR and HDFS. Not too much
> > but enough for the start..
> > 2. I run it as oozie action.
> > <action name="run-mahout-primary-similarity-ItemSimilarityJob">
> >       <java>
> >           <job-tracker>${jobTracker}</job-tracker>
> >           <name-node>${nameNode}</name-node>
> >           <prepare>
> >               <delete path="${mahoutOutputDir}/primary" />
> >               <delete
> > path="${tempDir}/run-mahout-ItemSimilarityJob/primary" />
> >           </prepare>
> >           <configuration>
> >               <property>
> >                   <name>mapred.queue.name</name>
> >                   <value>default</value>
> >               </property>
> >
> >           </configuration>
> >
> >
> >
> <main-class>org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob</main-class>
> >           <arg>--input</arg>
> >           <arg>${tempDir}/to-mahout-id/projPrefs</arg><!-- dense user_id,
> > item_id, pref [can be 3 or 5, 3 is VIEW item, 5 is CLICK on
> recommendation,
> > a kind of try to increase quality of recommender...]-->
> >
> >           <arg>--output</arg>
> >           <arg>${mahoutOutputDir}/primary</arg>
> >
> >           <arg>--similarityClassname</arg>
> >           <arg>SIMILARITY_COSINE</arg>
> >
> >           <arg>--maxSimilaritiesPerItem</arg>
> >           <arg>50000</arg>
> >
> >           <arg>--minPrefsPerUser</arg>
> >           <arg>0</arg>
> >
> >           <arg>--booleanData</arg>
> >           <arg>false</arg>
> >
> >           <arg>--tempDir</arg>
> >           <arg>${tempDir}/run-mahout-ItemSimilarityJob/primary</arg>
> >
> >       </java>
> >       <ok to="to-narrow-table"/>
> >       <error to="kill"/>
> >   </action>
> >
> > 3) RANK does it, here is a script:
> >
> > --user, item, pref previously prepared by hive
> > user_item_pref = LOAD '$user_item_pref' using PigStorage(',') as
> > (user_id:chararray, item_id:long, pref:double);
> >
> > --get distinct user from the whole input
> > distUserId = distinct(FOREACH user_item_pref GENERATE user_id);
> >
> > --get distinct item from the whole input
> > distItemId = distinct(FOREACH user_item_pref GENERATE item_id);
> >
> > --rank user 1....N
> > rankUsers_ = RANK distUserId;
> > rankUsers = FOREACH rankUsers_ GENERATE $0 as rank_id, user_id;
> >
> > --rank items 1....M
> > rankItems_ = RANK distItemId;
> > rankItems = FOREACH rankItems_ GENERATE $0 as rank_id, item_id;
> >
> > --join and remap natural user_id, item_id, to RANKS: 1.N, 1..M
> > joinedUsers = join user_item_pref by user_id, rankUsers by user_id USING
> > 'skewed';
> > joinedItems = join joinedUsers by user_item_pref::item_id, rankItems by
> > item_id using 'replicated';
> >
> > projPrefs = FOREACH joinedItems GENERATE joinedUsers::rankUsers::rank_id
> > as user_id,
> >                                        rankItems::rank_id
> > as item_id,
> >                                        joinedUsers::user_item_pref::pref
> > as pref;
> >
> > --store mapping for later remapping from RANK back to natural values
> > STORE (FOREACH rankUsers GENERATE rank_id, user_id) into '$rankUsers'
> using
> > PigStorage('\t');
> > STORE (FOREACH rankItems GENERATE rank_id, item_id) into '$rankItems'
> using
> > PigStorage('\t');
> > STORE (FOREACH projPrefs GENERATE user_id, item_id, pref) into
> '$projPrefs'
> > using PigStorage('\t');
> >
> > 4) I've seen this idea in different discussion, that different weight for
> > different actions are not good. Sorry, I don't understand what you do
> > suggest.
> > I have two kind of actions: user viewed item, user clicked on recommended
> > item (recommended item produced by my item similarity system).
> > I want to produce two kinds of recommendations:
> > 1. current item + recommend other items which other users visit in
> > conjuction with current item
> > 2. similar item: recommend items similar to current viewed item.
> > What can I try?
> > LLR=http://en.wikipedia.org/wiki/Log-likelihood_ratio= LOG_LIKEHOOD?
> >
> > Right now I do get awful recommendations and I can't understand what can
> I
> > try next :((((((((((((
> >
> >
> > 2014-08-17 19:02 GMT+04:00 Pat Ferrel <pa...@gmail.com>:
> >
> >> 1) how many cores in the cluster? The whole idea behind mapreduce is you
> >> buy more cpus you get nearly linear decrease in runtime.
> >> 2) what is your mahout command line with options, or how are you
> invoking
> >> mahout. I have seen the Mahout mapreduce recommender take this long so
> we
> >> should check what you are doing with downsampling.
> >> 3) do you really need to RANK your ids, that’s a full sort? When using
> > pig
> >> I usually get DISTINCT ones and assign an incrementing integer as the
> >> Mahout ID corresponding
> >> 4) your #2 assigning different weights to different actions usually does
> >> not work. I’ve done this before and compared offline metrics and seen
> >> precision go down. I’d get this working using only your primary actions
> >> first. What are you trying to get the user to do? View something, buy
> >> something? Use that action as the primary preference and start out with
> a
> >> weight of 1 using LLR. With LLR the weights are not used anyway so your
> >> data may not produce good results with mixed actions.
> >>
> >> A plug for the (admittedly pre-alpha) spark-itemsimilairty:
> >> 1) output from 2 can be directly ingested and will create output.
> >> 2) multiple actions can be used with cross-cooccurrence, not by guessing
> >> at weights.
> >> 3) output has your application specific IDs preserved.
> >> 4) its about 10x faster than mapreduce and will do aways with your ID
> >> translation steps
> >>
> >> One caveat is that your cluster machines will need lots of memory. I
> have
> >> 8-16g on mine.
> >>
> >> On Aug 17, 2014, at 1:26 AM, Serega Sheypak <se...@gmail.com>
> >> wrote:
> >>
> >> 1. I do collect preferences for items using 60days sliding window. today
> > -
> >> 60 days.
> >> 2. I do prepare triples user_id, item_id, descrete_pref_value (3 for
> item
> >> view, 5 for clicking recommndation block. The idea is to give more value
> >> for recommendations which attact visitor attention). I get ~ 20.000.000
> > of
> >> lines with ~1.000.000 distinct items and ~2.000.000 distinct users
> >> 3. I do use apache pig RANK function to rank all distinct user_id
> >> 4. I do the same for item_id
> >> 5. I do join input dataset with ranked datasets and provide input to
> > mahout
> >> with dense interger user_id, item_id
> >> 6. I do get mahout output and join integer item_id back to get natural
> > key
> >> value.
> >>
> >> step #1-2 takes ~ 40min
> >> step #3-5 takes ~1 hour
> >> mahout calc takes ~3hours
> >>
> >>
> >>
> >> 2014-08-17 10:45 GMT+04:00 Ted Dunning <te...@gmail.com>:
> >>
> >>> This really doesn't sound right.  It should be possible to process
> >> almost a
> >>> thousand times that much data every night without that much problem.
> >>>
> >>> How are you preparing the input data?
> >>>
> >>> How are you converting to Mahout id's?
> >>>
> >>> Even using python, you should be able to do the conversion in just a
> few
> >>> minutes without any parallelism whatsoever.
> >>>
> >>>
> >>>
> >>>
> >>> On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak <
> >> serega.sheypak@gmail.com>
> >>> wrote:
> >>>
> >>>> Hi, We are trying calculate ItemSimilarity.
> >>>> Right now we have 2*10^7 input lines. I do provide input data as raw
> >> text
> >>>> each day to recalculate item similarities. We do get +100..1000 new
> >> items
> >>>> each day.
> >>>> 1. It takes too much time to prepare input data.
> >>>> 2. It takes too much time to convert user_id, item_id to mahout ids
> >>>>
> >>>> Is there any poissibility to provide data to mahout mapreduce
> >>>> ItemSimilarity using some binary format with compression?
> >>>>
> >>>
> >>
> >>
> >
> >
>
>
>

Re: mapreduce ItemSimilarity input optimization

Posted by Pat Ferrel <pa...@gmail.com>.

Oh, and as to using different algorithms, this is an “ensemble” method. In the paper they are talking about using widely differing algorithms like ALS + Cooccurrence + … This technique was used to win the Netflix prize but in practice the improvements may be to small to warrant running multiple pipelines. In any case it isn’t the first improvement you may want to try. For instance your UI will have a drastic effect on how well you recs do, and there are other much easier techniques that we can talk about once you get the basics working.


On Aug 18, 2014, at 9:04 AM, Pat Ferrel <pa...@gmail.com> wrote:

When beginning to use a recommender from Mahout I always suggest you start from the defaults. These often give the best results—then tune afterwards to improve.

Your intuition is correct that multiple actions can be used to improve results but get the basics working first. The easiest way to use multiple actions is to use spark-itemsimilarity so since you are using mapreduce for now, just use one action. 

I would not try to combine the results from two similarity measures there is no benefit since LLR is better than any of them, at least I’ve never seen it loose. Below is my experience with trying many of the similarity metrics on exactly the same data. I did cross-validation with precision (MAP, mean average precision). LLR wins in other cases I’ve tried too. So LLR is the only method presently used in the Spark version of itemsimilarity.

<map-comparison.xlsx 2014-08-18 08-50-44 2014-08-18 08-51-53.jpeg>

If you still get weird results double check your ID mapping. Run a small bit of data through and spot check the mapping by hand.

At some point you will want to create a cross-validation test. This is good as a sort of integration sanity check when making changes to the recommender. You run cross-validation using standard test data to see if the score changes drastically between releases. Big changes may indicate a bug. At the beginning it will help you tune as in the case above where it helped decide on LLR.



On Aug 18, 2014, at 1:43 AM, Serega Sheypak <se...@gmail.com> wrote:

Thank you very much. I'll do what you are sayning in bullets 1...5 and try
again.

I also tried:
1. calc data using COUSINE_SIMILARITY
2. calc the same data using COOCCURENCE_SIMILARTY
3. join #1 and #2 where COOCURENCE >= $threshold

Where threshold is some emperical integer value. I've used  "2" The idea is
to filter out item pairs which never-ever met together...
Please see this link:
http://blog.godatadriven.com/merge-mahout-recommendations-results-from-different-algorithms.html

If I replace COUSINE_SIMILARITY with LLR and booleanData=true, does this
approach still make sense, or it's useless waste of time?

"What do you mean the similar items are terrible? How are you measuring
that? " I have eye testing only,
I did automate preparation->calculation->hbase upload-> web-app serving, I
didn't automate testing.




2014-08-18 5:16 GMT+04:00 Pat Ferrel <pa...@occamsmachete.com>:

> the things that stand out:
> 
> 1) remove your maxSimilaritiesPerItem option! 50000 maxSimilaritiesPerItem
> will _kill_ performance and give no gain, leave this setting at the default
> of 500
> 2) use only one action. What do you want the user to do? Do you want them
> to read a page? Then train on item page views. If those pages lead to a
> purchase then you want to recommend purchases so train on user purchases.
> 3) remove your minPrefsPerUser option, this should never be 0 or it will
> leave users in the training data that have no data and may contribute to
> longer runs with no gain.
> 4) this is a pretty small Hadoop cluster for the size of your data but I
> bet changing #1 will noticeably reduce the runtime
> 5) change —similarityClassname to SIMILARITY_LOGLIKELIHOOD
> 6) remove your —booleanData option since LLR ignores weights.
> 
> Remember that this is not the same as personalized recommendations. This
> method alone will show the same “similar items” for all users.
> 
> Sorry but both your “recommendation” types sound like the same thing.
> Using both item page view  _and_ clicks on recommended items will both lead
> to an item page view so you have two actions that lead to the same thing,
> right? Just train on an item page view (unless you really want the user to
> make a purchase)
> 
> What do you mean the similar items are terrible? How are you measuring
> that? Are you doing cross-validation measuring precision or A/B testing?
> What looks bad to you may be good, the eyeball test is not always reliable.
> If they are coming up completely crazy or random then you may have a bug in
> your ID translation logic.
> 
> It sounds like you have enough data to produce good results.
> 
> On Aug 17, 2014, at 11:14 AM, Serega Sheypak <se...@gmail.com>
> wrote:
> 
> 1. 7 nodes 4 CPU per node, 48 GB ram, 2 HDD for MR and HDFS. Not too much
> but enough for the start..
> 2. I run it as oozie action.
> <action name="run-mahout-primary-similarity-ItemSimilarityJob">
>       <java>
>           <job-tracker>${jobTracker}</job-tracker>
>           <name-node>${nameNode}</name-node>
>           <prepare>
>               <delete path="${mahoutOutputDir}/primary" />
>               <delete
> path="${tempDir}/run-mahout-ItemSimilarityJob/primary" />
>           </prepare>
>           <configuration>
>               <property>
>                   <name>mapred.queue.name</name>
>                   <value>default</value>
>               </property>
> 
>           </configuration>
> 
> 
> <main-class>org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob</main-class>
>           <arg>--input</arg>
>           <arg>${tempDir}/to-mahout-id/projPrefs</arg><!-- dense user_id,
> item_id, pref [can be 3 or 5, 3 is VIEW item, 5 is CLICK on recommendation,
> a kind of try to increase quality of recommender...]-->
> 
>           <arg>--output</arg>
>           <arg>${mahoutOutputDir}/primary</arg>
> 
>           <arg>--similarityClassname</arg>
>           <arg>SIMILARITY_COSINE</arg>
> 
>           <arg>--maxSimilaritiesPerItem</arg>
>           <arg>50000</arg>
> 
>           <arg>--minPrefsPerUser</arg>
>           <arg>0</arg>
> 
>           <arg>--booleanData</arg>
>           <arg>false</arg>
> 
>           <arg>--tempDir</arg>
>           <arg>${tempDir}/run-mahout-ItemSimilarityJob/primary</arg>
> 
>       </java>
>       <ok to="to-narrow-table"/>
>       <error to="kill"/>
>   </action>
> 
> 3) RANK does it, here is a script:
> 
> --user, item, pref previously prepared by hive
> user_item_pref = LOAD '$user_item_pref' using PigStorage(',') as
> (user_id:chararray, item_id:long, pref:double);
> 
> --get distinct user from the whole input
> distUserId = distinct(FOREACH user_item_pref GENERATE user_id);
> 
> --get distinct item from the whole input
> distItemId = distinct(FOREACH user_item_pref GENERATE item_id);
> 
> --rank user 1....N
> rankUsers_ = RANK distUserId;
> rankUsers = FOREACH rankUsers_ GENERATE $0 as rank_id, user_id;
> 
> --rank items 1....M
> rankItems_ = RANK distItemId;
> rankItems = FOREACH rankItems_ GENERATE $0 as rank_id, item_id;
> 
> --join and remap natural user_id, item_id, to RANKS: 1.N, 1..M
> joinedUsers = join user_item_pref by user_id, rankUsers by user_id USING
> 'skewed';
> joinedItems = join joinedUsers by user_item_pref::item_id, rankItems by
> item_id using 'replicated';
> 
> projPrefs = FOREACH joinedItems GENERATE joinedUsers::rankUsers::rank_id
> as user_id,
>                                        rankItems::rank_id
> as item_id,
>                                        joinedUsers::user_item_pref::pref
> as pref;
> 
> --store mapping for later remapping from RANK back to natural values
> STORE (FOREACH rankUsers GENERATE rank_id, user_id) into '$rankUsers' using
> PigStorage('\t');
> STORE (FOREACH rankItems GENERATE rank_id, item_id) into '$rankItems' using
> PigStorage('\t');
> STORE (FOREACH projPrefs GENERATE user_id, item_id, pref) into '$projPrefs'
> using PigStorage('\t');
> 
> 4) I've seen this idea in different discussion, that different weight for
> different actions are not good. Sorry, I don't understand what you do
> suggest.
> I have two kind of actions: user viewed item, user clicked on recommended
> item (recommended item produced by my item similarity system).
> I want to produce two kinds of recommendations:
> 1. current item + recommend other items which other users visit in
> conjuction with current item
> 2. similar item: recommend items similar to current viewed item.
> What can I try?
> LLR=http://en.wikipedia.org/wiki/Log-likelihood_ratio= LOG_LIKEHOOD?
> 
> Right now I do get awful recommendations and I can't understand what can I
> try next :((((((((((((
> 
> 
> 2014-08-17 19:02 GMT+04:00 Pat Ferrel <pa...@gmail.com>:
> 
>> 1) how many cores in the cluster? The whole idea behind mapreduce is you
>> buy more cpus you get nearly linear decrease in runtime.
>> 2) what is your mahout command line with options, or how are you invoking
>> mahout. I have seen the Mahout mapreduce recommender take this long so we
>> should check what you are doing with downsampling.
>> 3) do you really need to RANK your ids, that’s a full sort? When using
> pig
>> I usually get DISTINCT ones and assign an incrementing integer as the
>> Mahout ID corresponding
>> 4) your #2 assigning different weights to different actions usually does
>> not work. I’ve done this before and compared offline metrics and seen
>> precision go down. I’d get this working using only your primary actions
>> first. What are you trying to get the user to do? View something, buy
>> something? Use that action as the primary preference and start out with a
>> weight of 1 using LLR. With LLR the weights are not used anyway so your
>> data may not produce good results with mixed actions.
>> 
>> A plug for the (admittedly pre-alpha) spark-itemsimilairty:
>> 1) output from 2 can be directly ingested and will create output.
>> 2) multiple actions can be used with cross-cooccurrence, not by guessing
>> at weights.
>> 3) output has your application specific IDs preserved.
>> 4) its about 10x faster than mapreduce and will do aways with your ID
>> translation steps
>> 
>> One caveat is that your cluster machines will need lots of memory. I have
>> 8-16g on mine.
>> 
>> On Aug 17, 2014, at 1:26 AM, Serega Sheypak <se...@gmail.com>
>> wrote:
>> 
>> 1. I do collect preferences for items using 60days sliding window. today
> -
>> 60 days.
>> 2. I do prepare triples user_id, item_id, descrete_pref_value (3 for item
>> view, 5 for clicking recommndation block. The idea is to give more value
>> for recommendations which attact visitor attention). I get ~ 20.000.000
> of
>> lines with ~1.000.000 distinct items and ~2.000.000 distinct users
>> 3. I do use apache pig RANK function to rank all distinct user_id
>> 4. I do the same for item_id
>> 5. I do join input dataset with ranked datasets and provide input to
> mahout
>> with dense interger user_id, item_id
>> 6. I do get mahout output and join integer item_id back to get natural
> key
>> value.
>> 
>> step #1-2 takes ~ 40min
>> step #3-5 takes ~1 hour
>> mahout calc takes ~3hours
>> 
>> 
>> 
>> 2014-08-17 10:45 GMT+04:00 Ted Dunning <te...@gmail.com>:
>> 
>>> This really doesn't sound right.  It should be possible to process
>> almost a
>>> thousand times that much data every night without that much problem.
>>> 
>>> How are you preparing the input data?
>>> 
>>> How are you converting to Mahout id's?
>>> 
>>> Even using python, you should be able to do the conversion in just a few
>>> minutes without any parallelism whatsoever.
>>> 
>>> 
>>> 
>>> 
>>> On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak <
>> serega.sheypak@gmail.com>
>>> wrote:
>>> 
>>>> Hi, We are trying calculate ItemSimilarity.
>>>> Right now we have 2*10^7 input lines. I do provide input data as raw
>> text
>>>> each day to recalculate item similarities. We do get +100..1000 new
>> items
>>>> each day.
>>>> 1. It takes too much time to prepare input data.
>>>> 2. It takes too much time to convert user_id, item_id to mahout ids
>>>> 
>>>> Is there any poissibility to provide data to mahout mapreduce
>>>> ItemSimilarity using some binary format with compression?
>>>> 
>>> 
>> 
>> 
> 
>

Re: mapreduce ItemSimilarity input optimization

Posted by Pat Ferrel <pa...@gmail.com>.

When beginning to use a recommender from Mahout I always suggest you start from the defaults. These often give the best results—then tune afterwards to improve.

Your intuition is correct that multiple actions can be used to improve results but get the basics working first. The easiest way to use multiple actions is to use spark-itemsimilarity so since you are using mapreduce for now, just use one action. 

I would not try to combine the results from two similarity measures there is no benefit since LLR is better than any of them, at least I’ve never seen it loose. Below is my experience with trying many of the similarity metrics on exactly the same data. I did cross-validation with precision (MAP, mean average precision). LLR wins in other cases I’ve tried too. So LLR is the only method presently used in the Spark version of itemsimilarity.



If you still get weird results double check your ID mapping. Run a small bit of data through and spot check the mapping by hand.

At some point you will want to create a cross-validation test. This is good as a sort of integration sanity check when making changes to the recommender. You run cross-validation using standard test data to see if the score changes drastically between releases. Big changes may indicate a bug. At the beginning it will help you tune as in the case above where it helped decide on LLR.



On Aug 18, 2014, at 1:43 AM, Serega Sheypak <se...@gmail.com> wrote:

Thank you very much. I'll do what you are sayning in bullets 1...5 and try
again.

I also tried:
1. calc data using COUSINE_SIMILARITY
2. calc the same data using COOCCURENCE_SIMILARTY
3. join #1 and #2 where COOCURENCE >= $threshold

Where threshold is some emperical integer value. I've used  "2" The idea is
to filter out item pairs which never-ever met together...
Please see this link:
http://blog.godatadriven.com/merge-mahout-recommendations-results-from-different-algorithms.html

If I replace COUSINE_SIMILARITY with LLR and booleanData=true, does this
approach still make sense, or it's useless waste of time?

"What do you mean the similar items are terrible? How are you measuring
that? " I have eye testing only,
I did automate preparation->calculation->hbase upload-> web-app serving, I
didn't automate testing.




2014-08-18 5:16 GMT+04:00 Pat Ferrel <pa...@occamsmachete.com>:

> the things that stand out:
> 
> 1) remove your maxSimilaritiesPerItem option! 50000 maxSimilaritiesPerItem
> will _kill_ performance and give no gain, leave this setting at the default
> of 500
> 2) use only one action. What do you want the user to do? Do you want them
> to read a page? Then train on item page views. If those pages lead to a
> purchase then you want to recommend purchases so train on user purchases.
> 3) remove your minPrefsPerUser option, this should never be 0 or it will
> leave users in the training data that have no data and may contribute to
> longer runs with no gain.
> 4) this is a pretty small Hadoop cluster for the size of your data but I
> bet changing #1 will noticeably reduce the runtime
> 5) change —similarityClassname to SIMILARITY_LOGLIKELIHOOD
> 6) remove your —booleanData option since LLR ignores weights.
> 
> Remember that this is not the same as personalized recommendations. This
> method alone will show the same “similar items” for all users.
> 
> Sorry but both your “recommendation” types sound like the same thing.
> Using both item page view  _and_ clicks on recommended items will both lead
> to an item page view so you have two actions that lead to the same thing,
> right? Just train on an item page view (unless you really want the user to
> make a purchase)
> 
> What do you mean the similar items are terrible? How are you measuring
> that? Are you doing cross-validation measuring precision or A/B testing?
> What looks bad to you may be good, the eyeball test is not always reliable.
> If they are coming up completely crazy or random then you may have a bug in
> your ID translation logic.
> 
> It sounds like you have enough data to produce good results.
> 
> On Aug 17, 2014, at 11:14 AM, Serega Sheypak <se...@gmail.com>
> wrote:
> 
> 1. 7 nodes 4 CPU per node, 48 GB ram, 2 HDD for MR and HDFS. Not too much
> but enough for the start..
> 2. I run it as oozie action.
> <action name="run-mahout-primary-similarity-ItemSimilarityJob">
>       <java>
>           <job-tracker>${jobTracker}</job-tracker>
>           <name-node>${nameNode}</name-node>
>           <prepare>
>               <delete path="${mahoutOutputDir}/primary" />
>               <delete
> path="${tempDir}/run-mahout-ItemSimilarityJob/primary" />
>           </prepare>
>           <configuration>
>               <property>
>                   <name>mapred.queue.name</name>
>                   <value>default</value>
>               </property>
> 
>           </configuration>
> 
> 
> <main-class>org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob</main-class>
>           <arg>--input</arg>
>           <arg>${tempDir}/to-mahout-id/projPrefs</arg><!-- dense user_id,
> item_id, pref [can be 3 or 5, 3 is VIEW item, 5 is CLICK on recommendation,
> a kind of try to increase quality of recommender...]-->
> 
>           <arg>--output</arg>
>           <arg>${mahoutOutputDir}/primary</arg>
> 
>           <arg>--similarityClassname</arg>
>           <arg>SIMILARITY_COSINE</arg>
> 
>           <arg>--maxSimilaritiesPerItem</arg>
>           <arg>50000</arg>
> 
>           <arg>--minPrefsPerUser</arg>
>           <arg>0</arg>
> 
>           <arg>--booleanData</arg>
>           <arg>false</arg>
> 
>           <arg>--tempDir</arg>
>           <arg>${tempDir}/run-mahout-ItemSimilarityJob/primary</arg>
> 
>       </java>
>       <ok to="to-narrow-table"/>
>       <error to="kill"/>
>   </action>
> 
> 3) RANK does it, here is a script:
> 
> --user, item, pref previously prepared by hive
> user_item_pref = LOAD '$user_item_pref' using PigStorage(',') as
> (user_id:chararray, item_id:long, pref:double);
> 
> --get distinct user from the whole input
> distUserId = distinct(FOREACH user_item_pref GENERATE user_id);
> 
> --get distinct item from the whole input
> distItemId = distinct(FOREACH user_item_pref GENERATE item_id);
> 
> --rank user 1....N
> rankUsers_ = RANK distUserId;
> rankUsers = FOREACH rankUsers_ GENERATE $0 as rank_id, user_id;
> 
> --rank items 1....M
> rankItems_ = RANK distItemId;
> rankItems = FOREACH rankItems_ GENERATE $0 as rank_id, item_id;
> 
> --join and remap natural user_id, item_id, to RANKS: 1.N, 1..M
> joinedUsers = join user_item_pref by user_id, rankUsers by user_id USING
> 'skewed';
> joinedItems = join joinedUsers by user_item_pref::item_id, rankItems by
> item_id using 'replicated';
> 
> projPrefs = FOREACH joinedItems GENERATE joinedUsers::rankUsers::rank_id
> as user_id,
>                                        rankItems::rank_id
> as item_id,
>                                        joinedUsers::user_item_pref::pref
> as pref;
> 
> --store mapping for later remapping from RANK back to natural values
> STORE (FOREACH rankUsers GENERATE rank_id, user_id) into '$rankUsers' using
> PigStorage('\t');
> STORE (FOREACH rankItems GENERATE rank_id, item_id) into '$rankItems' using
> PigStorage('\t');
> STORE (FOREACH projPrefs GENERATE user_id, item_id, pref) into '$projPrefs'
> using PigStorage('\t');
> 
> 4) I've seen this idea in different discussion, that different weight for
> different actions are not good. Sorry, I don't understand what you do
> suggest.
> I have two kind of actions: user viewed item, user clicked on recommended
> item (recommended item produced by my item similarity system).
> I want to produce two kinds of recommendations:
> 1. current item + recommend other items which other users visit in
> conjuction with current item
> 2. similar item: recommend items similar to current viewed item.
> What can I try?
> LLR=http://en.wikipedia.org/wiki/Log-likelihood_ratio= LOG_LIKEHOOD?
> 
> Right now I do get awful recommendations and I can't understand what can I
> try next :((((((((((((
> 
> 
> 2014-08-17 19:02 GMT+04:00 Pat Ferrel <pa...@gmail.com>:
> 
>> 1) how many cores in the cluster? The whole idea behind mapreduce is you
>> buy more cpus you get nearly linear decrease in runtime.
>> 2) what is your mahout command line with options, or how are you invoking
>> mahout. I have seen the Mahout mapreduce recommender take this long so we
>> should check what you are doing with downsampling.
>> 3) do you really need to RANK your ids, that’s a full sort? When using
> pig
>> I usually get DISTINCT ones and assign an incrementing integer as the
>> Mahout ID corresponding
>> 4) your #2 assigning different weights to different actions usually does
>> not work. I’ve done this before and compared offline metrics and seen
>> precision go down. I’d get this working using only your primary actions
>> first. What are you trying to get the user to do? View something, buy
>> something? Use that action as the primary preference and start out with a
>> weight of 1 using LLR. With LLR the weights are not used anyway so your
>> data may not produce good results with mixed actions.
>> 
>> A plug for the (admittedly pre-alpha) spark-itemsimilairty:
>> 1) output from 2 can be directly ingested and will create output.
>> 2) multiple actions can be used with cross-cooccurrence, not by guessing
>> at weights.
>> 3) output has your application specific IDs preserved.
>> 4) its about 10x faster than mapreduce and will do aways with your ID
>> translation steps
>> 
>> One caveat is that your cluster machines will need lots of memory. I have
>> 8-16g on mine.
>> 
>> On Aug 17, 2014, at 1:26 AM, Serega Sheypak <se...@gmail.com>
>> wrote:
>> 
>> 1. I do collect preferences for items using 60days sliding window. today
> -
>> 60 days.
>> 2. I do prepare triples user_id, item_id, descrete_pref_value (3 for item
>> view, 5 for clicking recommndation block. The idea is to give more value
>> for recommendations which attact visitor attention). I get ~ 20.000.000
> of
>> lines with ~1.000.000 distinct items and ~2.000.000 distinct users
>> 3. I do use apache pig RANK function to rank all distinct user_id
>> 4. I do the same for item_id
>> 5. I do join input dataset with ranked datasets and provide input to
> mahout
>> with dense interger user_id, item_id
>> 6. I do get mahout output and join integer item_id back to get natural
> key
>> value.
>> 
>> step #1-2 takes ~ 40min
>> step #3-5 takes ~1 hour
>> mahout calc takes ~3hours
>> 
>> 
>> 
>> 2014-08-17 10:45 GMT+04:00 Ted Dunning <te...@gmail.com>:
>> 
>>> This really doesn't sound right.  It should be possible to process
>> almost a
>>> thousand times that much data every night without that much problem.
>>> 
>>> How are you preparing the input data?
>>> 
>>> How are you converting to Mahout id's?
>>> 
>>> Even using python, you should be able to do the conversion in just a few
>>> minutes without any parallelism whatsoever.
>>> 
>>> 
>>> 
>>> 
>>> On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak <
>> serega.sheypak@gmail.com>
>>> wrote:
>>> 
>>>> Hi, We are trying calculate ItemSimilarity.
>>>> Right now we have 2*10^7 input lines. I do provide input data as raw
>> text
>>>> each day to recalculate item similarities. We do get +100..1000 new
>> items
>>>> each day.
>>>> 1. It takes too much time to prepare input data.
>>>> 2. It takes too much time to convert user_id, item_id to mahout ids
>>>> 
>>>> Is there any poissibility to provide data to mahout mapreduce
>>>> ItemSimilarity using some binary format with compression?
>>>> 
>>> 
>> 
>> 
> 
>

Re: mapreduce ItemSimilarity input optimization

Posted by Ted Dunning <te...@gmail.com>.

Sherega,

Use the LLR similarity.

Make sure you reduce the downsampling setting is as Pat F suggested?  That
will make a huge difference.

The filtering of low frequency items is already done.

Also, consider delivering your similar items and recommendations via a
search engine.  Search engine deployment really facilitates exploration.




On Mon, Aug 18, 2014 at 1:43 AM, Serega Sheypak <se...@gmail.com>
wrote:

> Thank you very much. I'll do what you are sayning in bullets 1...5 and try
> again.
>
> I also tried:
> 1. calc data using COUSINE_SIMILARITY
> 2. calc the same data using COOCCURENCE_SIMILARTY
> 3. join #1 and #2 where COOCURENCE >= $threshold
>
> Where threshold is some emperical integer value. I've used  "2" The idea is
> to filter out item pairs which never-ever met together...
> Please see this link:
>
> http://blog.godatadriven.com/merge-mahout-recommendations-results-from-different-algorithms.html
>
> If I replace COUSINE_SIMILARITY with LLR and booleanData=true, does this
> approach still make sense, or it's useless waste of time?
>
> "What do you mean the similar items are terrible? How are you measuring
> that? " I have eye testing only,
> I did automate preparation->calculation->hbase upload-> web-app serving, I
> didn't automate testing.
>
>
>
>
> 2014-08-18 5:16 GMT+04:00 Pat Ferrel <pa...@occamsmachete.com>:
>
> > the things that stand out:
> >
> > 1) remove your maxSimilaritiesPerItem option! 50000
> maxSimilaritiesPerItem
> > will _kill_ performance and give no gain, leave this setting at the
> default
> > of 500
> > 2) use only one action. What do you want the user to do? Do you want them
> > to read a page? Then train on item page views. If those pages lead to a
> > purchase then you want to recommend purchases so train on user purchases.
> > 3) remove your minPrefsPerUser option, this should never be 0 or it will
> > leave users in the training data that have no data and may contribute to
> > longer runs with no gain.
> > 4) this is a pretty small Hadoop cluster for the size of your data but I
> > bet changing #1 will noticeably reduce the runtime
> > 5) change —similarityClassname to SIMILARITY_LOGLIKELIHOOD
> > 6) remove your —booleanData option since LLR ignores weights.
> >
> > Remember that this is not the same as personalized recommendations. This
> > method alone will show the same “similar items” for all users.
> >
> > Sorry but both your “recommendation” types sound like the same thing.
> > Using both item page view  _and_ clicks on recommended items will both
> lead
> > to an item page view so you have two actions that lead to the same thing,
> > right? Just train on an item page view (unless you really want the user
> to
> > make a purchase)
> >
> > What do you mean the similar items are terrible? How are you measuring
> > that? Are you doing cross-validation measuring precision or A/B testing?
> > What looks bad to you may be good, the eyeball test is not always
> reliable.
> > If they are coming up completely crazy or random then you may have a bug
> in
> > your ID translation logic.
> >
> > It sounds like you have enough data to produce good results.
> >
> > On Aug 17, 2014, at 11:14 AM, Serega Sheypak <se...@gmail.com>
> > wrote:
> >
> > 1. 7 nodes 4 CPU per node, 48 GB ram, 2 HDD for MR and HDFS. Not too much
> > but enough for the start..
> > 2. I run it as oozie action.
> > <action name="run-mahout-primary-similarity-ItemSimilarityJob">
> >        <java>
> >            <job-tracker>${jobTracker}</job-tracker>
> >            <name-node>${nameNode}</name-node>
> >            <prepare>
> >                <delete path="${mahoutOutputDir}/primary" />
> >                <delete
> > path="${tempDir}/run-mahout-ItemSimilarityJob/primary" />
> >            </prepare>
> >            <configuration>
> >                <property>
> >                    <name>mapred.queue.name</name>
> >                    <value>default</value>
> >                </property>
> >
> >            </configuration>
> >
> >
> >
> <main-class>org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob</main-class>
> >            <arg>--input</arg>
> >            <arg>${tempDir}/to-mahout-id/projPrefs</arg><!-- dense
> user_id,
> > item_id, pref [can be 3 or 5, 3 is VIEW item, 5 is CLICK on
> recommendation,
> > a kind of try to increase quality of recommender...]-->
> >
> >            <arg>--output</arg>
> >            <arg>${mahoutOutputDir}/primary</arg>
> >
> >            <arg>--similarityClassname</arg>
> >            <arg>SIMILARITY_COSINE</arg>
> >
> >            <arg>--maxSimilaritiesPerItem</arg>
> >            <arg>50000</arg>
> >
> >            <arg>--minPrefsPerUser</arg>
> >            <arg>0</arg>
> >
> >            <arg>--booleanData</arg>
> >            <arg>false</arg>
> >
> >            <arg>--tempDir</arg>
> >            <arg>${tempDir}/run-mahout-ItemSimilarityJob/primary</arg>
> >
> >        </java>
> >        <ok to="to-narrow-table"/>
> >        <error to="kill"/>
> >    </action>
> >
> > 3) RANK does it, here is a script:
> >
> > --user, item, pref previously prepared by hive
> > user_item_pref = LOAD '$user_item_pref' using PigStorage(',') as
> > (user_id:chararray, item_id:long, pref:double);
> >
> > --get distinct user from the whole input
> > distUserId = distinct(FOREACH user_item_pref GENERATE user_id);
> >
> > --get distinct item from the whole input
> > distItemId = distinct(FOREACH user_item_pref GENERATE item_id);
> >
> > --rank user 1....N
> > rankUsers_ = RANK distUserId;
> > rankUsers = FOREACH rankUsers_ GENERATE $0 as rank_id, user_id;
> >
> > --rank items 1....M
> > rankItems_ = RANK distItemId;
> > rankItems = FOREACH rankItems_ GENERATE $0 as rank_id, item_id;
> >
> > --join and remap natural user_id, item_id, to RANKS: 1.N, 1..M
> > joinedUsers = join user_item_pref by user_id, rankUsers by user_id USING
> > 'skewed';
> > joinedItems = join joinedUsers by user_item_pref::item_id, rankItems by
> > item_id using 'replicated';
> >
> > projPrefs = FOREACH joinedItems GENERATE joinedUsers::rankUsers::rank_id
> > as user_id,
> >                                         rankItems::rank_id
> > as item_id,
> >                                         joinedUsers::user_item_pref::pref
> > as pref;
> >
> > --store mapping for later remapping from RANK back to natural values
> > STORE (FOREACH rankUsers GENERATE rank_id, user_id) into '$rankUsers'
> using
> > PigStorage('\t');
> > STORE (FOREACH rankItems GENERATE rank_id, item_id) into '$rankItems'
> using
> > PigStorage('\t');
> > STORE (FOREACH projPrefs GENERATE user_id, item_id, pref) into
> '$projPrefs'
> > using PigStorage('\t');
> >
> > 4) I've seen this idea in different discussion, that different weight for
> > different actions are not good. Sorry, I don't understand what you do
> > suggest.
> > I have two kind of actions: user viewed item, user clicked on recommended
> > item (recommended item produced by my item similarity system).
> > I want to produce two kinds of recommendations:
> > 1. current item + recommend other items which other users visit in
> > conjuction with current item
> > 2. similar item: recommend items similar to current viewed item.
> > What can I try?
> > LLR=http://en.wikipedia.org/wiki/Log-likelihood_ratio= LOG_LIKEHOOD?
> >
> > Right now I do get awful recommendations and I can't understand what can
> I
> > try next :((((((((((((
> >
> >
> > 2014-08-17 19:02 GMT+04:00 Pat Ferrel <pa...@gmail.com>:
> >
> > > 1) how many cores in the cluster? The whole idea behind mapreduce is
> you
> > > buy more cpus you get nearly linear decrease in runtime.
> > > 2) what is your mahout command line with options, or how are you
> invoking
> > > mahout. I have seen the Mahout mapreduce recommender take this long so
> we
> > > should check what you are doing with downsampling.
> > > 3) do you really need to RANK your ids, that’s a full sort? When using
> > pig
> > > I usually get DISTINCT ones and assign an incrementing integer as the
> > > Mahout ID corresponding
> > > 4) your #2 assigning different weights to different actions usually
> does
> > > not work. I’ve done this before and compared offline metrics and seen
> > > precision go down. I’d get this working using only your primary actions
> > > first. What are you trying to get the user to do? View something, buy
> > > something? Use that action as the primary preference and start out
> with a
> > > weight of 1 using LLR. With LLR the weights are not used anyway so your
> > > data may not produce good results with mixed actions.
> > >
> > > A plug for the (admittedly pre-alpha) spark-itemsimilairty:
> > > 1) output from 2 can be directly ingested and will create output.
> > > 2) multiple actions can be used with cross-cooccurrence, not by
> guessing
> > > at weights.
> > > 3) output has your application specific IDs preserved.
> > > 4) its about 10x faster than mapreduce and will do aways with your ID
> > > translation steps
> > >
> > > One caveat is that your cluster machines will need lots of memory. I
> have
> > > 8-16g on mine.
> > >
> > > On Aug 17, 2014, at 1:26 AM, Serega Sheypak <se...@gmail.com>
> > > wrote:
> > >
> > > 1. I do collect preferences for items using 60days sliding window.
> today
> > -
> > > 60 days.
> > > 2. I do prepare triples user_id, item_id, descrete_pref_value (3 for
> item
> > > view, 5 for clicking recommndation block. The idea is to give more
> value
> > > for recommendations which attact visitor attention). I get ~ 20.000.000
> > of
> > > lines with ~1.000.000 distinct items and ~2.000.000 distinct users
> > > 3. I do use apache pig RANK function to rank all distinct user_id
> > > 4. I do the same for item_id
> > > 5. I do join input dataset with ranked datasets and provide input to
> > mahout
> > > with dense interger user_id, item_id
> > > 6. I do get mahout output and join integer item_id back to get natural
> > key
> > > value.
> > >
> > > step #1-2 takes ~ 40min
> > > step #3-5 takes ~1 hour
> > > mahout calc takes ~3hours
> > >
> > >
> > >
> > > 2014-08-17 10:45 GMT+04:00 Ted Dunning <te...@gmail.com>:
> > >
> > >> This really doesn't sound right.  It should be possible to process
> > > almost a
> > >> thousand times that much data every night without that much problem.
> > >>
> > >> How are you preparing the input data?
> > >>
> > >> How are you converting to Mahout id's?
> > >>
> > >> Even using python, you should be able to do the conversion in just a
> few
> > >> minutes without any parallelism whatsoever.
> > >>
> > >>
> > >>
> > >>
> > >> On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak <
> > > serega.sheypak@gmail.com>
> > >> wrote:
> > >>
> > >>> Hi, We are trying calculate ItemSimilarity.
> > >>> Right now we have 2*10^7 input lines. I do provide input data as raw
> > > text
> > >>> each day to recalculate item similarities. We do get +100..1000 new
> > > items
> > >>> each day.
> > >>> 1. It takes too much time to prepare input data.
> > >>> 2. It takes too much time to convert user_id, item_id to mahout ids
> > >>>
> > >>> Is there any poissibility to provide data to mahout mapreduce
> > >>> ItemSimilarity using some binary format with compression?
> > >>>
> > >>
> > >
> > >
> >
> >
>

Re: mapreduce ItemSimilarity input optimization

Posted by Serega Sheypak <se...@gmail.com>.

Thank you very much. I'll do what you are sayning in bullets 1...5 and try
again.

I also tried:
1. calc data using COUSINE_SIMILARITY
2. calc the same data using COOCCURENCE_SIMILARTY
3. join #1 and #2 where COOCURENCE >= $threshold

Where threshold is some emperical integer value. I've used  "2" The idea is
to filter out item pairs which never-ever met together...
Please see this link:
http://blog.godatadriven.com/merge-mahout-recommendations-results-from-different-algorithms.html

If I replace COUSINE_SIMILARITY with LLR and booleanData=true, does this
approach still make sense, or it's useless waste of time?

"What do you mean the similar items are terrible? How are you measuring
that? " I have eye testing only,
I did automate preparation->calculation->hbase upload-> web-app serving, I
didn't automate testing.




2014-08-18 5:16 GMT+04:00 Pat Ferrel <pa...@occamsmachete.com>:

> the things that stand out:
>
> 1) remove your maxSimilaritiesPerItem option! 50000 maxSimilaritiesPerItem
> will _kill_ performance and give no gain, leave this setting at the default
> of 500
> 2) use only one action. What do you want the user to do? Do you want them
> to read a page? Then train on item page views. If those pages lead to a
> purchase then you want to recommend purchases so train on user purchases.
> 3) remove your minPrefsPerUser option, this should never be 0 or it will
> leave users in the training data that have no data and may contribute to
> longer runs with no gain.
> 4) this is a pretty small Hadoop cluster for the size of your data but I
> bet changing #1 will noticeably reduce the runtime
> 5) change —similarityClassname to SIMILARITY_LOGLIKELIHOOD
> 6) remove your —booleanData option since LLR ignores weights.
>
> Remember that this is not the same as personalized recommendations. This
> method alone will show the same “similar items” for all users.
>
> Sorry but both your “recommendation” types sound like the same thing.
> Using both item page view  _and_ clicks on recommended items will both lead
> to an item page view so you have two actions that lead to the same thing,
> right? Just train on an item page view (unless you really want the user to
> make a purchase)
>
> What do you mean the similar items are terrible? How are you measuring
> that? Are you doing cross-validation measuring precision or A/B testing?
> What looks bad to you may be good, the eyeball test is not always reliable.
> If they are coming up completely crazy or random then you may have a bug in
> your ID translation logic.
>
> It sounds like you have enough data to produce good results.
>
> On Aug 17, 2014, at 11:14 AM, Serega Sheypak <se...@gmail.com>
> wrote:
>
> 1. 7 nodes 4 CPU per node, 48 GB ram, 2 HDD for MR and HDFS. Not too much
> but enough for the start..
> 2. I run it as oozie action.
> <action name="run-mahout-primary-similarity-ItemSimilarityJob">
>        <java>
>            <job-tracker>${jobTracker}</job-tracker>
>            <name-node>${nameNode}</name-node>
>            <prepare>
>                <delete path="${mahoutOutputDir}/primary" />
>                <delete
> path="${tempDir}/run-mahout-ItemSimilarityJob/primary" />
>            </prepare>
>            <configuration>
>                <property>
>                    <name>mapred.queue.name</name>
>                    <value>default</value>
>                </property>
>
>            </configuration>
>
>
> <main-class>org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob</main-class>
>            <arg>--input</arg>
>            <arg>${tempDir}/to-mahout-id/projPrefs</arg><!-- dense user_id,
> item_id, pref [can be 3 or 5, 3 is VIEW item, 5 is CLICK on recommendation,
> a kind of try to increase quality of recommender...]-->
>
>            <arg>--output</arg>
>            <arg>${mahoutOutputDir}/primary</arg>
>
>            <arg>--similarityClassname</arg>
>            <arg>SIMILARITY_COSINE</arg>
>
>            <arg>--maxSimilaritiesPerItem</arg>
>            <arg>50000</arg>
>
>            <arg>--minPrefsPerUser</arg>
>            <arg>0</arg>
>
>            <arg>--booleanData</arg>
>            <arg>false</arg>
>
>            <arg>--tempDir</arg>
>            <arg>${tempDir}/run-mahout-ItemSimilarityJob/primary</arg>
>
>        </java>
>        <ok to="to-narrow-table"/>
>        <error to="kill"/>
>    </action>
>
> 3) RANK does it, here is a script:
>
> --user, item, pref previously prepared by hive
> user_item_pref = LOAD '$user_item_pref' using PigStorage(',') as
> (user_id:chararray, item_id:long, pref:double);
>
> --get distinct user from the whole input
> distUserId = distinct(FOREACH user_item_pref GENERATE user_id);
>
> --get distinct item from the whole input
> distItemId = distinct(FOREACH user_item_pref GENERATE item_id);
>
> --rank user 1....N
> rankUsers_ = RANK distUserId;
> rankUsers = FOREACH rankUsers_ GENERATE $0 as rank_id, user_id;
>
> --rank items 1....M
> rankItems_ = RANK distItemId;
> rankItems = FOREACH rankItems_ GENERATE $0 as rank_id, item_id;
>
> --join and remap natural user_id, item_id, to RANKS: 1.N, 1..M
> joinedUsers = join user_item_pref by user_id, rankUsers by user_id USING
> 'skewed';
> joinedItems = join joinedUsers by user_item_pref::item_id, rankItems by
> item_id using 'replicated';
>
> projPrefs = FOREACH joinedItems GENERATE joinedUsers::rankUsers::rank_id
> as user_id,
>                                         rankItems::rank_id
> as item_id,
>                                         joinedUsers::user_item_pref::pref
> as pref;
>
> --store mapping for later remapping from RANK back to natural values
> STORE (FOREACH rankUsers GENERATE rank_id, user_id) into '$rankUsers' using
> PigStorage('\t');
> STORE (FOREACH rankItems GENERATE rank_id, item_id) into '$rankItems' using
> PigStorage('\t');
> STORE (FOREACH projPrefs GENERATE user_id, item_id, pref) into '$projPrefs'
> using PigStorage('\t');
>
> 4) I've seen this idea in different discussion, that different weight for
> different actions are not good. Sorry, I don't understand what you do
> suggest.
> I have two kind of actions: user viewed item, user clicked on recommended
> item (recommended item produced by my item similarity system).
> I want to produce two kinds of recommendations:
> 1. current item + recommend other items which other users visit in
> conjuction with current item
> 2. similar item: recommend items similar to current viewed item.
> What can I try?
> LLR=http://en.wikipedia.org/wiki/Log-likelihood_ratio= LOG_LIKEHOOD?
>
> Right now I do get awful recommendations and I can't understand what can I
> try next :((((((((((((
>
>
> 2014-08-17 19:02 GMT+04:00 Pat Ferrel <pa...@gmail.com>:
>
> > 1) how many cores in the cluster? The whole idea behind mapreduce is you
> > buy more cpus you get nearly linear decrease in runtime.
> > 2) what is your mahout command line with options, or how are you invoking
> > mahout. I have seen the Mahout mapreduce recommender take this long so we
> > should check what you are doing with downsampling.
> > 3) do you really need to RANK your ids, that’s a full sort? When using
> pig
> > I usually get DISTINCT ones and assign an incrementing integer as the
> > Mahout ID corresponding
> > 4) your #2 assigning different weights to different actions usually does
> > not work. I’ve done this before and compared offline metrics and seen
> > precision go down. I’d get this working using only your primary actions
> > first. What are you trying to get the user to do? View something, buy
> > something? Use that action as the primary preference and start out with a
> > weight of 1 using LLR. With LLR the weights are not used anyway so your
> > data may not produce good results with mixed actions.
> >
> > A plug for the (admittedly pre-alpha) spark-itemsimilairty:
> > 1) output from 2 can be directly ingested and will create output.
> > 2) multiple actions can be used with cross-cooccurrence, not by guessing
> > at weights.
> > 3) output has your application specific IDs preserved.
> > 4) its about 10x faster than mapreduce and will do aways with your ID
> > translation steps
> >
> > One caveat is that your cluster machines will need lots of memory. I have
> > 8-16g on mine.
> >
> > On Aug 17, 2014, at 1:26 AM, Serega Sheypak <se...@gmail.com>
> > wrote:
> >
> > 1. I do collect preferences for items using 60days sliding window. today
> -
> > 60 days.
> > 2. I do prepare triples user_id, item_id, descrete_pref_value (3 for item
> > view, 5 for clicking recommndation block. The idea is to give more value
> > for recommendations which attact visitor attention). I get ~ 20.000.000
> of
> > lines with ~1.000.000 distinct items and ~2.000.000 distinct users
> > 3. I do use apache pig RANK function to rank all distinct user_id
> > 4. I do the same for item_id
> > 5. I do join input dataset with ranked datasets and provide input to
> mahout
> > with dense interger user_id, item_id
> > 6. I do get mahout output and join integer item_id back to get natural
> key
> > value.
> >
> > step #1-2 takes ~ 40min
> > step #3-5 takes ~1 hour
> > mahout calc takes ~3hours
> >
> >
> >
> > 2014-08-17 10:45 GMT+04:00 Ted Dunning <te...@gmail.com>:
> >
> >> This really doesn't sound right.  It should be possible to process
> > almost a
> >> thousand times that much data every night without that much problem.
> >>
> >> How are you preparing the input data?
> >>
> >> How are you converting to Mahout id's?
> >>
> >> Even using python, you should be able to do the conversion in just a few
> >> minutes without any parallelism whatsoever.
> >>
> >>
> >>
> >>
> >> On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak <
> > serega.sheypak@gmail.com>
> >> wrote:
> >>
> >>> Hi, We are trying calculate ItemSimilarity.
> >>> Right now we have 2*10^7 input lines. I do provide input data as raw
> > text
> >>> each day to recalculate item similarities. We do get +100..1000 new
> > items
> >>> each day.
> >>> 1. It takes too much time to prepare input data.
> >>> 2. It takes too much time to convert user_id, item_id to mahout ids
> >>>
> >>> Is there any poissibility to provide data to mahout mapreduce
> >>> ItemSimilarity using some binary format with compression?
> >>>
> >>
> >
> >
>
>

Re: mapreduce ItemSimilarity input optimization

Posted by Ted Dunning <te...@gmail.com>.

Dang Pat.

Bottle that advice and sell it!  Put it on the web site.  Shout it from the
rooftops.






On Sun, Aug 17, 2014 at 6:16 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> the things that stand out:
>
> 1) remove your maxSimilaritiesPerItem option! 50000 maxSimilaritiesPerItem
> will _kill_ performance and give no gain, leave this setting at the default
> of 500
> 2) use only one action. What do you want the user to do? Do you want them
> to read a page? Then train on item page views. If those pages lead to a
> purchase then you want to recommend purchases so train on user purchases.
> 3) remove your minPrefsPerUser option, this should never be 0 or it will
> leave users in the training data that have no data and may contribute to
> longer runs with no gain.
> 4) this is a pretty small Hadoop cluster for the size of your data but I
> bet changing #1 will noticeably reduce the runtime
> 5) change —similarityClassname to SIMILARITY_LOGLIKELIHOOD
> 6) remove your —booleanData option since LLR ignores weights.
>
> Remember that this is not the same as personalized recommendations. This
> method alone will show the same “similar items” for all users.
>
> Sorry but both your “recommendation” types sound like the same thing.
> Using both item page view  _and_ clicks on recommended items will both lead
> to an item page view so you have two actions that lead to the same thing,
> right? Just train on an item page view (unless you really want the user to
> make a purchase)
>
> What do you mean the similar items are terrible? How are you measuring
> that? Are you doing cross-validation measuring precision or A/B testing?
> What looks bad to you may be good, the eyeball test is not always reliable.
> If they are coming up completely crazy or random then you may have a bug in
> your ID translation logic.
>
> It sounds like you have enough data to produce good results.
>
> On Aug 17, 2014, at 11:14 AM, Serega Sheypak <se...@gmail.com>
> wrote:
>
> 1. 7 nodes 4 CPU per node, 48 GB ram, 2 HDD for MR and HDFS. Not too much
> but enough for the start..
> 2. I run it as oozie action.
> <action name="run-mahout-primary-similarity-ItemSimilarityJob">
>        <java>
>            <job-tracker>${jobTracker}</job-tracker>
>            <name-node>${nameNode}</name-node>
>            <prepare>
>                <delete path="${mahoutOutputDir}/primary" />
>                <delete
> path="${tempDir}/run-mahout-ItemSimilarityJob/primary" />
>            </prepare>
>            <configuration>
>                <property>
>                    <name>mapred.queue.name</name>
>                    <value>default</value>
>                </property>
>
>            </configuration>
>
>
> <main-class>org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob</main-class>
>            <arg>--input</arg>
>            <arg>${tempDir}/to-mahout-id/projPrefs</arg><!-- dense user_id,
> item_id, pref [can be 3 or 5, 3 is VIEW item, 5 is CLICK on recommendation,
> a kind of try to increase quality of recommender...]-->
>
>            <arg>--output</arg>
>            <arg>${mahoutOutputDir}/primary</arg>
>
>            <arg>--similarityClassname</arg>
>            <arg>SIMILARITY_COSINE</arg>
>
>            <arg>--maxSimilaritiesPerItem</arg>
>            <arg>50000</arg>
>
>            <arg>--minPrefsPerUser</arg>
>            <arg>0</arg>
>
>            <arg>--booleanData</arg>
>            <arg>false</arg>
>
>            <arg>--tempDir</arg>
>            <arg>${tempDir}/run-mahout-ItemSimilarityJob/primary</arg>
>
>        </java>
>        <ok to="to-narrow-table"/>
>        <error to="kill"/>
>    </action>
>
> 3) RANK does it, here is a script:
>
> --user, item, pref previously prepared by hive
> user_item_pref = LOAD '$user_item_pref' using PigStorage(',') as
> (user_id:chararray, item_id:long, pref:double);
>
> --get distinct user from the whole input
> distUserId = distinct(FOREACH user_item_pref GENERATE user_id);
>
> --get distinct item from the whole input
> distItemId = distinct(FOREACH user_item_pref GENERATE item_id);
>
> --rank user 1....N
> rankUsers_ = RANK distUserId;
> rankUsers = FOREACH rankUsers_ GENERATE $0 as rank_id, user_id;
>
> --rank items 1....M
> rankItems_ = RANK distItemId;
> rankItems = FOREACH rankItems_ GENERATE $0 as rank_id, item_id;
>
> --join and remap natural user_id, item_id, to RANKS: 1.N, 1..M
> joinedUsers = join user_item_pref by user_id, rankUsers by user_id USING
> 'skewed';
> joinedItems = join joinedUsers by user_item_pref::item_id, rankItems by
> item_id using 'replicated';
>
> projPrefs = FOREACH joinedItems GENERATE joinedUsers::rankUsers::rank_id
> as user_id,
>                                         rankItems::rank_id
> as item_id,
>                                         joinedUsers::user_item_pref::pref
> as pref;
>
> --store mapping for later remapping from RANK back to natural values
> STORE (FOREACH rankUsers GENERATE rank_id, user_id) into '$rankUsers' using
> PigStorage('\t');
> STORE (FOREACH rankItems GENERATE rank_id, item_id) into '$rankItems' using
> PigStorage('\t');
> STORE (FOREACH projPrefs GENERATE user_id, item_id, pref) into '$projPrefs'
> using PigStorage('\t');
>
> 4) I've seen this idea in different discussion, that different weight for
> different actions are not good. Sorry, I don't understand what you do
> suggest.
> I have two kind of actions: user viewed item, user clicked on recommended
> item (recommended item produced by my item similarity system).
> I want to produce two kinds of recommendations:
> 1. current item + recommend other items which other users visit in
> conjuction with current item
> 2. similar item: recommend items similar to current viewed item.
> What can I try?
> LLR=http://en.wikipedia.org/wiki/Log-likelihood_ratio= LOG_LIKEHOOD?
>
> Right now I do get awful recommendations and I can't understand what can I
> try next :((((((((((((
>
>
> 2014-08-17 19:02 GMT+04:00 Pat Ferrel <pa...@gmail.com>:
>
> > 1) how many cores in the cluster? The whole idea behind mapreduce is you
> > buy more cpus you get nearly linear decrease in runtime.
> > 2) what is your mahout command line with options, or how are you invoking
> > mahout. I have seen the Mahout mapreduce recommender take this long so we
> > should check what you are doing with downsampling.
> > 3) do you really need to RANK your ids, that’s a full sort? When using
> pig
> > I usually get DISTINCT ones and assign an incrementing integer as the
> > Mahout ID corresponding
> > 4) your #2 assigning different weights to different actions usually does
> > not work. I’ve done this before and compared offline metrics and seen
> > precision go down. I’d get this working using only your primary actions
> > first. What are you trying to get the user to do? View something, buy
> > something? Use that action as the primary preference and start out with a
> > weight of 1 using LLR. With LLR the weights are not used anyway so your
> > data may not produce good results with mixed actions.
> >
> > A plug for the (admittedly pre-alpha) spark-itemsimilairty:
> > 1) output from 2 can be directly ingested and will create output.
> > 2) multiple actions can be used with cross-cooccurrence, not by guessing
> > at weights.
> > 3) output has your application specific IDs preserved.
> > 4) its about 10x faster than mapreduce and will do aways with your ID
> > translation steps
> >
> > One caveat is that your cluster machines will need lots of memory. I have
> > 8-16g on mine.
> >
> > On Aug 17, 2014, at 1:26 AM, Serega Sheypak <se...@gmail.com>
> > wrote:
> >
> > 1. I do collect preferences for items using 60days sliding window. today
> -
> > 60 days.
> > 2. I do prepare triples user_id, item_id, descrete_pref_value (3 for item
> > view, 5 for clicking recommndation block. The idea is to give more value
> > for recommendations which attact visitor attention). I get ~ 20.000.000
> of
> > lines with ~1.000.000 distinct items and ~2.000.000 distinct users
> > 3. I do use apache pig RANK function to rank all distinct user_id
> > 4. I do the same for item_id
> > 5. I do join input dataset with ranked datasets and provide input to
> mahout
> > with dense interger user_id, item_id
> > 6. I do get mahout output and join integer item_id back to get natural
> key
> > value.
> >
> > step #1-2 takes ~ 40min
> > step #3-5 takes ~1 hour
> > mahout calc takes ~3hours
> >
> >
> >
> > 2014-08-17 10:45 GMT+04:00 Ted Dunning <te...@gmail.com>:
> >
> >> This really doesn't sound right.  It should be possible to process
> > almost a
> >> thousand times that much data every night without that much problem.
> >>
> >> How are you preparing the input data?
> >>
> >> How are you converting to Mahout id's?
> >>
> >> Even using python, you should be able to do the conversion in just a few
> >> minutes without any parallelism whatsoever.
> >>
> >>
> >>
> >>
> >> On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak <
> > serega.sheypak@gmail.com>
> >> wrote:
> >>
> >>> Hi, We are trying calculate ItemSimilarity.
> >>> Right now we have 2*10^7 input lines. I do provide input data as raw
> > text
> >>> each day to recalculate item similarities. We do get +100..1000 new
> > items
> >>> each day.
> >>> 1. It takes too much time to prepare input data.
> >>> 2. It takes too much time to convert user_id, item_id to mahout ids
> >>>
> >>> Is there any poissibility to provide data to mahout mapreduce
> >>> ItemSimilarity using some binary format with compression?
> >>>
> >>
> >
> >
>
>

Re: mapreduce ItemSimilarity input optimization

Posted by Pat Ferrel <pa...@occamsmachete.com>.

the things that stand out:

1) remove your maxSimilaritiesPerItem option! 50000 maxSimilaritiesPerItem will _kill_ performance and give no gain, leave this setting at the default of 500
2) use only one action. What do you want the user to do? Do you want them to read a page? Then train on item page views. If those pages lead to a purchase then you want to recommend purchases so train on user purchases.
3) remove your minPrefsPerUser option, this should never be 0 or it will leave users in the training data that have no data and may contribute to longer runs with no gain.
4) this is a pretty small Hadoop cluster for the size of your data but I bet changing #1 will noticeably reduce the runtime
5) change —similarityClassname to SIMILARITY_LOGLIKELIHOOD
6) remove your —booleanData option since LLR ignores weights.

Remember that this is not the same as personalized recommendations. This method alone will show the same “similar items” for all users.

Sorry but both your “recommendation” types sound like the same thing. Using both item page view  _and_ clicks on recommended items will both lead to an item page view so you have two actions that lead to the same thing, right? Just train on an item page view (unless you really want the user to make a purchase) 

What do you mean the similar items are terrible? How are you measuring that? Are you doing cross-validation measuring precision or A/B testing? What looks bad to you may be good, the eyeball test is not always reliable. If they are coming up completely crazy or random then you may have a bug in your ID translation logic.

It sounds like you have enough data to produce good results.

On Aug 17, 2014, at 11:14 AM, Serega Sheypak <se...@gmail.com> wrote:

1. 7 nodes 4 CPU per node, 48 GB ram, 2 HDD for MR and HDFS. Not too much
but enough for the start..
2. I run it as oozie action.
<action name="run-mahout-primary-similarity-ItemSimilarityJob">
       <java>
           <job-tracker>${jobTracker}</job-tracker>
           <name-node>${nameNode}</name-node>
           <prepare>
               <delete path="${mahoutOutputDir}/primary" />
               <delete
path="${tempDir}/run-mahout-ItemSimilarityJob/primary" />
           </prepare>
           <configuration>
               <property>
                   <name>mapred.queue.name</name>
                   <value>default</value>
               </property>

           </configuration>

<main-class>org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob</main-class>
           <arg>--input</arg>
           <arg>${tempDir}/to-mahout-id/projPrefs</arg><!-- dense user_id,
item_id, pref [can be 3 or 5, 3 is VIEW item, 5 is CLICK on recommendation,
a kind of try to increase quality of recommender...]-->

           <arg>--output</arg>
           <arg>${mahoutOutputDir}/primary</arg>

           <arg>--similarityClassname</arg>
           <arg>SIMILARITY_COSINE</arg>

           <arg>--maxSimilaritiesPerItem</arg>
           <arg>50000</arg>

           <arg>--minPrefsPerUser</arg>
           <arg>0</arg>

           <arg>--booleanData</arg>
           <arg>false</arg>

           <arg>--tempDir</arg>
           <arg>${tempDir}/run-mahout-ItemSimilarityJob/primary</arg>

       </java>
       <ok to="to-narrow-table"/>
       <error to="kill"/>
   </action>

3) RANK does it, here is a script:

--user, item, pref previously prepared by hive
user_item_pref = LOAD '$user_item_pref' using PigStorage(',') as
(user_id:chararray, item_id:long, pref:double);

--get distinct user from the whole input
distUserId = distinct(FOREACH user_item_pref GENERATE user_id);

--get distinct item from the whole input
distItemId = distinct(FOREACH user_item_pref GENERATE item_id);

--rank user 1....N
rankUsers_ = RANK distUserId;
rankUsers = FOREACH rankUsers_ GENERATE $0 as rank_id, user_id;

--rank items 1....M
rankItems_ = RANK distItemId;
rankItems = FOREACH rankItems_ GENERATE $0 as rank_id, item_id;

--join and remap natural user_id, item_id, to RANKS: 1.N, 1..M
joinedUsers = join user_item_pref by user_id, rankUsers by user_id USING
'skewed';
joinedItems = join joinedUsers by user_item_pref::item_id, rankItems by
item_id using 'replicated';

projPrefs = FOREACH joinedItems GENERATE joinedUsers::rankUsers::rank_id
as user_id,
                                        rankItems::rank_id
as item_id,
                                        joinedUsers::user_item_pref::pref
as pref;

--store mapping for later remapping from RANK back to natural values
STORE (FOREACH rankUsers GENERATE rank_id, user_id) into '$rankUsers' using
PigStorage('\t');
STORE (FOREACH rankItems GENERATE rank_id, item_id) into '$rankItems' using
PigStorage('\t');
STORE (FOREACH projPrefs GENERATE user_id, item_id, pref) into '$projPrefs'
using PigStorage('\t');

4) I've seen this idea in different discussion, that different weight for
different actions are not good. Sorry, I don't understand what you do
suggest.
I have two kind of actions: user viewed item, user clicked on recommended
item (recommended item produced by my item similarity system).
I want to produce two kinds of recommendations:
1. current item + recommend other items which other users visit in
conjuction with current item
2. similar item: recommend items similar to current viewed item.
What can I try?
LLR=http://en.wikipedia.org/wiki/Log-likelihood_ratio= LOG_LIKEHOOD?

Right now I do get awful recommendations and I can't understand what can I
try next :((((((((((((


2014-08-17 19:02 GMT+04:00 Pat Ferrel <pa...@gmail.com>:

> 1) how many cores in the cluster? The whole idea behind mapreduce is you
> buy more cpus you get nearly linear decrease in runtime.
> 2) what is your mahout command line with options, or how are you invoking
> mahout. I have seen the Mahout mapreduce recommender take this long so we
> should check what you are doing with downsampling.
> 3) do you really need to RANK your ids, that’s a full sort? When using pig
> I usually get DISTINCT ones and assign an incrementing integer as the
> Mahout ID corresponding
> 4) your #2 assigning different weights to different actions usually does
> not work. I’ve done this before and compared offline metrics and seen
> precision go down. I’d get this working using only your primary actions
> first. What are you trying to get the user to do? View something, buy
> something? Use that action as the primary preference and start out with a
> weight of 1 using LLR. With LLR the weights are not used anyway so your
> data may not produce good results with mixed actions.
> 
> A plug for the (admittedly pre-alpha) spark-itemsimilairty:
> 1) output from 2 can be directly ingested and will create output.
> 2) multiple actions can be used with cross-cooccurrence, not by guessing
> at weights.
> 3) output has your application specific IDs preserved.
> 4) its about 10x faster than mapreduce and will do aways with your ID
> translation steps
> 
> One caveat is that your cluster machines will need lots of memory. I have
> 8-16g on mine.
> 
> On Aug 17, 2014, at 1:26 AM, Serega Sheypak <se...@gmail.com>
> wrote:
> 
> 1. I do collect preferences for items using 60days sliding window. today -
> 60 days.
> 2. I do prepare triples user_id, item_id, descrete_pref_value (3 for item
> view, 5 for clicking recommndation block. The idea is to give more value
> for recommendations which attact visitor attention). I get ~ 20.000.000 of
> lines with ~1.000.000 distinct items and ~2.000.000 distinct users
> 3. I do use apache pig RANK function to rank all distinct user_id
> 4. I do the same for item_id
> 5. I do join input dataset with ranked datasets and provide input to mahout
> with dense interger user_id, item_id
> 6. I do get mahout output and join integer item_id back to get natural key
> value.
> 
> step #1-2 takes ~ 40min
> step #3-5 takes ~1 hour
> mahout calc takes ~3hours
> 
> 
> 
> 2014-08-17 10:45 GMT+04:00 Ted Dunning <te...@gmail.com>:
> 
>> This really doesn't sound right.  It should be possible to process
> almost a
>> thousand times that much data every night without that much problem.
>> 
>> How are you preparing the input data?
>> 
>> How are you converting to Mahout id's?
>> 
>> Even using python, you should be able to do the conversion in just a few
>> minutes without any parallelism whatsoever.
>> 
>> 
>> 
>> 
>> On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak <
> serega.sheypak@gmail.com>
>> wrote:
>> 
>>> Hi, We are trying calculate ItemSimilarity.
>>> Right now we have 2*10^7 input lines. I do provide input data as raw
> text
>>> each day to recalculate item similarities. We do get +100..1000 new
> items
>>> each day.
>>> 1. It takes too much time to prepare input data.
>>> 2. It takes too much time to convert user_id, item_id to mahout ids
>>> 
>>> Is there any poissibility to provide data to mahout mapreduce
>>> ItemSimilarity using some binary format with compression?
>>> 
>> 
> 
>

Re: mapreduce ItemSimilarity input optimization

Posted by Serega Sheypak <se...@gmail.com>.

1. 7 nodes 4 CPU per node, 48 GB ram, 2 HDD for MR and HDFS. Not too much
but enough for the start..
2. I run it as oozie action.
 <action name="run-mahout-primary-similarity-ItemSimilarityJob">
        <java>
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <prepare>
                <delete path="${mahoutOutputDir}/primary" />
                <delete
path="${tempDir}/run-mahout-ItemSimilarityJob/primary" />
            </prepare>
            <configuration>
                <property>
                    <name>mapred.queue.name</name>
                    <value>default</value>
                </property>

            </configuration>

<main-class>org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob</main-class>
            <arg>--input</arg>
            <arg>${tempDir}/to-mahout-id/projPrefs</arg><!-- dense user_id,
item_id, pref [can be 3 or 5, 3 is VIEW item, 5 is CLICK on recommendation,
a kind of try to increase quality of recommender...]-->

            <arg>--output</arg>
            <arg>${mahoutOutputDir}/primary</arg>

            <arg>--similarityClassname</arg>
            <arg>SIMILARITY_COSINE</arg>

            <arg>--maxSimilaritiesPerItem</arg>
            <arg>50000</arg>

            <arg>--minPrefsPerUser</arg>
            <arg>0</arg>

            <arg>--booleanData</arg>
            <arg>false</arg>

            <arg>--tempDir</arg>
            <arg>${tempDir}/run-mahout-ItemSimilarityJob/primary</arg>

        </java>
        <ok to="to-narrow-table"/>
        <error to="kill"/>
    </action>

3) RANK does it, here is a script:

--user, item, pref previously prepared by hive
user_item_pref = LOAD '$user_item_pref' using PigStorage(',') as
(user_id:chararray, item_id:long, pref:double);

--get distinct user from the whole input
distUserId = distinct(FOREACH user_item_pref GENERATE user_id);

--get distinct item from the whole input
distItemId = distinct(FOREACH user_item_pref GENERATE item_id);

--rank user 1....N
rankUsers_ = RANK distUserId;
rankUsers = FOREACH rankUsers_ GENERATE $0 as rank_id, user_id;

--rank items 1....M
rankItems_ = RANK distItemId;
rankItems = FOREACH rankItems_ GENERATE $0 as rank_id, item_id;

--join and remap natural user_id, item_id, to RANKS: 1.N, 1..M
joinedUsers = join user_item_pref by user_id, rankUsers by user_id USING
'skewed';
joinedItems = join joinedUsers by user_item_pref::item_id, rankItems by
item_id using 'replicated';

projPrefs = FOREACH joinedItems GENERATE joinedUsers::rankUsers::rank_id
as user_id,
                                         rankItems::rank_id
 as item_id,
                                         joinedUsers::user_item_pref::pref
as pref;

--store mapping for later remapping from RANK back to natural values
STORE (FOREACH rankUsers GENERATE rank_id, user_id) into '$rankUsers' using
PigStorage('\t');
STORE (FOREACH rankItems GENERATE rank_id, item_id) into '$rankItems' using
PigStorage('\t');
STORE (FOREACH projPrefs GENERATE user_id, item_id, pref) into '$projPrefs'
using PigStorage('\t');

4) I've seen this idea in different discussion, that different weight for
different actions are not good. Sorry, I don't understand what you do
suggest.
I have two kind of actions: user viewed item, user clicked on recommended
item (recommended item produced by my item similarity system).
I want to produce two kinds of recommendations:
1. current item + recommend other items which other users visit in
conjuction with current item
2. similar item: recommend items similar to current viewed item.
What can I try?
LLR=http://en.wikipedia.org/wiki/Log-likelihood_ratio= LOG_LIKEHOOD?

Right now I do get awful recommendations and I can't understand what can I
try next :((((((((((((


2014-08-17 19:02 GMT+04:00 Pat Ferrel <pa...@gmail.com>:

> 1) how many cores in the cluster? The whole idea behind mapreduce is you
> buy more cpus you get nearly linear decrease in runtime.
> 2) what is your mahout command line with options, or how are you invoking
> mahout. I have seen the Mahout mapreduce recommender take this long so we
> should check what you are doing with downsampling.
> 3) do you really need to RANK your ids, that’s a full sort? When using pig
> I usually get DISTINCT ones and assign an incrementing integer as the
> Mahout ID corresponding
> 4) your #2 assigning different weights to different actions usually does
> not work. I’ve done this before and compared offline metrics and seen
> precision go down. I’d get this working using only your primary actions
> first. What are you trying to get the user to do? View something, buy
> something? Use that action as the primary preference and start out with a
> weight of 1 using LLR. With LLR the weights are not used anyway so your
> data may not produce good results with mixed actions.
>
> A plug for the (admittedly pre-alpha) spark-itemsimilairty:
> 1) output from 2 can be directly ingested and will create output.
> 2) multiple actions can be used with cross-cooccurrence, not by guessing
> at weights.
> 3) output has your application specific IDs preserved.
> 4) its about 10x faster than mapreduce and will do aways with your ID
> translation steps
>
> One caveat is that your cluster machines will need lots of memory. I have
> 8-16g on mine.
>
> On Aug 17, 2014, at 1:26 AM, Serega Sheypak <se...@gmail.com>
> wrote:
>
> 1. I do collect preferences for items using 60days sliding window. today -
> 60 days.
> 2. I do prepare triples user_id, item_id, descrete_pref_value (3 for item
> view, 5 for clicking recommndation block. The idea is to give more value
> for recommendations which attact visitor attention). I get ~ 20.000.000 of
> lines with ~1.000.000 distinct items and ~2.000.000 distinct users
> 3. I do use apache pig RANK function to rank all distinct user_id
> 4. I do the same for item_id
> 5. I do join input dataset with ranked datasets and provide input to mahout
> with dense interger user_id, item_id
> 6. I do get mahout output and join integer item_id back to get natural key
> value.
>
> step #1-2 takes ~ 40min
> step #3-5 takes ~1 hour
> mahout calc takes ~3hours
>
>
>
> 2014-08-17 10:45 GMT+04:00 Ted Dunning <te...@gmail.com>:
>
> > This really doesn't sound right.  It should be possible to process
> almost a
> > thousand times that much data every night without that much problem.
> >
> > How are you preparing the input data?
> >
> > How are you converting to Mahout id's?
> >
> > Even using python, you should be able to do the conversion in just a few
> > minutes without any parallelism whatsoever.
> >
> >
> >
> >
> > On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak <
> serega.sheypak@gmail.com>
> > wrote:
> >
> >> Hi, We are trying calculate ItemSimilarity.
> >> Right now we have 2*10^7 input lines. I do provide input data as raw
> text
> >> each day to recalculate item similarities. We do get +100..1000 new
> items
> >> each day.
> >> 1. It takes too much time to prepare input data.
> >> 2. It takes too much time to convert user_id, item_id to mahout ids
> >>
> >> Is there any poissibility to provide data to mahout mapreduce
> >> ItemSimilarity using some binary format with compression?
> >>
> >
>
>

Re: mapreduce ItemSimilarity input optimization

Posted by Pat Ferrel <pa...@gmail.com>.

1) how many cores in the cluster? The whole idea behind mapreduce is you buy more cpus you get nearly linear decrease in runtime.
2) what is your mahout command line with options, or how are you invoking mahout. I have seen the Mahout mapreduce recommender take this long so we should check what you are doing with downsampling.
3) do you really need to RANK your ids, that’s a full sort? When using pig I usually get DISTINCT ones and assign an incrementing integer as the Mahout ID corresponding
4) your #2 assigning different weights to different actions usually does not work. I’ve done this before and compared offline metrics and seen precision go down. I’d get this working using only your primary actions first. What are you trying to get the user to do? View something, buy something? Use that action as the primary preference and start out with a weight of 1 using LLR. With LLR the weights are not used anyway so your data may not produce good results with mixed actions.

A plug for the (admittedly pre-alpha) spark-itemsimilairty:
1) output from 2 can be directly ingested and will create output.
2) multiple actions can be used with cross-cooccurrence, not by guessing at weights. 
3) output has your application specific IDs preserved.
4) its about 10x faster than mapreduce and will do aways with your ID translation steps

One caveat is that your cluster machines will need lots of memory. I have 8-16g on mine.

On Aug 17, 2014, at 1:26 AM, Serega Sheypak <se...@gmail.com> wrote:

1. I do collect preferences for items using 60days sliding window. today -
60 days.
2. I do prepare triples user_id, item_id, descrete_pref_value (3 for item
view, 5 for clicking recommndation block. The idea is to give more value
for recommendations which attact visitor attention). I get ~ 20.000.000 of
lines with ~1.000.000 distinct items and ~2.000.000 distinct users
3. I do use apache pig RANK function to rank all distinct user_id
4. I do the same for item_id
5. I do join input dataset with ranked datasets and provide input to mahout
with dense interger user_id, item_id
6. I do get mahout output and join integer item_id back to get natural key
value.

step #1-2 takes ~ 40min
step #3-5 takes ~1 hour
mahout calc takes ~3hours

2014-08-17 10:45 GMT+04:00 Ted Dunning <te...@gmail.com>:

> This really doesn't sound right.  It should be possible to process almost a
> thousand times that much data every night without that much problem.
> 
> How are you preparing the input data?
> 
> How are you converting to Mahout id's?
> 
> Even using python, you should be able to do the conversion in just a few
> minutes without any parallelism whatsoever.
> 
> 
> 
> 
> On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak <se...@gmail.com>
> wrote:
> 
>> Hi, We are trying calculate ItemSimilarity.
>> Right now we have 2*10^7 input lines. I do provide input data as raw text
>> each day to recalculate item similarities. We do get +100..1000 new items
>> each day.
>> 1. It takes too much time to prepare input data.
>> 2. It takes too much time to convert user_id, item_id to mahout ids
>> 
>> Is there any poissibility to provide data to mahout mapreduce
>> ItemSimilarity using some binary format with compression?
>> 
>

Re: mapreduce ItemSimilarity input optimization

Posted by Serega Sheypak <se...@gmail.com>.

1. I do collect preferences for items using 60days sliding window. today -
60 days.
2. I do prepare triples user_id, item_id, descrete_pref_value (3 for item
view, 5 for clicking recommndation block. The idea is to give more value
for recommendations which attact visitor attention). I get ~ 20.000.000 of
lines with ~1.000.000 distinct items and ~2.000.000 distinct users
3. I do use apache pig RANK function to rank all distinct user_id
4. I do the same for item_id
5. I do join input dataset with ranked datasets and provide input to mahout
with dense interger user_id, item_id
6. I do get mahout output and join integer item_id back to get natural key
 value.

step #1-2 takes ~ 40min
step #3-5 takes ~1 hour
mahout calc takes ~3hours

2014-08-17 10:45 GMT+04:00 Ted Dunning <te...@gmail.com>:

> This really doesn't sound right.  It should be possible to process almost a
> thousand times that much data every night without that much problem.
>
> How are you preparing the input data?
>
> How are you converting to Mahout id's?
>
> Even using python, you should be able to do the conversion in just a few
> minutes without any parallelism whatsoever.
>
>
>
>
> On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak <se...@gmail.com>
> wrote:
>
> > Hi, We are trying calculate ItemSimilarity.
> > Right now we have 2*10^7 input lines. I do provide input data as raw text
> > each day to recalculate item similarities. We do get +100..1000 new items
> > each day.
> > 1. It takes too much time to prepare input data.
> > 2. It takes too much time to convert user_id, item_id to mahout ids
> >
> > Is there any poissibility to provide data to mahout mapreduce
> > ItemSimilarity using some binary format with compression?
> >
>

Re: mapreduce ItemSimilarity input optimization

Posted by Ted Dunning <te...@gmail.com>.

This really doesn't sound right.  It should be possible to process almost a
thousand times that much data every night without that much problem.

How are you preparing the input data?

How are you converting to Mahout id's?

Even using python, you should be able to do the conversion in just a few
minutes without any parallelism whatsoever.

On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak <se...@gmail.com>
wrote:

> Hi, We are trying calculate ItemSimilarity.
> Right now we have 2*10^7 input lines. I do provide input data as raw text
> each day to recalculate item similarities. We do get +100..1000 new items
> each day.
> 1. It takes too much time to prepare input data.
> 2. It takes too much time to convert user_id, item_id to mahout ids
>
> Is there any poissibility to provide data to mahout mapreduce
> ItemSimilarity using some binary format with compression?
>

Re: mapreduce ItemSimilarity input optimization

Posted by Pat Ferrel <pa...@gmail.com>.

Hadoop supports automatic compression/decompression of text files so that shouldn’t be a problem. At some point the data is almost always text, out goal is to take data in that form and make as much data prep as possible unnecessary. 

I don’t know what is in Cloudera 4.7 but Mahout will work with it out of the box. You’ll have to ask cloudera for other specifics. Mahout 0.9 pre-built artifacts are all you need to use sequence file input with RSJ.

The spark-itemsimilarity is only in the snapshot of Mahout 1.0 so you’ll need to download and build it. It requires Spark as well as Hadoop but so you’ll need to see if that is installed. If you want to write your own wrappers for the CooccurrenceAnalysis.Cooccurrence you can implement whatever format you want but you’ll have to do your own ID translation. Cooccurrence at that level of the pipeline requires Mahout IDs.

Remember that most work is done in Spark with RDDs, which do away with the need for intermediate files. You will generally only use them to store results and input. Think of them as import/export files, not working data. 

On Aug 16, 2014, at 10:32 AM, Serega Sheypak <se...@gmail.com> wrote:

Hi, I'm sitting on Cloudera 4.7 does it work aout of the box?
Right now I do expect from mahout simple interface: user_id, item_id, pref.
I do expect support for seq file / avro. Really, It's impossible to work
with TDF. Too much data... ^(

2014-08-16 20:16 GMT+04:00 Pat Ferrel <pa...@occamsmachete.com>:

> The Spark version “spark-itemsimilarity” uses _your_ IDs. It is ready to
> try and I’d love it if you could. The IDs are kept in a HashBiMap in memory
> on each cluster machine and so it's memory limited to the size of the
> dictionary but in practice that will probably work for many (most)
> applications. This conversion of your ID into Mahout ID is done in the job
> and in parallel so it's about as fast as can be though we may be able to
> optimize the memory footprint in time.
> 
> run “mahout spark-itemsimilarity” to get a full list of options. You can
> specify some form of text-delimited format for input—the default uses [\t,
> ] for the delimiter and expects (userID,itemID,ignored-text) but you can
> specify which column in the TDF contains which ID and even use filters to
> capture only the lines with data if you are using log files.
> 
> I’ll see if I can get a doc up on the mahout site to explain it a bit
> better.
> 
> As to providing input to Mahout in binary form, the Hadoop version of
> “rowsimilarity” takes a DRM sequence file. This would be a row per user
> containing a Mahout userID and Mahout SparseVector of the item
> interactions. You will still have to convert IDs though.
> 
> On Aug 16, 2014, at 5:10 AM, Serega Sheypak <se...@gmail.com>
> wrote:
> 
> Hi, We are trying calculate ItemSimilarity.
> Right now we have 2*10^7 input lines. I do provide input data as raw text
> each day to recalculate item similarities. We do get +100..1000 new items
> each day.
> 1. It takes too much time to prepare input data.
> 2. It takes too much time to convert user_id, item_id to mahout ids
> 
> Is there any poissibility to provide data to mahout mapreduce
> ItemSimilarity using some binary format with compression?
> 
>

Re: mapreduce ItemSimilarity input optimization

Posted by Serega Sheypak <se...@gmail.com>.

Hi, I'm sitting on Cloudera 4.7 does it work aout of the box?
Right now I do expect from mahout simple interface: user_id, item_id, pref.
I do expect support for seq file / avro. Really, It's impossible to work
with TDF. Too much data... ^(





2014-08-16 20:16 GMT+04:00 Pat Ferrel <pa...@occamsmachete.com>:

> The Spark version “spark-itemsimilarity” uses _your_ IDs. It is ready to
> try and I’d love it if you could. The IDs are kept in a HashBiMap in memory
> on each cluster machine and so it's memory limited to the size of the
> dictionary but in practice that will probably work for many (most)
> applications. This conversion of your ID into Mahout ID is done in the job
> and in parallel so it's about as fast as can be though we may be able to
> optimize the memory footprint in time.
>
> run “mahout spark-itemsimilarity” to get a full list of options. You can
> specify some form of text-delimited format for input—the default uses [\t,
> ] for the delimiter and expects (userID,itemID,ignored-text) but you can
> specify which column in the TDF contains which ID and even use filters to
> capture only the lines with data if you are using log files.
>
> I’ll see if I can get a doc up on the mahout site to explain it a bit
> better.
>
> As to providing input to Mahout in binary form, the Hadoop version of
> “rowsimilarity” takes a DRM sequence file. This would be a row per user
> containing a Mahout userID and Mahout SparseVector of the item
> interactions. You will still have to convert IDs though.
>
> On Aug 16, 2014, at 5:10 AM, Serega Sheypak <se...@gmail.com>
> wrote:
>
> Hi, We are trying calculate ItemSimilarity.
> Right now we have 2*10^7 input lines. I do provide input data as raw text
> each day to recalculate item similarities. We do get +100..1000 new items
> each day.
> 1. It takes too much time to prepare input data.
> 2. It takes too much time to convert user_id, item_id to mahout ids
>
> Is there any poissibility to provide data to mahout mapreduce
> ItemSimilarity using some binary format with compression?
>
>

Re: mapreduce ItemSimilarity input optimization

Posted by Pat Ferrel <pa...@occamsmachete.com>.

The Spark version “spark-itemsimilarity” uses _your_ IDs. It is ready to try and I’d love it if you could. The IDs are kept in a HashBiMap in memory on each cluster machine and so it's memory limited to the size of the dictionary but in practice that will probably work for many (most) applications. This conversion of your ID into Mahout ID is done in the job and in parallel so it's about as fast as can be though we may be able to optimize the memory footprint in time.

run “mahout spark-itemsimilarity” to get a full list of options. You can specify some form of text-delimited format for input—the default uses [\t, ] for the delimiter and expects (userID,itemID,ignored-text) but you can specify which column in the TDF contains which ID and even use filters to capture only the lines with data if you are using log files.

I’ll see if I can get a doc up on the mahout site to explain it a bit better.

As to providing input to Mahout in binary form, the Hadoop version of “rowsimilarity” takes a DRM sequence file. This would be a row per user containing a Mahout userID and Mahout SparseVector of the item interactions. You will still have to convert IDs though.

On Aug 16, 2014, at 5:10 AM, Serega Sheypak <se...@gmail.com> wrote:

Hi, We are trying calculate ItemSimilarity.
Right now we have 2*10^7 input lines. I do provide input data as raw text
each day to recalculate item similarities. We do get +100..1000 new items
each day.
1. It takes too much time to prepare input data.
2. It takes too much time to convert user_id, item_id to mahout ids

Is there any poissibility to provide data to mahout mapreduce
ItemSimilarity using some binary format with compression?