You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Serega Sheypak <se...@gmail.com> on 2014/07/20 20:57:35 UTC

recommenditembased returns 0 records from last map-reduce job

Hi, I'm trying to create item similarity.
I gather items which users visit during shopping and then create a file:
user_id, item_id, weight (where weight can be: [1.0, 1.6, 1.9], depends on
user action type and data source)
UNION
-item_id, item_id, 1.0 (from items dictionary)

and I do provide a userFile, where user_id = -item_id

The idea is to get item similary. If any user visits item named "A", i want
to show him items "B", "c", "xxx" using preferences of other users.

The problem is that the last (???) mapreduce job returns 0 rows:

Here are my settings:


sudo -u oozie mahout recommenditembased \
                    --input visited_items_with_inverted_items \

                    --output result \
                    --similarityClassname SIMILARITY_LOGLIKELIHOOD \
                    --usersFile inverted_items \
                    --numRecommendations 500 \
                    --booleanData false \
                    --maxPrefsPerUser 100 \
                    --maxSimilaritiesPerItem 500 \
                    --minPrefsPerUser 0\
                    --maxPrefsPerUserInItemSimilarity 30 \
                    --threshold 0.91 \
                    --tempDir  temp \

Some counters... I don't get what do they mean....

14/07/20 22:43:08 INFO mapred.JobClient:
  org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters

14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530

14/07/20 22:43:43 INFO mapred.JobClient:
  org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements

14/07/20 22:43:43 INFO mapred.JobClient:
    USER_RATINGS_NEGLECTED=1,798,738

14/07/20 22:43:43 INFO mapred.JobClient:     USER_RATINGS_USED=12,429,693


14/07/20 22:44:24 INFO mapred.JobClient:
  org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters

14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879

14/07/20 22:45:18 INFO mapred.JobClient:
  org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters

14/07/20 22:45:18 INFO mapred.JobClient:     COOCCURRENCES=35882374

14/07/20 22:45:18 INFO mapred.JobClient:     PRUNED_COOCCURRENCES=0

14/07/20 22:46:00 INFO mapred.JobClient:     Map input records=3312879

14/07/20 22:46:00 INFO mapred.JobClient:     Map output records=17570268

14/07/20 22:46:00 INFO mapred.JobClient:     Reduce input records=5221907

14/07/20 22:46:00 INFO mapred.JobClient:     Reduce output records=3312879


14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input records=3312879

14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output records=3312879

14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input records=3312879

14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output records=3312879

14/07/20 22:47:06 INFO mapred.JobClient:     Map input records=7528530

14/07/20 22:47:06 INFO mapred.JobClient:     Map output records=3313251

14/07/20 22:47:06 INFO mapred.JobClient:     Reduce input records=3313251

14/07/20 22:47:06 INFO mapred.JobClient:     Reduce output records=3313251

14/07/20 22:47:40 INFO mapred.JobClient:     Map input records=6626130

14/07/20 22:47:40 INFO mapred.JobClient:     Map output records=6626130

14/07/20 22:47:40 INFO mapred.JobClient:     Reduce input records=6626130

14/07/20 22:47:40 INFO mapred.JobClient:     Reduce output records=3312879


14/07/20 22:48:26 INFO mapred.JobClient:     Map input records=3312879

14/07/20 22:48:26 INFO mapred.JobClient:     Map output records=3313251

14/07/20 22:48:26 INFO mapred.JobClient:     Reduce input records=3313251

--------
14/07/20 22:48:26 INFO mapred.JobClient:     Reduce output records=0
--------

why 0???

Re: recommenditembased returns 0 records from last map-reduce job

Posted by Serega Sheypak <se...@gmail.com>.

Thank you for your input.


2014-07-21 12:00 GMT+04:00 Peng Zhang <pz...@gmail.com>:

> My personal comments:
> 1. Data cleansing. One beautiful characteristic of Mahout’s CF
> recommendation is the simplicity of input data, often times just three
> columns (user, item, preference). If any value is missing, just don’t put
> the record in the input file. Therefore I don’t see there is any need to do
> data cleaning given that the application has recorded user-item-preference
> correctly and you have translated user-id and item-id properly.
> 2. Oftentimes Loglikelihood has a better performance than
> PearsonCorrelation in Mahout’s Collaborative Filtering. The former is
> focused on discrete values and the latter is focused on continuous values.
> Refer to Ted’s popular post Surprise and Coincidence about the former.
>
>
> Peng Zhang
> pzhang.xjtu@gmail.com
>
>
>
>
>
> On Jul 21, 2014, at 3:37 PM, Serega Sheypak <se...@gmail.com>
> wrote:
>
> > Thanks! I'll report this evening.
> >
> > Are there any articles about data preparation for mahout item
> > recommendation? There are many books but most of them are copy-paste of
> > javadoc and guides from mahout site.
> > I'm -1 at math, my challenges are:
> >
> > 1. approaches for data cleaning, do I have to apply dead-simple
> statisical
> > rules?
> > "The empirical rule also states that approximately 95 percent of the data
> > values will fall within two standard deviations from the mean."
> > So If my user visits are described as normal distirbution Does it make
> > sense? The idea is to put away all noise.
> >
> > 2. similarityClassname - don't have any intuition here... I see that
> people
> > use SIMILARITY_LOGLIKELIHOOD and PEARSON
> >
> >
> > 2014-07-21 11:18 GMT+04:00 Peng Zhang <pz...@gmail.com>:
> >
> >> Serega,
> >>
> >> See the last line on how to pass outputPathForSimilarityMatrix options
> to
> >> the recommenditembased command:
> >>
> >> sudo -u oozie mahout recommenditembased \
> >>                   --input visited_items_with_inverted_items \
> >>
> >>                   --output result \
> >>                   --similarityClassname SIMILARITY_LOGLIKELIHOOD \
> >>                   --usersFile inverted_items \
> >>                   --numRecommendations 500 \
> >>                   --booleanData false \
> >>                   --maxPrefsPerUser 100 \
> >>                   --maxSimilaritiesPerItem 500 \
> >>                   --minPrefsPerUser 0\
> >>                   --maxPrefsPerUserInItemSimilarity 30 \
> >>                   --threshold 0.91 \
> >>                   --tempDir  temp \
> >>                   --outputPathForSimilarityMatrix similarityMatri \
> >>
> >>
> >> Peng Zhang
> >> pzhang.xjtu@gmail.com
> >>
> >>
> >>
> >>
> >>
> >> On Jul 21, 2014, at 3:09 PM, Serega Sheypak <se...@gmail.com>
> >> wrote:
> >>
> >>> I've inspected the code, our approach wouldn't work with
> >> booleanData=false.
> >>> We do calcualte imte similarity in the wrong way...(((
> >>> Thank you
> >>> 1. We provide "fake" user_id and provide --usersFile in order to get
> >>> recommendations for "fake user_id, where user_id is a negative item_id.
> >> It
> >>> worked when we did provide user_id->item_id pairs without preference.
> >>> 2. Our target is to get item similarities. We tried
> >>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob but
> >> it
> >>> returns bad result comparing to RecommenderJob with our "fake" user_id
> >>> (inverted item_id)
> >>>
> >>> 1. I'll try the option you provided.
> >>> 2. I will remove input with fake user_id and usersFile with these fake
> >> ids
> >>>
> >>> 3.
> >>>
> >>
> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
> >>> I don't understand how to pass ---outputPathForSimilarityMatrix option
> to
> >>> RecommenderJob
> >>>
> >>>
> >>> 2014-07-21 4:58 GMT+04:00 Peng Zhang <pz...@gmail.com>:
> >>>
> >>>> Seraga,
> >>>>
> >>>> I have two comments:
> >>>> 1. Don’t use negative user ids. Since Mahout uses user id as well as
> >> item
> >>>> id as the row/column index, you’d better use 0, 1, 2, etc as ids
> >>>> 2. If you want to get the item similarity information, you can use
> >>>> --outputPathForSimilarityMatrix in the command
> >>>>
> >>>> Regards,
> >>>> Peng Zhang
> >>>> M: +86 186-1658-7856
> >>>> pzhang.xjtu@gmail.com
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <serega.sheypak@gmail.com
> >
> >>>> wrote:
> >>>>
> >>>>> All bad things happen here:
> >>>>>
> >>>>>
> >>>>>
> >>>>> Name
> >>>>>
> >>>>> RecommenderJob-PartialMultiplyMapper-Reducer
> >>>>>
> >>>>> User
> >>>>>
> >>>>> oozie
> >>>>>
> >>>>> Process User
> >>>>>
> >>>>> oozie
> >>>>>
> >>>>> Group
> >>>>>
> >>>>> oozie
> >>>>>
> >>>>> Mapper Class
> >>>>>
> >>>>> PartialMultiplyMapper
> >>>>>
> >>>>> Reducer Class
> >>>>>
> >>>>> AggregateAndRecommendReducer
> >>>>>
> >>>>>
> >>>>> Job Input Directory
> >>>>>
> >>>>> hdfs://nameservice1/itemrec/temp/partialMultiply
> >>>>>
> >>>>> Job Output Directory
> >>>>>
> >>>>> hdfs://nameservice1/itemrec/output/
> >>>>>
> >>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map input
> records=3312879
> >>>>>
> >>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map output
> records=3313251
> >>>>>
> >>>>>
> >>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce input
> >> records=3313251
> >>>>>
> >>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce output records=0
> >>>>>
> >>>>> Why does mahout returns 0 rows? it works when booleanData=true
> >>>> (preferences
> >>>>> are ignored...?)
> >>>>>
> >>>>>
> >>>>>
> >>>>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <serega.sheypak@gmail.com
> >:
> >>>>>
> >>>>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
> >>>>>> users_file:
> >>>>>> --inverted_item_id
> >>>>>> -1
> >>>>>> -2
> >>>>>> -3
> >>>>>> -4
> >>>>>>
> >>>>>> users_items_prefs
> >>>>>> --inverted item_id
> >>>>>> -1 1 1.0
> >>>>>> -2 2 1.0
> >>>>>> -3 3 1.0
> >>>>>> -4 4 1.0
> >>>>>> --user_id item_id pref_value
> >>>>>> 11   1 1.6
> >>>>>> 11   2 1.6
> >>>>>> 123 3 2.0
> >>>>>> 123 4 2.0
> >>>>>> 333 1 2.0
> >>>>>> 333 2 1.6
> >>>>>> --e.t.c.
> >>>>>>
> >>>>>> if I set --booleanData true
> >>>>>> then mahout returns the result.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <
> >> andrew.musselman@gmail.com
> >>>>> :
> >>>>>>
> >>>>>> I'm confused about how you're constructing the user file, and why
> >> there
> >>>>>>> are negated item ids here.
> >>>>>>>
> >>>>>>> Can you post some more details please, including Mahout version and
> >>>> some
> >>>>>>> sample data sets?
> >>>>>>>
> >>>>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak <
> >>>> serega.sheypak@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Hi, I'm trying to create item similarity.
> >>>>>>>> I gather items which users visit during shopping and then create a
> >>>> file:
> >>>>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6, 1.9],
> >>>> depends
> >>>>>>> on
> >>>>>>>> user action type and data source)
> >>>>>>>> UNION
> >>>>>>>> -item_id, item_id, 1.0 (from items dictionary)
> >>>>>>>>
> >>>>>>>> and I do provide a userFile, where user_id = -item_id
> >>>>>>>>
> >>>>>>>> The idea is to get item similary. If any user visits item named
> >> "A", i
> >>>>>>> want
> >>>>>>>> to show him items "B", "c", "xxx" using preferences of other
> users.
> >>>>>>>>
> >>>>>>>> The problem is that the last (???) mapreduce job returns 0 rows:
> >>>>>>>>
> >>>>>>>> Here are my settings:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> sudo -u oozie mahout recommenditembased \
> >>>>>>>>                 --input visited_items_with_inverted_items \
> >>>>>>>>
> >>>>>>>>                 --output result \
> >>>>>>>>                 --similarityClassname SIMILARITY_LOGLIKELIHOOD \
> >>>>>>>>                 --usersFile inverted_items \
> >>>>>>>>                 --numRecommendations 500 \
> >>>>>>>>                 --booleanData false \
> >>>>>>>>                 --maxPrefsPerUser 100 \
> >>>>>>>>                 --maxSimilaritiesPerItem 500 \
> >>>>>>>>                 --minPrefsPerUser 0\
> >>>>>>>>                 --maxPrefsPerUserInItemSimilarity 30 \
> >>>>>>>>                 --threshold 0.91 \
> >>>>>>>>                 --tempDir  temp \
> >>>>>>>>
> >>>>>>>> Some counters... I don't get what do they mean....
> >>>>>>>>
> >>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
> >>>>>>>>
> org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
> >>>>>>>>
> >>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530
> >>>>>>>>
> >>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> >>>>>>>>
> >>>>>>>
> >>>>
> >>
> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
> >>>>>>>>
> >>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> >>>>>>>> USER_RATINGS_NEGLECTED=1,798,738
> >>>>>>>>
> >>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> >>>>>>> USER_RATINGS_USED=12,429,693
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
> >>>>>>>>
> >>>>>>>
> >>>>
> >>
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> >>>>>>>>
> >>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879
> >>>>>>>>
> >>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> >>>>>>>>
> >>>>>>>
> >>>>
> >>
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> >>>>>>>>
> >>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> COOCCURRENCES=35882374
> >>>>>>>>
> >>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> PRUNED_COOCCURRENCES=0
> >>>>>>>>
> >>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map input
> >> records=3312879
> >>>>>>>>
> >>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map output
> >>>> records=17570268
> >>>>>>>>
> >>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce input
> >>>>>>> records=5221907
> >>>>>>>>
> >>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce output
> >>>>>>> records=3312879
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
> >>>>>>> records=3312879
> >>>>>>>>
> >>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
> >>>>>>> records=3312879
> >>>>>>>>
> >>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
> >>>>>>> records=3312879
> >>>>>>>>
> >>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
> >>>>>>> records=3312879
> >>>>>>>>
> >>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map input
> >> records=7528530
> >>>>>>>>
> >>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map output
> >>>> records=3313251
> >>>>>>>>
> >>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce input
> >>>>>>> records=3313251
> >>>>>>>>
> >>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce output
> >>>>>>> records=3313251
> >>>>>>>>
> >>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map input
> >> records=6626130
> >>>>>>>>
> >>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map output
> >>>> records=6626130
> >>>>>>>>
> >>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce input
> >>>>>>> records=6626130
> >>>>>>>>
> >>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce output
> >>>>>>> records=3312879
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map input
> >> records=3312879
> >>>>>>>>
> >>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map output
> >>>> records=3313251
> >>>>>>>>
> >>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce input
> >>>>>>> records=3313251
> >>>>>>>>
> >>>>>>>> --------
> >>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce output
> records=0
> >>>>>>>> --------
> >>>>>>>>
> >>>>>>>> why 0???
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Re: recommenditembased returns 0 records from last map-reduce job

Posted by Peng Zhang <pz...@gmail.com>.

My personal comments:
1. Data cleansing. One beautiful characteristic of Mahout’s CF recommendation is the simplicity of input data, often times just three columns (user, item, preference). If any value is missing, just don’t put the record in the input file. Therefore I don’t see there is any need to do data cleaning given that the application has recorded user-item-preference correctly and you have translated user-id and item-id properly.
2. Oftentimes Loglikelihood has a better performance than PearsonCorrelation in Mahout’s Collaborative Filtering. The former is focused on discrete values and the latter is focused on continuous values. Refer to Ted’s popular post Surprise and Coincidence about the former.


Peng Zhang
pzhang.xjtu@gmail.com





On Jul 21, 2014, at 3:37 PM, Serega Sheypak <se...@gmail.com> wrote:

> Thanks! I'll report this evening.
> 
> Are there any articles about data preparation for mahout item
> recommendation? There are many books but most of them are copy-paste of
> javadoc and guides from mahout site.
> I'm -1 at math, my challenges are:
> 
> 1. approaches for data cleaning, do I have to apply dead-simple statisical
> rules?
> "The empirical rule also states that approximately 95 percent of the data
> values will fall within two standard deviations from the mean."
> So If my user visits are described as normal distirbution Does it make
> sense? The idea is to put away all noise.
> 
> 2. similarityClassname - don't have any intuition here... I see that people
> use SIMILARITY_LOGLIKELIHOOD and PEARSON
> 
> 
> 2014-07-21 11:18 GMT+04:00 Peng Zhang <pz...@gmail.com>:
> 
>> Serega,
>> 
>> See the last line on how to pass outputPathForSimilarityMatrix options to
>> the recommenditembased command:
>> 
>> sudo -u oozie mahout recommenditembased \
>>                   --input visited_items_with_inverted_items \
>> 
>>                   --output result \
>>                   --similarityClassname SIMILARITY_LOGLIKELIHOOD \
>>                   --usersFile inverted_items \
>>                   --numRecommendations 500 \
>>                   --booleanData false \
>>                   --maxPrefsPerUser 100 \
>>                   --maxSimilaritiesPerItem 500 \
>>                   --minPrefsPerUser 0\
>>                   --maxPrefsPerUserInItemSimilarity 30 \
>>                   --threshold 0.91 \
>>                   --tempDir  temp \
>>                   --outputPathForSimilarityMatrix similarityMatri \
>> 
>> 
>> Peng Zhang
>> pzhang.xjtu@gmail.com
>> 
>> 
>> 
>> 
>> 
>> On Jul 21, 2014, at 3:09 PM, Serega Sheypak <se...@gmail.com>
>> wrote:
>> 
>>> I've inspected the code, our approach wouldn't work with
>> booleanData=false.
>>> We do calcualte imte similarity in the wrong way...(((
>>> Thank you
>>> 1. We provide "fake" user_id and provide --usersFile in order to get
>>> recommendations for "fake user_id, where user_id is a negative item_id.
>> It
>>> worked when we did provide user_id->item_id pairs without preference.
>>> 2. Our target is to get item similarities. We tried
>>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob but
>> it
>>> returns bad result comparing to RecommenderJob with our "fake" user_id
>>> (inverted item_id)
>>> 
>>> 1. I'll try the option you provided.
>>> 2. I will remove input with fake user_id and usersFile with these fake
>> ids
>>> 
>>> 3.
>>> 
>> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
>>> I don't understand how to pass ---outputPathForSimilarityMatrix option to
>>> RecommenderJob
>>> 
>>> 
>>> 2014-07-21 4:58 GMT+04:00 Peng Zhang <pz...@gmail.com>:
>>> 
>>>> Seraga,
>>>> 
>>>> I have two comments:
>>>> 1. Don’t use negative user ids. Since Mahout uses user id as well as
>> item
>>>> id as the row/column index, you’d better use 0, 1, 2, etc as ids
>>>> 2. If you want to get the item similarity information, you can use
>>>> --outputPathForSimilarityMatrix in the command
>>>> 
>>>> Regards,
>>>> Peng Zhang
>>>> M: +86 186-1658-7856
>>>> pzhang.xjtu@gmail.com
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <se...@gmail.com>
>>>> wrote:
>>>> 
>>>>> All bad things happen here:
>>>>> 
>>>>> 
>>>>> 
>>>>> Name
>>>>> 
>>>>> RecommenderJob-PartialMultiplyMapper-Reducer
>>>>> 
>>>>> User
>>>>> 
>>>>> oozie
>>>>> 
>>>>> Process User
>>>>> 
>>>>> oozie
>>>>> 
>>>>> Group
>>>>> 
>>>>> oozie
>>>>> 
>>>>> Mapper Class
>>>>> 
>>>>> PartialMultiplyMapper
>>>>> 
>>>>> Reducer Class
>>>>> 
>>>>> AggregateAndRecommendReducer
>>>>> 
>>>>> 
>>>>> Job Input Directory
>>>>> 
>>>>> hdfs://nameservice1/itemrec/temp/partialMultiply
>>>>> 
>>>>> Job Output Directory
>>>>> 
>>>>> hdfs://nameservice1/itemrec/output/
>>>>> 
>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map input records=3312879
>>>>> 
>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map output records=3313251
>>>>> 
>>>>> 
>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce input
>> records=3313251
>>>>> 
>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce output records=0
>>>>> 
>>>>> Why does mahout returns 0 rows? it works when booleanData=true
>>>> (preferences
>>>>> are ignored...?)
>>>>> 
>>>>> 
>>>>> 
>>>>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <se...@gmail.com>:
>>>>> 
>>>>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
>>>>>> users_file:
>>>>>> --inverted_item_id
>>>>>> -1
>>>>>> -2
>>>>>> -3
>>>>>> -4
>>>>>> 
>>>>>> users_items_prefs
>>>>>> --inverted item_id
>>>>>> -1 1 1.0
>>>>>> -2 2 1.0
>>>>>> -3 3 1.0
>>>>>> -4 4 1.0
>>>>>> --user_id item_id pref_value
>>>>>> 11   1 1.6
>>>>>> 11   2 1.6
>>>>>> 123 3 2.0
>>>>>> 123 4 2.0
>>>>>> 333 1 2.0
>>>>>> 333 2 1.6
>>>>>> --e.t.c.
>>>>>> 
>>>>>> if I set --booleanData true
>>>>>> then mahout returns the result.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <
>> andrew.musselman@gmail.com
>>>>> :
>>>>>> 
>>>>>> I'm confused about how you're constructing the user file, and why
>> there
>>>>>>> are negated item ids here.
>>>>>>> 
>>>>>>> Can you post some more details please, including Mahout version and
>>>> some
>>>>>>> sample data sets?
>>>>>>> 
>>>>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak <
>>>> serega.sheypak@gmail.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hi, I'm trying to create item similarity.
>>>>>>>> I gather items which users visit during shopping and then create a
>>>> file:
>>>>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6, 1.9],
>>>> depends
>>>>>>> on
>>>>>>>> user action type and data source)
>>>>>>>> UNION
>>>>>>>> -item_id, item_id, 1.0 (from items dictionary)
>>>>>>>> 
>>>>>>>> and I do provide a userFile, where user_id = -item_id
>>>>>>>> 
>>>>>>>> The idea is to get item similary. If any user visits item named
>> "A", i
>>>>>>> want
>>>>>>>> to show him items "B", "c", "xxx" using preferences of other users.
>>>>>>>> 
>>>>>>>> The problem is that the last (???) mapreduce job returns 0 rows:
>>>>>>>> 
>>>>>>>> Here are my settings:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> sudo -u oozie mahout recommenditembased \
>>>>>>>>                 --input visited_items_with_inverted_items \
>>>>>>>> 
>>>>>>>>                 --output result \
>>>>>>>>                 --similarityClassname SIMILARITY_LOGLIKELIHOOD \
>>>>>>>>                 --usersFile inverted_items \
>>>>>>>>                 --numRecommendations 500 \
>>>>>>>>                 --booleanData false \
>>>>>>>>                 --maxPrefsPerUser 100 \
>>>>>>>>                 --maxSimilaritiesPerItem 500 \
>>>>>>>>                 --minPrefsPerUser 0\
>>>>>>>>                 --maxPrefsPerUserInItemSimilarity 30 \
>>>>>>>>                 --threshold 0.91 \
>>>>>>>>                 --tempDir  temp \
>>>>>>>> 
>>>>>>>> Some counters... I don't get what do they mean....
>>>>>>>> 
>>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
>>>>>>>> org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
>>>>>>>> 
>>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530
>>>>>>>> 
>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>>>>> 
>>>>>>> 
>>>> 
>> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
>>>>>>>> 
>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>>>>> USER_RATINGS_NEGLECTED=1,798,738
>>>>>>>> 
>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>>>> USER_RATINGS_USED=12,429,693
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
>>>>>>>> 
>>>>>>> 
>>>> 
>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>>>>>>> 
>>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879
>>>>>>>> 
>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>>>>>> 
>>>>>>> 
>>>> 
>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>>>>>>> 
>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:     COOCCURRENCES=35882374
>>>>>>>> 
>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:     PRUNED_COOCCURRENCES=0
>>>>>>>> 
>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map input
>> records=3312879
>>>>>>>> 
>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map output
>>>> records=17570268
>>>>>>>> 
>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce input
>>>>>>> records=5221907
>>>>>>>> 
>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce output
>>>>>>> records=3312879
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
>>>>>>> records=3312879
>>>>>>>> 
>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
>>>>>>> records=3312879
>>>>>>>> 
>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
>>>>>>> records=3312879
>>>>>>>> 
>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
>>>>>>> records=3312879
>>>>>>>> 
>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map input
>> records=7528530
>>>>>>>> 
>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map output
>>>> records=3313251
>>>>>>>> 
>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce input
>>>>>>> records=3313251
>>>>>>>> 
>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce output
>>>>>>> records=3313251
>>>>>>>> 
>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map input
>> records=6626130
>>>>>>>> 
>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map output
>>>> records=6626130
>>>>>>>> 
>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce input
>>>>>>> records=6626130
>>>>>>>> 
>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce output
>>>>>>> records=3312879
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map input
>> records=3312879
>>>>>>>> 
>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map output
>>>> records=3313251
>>>>>>>> 
>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce input
>>>>>>> records=3313251
>>>>>>>> 
>>>>>>>> --------
>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce output records=0
>>>>>>>> --------
>>>>>>>> 
>>>>>>>> why 0???
>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>>

Re: recommenditembased returns 0 records from last map-reduce job

Posted by Serega Sheypak <se...@gmail.com>.

Thanks! I'll report this evening.

Are there any articles about data preparation for mahout item
recommendation? There are many books but most of them are copy-paste of
javadoc and guides from mahout site.
I'm -1 at math, my challenges are:

1. approaches for data cleaning, do I have to apply dead-simple statisical
rules?
"The empirical rule also states that approximately 95 percent of the data
values will fall within two standard deviations from the mean."
So If my user visits are described as normal distirbution Does it make
sense? The idea is to put away all noise.

2. similarityClassname - don't have any intuition here... I see that people
use SIMILARITY_LOGLIKELIHOOD and PEARSON


2014-07-21 11:18 GMT+04:00 Peng Zhang <pz...@gmail.com>:

> Serega,
>
> See the last line on how to pass outputPathForSimilarityMatrix options to
> the recommenditembased command:
>
> sudo -u oozie mahout recommenditembased \
>                    --input visited_items_with_inverted_items \
>
>                    --output result \
>                    --similarityClassname SIMILARITY_LOGLIKELIHOOD \
>                    --usersFile inverted_items \
>                    --numRecommendations 500 \
>                    --booleanData false \
>                    --maxPrefsPerUser 100 \
>                    --maxSimilaritiesPerItem 500 \
>                    --minPrefsPerUser 0\
>                    --maxPrefsPerUserInItemSimilarity 30 \
>                    --threshold 0.91 \
>                    --tempDir  temp \
>                    --outputPathForSimilarityMatrix similarityMatri \
>
>
> Peng Zhang
> pzhang.xjtu@gmail.com
>
>
>
>
>
> On Jul 21, 2014, at 3:09 PM, Serega Sheypak <se...@gmail.com>
> wrote:
>
> > I've inspected the code, our approach wouldn't work with
> booleanData=false.
> > We do calcualte imte similarity in the wrong way...(((
> > Thank you
> > 1. We provide "fake" user_id and provide --usersFile in order to get
> > recommendations for "fake user_id, where user_id is a negative item_id.
> It
> > worked when we did provide user_id->item_id pairs without preference.
> > 2. Our target is to get item similarities. We tried
> > org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob but
> it
> > returns bad result comparing to RecommenderJob with our "fake" user_id
> > (inverted item_id)
> >
> > 1. I'll try the option you provided.
> > 2. I will remove input with fake user_id and usersFile with these fake
> ids
> >
> > 3.
> >
> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
> > I don't understand how to pass ---outputPathForSimilarityMatrix option to
> > RecommenderJob
> >
> >
> > 2014-07-21 4:58 GMT+04:00 Peng Zhang <pz...@gmail.com>:
> >
> >> Seraga,
> >>
> >> I have two comments:
> >> 1. Don’t use negative user ids. Since Mahout uses user id as well as
> item
> >> id as the row/column index, you’d better use 0, 1, 2, etc as ids
> >> 2. If you want to get the item similarity information, you can use
> >> --outputPathForSimilarityMatrix in the command
> >>
> >> Regards,
> >> Peng Zhang
> >> M: +86 186-1658-7856
> >> pzhang.xjtu@gmail.com
> >>
> >>
> >>
> >>
> >>
> >> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <se...@gmail.com>
> >> wrote:
> >>
> >>> All bad things happen here:
> >>>
> >>>
> >>>
> >>> Name
> >>>
> >>> RecommenderJob-PartialMultiplyMapper-Reducer
> >>>
> >>> User
> >>>
> >>> oozie
> >>>
> >>> Process User
> >>>
> >>> oozie
> >>>
> >>> Group
> >>>
> >>> oozie
> >>>
> >>> Mapper Class
> >>>
> >>> PartialMultiplyMapper
> >>>
> >>> Reducer Class
> >>>
> >>> AggregateAndRecommendReducer
> >>>
> >>>
> >>> Job Input Directory
> >>>
> >>> hdfs://nameservice1/itemrec/temp/partialMultiply
> >>>
> >>> Job Output Directory
> >>>
> >>> hdfs://nameservice1/itemrec/output/
> >>>
> >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map input records=3312879
> >>>
> >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map output records=3313251
> >>>
> >>>
> >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce input
> records=3313251
> >>>
> >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce output records=0
> >>>
> >>> Why does mahout returns 0 rows? it works when booleanData=true
> >> (preferences
> >>> are ignored...?)
> >>>
> >>>
> >>>
> >>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <se...@gmail.com>:
> >>>
> >>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
> >>>> users_file:
> >>>> --inverted_item_id
> >>>> -1
> >>>> -2
> >>>> -3
> >>>> -4
> >>>>
> >>>> users_items_prefs
> >>>> --inverted item_id
> >>>> -1 1 1.0
> >>>> -2 2 1.0
> >>>> -3 3 1.0
> >>>> -4 4 1.0
> >>>> --user_id item_id pref_value
> >>>> 11   1 1.6
> >>>> 11   2 1.6
> >>>> 123 3 2.0
> >>>> 123 4 2.0
> >>>> 333 1 2.0
> >>>> 333 2 1.6
> >>>> --e.t.c.
> >>>>
> >>>> if I set --booleanData true
> >>>> then mahout returns the result.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <
> andrew.musselman@gmail.com
> >>> :
> >>>>
> >>>> I'm confused about how you're constructing the user file, and why
> there
> >>>>> are negated item ids here.
> >>>>>
> >>>>> Can you post some more details please, including Mahout version and
> >> some
> >>>>> sample data sets?
> >>>>>
> >>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak <
> >> serega.sheypak@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> Hi, I'm trying to create item similarity.
> >>>>>> I gather items which users visit during shopping and then create a
> >> file:
> >>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6, 1.9],
> >> depends
> >>>>> on
> >>>>>> user action type and data source)
> >>>>>> UNION
> >>>>>> -item_id, item_id, 1.0 (from items dictionary)
> >>>>>>
> >>>>>> and I do provide a userFile, where user_id = -item_id
> >>>>>>
> >>>>>> The idea is to get item similary. If any user visits item named
> "A", i
> >>>>> want
> >>>>>> to show him items "B", "c", "xxx" using preferences of other users.
> >>>>>>
> >>>>>> The problem is that the last (???) mapreduce job returns 0 rows:
> >>>>>>
> >>>>>> Here are my settings:
> >>>>>>
> >>>>>>
> >>>>>> sudo -u oozie mahout recommenditembased \
> >>>>>>                  --input visited_items_with_inverted_items \
> >>>>>>
> >>>>>>                  --output result \
> >>>>>>                  --similarityClassname SIMILARITY_LOGLIKELIHOOD \
> >>>>>>                  --usersFile inverted_items \
> >>>>>>                  --numRecommendations 500 \
> >>>>>>                  --booleanData false \
> >>>>>>                  --maxPrefsPerUser 100 \
> >>>>>>                  --maxSimilaritiesPerItem 500 \
> >>>>>>                  --minPrefsPerUser 0\
> >>>>>>                  --maxPrefsPerUserInItemSimilarity 30 \
> >>>>>>                  --threshold 0.91 \
> >>>>>>                  --tempDir  temp \
> >>>>>>
> >>>>>> Some counters... I don't get what do they mean....
> >>>>>>
> >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
> >>>>>> org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
> >>>>>>
> >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530
> >>>>>>
> >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> >>>>>>
> >>>>>
> >>
> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
> >>>>>>
> >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> >>>>>>  USER_RATINGS_NEGLECTED=1,798,738
> >>>>>>
> >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> >>>>> USER_RATINGS_USED=12,429,693
> >>>>>>
> >>>>>>
> >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
> >>>>>>
> >>>>>
> >>
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> >>>>>>
> >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879
> >>>>>>
> >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> >>>>>>
> >>>>>
> >>
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> >>>>>>
> >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:     COOCCURRENCES=35882374
> >>>>>>
> >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:     PRUNED_COOCCURRENCES=0
> >>>>>>
> >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map input
> records=3312879
> >>>>>>
> >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map output
> >> records=17570268
> >>>>>>
> >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce input
> >>>>> records=5221907
> >>>>>>
> >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce output
> >>>>> records=3312879
> >>>>>>
> >>>>>>
> >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
> >>>>> records=3312879
> >>>>>>
> >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
> >>>>> records=3312879
> >>>>>>
> >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
> >>>>> records=3312879
> >>>>>>
> >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
> >>>>> records=3312879
> >>>>>>
> >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map input
> records=7528530
> >>>>>>
> >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map output
> >> records=3313251
> >>>>>>
> >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce input
> >>>>> records=3313251
> >>>>>>
> >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce output
> >>>>> records=3313251
> >>>>>>
> >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map input
> records=6626130
> >>>>>>
> >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map output
> >> records=6626130
> >>>>>>
> >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce input
> >>>>> records=6626130
> >>>>>>
> >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce output
> >>>>> records=3312879
> >>>>>>
> >>>>>>
> >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map input
> records=3312879
> >>>>>>
> >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map output
> >> records=3313251
> >>>>>>
> >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce input
> >>>>> records=3313251
> >>>>>>
> >>>>>> --------
> >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce output records=0
> >>>>>> --------
> >>>>>>
> >>>>>> why 0???
> >>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Re: recommenditembased returns 0 records from last map-reduce job

Posted by Serega Sheypak <se...@gmail.com>.

Hi, nothing helps. I did remap natural user_id to sequential 1,2,3,4....
keys. I did the same for item ids.
Result is the same, like I didn't do any mapping


Command line arguments: {


--booleanData=[false],


--endPhase=[2147483647],


--input=[projPrefs],


--maxPrefs=[500],


--maxSimilaritiesPerItem=[100000],


--minPrefsPerUser=[0],


--output=[output],


--similarityClassname=[SIMILARITY_PEARSON_CORRELATION],


--startPhase=[0],


--tempDir=[temp],


--threshold=[0.91]}



USERS=4056935


NEGLECTED_OBSERVATIONS=1211304


ROWS=779547


USED_OBSERVATIONS=9369782



COOCCURRENCES=12326601

PRUNED_COOCCURRENCES=90241722 (*??? why so much ???*)



And on the last map-reduce job:


Map input records=689597


Map output records=3436



Reduce input records=3436

Reduce output records=*1718*




2014-07-27 15:29 GMT+04:00 Serega Sheypak <se...@gmail.com>:

> Thank you! I could spend all my life trying to get result without knowing
> the requirements for input data.
>
> BTW:
> we used mahout 0.7-cdh-4.4...cdh
> 4.7 org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
> and did get results close to reality. We just provided long user_id,
> item_id and didn't do something special.
> Why did it work?
>
>
> 2014-07-27 5:18 GMT+04:00 Pat Ferrel <pa...@gmail.com>:
>
> Both those jobs require you create Mahout IDs for users and items. For
>> most Hadoop based Mahout jobs, taking either text input or sequence files,
>> the IDs must follow the rules mentioned below. There are a few exceptions
>> but none you are using. The Wiki was rewritten for 0.9 and so the ID
>> requirements may not be documented well. You can file a Jira so someone
>> documents this.
>>
>> BTW spark-itemsimilarity will take any IDs and can read any
>> text-delimited file format, unfortunately it’s not quite ready yet.
>>
>> On Jul 26, 2014, at 3:14 AM, Serega Sheypak <se...@gmail.com>
>> wrote:
>>
>> Hm... rather confusing... You are talking about input for:
>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
>> or
>> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
>>
>> My target is to get item-item similarity. ItemSimilarityJob right now
>> returns few similarities.
>>
>> I'm readin this:
>> https://mahout.apache.org/users/recommender/intro-itembased-hadoop.html
>> and that:
>> https://mahout.apache.org/users/recommender/userbased-5-minutes.html
>>
>> I don't see there something about " Your IDs must be in the range from 0
>> to
>> the number of rows" for both items and users. Where does this requirement
>> come from?
>>
>>
>> 2014-07-25 23:57 GMT+04:00 Pat Ferrel <pa...@gmail.com>:
>>
>> > I think I did explain below. Your IDs must be in the range from 0 to the
>> > number of rows - 1 and the same for item IDs. This is done by taking
>> your
>> > application specific IDs and mapping them to sequential non-negative
>> > Integers. You need to maintain a mapping to/from Mahout IDs somewhere in
>> > your own code.
>> >
>> > For example imagine input of the form
>> > -92, abc, 1.0
>> > 75000x, jkl, 2.0
>> >
>> > Your first user ID is -92, give it Mahout ID = 0. For your next user ID
>> > 75000x give it Mahout ID = 1
>> > Your first item ID is abc, give it Mahout ID = 0. For your next item ID
>> > jkl give it Mahout ID = 1
>> > keep doing this the first time you see a unique id from your input. A
>> Map
>> > will do this for you.
>> >
>> > And so on. Then the input to Mahout would be:
>> > 0,0,1.0
>> > 1,1,2.0
>> >
>> > The output will have Mahout IDs too so you need to map recommendations
>> for
>> > Mahout User ID 0 back to your User ID of -92, and the same for all item
>> IDs.
>> >
>> >
>> > On Jul 25, 2014, at 11:55 AM, Serega Sheypak <se...@gmail.com>
>> > wrote:
>> >
>> > I'm preparing data using apache hive: user_id:long, item_it:long,
>> > preference[1.0, 2.0]
>> > I don't understand "For most Mahout jobs you have to prepare you data to
>> > have Mahout IDs". What is "Mahout IDs"? I try to follow mahout site
>> docs, I
>> > didn't find there something related to mahout ids.
>> > Please explain.
>> >
>> >
>> > 2014-07-25 22:39 GMT+04:00 Pat Ferrel <pa...@gmail.com>:
>> >
>> >> Sorry I haven’t read this thread carefully but it looks like you may be
>> >> using the wrong IDs.
>> >>
>> >> For most Mahout jobs you have to prepare you data to have Mahout IDs.
>> You
>> >> do this by looking at each datum and as you see a new unique
>> application
>> >> specific user or item ID you give it a Mahout ID starting from 0. So
>> > Mahout
>> >> ID can be thought of as row and column numbers in a matrix. The Mahout
>> > IDs
>> >> for rows will be 0 thru # of rows-1 same for columns.
>> >>
>> >> This always requires that you translate into Mahout IDs then after the
>> > job
>> >> is run translate back into your application IDs. You need a
>> > bi-directional
>> >> dictionary of some type. I use a HashBiMap from Guava.
>> >>
>> >> Also I’d avoid the threshold for now. If you get that wrong it will
>> mess
>> >> things up badly and is very hard to tune. It’s there for completeness
>> > but I
>> >> never use it.
>> >>
>> >>
>> >> On Jul 25, 2014, at 12:55 AM, Serega Sheypak <serega.sheypak@gmail.com
>> >
>> >> wrote:
>> >>
>> >> Hi, nothing helps...
>> >> I do use mahout 0.9 compiled for CDH 4.7
>> >> I do provide only positive values
>> >> I do use itemsimilarityJob and do get 2000 similarities for 1400 unique
>> >> items
>> >> Input data is:
>> >> 16*10^6 preferences
>> >> 4*10^6 users
>> >> 0.6*10^ items
>> >> I do use perason correlation and preferece vlaues are: 1.0 and 2.0
>> >>
>> >>
>> >> 2014-07-22 9:32 GMT+04:00 Serega Sheypak <se...@gmail.com>:
>> >>
>> >>> Ok, I have recompiled mahout 0.9 for CDH 4.7. I'll try this evening.
>> >>> Right now I don't see how can it help me. As far as I know the stuff I
>> >> try
>> >>> to use is pretty old and stable.
>> >>> looks like I do apply it in a wrong way.
>> >>>
>> >>> There is an option for recommenditembased named "--threshold". I do
>> >>> provide data for recommenditembased with preference values in range
>> >>> [1.1..2.0].
>> >>> I set --threshold to 1.2
>> >>> --threshold is absolute and can be from [1.1 . .2+] or it's relative
>> and
>> >>> can be [0.0 .. 0.99999]?
>> >>>
>> >>>
>> >>> 2014-07-22 3:54 GMT+04:00 Ted Dunning <te...@gmail.com>:
>> >>>
>> >>> That version is no longer supported.  You should upgrade to 0.9
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Mon, Jul 21, 2014 at 11:41 AM, Serega Sheypak <
>> >>>> serega.sheypak@gmail.com>
>> >>>> wrote:
>> >>>>
>> >>>>> 0.7-cdh4.7.0
>> >>>>> Anyway, recommenditembased does produce these catalogs:
>> >>>>>
>> >>>>> /recommenditembased/temp/maxValues.bin
>> >>>>> /recommenditembased/temp/norms.bin
>> >>>>> /recommenditembased/temp/numNonZeroEntries.bin
>> >>>>> /recommenditembased/temp/pairwiseSimilarity
>> >>>>> /recommenditembased/temp/partialMultiply
>> >>>>> /recommenditembased/temp/prePartialMultiply1
>> >>>>> /recommenditembased/temp/prePartialMultiply2
>> >>>>> /recommenditembased/temp/preparePreferenceMatrix
>> >>>>> /recommenditembased/temp/similarityMatrix
>> >>>>> /recommenditembased/temp/weights
>> >>>>>
>> >>>>> I suppose that "/recommenditembased/temp/similarityMatrix" is the
>> > thing
>> >>>> In
>> >>>>> eed. Right now I try to read it using
>> >>>>>
>> >>>>> matrix = LOAD '/recommenditembased/temp/similarityMatrix' USING
>> >>>>> com.twitter.elephantbird.pig.load.SequenceFileLoader(
>> >>>>>  '-c com.twitter.elephantbird.pig.util.IntWritableConverter',
>> >>>>>  '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
>> >>>>> )  as (intId: int, vector:tuple(cardinality:int,
>> >>>>> entries:bag{t:tuple(some_id:long, some_value:double)}));
>> >>>>>
>> >>>>>
>> >>>>> Looks like the vector is empty... Or i do something wrong.
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> 2014-07-21 22:09 GMT+04:00 Ted Dunning <te...@gmail.com>:
>> >>>>>
>> >>>>>> Which version of Mahout?
>> >>>>>>
>> >>>>>>
>> >>>>>> On Mon, Jul 21, 2014 at 11:05 AM, Serega Sheypak <
>> >>>>> serega.sheypak@gmail.com
>> >>>>>>>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>>> Hi, I've tried: Unexpected --outputPathForSimilarityMatrix while
>> >>>>>> processing
>> >>>>>>> Job-Specific
>> >>>>>>>
>> >>>>>>> sudo -u hdfs hadoop fs -rm -r
>> >>>>>> hdfs://nameservice1/recommenditembased/output
>> >>>>>>> sudo -u hdfs hadoop fs -rm -r
>> >>>>> hdfs://nameservice1/recommenditembased/temp
>> >>>>>>> sudo -u oozie mahout recommenditembased \
>> >>>>>>>                  --input \
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>
>> >
>> hdfs://nameservice1/user/hive/warehouse/staging_weighted_visits_and_rec_clicks
>> >>>>>>> \
>> >>>>>>>                  --output \
>> >>>>>>>                  hdfs://nameservice1/recommenditembased/output \
>> >>>>>>>                  --similarityClassname \
>> >>>>>>>                  SIMILARITY_LOGLIKELIHOOD \
>> >>>>>>>                 --numRecommendations \
>> >>>>>>>                  500 \
>> >>>>>>>                  --booleanData \
>> >>>>>>>                  false \
>> >>>>>>>                  --maxPrefsPerUser \
>> >>>>>>>                  1000 \
>> >>>>>>>                  --maxSimilaritiesPerItem \
>> >>>>>>>                  1000 \
>> >>>>>>>                  --minPrefsPerUser \
>> >>>>>>>                  5 \
>> >>>>>>>                  --maxPrefsPerUserInItemSimilarity \
>> >>>>>>>                  30 \
>> >>>>>>>                  --threshold \
>> >>>>>>>                 1.1 \
>> >>>>>>>                  --tempDir \
>> >>>>>>>                  hdfs://nameservice1/recommenditembased/temp \
>> >>>>>>>                  --outputPathForSimilarityMatrix \
>> >>>>>>>
>> >>>> hdfs://nameservice1/recommenditembased/sim_matrix
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> I'm on Cloudera cdh 4.7, looks like this feature is not supported.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> 2014-07-21 11:18 GMT+04:00 Peng Zhang <pz...@gmail.com>:
>> >>>>>>>
>> >>>>>>>> Serega,
>> >>>>>>>>
>> >>>>>>>> See the last line on how to pass outputPathForSimilarityMatrix
>> >>>>> options
>> >>>>>> to
>> >>>>>>>> the recommenditembased command:
>> >>>>>>>>
>> >>>>>>>> sudo -u oozie mahout recommenditembased \
>> >>>>>>>>                 --input visited_items_with_inverted_items \
>> >>>>>>>>
>> >>>>>>>>                 --output result \
>> >>>>>>>>                 --similarityClassname SIMILARITY_LOGLIKELIHOOD
>> >>>> \
>> >>>>>>>>                 --usersFile inverted_items \
>> >>>>>>>>                 --numRecommendations 500 \
>> >>>>>>>>                 --booleanData false \
>> >>>>>>>>                 --maxPrefsPerUser 100 \
>> >>>>>>>>                 --maxSimilaritiesPerItem 500 \
>> >>>>>>>>                 --minPrefsPerUser 0\
>> >>>>>>>>                 --maxPrefsPerUserInItemSimilarity 30 \
>> >>>>>>>>                 --threshold 0.91 \
>> >>>>>>>>                 --tempDir  temp \
>> >>>>>>>>                 --outputPathForSimilarityMatrix
>> >>>> similarityMatri \
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Peng Zhang
>> >>>>>>>> pzhang.xjtu@gmail.com
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Jul 21, 2014, at 3:09 PM, Serega Sheypak <
>> >>>>> serega.sheypak@gmail.com>
>> >>>>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>>> I've inspected the code, our approach wouldn't work with
>> >>>>>>>> booleanData=false.
>> >>>>>>>>> We do calcualte imte similarity in the wrong way...(((
>> >>>>>>>>> Thank you
>> >>>>>>>>> 1. We provide "fake" user_id and provide --usersFile in order to
>> >>>>> get
>> >>>>>>>>> recommendations for "fake user_id, where user_id is a negative
>> >>>>>> item_id.
>> >>>>>>>> It
>> >>>>>>>>> worked when we did provide user_id->item_id pairs without
>> >>>>> preference.
>> >>>>>>>>> 2. Our target is to get item similarities. We tried
>> >>>>>>>>>
>> >>>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
>> >>>>>> but
>> >>>>>>>> it
>> >>>>>>>>> returns bad result comparing to RecommenderJob with our "fake"
>> >>>>>> user_id
>> >>>>>>>>> (inverted item_id)
>> >>>>>>>>>
>> >>>>>>>>> 1. I'll try the option you provided.
>> >>>>>>>>> 2. I will remove input with fake user_id and usersFile with
>> >>>> these
>> >>>>>> fake
>> >>>>>>>> ids
>> >>>>>>>>>
>> >>>>>>>>> 3.
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>
>> >
>> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
>> >>>>>>>>> I don't understand how to pass ---outputPathForSimilarityMatrix
>> >>>>>> option
>> >>>>>>> to
>> >>>>>>>>> RecommenderJob
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> 2014-07-21 4:58 GMT+04:00 Peng Zhang <pz...@gmail.com>:
>> >>>>>>>>>
>> >>>>>>>>>> Seraga,
>> >>>>>>>>>>
>> >>>>>>>>>> I have two comments:
>> >>>>>>>>>> 1. Don’t use negative user ids. Since Mahout uses user id as
>> >>>> well
>> >>>>> as
>> >>>>>>>> item
>> >>>>>>>>>> id as the row/column index, you’d better use 0, 1, 2, etc as
>> >>>> ids
>> >>>>>>>>>> 2. If you want to get the item similarity information, you can
>> >>>> use
>> >>>>>>>>>> --outputPathForSimilarityMatrix in the command
>> >>>>>>>>>>
>> >>>>>>>>>> Regards,
>> >>>>>>>>>> Peng Zhang
>> >>>>>>>>>> M: +86 186-1658-7856
>> >>>>>>>>>> pzhang.xjtu@gmail.com
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <
>> >>>>>> serega.sheypak@gmail.com
>> >>>>>>>>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>> All bad things happen here:
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Name
>> >>>>>>>>>>>
>> >>>>>>>>>>> RecommenderJob-PartialMultiplyMapper-Reducer
>> >>>>>>>>>>>
>> >>>>>>>>>>> User
>> >>>>>>>>>>>
>> >>>>>>>>>>> oozie
>> >>>>>>>>>>>
>> >>>>>>>>>>> Process User
>> >>>>>>>>>>>
>> >>>>>>>>>>> oozie
>> >>>>>>>>>>>
>> >>>>>>>>>>> Group
>> >>>>>>>>>>>
>> >>>>>>>>>>> oozie
>> >>>>>>>>>>>
>> >>>>>>>>>>> Mapper Class
>> >>>>>>>>>>>
>> >>>>>>>>>>> PartialMultiplyMapper
>> >>>>>>>>>>>
>> >>>>>>>>>>> Reducer Class
>> >>>>>>>>>>>
>> >>>>>>>>>>> AggregateAndRecommendReducer
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Job Input Directory
>> >>>>>>>>>>>
>> >>>>>>>>>>> hdfs://nameservice1/itemrec/temp/partialMultiply
>> >>>>>>>>>>>
>> >>>>>>>>>>> Job Output Directory
>> >>>>>>>>>>>
>> >>>>>>>>>>> hdfs://nameservice1/itemrec/output/
>> >>>>>>>>>>>
>> >>>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map input
>> >>>>>>> records=3312879
>> >>>>>>>>>>>
>> >>>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map output
>> >>>>>>> records=3313251
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce input
>> >>>>>>>> records=3313251
>> >>>>>>>>>>>
>> >>>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce output
>> >>>>>> records=0
>> >>>>>>>>>>>
>> >>>>>>>>>>> Why does mahout returns 0 rows? it works when booleanData=true
>> >>>>>>>>>> (preferences
>> >>>>>>>>>>> are ignored...?)
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <
>> >>>>>> serega.sheypak@gmail.com
>> >>>>>>>> :
>> >>>>>>>>>>>
>> >>>>>>>>>>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
>> >>>>>>>>>>>> users_file:
>> >>>>>>>>>>>> --inverted_item_id
>> >>>>>>>>>>>> -1
>> >>>>>>>>>>>> -2
>> >>>>>>>>>>>> -3
>> >>>>>>>>>>>> -4
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> users_items_prefs
>> >>>>>>>>>>>> --inverted item_id
>> >>>>>>>>>>>> -1 1 1.0
>> >>>>>>>>>>>> -2 2 1.0
>> >>>>>>>>>>>> -3 3 1.0
>> >>>>>>>>>>>> -4 4 1.0
>> >>>>>>>>>>>> --user_id item_id pref_value
>> >>>>>>>>>>>> 11   1 1.6
>> >>>>>>>>>>>> 11   2 1.6
>> >>>>>>>>>>>> 123 3 2.0
>> >>>>>>>>>>>> 123 4 2.0
>> >>>>>>>>>>>> 333 1 2.0
>> >>>>>>>>>>>> 333 2 1.6
>> >>>>>>>>>>>> --e.t.c.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> if I set --booleanData true
>> >>>>>>>>>>>> then mahout returns the result.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <
>> >>>>>>>> andrew.musselman@gmail.com
>> >>>>>>>>>>> :
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> I'm confused about how you're constructing the user file, and
>> >>>>> why
>> >>>>>>>> there
>> >>>>>>>>>>>>> are negated item ids here.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Can you post some more details please, including Mahout
>> >>>> version
>> >>>>>> and
>> >>>>>>>>>> some
>> >>>>>>>>>>>>> sample data sets?
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak <
>> >>>>>>>>>> serega.sheypak@gmail.com>
>> >>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Hi, I'm trying to create item similarity.
>> >>>>>>>>>>>>>> I gather items which users visit during shopping and then
>> >>>>>> create a
>> >>>>>>>>>> file:
>> >>>>>>>>>>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6,
>> >>>>> 1.9],
>> >>>>>>>>>> depends
>> >>>>>>>>>>>>> on
>> >>>>>>>>>>>>>> user action type and data source)
>> >>>>>>>>>>>>>> UNION
>> >>>>>>>>>>>>>> -item_id, item_id, 1.0 (from items dictionary)
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> and I do provide a userFile, where user_id = -item_id
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> The idea is to get item similary. If any user visits item
>> >>>>> named
>> >>>>>>>> "A", i
>> >>>>>>>>>>>>> want
>> >>>>>>>>>>>>>> to show him items "B", "c", "xxx" using preferences of
>> >>>> other
>> >>>>>>> users.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> The problem is that the last (???) mapreduce job returns 0
>> >>>>> rows:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Here are my settings:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> sudo -u oozie mahout recommenditembased \
>> >>>>>>>>>>>>>>               --input visited_items_with_inverted_items
>> >>>> \
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>               --output result \
>> >>>>>>>>>>>>>>               --similarityClassname
>> >>>>> SIMILARITY_LOGLIKELIHOOD
>> >>>>>> \
>> >>>>>>>>>>>>>>               --usersFile inverted_items \
>> >>>>>>>>>>>>>>               --numRecommendations 500 \
>> >>>>>>>>>>>>>>               --booleanData false \
>> >>>>>>>>>>>>>>               --maxPrefsPerUser 100 \
>> >>>>>>>>>>>>>>               --maxSimilaritiesPerItem 500 \
>> >>>>>>>>>>>>>>               --minPrefsPerUser 0\
>> >>>>>>>>>>>>>>               --maxPrefsPerUserInItemSimilarity 30 \
>> >>>>>>>>>>>>>>               --threshold 0.91 \
>> >>>>>>>>>>>>>>               --tempDir  temp \
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Some counters... I don't get what do they mean....
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
>> >>>>>>>>>>>>>>
>> >>>>>>>
>> org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>
>> >
>> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>> >>>>>>>>>>>>>> USER_RATINGS_NEGLECTED=1,798,738
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>> >>>>>>>>>>>>> USER_RATINGS_USED=12,429,693
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>
>> >
>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>
>> >
>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>> >>>>>>> COOCCURRENCES=35882374
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>> >>>>>>> PRUNED_COOCCURRENCES=0
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map input
>> >>>>>>>> records=3312879
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map output
>> >>>>>>>>>> records=17570268
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce input
>> >>>>>>>>>>>>> records=5221907
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce output
>> >>>>>>>>>>>>> records=3312879
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
>> >>>>>>>>>>>>> records=3312879
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
>> >>>>>>>>>>>>> records=3312879
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
>> >>>>>>>>>>>>> records=3312879
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
>> >>>>>>>>>>>>> records=3312879
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map input
>> >>>>>>>> records=7528530
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map output
>> >>>>>>>>>> records=3313251
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce input
>> >>>>>>>>>>>>> records=3313251
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce output
>> >>>>>>>>>>>>> records=3313251
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map input
>> >>>>>>>> records=6626130
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map output
>> >>>>>>>>>> records=6626130
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce input
>> >>>>>>>>>>>>> records=6626130
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce output
>> >>>>>>>>>>>>> records=3312879
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map input
>> >>>>>>>> records=3312879
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map output
>> >>>>>>>>>> records=3313251
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce input
>> >>>>>>>>>>>>> records=3313251
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> --------
>> >>>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce output
>> >>>>>>> records=0
>> >>>>>>>>>>>>>> --------
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> why 0???
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>>
>> >>
>> >>
>> >
>> >
>>
>>
>

Re: recommenditembased returns 0 records from last map-reduce job

Posted by Serega Sheypak <se...@gmail.com>.

Thank you! I could spend all my life trying to get result without knowing
the requirements for input data.

BTW:
we used mahout 0.7-cdh-4.4...cdh
4.7 org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
and did get results close to reality. We just provided long user_id,
item_id and didn't do something special.
Why did it work?


2014-07-27 5:18 GMT+04:00 Pat Ferrel <pa...@gmail.com>:

> Both those jobs require you create Mahout IDs for users and items. For
> most Hadoop based Mahout jobs, taking either text input or sequence files,
> the IDs must follow the rules mentioned below. There are a few exceptions
> but none you are using. The Wiki was rewritten for 0.9 and so the ID
> requirements may not be documented well. You can file a Jira so someone
> documents this.
>
> BTW spark-itemsimilarity will take any IDs and can read any text-delimited
> file format, unfortunately it’s not quite ready yet.
>
> On Jul 26, 2014, at 3:14 AM, Serega Sheypak <se...@gmail.com>
> wrote:
>
> Hm... rather confusing... You are talking about input for:
> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
> or
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
>
> My target is to get item-item similarity. ItemSimilarityJob right now
> returns few similarities.
>
> I'm readin this:
> https://mahout.apache.org/users/recommender/intro-itembased-hadoop.html
> and that:
> https://mahout.apache.org/users/recommender/userbased-5-minutes.html
>
> I don't see there something about " Your IDs must be in the range from 0 to
> the number of rows" for both items and users. Where does this requirement
> come from?
>
>
> 2014-07-25 23:57 GMT+04:00 Pat Ferrel <pa...@gmail.com>:
>
> > I think I did explain below. Your IDs must be in the range from 0 to the
> > number of rows - 1 and the same for item IDs. This is done by taking your
> > application specific IDs and mapping them to sequential non-negative
> > Integers. You need to maintain a mapping to/from Mahout IDs somewhere in
> > your own code.
> >
> > For example imagine input of the form
> > -92, abc, 1.0
> > 75000x, jkl, 2.0
> >
> > Your first user ID is -92, give it Mahout ID = 0. For your next user ID
> > 75000x give it Mahout ID = 1
> > Your first item ID is abc, give it Mahout ID = 0. For your next item ID
> > jkl give it Mahout ID = 1
> > keep doing this the first time you see a unique id from your input. A Map
> > will do this for you.
> >
> > And so on. Then the input to Mahout would be:
> > 0,0,1.0
> > 1,1,2.0
> >
> > The output will have Mahout IDs too so you need to map recommendations
> for
> > Mahout User ID 0 back to your User ID of -92, and the same for all item
> IDs.
> >
> >
> > On Jul 25, 2014, at 11:55 AM, Serega Sheypak <se...@gmail.com>
> > wrote:
> >
> > I'm preparing data using apache hive: user_id:long, item_it:long,
> > preference[1.0, 2.0]
> > I don't understand "For most Mahout jobs you have to prepare you data to
> > have Mahout IDs". What is "Mahout IDs"? I try to follow mahout site
> docs, I
> > didn't find there something related to mahout ids.
> > Please explain.
> >
> >
> > 2014-07-25 22:39 GMT+04:00 Pat Ferrel <pa...@gmail.com>:
> >
> >> Sorry I haven’t read this thread carefully but it looks like you may be
> >> using the wrong IDs.
> >>
> >> For most Mahout jobs you have to prepare you data to have Mahout IDs.
> You
> >> do this by looking at each datum and as you see a new unique application
> >> specific user or item ID you give it a Mahout ID starting from 0. So
> > Mahout
> >> ID can be thought of as row and column numbers in a matrix. The Mahout
> > IDs
> >> for rows will be 0 thru # of rows-1 same for columns.
> >>
> >> This always requires that you translate into Mahout IDs then after the
> > job
> >> is run translate back into your application IDs. You need a
> > bi-directional
> >> dictionary of some type. I use a HashBiMap from Guava.
> >>
> >> Also I’d avoid the threshold for now. If you get that wrong it will mess
> >> things up badly and is very hard to tune. It’s there for completeness
> > but I
> >> never use it.
> >>
> >>
> >> On Jul 25, 2014, at 12:55 AM, Serega Sheypak <se...@gmail.com>
> >> wrote:
> >>
> >> Hi, nothing helps...
> >> I do use mahout 0.9 compiled for CDH 4.7
> >> I do provide only positive values
> >> I do use itemsimilarityJob and do get 2000 similarities for 1400 unique
> >> items
> >> Input data is:
> >> 16*10^6 preferences
> >> 4*10^6 users
> >> 0.6*10^ items
> >> I do use perason correlation and preferece vlaues are: 1.0 and 2.0
> >>
> >>
> >> 2014-07-22 9:32 GMT+04:00 Serega Sheypak <se...@gmail.com>:
> >>
> >>> Ok, I have recompiled mahout 0.9 for CDH 4.7. I'll try this evening.
> >>> Right now I don't see how can it help me. As far as I know the stuff I
> >> try
> >>> to use is pretty old and stable.
> >>> looks like I do apply it in a wrong way.
> >>>
> >>> There is an option for recommenditembased named "--threshold". I do
> >>> provide data for recommenditembased with preference values in range
> >>> [1.1..2.0].
> >>> I set --threshold to 1.2
> >>> --threshold is absolute and can be from [1.1 . .2+] or it's relative
> and
> >>> can be [0.0 .. 0.99999]?
> >>>
> >>>
> >>> 2014-07-22 3:54 GMT+04:00 Ted Dunning <te...@gmail.com>:
> >>>
> >>> That version is no longer supported.  You should upgrade to 0.9
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Jul 21, 2014 at 11:41 AM, Serega Sheypak <
> >>>> serega.sheypak@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> 0.7-cdh4.7.0
> >>>>> Anyway, recommenditembased does produce these catalogs:
> >>>>>
> >>>>> /recommenditembased/temp/maxValues.bin
> >>>>> /recommenditembased/temp/norms.bin
> >>>>> /recommenditembased/temp/numNonZeroEntries.bin
> >>>>> /recommenditembased/temp/pairwiseSimilarity
> >>>>> /recommenditembased/temp/partialMultiply
> >>>>> /recommenditembased/temp/prePartialMultiply1
> >>>>> /recommenditembased/temp/prePartialMultiply2
> >>>>> /recommenditembased/temp/preparePreferenceMatrix
> >>>>> /recommenditembased/temp/similarityMatrix
> >>>>> /recommenditembased/temp/weights
> >>>>>
> >>>>> I suppose that "/recommenditembased/temp/similarityMatrix" is the
> > thing
> >>>> In
> >>>>> eed. Right now I try to read it using
> >>>>>
> >>>>> matrix = LOAD '/recommenditembased/temp/similarityMatrix' USING
> >>>>> com.twitter.elephantbird.pig.load.SequenceFileLoader(
> >>>>>  '-c com.twitter.elephantbird.pig.util.IntWritableConverter',
> >>>>>  '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
> >>>>> )  as (intId: int, vector:tuple(cardinality:int,
> >>>>> entries:bag{t:tuple(some_id:long, some_value:double)}));
> >>>>>
> >>>>>
> >>>>> Looks like the vector is empty... Or i do something wrong.
> >>>>>
> >>>>>
> >>>>>
> >>>>> 2014-07-21 22:09 GMT+04:00 Ted Dunning <te...@gmail.com>:
> >>>>>
> >>>>>> Which version of Mahout?
> >>>>>>
> >>>>>>
> >>>>>> On Mon, Jul 21, 2014 at 11:05 AM, Serega Sheypak <
> >>>>> serega.sheypak@gmail.com
> >>>>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi, I've tried: Unexpected --outputPathForSimilarityMatrix while
> >>>>>> processing
> >>>>>>> Job-Specific
> >>>>>>>
> >>>>>>> sudo -u hdfs hadoop fs -rm -r
> >>>>>> hdfs://nameservice1/recommenditembased/output
> >>>>>>> sudo -u hdfs hadoop fs -rm -r
> >>>>> hdfs://nameservice1/recommenditembased/temp
> >>>>>>> sudo -u oozie mahout recommenditembased \
> >>>>>>>                  --input \
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> >
> hdfs://nameservice1/user/hive/warehouse/staging_weighted_visits_and_rec_clicks
> >>>>>>> \
> >>>>>>>                  --output \
> >>>>>>>                  hdfs://nameservice1/recommenditembased/output \
> >>>>>>>                  --similarityClassname \
> >>>>>>>                  SIMILARITY_LOGLIKELIHOOD \
> >>>>>>>                 --numRecommendations \
> >>>>>>>                  500 \
> >>>>>>>                  --booleanData \
> >>>>>>>                  false \
> >>>>>>>                  --maxPrefsPerUser \
> >>>>>>>                  1000 \
> >>>>>>>                  --maxSimilaritiesPerItem \
> >>>>>>>                  1000 \
> >>>>>>>                  --minPrefsPerUser \
> >>>>>>>                  5 \
> >>>>>>>                  --maxPrefsPerUserInItemSimilarity \
> >>>>>>>                  30 \
> >>>>>>>                  --threshold \
> >>>>>>>                 1.1 \
> >>>>>>>                  --tempDir \
> >>>>>>>                  hdfs://nameservice1/recommenditembased/temp \
> >>>>>>>                  --outputPathForSimilarityMatrix \
> >>>>>>>
> >>>> hdfs://nameservice1/recommenditembased/sim_matrix
> >>>>>>>
> >>>>>>>
> >>>>>>> I'm on Cloudera cdh 4.7, looks like this feature is not supported.
> >>>>>>>
> >>>>>>>
> >>>>>>> 2014-07-21 11:18 GMT+04:00 Peng Zhang <pz...@gmail.com>:
> >>>>>>>
> >>>>>>>> Serega,
> >>>>>>>>
> >>>>>>>> See the last line on how to pass outputPathForSimilarityMatrix
> >>>>> options
> >>>>>> to
> >>>>>>>> the recommenditembased command:
> >>>>>>>>
> >>>>>>>> sudo -u oozie mahout recommenditembased \
> >>>>>>>>                 --input visited_items_with_inverted_items \
> >>>>>>>>
> >>>>>>>>                 --output result \
> >>>>>>>>                 --similarityClassname SIMILARITY_LOGLIKELIHOOD
> >>>> \
> >>>>>>>>                 --usersFile inverted_items \
> >>>>>>>>                 --numRecommendations 500 \
> >>>>>>>>                 --booleanData false \
> >>>>>>>>                 --maxPrefsPerUser 100 \
> >>>>>>>>                 --maxSimilaritiesPerItem 500 \
> >>>>>>>>                 --minPrefsPerUser 0\
> >>>>>>>>                 --maxPrefsPerUserInItemSimilarity 30 \
> >>>>>>>>                 --threshold 0.91 \
> >>>>>>>>                 --tempDir  temp \
> >>>>>>>>                 --outputPathForSimilarityMatrix
> >>>> similarityMatri \
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Peng Zhang
> >>>>>>>> pzhang.xjtu@gmail.com
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Jul 21, 2014, at 3:09 PM, Serega Sheypak <
> >>>>> serega.sheypak@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> I've inspected the code, our approach wouldn't work with
> >>>>>>>> booleanData=false.
> >>>>>>>>> We do calcualte imte similarity in the wrong way...(((
> >>>>>>>>> Thank you
> >>>>>>>>> 1. We provide "fake" user_id and provide --usersFile in order to
> >>>>> get
> >>>>>>>>> recommendations for "fake user_id, where user_id is a negative
> >>>>>> item_id.
> >>>>>>>> It
> >>>>>>>>> worked when we did provide user_id->item_id pairs without
> >>>>> preference.
> >>>>>>>>> 2. Our target is to get item similarities. We tried
> >>>>>>>>>
> >>>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
> >>>>>> but
> >>>>>>>> it
> >>>>>>>>> returns bad result comparing to RecommenderJob with our "fake"
> >>>>>> user_id
> >>>>>>>>> (inverted item_id)
> >>>>>>>>>
> >>>>>>>>> 1. I'll try the option you provided.
> >>>>>>>>> 2. I will remove input with fake user_id and usersFile with
> >>>> these
> >>>>>> fake
> >>>>>>>> ids
> >>>>>>>>>
> >>>>>>>>> 3.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> >
> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
> >>>>>>>>> I don't understand how to pass ---outputPathForSimilarityMatrix
> >>>>>> option
> >>>>>>> to
> >>>>>>>>> RecommenderJob
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> 2014-07-21 4:58 GMT+04:00 Peng Zhang <pz...@gmail.com>:
> >>>>>>>>>
> >>>>>>>>>> Seraga,
> >>>>>>>>>>
> >>>>>>>>>> I have two comments:
> >>>>>>>>>> 1. Don’t use negative user ids. Since Mahout uses user id as
> >>>> well
> >>>>> as
> >>>>>>>> item
> >>>>>>>>>> id as the row/column index, you’d better use 0, 1, 2, etc as
> >>>> ids
> >>>>>>>>>> 2. If you want to get the item similarity information, you can
> >>>> use
> >>>>>>>>>> --outputPathForSimilarityMatrix in the command
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>> Peng Zhang
> >>>>>>>>>> M: +86 186-1658-7856
> >>>>>>>>>> pzhang.xjtu@gmail.com
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <
> >>>>>> serega.sheypak@gmail.com
> >>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> All bad things happen here:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Name
> >>>>>>>>>>>
> >>>>>>>>>>> RecommenderJob-PartialMultiplyMapper-Reducer
> >>>>>>>>>>>
> >>>>>>>>>>> User
> >>>>>>>>>>>
> >>>>>>>>>>> oozie
> >>>>>>>>>>>
> >>>>>>>>>>> Process User
> >>>>>>>>>>>
> >>>>>>>>>>> oozie
> >>>>>>>>>>>
> >>>>>>>>>>> Group
> >>>>>>>>>>>
> >>>>>>>>>>> oozie
> >>>>>>>>>>>
> >>>>>>>>>>> Mapper Class
> >>>>>>>>>>>
> >>>>>>>>>>> PartialMultiplyMapper
> >>>>>>>>>>>
> >>>>>>>>>>> Reducer Class
> >>>>>>>>>>>
> >>>>>>>>>>> AggregateAndRecommendReducer
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Job Input Directory
> >>>>>>>>>>>
> >>>>>>>>>>> hdfs://nameservice1/itemrec/temp/partialMultiply
> >>>>>>>>>>>
> >>>>>>>>>>> Job Output Directory
> >>>>>>>>>>>
> >>>>>>>>>>> hdfs://nameservice1/itemrec/output/
> >>>>>>>>>>>
> >>>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map input
> >>>>>>> records=3312879
> >>>>>>>>>>>
> >>>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map output
> >>>>>>> records=3313251
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce input
> >>>>>>>> records=3313251
> >>>>>>>>>>>
> >>>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce output
> >>>>>> records=0
> >>>>>>>>>>>
> >>>>>>>>>>> Why does mahout returns 0 rows? it works when booleanData=true
> >>>>>>>>>> (preferences
> >>>>>>>>>>> are ignored...?)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <
> >>>>>> serega.sheypak@gmail.com
> >>>>>>>> :
> >>>>>>>>>>>
> >>>>>>>>>>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
> >>>>>>>>>>>> users_file:
> >>>>>>>>>>>> --inverted_item_id
> >>>>>>>>>>>> -1
> >>>>>>>>>>>> -2
> >>>>>>>>>>>> -3
> >>>>>>>>>>>> -4
> >>>>>>>>>>>>
> >>>>>>>>>>>> users_items_prefs
> >>>>>>>>>>>> --inverted item_id
> >>>>>>>>>>>> -1 1 1.0
> >>>>>>>>>>>> -2 2 1.0
> >>>>>>>>>>>> -3 3 1.0
> >>>>>>>>>>>> -4 4 1.0
> >>>>>>>>>>>> --user_id item_id pref_value
> >>>>>>>>>>>> 11   1 1.6
> >>>>>>>>>>>> 11   2 1.6
> >>>>>>>>>>>> 123 3 2.0
> >>>>>>>>>>>> 123 4 2.0
> >>>>>>>>>>>> 333 1 2.0
> >>>>>>>>>>>> 333 2 1.6
> >>>>>>>>>>>> --e.t.c.
> >>>>>>>>>>>>
> >>>>>>>>>>>> if I set --booleanData true
> >>>>>>>>>>>> then mahout returns the result.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <
> >>>>>>>> andrew.musselman@gmail.com
> >>>>>>>>>>> :
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'm confused about how you're constructing the user file, and
> >>>>> why
> >>>>>>>> there
> >>>>>>>>>>>>> are negated item ids here.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Can you post some more details please, including Mahout
> >>>> version
> >>>>>> and
> >>>>>>>>>> some
> >>>>>>>>>>>>> sample data sets?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak <
> >>>>>>>>>> serega.sheypak@gmail.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi, I'm trying to create item similarity.
> >>>>>>>>>>>>>> I gather items which users visit during shopping and then
> >>>>>> create a
> >>>>>>>>>> file:
> >>>>>>>>>>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6,
> >>>>> 1.9],
> >>>>>>>>>> depends
> >>>>>>>>>>>>> on
> >>>>>>>>>>>>>> user action type and data source)
> >>>>>>>>>>>>>> UNION
> >>>>>>>>>>>>>> -item_id, item_id, 1.0 (from items dictionary)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> and I do provide a userFile, where user_id = -item_id
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The idea is to get item similary. If any user visits item
> >>>>> named
> >>>>>>>> "A", i
> >>>>>>>>>>>>> want
> >>>>>>>>>>>>>> to show him items "B", "c", "xxx" using preferences of
> >>>> other
> >>>>>>> users.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The problem is that the last (???) mapreduce job returns 0
> >>>>> rows:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Here are my settings:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> sudo -u oozie mahout recommenditembased \
> >>>>>>>>>>>>>>               --input visited_items_with_inverted_items
> >>>> \
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>               --output result \
> >>>>>>>>>>>>>>               --similarityClassname
> >>>>> SIMILARITY_LOGLIKELIHOOD
> >>>>>> \
> >>>>>>>>>>>>>>               --usersFile inverted_items \
> >>>>>>>>>>>>>>               --numRecommendations 500 \
> >>>>>>>>>>>>>>               --booleanData false \
> >>>>>>>>>>>>>>               --maxPrefsPerUser 100 \
> >>>>>>>>>>>>>>               --maxSimilaritiesPerItem 500 \
> >>>>>>>>>>>>>>               --minPrefsPerUser 0\
> >>>>>>>>>>>>>>               --maxPrefsPerUserInItemSimilarity 30 \
> >>>>>>>>>>>>>>               --threshold 0.91 \
> >>>>>>>>>>>>>>               --tempDir  temp \
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Some counters... I don't get what do they mean....
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
> >>>>>>>>>>>>>>
> >>>>>>>
> org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> >
> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> >>>>>>>>>>>>>> USER_RATINGS_NEGLECTED=1,798,738
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> >>>>>>>>>>>>> USER_RATINGS_USED=12,429,693
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> >
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> >
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> >>>>>>> COOCCURRENCES=35882374
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> >>>>>>> PRUNED_COOCCURRENCES=0
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map input
> >>>>>>>> records=3312879
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map output
> >>>>>>>>>> records=17570268
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce input
> >>>>>>>>>>>>> records=5221907
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce output
> >>>>>>>>>>>>> records=3312879
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
> >>>>>>>>>>>>> records=3312879
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
> >>>>>>>>>>>>> records=3312879
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
> >>>>>>>>>>>>> records=3312879
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
> >>>>>>>>>>>>> records=3312879
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map input
> >>>>>>>> records=7528530
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map output
> >>>>>>>>>> records=3313251
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce input
> >>>>>>>>>>>>> records=3313251
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce output
> >>>>>>>>>>>>> records=3313251
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map input
> >>>>>>>> records=6626130
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map output
> >>>>>>>>>> records=6626130
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce input
> >>>>>>>>>>>>> records=6626130
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce output
> >>>>>>>>>>>>> records=3312879
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map input
> >>>>>>>> records=3312879
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map output
> >>>>>>>>>> records=3313251
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce input
> >>>>>>>>>>>>> records=3313251
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> --------
> >>>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce output
> >>>>>>> records=0
> >>>>>>>>>>>>>> --------
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> why 0???
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >
> >
>
>

Re: recommenditembased returns 0 records from last map-reduce job

Posted by Pat Ferrel <pa...@gmail.com>.

Both those jobs require you create Mahout IDs for users and items. For most Hadoop based Mahout jobs, taking either text input or sequence files, the IDs must follow the rules mentioned below. There are a few exceptions but none you are using. The Wiki was rewritten for 0.9 and so the ID requirements may not be documented well. You can file a Jira so someone documents this.

BTW spark-itemsimilarity will take any IDs and can read any text-delimited file format, unfortunately it’s not quite ready yet.
 
On Jul 26, 2014, at 3:14 AM, Serega Sheypak <se...@gmail.com> wrote:

Hm... rather confusing... You are talking about input for:
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
or
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob

My target is to get item-item similarity. ItemSimilarityJob right now
returns few similarities.

I'm readin this:
https://mahout.apache.org/users/recommender/intro-itembased-hadoop.html
and that:
https://mahout.apache.org/users/recommender/userbased-5-minutes.html

I don't see there something about " Your IDs must be in the range from 0 to
the number of rows" for both items and users. Where does this requirement
come from?


2014-07-25 23:57 GMT+04:00 Pat Ferrel <pa...@gmail.com>:

> I think I did explain below. Your IDs must be in the range from 0 to the
> number of rows - 1 and the same for item IDs. This is done by taking your
> application specific IDs and mapping them to sequential non-negative
> Integers. You need to maintain a mapping to/from Mahout IDs somewhere in
> your own code.
> 
> For example imagine input of the form
> -92, abc, 1.0
> 75000x, jkl, 2.0
> 
> Your first user ID is -92, give it Mahout ID = 0. For your next user ID
> 75000x give it Mahout ID = 1
> Your first item ID is abc, give it Mahout ID = 0. For your next item ID
> jkl give it Mahout ID = 1
> keep doing this the first time you see a unique id from your input. A Map
> will do this for you.
> 
> And so on. Then the input to Mahout would be:
> 0,0,1.0
> 1,1,2.0
> 
> The output will have Mahout IDs too so you need to map recommendations for
> Mahout User ID 0 back to your User ID of -92, and the same for all item IDs.
> 
> 
> On Jul 25, 2014, at 11:55 AM, Serega Sheypak <se...@gmail.com>
> wrote:
> 
> I'm preparing data using apache hive: user_id:long, item_it:long,
> preference[1.0, 2.0]
> I don't understand "For most Mahout jobs you have to prepare you data to
> have Mahout IDs". What is "Mahout IDs"? I try to follow mahout site docs, I
> didn't find there something related to mahout ids.
> Please explain.
> 
> 
> 2014-07-25 22:39 GMT+04:00 Pat Ferrel <pa...@gmail.com>:
> 
>> Sorry I haven’t read this thread carefully but it looks like you may be
>> using the wrong IDs.
>> 
>> For most Mahout jobs you have to prepare you data to have Mahout IDs. You
>> do this by looking at each datum and as you see a new unique application
>> specific user or item ID you give it a Mahout ID starting from 0. So
> Mahout
>> ID can be thought of as row and column numbers in a matrix. The Mahout
> IDs
>> for rows will be 0 thru # of rows-1 same for columns.
>> 
>> This always requires that you translate into Mahout IDs then after the
> job
>> is run translate back into your application IDs. You need a
> bi-directional
>> dictionary of some type. I use a HashBiMap from Guava.
>> 
>> Also I’d avoid the threshold for now. If you get that wrong it will mess
>> things up badly and is very hard to tune. It’s there for completeness
> but I
>> never use it.
>> 
>> 
>> On Jul 25, 2014, at 12:55 AM, Serega Sheypak <se...@gmail.com>
>> wrote:
>> 
>> Hi, nothing helps...
>> I do use mahout 0.9 compiled for CDH 4.7
>> I do provide only positive values
>> I do use itemsimilarityJob and do get 2000 similarities for 1400 unique
>> items
>> Input data is:
>> 16*10^6 preferences
>> 4*10^6 users
>> 0.6*10^ items
>> I do use perason correlation and preferece vlaues are: 1.0 and 2.0
>> 
>> 
>> 2014-07-22 9:32 GMT+04:00 Serega Sheypak <se...@gmail.com>:
>> 
>>> Ok, I have recompiled mahout 0.9 for CDH 4.7. I'll try this evening.
>>> Right now I don't see how can it help me. As far as I know the stuff I
>> try
>>> to use is pretty old and stable.
>>> looks like I do apply it in a wrong way.
>>> 
>>> There is an option for recommenditembased named "--threshold". I do
>>> provide data for recommenditembased with preference values in range
>>> [1.1..2.0].
>>> I set --threshold to 1.2
>>> --threshold is absolute and can be from [1.1 . .2+] or it's relative and
>>> can be [0.0 .. 0.99999]?
>>> 
>>> 
>>> 2014-07-22 3:54 GMT+04:00 Ted Dunning <te...@gmail.com>:
>>> 
>>> That version is no longer supported.  You should upgrade to 0.9
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Mon, Jul 21, 2014 at 11:41 AM, Serega Sheypak <
>>>> serega.sheypak@gmail.com>
>>>> wrote:
>>>> 
>>>>> 0.7-cdh4.7.0
>>>>> Anyway, recommenditembased does produce these catalogs:
>>>>> 
>>>>> /recommenditembased/temp/maxValues.bin
>>>>> /recommenditembased/temp/norms.bin
>>>>> /recommenditembased/temp/numNonZeroEntries.bin
>>>>> /recommenditembased/temp/pairwiseSimilarity
>>>>> /recommenditembased/temp/partialMultiply
>>>>> /recommenditembased/temp/prePartialMultiply1
>>>>> /recommenditembased/temp/prePartialMultiply2
>>>>> /recommenditembased/temp/preparePreferenceMatrix
>>>>> /recommenditembased/temp/similarityMatrix
>>>>> /recommenditembased/temp/weights
>>>>> 
>>>>> I suppose that "/recommenditembased/temp/similarityMatrix" is the
> thing
>>>> In
>>>>> eed. Right now I try to read it using
>>>>> 
>>>>> matrix = LOAD '/recommenditembased/temp/similarityMatrix' USING
>>>>> com.twitter.elephantbird.pig.load.SequenceFileLoader(
>>>>>  '-c com.twitter.elephantbird.pig.util.IntWritableConverter',
>>>>>  '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
>>>>> )  as (intId: int, vector:tuple(cardinality:int,
>>>>> entries:bag{t:tuple(some_id:long, some_value:double)}));
>>>>> 
>>>>> 
>>>>> Looks like the vector is empty... Or i do something wrong.
>>>>> 
>>>>> 
>>>>> 
>>>>> 2014-07-21 22:09 GMT+04:00 Ted Dunning <te...@gmail.com>:
>>>>> 
>>>>>> Which version of Mahout?
>>>>>> 
>>>>>> 
>>>>>> On Mon, Jul 21, 2014 at 11:05 AM, Serega Sheypak <
>>>>> serega.sheypak@gmail.com
>>>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi, I've tried: Unexpected --outputPathForSimilarityMatrix while
>>>>>> processing
>>>>>>> Job-Specific
>>>>>>> 
>>>>>>> sudo -u hdfs hadoop fs -rm -r
>>>>>> hdfs://nameservice1/recommenditembased/output
>>>>>>> sudo -u hdfs hadoop fs -rm -r
>>>>> hdfs://nameservice1/recommenditembased/temp
>>>>>>> sudo -u oozie mahout recommenditembased \
>>>>>>>                  --input \
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
> hdfs://nameservice1/user/hive/warehouse/staging_weighted_visits_and_rec_clicks
>>>>>>> \
>>>>>>>                  --output \
>>>>>>>                  hdfs://nameservice1/recommenditembased/output \
>>>>>>>                  --similarityClassname \
>>>>>>>                  SIMILARITY_LOGLIKELIHOOD \
>>>>>>>                 --numRecommendations \
>>>>>>>                  500 \
>>>>>>>                  --booleanData \
>>>>>>>                  false \
>>>>>>>                  --maxPrefsPerUser \
>>>>>>>                  1000 \
>>>>>>>                  --maxSimilaritiesPerItem \
>>>>>>>                  1000 \
>>>>>>>                  --minPrefsPerUser \
>>>>>>>                  5 \
>>>>>>>                  --maxPrefsPerUserInItemSimilarity \
>>>>>>>                  30 \
>>>>>>>                  --threshold \
>>>>>>>                 1.1 \
>>>>>>>                  --tempDir \
>>>>>>>                  hdfs://nameservice1/recommenditembased/temp \
>>>>>>>                  --outputPathForSimilarityMatrix \
>>>>>>> 
>>>> hdfs://nameservice1/recommenditembased/sim_matrix
>>>>>>> 
>>>>>>> 
>>>>>>> I'm on Cloudera cdh 4.7, looks like this feature is not supported.
>>>>>>> 
>>>>>>> 
>>>>>>> 2014-07-21 11:18 GMT+04:00 Peng Zhang <pz...@gmail.com>:
>>>>>>> 
>>>>>>>> Serega,
>>>>>>>> 
>>>>>>>> See the last line on how to pass outputPathForSimilarityMatrix
>>>>> options
>>>>>> to
>>>>>>>> the recommenditembased command:
>>>>>>>> 
>>>>>>>> sudo -u oozie mahout recommenditembased \
>>>>>>>>                 --input visited_items_with_inverted_items \
>>>>>>>> 
>>>>>>>>                 --output result \
>>>>>>>>                 --similarityClassname SIMILARITY_LOGLIKELIHOOD
>>>> \
>>>>>>>>                 --usersFile inverted_items \
>>>>>>>>                 --numRecommendations 500 \
>>>>>>>>                 --booleanData false \
>>>>>>>>                 --maxPrefsPerUser 100 \
>>>>>>>>                 --maxSimilaritiesPerItem 500 \
>>>>>>>>                 --minPrefsPerUser 0\
>>>>>>>>                 --maxPrefsPerUserInItemSimilarity 30 \
>>>>>>>>                 --threshold 0.91 \
>>>>>>>>                 --tempDir  temp \
>>>>>>>>                 --outputPathForSimilarityMatrix
>>>> similarityMatri \
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Peng Zhang
>>>>>>>> pzhang.xjtu@gmail.com
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Jul 21, 2014, at 3:09 PM, Serega Sheypak <
>>>>> serega.sheypak@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> I've inspected the code, our approach wouldn't work with
>>>>>>>> booleanData=false.
>>>>>>>>> We do calcualte imte similarity in the wrong way...(((
>>>>>>>>> Thank you
>>>>>>>>> 1. We provide "fake" user_id and provide --usersFile in order to
>>>>> get
>>>>>>>>> recommendations for "fake user_id, where user_id is a negative
>>>>>> item_id.
>>>>>>>> It
>>>>>>>>> worked when we did provide user_id->item_id pairs without
>>>>> preference.
>>>>>>>>> 2. Our target is to get item similarities. We tried
>>>>>>>>> 
>>>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
>>>>>> but
>>>>>>>> it
>>>>>>>>> returns bad result comparing to RecommenderJob with our "fake"
>>>>>> user_id
>>>>>>>>> (inverted item_id)
>>>>>>>>> 
>>>>>>>>> 1. I'll try the option you provided.
>>>>>>>>> 2. I will remove input with fake user_id and usersFile with
>>>> these
>>>>>> fake
>>>>>>>> ids
>>>>>>>>> 
>>>>>>>>> 3.
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
>>>>>>>>> I don't understand how to pass ---outputPathForSimilarityMatrix
>>>>>> option
>>>>>>> to
>>>>>>>>> RecommenderJob
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 2014-07-21 4:58 GMT+04:00 Peng Zhang <pz...@gmail.com>:
>>>>>>>>> 
>>>>>>>>>> Seraga,
>>>>>>>>>> 
>>>>>>>>>> I have two comments:
>>>>>>>>>> 1. Don’t use negative user ids. Since Mahout uses user id as
>>>> well
>>>>> as
>>>>>>>> item
>>>>>>>>>> id as the row/column index, you’d better use 0, 1, 2, etc as
>>>> ids
>>>>>>>>>> 2. If you want to get the item similarity information, you can
>>>> use
>>>>>>>>>> --outputPathForSimilarityMatrix in the command
>>>>>>>>>> 
>>>>>>>>>> Regards,
>>>>>>>>>> Peng Zhang
>>>>>>>>>> M: +86 186-1658-7856
>>>>>>>>>> pzhang.xjtu@gmail.com
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <
>>>>>> serega.sheypak@gmail.com
>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> All bad things happen here:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Name
>>>>>>>>>>> 
>>>>>>>>>>> RecommenderJob-PartialMultiplyMapper-Reducer
>>>>>>>>>>> 
>>>>>>>>>>> User
>>>>>>>>>>> 
>>>>>>>>>>> oozie
>>>>>>>>>>> 
>>>>>>>>>>> Process User
>>>>>>>>>>> 
>>>>>>>>>>> oozie
>>>>>>>>>>> 
>>>>>>>>>>> Group
>>>>>>>>>>> 
>>>>>>>>>>> oozie
>>>>>>>>>>> 
>>>>>>>>>>> Mapper Class
>>>>>>>>>>> 
>>>>>>>>>>> PartialMultiplyMapper
>>>>>>>>>>> 
>>>>>>>>>>> Reducer Class
>>>>>>>>>>> 
>>>>>>>>>>> AggregateAndRecommendReducer
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Job Input Directory
>>>>>>>>>>> 
>>>>>>>>>>> hdfs://nameservice1/itemrec/temp/partialMultiply
>>>>>>>>>>> 
>>>>>>>>>>> Job Output Directory
>>>>>>>>>>> 
>>>>>>>>>>> hdfs://nameservice1/itemrec/output/
>>>>>>>>>>> 
>>>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map input
>>>>>>> records=3312879
>>>>>>>>>>> 
>>>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map output
>>>>>>> records=3313251
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce input
>>>>>>>> records=3313251
>>>>>>>>>>> 
>>>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce output
>>>>>> records=0
>>>>>>>>>>> 
>>>>>>>>>>> Why does mahout returns 0 rows? it works when booleanData=true
>>>>>>>>>> (preferences
>>>>>>>>>>> are ignored...?)
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <
>>>>>> serega.sheypak@gmail.com
>>>>>>>> :
>>>>>>>>>>> 
>>>>>>>>>>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
>>>>>>>>>>>> users_file:
>>>>>>>>>>>> --inverted_item_id
>>>>>>>>>>>> -1
>>>>>>>>>>>> -2
>>>>>>>>>>>> -3
>>>>>>>>>>>> -4
>>>>>>>>>>>> 
>>>>>>>>>>>> users_items_prefs
>>>>>>>>>>>> --inverted item_id
>>>>>>>>>>>> -1 1 1.0
>>>>>>>>>>>> -2 2 1.0
>>>>>>>>>>>> -3 3 1.0
>>>>>>>>>>>> -4 4 1.0
>>>>>>>>>>>> --user_id item_id pref_value
>>>>>>>>>>>> 11   1 1.6
>>>>>>>>>>>> 11   2 1.6
>>>>>>>>>>>> 123 3 2.0
>>>>>>>>>>>> 123 4 2.0
>>>>>>>>>>>> 333 1 2.0
>>>>>>>>>>>> 333 2 1.6
>>>>>>>>>>>> --e.t.c.
>>>>>>>>>>>> 
>>>>>>>>>>>> if I set --booleanData true
>>>>>>>>>>>> then mahout returns the result.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <
>>>>>>>> andrew.musselman@gmail.com
>>>>>>>>>>> :
>>>>>>>>>>>> 
>>>>>>>>>>>> I'm confused about how you're constructing the user file, and
>>>>> why
>>>>>>>> there
>>>>>>>>>>>>> are negated item ids here.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Can you post some more details please, including Mahout
>>>> version
>>>>>> and
>>>>>>>>>> some
>>>>>>>>>>>>> sample data sets?
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak <
>>>>>>>>>> serega.sheypak@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi, I'm trying to create item similarity.
>>>>>>>>>>>>>> I gather items which users visit during shopping and then
>>>>>> create a
>>>>>>>>>> file:
>>>>>>>>>>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6,
>>>>> 1.9],
>>>>>>>>>> depends
>>>>>>>>>>>>> on
>>>>>>>>>>>>>> user action type and data source)
>>>>>>>>>>>>>> UNION
>>>>>>>>>>>>>> -item_id, item_id, 1.0 (from items dictionary)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> and I do provide a userFile, where user_id = -item_id
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The idea is to get item similary. If any user visits item
>>>>> named
>>>>>>>> "A", i
>>>>>>>>>>>>> want
>>>>>>>>>>>>>> to show him items "B", "c", "xxx" using preferences of
>>>> other
>>>>>>> users.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The problem is that the last (???) mapreduce job returns 0
>>>>> rows:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Here are my settings:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> sudo -u oozie mahout recommenditembased \
>>>>>>>>>>>>>>               --input visited_items_with_inverted_items
>>>> \
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>               --output result \
>>>>>>>>>>>>>>               --similarityClassname
>>>>> SIMILARITY_LOGLIKELIHOOD
>>>>>> \
>>>>>>>>>>>>>>               --usersFile inverted_items \
>>>>>>>>>>>>>>               --numRecommendations 500 \
>>>>>>>>>>>>>>               --booleanData false \
>>>>>>>>>>>>>>               --maxPrefsPerUser 100 \
>>>>>>>>>>>>>>               --maxSimilaritiesPerItem 500 \
>>>>>>>>>>>>>>               --minPrefsPerUser 0\
>>>>>>>>>>>>>>               --maxPrefsPerUserInItemSimilarity 30 \
>>>>>>>>>>>>>>               --threshold 0.91 \
>>>>>>>>>>>>>>               --tempDir  temp \
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Some counters... I don't get what do they mean....
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
>>>>>>>>>>>>>> 
>>>>>>> org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>>>>>>>>>>> USER_RATINGS_NEGLECTED=1,798,738
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>>>>>>>>>> USER_RATINGS_USED=12,429,693
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>>>>> COOCCURRENCES=35882374
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>>>>> PRUNED_COOCCURRENCES=0
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map input
>>>>>>>> records=3312879
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map output
>>>>>>>>>> records=17570268
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce input
>>>>>>>>>>>>> records=5221907
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce output
>>>>>>>>>>>>> records=3312879
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
>>>>>>>>>>>>> records=3312879
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
>>>>>>>>>>>>> records=3312879
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
>>>>>>>>>>>>> records=3312879
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
>>>>>>>>>>>>> records=3312879
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map input
>>>>>>>> records=7528530
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map output
>>>>>>>>>> records=3313251
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce input
>>>>>>>>>>>>> records=3313251
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce output
>>>>>>>>>>>>> records=3313251
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map input
>>>>>>>> records=6626130
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map output
>>>>>>>>>> records=6626130
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce input
>>>>>>>>>>>>> records=6626130
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce output
>>>>>>>>>>>>> records=3312879
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map input
>>>>>>>> records=3312879
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map output
>>>>>>>>>> records=3313251
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce input
>>>>>>>>>>>>> records=3313251
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --------
>>>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce output
>>>>>>> records=0
>>>>>>>>>>>>>> --------
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> why 0???
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: recommenditembased returns 0 records from last map-reduce job

Posted by Serega Sheypak <se...@gmail.com>.

Hm... rather confusing... You are talking about input for:
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
or
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob

My target is to get item-item similarity. ItemSimilarityJob right now
returns few similarities.

I'm readin this:
https://mahout.apache.org/users/recommender/intro-itembased-hadoop.html
and that:
https://mahout.apache.org/users/recommender/userbased-5-minutes.html

I don't see there something about " Your IDs must be in the range from 0 to
the number of rows" for both items and users. Where does this requirement
come from?


2014-07-25 23:57 GMT+04:00 Pat Ferrel <pa...@gmail.com>:

> I think I did explain below. Your IDs must be in the range from 0 to the
> number of rows - 1 and the same for item IDs. This is done by taking your
> application specific IDs and mapping them to sequential non-negative
> Integers. You need to maintain a mapping to/from Mahout IDs somewhere in
> your own code.
>
> For example imagine input of the form
> -92, abc, 1.0
> 75000x, jkl, 2.0
>
> Your first user ID is -92, give it Mahout ID = 0. For your next user ID
> 75000x give it Mahout ID = 1
> Your first item ID is abc, give it Mahout ID = 0. For your next item ID
> jkl give it Mahout ID = 1
> keep doing this the first time you see a unique id from your input. A Map
> will do this for you.
>
> And so on. Then the input to Mahout would be:
> 0,0,1.0
> 1,1,2.0
>
> The output will have Mahout IDs too so you need to map recommendations for
> Mahout User ID 0 back to your User ID of -92, and the same for all item IDs.
>
>
> On Jul 25, 2014, at 11:55 AM, Serega Sheypak <se...@gmail.com>
> wrote:
>
> I'm preparing data using apache hive: user_id:long, item_it:long,
> preference[1.0, 2.0]
> I don't understand "For most Mahout jobs you have to prepare you data to
> have Mahout IDs". What is "Mahout IDs"? I try to follow mahout site docs, I
> didn't find there something related to mahout ids.
> Please explain.
>
>
> 2014-07-25 22:39 GMT+04:00 Pat Ferrel <pa...@gmail.com>:
>
> > Sorry I haven’t read this thread carefully but it looks like you may be
> > using the wrong IDs.
> >
> > For most Mahout jobs you have to prepare you data to have Mahout IDs. You
> > do this by looking at each datum and as you see a new unique application
> > specific user or item ID you give it a Mahout ID starting from 0. So
> Mahout
> > ID can be thought of as row and column numbers in a matrix. The Mahout
> IDs
> > for rows will be 0 thru # of rows-1 same for columns.
> >
> > This always requires that you translate into Mahout IDs then after the
> job
> > is run translate back into your application IDs. You need a
> bi-directional
> > dictionary of some type. I use a HashBiMap from Guava.
> >
> > Also I’d avoid the threshold for now. If you get that wrong it will mess
> > things up badly and is very hard to tune. It’s there for completeness
> but I
> > never use it.
> >
> >
> > On Jul 25, 2014, at 12:55 AM, Serega Sheypak <se...@gmail.com>
> > wrote:
> >
> > Hi, nothing helps...
> > I do use mahout 0.9 compiled for CDH 4.7
> > I do provide only positive values
> > I do use itemsimilarityJob and do get 2000 similarities for 1400 unique
> > items
> > Input data is:
> > 16*10^6 preferences
> > 4*10^6 users
> > 0.6*10^ items
> > I do use perason correlation and preferece vlaues are: 1.0 and 2.0
> >
> >
> > 2014-07-22 9:32 GMT+04:00 Serega Sheypak <se...@gmail.com>:
> >
> >> Ok, I have recompiled mahout 0.9 for CDH 4.7. I'll try this evening.
> >> Right now I don't see how can it help me. As far as I know the stuff I
> > try
> >> to use is pretty old and stable.
> >> looks like I do apply it in a wrong way.
> >>
> >> There is an option for recommenditembased named "--threshold". I do
> >> provide data for recommenditembased with preference values in range
> >> [1.1..2.0].
> >> I set --threshold to 1.2
> >> --threshold is absolute and can be from [1.1 . .2+] or it's relative and
> >> can be [0.0 .. 0.99999]?
> >>
> >>
> >> 2014-07-22 3:54 GMT+04:00 Ted Dunning <te...@gmail.com>:
> >>
> >> That version is no longer supported.  You should upgrade to 0.9
> >>>
> >>>
> >>>
> >>>
> >>> On Mon, Jul 21, 2014 at 11:41 AM, Serega Sheypak <
> >>> serega.sheypak@gmail.com>
> >>> wrote:
> >>>
> >>>> 0.7-cdh4.7.0
> >>>> Anyway, recommenditembased does produce these catalogs:
> >>>>
> >>>> /recommenditembased/temp/maxValues.bin
> >>>> /recommenditembased/temp/norms.bin
> >>>> /recommenditembased/temp/numNonZeroEntries.bin
> >>>> /recommenditembased/temp/pairwiseSimilarity
> >>>> /recommenditembased/temp/partialMultiply
> >>>> /recommenditembased/temp/prePartialMultiply1
> >>>> /recommenditembased/temp/prePartialMultiply2
> >>>> /recommenditembased/temp/preparePreferenceMatrix
> >>>> /recommenditembased/temp/similarityMatrix
> >>>> /recommenditembased/temp/weights
> >>>>
> >>>> I suppose that "/recommenditembased/temp/similarityMatrix" is the
> thing
> >>> In
> >>>> eed. Right now I try to read it using
> >>>>
> >>>> matrix = LOAD '/recommenditembased/temp/similarityMatrix' USING
> >>>> com.twitter.elephantbird.pig.load.SequenceFileLoader(
> >>>>   '-c com.twitter.elephantbird.pig.util.IntWritableConverter',
> >>>>   '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
> >>>> )  as (intId: int, vector:tuple(cardinality:int,
> >>>> entries:bag{t:tuple(some_id:long, some_value:double)}));
> >>>>
> >>>>
> >>>> Looks like the vector is empty... Or i do something wrong.
> >>>>
> >>>>
> >>>>
> >>>> 2014-07-21 22:09 GMT+04:00 Ted Dunning <te...@gmail.com>:
> >>>>
> >>>>> Which version of Mahout?
> >>>>>
> >>>>>
> >>>>> On Mon, Jul 21, 2014 at 11:05 AM, Serega Sheypak <
> >>>> serega.sheypak@gmail.com
> >>>>>>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi, I've tried: Unexpected --outputPathForSimilarityMatrix while
> >>>>> processing
> >>>>>> Job-Specific
> >>>>>>
> >>>>>> sudo -u hdfs hadoop fs -rm -r
> >>>>> hdfs://nameservice1/recommenditembased/output
> >>>>>> sudo -u hdfs hadoop fs -rm -r
> >>>> hdfs://nameservice1/recommenditembased/temp
> >>>>>> sudo -u oozie mahout recommenditembased \
> >>>>>>                   --input \
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >
> hdfs://nameservice1/user/hive/warehouse/staging_weighted_visits_and_rec_clicks
> >>>>>> \
> >>>>>>                   --output \
> >>>>>>                   hdfs://nameservice1/recommenditembased/output \
> >>>>>>                   --similarityClassname \
> >>>>>>                   SIMILARITY_LOGLIKELIHOOD \
> >>>>>>                  --numRecommendations \
> >>>>>>                   500 \
> >>>>>>                   --booleanData \
> >>>>>>                   false \
> >>>>>>                   --maxPrefsPerUser \
> >>>>>>                   1000 \
> >>>>>>                   --maxSimilaritiesPerItem \
> >>>>>>                   1000 \
> >>>>>>                   --minPrefsPerUser \
> >>>>>>                   5 \
> >>>>>>                   --maxPrefsPerUserInItemSimilarity \
> >>>>>>                   30 \
> >>>>>>                   --threshold \
> >>>>>>                  1.1 \
> >>>>>>                   --tempDir \
> >>>>>>                   hdfs://nameservice1/recommenditembased/temp \
> >>>>>>                   --outputPathForSimilarityMatrix \
> >>>>>>
> >>> hdfs://nameservice1/recommenditembased/sim_matrix
> >>>>>>
> >>>>>>
> >>>>>> I'm on Cloudera cdh 4.7, looks like this feature is not supported.
> >>>>>>
> >>>>>>
> >>>>>> 2014-07-21 11:18 GMT+04:00 Peng Zhang <pz...@gmail.com>:
> >>>>>>
> >>>>>>> Serega,
> >>>>>>>
> >>>>>>> See the last line on how to pass outputPathForSimilarityMatrix
> >>>> options
> >>>>> to
> >>>>>>> the recommenditembased command:
> >>>>>>>
> >>>>>>> sudo -u oozie mahout recommenditembased \
> >>>>>>>                  --input visited_items_with_inverted_items \
> >>>>>>>
> >>>>>>>                  --output result \
> >>>>>>>                  --similarityClassname SIMILARITY_LOGLIKELIHOOD
> >>> \
> >>>>>>>                  --usersFile inverted_items \
> >>>>>>>                  --numRecommendations 500 \
> >>>>>>>                  --booleanData false \
> >>>>>>>                  --maxPrefsPerUser 100 \
> >>>>>>>                  --maxSimilaritiesPerItem 500 \
> >>>>>>>                  --minPrefsPerUser 0\
> >>>>>>>                  --maxPrefsPerUserInItemSimilarity 30 \
> >>>>>>>                  --threshold 0.91 \
> >>>>>>>                  --tempDir  temp \
> >>>>>>>                  --outputPathForSimilarityMatrix
> >>> similarityMatri \
> >>>>>>>
> >>>>>>>
> >>>>>>> Peng Zhang
> >>>>>>> pzhang.xjtu@gmail.com
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Jul 21, 2014, at 3:09 PM, Serega Sheypak <
> >>>> serega.sheypak@gmail.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> I've inspected the code, our approach wouldn't work with
> >>>>>>> booleanData=false.
> >>>>>>>> We do calcualte imte similarity in the wrong way...(((
> >>>>>>>> Thank you
> >>>>>>>> 1. We provide "fake" user_id and provide --usersFile in order to
> >>>> get
> >>>>>>>> recommendations for "fake user_id, where user_id is a negative
> >>>>> item_id.
> >>>>>>> It
> >>>>>>>> worked when we did provide user_id->item_id pairs without
> >>>> preference.
> >>>>>>>> 2. Our target is to get item similarities. We tried
> >>>>>>>>
> >>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
> >>>>> but
> >>>>>>> it
> >>>>>>>> returns bad result comparing to RecommenderJob with our "fake"
> >>>>> user_id
> >>>>>>>> (inverted item_id)
> >>>>>>>>
> >>>>>>>> 1. I'll try the option you provided.
> >>>>>>>> 2. I will remove input with fake user_id and usersFile with
> >>> these
> >>>>> fake
> >>>>>>> ids
> >>>>>>>>
> >>>>>>>> 3.
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >
> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
> >>>>>>>> I don't understand how to pass ---outputPathForSimilarityMatrix
> >>>>> option
> >>>>>> to
> >>>>>>>> RecommenderJob
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 2014-07-21 4:58 GMT+04:00 Peng Zhang <pz...@gmail.com>:
> >>>>>>>>
> >>>>>>>>> Seraga,
> >>>>>>>>>
> >>>>>>>>> I have two comments:
> >>>>>>>>> 1. Don’t use negative user ids. Since Mahout uses user id as
> >>> well
> >>>> as
> >>>>>>> item
> >>>>>>>>> id as the row/column index, you’d better use 0, 1, 2, etc as
> >>> ids
> >>>>>>>>> 2. If you want to get the item similarity information, you can
> >>> use
> >>>>>>>>> --outputPathForSimilarityMatrix in the command
> >>>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>> Peng Zhang
> >>>>>>>>> M: +86 186-1658-7856
> >>>>>>>>> pzhang.xjtu@gmail.com
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <
> >>>>> serega.sheypak@gmail.com
> >>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> All bad things happen here:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Name
> >>>>>>>>>>
> >>>>>>>>>> RecommenderJob-PartialMultiplyMapper-Reducer
> >>>>>>>>>>
> >>>>>>>>>> User
> >>>>>>>>>>
> >>>>>>>>>> oozie
> >>>>>>>>>>
> >>>>>>>>>> Process User
> >>>>>>>>>>
> >>>>>>>>>> oozie
> >>>>>>>>>>
> >>>>>>>>>> Group
> >>>>>>>>>>
> >>>>>>>>>> oozie
> >>>>>>>>>>
> >>>>>>>>>> Mapper Class
> >>>>>>>>>>
> >>>>>>>>>> PartialMultiplyMapper
> >>>>>>>>>>
> >>>>>>>>>> Reducer Class
> >>>>>>>>>>
> >>>>>>>>>> AggregateAndRecommendReducer
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Job Input Directory
> >>>>>>>>>>
> >>>>>>>>>> hdfs://nameservice1/itemrec/temp/partialMultiply
> >>>>>>>>>>
> >>>>>>>>>> Job Output Directory
> >>>>>>>>>>
> >>>>>>>>>> hdfs://nameservice1/itemrec/output/
> >>>>>>>>>>
> >>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map input
> >>>>>> records=3312879
> >>>>>>>>>>
> >>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map output
> >>>>>> records=3313251
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce input
> >>>>>>> records=3313251
> >>>>>>>>>>
> >>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce output
> >>>>> records=0
> >>>>>>>>>>
> >>>>>>>>>> Why does mahout returns 0 rows? it works when booleanData=true
> >>>>>>>>> (preferences
> >>>>>>>>>> are ignored...?)
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <
> >>>>> serega.sheypak@gmail.com
> >>>>>>> :
> >>>>>>>>>>
> >>>>>>>>>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
> >>>>>>>>>>> users_file:
> >>>>>>>>>>> --inverted_item_id
> >>>>>>>>>>> -1
> >>>>>>>>>>> -2
> >>>>>>>>>>> -3
> >>>>>>>>>>> -4
> >>>>>>>>>>>
> >>>>>>>>>>> users_items_prefs
> >>>>>>>>>>> --inverted item_id
> >>>>>>>>>>> -1 1 1.0
> >>>>>>>>>>> -2 2 1.0
> >>>>>>>>>>> -3 3 1.0
> >>>>>>>>>>> -4 4 1.0
> >>>>>>>>>>> --user_id item_id pref_value
> >>>>>>>>>>> 11   1 1.6
> >>>>>>>>>>> 11   2 1.6
> >>>>>>>>>>> 123 3 2.0
> >>>>>>>>>>> 123 4 2.0
> >>>>>>>>>>> 333 1 2.0
> >>>>>>>>>>> 333 2 1.6
> >>>>>>>>>>> --e.t.c.
> >>>>>>>>>>>
> >>>>>>>>>>> if I set --booleanData true
> >>>>>>>>>>> then mahout returns the result.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <
> >>>>>>> andrew.musselman@gmail.com
> >>>>>>>>>> :
> >>>>>>>>>>>
> >>>>>>>>>>> I'm confused about how you're constructing the user file, and
> >>>> why
> >>>>>>> there
> >>>>>>>>>>>> are negated item ids here.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Can you post some more details please, including Mahout
> >>> version
> >>>>> and
> >>>>>>>>> some
> >>>>>>>>>>>> sample data sets?
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak <
> >>>>>>>>> serega.sheypak@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi, I'm trying to create item similarity.
> >>>>>>>>>>>>> I gather items which users visit during shopping and then
> >>>>> create a
> >>>>>>>>> file:
> >>>>>>>>>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6,
> >>>> 1.9],
> >>>>>>>>> depends
> >>>>>>>>>>>> on
> >>>>>>>>>>>>> user action type and data source)
> >>>>>>>>>>>>> UNION
> >>>>>>>>>>>>> -item_id, item_id, 1.0 (from items dictionary)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> and I do provide a userFile, where user_id = -item_id
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The idea is to get item similary. If any user visits item
> >>>> named
> >>>>>>> "A", i
> >>>>>>>>>>>> want
> >>>>>>>>>>>>> to show him items "B", "c", "xxx" using preferences of
> >>> other
> >>>>>> users.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The problem is that the last (???) mapreduce job returns 0
> >>>> rows:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Here are my settings:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> sudo -u oozie mahout recommenditembased \
> >>>>>>>>>>>>>                --input visited_items_with_inverted_items
> >>> \
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>                --output result \
> >>>>>>>>>>>>>                --similarityClassname
> >>>> SIMILARITY_LOGLIKELIHOOD
> >>>>> \
> >>>>>>>>>>>>>                --usersFile inverted_items \
> >>>>>>>>>>>>>                --numRecommendations 500 \
> >>>>>>>>>>>>>                --booleanData false \
> >>>>>>>>>>>>>                --maxPrefsPerUser 100 \
> >>>>>>>>>>>>>                --maxSimilaritiesPerItem 500 \
> >>>>>>>>>>>>>                --minPrefsPerUser 0\
> >>>>>>>>>>>>>                --maxPrefsPerUserInItemSimilarity 30 \
> >>>>>>>>>>>>>                --threshold 0.91 \
> >>>>>>>>>>>>>                --tempDir  temp \
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Some counters... I don't get what do they mean....
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
> >>>>>>>>>>>>>
> >>>>>> org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >
> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> >>>>>>>>>>>>> USER_RATINGS_NEGLECTED=1,798,738
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> >>>>>>>>>>>> USER_RATINGS_USED=12,429,693
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> >>>>>> COOCCURRENCES=35882374
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> >>>>>> PRUNED_COOCCURRENCES=0
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map input
> >>>>>>> records=3312879
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map output
> >>>>>>>>> records=17570268
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce input
> >>>>>>>>>>>> records=5221907
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce output
> >>>>>>>>>>>> records=3312879
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
> >>>>>>>>>>>> records=3312879
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
> >>>>>>>>>>>> records=3312879
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
> >>>>>>>>>>>> records=3312879
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
> >>>>>>>>>>>> records=3312879
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map input
> >>>>>>> records=7528530
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map output
> >>>>>>>>> records=3313251
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce input
> >>>>>>>>>>>> records=3313251
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce output
> >>>>>>>>>>>> records=3313251
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map input
> >>>>>>> records=6626130
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map output
> >>>>>>>>> records=6626130
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce input
> >>>>>>>>>>>> records=6626130
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce output
> >>>>>>>>>>>> records=3312879
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map input
> >>>>>>> records=3312879
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map output
> >>>>>>>>> records=3313251
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce input
> >>>>>>>>>>>> records=3313251
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> --------
> >>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce output
> >>>>>> records=0
> >>>>>>>>>>>>> --------
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> why 0???
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
> >
> >
>
>

Re: recommenditembased returns 0 records from last map-reduce job

Posted by Pat Ferrel <pa...@gmail.com>.

I think I did explain below. Your IDs must be in the range from 0 to the number of rows - 1 and the same for item IDs. This is done by taking your application specific IDs and mapping them to sequential non-negative Integers. You need to maintain a mapping to/from Mahout IDs somewhere in your own code.

For example imagine input of the form
-92, abc, 1.0
75000x, jkl, 2.0

Your first user ID is -92, give it Mahout ID = 0. For your next user ID 75000x give it Mahout ID = 1
Your first item ID is abc, give it Mahout ID = 0. For your next item ID jkl give it Mahout ID = 1
keep doing this the first time you see a unique id from your input. A Map will do this for you.

And so on. Then the input to Mahout would be:
0,0,1.0
1,1,2.0

The output will have Mahout IDs too so you need to map recommendations for Mahout User ID 0 back to your User ID of -92, and the same for all item IDs.


On Jul 25, 2014, at 11:55 AM, Serega Sheypak <se...@gmail.com> wrote:

I'm preparing data using apache hive: user_id:long, item_it:long,
preference[1.0, 2.0]
I don't understand "For most Mahout jobs you have to prepare you data to
have Mahout IDs". What is "Mahout IDs"? I try to follow mahout site docs, I
didn't find there something related to mahout ids.
Please explain.


2014-07-25 22:39 GMT+04:00 Pat Ferrel <pa...@gmail.com>:

> Sorry I haven’t read this thread carefully but it looks like you may be
> using the wrong IDs.
> 
> For most Mahout jobs you have to prepare you data to have Mahout IDs. You
> do this by looking at each datum and as you see a new unique application
> specific user or item ID you give it a Mahout ID starting from 0. So Mahout
> ID can be thought of as row and column numbers in a matrix. The Mahout IDs
> for rows will be 0 thru # of rows-1 same for columns.
> 
> This always requires that you translate into Mahout IDs then after the job
> is run translate back into your application IDs. You need a bi-directional
> dictionary of some type. I use a HashBiMap from Guava.
> 
> Also I’d avoid the threshold for now. If you get that wrong it will mess
> things up badly and is very hard to tune. It’s there for completeness but I
> never use it.
> 
> 
> On Jul 25, 2014, at 12:55 AM, Serega Sheypak <se...@gmail.com>
> wrote:
> 
> Hi, nothing helps...
> I do use mahout 0.9 compiled for CDH 4.7
> I do provide only positive values
> I do use itemsimilarityJob and do get 2000 similarities for 1400 unique
> items
> Input data is:
> 16*10^6 preferences
> 4*10^6 users
> 0.6*10^ items
> I do use perason correlation and preferece vlaues are: 1.0 and 2.0
> 
> 
> 2014-07-22 9:32 GMT+04:00 Serega Sheypak <se...@gmail.com>:
> 
>> Ok, I have recompiled mahout 0.9 for CDH 4.7. I'll try this evening.
>> Right now I don't see how can it help me. As far as I know the stuff I
> try
>> to use is pretty old and stable.
>> looks like I do apply it in a wrong way.
>> 
>> There is an option for recommenditembased named "--threshold". I do
>> provide data for recommenditembased with preference values in range
>> [1.1..2.0].
>> I set --threshold to 1.2
>> --threshold is absolute and can be from [1.1 . .2+] or it's relative and
>> can be [0.0 .. 0.99999]?
>> 
>> 
>> 2014-07-22 3:54 GMT+04:00 Ted Dunning <te...@gmail.com>:
>> 
>> That version is no longer supported.  You should upgrade to 0.9
>>> 
>>> 
>>> 
>>> 
>>> On Mon, Jul 21, 2014 at 11:41 AM, Serega Sheypak <
>>> serega.sheypak@gmail.com>
>>> wrote:
>>> 
>>>> 0.7-cdh4.7.0
>>>> Anyway, recommenditembased does produce these catalogs:
>>>> 
>>>> /recommenditembased/temp/maxValues.bin
>>>> /recommenditembased/temp/norms.bin
>>>> /recommenditembased/temp/numNonZeroEntries.bin
>>>> /recommenditembased/temp/pairwiseSimilarity
>>>> /recommenditembased/temp/partialMultiply
>>>> /recommenditembased/temp/prePartialMultiply1
>>>> /recommenditembased/temp/prePartialMultiply2
>>>> /recommenditembased/temp/preparePreferenceMatrix
>>>> /recommenditembased/temp/similarityMatrix
>>>> /recommenditembased/temp/weights
>>>> 
>>>> I suppose that "/recommenditembased/temp/similarityMatrix" is the thing
>>> In
>>>> eed. Right now I try to read it using
>>>> 
>>>> matrix = LOAD '/recommenditembased/temp/similarityMatrix' USING
>>>> com.twitter.elephantbird.pig.load.SequenceFileLoader(
>>>>   '-c com.twitter.elephantbird.pig.util.IntWritableConverter',
>>>>   '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
>>>> )  as (intId: int, vector:tuple(cardinality:int,
>>>> entries:bag{t:tuple(some_id:long, some_value:double)}));
>>>> 
>>>> 
>>>> Looks like the vector is empty... Or i do something wrong.
>>>> 
>>>> 
>>>> 
>>>> 2014-07-21 22:09 GMT+04:00 Ted Dunning <te...@gmail.com>:
>>>> 
>>>>> Which version of Mahout?
>>>>> 
>>>>> 
>>>>> On Mon, Jul 21, 2014 at 11:05 AM, Serega Sheypak <
>>>> serega.sheypak@gmail.com
>>>>>> 
>>>>> wrote:
>>>>> 
>>>>>> Hi, I've tried: Unexpected --outputPathForSimilarityMatrix while
>>>>> processing
>>>>>> Job-Specific
>>>>>> 
>>>>>> sudo -u hdfs hadoop fs -rm -r
>>>>> hdfs://nameservice1/recommenditembased/output
>>>>>> sudo -u hdfs hadoop fs -rm -r
>>>> hdfs://nameservice1/recommenditembased/temp
>>>>>> sudo -u oozie mahout recommenditembased \
>>>>>>                   --input \
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
> hdfs://nameservice1/user/hive/warehouse/staging_weighted_visits_and_rec_clicks
>>>>>> \
>>>>>>                   --output \
>>>>>>                   hdfs://nameservice1/recommenditembased/output \
>>>>>>                   --similarityClassname \
>>>>>>                   SIMILARITY_LOGLIKELIHOOD \
>>>>>>                  --numRecommendations \
>>>>>>                   500 \
>>>>>>                   --booleanData \
>>>>>>                   false \
>>>>>>                   --maxPrefsPerUser \
>>>>>>                   1000 \
>>>>>>                   --maxSimilaritiesPerItem \
>>>>>>                   1000 \
>>>>>>                   --minPrefsPerUser \
>>>>>>                   5 \
>>>>>>                   --maxPrefsPerUserInItemSimilarity \
>>>>>>                   30 \
>>>>>>                   --threshold \
>>>>>>                  1.1 \
>>>>>>                   --tempDir \
>>>>>>                   hdfs://nameservice1/recommenditembased/temp \
>>>>>>                   --outputPathForSimilarityMatrix \
>>>>>> 
>>> hdfs://nameservice1/recommenditembased/sim_matrix
>>>>>> 
>>>>>> 
>>>>>> I'm on Cloudera cdh 4.7, looks like this feature is not supported.
>>>>>> 
>>>>>> 
>>>>>> 2014-07-21 11:18 GMT+04:00 Peng Zhang <pz...@gmail.com>:
>>>>>> 
>>>>>>> Serega,
>>>>>>> 
>>>>>>> See the last line on how to pass outputPathForSimilarityMatrix
>>>> options
>>>>> to
>>>>>>> the recommenditembased command:
>>>>>>> 
>>>>>>> sudo -u oozie mahout recommenditembased \
>>>>>>>                  --input visited_items_with_inverted_items \
>>>>>>> 
>>>>>>>                  --output result \
>>>>>>>                  --similarityClassname SIMILARITY_LOGLIKELIHOOD
>>> \
>>>>>>>                  --usersFile inverted_items \
>>>>>>>                  --numRecommendations 500 \
>>>>>>>                  --booleanData false \
>>>>>>>                  --maxPrefsPerUser 100 \
>>>>>>>                  --maxSimilaritiesPerItem 500 \
>>>>>>>                  --minPrefsPerUser 0\
>>>>>>>                  --maxPrefsPerUserInItemSimilarity 30 \
>>>>>>>                  --threshold 0.91 \
>>>>>>>                  --tempDir  temp \
>>>>>>>                  --outputPathForSimilarityMatrix
>>> similarityMatri \
>>>>>>> 
>>>>>>> 
>>>>>>> Peng Zhang
>>>>>>> pzhang.xjtu@gmail.com
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Jul 21, 2014, at 3:09 PM, Serega Sheypak <
>>>> serega.sheypak@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> I've inspected the code, our approach wouldn't work with
>>>>>>> booleanData=false.
>>>>>>>> We do calcualte imte similarity in the wrong way...(((
>>>>>>>> Thank you
>>>>>>>> 1. We provide "fake" user_id and provide --usersFile in order to
>>>> get
>>>>>>>> recommendations for "fake user_id, where user_id is a negative
>>>>> item_id.
>>>>>>> It
>>>>>>>> worked when we did provide user_id->item_id pairs without
>>>> preference.
>>>>>>>> 2. Our target is to get item similarities. We tried
>>>>>>>> 
>>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
>>>>> but
>>>>>>> it
>>>>>>>> returns bad result comparing to RecommenderJob with our "fake"
>>>>> user_id
>>>>>>>> (inverted item_id)
>>>>>>>> 
>>>>>>>> 1. I'll try the option you provided.
>>>>>>>> 2. I will remove input with fake user_id and usersFile with
>>> these
>>>>> fake
>>>>>>> ids
>>>>>>>> 
>>>>>>>> 3.
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
>>>>>>>> I don't understand how to pass ---outputPathForSimilarityMatrix
>>>>> option
>>>>>> to
>>>>>>>> RecommenderJob
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 2014-07-21 4:58 GMT+04:00 Peng Zhang <pz...@gmail.com>:
>>>>>>>> 
>>>>>>>>> Seraga,
>>>>>>>>> 
>>>>>>>>> I have two comments:
>>>>>>>>> 1. Don’t use negative user ids. Since Mahout uses user id as
>>> well
>>>> as
>>>>>>> item
>>>>>>>>> id as the row/column index, you’d better use 0, 1, 2, etc as
>>> ids
>>>>>>>>> 2. If you want to get the item similarity information, you can
>>> use
>>>>>>>>> --outputPathForSimilarityMatrix in the command
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>> Peng Zhang
>>>>>>>>> M: +86 186-1658-7856
>>>>>>>>> pzhang.xjtu@gmail.com
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <
>>>>> serega.sheypak@gmail.com
>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> All bad things happen here:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Name
>>>>>>>>>> 
>>>>>>>>>> RecommenderJob-PartialMultiplyMapper-Reducer
>>>>>>>>>> 
>>>>>>>>>> User
>>>>>>>>>> 
>>>>>>>>>> oozie
>>>>>>>>>> 
>>>>>>>>>> Process User
>>>>>>>>>> 
>>>>>>>>>> oozie
>>>>>>>>>> 
>>>>>>>>>> Group
>>>>>>>>>> 
>>>>>>>>>> oozie
>>>>>>>>>> 
>>>>>>>>>> Mapper Class
>>>>>>>>>> 
>>>>>>>>>> PartialMultiplyMapper
>>>>>>>>>> 
>>>>>>>>>> Reducer Class
>>>>>>>>>> 
>>>>>>>>>> AggregateAndRecommendReducer
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Job Input Directory
>>>>>>>>>> 
>>>>>>>>>> hdfs://nameservice1/itemrec/temp/partialMultiply
>>>>>>>>>> 
>>>>>>>>>> Job Output Directory
>>>>>>>>>> 
>>>>>>>>>> hdfs://nameservice1/itemrec/output/
>>>>>>>>>> 
>>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map input
>>>>>> records=3312879
>>>>>>>>>> 
>>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map output
>>>>>> records=3313251
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce input
>>>>>>> records=3313251
>>>>>>>>>> 
>>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce output
>>>>> records=0
>>>>>>>>>> 
>>>>>>>>>> Why does mahout returns 0 rows? it works when booleanData=true
>>>>>>>>> (preferences
>>>>>>>>>> are ignored...?)
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <
>>>>> serega.sheypak@gmail.com
>>>>>>> :
>>>>>>>>>> 
>>>>>>>>>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
>>>>>>>>>>> users_file:
>>>>>>>>>>> --inverted_item_id
>>>>>>>>>>> -1
>>>>>>>>>>> -2
>>>>>>>>>>> -3
>>>>>>>>>>> -4
>>>>>>>>>>> 
>>>>>>>>>>> users_items_prefs
>>>>>>>>>>> --inverted item_id
>>>>>>>>>>> -1 1 1.0
>>>>>>>>>>> -2 2 1.0
>>>>>>>>>>> -3 3 1.0
>>>>>>>>>>> -4 4 1.0
>>>>>>>>>>> --user_id item_id pref_value
>>>>>>>>>>> 11   1 1.6
>>>>>>>>>>> 11   2 1.6
>>>>>>>>>>> 123 3 2.0
>>>>>>>>>>> 123 4 2.0
>>>>>>>>>>> 333 1 2.0
>>>>>>>>>>> 333 2 1.6
>>>>>>>>>>> --e.t.c.
>>>>>>>>>>> 
>>>>>>>>>>> if I set --booleanData true
>>>>>>>>>>> then mahout returns the result.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <
>>>>>>> andrew.musselman@gmail.com
>>>>>>>>>> :
>>>>>>>>>>> 
>>>>>>>>>>> I'm confused about how you're constructing the user file, and
>>>> why
>>>>>>> there
>>>>>>>>>>>> are negated item ids here.
>>>>>>>>>>>> 
>>>>>>>>>>>> Can you post some more details please, including Mahout
>>> version
>>>>> and
>>>>>>>>> some
>>>>>>>>>>>> sample data sets?
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak <
>>>>>>>>> serega.sheypak@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi, I'm trying to create item similarity.
>>>>>>>>>>>>> I gather items which users visit during shopping and then
>>>>> create a
>>>>>>>>> file:
>>>>>>>>>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6,
>>>> 1.9],
>>>>>>>>> depends
>>>>>>>>>>>> on
>>>>>>>>>>>>> user action type and data source)
>>>>>>>>>>>>> UNION
>>>>>>>>>>>>> -item_id, item_id, 1.0 (from items dictionary)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> and I do provide a userFile, where user_id = -item_id
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The idea is to get item similary. If any user visits item
>>>> named
>>>>>>> "A", i
>>>>>>>>>>>> want
>>>>>>>>>>>>> to show him items "B", "c", "xxx" using preferences of
>>> other
>>>>>> users.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The problem is that the last (???) mapreduce job returns 0
>>>> rows:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Here are my settings:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> sudo -u oozie mahout recommenditembased \
>>>>>>>>>>>>>                --input visited_items_with_inverted_items
>>> \
>>>>>>>>>>>>> 
>>>>>>>>>>>>>                --output result \
>>>>>>>>>>>>>                --similarityClassname
>>>> SIMILARITY_LOGLIKELIHOOD
>>>>> \
>>>>>>>>>>>>>                --usersFile inverted_items \
>>>>>>>>>>>>>                --numRecommendations 500 \
>>>>>>>>>>>>>                --booleanData false \
>>>>>>>>>>>>>                --maxPrefsPerUser 100 \
>>>>>>>>>>>>>                --maxSimilaritiesPerItem 500 \
>>>>>>>>>>>>>                --minPrefsPerUser 0\
>>>>>>>>>>>>>                --maxPrefsPerUserInItemSimilarity 30 \
>>>>>>>>>>>>>                --threshold 0.91 \
>>>>>>>>>>>>>                --tempDir  temp \
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Some counters... I don't get what do they mean....
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
>>>>>>>>>>>>> 
>>>>>> org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>>>>>>>>>> USER_RATINGS_NEGLECTED=1,798,738
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>>>>>>>>> USER_RATINGS_USED=12,429,693
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>>>> COOCCURRENCES=35882374
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>>>> PRUNED_COOCCURRENCES=0
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map input
>>>>>>> records=3312879
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map output
>>>>>>>>> records=17570268
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce input
>>>>>>>>>>>> records=5221907
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce output
>>>>>>>>>>>> records=3312879
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
>>>>>>>>>>>> records=3312879
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
>>>>>>>>>>>> records=3312879
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
>>>>>>>>>>>> records=3312879
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
>>>>>>>>>>>> records=3312879
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map input
>>>>>>> records=7528530
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map output
>>>>>>>>> records=3313251
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce input
>>>>>>>>>>>> records=3313251
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce output
>>>>>>>>>>>> records=3313251
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map input
>>>>>>> records=6626130
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map output
>>>>>>>>> records=6626130
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce input
>>>>>>>>>>>> records=6626130
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce output
>>>>>>>>>>>> records=3312879
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map input
>>>>>>> records=3312879
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map output
>>>>>>>>> records=3313251
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce input
>>>>>>>>>>>> records=3313251
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --------
>>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce output
>>>>>> records=0
>>>>>>>>>>>>> --------
>>>>>>>>>>>>> 
>>>>>>>>>>>>> why 0???
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>> 
> 
>

Re: recommenditembased returns 0 records from last map-reduce job

Posted by Serega Sheypak <se...@gmail.com>.

I'm preparing data using apache hive: user_id:long, item_it:long,
preference[1.0, 2.0]
I don't understand "For most Mahout jobs you have to prepare you data to
have Mahout IDs". What is "Mahout IDs"? I try to follow mahout site docs, I
didn't find there something related to mahout ids.
Please explain.


2014-07-25 22:39 GMT+04:00 Pat Ferrel <pa...@gmail.com>:

> Sorry I haven’t read this thread carefully but it looks like you may be
> using the wrong IDs.
>
> For most Mahout jobs you have to prepare you data to have Mahout IDs. You
> do this by looking at each datum and as you see a new unique application
> specific user or item ID you give it a Mahout ID starting from 0. So Mahout
> ID can be thought of as row and column numbers in a matrix. The Mahout IDs
> for rows will be 0 thru # of rows-1 same for columns.
>
> This always requires that you translate into Mahout IDs then after the job
> is run translate back into your application IDs. You need a bi-directional
> dictionary of some type. I use a HashBiMap from Guava.
>
> Also I’d avoid the threshold for now. If you get that wrong it will mess
> things up badly and is very hard to tune. It’s there for completeness but I
> never use it.
>
>
> On Jul 25, 2014, at 12:55 AM, Serega Sheypak <se...@gmail.com>
> wrote:
>
> Hi, nothing helps...
> I do use mahout 0.9 compiled for CDH 4.7
> I do provide only positive values
> I do use itemsimilarityJob and do get 2000 similarities for 1400 unique
> items
> Input data is:
> 16*10^6 preferences
> 4*10^6 users
> 0.6*10^ items
> I do use perason correlation and preferece vlaues are: 1.0 and 2.0
>
>
> 2014-07-22 9:32 GMT+04:00 Serega Sheypak <se...@gmail.com>:
>
> > Ok, I have recompiled mahout 0.9 for CDH 4.7. I'll try this evening.
> > Right now I don't see how can it help me. As far as I know the stuff I
> try
> > to use is pretty old and stable.
> > looks like I do apply it in a wrong way.
> >
> > There is an option for recommenditembased named "--threshold". I do
> > provide data for recommenditembased with preference values in range
> > [1.1..2.0].
> > I set --threshold to 1.2
> > --threshold is absolute and can be from [1.1 . .2+] or it's relative and
> > can be [0.0 .. 0.99999]?
> >
> >
> > 2014-07-22 3:54 GMT+04:00 Ted Dunning <te...@gmail.com>:
> >
> > That version is no longer supported.  You should upgrade to 0.9
> >>
> >>
> >>
> >>
> >> On Mon, Jul 21, 2014 at 11:41 AM, Serega Sheypak <
> >> serega.sheypak@gmail.com>
> >> wrote:
> >>
> >>> 0.7-cdh4.7.0
> >>> Anyway, recommenditembased does produce these catalogs:
> >>>
> >>> /recommenditembased/temp/maxValues.bin
> >>> /recommenditembased/temp/norms.bin
> >>> /recommenditembased/temp/numNonZeroEntries.bin
> >>> /recommenditembased/temp/pairwiseSimilarity
> >>> /recommenditembased/temp/partialMultiply
> >>> /recommenditembased/temp/prePartialMultiply1
> >>> /recommenditembased/temp/prePartialMultiply2
> >>> /recommenditembased/temp/preparePreferenceMatrix
> >>> /recommenditembased/temp/similarityMatrix
> >>> /recommenditembased/temp/weights
> >>>
> >>> I suppose that "/recommenditembased/temp/similarityMatrix" is the thing
> >> In
> >>> eed. Right now I try to read it using
> >>>
> >>> matrix = LOAD '/recommenditembased/temp/similarityMatrix' USING
> >>> com.twitter.elephantbird.pig.load.SequenceFileLoader(
> >>>    '-c com.twitter.elephantbird.pig.util.IntWritableConverter',
> >>>    '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
> >>> )  as (intId: int, vector:tuple(cardinality:int,
> >>> entries:bag{t:tuple(some_id:long, some_value:double)}));
> >>>
> >>>
> >>> Looks like the vector is empty... Or i do something wrong.
> >>>
> >>>
> >>>
> >>> 2014-07-21 22:09 GMT+04:00 Ted Dunning <te...@gmail.com>:
> >>>
> >>>> Which version of Mahout?
> >>>>
> >>>>
> >>>> On Mon, Jul 21, 2014 at 11:05 AM, Serega Sheypak <
> >>> serega.sheypak@gmail.com
> >>>>>
> >>>> wrote:
> >>>>
> >>>>> Hi, I've tried: Unexpected --outputPathForSimilarityMatrix while
> >>>> processing
> >>>>> Job-Specific
> >>>>>
> >>>>> sudo -u hdfs hadoop fs -rm -r
> >>>> hdfs://nameservice1/recommenditembased/output
> >>>>> sudo -u hdfs hadoop fs -rm -r
> >>> hdfs://nameservice1/recommenditembased/temp
> >>>>> sudo -u oozie mahout recommenditembased \
> >>>>>                    --input \
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> hdfs://nameservice1/user/hive/warehouse/staging_weighted_visits_and_rec_clicks
> >>>>> \
> >>>>>                    --output \
> >>>>>                    hdfs://nameservice1/recommenditembased/output \
> >>>>>                    --similarityClassname \
> >>>>>                    SIMILARITY_LOGLIKELIHOOD \
> >>>>>                   --numRecommendations \
> >>>>>                    500 \
> >>>>>                    --booleanData \
> >>>>>                    false \
> >>>>>                    --maxPrefsPerUser \
> >>>>>                    1000 \
> >>>>>                    --maxSimilaritiesPerItem \
> >>>>>                    1000 \
> >>>>>                    --minPrefsPerUser \
> >>>>>                    5 \
> >>>>>                    --maxPrefsPerUserInItemSimilarity \
> >>>>>                    30 \
> >>>>>                    --threshold \
> >>>>>                   1.1 \
> >>>>>                    --tempDir \
> >>>>>                    hdfs://nameservice1/recommenditembased/temp \
> >>>>>                    --outputPathForSimilarityMatrix \
> >>>>>
> >> hdfs://nameservice1/recommenditembased/sim_matrix
> >>>>>
> >>>>>
> >>>>> I'm on Cloudera cdh 4.7, looks like this feature is not supported.
> >>>>>
> >>>>>
> >>>>> 2014-07-21 11:18 GMT+04:00 Peng Zhang <pz...@gmail.com>:
> >>>>>
> >>>>>> Serega,
> >>>>>>
> >>>>>> See the last line on how to pass outputPathForSimilarityMatrix
> >>> options
> >>>> to
> >>>>>> the recommenditembased command:
> >>>>>>
> >>>>>> sudo -u oozie mahout recommenditembased \
> >>>>>>                   --input visited_items_with_inverted_items \
> >>>>>>
> >>>>>>                   --output result \
> >>>>>>                   --similarityClassname SIMILARITY_LOGLIKELIHOOD
> >> \
> >>>>>>                   --usersFile inverted_items \
> >>>>>>                   --numRecommendations 500 \
> >>>>>>                   --booleanData false \
> >>>>>>                   --maxPrefsPerUser 100 \
> >>>>>>                   --maxSimilaritiesPerItem 500 \
> >>>>>>                   --minPrefsPerUser 0\
> >>>>>>                   --maxPrefsPerUserInItemSimilarity 30 \
> >>>>>>                   --threshold 0.91 \
> >>>>>>                   --tempDir  temp \
> >>>>>>                   --outputPathForSimilarityMatrix
> >> similarityMatri \
> >>>>>>
> >>>>>>
> >>>>>> Peng Zhang
> >>>>>> pzhang.xjtu@gmail.com
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Jul 21, 2014, at 3:09 PM, Serega Sheypak <
> >>> serega.sheypak@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> I've inspected the code, our approach wouldn't work with
> >>>>>> booleanData=false.
> >>>>>>> We do calcualte imte similarity in the wrong way...(((
> >>>>>>> Thank you
> >>>>>>> 1. We provide "fake" user_id and provide --usersFile in order to
> >>> get
> >>>>>>> recommendations for "fake user_id, where user_id is a negative
> >>>> item_id.
> >>>>>> It
> >>>>>>> worked when we did provide user_id->item_id pairs without
> >>> preference.
> >>>>>>> 2. Our target is to get item similarities. We tried
> >>>>>>>
> >> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
> >>>> but
> >>>>>> it
> >>>>>>> returns bad result comparing to RecommenderJob with our "fake"
> >>>> user_id
> >>>>>>> (inverted item_id)
> >>>>>>>
> >>>>>>> 1. I'll try the option you provided.
> >>>>>>> 2. I will remove input with fake user_id and usersFile with
> >> these
> >>>> fake
> >>>>>> ids
> >>>>>>>
> >>>>>>> 3.
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
> >>>>>>> I don't understand how to pass ---outputPathForSimilarityMatrix
> >>>> option
> >>>>> to
> >>>>>>> RecommenderJob
> >>>>>>>
> >>>>>>>
> >>>>>>> 2014-07-21 4:58 GMT+04:00 Peng Zhang <pz...@gmail.com>:
> >>>>>>>
> >>>>>>>> Seraga,
> >>>>>>>>
> >>>>>>>> I have two comments:
> >>>>>>>> 1. Don’t use negative user ids. Since Mahout uses user id as
> >> well
> >>> as
> >>>>>> item
> >>>>>>>> id as the row/column index, you’d better use 0, 1, 2, etc as
> >> ids
> >>>>>>>> 2. If you want to get the item similarity information, you can
> >> use
> >>>>>>>> --outputPathForSimilarityMatrix in the command
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Peng Zhang
> >>>>>>>> M: +86 186-1658-7856
> >>>>>>>> pzhang.xjtu@gmail.com
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <
> >>>> serega.sheypak@gmail.com
> >>>>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> All bad things happen here:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Name
> >>>>>>>>>
> >>>>>>>>> RecommenderJob-PartialMultiplyMapper-Reducer
> >>>>>>>>>
> >>>>>>>>> User
> >>>>>>>>>
> >>>>>>>>> oozie
> >>>>>>>>>
> >>>>>>>>> Process User
> >>>>>>>>>
> >>>>>>>>> oozie
> >>>>>>>>>
> >>>>>>>>> Group
> >>>>>>>>>
> >>>>>>>>> oozie
> >>>>>>>>>
> >>>>>>>>> Mapper Class
> >>>>>>>>>
> >>>>>>>>> PartialMultiplyMapper
> >>>>>>>>>
> >>>>>>>>> Reducer Class
> >>>>>>>>>
> >>>>>>>>> AggregateAndRecommendReducer
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Job Input Directory
> >>>>>>>>>
> >>>>>>>>> hdfs://nameservice1/itemrec/temp/partialMultiply
> >>>>>>>>>
> >>>>>>>>> Job Output Directory
> >>>>>>>>>
> >>>>>>>>> hdfs://nameservice1/itemrec/output/
> >>>>>>>>>
> >>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map input
> >>>>> records=3312879
> >>>>>>>>>
> >>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map output
> >>>>> records=3313251
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce input
> >>>>>> records=3313251
> >>>>>>>>>
> >>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce output
> >>>> records=0
> >>>>>>>>>
> >>>>>>>>> Why does mahout returns 0 rows? it works when booleanData=true
> >>>>>>>> (preferences
> >>>>>>>>> are ignored...?)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <
> >>>> serega.sheypak@gmail.com
> >>>>>> :
> >>>>>>>>>
> >>>>>>>>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
> >>>>>>>>>> users_file:
> >>>>>>>>>> --inverted_item_id
> >>>>>>>>>> -1
> >>>>>>>>>> -2
> >>>>>>>>>> -3
> >>>>>>>>>> -4
> >>>>>>>>>>
> >>>>>>>>>> users_items_prefs
> >>>>>>>>>> --inverted item_id
> >>>>>>>>>> -1 1 1.0
> >>>>>>>>>> -2 2 1.0
> >>>>>>>>>> -3 3 1.0
> >>>>>>>>>> -4 4 1.0
> >>>>>>>>>> --user_id item_id pref_value
> >>>>>>>>>> 11   1 1.6
> >>>>>>>>>> 11   2 1.6
> >>>>>>>>>> 123 3 2.0
> >>>>>>>>>> 123 4 2.0
> >>>>>>>>>> 333 1 2.0
> >>>>>>>>>> 333 2 1.6
> >>>>>>>>>> --e.t.c.
> >>>>>>>>>>
> >>>>>>>>>> if I set --booleanData true
> >>>>>>>>>> then mahout returns the result.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <
> >>>>>> andrew.musselman@gmail.com
> >>>>>>>>> :
> >>>>>>>>>>
> >>>>>>>>>> I'm confused about how you're constructing the user file, and
> >>> why
> >>>>>> there
> >>>>>>>>>>> are negated item ids here.
> >>>>>>>>>>>
> >>>>>>>>>>> Can you post some more details please, including Mahout
> >> version
> >>>> and
> >>>>>>>> some
> >>>>>>>>>>> sample data sets?
> >>>>>>>>>>>
> >>>>>>>>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak <
> >>>>>>>> serega.sheypak@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi, I'm trying to create item similarity.
> >>>>>>>>>>>> I gather items which users visit during shopping and then
> >>>> create a
> >>>>>>>> file:
> >>>>>>>>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6,
> >>> 1.9],
> >>>>>>>> depends
> >>>>>>>>>>> on
> >>>>>>>>>>>> user action type and data source)
> >>>>>>>>>>>> UNION
> >>>>>>>>>>>> -item_id, item_id, 1.0 (from items dictionary)
> >>>>>>>>>>>>
> >>>>>>>>>>>> and I do provide a userFile, where user_id = -item_id
> >>>>>>>>>>>>
> >>>>>>>>>>>> The idea is to get item similary. If any user visits item
> >>> named
> >>>>>> "A", i
> >>>>>>>>>>> want
> >>>>>>>>>>>> to show him items "B", "c", "xxx" using preferences of
> >> other
> >>>>> users.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The problem is that the last (???) mapreduce job returns 0
> >>> rows:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Here are my settings:
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> sudo -u oozie mahout recommenditembased \
> >>>>>>>>>>>>                 --input visited_items_with_inverted_items
> >> \
> >>>>>>>>>>>>
> >>>>>>>>>>>>                 --output result \
> >>>>>>>>>>>>                 --similarityClassname
> >>> SIMILARITY_LOGLIKELIHOOD
> >>>> \
> >>>>>>>>>>>>                 --usersFile inverted_items \
> >>>>>>>>>>>>                 --numRecommendations 500 \
> >>>>>>>>>>>>                 --booleanData false \
> >>>>>>>>>>>>                 --maxPrefsPerUser 100 \
> >>>>>>>>>>>>                 --maxSimilaritiesPerItem 500 \
> >>>>>>>>>>>>                 --minPrefsPerUser 0\
> >>>>>>>>>>>>                 --maxPrefsPerUserInItemSimilarity 30 \
> >>>>>>>>>>>>                 --threshold 0.91 \
> >>>>>>>>>>>>                 --tempDir  temp \
> >>>>>>>>>>>>
> >>>>>>>>>>>> Some counters... I don't get what do they mean....
> >>>>>>>>>>>>
> >>>>>>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
> >>>>>>>>>>>>
> >>>>> org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
> >>>>>>>>>>>>
> >>>>>>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530
> >>>>>>>>>>>>
> >>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
> >>>>>>>>>>>>
> >>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> >>>>>>>>>>>> USER_RATINGS_NEGLECTED=1,798,738
> >>>>>>>>>>>>
> >>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> >>>>>>>>>>> USER_RATINGS_USED=12,429,693
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> >>>>>>>>>>>>
> >>>>>>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879
> >>>>>>>>>>>>
> >>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> >>>>>>>>>>>>
> >>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> >>>>> COOCCURRENCES=35882374
> >>>>>>>>>>>>
> >>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> >>>>> PRUNED_COOCCURRENCES=0
> >>>>>>>>>>>>
> >>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map input
> >>>>>> records=3312879
> >>>>>>>>>>>>
> >>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map output
> >>>>>>>> records=17570268
> >>>>>>>>>>>>
> >>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce input
> >>>>>>>>>>> records=5221907
> >>>>>>>>>>>>
> >>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce output
> >>>>>>>>>>> records=3312879
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
> >>>>>>>>>>> records=3312879
> >>>>>>>>>>>>
> >>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
> >>>>>>>>>>> records=3312879
> >>>>>>>>>>>>
> >>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
> >>>>>>>>>>> records=3312879
> >>>>>>>>>>>>
> >>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
> >>>>>>>>>>> records=3312879
> >>>>>>>>>>>>
> >>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map input
> >>>>>> records=7528530
> >>>>>>>>>>>>
> >>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map output
> >>>>>>>> records=3313251
> >>>>>>>>>>>>
> >>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce input
> >>>>>>>>>>> records=3313251
> >>>>>>>>>>>>
> >>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce output
> >>>>>>>>>>> records=3313251
> >>>>>>>>>>>>
> >>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map input
> >>>>>> records=6626130
> >>>>>>>>>>>>
> >>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map output
> >>>>>>>> records=6626130
> >>>>>>>>>>>>
> >>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce input
> >>>>>>>>>>> records=6626130
> >>>>>>>>>>>>
> >>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce output
> >>>>>>>>>>> records=3312879
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map input
> >>>>>> records=3312879
> >>>>>>>>>>>>
> >>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map output
> >>>>>>>> records=3313251
> >>>>>>>>>>>>
> >>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce input
> >>>>>>>>>>> records=3313251
> >>>>>>>>>>>>
> >>>>>>>>>>>> --------
> >>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce output
> >>>>> records=0
> >>>>>>>>>>>> --------
> >>>>>>>>>>>>
> >>>>>>>>>>>> why 0???
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
> >
>
>

Re: recommenditembased returns 0 records from last map-reduce job

Posted by Pat Ferrel <pa...@gmail.com>.

Sorry I haven’t read this thread carefully but it looks like you may be using the wrong IDs.

For most Mahout jobs you have to prepare you data to have Mahout IDs. You do this by looking at each datum and as you see a new unique application specific user or item ID you give it a Mahout ID starting from 0. So Mahout ID can be thought of as row and column numbers in a matrix. The Mahout IDs for rows will be 0 thru # of rows-1 same for columns.

This always requires that you translate into Mahout IDs then after the job is run translate back into your application IDs. You need a bi-directional dictionary of some type. I use a HashBiMap from Guava.

Also I’d avoid the threshold for now. If you get that wrong it will mess things up badly and is very hard to tune. It’s there for completeness but I never use it.


On Jul 25, 2014, at 12:55 AM, Serega Sheypak <se...@gmail.com> wrote:

Hi, nothing helps...
I do use mahout 0.9 compiled for CDH 4.7
I do provide only positive values
I do use itemsimilarityJob and do get 2000 similarities for 1400 unique
items
Input data is:
16*10^6 preferences
4*10^6 users
0.6*10^ items
I do use perason correlation and preferece vlaues are: 1.0 and 2.0


2014-07-22 9:32 GMT+04:00 Serega Sheypak <se...@gmail.com>:

> Ok, I have recompiled mahout 0.9 for CDH 4.7. I'll try this evening.
> Right now I don't see how can it help me. As far as I know the stuff I try
> to use is pretty old and stable.
> looks like I do apply it in a wrong way.
> 
> There is an option for recommenditembased named "--threshold". I do
> provide data for recommenditembased with preference values in range
> [1.1..2.0].
> I set --threshold to 1.2
> --threshold is absolute and can be from [1.1 . .2+] or it's relative and
> can be [0.0 .. 0.99999]?
> 
> 
> 2014-07-22 3:54 GMT+04:00 Ted Dunning <te...@gmail.com>:
> 
> That version is no longer supported.  You should upgrade to 0.9
>> 
>> 
>> 
>> 
>> On Mon, Jul 21, 2014 at 11:41 AM, Serega Sheypak <
>> serega.sheypak@gmail.com>
>> wrote:
>> 
>>> 0.7-cdh4.7.0
>>> Anyway, recommenditembased does produce these catalogs:
>>> 
>>> /recommenditembased/temp/maxValues.bin
>>> /recommenditembased/temp/norms.bin
>>> /recommenditembased/temp/numNonZeroEntries.bin
>>> /recommenditembased/temp/pairwiseSimilarity
>>> /recommenditembased/temp/partialMultiply
>>> /recommenditembased/temp/prePartialMultiply1
>>> /recommenditembased/temp/prePartialMultiply2
>>> /recommenditembased/temp/preparePreferenceMatrix
>>> /recommenditembased/temp/similarityMatrix
>>> /recommenditembased/temp/weights
>>> 
>>> I suppose that "/recommenditembased/temp/similarityMatrix" is the thing
>> In
>>> eed. Right now I try to read it using
>>> 
>>> matrix = LOAD '/recommenditembased/temp/similarityMatrix' USING
>>> com.twitter.elephantbird.pig.load.SequenceFileLoader(
>>>    '-c com.twitter.elephantbird.pig.util.IntWritableConverter',
>>>    '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
>>> )  as (intId: int, vector:tuple(cardinality:int,
>>> entries:bag{t:tuple(some_id:long, some_value:double)}));
>>> 
>>> 
>>> Looks like the vector is empty... Or i do something wrong.
>>> 
>>> 
>>> 
>>> 2014-07-21 22:09 GMT+04:00 Ted Dunning <te...@gmail.com>:
>>> 
>>>> Which version of Mahout?
>>>> 
>>>> 
>>>> On Mon, Jul 21, 2014 at 11:05 AM, Serega Sheypak <
>>> serega.sheypak@gmail.com
>>>>> 
>>>> wrote:
>>>> 
>>>>> Hi, I've tried: Unexpected --outputPathForSimilarityMatrix while
>>>> processing
>>>>> Job-Specific
>>>>> 
>>>>> sudo -u hdfs hadoop fs -rm -r
>>>> hdfs://nameservice1/recommenditembased/output
>>>>> sudo -u hdfs hadoop fs -rm -r
>>> hdfs://nameservice1/recommenditembased/temp
>>>>> sudo -u oozie mahout recommenditembased \
>>>>>                    --input \
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>> hdfs://nameservice1/user/hive/warehouse/staging_weighted_visits_and_rec_clicks
>>>>> \
>>>>>                    --output \
>>>>>                    hdfs://nameservice1/recommenditembased/output \
>>>>>                    --similarityClassname \
>>>>>                    SIMILARITY_LOGLIKELIHOOD \
>>>>>                   --numRecommendations \
>>>>>                    500 \
>>>>>                    --booleanData \
>>>>>                    false \
>>>>>                    --maxPrefsPerUser \
>>>>>                    1000 \
>>>>>                    --maxSimilaritiesPerItem \
>>>>>                    1000 \
>>>>>                    --minPrefsPerUser \
>>>>>                    5 \
>>>>>                    --maxPrefsPerUserInItemSimilarity \
>>>>>                    30 \
>>>>>                    --threshold \
>>>>>                   1.1 \
>>>>>                    --tempDir \
>>>>>                    hdfs://nameservice1/recommenditembased/temp \
>>>>>                    --outputPathForSimilarityMatrix \
>>>>> 
>> hdfs://nameservice1/recommenditembased/sim_matrix
>>>>> 
>>>>> 
>>>>> I'm on Cloudera cdh 4.7, looks like this feature is not supported.
>>>>> 
>>>>> 
>>>>> 2014-07-21 11:18 GMT+04:00 Peng Zhang <pz...@gmail.com>:
>>>>> 
>>>>>> Serega,
>>>>>> 
>>>>>> See the last line on how to pass outputPathForSimilarityMatrix
>>> options
>>>> to
>>>>>> the recommenditembased command:
>>>>>> 
>>>>>> sudo -u oozie mahout recommenditembased \
>>>>>>                   --input visited_items_with_inverted_items \
>>>>>> 
>>>>>>                   --output result \
>>>>>>                   --similarityClassname SIMILARITY_LOGLIKELIHOOD
>> \
>>>>>>                   --usersFile inverted_items \
>>>>>>                   --numRecommendations 500 \
>>>>>>                   --booleanData false \
>>>>>>                   --maxPrefsPerUser 100 \
>>>>>>                   --maxSimilaritiesPerItem 500 \
>>>>>>                   --minPrefsPerUser 0\
>>>>>>                   --maxPrefsPerUserInItemSimilarity 30 \
>>>>>>                   --threshold 0.91 \
>>>>>>                   --tempDir  temp \
>>>>>>                   --outputPathForSimilarityMatrix
>> similarityMatri \
>>>>>> 
>>>>>> 
>>>>>> Peng Zhang
>>>>>> pzhang.xjtu@gmail.com
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Jul 21, 2014, at 3:09 PM, Serega Sheypak <
>>> serega.sheypak@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> I've inspected the code, our approach wouldn't work with
>>>>>> booleanData=false.
>>>>>>> We do calcualte imte similarity in the wrong way...(((
>>>>>>> Thank you
>>>>>>> 1. We provide "fake" user_id and provide --usersFile in order to
>>> get
>>>>>>> recommendations for "fake user_id, where user_id is a negative
>>>> item_id.
>>>>>> It
>>>>>>> worked when we did provide user_id->item_id pairs without
>>> preference.
>>>>>>> 2. Our target is to get item similarities. We tried
>>>>>>> 
>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
>>>> but
>>>>>> it
>>>>>>> returns bad result comparing to RecommenderJob with our "fake"
>>>> user_id
>>>>>>> (inverted item_id)
>>>>>>> 
>>>>>>> 1. I'll try the option you provided.
>>>>>>> 2. I will remove input with fake user_id and usersFile with
>> these
>>>> fake
>>>>>> ids
>>>>>>> 
>>>>>>> 3.
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
>>>>>>> I don't understand how to pass ---outputPathForSimilarityMatrix
>>>> option
>>>>> to
>>>>>>> RecommenderJob
>>>>>>> 
>>>>>>> 
>>>>>>> 2014-07-21 4:58 GMT+04:00 Peng Zhang <pz...@gmail.com>:
>>>>>>> 
>>>>>>>> Seraga,
>>>>>>>> 
>>>>>>>> I have two comments:
>>>>>>>> 1. Don’t use negative user ids. Since Mahout uses user id as
>> well
>>> as
>>>>>> item
>>>>>>>> id as the row/column index, you’d better use 0, 1, 2, etc as
>> ids
>>>>>>>> 2. If you want to get the item similarity information, you can
>> use
>>>>>>>> --outputPathForSimilarityMatrix in the command
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Peng Zhang
>>>>>>>> M: +86 186-1658-7856
>>>>>>>> pzhang.xjtu@gmail.com
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <
>>>> serega.sheypak@gmail.com
>>>>>> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> All bad things happen here:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Name
>>>>>>>>> 
>>>>>>>>> RecommenderJob-PartialMultiplyMapper-Reducer
>>>>>>>>> 
>>>>>>>>> User
>>>>>>>>> 
>>>>>>>>> oozie
>>>>>>>>> 
>>>>>>>>> Process User
>>>>>>>>> 
>>>>>>>>> oozie
>>>>>>>>> 
>>>>>>>>> Group
>>>>>>>>> 
>>>>>>>>> oozie
>>>>>>>>> 
>>>>>>>>> Mapper Class
>>>>>>>>> 
>>>>>>>>> PartialMultiplyMapper
>>>>>>>>> 
>>>>>>>>> Reducer Class
>>>>>>>>> 
>>>>>>>>> AggregateAndRecommendReducer
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Job Input Directory
>>>>>>>>> 
>>>>>>>>> hdfs://nameservice1/itemrec/temp/partialMultiply
>>>>>>>>> 
>>>>>>>>> Job Output Directory
>>>>>>>>> 
>>>>>>>>> hdfs://nameservice1/itemrec/output/
>>>>>>>>> 
>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map input
>>>>> records=3312879
>>>>>>>>> 
>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map output
>>>>> records=3313251
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce input
>>>>>> records=3313251
>>>>>>>>> 
>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce output
>>>> records=0
>>>>>>>>> 
>>>>>>>>> Why does mahout returns 0 rows? it works when booleanData=true
>>>>>>>> (preferences
>>>>>>>>> are ignored...?)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <
>>>> serega.sheypak@gmail.com
>>>>>> :
>>>>>>>>> 
>>>>>>>>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
>>>>>>>>>> users_file:
>>>>>>>>>> --inverted_item_id
>>>>>>>>>> -1
>>>>>>>>>> -2
>>>>>>>>>> -3
>>>>>>>>>> -4
>>>>>>>>>> 
>>>>>>>>>> users_items_prefs
>>>>>>>>>> --inverted item_id
>>>>>>>>>> -1 1 1.0
>>>>>>>>>> -2 2 1.0
>>>>>>>>>> -3 3 1.0
>>>>>>>>>> -4 4 1.0
>>>>>>>>>> --user_id item_id pref_value
>>>>>>>>>> 11   1 1.6
>>>>>>>>>> 11   2 1.6
>>>>>>>>>> 123 3 2.0
>>>>>>>>>> 123 4 2.0
>>>>>>>>>> 333 1 2.0
>>>>>>>>>> 333 2 1.6
>>>>>>>>>> --e.t.c.
>>>>>>>>>> 
>>>>>>>>>> if I set --booleanData true
>>>>>>>>>> then mahout returns the result.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <
>>>>>> andrew.musselman@gmail.com
>>>>>>>>> :
>>>>>>>>>> 
>>>>>>>>>> I'm confused about how you're constructing the user file, and
>>> why
>>>>>> there
>>>>>>>>>>> are negated item ids here.
>>>>>>>>>>> 
>>>>>>>>>>> Can you post some more details please, including Mahout
>> version
>>>> and
>>>>>>>> some
>>>>>>>>>>> sample data sets?
>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak <
>>>>>>>> serega.sheypak@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi, I'm trying to create item similarity.
>>>>>>>>>>>> I gather items which users visit during shopping and then
>>>> create a
>>>>>>>> file:
>>>>>>>>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6,
>>> 1.9],
>>>>>>>> depends
>>>>>>>>>>> on
>>>>>>>>>>>> user action type and data source)
>>>>>>>>>>>> UNION
>>>>>>>>>>>> -item_id, item_id, 1.0 (from items dictionary)
>>>>>>>>>>>> 
>>>>>>>>>>>> and I do provide a userFile, where user_id = -item_id
>>>>>>>>>>>> 
>>>>>>>>>>>> The idea is to get item similary. If any user visits item
>>> named
>>>>>> "A", i
>>>>>>>>>>> want
>>>>>>>>>>>> to show him items "B", "c", "xxx" using preferences of
>> other
>>>>> users.
>>>>>>>>>>>> 
>>>>>>>>>>>> The problem is that the last (???) mapreduce job returns 0
>>> rows:
>>>>>>>>>>>> 
>>>>>>>>>>>> Here are my settings:
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> sudo -u oozie mahout recommenditembased \
>>>>>>>>>>>>                 --input visited_items_with_inverted_items
>> \
>>>>>>>>>>>> 
>>>>>>>>>>>>                 --output result \
>>>>>>>>>>>>                 --similarityClassname
>>> SIMILARITY_LOGLIKELIHOOD
>>>> \
>>>>>>>>>>>>                 --usersFile inverted_items \
>>>>>>>>>>>>                 --numRecommendations 500 \
>>>>>>>>>>>>                 --booleanData false \
>>>>>>>>>>>>                 --maxPrefsPerUser 100 \
>>>>>>>>>>>>                 --maxSimilaritiesPerItem 500 \
>>>>>>>>>>>>                 --minPrefsPerUser 0\
>>>>>>>>>>>>                 --maxPrefsPerUserInItemSimilarity 30 \
>>>>>>>>>>>>                 --threshold 0.91 \
>>>>>>>>>>>>                 --tempDir  temp \
>>>>>>>>>>>> 
>>>>>>>>>>>> Some counters... I don't get what do they mean....
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
>>>>>>>>>>>> 
>>>>> org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>>>>>>>>> USER_RATINGS_NEGLECTED=1,798,738
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>>>>>>>> USER_RATINGS_USED=12,429,693
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>>> COOCCURRENCES=35882374
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>>> PRUNED_COOCCURRENCES=0
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map input
>>>>>> records=3312879
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map output
>>>>>>>> records=17570268
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce input
>>>>>>>>>>> records=5221907
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce output
>>>>>>>>>>> records=3312879
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
>>>>>>>>>>> records=3312879
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
>>>>>>>>>>> records=3312879
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
>>>>>>>>>>> records=3312879
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
>>>>>>>>>>> records=3312879
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map input
>>>>>> records=7528530
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map output
>>>>>>>> records=3313251
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce input
>>>>>>>>>>> records=3313251
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce output
>>>>>>>>>>> records=3313251
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map input
>>>>>> records=6626130
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map output
>>>>>>>> records=6626130
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce input
>>>>>>>>>>> records=6626130
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce output
>>>>>>>>>>> records=3312879
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map input
>>>>>> records=3312879
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map output
>>>>>>>> records=3313251
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce input
>>>>>>>>>>> records=3313251
>>>>>>>>>>>> 
>>>>>>>>>>>> --------
>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce output
>>>>> records=0
>>>>>>>>>>>> --------
>>>>>>>>>>>> 
>>>>>>>>>>>> why 0???
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 
>

Re: recommenditembased returns 0 records from last map-reduce job

Posted by Serega Sheypak <se...@gmail.com>.

Hi, nothing helps...
I do use mahout 0.9 compiled for CDH 4.7
I do provide only positive values
I do use itemsimilarityJob and do get 2000 similarities for 1400 unique
items
Input data is:
16*10^6 preferences
4*10^6 users
0.6*10^ items
I do use perason correlation and preferece vlaues are: 1.0 and 2.0


2014-07-22 9:32 GMT+04:00 Serega Sheypak <se...@gmail.com>:

> Ok, I have recompiled mahout 0.9 for CDH 4.7. I'll try this evening.
> Right now I don't see how can it help me. As far as I know the stuff I try
> to use is pretty old and stable.
> looks like I do apply it in a wrong way.
>
> There is an option for recommenditembased named "--threshold". I do
> provide data for recommenditembased with preference values in range
> [1.1..2.0].
> I set --threshold to 1.2
> --threshold is absolute and can be from [1.1 . .2+] or it's relative and
> can be [0.0 .. 0.99999]?
>
>
> 2014-07-22 3:54 GMT+04:00 Ted Dunning <te...@gmail.com>:
>
> That version is no longer supported.  You should upgrade to 0.9
>>
>>
>>
>>
>> On Mon, Jul 21, 2014 at 11:41 AM, Serega Sheypak <
>> serega.sheypak@gmail.com>
>> wrote:
>>
>> > 0.7-cdh4.7.0
>> > Anyway, recommenditembased does produce these catalogs:
>> >
>> > /recommenditembased/temp/maxValues.bin
>> > /recommenditembased/temp/norms.bin
>> > /recommenditembased/temp/numNonZeroEntries.bin
>> > /recommenditembased/temp/pairwiseSimilarity
>> > /recommenditembased/temp/partialMultiply
>> > /recommenditembased/temp/prePartialMultiply1
>> > /recommenditembased/temp/prePartialMultiply2
>> > /recommenditembased/temp/preparePreferenceMatrix
>> > /recommenditembased/temp/similarityMatrix
>> > /recommenditembased/temp/weights
>> >
>> > I suppose that "/recommenditembased/temp/similarityMatrix" is the thing
>> In
>> > eed. Right now I try to read it using
>> >
>> > matrix = LOAD '/recommenditembased/temp/similarityMatrix' USING
>> >  com.twitter.elephantbird.pig.load.SequenceFileLoader(
>> >     '-c com.twitter.elephantbird.pig.util.IntWritableConverter',
>> >     '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
>> > )  as (intId: int, vector:tuple(cardinality:int,
>> > entries:bag{t:tuple(some_id:long, some_value:double)}));
>> >
>> >
>> > Looks like the vector is empty... Or i do something wrong.
>> >
>> >
>> >
>> > 2014-07-21 22:09 GMT+04:00 Ted Dunning <te...@gmail.com>:
>> >
>> > > Which version of Mahout?
>> > >
>> > >
>> > > On Mon, Jul 21, 2014 at 11:05 AM, Serega Sheypak <
>> > serega.sheypak@gmail.com
>> > > >
>> > > wrote:
>> > >
>> > > > Hi, I've tried: Unexpected --outputPathForSimilarityMatrix while
>> > > processing
>> > > > Job-Specific
>> > > >
>> > > > sudo -u hdfs hadoop fs -rm -r
>> > > hdfs://nameservice1/recommenditembased/output
>> > > > sudo -u hdfs hadoop fs -rm -r
>> > hdfs://nameservice1/recommenditembased/temp
>> > > > sudo -u oozie mahout recommenditembased \
>> > > >                     --input \
>> > > >
>> > > >
>> > > >
>> > >
>> >
>> hdfs://nameservice1/user/hive/warehouse/staging_weighted_visits_and_rec_clicks
>> > > > \
>> > > >                     --output \
>> > > >                     hdfs://nameservice1/recommenditembased/output \
>> > > >                     --similarityClassname \
>> > > >                     SIMILARITY_LOGLIKELIHOOD \
>> > > >                    --numRecommendations \
>> > > >                     500 \
>> > > >                     --booleanData \
>> > > >                     false \
>> > > >                     --maxPrefsPerUser \
>> > > >                     1000 \
>> > > >                     --maxSimilaritiesPerItem \
>> > > >                     1000 \
>> > > >                     --minPrefsPerUser \
>> > > >                     5 \
>> > > >                     --maxPrefsPerUserInItemSimilarity \
>> > > >                     30 \
>> > > >                     --threshold \
>> > > >                    1.1 \
>> > > >                     --tempDir \
>> > > >                     hdfs://nameservice1/recommenditembased/temp \
>> > > >                     --outputPathForSimilarityMatrix \
>> > > >
>> hdfs://nameservice1/recommenditembased/sim_matrix
>> > > >
>> > > >
>> > > > I'm on Cloudera cdh 4.7, looks like this feature is not supported.
>> > > >
>> > > >
>> > > > 2014-07-21 11:18 GMT+04:00 Peng Zhang <pz...@gmail.com>:
>> > > >
>> > > > > Serega,
>> > > > >
>> > > > > See the last line on how to pass outputPathForSimilarityMatrix
>> > options
>> > > to
>> > > > > the recommenditembased command:
>> > > > >
>> > > > > sudo -u oozie mahout recommenditembased \
>> > > > >                    --input visited_items_with_inverted_items \
>> > > > >
>> > > > >                    --output result \
>> > > > >                    --similarityClassname SIMILARITY_LOGLIKELIHOOD
>> \
>> > > > >                    --usersFile inverted_items \
>> > > > >                    --numRecommendations 500 \
>> > > > >                    --booleanData false \
>> > > > >                    --maxPrefsPerUser 100 \
>> > > > >                    --maxSimilaritiesPerItem 500 \
>> > > > >                    --minPrefsPerUser 0\
>> > > > >                    --maxPrefsPerUserInItemSimilarity 30 \
>> > > > >                    --threshold 0.91 \
>> > > > >                    --tempDir  temp \
>> > > > >                    --outputPathForSimilarityMatrix
>> similarityMatri \
>> > > > >
>> > > > >
>> > > > > Peng Zhang
>> > > > > pzhang.xjtu@gmail.com
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Jul 21, 2014, at 3:09 PM, Serega Sheypak <
>> > serega.sheypak@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > > > I've inspected the code, our approach wouldn't work with
>> > > > > booleanData=false.
>> > > > > > We do calcualte imte similarity in the wrong way...(((
>> > > > > > Thank you
>> > > > > > 1. We provide "fake" user_id and provide --usersFile in order to
>> > get
>> > > > > > recommendations for "fake user_id, where user_id is a negative
>> > > item_id.
>> > > > > It
>> > > > > > worked when we did provide user_id->item_id pairs without
>> > preference.
>> > > > > > 2. Our target is to get item similarities. We tried
>> > > > > >
>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
>> > > but
>> > > > > it
>> > > > > > returns bad result comparing to RecommenderJob with our "fake"
>> > > user_id
>> > > > > > (inverted item_id)
>> > > > > >
>> > > > > > 1. I'll try the option you provided.
>> > > > > > 2. I will remove input with fake user_id and usersFile with
>> these
>> > > fake
>> > > > > ids
>> > > > > >
>> > > > > > 3.
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
>> > > > > > I don't understand how to pass ---outputPathForSimilarityMatrix
>> > > option
>> > > > to
>> > > > > > RecommenderJob
>> > > > > >
>> > > > > >
>> > > > > > 2014-07-21 4:58 GMT+04:00 Peng Zhang <pz...@gmail.com>:
>> > > > > >
>> > > > > >> Seraga,
>> > > > > >>
>> > > > > >> I have two comments:
>> > > > > >> 1. Don’t use negative user ids. Since Mahout uses user id as
>> well
>> > as
>> > > > > item
>> > > > > >> id as the row/column index, you’d better use 0, 1, 2, etc as
>> ids
>> > > > > >> 2. If you want to get the item similarity information, you can
>> use
>> > > > > >> --outputPathForSimilarityMatrix in the command
>> > > > > >>
>> > > > > >> Regards,
>> > > > > >> Peng Zhang
>> > > > > >> M: +86 186-1658-7856
>> > > > > >> pzhang.xjtu@gmail.com
>> > > > > >>
>> > > > > >>
>> > > > > >>
>> > > > > >>
>> > > > > >>
>> > > > > >> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <
>> > > serega.sheypak@gmail.com
>> > > > >
>> > > > > >> wrote:
>> > > > > >>
>> > > > > >>> All bad things happen here:
>> > > > > >>>
>> > > > > >>>
>> > > > > >>>
>> > > > > >>> Name
>> > > > > >>>
>> > > > > >>> RecommenderJob-PartialMultiplyMapper-Reducer
>> > > > > >>>
>> > > > > >>> User
>> > > > > >>>
>> > > > > >>> oozie
>> > > > > >>>
>> > > > > >>> Process User
>> > > > > >>>
>> > > > > >>> oozie
>> > > > > >>>
>> > > > > >>> Group
>> > > > > >>>
>> > > > > >>> oozie
>> > > > > >>>
>> > > > > >>> Mapper Class
>> > > > > >>>
>> > > > > >>> PartialMultiplyMapper
>> > > > > >>>
>> > > > > >>> Reducer Class
>> > > > > >>>
>> > > > > >>> AggregateAndRecommendReducer
>> > > > > >>>
>> > > > > >>>
>> > > > > >>> Job Input Directory
>> > > > > >>>
>> > > > > >>> hdfs://nameservice1/itemrec/temp/partialMultiply
>> > > > > >>>
>> > > > > >>> Job Output Directory
>> > > > > >>>
>> > > > > >>> hdfs://nameservice1/itemrec/output/
>> > > > > >>>
>> > > > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map input
>> > > > records=3312879
>> > > > > >>>
>> > > > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map output
>> > > > records=3313251
>> > > > > >>>
>> > > > > >>>
>> > > > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce input
>> > > > > records=3313251
>> > > > > >>>
>> > > > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce output
>> > > records=0
>> > > > > >>>
>> > > > > >>> Why does mahout returns 0 rows? it works when booleanData=true
>> > > > > >> (preferences
>> > > > > >>> are ignored...?)
>> > > > > >>>
>> > > > > >>>
>> > > > > >>>
>> > > > > >>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <
>> > > serega.sheypak@gmail.com
>> > > > >:
>> > > > > >>>
>> > > > > >>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
>> > > > > >>>> users_file:
>> > > > > >>>> --inverted_item_id
>> > > > > >>>> -1
>> > > > > >>>> -2
>> > > > > >>>> -3
>> > > > > >>>> -4
>> > > > > >>>>
>> > > > > >>>> users_items_prefs
>> > > > > >>>> --inverted item_id
>> > > > > >>>> -1 1 1.0
>> > > > > >>>> -2 2 1.0
>> > > > > >>>> -3 3 1.0
>> > > > > >>>> -4 4 1.0
>> > > > > >>>> --user_id item_id pref_value
>> > > > > >>>> 11   1 1.6
>> > > > > >>>> 11   2 1.6
>> > > > > >>>> 123 3 2.0
>> > > > > >>>> 123 4 2.0
>> > > > > >>>> 333 1 2.0
>> > > > > >>>> 333 2 1.6
>> > > > > >>>> --e.t.c.
>> > > > > >>>>
>> > > > > >>>> if I set --booleanData true
>> > > > > >>>> then mahout returns the result.
>> > > > > >>>>
>> > > > > >>>>
>> > > > > >>>>
>> > > > > >>>>
>> > > > > >>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <
>> > > > > andrew.musselman@gmail.com
>> > > > > >>> :
>> > > > > >>>>
>> > > > > >>>> I'm confused about how you're constructing the user file, and
>> > why
>> > > > > there
>> > > > > >>>>> are negated item ids here.
>> > > > > >>>>>
>> > > > > >>>>> Can you post some more details please, including Mahout
>> version
>> > > and
>> > > > > >> some
>> > > > > >>>>> sample data sets?
>> > > > > >>>>>
>> > > > > >>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak <
>> > > > > >> serega.sheypak@gmail.com>
>> > > > > >>>>> wrote:
>> > > > > >>>>>>
>> > > > > >>>>>> Hi, I'm trying to create item similarity.
>> > > > > >>>>>> I gather items which users visit during shopping and then
>> > > create a
>> > > > > >> file:
>> > > > > >>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6,
>> > 1.9],
>> > > > > >> depends
>> > > > > >>>>> on
>> > > > > >>>>>> user action type and data source)
>> > > > > >>>>>> UNION
>> > > > > >>>>>> -item_id, item_id, 1.0 (from items dictionary)
>> > > > > >>>>>>
>> > > > > >>>>>> and I do provide a userFile, where user_id = -item_id
>> > > > > >>>>>>
>> > > > > >>>>>> The idea is to get item similary. If any user visits item
>> > named
>> > > > > "A", i
>> > > > > >>>>> want
>> > > > > >>>>>> to show him items "B", "c", "xxx" using preferences of
>> other
>> > > > users.
>> > > > > >>>>>>
>> > > > > >>>>>> The problem is that the last (???) mapreduce job returns 0
>> > rows:
>> > > > > >>>>>>
>> > > > > >>>>>> Here are my settings:
>> > > > > >>>>>>
>> > > > > >>>>>>
>> > > > > >>>>>> sudo -u oozie mahout recommenditembased \
>> > > > > >>>>>>                  --input visited_items_with_inverted_items
>> \
>> > > > > >>>>>>
>> > > > > >>>>>>                  --output result \
>> > > > > >>>>>>                  --similarityClassname
>> > SIMILARITY_LOGLIKELIHOOD
>> > > \
>> > > > > >>>>>>                  --usersFile inverted_items \
>> > > > > >>>>>>                  --numRecommendations 500 \
>> > > > > >>>>>>                  --booleanData false \
>> > > > > >>>>>>                  --maxPrefsPerUser 100 \
>> > > > > >>>>>>                  --maxSimilaritiesPerItem 500 \
>> > > > > >>>>>>                  --minPrefsPerUser 0\
>> > > > > >>>>>>                  --maxPrefsPerUserInItemSimilarity 30 \
>> > > > > >>>>>>                  --threshold 0.91 \
>> > > > > >>>>>>                  --tempDir  temp \
>> > > > > >>>>>>
>> > > > > >>>>>> Some counters... I don't get what do they mean....
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
>> > > > > >>>>>>
>> > > > org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>> > > > > >>>>>>
>> > > > > >>>>>
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>> > > > > >>>>>>  USER_RATINGS_NEGLECTED=1,798,738
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>> > > > > >>>>> USER_RATINGS_USED=12,429,693
>> > > > > >>>>>>
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
>> > > > > >>>>>>
>> > > > > >>>>>
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>> > > > > >>>>>>
>> > > > > >>>>>
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>> > > > COOCCURRENCES=35882374
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>> > > > PRUNED_COOCCURRENCES=0
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map input
>> > > > > records=3312879
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map output
>> > > > > >> records=17570268
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce input
>> > > > > >>>>> records=5221907
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce output
>> > > > > >>>>> records=3312879
>> > > > > >>>>>>
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
>> > > > > >>>>> records=3312879
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
>> > > > > >>>>> records=3312879
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
>> > > > > >>>>> records=3312879
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
>> > > > > >>>>> records=3312879
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map input
>> > > > > records=7528530
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map output
>> > > > > >> records=3313251
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce input
>> > > > > >>>>> records=3313251
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce output
>> > > > > >>>>> records=3313251
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map input
>> > > > > records=6626130
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map output
>> > > > > >> records=6626130
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce input
>> > > > > >>>>> records=6626130
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce output
>> > > > > >>>>> records=3312879
>> > > > > >>>>>>
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map input
>> > > > > records=3312879
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map output
>> > > > > >> records=3313251
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce input
>> > > > > >>>>> records=3313251
>> > > > > >>>>>>
>> > > > > >>>>>> --------
>> > > > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce output
>> > > > records=0
>> > > > > >>>>>> --------
>> > > > > >>>>>>
>> > > > > >>>>>> why 0???
>> > > > > >>>>>
>> > > > > >>>>
>> > > > > >>>>
>> > > > > >>
>> > > > > >>
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: recommenditembased returns 0 records from last map-reduce job

Posted by Serega Sheypak <se...@gmail.com>.

Ok, I have recompiled mahout 0.9 for CDH 4.7. I'll try this evening.
Right now I don't see how can it help me. As far as I know the stuff I try
to use is pretty old and stable.
looks like I do apply it in a wrong way.

There is an option for recommenditembased named "--threshold". I do provide
data for recommenditembased with preference values in range [1.1..2.0].
I set --threshold to 1.2
--threshold is absolute and can be from [1.1 . .2+] or it's relative and
can be [0.0 .. 0.99999]?


2014-07-22 3:54 GMT+04:00 Ted Dunning <te...@gmail.com>:

> That version is no longer supported.  You should upgrade to 0.9
>
>
>
>
> On Mon, Jul 21, 2014 at 11:41 AM, Serega Sheypak <serega.sheypak@gmail.com
> >
> wrote:
>
> > 0.7-cdh4.7.0
> > Anyway, recommenditembased does produce these catalogs:
> >
> > /recommenditembased/temp/maxValues.bin
> > /recommenditembased/temp/norms.bin
> > /recommenditembased/temp/numNonZeroEntries.bin
> > /recommenditembased/temp/pairwiseSimilarity
> > /recommenditembased/temp/partialMultiply
> > /recommenditembased/temp/prePartialMultiply1
> > /recommenditembased/temp/prePartialMultiply2
> > /recommenditembased/temp/preparePreferenceMatrix
> > /recommenditembased/temp/similarityMatrix
> > /recommenditembased/temp/weights
> >
> > I suppose that "/recommenditembased/temp/similarityMatrix" is the thing
> In
> > eed. Right now I try to read it using
> >
> > matrix = LOAD '/recommenditembased/temp/similarityMatrix' USING
> >  com.twitter.elephantbird.pig.load.SequenceFileLoader(
> >     '-c com.twitter.elephantbird.pig.util.IntWritableConverter',
> >     '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
> > )  as (intId: int, vector:tuple(cardinality:int,
> > entries:bag{t:tuple(some_id:long, some_value:double)}));
> >
> >
> > Looks like the vector is empty... Or i do something wrong.
> >
> >
> >
> > 2014-07-21 22:09 GMT+04:00 Ted Dunning <te...@gmail.com>:
> >
> > > Which version of Mahout?
> > >
> > >
> > > On Mon, Jul 21, 2014 at 11:05 AM, Serega Sheypak <
> > serega.sheypak@gmail.com
> > > >
> > > wrote:
> > >
> > > > Hi, I've tried: Unexpected --outputPathForSimilarityMatrix while
> > > processing
> > > > Job-Specific
> > > >
> > > > sudo -u hdfs hadoop fs -rm -r
> > > hdfs://nameservice1/recommenditembased/output
> > > > sudo -u hdfs hadoop fs -rm -r
> > hdfs://nameservice1/recommenditembased/temp
> > > > sudo -u oozie mahout recommenditembased \
> > > >                     --input \
> > > >
> > > >
> > > >
> > >
> >
> hdfs://nameservice1/user/hive/warehouse/staging_weighted_visits_and_rec_clicks
> > > > \
> > > >                     --output \
> > > >                     hdfs://nameservice1/recommenditembased/output \
> > > >                     --similarityClassname \
> > > >                     SIMILARITY_LOGLIKELIHOOD \
> > > >                    --numRecommendations \
> > > >                     500 \
> > > >                     --booleanData \
> > > >                     false \
> > > >                     --maxPrefsPerUser \
> > > >                     1000 \
> > > >                     --maxSimilaritiesPerItem \
> > > >                     1000 \
> > > >                     --minPrefsPerUser \
> > > >                     5 \
> > > >                     --maxPrefsPerUserInItemSimilarity \
> > > >                     30 \
> > > >                     --threshold \
> > > >                    1.1 \
> > > >                     --tempDir \
> > > >                     hdfs://nameservice1/recommenditembased/temp \
> > > >                     --outputPathForSimilarityMatrix \
> > > >                     hdfs://nameservice1/recommenditembased/sim_matrix
> > > >
> > > >
> > > > I'm on Cloudera cdh 4.7, looks like this feature is not supported.
> > > >
> > > >
> > > > 2014-07-21 11:18 GMT+04:00 Peng Zhang <pz...@gmail.com>:
> > > >
> > > > > Serega,
> > > > >
> > > > > See the last line on how to pass outputPathForSimilarityMatrix
> > options
> > > to
> > > > > the recommenditembased command:
> > > > >
> > > > > sudo -u oozie mahout recommenditembased \
> > > > >                    --input visited_items_with_inverted_items \
> > > > >
> > > > >                    --output result \
> > > > >                    --similarityClassname SIMILARITY_LOGLIKELIHOOD \
> > > > >                    --usersFile inverted_items \
> > > > >                    --numRecommendations 500 \
> > > > >                    --booleanData false \
> > > > >                    --maxPrefsPerUser 100 \
> > > > >                    --maxSimilaritiesPerItem 500 \
> > > > >                    --minPrefsPerUser 0\
> > > > >                    --maxPrefsPerUserInItemSimilarity 30 \
> > > > >                    --threshold 0.91 \
> > > > >                    --tempDir  temp \
> > > > >                    --outputPathForSimilarityMatrix similarityMatri
> \
> > > > >
> > > > >
> > > > > Peng Zhang
> > > > > pzhang.xjtu@gmail.com
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Jul 21, 2014, at 3:09 PM, Serega Sheypak <
> > serega.sheypak@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > I've inspected the code, our approach wouldn't work with
> > > > > booleanData=false.
> > > > > > We do calcualte imte similarity in the wrong way...(((
> > > > > > Thank you
> > > > > > 1. We provide "fake" user_id and provide --usersFile in order to
> > get
> > > > > > recommendations for "fake user_id, where user_id is a negative
> > > item_id.
> > > > > It
> > > > > > worked when we did provide user_id->item_id pairs without
> > preference.
> > > > > > 2. Our target is to get item similarities. We tried
> > > > > >
> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
> > > but
> > > > > it
> > > > > > returns bad result comparing to RecommenderJob with our "fake"
> > > user_id
> > > > > > (inverted item_id)
> > > > > >
> > > > > > 1. I'll try the option you provided.
> > > > > > 2. I will remove input with fake user_id and usersFile with these
> > > fake
> > > > > ids
> > > > > >
> > > > > > 3.
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
> > > > > > I don't understand how to pass ---outputPathForSimilarityMatrix
> > > option
> > > > to
> > > > > > RecommenderJob
> > > > > >
> > > > > >
> > > > > > 2014-07-21 4:58 GMT+04:00 Peng Zhang <pz...@gmail.com>:
> > > > > >
> > > > > >> Seraga,
> > > > > >>
> > > > > >> I have two comments:
> > > > > >> 1. Don’t use negative user ids. Since Mahout uses user id as
> well
> > as
> > > > > item
> > > > > >> id as the row/column index, you’d better use 0, 1, 2, etc as ids
> > > > > >> 2. If you want to get the item similarity information, you can
> use
> > > > > >> --outputPathForSimilarityMatrix in the command
> > > > > >>
> > > > > >> Regards,
> > > > > >> Peng Zhang
> > > > > >> M: +86 186-1658-7856
> > > > > >> pzhang.xjtu@gmail.com
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <
> > > serega.sheypak@gmail.com
> > > > >
> > > > > >> wrote:
> > > > > >>
> > > > > >>> All bad things happen here:
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> Name
> > > > > >>>
> > > > > >>> RecommenderJob-PartialMultiplyMapper-Reducer
> > > > > >>>
> > > > > >>> User
> > > > > >>>
> > > > > >>> oozie
> > > > > >>>
> > > > > >>> Process User
> > > > > >>>
> > > > > >>> oozie
> > > > > >>>
> > > > > >>> Group
> > > > > >>>
> > > > > >>> oozie
> > > > > >>>
> > > > > >>> Mapper Class
> > > > > >>>
> > > > > >>> PartialMultiplyMapper
> > > > > >>>
> > > > > >>> Reducer Class
> > > > > >>>
> > > > > >>> AggregateAndRecommendReducer
> > > > > >>>
> > > > > >>>
> > > > > >>> Job Input Directory
> > > > > >>>
> > > > > >>> hdfs://nameservice1/itemrec/temp/partialMultiply
> > > > > >>>
> > > > > >>> Job Output Directory
> > > > > >>>
> > > > > >>> hdfs://nameservice1/itemrec/output/
> > > > > >>>
> > > > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map input
> > > > records=3312879
> > > > > >>>
> > > > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map output
> > > > records=3313251
> > > > > >>>
> > > > > >>>
> > > > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce input
> > > > > records=3313251
> > > > > >>>
> > > > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce output
> > > records=0
> > > > > >>>
> > > > > >>> Why does mahout returns 0 rows? it works when booleanData=true
> > > > > >> (preferences
> > > > > >>> are ignored...?)
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <
> > > serega.sheypak@gmail.com
> > > > >:
> > > > > >>>
> > > > > >>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
> > > > > >>>> users_file:
> > > > > >>>> --inverted_item_id
> > > > > >>>> -1
> > > > > >>>> -2
> > > > > >>>> -3
> > > > > >>>> -4
> > > > > >>>>
> > > > > >>>> users_items_prefs
> > > > > >>>> --inverted item_id
> > > > > >>>> -1 1 1.0
> > > > > >>>> -2 2 1.0
> > > > > >>>> -3 3 1.0
> > > > > >>>> -4 4 1.0
> > > > > >>>> --user_id item_id pref_value
> > > > > >>>> 11   1 1.6
> > > > > >>>> 11   2 1.6
> > > > > >>>> 123 3 2.0
> > > > > >>>> 123 4 2.0
> > > > > >>>> 333 1 2.0
> > > > > >>>> 333 2 1.6
> > > > > >>>> --e.t.c.
> > > > > >>>>
> > > > > >>>> if I set --booleanData true
> > > > > >>>> then mahout returns the result.
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <
> > > > > andrew.musselman@gmail.com
> > > > > >>> :
> > > > > >>>>
> > > > > >>>> I'm confused about how you're constructing the user file, and
> > why
> > > > > there
> > > > > >>>>> are negated item ids here.
> > > > > >>>>>
> > > > > >>>>> Can you post some more details please, including Mahout
> version
> > > and
> > > > > >> some
> > > > > >>>>> sample data sets?
> > > > > >>>>>
> > > > > >>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak <
> > > > > >> serega.sheypak@gmail.com>
> > > > > >>>>> wrote:
> > > > > >>>>>>
> > > > > >>>>>> Hi, I'm trying to create item similarity.
> > > > > >>>>>> I gather items which users visit during shopping and then
> > > create a
> > > > > >> file:
> > > > > >>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6,
> > 1.9],
> > > > > >> depends
> > > > > >>>>> on
> > > > > >>>>>> user action type and data source)
> > > > > >>>>>> UNION
> > > > > >>>>>> -item_id, item_id, 1.0 (from items dictionary)
> > > > > >>>>>>
> > > > > >>>>>> and I do provide a userFile, where user_id = -item_id
> > > > > >>>>>>
> > > > > >>>>>> The idea is to get item similary. If any user visits item
> > named
> > > > > "A", i
> > > > > >>>>> want
> > > > > >>>>>> to show him items "B", "c", "xxx" using preferences of other
> > > > users.
> > > > > >>>>>>
> > > > > >>>>>> The problem is that the last (???) mapreduce job returns 0
> > rows:
> > > > > >>>>>>
> > > > > >>>>>> Here are my settings:
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>> sudo -u oozie mahout recommenditembased \
> > > > > >>>>>>                  --input visited_items_with_inverted_items \
> > > > > >>>>>>
> > > > > >>>>>>                  --output result \
> > > > > >>>>>>                  --similarityClassname
> > SIMILARITY_LOGLIKELIHOOD
> > > \
> > > > > >>>>>>                  --usersFile inverted_items \
> > > > > >>>>>>                  --numRecommendations 500 \
> > > > > >>>>>>                  --booleanData false \
> > > > > >>>>>>                  --maxPrefsPerUser 100 \
> > > > > >>>>>>                  --maxSimilaritiesPerItem 500 \
> > > > > >>>>>>                  --minPrefsPerUser 0\
> > > > > >>>>>>                  --maxPrefsPerUserInItemSimilarity 30 \
> > > > > >>>>>>                  --threshold 0.91 \
> > > > > >>>>>>                  --tempDir  temp \
> > > > > >>>>>>
> > > > > >>>>>> Some counters... I don't get what do they mean....
> > > > > >>>>>>
> > > > > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
> > > > > >>>>>>
> > > > org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
> > > > > >>>>>>
> > > > > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530
> > > > > >>>>>>
> > > > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
> > > > > >>>>>>
> > > > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> > > > > >>>>>>  USER_RATINGS_NEGLECTED=1,798,738
> > > > > >>>>>>
> > > > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> > > > > >>>>> USER_RATINGS_USED=12,429,693
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> > > > > >>>>>>
> > > > > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879
> > > > > >>>>>>
> > > > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> > > > > >>>>>>
> > > > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> > > > COOCCURRENCES=35882374
> > > > > >>>>>>
> > > > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> > > > PRUNED_COOCCURRENCES=0
> > > > > >>>>>>
> > > > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map input
> > > > > records=3312879
> > > > > >>>>>>
> > > > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map output
> > > > > >> records=17570268
> > > > > >>>>>>
> > > > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce input
> > > > > >>>>> records=5221907
> > > > > >>>>>>
> > > > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce output
> > > > > >>>>> records=3312879
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
> > > > > >>>>> records=3312879
> > > > > >>>>>>
> > > > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
> > > > > >>>>> records=3312879
> > > > > >>>>>>
> > > > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
> > > > > >>>>> records=3312879
> > > > > >>>>>>
> > > > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
> > > > > >>>>> records=3312879
> > > > > >>>>>>
> > > > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map input
> > > > > records=7528530
> > > > > >>>>>>
> > > > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map output
> > > > > >> records=3313251
> > > > > >>>>>>
> > > > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce input
> > > > > >>>>> records=3313251
> > > > > >>>>>>
> > > > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce output
> > > > > >>>>> records=3313251
> > > > > >>>>>>
> > > > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map input
> > > > > records=6626130
> > > > > >>>>>>
> > > > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map output
> > > > > >> records=6626130
> > > > > >>>>>>
> > > > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce input
> > > > > >>>>> records=6626130
> > > > > >>>>>>
> > > > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce output
> > > > > >>>>> records=3312879
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map input
> > > > > records=3312879
> > > > > >>>>>>
> > > > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map output
> > > > > >> records=3313251
> > > > > >>>>>>
> > > > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce input
> > > > > >>>>> records=3313251
> > > > > >>>>>>
> > > > > >>>>>> --------
> > > > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce output
> > > > records=0
> > > > > >>>>>> --------
> > > > > >>>>>>
> > > > > >>>>>> why 0???
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>
> > > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: recommenditembased returns 0 records from last map-reduce job

Posted by Ted Dunning <te...@gmail.com>.

That version is no longer supported.  You should upgrade to 0.9




On Mon, Jul 21, 2014 at 11:41 AM, Serega Sheypak <se...@gmail.com>
wrote:

> 0.7-cdh4.7.0
> Anyway, recommenditembased does produce these catalogs:
>
> /recommenditembased/temp/maxValues.bin
> /recommenditembased/temp/norms.bin
> /recommenditembased/temp/numNonZeroEntries.bin
> /recommenditembased/temp/pairwiseSimilarity
> /recommenditembased/temp/partialMultiply
> /recommenditembased/temp/prePartialMultiply1
> /recommenditembased/temp/prePartialMultiply2
> /recommenditembased/temp/preparePreferenceMatrix
> /recommenditembased/temp/similarityMatrix
> /recommenditembased/temp/weights
>
> I suppose that "/recommenditembased/temp/similarityMatrix" is the thing In
> eed. Right now I try to read it using
>
> matrix = LOAD '/recommenditembased/temp/similarityMatrix' USING
>  com.twitter.elephantbird.pig.load.SequenceFileLoader(
>     '-c com.twitter.elephantbird.pig.util.IntWritableConverter',
>     '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
> )  as (intId: int, vector:tuple(cardinality:int,
> entries:bag{t:tuple(some_id:long, some_value:double)}));
>
>
> Looks like the vector is empty... Or i do something wrong.
>
>
>
> 2014-07-21 22:09 GMT+04:00 Ted Dunning <te...@gmail.com>:
>
> > Which version of Mahout?
> >
> >
> > On Mon, Jul 21, 2014 at 11:05 AM, Serega Sheypak <
> serega.sheypak@gmail.com
> > >
> > wrote:
> >
> > > Hi, I've tried: Unexpected --outputPathForSimilarityMatrix while
> > processing
> > > Job-Specific
> > >
> > > sudo -u hdfs hadoop fs -rm -r
> > hdfs://nameservice1/recommenditembased/output
> > > sudo -u hdfs hadoop fs -rm -r
> hdfs://nameservice1/recommenditembased/temp
> > > sudo -u oozie mahout recommenditembased \
> > >                     --input \
> > >
> > >
> > >
> >
> hdfs://nameservice1/user/hive/warehouse/staging_weighted_visits_and_rec_clicks
> > > \
> > >                     --output \
> > >                     hdfs://nameservice1/recommenditembased/output \
> > >                     --similarityClassname \
> > >                     SIMILARITY_LOGLIKELIHOOD \
> > >                    --numRecommendations \
> > >                     500 \
> > >                     --booleanData \
> > >                     false \
> > >                     --maxPrefsPerUser \
> > >                     1000 \
> > >                     --maxSimilaritiesPerItem \
> > >                     1000 \
> > >                     --minPrefsPerUser \
> > >                     5 \
> > >                     --maxPrefsPerUserInItemSimilarity \
> > >                     30 \
> > >                     --threshold \
> > >                    1.1 \
> > >                     --tempDir \
> > >                     hdfs://nameservice1/recommenditembased/temp \
> > >                     --outputPathForSimilarityMatrix \
> > >                     hdfs://nameservice1/recommenditembased/sim_matrix
> > >
> > >
> > > I'm on Cloudera cdh 4.7, looks like this feature is not supported.
> > >
> > >
> > > 2014-07-21 11:18 GMT+04:00 Peng Zhang <pz...@gmail.com>:
> > >
> > > > Serega,
> > > >
> > > > See the last line on how to pass outputPathForSimilarityMatrix
> options
> > to
> > > > the recommenditembased command:
> > > >
> > > > sudo -u oozie mahout recommenditembased \
> > > >                    --input visited_items_with_inverted_items \
> > > >
> > > >                    --output result \
> > > >                    --similarityClassname SIMILARITY_LOGLIKELIHOOD \
> > > >                    --usersFile inverted_items \
> > > >                    --numRecommendations 500 \
> > > >                    --booleanData false \
> > > >                    --maxPrefsPerUser 100 \
> > > >                    --maxSimilaritiesPerItem 500 \
> > > >                    --minPrefsPerUser 0\
> > > >                    --maxPrefsPerUserInItemSimilarity 30 \
> > > >                    --threshold 0.91 \
> > > >                    --tempDir  temp \
> > > >                    --outputPathForSimilarityMatrix similarityMatri \
> > > >
> > > >
> > > > Peng Zhang
> > > > pzhang.xjtu@gmail.com
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Jul 21, 2014, at 3:09 PM, Serega Sheypak <
> serega.sheypak@gmail.com>
> > > > wrote:
> > > >
> > > > > I've inspected the code, our approach wouldn't work with
> > > > booleanData=false.
> > > > > We do calcualte imte similarity in the wrong way...(((
> > > > > Thank you
> > > > > 1. We provide "fake" user_id and provide --usersFile in order to
> get
> > > > > recommendations for "fake user_id, where user_id is a negative
> > item_id.
> > > > It
> > > > > worked when we did provide user_id->item_id pairs without
> preference.
> > > > > 2. Our target is to get item similarities. We tried
> > > > > org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
> > but
> > > > it
> > > > > returns bad result comparing to RecommenderJob with our "fake"
> > user_id
> > > > > (inverted item_id)
> > > > >
> > > > > 1. I'll try the option you provided.
> > > > > 2. I will remove input with fake user_id and usersFile with these
> > fake
> > > > ids
> > > > >
> > > > > 3.
> > > > >
> > > >
> > >
> >
> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
> > > > > I don't understand how to pass ---outputPathForSimilarityMatrix
> > option
> > > to
> > > > > RecommenderJob
> > > > >
> > > > >
> > > > > 2014-07-21 4:58 GMT+04:00 Peng Zhang <pz...@gmail.com>:
> > > > >
> > > > >> Seraga,
> > > > >>
> > > > >> I have two comments:
> > > > >> 1. Don’t use negative user ids. Since Mahout uses user id as well
> as
> > > > item
> > > > >> id as the row/column index, you’d better use 0, 1, 2, etc as ids
> > > > >> 2. If you want to get the item similarity information, you can use
> > > > >> --outputPathForSimilarityMatrix in the command
> > > > >>
> > > > >> Regards,
> > > > >> Peng Zhang
> > > > >> M: +86 186-1658-7856
> > > > >> pzhang.xjtu@gmail.com
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <
> > serega.sheypak@gmail.com
> > > >
> > > > >> wrote:
> > > > >>
> > > > >>> All bad things happen here:
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> Name
> > > > >>>
> > > > >>> RecommenderJob-PartialMultiplyMapper-Reducer
> > > > >>>
> > > > >>> User
> > > > >>>
> > > > >>> oozie
> > > > >>>
> > > > >>> Process User
> > > > >>>
> > > > >>> oozie
> > > > >>>
> > > > >>> Group
> > > > >>>
> > > > >>> oozie
> > > > >>>
> > > > >>> Mapper Class
> > > > >>>
> > > > >>> PartialMultiplyMapper
> > > > >>>
> > > > >>> Reducer Class
> > > > >>>
> > > > >>> AggregateAndRecommendReducer
> > > > >>>
> > > > >>>
> > > > >>> Job Input Directory
> > > > >>>
> > > > >>> hdfs://nameservice1/itemrec/temp/partialMultiply
> > > > >>>
> > > > >>> Job Output Directory
> > > > >>>
> > > > >>> hdfs://nameservice1/itemrec/output/
> > > > >>>
> > > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map input
> > > records=3312879
> > > > >>>
> > > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map output
> > > records=3313251
> > > > >>>
> > > > >>>
> > > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce input
> > > > records=3313251
> > > > >>>
> > > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce output
> > records=0
> > > > >>>
> > > > >>> Why does mahout returns 0 rows? it works when booleanData=true
> > > > >> (preferences
> > > > >>> are ignored...?)
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <
> > serega.sheypak@gmail.com
> > > >:
> > > > >>>
> > > > >>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
> > > > >>>> users_file:
> > > > >>>> --inverted_item_id
> > > > >>>> -1
> > > > >>>> -2
> > > > >>>> -3
> > > > >>>> -4
> > > > >>>>
> > > > >>>> users_items_prefs
> > > > >>>> --inverted item_id
> > > > >>>> -1 1 1.0
> > > > >>>> -2 2 1.0
> > > > >>>> -3 3 1.0
> > > > >>>> -4 4 1.0
> > > > >>>> --user_id item_id pref_value
> > > > >>>> 11   1 1.6
> > > > >>>> 11   2 1.6
> > > > >>>> 123 3 2.0
> > > > >>>> 123 4 2.0
> > > > >>>> 333 1 2.0
> > > > >>>> 333 2 1.6
> > > > >>>> --e.t.c.
> > > > >>>>
> > > > >>>> if I set --booleanData true
> > > > >>>> then mahout returns the result.
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <
> > > > andrew.musselman@gmail.com
> > > > >>> :
> > > > >>>>
> > > > >>>> I'm confused about how you're constructing the user file, and
> why
> > > > there
> > > > >>>>> are negated item ids here.
> > > > >>>>>
> > > > >>>>> Can you post some more details please, including Mahout version
> > and
> > > > >> some
> > > > >>>>> sample data sets?
> > > > >>>>>
> > > > >>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak <
> > > > >> serega.sheypak@gmail.com>
> > > > >>>>> wrote:
> > > > >>>>>>
> > > > >>>>>> Hi, I'm trying to create item similarity.
> > > > >>>>>> I gather items which users visit during shopping and then
> > create a
> > > > >> file:
> > > > >>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6,
> 1.9],
> > > > >> depends
> > > > >>>>> on
> > > > >>>>>> user action type and data source)
> > > > >>>>>> UNION
> > > > >>>>>> -item_id, item_id, 1.0 (from items dictionary)
> > > > >>>>>>
> > > > >>>>>> and I do provide a userFile, where user_id = -item_id
> > > > >>>>>>
> > > > >>>>>> The idea is to get item similary. If any user visits item
> named
> > > > "A", i
> > > > >>>>> want
> > > > >>>>>> to show him items "B", "c", "xxx" using preferences of other
> > > users.
> > > > >>>>>>
> > > > >>>>>> The problem is that the last (???) mapreduce job returns 0
> rows:
> > > > >>>>>>
> > > > >>>>>> Here are my settings:
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> sudo -u oozie mahout recommenditembased \
> > > > >>>>>>                  --input visited_items_with_inverted_items \
> > > > >>>>>>
> > > > >>>>>>                  --output result \
> > > > >>>>>>                  --similarityClassname
> SIMILARITY_LOGLIKELIHOOD
> > \
> > > > >>>>>>                  --usersFile inverted_items \
> > > > >>>>>>                  --numRecommendations 500 \
> > > > >>>>>>                  --booleanData false \
> > > > >>>>>>                  --maxPrefsPerUser 100 \
> > > > >>>>>>                  --maxSimilaritiesPerItem 500 \
> > > > >>>>>>                  --minPrefsPerUser 0\
> > > > >>>>>>                  --maxPrefsPerUserInItemSimilarity 30 \
> > > > >>>>>>                  --threshold 0.91 \
> > > > >>>>>>                  --tempDir  temp \
> > > > >>>>>>
> > > > >>>>>> Some counters... I don't get what do they mean....
> > > > >>>>>>
> > > > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
> > > > >>>>>>
> > > org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
> > > > >>>>>>
> > > > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530
> > > > >>>>>>
> > > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> > > > >>>>>>
> > > > >>>>>
> > > > >>
> > > >
> > >
> >
> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
> > > > >>>>>>
> > > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> > > > >>>>>>  USER_RATINGS_NEGLECTED=1,798,738
> > > > >>>>>>
> > > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> > > > >>>>> USER_RATINGS_USED=12,429,693
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
> > > > >>>>>>
> > > > >>>>>
> > > > >>
> > > >
> > >
> >
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> > > > >>>>>>
> > > > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879
> > > > >>>>>>
> > > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> > > > >>>>>>
> > > > >>>>>
> > > > >>
> > > >
> > >
> >
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> > > > >>>>>>
> > > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> > > COOCCURRENCES=35882374
> > > > >>>>>>
> > > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> > > PRUNED_COOCCURRENCES=0
> > > > >>>>>>
> > > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map input
> > > > records=3312879
> > > > >>>>>>
> > > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map output
> > > > >> records=17570268
> > > > >>>>>>
> > > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce input
> > > > >>>>> records=5221907
> > > > >>>>>>
> > > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce output
> > > > >>>>> records=3312879
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
> > > > >>>>> records=3312879
> > > > >>>>>>
> > > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
> > > > >>>>> records=3312879
> > > > >>>>>>
> > > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
> > > > >>>>> records=3312879
> > > > >>>>>>
> > > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
> > > > >>>>> records=3312879
> > > > >>>>>>
> > > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map input
> > > > records=7528530
> > > > >>>>>>
> > > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map output
> > > > >> records=3313251
> > > > >>>>>>
> > > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce input
> > > > >>>>> records=3313251
> > > > >>>>>>
> > > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce output
> > > > >>>>> records=3313251
> > > > >>>>>>
> > > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map input
> > > > records=6626130
> > > > >>>>>>
> > > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map output
> > > > >> records=6626130
> > > > >>>>>>
> > > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce input
> > > > >>>>> records=6626130
> > > > >>>>>>
> > > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce output
> > > > >>>>> records=3312879
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map input
> > > > records=3312879
> > > > >>>>>>
> > > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map output
> > > > >> records=3313251
> > > > >>>>>>
> > > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce input
> > > > >>>>> records=3313251
> > > > >>>>>>
> > > > >>>>>> --------
> > > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce output
> > > records=0
> > > > >>>>>> --------
> > > > >>>>>>
> > > > >>>>>> why 0???
> > > > >>>>>
> > > > >>>>
> > > > >>>>
> > > > >>
> > > > >>
> > > >
> > > >
> > >
> >
>

Re: recommenditembased returns 0 records from last map-reduce job

Posted by Serega Sheypak <se...@gmail.com>.

Hm.. I did take read sample from ratingMatrix
and none of these ids

key[265946039] idx 1278942761 get 1.600000023841858
key[266002133] idx 242466370 get 1.600000023841858
key[266024933] idx 335624517 get 1.600000023841858
key[266024933] idx 291527196 get 1.600000023841858
key[266024933] idx 1406499341 get 1.600000023841858
key[266024933] idx 836009310 get 1.600000023841858
key[266024933] idx 331659103 get 1.600000023841858
key[266106533] idx 689552069 get 1.600000023841858

among my user_id or item_id.
1.600000023841858 looks like preference value for a relation user_id,
item_id, pref

code:
def reader = new SequenceFile.Reader(new Configuration(),
SequenceFile.Reader.file(pathToFile));
    IntWritable key = new IntWritable();
    VectorWritable value = new VectorWritable();

    while(reader.next(key, value)){
        def itr = value.get().iterateNonZero()
        while(itr.hasNext()){
            def elem = itr.next()
            println "key[$key] idx${elem.index()} get${elem.get()}"
        }

    }
    reader.close();




2014-07-21 23:57 GMT+04:00 Serega Sheypak <se...@gmail.com>:

> temp/preparePreferenceMatrix/ratingMatrix
> has data looks like it's similarity between items...
> I'm confused. How can I get item similarity?
>
>
> 2014-07-21 23:48 GMT+04:00 Serega Sheypak <se...@gmail.com>:
>
> The code snippet:
>>
>>  @Test//(enabled = false)
>>     void testReadAll(){
>>         (0..5).each {
>>
>>             def pathToFile = new Path('matrixSim/part-r-0000$it")
>>             println pathToFile
>>             def reader = new SequenceFile.Reader(new Configuration(),
>> SequenceFile.Reader.file(pathToFile));
>>             IntWritable key = new IntWritable();
>>             VectorWritable value = new VectorWritable();
>>             while(reader.next(key, value)){
>>                 def itr = value.get().iterateNonZero()
>>                 while(itr.hasNext()){
>>                     println itr.next()
>>                 }
>>             }
>>             reader.close();
>>         }
>>     }
>>
>>
>>  2014-07-21 23:46 GMT+04:00 Serega Sheypak <se...@gmail.com>:
>>
>> I've parsed it via java, matrix is empty. why?
>>>
>>>
>>> 2014-07-21 22:41 GMT+04:00 Serega Sheypak <se...@gmail.com>:
>>>
>>> 0.7-cdh4.7.0
>>>> Anyway, recommenditembased does produce these catalogs:
>>>>
>>>> /recommenditembased/temp/maxValues.bin
>>>> /recommenditembased/temp/norms.bin
>>>> /recommenditembased/temp/numNonZeroEntries.bin
>>>> /recommenditembased/temp/pairwiseSimilarity
>>>> /recommenditembased/temp/partialMultiply
>>>> /recommenditembased/temp/prePartialMultiply1
>>>> /recommenditembased/temp/prePartialMultiply2
>>>> /recommenditembased/temp/preparePreferenceMatrix
>>>> /recommenditembased/temp/similarityMatrix
>>>> /recommenditembased/temp/weights
>>>>
>>>> I suppose that "/recommenditembased/temp/similarityMatrix" is the
>>>> thing In eed. Right now I try to read it using
>>>>
>>>> matrix = LOAD '/recommenditembased/temp/similarityMatrix' USING
>>>>  com.twitter.elephantbird.pig.load.SequenceFileLoader(
>>>>     '-c com.twitter.elephantbird.pig.util.IntWritableConverter',
>>>>     '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
>>>> )  as (intId: int, vector:tuple(cardinality:int,
>>>> entries:bag{t:tuple(some_id:long, some_value:double)}));
>>>>
>>>>
>>>> Looks like the vector is empty... Or i do something wrong.
>>>>
>>>>
>>>>
>>>> 2014-07-21 22:09 GMT+04:00 Ted Dunning <te...@gmail.com>:
>>>>
>>>> Which version of Mahout?
>>>>>
>>>>>
>>>>> On Mon, Jul 21, 2014 at 11:05 AM, Serega Sheypak <
>>>>> serega.sheypak@gmail.com>
>>>>> wrote:
>>>>>
>>>>> > Hi, I've tried: Unexpected --outputPathForSimilarityMatrix while
>>>>> processing
>>>>> > Job-Specific
>>>>> >
>>>>> > sudo -u hdfs hadoop fs -rm -r
>>>>> hdfs://nameservice1/recommenditembased/output
>>>>> > sudo -u hdfs hadoop fs -rm -r
>>>>> hdfs://nameservice1/recommenditembased/temp
>>>>> > sudo -u oozie mahout recommenditembased \
>>>>> >                     --input \
>>>>> >
>>>>> >
>>>>> >
>>>>> hdfs://nameservice1/user/hive/warehouse/staging_weighted_visits_and_rec_clicks
>>>>> > \
>>>>> >                     --output \
>>>>> >                     hdfs://nameservice1/recommenditembased/output \
>>>>> >                     --similarityClassname \
>>>>> >                     SIMILARITY_LOGLIKELIHOOD \
>>>>> >                    --numRecommendations \
>>>>> >                     500 \
>>>>> >                     --booleanData \
>>>>> >                     false \
>>>>> >                     --maxPrefsPerUser \
>>>>> >                     1000 \
>>>>> >                     --maxSimilaritiesPerItem \
>>>>> >                     1000 \
>>>>> >                     --minPrefsPerUser \
>>>>> >                     5 \
>>>>> >                     --maxPrefsPerUserInItemSimilarity \
>>>>> >                     30 \
>>>>> >                     --threshold \
>>>>> >                    1.1 \
>>>>> >                     --tempDir \
>>>>> >                     hdfs://nameservice1/recommenditembased/temp \
>>>>> >                     --outputPathForSimilarityMatrix \
>>>>> >                     hdfs://nameservice1/recommenditembased/sim_matrix
>>>>> >
>>>>> >
>>>>> > I'm on Cloudera cdh 4.7, looks like this feature is not supported.
>>>>> >
>>>>> >
>>>>> > 2014-07-21 11:18 GMT+04:00 Peng Zhang <pz...@gmail.com>:
>>>>> >
>>>>> > > Serega,
>>>>> > >
>>>>> > > See the last line on how to pass outputPathForSimilarityMatrix
>>>>> options to
>>>>> > > the recommenditembased command:
>>>>> > >
>>>>> > > sudo -u oozie mahout recommenditembased \
>>>>> > >                    --input visited_items_with_inverted_items \
>>>>> > >
>>>>> > >                    --output result \
>>>>> > >                    --similarityClassname SIMILARITY_LOGLIKELIHOOD \
>>>>> > >                    --usersFile inverted_items \
>>>>> > >                    --numRecommendations 500 \
>>>>> > >                    --booleanData false \
>>>>> > >                    --maxPrefsPerUser 100 \
>>>>> > >                    --maxSimilaritiesPerItem 500 \
>>>>> > >                    --minPrefsPerUser 0\
>>>>> > >                    --maxPrefsPerUserInItemSimilarity 30 \
>>>>> > >                    --threshold 0.91 \
>>>>> > >                    --tempDir  temp \
>>>>> > >                    --outputPathForSimilarityMatrix similarityMatri
>>>>> \
>>>>> > >
>>>>> > >
>>>>> > > Peng Zhang
>>>>> > > pzhang.xjtu@gmail.com
>>>>> > >
>>>>> > >
>>>>> > >
>>>>> > >
>>>>> > >
>>>>> > > On Jul 21, 2014, at 3:09 PM, Serega Sheypak <
>>>>> serega.sheypak@gmail.com>
>>>>> > > wrote:
>>>>> > >
>>>>> > > > I've inspected the code, our approach wouldn't work with
>>>>> > > booleanData=false.
>>>>> > > > We do calcualte imte similarity in the wrong way...(((
>>>>> > > > Thank you
>>>>> > > > 1. We provide "fake" user_id and provide --usersFile in order to
>>>>> get
>>>>> > > > recommendations for "fake user_id, where user_id is a negative
>>>>> item_id.
>>>>> > > It
>>>>> > > > worked when we did provide user_id->item_id pairs without
>>>>> preference.
>>>>> > > > 2. Our target is to get item similarities. We tried
>>>>> > > >
>>>>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob but
>>>>> > > it
>>>>> > > > returns bad result comparing to RecommenderJob with our "fake"
>>>>> user_id
>>>>> > > > (inverted item_id)
>>>>> > > >
>>>>> > > > 1. I'll try the option you provided.
>>>>> > > > 2. I will remove input with fake user_id and usersFile with
>>>>> these fake
>>>>> > > ids
>>>>> > > >
>>>>> > > > 3.
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
>>>>> > > > I don't understand how to pass ---outputPathForSimilarityMatrix
>>>>> option
>>>>> > to
>>>>> > > > RecommenderJob
>>>>> > > >
>>>>> > > >
>>>>> > > > 2014-07-21 4:58 GMT+04:00 Peng Zhang <pz...@gmail.com>:
>>>>> > > >
>>>>> > > >> Seraga,
>>>>> > > >>
>>>>> > > >> I have two comments:
>>>>> > > >> 1. Don’t use negative user ids. Since Mahout uses user id as
>>>>> well as
>>>>> > > item
>>>>> > > >> id as the row/column index, you’d better use 0, 1, 2, etc as ids
>>>>> > > >> 2. If you want to get the item similarity information, you can
>>>>> use
>>>>> > > >> --outputPathForSimilarityMatrix in the command
>>>>> > > >>
>>>>> > > >> Regards,
>>>>> > > >> Peng Zhang
>>>>> > > >> M: +86 186-1658-7856
>>>>> > > >> pzhang.xjtu@gmail.com
>>>>> > > >>
>>>>> > > >>
>>>>> > > >>
>>>>> > > >>
>>>>> > > >>
>>>>> > > >> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <
>>>>> serega.sheypak@gmail.com
>>>>> > >
>>>>> > > >> wrote:
>>>>> > > >>
>>>>> > > >>> All bad things happen here:
>>>>> > > >>>
>>>>> > > >>>
>>>>> > > >>>
>>>>> > > >>> Name
>>>>> > > >>>
>>>>> > > >>> RecommenderJob-PartialMultiplyMapper-Reducer
>>>>> > > >>>
>>>>> > > >>> User
>>>>> > > >>>
>>>>> > > >>> oozie
>>>>> > > >>>
>>>>> > > >>> Process User
>>>>> > > >>>
>>>>> > > >>> oozie
>>>>> > > >>>
>>>>> > > >>> Group
>>>>> > > >>>
>>>>> > > >>> oozie
>>>>> > > >>>
>>>>> > > >>> Mapper Class
>>>>> > > >>>
>>>>> > > >>> PartialMultiplyMapper
>>>>> > > >>>
>>>>> > > >>> Reducer Class
>>>>> > > >>>
>>>>> > > >>> AggregateAndRecommendReducer
>>>>> > > >>>
>>>>> > > >>>
>>>>> > > >>> Job Input Directory
>>>>> > > >>>
>>>>> > > >>> hdfs://nameservice1/itemrec/temp/partialMultiply
>>>>> > > >>>
>>>>> > > >>> Job Output Directory
>>>>> > > >>>
>>>>> > > >>> hdfs://nameservice1/itemrec/output/
>>>>> > > >>>
>>>>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map input
>>>>> > records=3312879
>>>>> > > >>>
>>>>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map output
>>>>> > records=3313251
>>>>> > > >>>
>>>>> > > >>>
>>>>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce input
>>>>> > > records=3313251
>>>>> > > >>>
>>>>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce output
>>>>> records=0
>>>>> > > >>>
>>>>> > > >>> Why does mahout returns 0 rows? it works when booleanData=true
>>>>> > > >> (preferences
>>>>> > > >>> are ignored...?)
>>>>> > > >>>
>>>>> > > >>>
>>>>> > > >>>
>>>>> > > >>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <
>>>>> serega.sheypak@gmail.com
>>>>> > >:
>>>>> > > >>>
>>>>> > > >>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
>>>>> > > >>>> users_file:
>>>>> > > >>>> --inverted_item_id
>>>>> > > >>>> -1
>>>>> > > >>>> -2
>>>>> > > >>>> -3
>>>>> > > >>>> -4
>>>>> > > >>>>
>>>>> > > >>>> users_items_prefs
>>>>> > > >>>> --inverted item_id
>>>>> > > >>>> -1 1 1.0
>>>>> > > >>>> -2 2 1.0
>>>>> > > >>>> -3 3 1.0
>>>>> > > >>>> -4 4 1.0
>>>>> > > >>>> --user_id item_id pref_value
>>>>> > > >>>> 11   1 1.6
>>>>> > > >>>> 11   2 1.6
>>>>> > > >>>> 123 3 2.0
>>>>> > > >>>> 123 4 2.0
>>>>> > > >>>> 333 1 2.0
>>>>> > > >>>> 333 2 1.6
>>>>> > > >>>> --e.t.c.
>>>>> > > >>>>
>>>>> > > >>>> if I set --booleanData true
>>>>> > > >>>> then mahout returns the result.
>>>>> > > >>>>
>>>>> > > >>>>
>>>>> > > >>>>
>>>>> > > >>>>
>>>>> > > >>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <
>>>>> > > andrew.musselman@gmail.com
>>>>> > > >>> :
>>>>> > > >>>>
>>>>> > > >>>> I'm confused about how you're constructing the user file, and
>>>>> why
>>>>> > > there
>>>>> > > >>>>> are negated item ids here.
>>>>> > > >>>>>
>>>>> > > >>>>> Can you post some more details please, including Mahout
>>>>> version and
>>>>> > > >> some
>>>>> > > >>>>> sample data sets?
>>>>> > > >>>>>
>>>>> > > >>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak <
>>>>> > > >> serega.sheypak@gmail.com>
>>>>> > > >>>>> wrote:
>>>>> > > >>>>>>
>>>>> > > >>>>>> Hi, I'm trying to create item similarity.
>>>>> > > >>>>>> I gather items which users visit during shopping and then
>>>>> create a
>>>>> > > >> file:
>>>>> > > >>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6,
>>>>> 1.9],
>>>>> > > >> depends
>>>>> > > >>>>> on
>>>>> > > >>>>>> user action type and data source)
>>>>> > > >>>>>> UNION
>>>>> > > >>>>>> -item_id, item_id, 1.0 (from items dictionary)
>>>>> > > >>>>>>
>>>>> > > >>>>>> and I do provide a userFile, where user_id = -item_id
>>>>> > > >>>>>>
>>>>> > > >>>>>> The idea is to get item similary. If any user visits item
>>>>> named
>>>>> > > "A", i
>>>>> > > >>>>> want
>>>>> > > >>>>>> to show him items "B", "c", "xxx" using preferences of other
>>>>> > users.
>>>>> > > >>>>>>
>>>>> > > >>>>>> The problem is that the last (???) mapreduce job returns 0
>>>>> rows:
>>>>> > > >>>>>>
>>>>> > > >>>>>> Here are my settings:
>>>>> > > >>>>>>
>>>>> > > >>>>>>
>>>>> > > >>>>>> sudo -u oozie mahout recommenditembased \
>>>>> > > >>>>>>                  --input visited_items_with_inverted_items \
>>>>> > > >>>>>>
>>>>> > > >>>>>>                  --output result \
>>>>> > > >>>>>>                  --similarityClassname
>>>>> SIMILARITY_LOGLIKELIHOOD \
>>>>> > > >>>>>>                  --usersFile inverted_items \
>>>>> > > >>>>>>                  --numRecommendations 500 \
>>>>> > > >>>>>>                  --booleanData false \
>>>>> > > >>>>>>                  --maxPrefsPerUser 100 \
>>>>> > > >>>>>>                  --maxSimilaritiesPerItem 500 \
>>>>> > > >>>>>>                  --minPrefsPerUser 0\
>>>>> > > >>>>>>                  --maxPrefsPerUserInItemSimilarity 30 \
>>>>> > > >>>>>>                  --threshold 0.91 \
>>>>> > > >>>>>>                  --tempDir  temp \
>>>>> > > >>>>>>
>>>>> > > >>>>>> Some counters... I don't get what do they mean....
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
>>>>> > > >>>>>>
>>>>> > org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>> > > >>>>>>
>>>>> > > >>>>>
>>>>> > > >>
>>>>> > >
>>>>> >
>>>>> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>> > > >>>>>>  USER_RATINGS_NEGLECTED=1,798,738
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>> > > >>>>> USER_RATINGS_USED=12,429,693
>>>>> > > >>>>>>
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
>>>>> > > >>>>>>
>>>>> > > >>>>>
>>>>> > > >>
>>>>> > >
>>>>> >
>>>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>>> > > >>>>>>
>>>>> > > >>>>>
>>>>> > > >>
>>>>> > >
>>>>> >
>>>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>>> > COOCCURRENCES=35882374
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>>> > PRUNED_COOCCURRENCES=0
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map input
>>>>> > > records=3312879
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map output
>>>>> > > >> records=17570268
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce input
>>>>> > > >>>>> records=5221907
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce output
>>>>> > > >>>>> records=3312879
>>>>> > > >>>>>>
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
>>>>> > > >>>>> records=3312879
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
>>>>> > > >>>>> records=3312879
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
>>>>> > > >>>>> records=3312879
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
>>>>> > > >>>>> records=3312879
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map input
>>>>> > > records=7528530
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map output
>>>>> > > >> records=3313251
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce input
>>>>> > > >>>>> records=3313251
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce output
>>>>> > > >>>>> records=3313251
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map input
>>>>> > > records=6626130
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map output
>>>>> > > >> records=6626130
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce input
>>>>> > > >>>>> records=6626130
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce output
>>>>> > > >>>>> records=3312879
>>>>> > > >>>>>>
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map input
>>>>> > > records=3312879
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map output
>>>>> > > >> records=3313251
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce input
>>>>> > > >>>>> records=3313251
>>>>> > > >>>>>>
>>>>> > > >>>>>> --------
>>>>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce output
>>>>> > records=0
>>>>> > > >>>>>> --------
>>>>> > > >>>>>>
>>>>> > > >>>>>> why 0???
>>>>> > > >>>>>
>>>>> > > >>>>
>>>>> > > >>>>
>>>>> > > >>
>>>>> > > >>
>>>>> > >
>>>>> > >
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>
>

Re: recommenditembased returns 0 records from last map-reduce job

Posted by Serega Sheypak <se...@gmail.com>.

temp/preparePreferenceMatrix/ratingMatrix
has data looks like it's similarity between items...
I'm confused. How can I get item similarity?


2014-07-21 23:48 GMT+04:00 Serega Sheypak <se...@gmail.com>:

> The code snippet:
>
>  @Test//(enabled = false)
>     void testReadAll(){
>         (0..5).each {
>
>             def pathToFile = new Path('matrixSim/part-r-0000$it")
>             println pathToFile
>             def reader = new SequenceFile.Reader(new Configuration(),
> SequenceFile.Reader.file(pathToFile));
>             IntWritable key = new IntWritable();
>             VectorWritable value = new VectorWritable();
>             while(reader.next(key, value)){
>                 def itr = value.get().iterateNonZero()
>                 while(itr.hasNext()){
>                     println itr.next()
>                 }
>             }
>             reader.close();
>         }
>     }
>
>
>  2014-07-21 23:46 GMT+04:00 Serega Sheypak <se...@gmail.com>:
>
> I've parsed it via java, matrix is empty. why?
>>
>>
>> 2014-07-21 22:41 GMT+04:00 Serega Sheypak <se...@gmail.com>:
>>
>> 0.7-cdh4.7.0
>>> Anyway, recommenditembased does produce these catalogs:
>>>
>>> /recommenditembased/temp/maxValues.bin
>>> /recommenditembased/temp/norms.bin
>>> /recommenditembased/temp/numNonZeroEntries.bin
>>> /recommenditembased/temp/pairwiseSimilarity
>>> /recommenditembased/temp/partialMultiply
>>> /recommenditembased/temp/prePartialMultiply1
>>> /recommenditembased/temp/prePartialMultiply2
>>> /recommenditembased/temp/preparePreferenceMatrix
>>> /recommenditembased/temp/similarityMatrix
>>> /recommenditembased/temp/weights
>>>
>>> I suppose that "/recommenditembased/temp/similarityMatrix" is the thing
>>> In eed. Right now I try to read it using
>>>
>>> matrix = LOAD '/recommenditembased/temp/similarityMatrix' USING
>>>  com.twitter.elephantbird.pig.load.SequenceFileLoader(
>>>     '-c com.twitter.elephantbird.pig.util.IntWritableConverter',
>>>     '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
>>> )  as (intId: int, vector:tuple(cardinality:int,
>>> entries:bag{t:tuple(some_id:long, some_value:double)}));
>>>
>>>
>>> Looks like the vector is empty... Or i do something wrong.
>>>
>>>
>>>
>>> 2014-07-21 22:09 GMT+04:00 Ted Dunning <te...@gmail.com>:
>>>
>>> Which version of Mahout?
>>>>
>>>>
>>>> On Mon, Jul 21, 2014 at 11:05 AM, Serega Sheypak <
>>>> serega.sheypak@gmail.com>
>>>> wrote:
>>>>
>>>> > Hi, I've tried: Unexpected --outputPathForSimilarityMatrix while
>>>> processing
>>>> > Job-Specific
>>>> >
>>>> > sudo -u hdfs hadoop fs -rm -r
>>>> hdfs://nameservice1/recommenditembased/output
>>>> > sudo -u hdfs hadoop fs -rm -r
>>>> hdfs://nameservice1/recommenditembased/temp
>>>> > sudo -u oozie mahout recommenditembased \
>>>> >                     --input \
>>>> >
>>>> >
>>>> >
>>>> hdfs://nameservice1/user/hive/warehouse/staging_weighted_visits_and_rec_clicks
>>>> > \
>>>> >                     --output \
>>>> >                     hdfs://nameservice1/recommenditembased/output \
>>>> >                     --similarityClassname \
>>>> >                     SIMILARITY_LOGLIKELIHOOD \
>>>> >                    --numRecommendations \
>>>> >                     500 \
>>>> >                     --booleanData \
>>>> >                     false \
>>>> >                     --maxPrefsPerUser \
>>>> >                     1000 \
>>>> >                     --maxSimilaritiesPerItem \
>>>> >                     1000 \
>>>> >                     --minPrefsPerUser \
>>>> >                     5 \
>>>> >                     --maxPrefsPerUserInItemSimilarity \
>>>> >                     30 \
>>>> >                     --threshold \
>>>> >                    1.1 \
>>>> >                     --tempDir \
>>>> >                     hdfs://nameservice1/recommenditembased/temp \
>>>> >                     --outputPathForSimilarityMatrix \
>>>> >                     hdfs://nameservice1/recommenditembased/sim_matrix
>>>> >
>>>> >
>>>> > I'm on Cloudera cdh 4.7, looks like this feature is not supported.
>>>> >
>>>> >
>>>> > 2014-07-21 11:18 GMT+04:00 Peng Zhang <pz...@gmail.com>:
>>>> >
>>>> > > Serega,
>>>> > >
>>>> > > See the last line on how to pass outputPathForSimilarityMatrix
>>>> options to
>>>> > > the recommenditembased command:
>>>> > >
>>>> > > sudo -u oozie mahout recommenditembased \
>>>> > >                    --input visited_items_with_inverted_items \
>>>> > >
>>>> > >                    --output result \
>>>> > >                    --similarityClassname SIMILARITY_LOGLIKELIHOOD \
>>>> > >                    --usersFile inverted_items \
>>>> > >                    --numRecommendations 500 \
>>>> > >                    --booleanData false \
>>>> > >                    --maxPrefsPerUser 100 \
>>>> > >                    --maxSimilaritiesPerItem 500 \
>>>> > >                    --minPrefsPerUser 0\
>>>> > >                    --maxPrefsPerUserInItemSimilarity 30 \
>>>> > >                    --threshold 0.91 \
>>>> > >                    --tempDir  temp \
>>>> > >                    --outputPathForSimilarityMatrix similarityMatri \
>>>> > >
>>>> > >
>>>> > > Peng Zhang
>>>> > > pzhang.xjtu@gmail.com
>>>> > >
>>>> > >
>>>> > >
>>>> > >
>>>> > >
>>>> > > On Jul 21, 2014, at 3:09 PM, Serega Sheypak <
>>>> serega.sheypak@gmail.com>
>>>> > > wrote:
>>>> > >
>>>> > > > I've inspected the code, our approach wouldn't work with
>>>> > > booleanData=false.
>>>> > > > We do calcualte imte similarity in the wrong way...(((
>>>> > > > Thank you
>>>> > > > 1. We provide "fake" user_id and provide --usersFile in order to
>>>> get
>>>> > > > recommendations for "fake user_id, where user_id is a negative
>>>> item_id.
>>>> > > It
>>>> > > > worked when we did provide user_id->item_id pairs without
>>>> preference.
>>>> > > > 2. Our target is to get item similarities. We tried
>>>> > > >
>>>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob but
>>>> > > it
>>>> > > > returns bad result comparing to RecommenderJob with our "fake"
>>>> user_id
>>>> > > > (inverted item_id)
>>>> > > >
>>>> > > > 1. I'll try the option you provided.
>>>> > > > 2. I will remove input with fake user_id and usersFile with these
>>>> fake
>>>> > > ids
>>>> > > >
>>>> > > > 3.
>>>> > > >
>>>> > >
>>>> >
>>>> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
>>>> > > > I don't understand how to pass ---outputPathForSimilarityMatrix
>>>> option
>>>> > to
>>>> > > > RecommenderJob
>>>> > > >
>>>> > > >
>>>> > > > 2014-07-21 4:58 GMT+04:00 Peng Zhang <pz...@gmail.com>:
>>>> > > >
>>>> > > >> Seraga,
>>>> > > >>
>>>> > > >> I have two comments:
>>>> > > >> 1. Don’t use negative user ids. Since Mahout uses user id as
>>>> well as
>>>> > > item
>>>> > > >> id as the row/column index, you’d better use 0, 1, 2, etc as ids
>>>> > > >> 2. If you want to get the item similarity information, you can
>>>> use
>>>> > > >> --outputPathForSimilarityMatrix in the command
>>>> > > >>
>>>> > > >> Regards,
>>>> > > >> Peng Zhang
>>>> > > >> M: +86 186-1658-7856
>>>> > > >> pzhang.xjtu@gmail.com
>>>> > > >>
>>>> > > >>
>>>> > > >>
>>>> > > >>
>>>> > > >>
>>>> > > >> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <
>>>> serega.sheypak@gmail.com
>>>> > >
>>>> > > >> wrote:
>>>> > > >>
>>>> > > >>> All bad things happen here:
>>>> > > >>>
>>>> > > >>>
>>>> > > >>>
>>>> > > >>> Name
>>>> > > >>>
>>>> > > >>> RecommenderJob-PartialMultiplyMapper-Reducer
>>>> > > >>>
>>>> > > >>> User
>>>> > > >>>
>>>> > > >>> oozie
>>>> > > >>>
>>>> > > >>> Process User
>>>> > > >>>
>>>> > > >>> oozie
>>>> > > >>>
>>>> > > >>> Group
>>>> > > >>>
>>>> > > >>> oozie
>>>> > > >>>
>>>> > > >>> Mapper Class
>>>> > > >>>
>>>> > > >>> PartialMultiplyMapper
>>>> > > >>>
>>>> > > >>> Reducer Class
>>>> > > >>>
>>>> > > >>> AggregateAndRecommendReducer
>>>> > > >>>
>>>> > > >>>
>>>> > > >>> Job Input Directory
>>>> > > >>>
>>>> > > >>> hdfs://nameservice1/itemrec/temp/partialMultiply
>>>> > > >>>
>>>> > > >>> Job Output Directory
>>>> > > >>>
>>>> > > >>> hdfs://nameservice1/itemrec/output/
>>>> > > >>>
>>>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map input
>>>> > records=3312879
>>>> > > >>>
>>>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map output
>>>> > records=3313251
>>>> > > >>>
>>>> > > >>>
>>>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce input
>>>> > > records=3313251
>>>> > > >>>
>>>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce output
>>>> records=0
>>>> > > >>>
>>>> > > >>> Why does mahout returns 0 rows? it works when booleanData=true
>>>> > > >> (preferences
>>>> > > >>> are ignored...?)
>>>> > > >>>
>>>> > > >>>
>>>> > > >>>
>>>> > > >>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <
>>>> serega.sheypak@gmail.com
>>>> > >:
>>>> > > >>>
>>>> > > >>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
>>>> > > >>>> users_file:
>>>> > > >>>> --inverted_item_id
>>>> > > >>>> -1
>>>> > > >>>> -2
>>>> > > >>>> -3
>>>> > > >>>> -4
>>>> > > >>>>
>>>> > > >>>> users_items_prefs
>>>> > > >>>> --inverted item_id
>>>> > > >>>> -1 1 1.0
>>>> > > >>>> -2 2 1.0
>>>> > > >>>> -3 3 1.0
>>>> > > >>>> -4 4 1.0
>>>> > > >>>> --user_id item_id pref_value
>>>> > > >>>> 11   1 1.6
>>>> > > >>>> 11   2 1.6
>>>> > > >>>> 123 3 2.0
>>>> > > >>>> 123 4 2.0
>>>> > > >>>> 333 1 2.0
>>>> > > >>>> 333 2 1.6
>>>> > > >>>> --e.t.c.
>>>> > > >>>>
>>>> > > >>>> if I set --booleanData true
>>>> > > >>>> then mahout returns the result.
>>>> > > >>>>
>>>> > > >>>>
>>>> > > >>>>
>>>> > > >>>>
>>>> > > >>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <
>>>> > > andrew.musselman@gmail.com
>>>> > > >>> :
>>>> > > >>>>
>>>> > > >>>> I'm confused about how you're constructing the user file, and
>>>> why
>>>> > > there
>>>> > > >>>>> are negated item ids here.
>>>> > > >>>>>
>>>> > > >>>>> Can you post some more details please, including Mahout
>>>> version and
>>>> > > >> some
>>>> > > >>>>> sample data sets?
>>>> > > >>>>>
>>>> > > >>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak <
>>>> > > >> serega.sheypak@gmail.com>
>>>> > > >>>>> wrote:
>>>> > > >>>>>>
>>>> > > >>>>>> Hi, I'm trying to create item similarity.
>>>> > > >>>>>> I gather items which users visit during shopping and then
>>>> create a
>>>> > > >> file:
>>>> > > >>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6,
>>>> 1.9],
>>>> > > >> depends
>>>> > > >>>>> on
>>>> > > >>>>>> user action type and data source)
>>>> > > >>>>>> UNION
>>>> > > >>>>>> -item_id, item_id, 1.0 (from items dictionary)
>>>> > > >>>>>>
>>>> > > >>>>>> and I do provide a userFile, where user_id = -item_id
>>>> > > >>>>>>
>>>> > > >>>>>> The idea is to get item similary. If any user visits item
>>>> named
>>>> > > "A", i
>>>> > > >>>>> want
>>>> > > >>>>>> to show him items "B", "c", "xxx" using preferences of other
>>>> > users.
>>>> > > >>>>>>
>>>> > > >>>>>> The problem is that the last (???) mapreduce job returns 0
>>>> rows:
>>>> > > >>>>>>
>>>> > > >>>>>> Here are my settings:
>>>> > > >>>>>>
>>>> > > >>>>>>
>>>> > > >>>>>> sudo -u oozie mahout recommenditembased \
>>>> > > >>>>>>                  --input visited_items_with_inverted_items \
>>>> > > >>>>>>
>>>> > > >>>>>>                  --output result \
>>>> > > >>>>>>                  --similarityClassname
>>>> SIMILARITY_LOGLIKELIHOOD \
>>>> > > >>>>>>                  --usersFile inverted_items \
>>>> > > >>>>>>                  --numRecommendations 500 \
>>>> > > >>>>>>                  --booleanData false \
>>>> > > >>>>>>                  --maxPrefsPerUser 100 \
>>>> > > >>>>>>                  --maxSimilaritiesPerItem 500 \
>>>> > > >>>>>>                  --minPrefsPerUser 0\
>>>> > > >>>>>>                  --maxPrefsPerUserInItemSimilarity 30 \
>>>> > > >>>>>>                  --threshold 0.91 \
>>>> > > >>>>>>                  --tempDir  temp \
>>>> > > >>>>>>
>>>> > > >>>>>> Some counters... I don't get what do they mean....
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
>>>> > > >>>>>>
>>>> > org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>> > > >>>>>>
>>>> > > >>>>>
>>>> > > >>
>>>> > >
>>>> >
>>>> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>> > > >>>>>>  USER_RATINGS_NEGLECTED=1,798,738
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>> > > >>>>> USER_RATINGS_USED=12,429,693
>>>> > > >>>>>>
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
>>>> > > >>>>>>
>>>> > > >>>>>
>>>> > > >>
>>>> > >
>>>> >
>>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>> > > >>>>>>
>>>> > > >>>>>
>>>> > > >>
>>>> > >
>>>> >
>>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>> > COOCCURRENCES=35882374
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>> > PRUNED_COOCCURRENCES=0
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map input
>>>> > > records=3312879
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map output
>>>> > > >> records=17570268
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce input
>>>> > > >>>>> records=5221907
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce output
>>>> > > >>>>> records=3312879
>>>> > > >>>>>>
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
>>>> > > >>>>> records=3312879
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
>>>> > > >>>>> records=3312879
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
>>>> > > >>>>> records=3312879
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
>>>> > > >>>>> records=3312879
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map input
>>>> > > records=7528530
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map output
>>>> > > >> records=3313251
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce input
>>>> > > >>>>> records=3313251
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce output
>>>> > > >>>>> records=3313251
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map input
>>>> > > records=6626130
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map output
>>>> > > >> records=6626130
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce input
>>>> > > >>>>> records=6626130
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce output
>>>> > > >>>>> records=3312879
>>>> > > >>>>>>
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map input
>>>> > > records=3312879
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map output
>>>> > > >> records=3313251
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce input
>>>> > > >>>>> records=3313251
>>>> > > >>>>>>
>>>> > > >>>>>> --------
>>>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce output
>>>> > records=0
>>>> > > >>>>>> --------
>>>> > > >>>>>>
>>>> > > >>>>>> why 0???
>>>> > > >>>>>
>>>> > > >>>>
>>>> > > >>>>
>>>> > > >>
>>>> > > >>
>>>> > >
>>>> > >
>>>> >
>>>>
>>>
>>>
>>
>

Re: recommenditembased returns 0 records from last map-reduce job

Posted by Serega Sheypak <se...@gmail.com>.

The code snippet:

 @Test//(enabled = false)
    void testReadAll(){
        (0..5).each {

            def pathToFile = new Path('matrixSim/part-r-0000$it")
            println pathToFile
            def reader = new SequenceFile.Reader(new Configuration(),
SequenceFile.Reader.file(pathToFile));
            IntWritable key = new IntWritable();
            VectorWritable value = new VectorWritable();
            while(reader.next(key, value)){
                def itr = value.get().iterateNonZero()
                while(itr.hasNext()){
                    println itr.next()
                }
            }
            reader.close();
        }
    }


2014-07-21 23:46 GMT+04:00 Serega Sheypak <se...@gmail.com>:

> I've parsed it via java, matrix is empty. why?
>
>
> 2014-07-21 22:41 GMT+04:00 Serega Sheypak <se...@gmail.com>:
>
> 0.7-cdh4.7.0
>> Anyway, recommenditembased does produce these catalogs:
>>
>> /recommenditembased/temp/maxValues.bin
>> /recommenditembased/temp/norms.bin
>> /recommenditembased/temp/numNonZeroEntries.bin
>> /recommenditembased/temp/pairwiseSimilarity
>> /recommenditembased/temp/partialMultiply
>> /recommenditembased/temp/prePartialMultiply1
>> /recommenditembased/temp/prePartialMultiply2
>> /recommenditembased/temp/preparePreferenceMatrix
>> /recommenditembased/temp/similarityMatrix
>> /recommenditembased/temp/weights
>>
>> I suppose that "/recommenditembased/temp/similarityMatrix" is the thing
>> In eed. Right now I try to read it using
>>
>> matrix = LOAD '/recommenditembased/temp/similarityMatrix' USING
>>  com.twitter.elephantbird.pig.load.SequenceFileLoader(
>>     '-c com.twitter.elephantbird.pig.util.IntWritableConverter',
>>     '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
>> )  as (intId: int, vector:tuple(cardinality:int,
>> entries:bag{t:tuple(some_id:long, some_value:double)}));
>>
>>
>> Looks like the vector is empty... Or i do something wrong.
>>
>>
>>
>> 2014-07-21 22:09 GMT+04:00 Ted Dunning <te...@gmail.com>:
>>
>> Which version of Mahout?
>>>
>>>
>>> On Mon, Jul 21, 2014 at 11:05 AM, Serega Sheypak <
>>> serega.sheypak@gmail.com>
>>> wrote:
>>>
>>> > Hi, I've tried: Unexpected --outputPathForSimilarityMatrix while
>>> processing
>>> > Job-Specific
>>> >
>>> > sudo -u hdfs hadoop fs -rm -r
>>> hdfs://nameservice1/recommenditembased/output
>>> > sudo -u hdfs hadoop fs -rm -r
>>> hdfs://nameservice1/recommenditembased/temp
>>> > sudo -u oozie mahout recommenditembased \
>>> >                     --input \
>>> >
>>> >
>>> >
>>> hdfs://nameservice1/user/hive/warehouse/staging_weighted_visits_and_rec_clicks
>>> > \
>>> >                     --output \
>>> >                     hdfs://nameservice1/recommenditembased/output \
>>> >                     --similarityClassname \
>>> >                     SIMILARITY_LOGLIKELIHOOD \
>>> >                    --numRecommendations \
>>> >                     500 \
>>> >                     --booleanData \
>>> >                     false \
>>> >                     --maxPrefsPerUser \
>>> >                     1000 \
>>> >                     --maxSimilaritiesPerItem \
>>> >                     1000 \
>>> >                     --minPrefsPerUser \
>>> >                     5 \
>>> >                     --maxPrefsPerUserInItemSimilarity \
>>> >                     30 \
>>> >                     --threshold \
>>> >                    1.1 \
>>> >                     --tempDir \
>>> >                     hdfs://nameservice1/recommenditembased/temp \
>>> >                     --outputPathForSimilarityMatrix \
>>> >                     hdfs://nameservice1/recommenditembased/sim_matrix
>>> >
>>> >
>>> > I'm on Cloudera cdh 4.7, looks like this feature is not supported.
>>> >
>>> >
>>> > 2014-07-21 11:18 GMT+04:00 Peng Zhang <pz...@gmail.com>:
>>> >
>>> > > Serega,
>>> > >
>>> > > See the last line on how to pass outputPathForSimilarityMatrix
>>> options to
>>> > > the recommenditembased command:
>>> > >
>>> > > sudo -u oozie mahout recommenditembased \
>>> > >                    --input visited_items_with_inverted_items \
>>> > >
>>> > >                    --output result \
>>> > >                    --similarityClassname SIMILARITY_LOGLIKELIHOOD \
>>> > >                    --usersFile inverted_items \
>>> > >                    --numRecommendations 500 \
>>> > >                    --booleanData false \
>>> > >                    --maxPrefsPerUser 100 \
>>> > >                    --maxSimilaritiesPerItem 500 \
>>> > >                    --minPrefsPerUser 0\
>>> > >                    --maxPrefsPerUserInItemSimilarity 30 \
>>> > >                    --threshold 0.91 \
>>> > >                    --tempDir  temp \
>>> > >                    --outputPathForSimilarityMatrix similarityMatri \
>>> > >
>>> > >
>>> > > Peng Zhang
>>> > > pzhang.xjtu@gmail.com
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > On Jul 21, 2014, at 3:09 PM, Serega Sheypak <
>>> serega.sheypak@gmail.com>
>>> > > wrote:
>>> > >
>>> > > > I've inspected the code, our approach wouldn't work with
>>> > > booleanData=false.
>>> > > > We do calcualte imte similarity in the wrong way...(((
>>> > > > Thank you
>>> > > > 1. We provide "fake" user_id and provide --usersFile in order to
>>> get
>>> > > > recommendations for "fake user_id, where user_id is a negative
>>> item_id.
>>> > > It
>>> > > > worked when we did provide user_id->item_id pairs without
>>> preference.
>>> > > > 2. Our target is to get item similarities. We tried
>>> > > >
>>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob but
>>> > > it
>>> > > > returns bad result comparing to RecommenderJob with our "fake"
>>> user_id
>>> > > > (inverted item_id)
>>> > > >
>>> > > > 1. I'll try the option you provided.
>>> > > > 2. I will remove input with fake user_id and usersFile with these
>>> fake
>>> > > ids
>>> > > >
>>> > > > 3.
>>> > > >
>>> > >
>>> >
>>> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
>>> > > > I don't understand how to pass ---outputPathForSimilarityMatrix
>>> option
>>> > to
>>> > > > RecommenderJob
>>> > > >
>>> > > >
>>> > > > 2014-07-21 4:58 GMT+04:00 Peng Zhang <pz...@gmail.com>:
>>> > > >
>>> > > >> Seraga,
>>> > > >>
>>> > > >> I have two comments:
>>> > > >> 1. Don’t use negative user ids. Since Mahout uses user id as well
>>> as
>>> > > item
>>> > > >> id as the row/column index, you’d better use 0, 1, 2, etc as ids
>>> > > >> 2. If you want to get the item similarity information, you can use
>>> > > >> --outputPathForSimilarityMatrix in the command
>>> > > >>
>>> > > >> Regards,
>>> > > >> Peng Zhang
>>> > > >> M: +86 186-1658-7856
>>> > > >> pzhang.xjtu@gmail.com
>>> > > >>
>>> > > >>
>>> > > >>
>>> > > >>
>>> > > >>
>>> > > >> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <
>>> serega.sheypak@gmail.com
>>> > >
>>> > > >> wrote:
>>> > > >>
>>> > > >>> All bad things happen here:
>>> > > >>>
>>> > > >>>
>>> > > >>>
>>> > > >>> Name
>>> > > >>>
>>> > > >>> RecommenderJob-PartialMultiplyMapper-Reducer
>>> > > >>>
>>> > > >>> User
>>> > > >>>
>>> > > >>> oozie
>>> > > >>>
>>> > > >>> Process User
>>> > > >>>
>>> > > >>> oozie
>>> > > >>>
>>> > > >>> Group
>>> > > >>>
>>> > > >>> oozie
>>> > > >>>
>>> > > >>> Mapper Class
>>> > > >>>
>>> > > >>> PartialMultiplyMapper
>>> > > >>>
>>> > > >>> Reducer Class
>>> > > >>>
>>> > > >>> AggregateAndRecommendReducer
>>> > > >>>
>>> > > >>>
>>> > > >>> Job Input Directory
>>> > > >>>
>>> > > >>> hdfs://nameservice1/itemrec/temp/partialMultiply
>>> > > >>>
>>> > > >>> Job Output Directory
>>> > > >>>
>>> > > >>> hdfs://nameservice1/itemrec/output/
>>> > > >>>
>>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map input
>>> > records=3312879
>>> > > >>>
>>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map output
>>> > records=3313251
>>> > > >>>
>>> > > >>>
>>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce input
>>> > > records=3313251
>>> > > >>>
>>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce output
>>> records=0
>>> > > >>>
>>> > > >>> Why does mahout returns 0 rows? it works when booleanData=true
>>> > > >> (preferences
>>> > > >>> are ignored...?)
>>> > > >>>
>>> > > >>>
>>> > > >>>
>>> > > >>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <
>>> serega.sheypak@gmail.com
>>> > >:
>>> > > >>>
>>> > > >>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
>>> > > >>>> users_file:
>>> > > >>>> --inverted_item_id
>>> > > >>>> -1
>>> > > >>>> -2
>>> > > >>>> -3
>>> > > >>>> -4
>>> > > >>>>
>>> > > >>>> users_items_prefs
>>> > > >>>> --inverted item_id
>>> > > >>>> -1 1 1.0
>>> > > >>>> -2 2 1.0
>>> > > >>>> -3 3 1.0
>>> > > >>>> -4 4 1.0
>>> > > >>>> --user_id item_id pref_value
>>> > > >>>> 11   1 1.6
>>> > > >>>> 11   2 1.6
>>> > > >>>> 123 3 2.0
>>> > > >>>> 123 4 2.0
>>> > > >>>> 333 1 2.0
>>> > > >>>> 333 2 1.6
>>> > > >>>> --e.t.c.
>>> > > >>>>
>>> > > >>>> if I set --booleanData true
>>> > > >>>> then mahout returns the result.
>>> > > >>>>
>>> > > >>>>
>>> > > >>>>
>>> > > >>>>
>>> > > >>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <
>>> > > andrew.musselman@gmail.com
>>> > > >>> :
>>> > > >>>>
>>> > > >>>> I'm confused about how you're constructing the user file, and
>>> why
>>> > > there
>>> > > >>>>> are negated item ids here.
>>> > > >>>>>
>>> > > >>>>> Can you post some more details please, including Mahout
>>> version and
>>> > > >> some
>>> > > >>>>> sample data sets?
>>> > > >>>>>
>>> > > >>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak <
>>> > > >> serega.sheypak@gmail.com>
>>> > > >>>>> wrote:
>>> > > >>>>>>
>>> > > >>>>>> Hi, I'm trying to create item similarity.
>>> > > >>>>>> I gather items which users visit during shopping and then
>>> create a
>>> > > >> file:
>>> > > >>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6,
>>> 1.9],
>>> > > >> depends
>>> > > >>>>> on
>>> > > >>>>>> user action type and data source)
>>> > > >>>>>> UNION
>>> > > >>>>>> -item_id, item_id, 1.0 (from items dictionary)
>>> > > >>>>>>
>>> > > >>>>>> and I do provide a userFile, where user_id = -item_id
>>> > > >>>>>>
>>> > > >>>>>> The idea is to get item similary. If any user visits item
>>> named
>>> > > "A", i
>>> > > >>>>> want
>>> > > >>>>>> to show him items "B", "c", "xxx" using preferences of other
>>> > users.
>>> > > >>>>>>
>>> > > >>>>>> The problem is that the last (???) mapreduce job returns 0
>>> rows:
>>> > > >>>>>>
>>> > > >>>>>> Here are my settings:
>>> > > >>>>>>
>>> > > >>>>>>
>>> > > >>>>>> sudo -u oozie mahout recommenditembased \
>>> > > >>>>>>                  --input visited_items_with_inverted_items \
>>> > > >>>>>>
>>> > > >>>>>>                  --output result \
>>> > > >>>>>>                  --similarityClassname
>>> SIMILARITY_LOGLIKELIHOOD \
>>> > > >>>>>>                  --usersFile inverted_items \
>>> > > >>>>>>                  --numRecommendations 500 \
>>> > > >>>>>>                  --booleanData false \
>>> > > >>>>>>                  --maxPrefsPerUser 100 \
>>> > > >>>>>>                  --maxSimilaritiesPerItem 500 \
>>> > > >>>>>>                  --minPrefsPerUser 0\
>>> > > >>>>>>                  --maxPrefsPerUserInItemSimilarity 30 \
>>> > > >>>>>>                  --threshold 0.91 \
>>> > > >>>>>>                  --tempDir  temp \
>>> > > >>>>>>
>>> > > >>>>>> Some counters... I don't get what do they mean....
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
>>> > > >>>>>>
>>> > org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>> > > >>>>>>
>>> > > >>>>>
>>> > > >>
>>> > >
>>> >
>>> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>> > > >>>>>>  USER_RATINGS_NEGLECTED=1,798,738
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>> > > >>>>> USER_RATINGS_USED=12,429,693
>>> > > >>>>>>
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
>>> > > >>>>>>
>>> > > >>>>>
>>> > > >>
>>> > >
>>> >
>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>> > > >>>>>>
>>> > > >>>>>
>>> > > >>
>>> > >
>>> >
>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>> > COOCCURRENCES=35882374
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>> > PRUNED_COOCCURRENCES=0
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map input
>>> > > records=3312879
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map output
>>> > > >> records=17570268
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce input
>>> > > >>>>> records=5221907
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce output
>>> > > >>>>> records=3312879
>>> > > >>>>>>
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
>>> > > >>>>> records=3312879
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
>>> > > >>>>> records=3312879
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
>>> > > >>>>> records=3312879
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
>>> > > >>>>> records=3312879
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map input
>>> > > records=7528530
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map output
>>> > > >> records=3313251
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce input
>>> > > >>>>> records=3313251
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce output
>>> > > >>>>> records=3313251
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map input
>>> > > records=6626130
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map output
>>> > > >> records=6626130
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce input
>>> > > >>>>> records=6626130
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce output
>>> > > >>>>> records=3312879
>>> > > >>>>>>
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map input
>>> > > records=3312879
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map output
>>> > > >> records=3313251
>>> > > >>>>>>
>>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce input
>>> > > >>>>> records=3313251
>>> > > >>>>>>
>>> > > >>>>>> --------
>>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce output
>>> > records=0
>>> > > >>>>>> --------
>>> > > >>>>>>
>>> > > >>>>>> why 0???
>>> > > >>>>>
>>> > > >>>>
>>> > > >>>>
>>> > > >>
>>> > > >>
>>> > >
>>> > >
>>> >
>>>
>>
>>
>

Re: recommenditembased returns 0 records from last map-reduce job

Posted by Serega Sheypak <se...@gmail.com>.

I've parsed it via java, matrix is empty. why?


2014-07-21 22:41 GMT+04:00 Serega Sheypak <se...@gmail.com>:

> 0.7-cdh4.7.0
> Anyway, recommenditembased does produce these catalogs:
>
> /recommenditembased/temp/maxValues.bin
> /recommenditembased/temp/norms.bin
> /recommenditembased/temp/numNonZeroEntries.bin
> /recommenditembased/temp/pairwiseSimilarity
> /recommenditembased/temp/partialMultiply
> /recommenditembased/temp/prePartialMultiply1
> /recommenditembased/temp/prePartialMultiply2
> /recommenditembased/temp/preparePreferenceMatrix
> /recommenditembased/temp/similarityMatrix
> /recommenditembased/temp/weights
>
> I suppose that "/recommenditembased/temp/similarityMatrix" is the thing
> In eed. Right now I try to read it using
>
> matrix = LOAD '/recommenditembased/temp/similarityMatrix' USING
>  com.twitter.elephantbird.pig.load.SequenceFileLoader(
>     '-c com.twitter.elephantbird.pig.util.IntWritableConverter',
>     '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
> )  as (intId: int, vector:tuple(cardinality:int,
> entries:bag{t:tuple(some_id:long, some_value:double)}));
>
>
> Looks like the vector is empty... Or i do something wrong.
>
>
>
> 2014-07-21 22:09 GMT+04:00 Ted Dunning <te...@gmail.com>:
>
> Which version of Mahout?
>>
>>
>> On Mon, Jul 21, 2014 at 11:05 AM, Serega Sheypak <
>> serega.sheypak@gmail.com>
>> wrote:
>>
>> > Hi, I've tried: Unexpected --outputPathForSimilarityMatrix while
>> processing
>> > Job-Specific
>> >
>> > sudo -u hdfs hadoop fs -rm -r
>> hdfs://nameservice1/recommenditembased/output
>> > sudo -u hdfs hadoop fs -rm -r
>> hdfs://nameservice1/recommenditembased/temp
>> > sudo -u oozie mahout recommenditembased \
>> >                     --input \
>> >
>> >
>> >
>> hdfs://nameservice1/user/hive/warehouse/staging_weighted_visits_and_rec_clicks
>> > \
>> >                     --output \
>> >                     hdfs://nameservice1/recommenditembased/output \
>> >                     --similarityClassname \
>> >                     SIMILARITY_LOGLIKELIHOOD \
>> >                    --numRecommendations \
>> >                     500 \
>> >                     --booleanData \
>> >                     false \
>> >                     --maxPrefsPerUser \
>> >                     1000 \
>> >                     --maxSimilaritiesPerItem \
>> >                     1000 \
>> >                     --minPrefsPerUser \
>> >                     5 \
>> >                     --maxPrefsPerUserInItemSimilarity \
>> >                     30 \
>> >                     --threshold \
>> >                    1.1 \
>> >                     --tempDir \
>> >                     hdfs://nameservice1/recommenditembased/temp \
>> >                     --outputPathForSimilarityMatrix \
>> >                     hdfs://nameservice1/recommenditembased/sim_matrix
>> >
>> >
>> > I'm on Cloudera cdh 4.7, looks like this feature is not supported.
>> >
>> >
>> > 2014-07-21 11:18 GMT+04:00 Peng Zhang <pz...@gmail.com>:
>> >
>> > > Serega,
>> > >
>> > > See the last line on how to pass outputPathForSimilarityMatrix
>> options to
>> > > the recommenditembased command:
>> > >
>> > > sudo -u oozie mahout recommenditembased \
>> > >                    --input visited_items_with_inverted_items \
>> > >
>> > >                    --output result \
>> > >                    --similarityClassname SIMILARITY_LOGLIKELIHOOD \
>> > >                    --usersFile inverted_items \
>> > >                    --numRecommendations 500 \
>> > >                    --booleanData false \
>> > >                    --maxPrefsPerUser 100 \
>> > >                    --maxSimilaritiesPerItem 500 \
>> > >                    --minPrefsPerUser 0\
>> > >                    --maxPrefsPerUserInItemSimilarity 30 \
>> > >                    --threshold 0.91 \
>> > >                    --tempDir  temp \
>> > >                    --outputPathForSimilarityMatrix similarityMatri \
>> > >
>> > >
>> > > Peng Zhang
>> > > pzhang.xjtu@gmail.com
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Jul 21, 2014, at 3:09 PM, Serega Sheypak <serega.sheypak@gmail.com
>> >
>> > > wrote:
>> > >
>> > > > I've inspected the code, our approach wouldn't work with
>> > > booleanData=false.
>> > > > We do calcualte imte similarity in the wrong way...(((
>> > > > Thank you
>> > > > 1. We provide "fake" user_id and provide --usersFile in order to get
>> > > > recommendations for "fake user_id, where user_id is a negative
>> item_id.
>> > > It
>> > > > worked when we did provide user_id->item_id pairs without
>> preference.
>> > > > 2. Our target is to get item similarities. We tried
>> > > > org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
>> but
>> > > it
>> > > > returns bad result comparing to RecommenderJob with our "fake"
>> user_id
>> > > > (inverted item_id)
>> > > >
>> > > > 1. I'll try the option you provided.
>> > > > 2. I will remove input with fake user_id and usersFile with these
>> fake
>> > > ids
>> > > >
>> > > > 3.
>> > > >
>> > >
>> >
>> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
>> > > > I don't understand how to pass ---outputPathForSimilarityMatrix
>> option
>> > to
>> > > > RecommenderJob
>> > > >
>> > > >
>> > > > 2014-07-21 4:58 GMT+04:00 Peng Zhang <pz...@gmail.com>:
>> > > >
>> > > >> Seraga,
>> > > >>
>> > > >> I have two comments:
>> > > >> 1. Don’t use negative user ids. Since Mahout uses user id as well
>> as
>> > > item
>> > > >> id as the row/column index, you’d better use 0, 1, 2, etc as ids
>> > > >> 2. If you want to get the item similarity information, you can use
>> > > >> --outputPathForSimilarityMatrix in the command
>> > > >>
>> > > >> Regards,
>> > > >> Peng Zhang
>> > > >> M: +86 186-1658-7856
>> > > >> pzhang.xjtu@gmail.com
>> > > >>
>> > > >>
>> > > >>
>> > > >>
>> > > >>
>> > > >> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <
>> serega.sheypak@gmail.com
>> > >
>> > > >> wrote:
>> > > >>
>> > > >>> All bad things happen here:
>> > > >>>
>> > > >>>
>> > > >>>
>> > > >>> Name
>> > > >>>
>> > > >>> RecommenderJob-PartialMultiplyMapper-Reducer
>> > > >>>
>> > > >>> User
>> > > >>>
>> > > >>> oozie
>> > > >>>
>> > > >>> Process User
>> > > >>>
>> > > >>> oozie
>> > > >>>
>> > > >>> Group
>> > > >>>
>> > > >>> oozie
>> > > >>>
>> > > >>> Mapper Class
>> > > >>>
>> > > >>> PartialMultiplyMapper
>> > > >>>
>> > > >>> Reducer Class
>> > > >>>
>> > > >>> AggregateAndRecommendReducer
>> > > >>>
>> > > >>>
>> > > >>> Job Input Directory
>> > > >>>
>> > > >>> hdfs://nameservice1/itemrec/temp/partialMultiply
>> > > >>>
>> > > >>> Job Output Directory
>> > > >>>
>> > > >>> hdfs://nameservice1/itemrec/output/
>> > > >>>
>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map input
>> > records=3312879
>> > > >>>
>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map output
>> > records=3313251
>> > > >>>
>> > > >>>
>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce input
>> > > records=3313251
>> > > >>>
>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce output
>> records=0
>> > > >>>
>> > > >>> Why does mahout returns 0 rows? it works when booleanData=true
>> > > >> (preferences
>> > > >>> are ignored...?)
>> > > >>>
>> > > >>>
>> > > >>>
>> > > >>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <
>> serega.sheypak@gmail.com
>> > >:
>> > > >>>
>> > > >>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
>> > > >>>> users_file:
>> > > >>>> --inverted_item_id
>> > > >>>> -1
>> > > >>>> -2
>> > > >>>> -3
>> > > >>>> -4
>> > > >>>>
>> > > >>>> users_items_prefs
>> > > >>>> --inverted item_id
>> > > >>>> -1 1 1.0
>> > > >>>> -2 2 1.0
>> > > >>>> -3 3 1.0
>> > > >>>> -4 4 1.0
>> > > >>>> --user_id item_id pref_value
>> > > >>>> 11   1 1.6
>> > > >>>> 11   2 1.6
>> > > >>>> 123 3 2.0
>> > > >>>> 123 4 2.0
>> > > >>>> 333 1 2.0
>> > > >>>> 333 2 1.6
>> > > >>>> --e.t.c.
>> > > >>>>
>> > > >>>> if I set --booleanData true
>> > > >>>> then mahout returns the result.
>> > > >>>>
>> > > >>>>
>> > > >>>>
>> > > >>>>
>> > > >>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <
>> > > andrew.musselman@gmail.com
>> > > >>> :
>> > > >>>>
>> > > >>>> I'm confused about how you're constructing the user file, and why
>> > > there
>> > > >>>>> are negated item ids here.
>> > > >>>>>
>> > > >>>>> Can you post some more details please, including Mahout version
>> and
>> > > >> some
>> > > >>>>> sample data sets?
>> > > >>>>>
>> > > >>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak <
>> > > >> serega.sheypak@gmail.com>
>> > > >>>>> wrote:
>> > > >>>>>>
>> > > >>>>>> Hi, I'm trying to create item similarity.
>> > > >>>>>> I gather items which users visit during shopping and then
>> create a
>> > > >> file:
>> > > >>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6, 1.9],
>> > > >> depends
>> > > >>>>> on
>> > > >>>>>> user action type and data source)
>> > > >>>>>> UNION
>> > > >>>>>> -item_id, item_id, 1.0 (from items dictionary)
>> > > >>>>>>
>> > > >>>>>> and I do provide a userFile, where user_id = -item_id
>> > > >>>>>>
>> > > >>>>>> The idea is to get item similary. If any user visits item named
>> > > "A", i
>> > > >>>>> want
>> > > >>>>>> to show him items "B", "c", "xxx" using preferences of other
>> > users.
>> > > >>>>>>
>> > > >>>>>> The problem is that the last (???) mapreduce job returns 0
>> rows:
>> > > >>>>>>
>> > > >>>>>> Here are my settings:
>> > > >>>>>>
>> > > >>>>>>
>> > > >>>>>> sudo -u oozie mahout recommenditembased \
>> > > >>>>>>                  --input visited_items_with_inverted_items \
>> > > >>>>>>
>> > > >>>>>>                  --output result \
>> > > >>>>>>                  --similarityClassname
>> SIMILARITY_LOGLIKELIHOOD \
>> > > >>>>>>                  --usersFile inverted_items \
>> > > >>>>>>                  --numRecommendations 500 \
>> > > >>>>>>                  --booleanData false \
>> > > >>>>>>                  --maxPrefsPerUser 100 \
>> > > >>>>>>                  --maxSimilaritiesPerItem 500 \
>> > > >>>>>>                  --minPrefsPerUser 0\
>> > > >>>>>>                  --maxPrefsPerUserInItemSimilarity 30 \
>> > > >>>>>>                  --threshold 0.91 \
>> > > >>>>>>                  --tempDir  temp \
>> > > >>>>>>
>> > > >>>>>> Some counters... I don't get what do they mean....
>> > > >>>>>>
>> > > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
>> > > >>>>>>
>> > org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
>> > > >>>>>>
>> > > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530
>> > > >>>>>>
>> > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>> > > >>>>>>
>> > > >>>>>
>> > > >>
>> > >
>> >
>> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
>> > > >>>>>>
>> > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>> > > >>>>>>  USER_RATINGS_NEGLECTED=1,798,738
>> > > >>>>>>
>> > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>> > > >>>>> USER_RATINGS_USED=12,429,693
>> > > >>>>>>
>> > > >>>>>>
>> > > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
>> > > >>>>>>
>> > > >>>>>
>> > > >>
>> > >
>> >
>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>> > > >>>>>>
>> > > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879
>> > > >>>>>>
>> > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>> > > >>>>>>
>> > > >>>>>
>> > > >>
>> > >
>> >
>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>> > > >>>>>>
>> > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>> > COOCCURRENCES=35882374
>> > > >>>>>>
>> > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>> > PRUNED_COOCCURRENCES=0
>> > > >>>>>>
>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map input
>> > > records=3312879
>> > > >>>>>>
>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map output
>> > > >> records=17570268
>> > > >>>>>>
>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce input
>> > > >>>>> records=5221907
>> > > >>>>>>
>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce output
>> > > >>>>> records=3312879
>> > > >>>>>>
>> > > >>>>>>
>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
>> > > >>>>> records=3312879
>> > > >>>>>>
>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
>> > > >>>>> records=3312879
>> > > >>>>>>
>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
>> > > >>>>> records=3312879
>> > > >>>>>>
>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
>> > > >>>>> records=3312879
>> > > >>>>>>
>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map input
>> > > records=7528530
>> > > >>>>>>
>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map output
>> > > >> records=3313251
>> > > >>>>>>
>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce input
>> > > >>>>> records=3313251
>> > > >>>>>>
>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce output
>> > > >>>>> records=3313251
>> > > >>>>>>
>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map input
>> > > records=6626130
>> > > >>>>>>
>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map output
>> > > >> records=6626130
>> > > >>>>>>
>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce input
>> > > >>>>> records=6626130
>> > > >>>>>>
>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce output
>> > > >>>>> records=3312879
>> > > >>>>>>
>> > > >>>>>>
>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map input
>> > > records=3312879
>> > > >>>>>>
>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map output
>> > > >> records=3313251
>> > > >>>>>>
>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce input
>> > > >>>>> records=3313251
>> > > >>>>>>
>> > > >>>>>> --------
>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce output
>> > records=0
>> > > >>>>>> --------
>> > > >>>>>>
>> > > >>>>>> why 0???
>> > > >>>>>
>> > > >>>>
>> > > >>>>
>> > > >>
>> > > >>
>> > >
>> > >
>> >
>>
>
>

Re: recommenditembased returns 0 records from last map-reduce job

Posted by Serega Sheypak <se...@gmail.com>.

0.7-cdh4.7.0
Anyway, recommenditembased does produce these catalogs:

/recommenditembased/temp/maxValues.bin
/recommenditembased/temp/norms.bin
/recommenditembased/temp/numNonZeroEntries.bin
/recommenditembased/temp/pairwiseSimilarity
/recommenditembased/temp/partialMultiply
/recommenditembased/temp/prePartialMultiply1
/recommenditembased/temp/prePartialMultiply2
/recommenditembased/temp/preparePreferenceMatrix
/recommenditembased/temp/similarityMatrix
/recommenditembased/temp/weights

I suppose that "/recommenditembased/temp/similarityMatrix" is the thing In
eed. Right now I try to read it using

matrix = LOAD '/recommenditembased/temp/similarityMatrix' USING
 com.twitter.elephantbird.pig.load.SequenceFileLoader(
    '-c com.twitter.elephantbird.pig.util.IntWritableConverter',
    '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
)  as (intId: int, vector:tuple(cardinality:int,
entries:bag{t:tuple(some_id:long, some_value:double)}));


Looks like the vector is empty... Or i do something wrong.



2014-07-21 22:09 GMT+04:00 Ted Dunning <te...@gmail.com>:

> Which version of Mahout?
>
>
> On Mon, Jul 21, 2014 at 11:05 AM, Serega Sheypak <serega.sheypak@gmail.com
> >
> wrote:
>
> > Hi, I've tried: Unexpected --outputPathForSimilarityMatrix while
> processing
> > Job-Specific
> >
> > sudo -u hdfs hadoop fs -rm -r
> hdfs://nameservice1/recommenditembased/output
> > sudo -u hdfs hadoop fs -rm -r hdfs://nameservice1/recommenditembased/temp
> > sudo -u oozie mahout recommenditembased \
> >                     --input \
> >
> >
> >
> hdfs://nameservice1/user/hive/warehouse/staging_weighted_visits_and_rec_clicks
> > \
> >                     --output \
> >                     hdfs://nameservice1/recommenditembased/output \
> >                     --similarityClassname \
> >                     SIMILARITY_LOGLIKELIHOOD \
> >                    --numRecommendations \
> >                     500 \
> >                     --booleanData \
> >                     false \
> >                     --maxPrefsPerUser \
> >                     1000 \
> >                     --maxSimilaritiesPerItem \
> >                     1000 \
> >                     --minPrefsPerUser \
> >                     5 \
> >                     --maxPrefsPerUserInItemSimilarity \
> >                     30 \
> >                     --threshold \
> >                    1.1 \
> >                     --tempDir \
> >                     hdfs://nameservice1/recommenditembased/temp \
> >                     --outputPathForSimilarityMatrix \
> >                     hdfs://nameservice1/recommenditembased/sim_matrix
> >
> >
> > I'm on Cloudera cdh 4.7, looks like this feature is not supported.
> >
> >
> > 2014-07-21 11:18 GMT+04:00 Peng Zhang <pz...@gmail.com>:
> >
> > > Serega,
> > >
> > > See the last line on how to pass outputPathForSimilarityMatrix options
> to
> > > the recommenditembased command:
> > >
> > > sudo -u oozie mahout recommenditembased \
> > >                    --input visited_items_with_inverted_items \
> > >
> > >                    --output result \
> > >                    --similarityClassname SIMILARITY_LOGLIKELIHOOD \
> > >                    --usersFile inverted_items \
> > >                    --numRecommendations 500 \
> > >                    --booleanData false \
> > >                    --maxPrefsPerUser 100 \
> > >                    --maxSimilaritiesPerItem 500 \
> > >                    --minPrefsPerUser 0\
> > >                    --maxPrefsPerUserInItemSimilarity 30 \
> > >                    --threshold 0.91 \
> > >                    --tempDir  temp \
> > >                    --outputPathForSimilarityMatrix similarityMatri \
> > >
> > >
> > > Peng Zhang
> > > pzhang.xjtu@gmail.com
> > >
> > >
> > >
> > >
> > >
> > > On Jul 21, 2014, at 3:09 PM, Serega Sheypak <se...@gmail.com>
> > > wrote:
> > >
> > > > I've inspected the code, our approach wouldn't work with
> > > booleanData=false.
> > > > We do calcualte imte similarity in the wrong way...(((
> > > > Thank you
> > > > 1. We provide "fake" user_id and provide --usersFile in order to get
> > > > recommendations for "fake user_id, where user_id is a negative
> item_id.
> > > It
> > > > worked when we did provide user_id->item_id pairs without preference.
> > > > 2. Our target is to get item similarities. We tried
> > > > org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
> but
> > > it
> > > > returns bad result comparing to RecommenderJob with our "fake"
> user_id
> > > > (inverted item_id)
> > > >
> > > > 1. I'll try the option you provided.
> > > > 2. I will remove input with fake user_id and usersFile with these
> fake
> > > ids
> > > >
> > > > 3.
> > > >
> > >
> >
> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
> > > > I don't understand how to pass ---outputPathForSimilarityMatrix
> option
> > to
> > > > RecommenderJob
> > > >
> > > >
> > > > 2014-07-21 4:58 GMT+04:00 Peng Zhang <pz...@gmail.com>:
> > > >
> > > >> Seraga,
> > > >>
> > > >> I have two comments:
> > > >> 1. Don’t use negative user ids. Since Mahout uses user id as well as
> > > item
> > > >> id as the row/column index, you’d better use 0, 1, 2, etc as ids
> > > >> 2. If you want to get the item similarity information, you can use
> > > >> --outputPathForSimilarityMatrix in the command
> > > >>
> > > >> Regards,
> > > >> Peng Zhang
> > > >> M: +86 186-1658-7856
> > > >> pzhang.xjtu@gmail.com
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <
> serega.sheypak@gmail.com
> > >
> > > >> wrote:
> > > >>
> > > >>> All bad things happen here:
> > > >>>
> > > >>>
> > > >>>
> > > >>> Name
> > > >>>
> > > >>> RecommenderJob-PartialMultiplyMapper-Reducer
> > > >>>
> > > >>> User
> > > >>>
> > > >>> oozie
> > > >>>
> > > >>> Process User
> > > >>>
> > > >>> oozie
> > > >>>
> > > >>> Group
> > > >>>
> > > >>> oozie
> > > >>>
> > > >>> Mapper Class
> > > >>>
> > > >>> PartialMultiplyMapper
> > > >>>
> > > >>> Reducer Class
> > > >>>
> > > >>> AggregateAndRecommendReducer
> > > >>>
> > > >>>
> > > >>> Job Input Directory
> > > >>>
> > > >>> hdfs://nameservice1/itemrec/temp/partialMultiply
> > > >>>
> > > >>> Job Output Directory
> > > >>>
> > > >>> hdfs://nameservice1/itemrec/output/
> > > >>>
> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map input
> > records=3312879
> > > >>>
> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map output
> > records=3313251
> > > >>>
> > > >>>
> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce input
> > > records=3313251
> > > >>>
> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce output
> records=0
> > > >>>
> > > >>> Why does mahout returns 0 rows? it works when booleanData=true
> > > >> (preferences
> > > >>> are ignored...?)
> > > >>>
> > > >>>
> > > >>>
> > > >>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <
> serega.sheypak@gmail.com
> > >:
> > > >>>
> > > >>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
> > > >>>> users_file:
> > > >>>> --inverted_item_id
> > > >>>> -1
> > > >>>> -2
> > > >>>> -3
> > > >>>> -4
> > > >>>>
> > > >>>> users_items_prefs
> > > >>>> --inverted item_id
> > > >>>> -1 1 1.0
> > > >>>> -2 2 1.0
> > > >>>> -3 3 1.0
> > > >>>> -4 4 1.0
> > > >>>> --user_id item_id pref_value
> > > >>>> 11   1 1.6
> > > >>>> 11   2 1.6
> > > >>>> 123 3 2.0
> > > >>>> 123 4 2.0
> > > >>>> 333 1 2.0
> > > >>>> 333 2 1.6
> > > >>>> --e.t.c.
> > > >>>>
> > > >>>> if I set --booleanData true
> > > >>>> then mahout returns the result.
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <
> > > andrew.musselman@gmail.com
> > > >>> :
> > > >>>>
> > > >>>> I'm confused about how you're constructing the user file, and why
> > > there
> > > >>>>> are negated item ids here.
> > > >>>>>
> > > >>>>> Can you post some more details please, including Mahout version
> and
> > > >> some
> > > >>>>> sample data sets?
> > > >>>>>
> > > >>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak <
> > > >> serega.sheypak@gmail.com>
> > > >>>>> wrote:
> > > >>>>>>
> > > >>>>>> Hi, I'm trying to create item similarity.
> > > >>>>>> I gather items which users visit during shopping and then
> create a
> > > >> file:
> > > >>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6, 1.9],
> > > >> depends
> > > >>>>> on
> > > >>>>>> user action type and data source)
> > > >>>>>> UNION
> > > >>>>>> -item_id, item_id, 1.0 (from items dictionary)
> > > >>>>>>
> > > >>>>>> and I do provide a userFile, where user_id = -item_id
> > > >>>>>>
> > > >>>>>> The idea is to get item similary. If any user visits item named
> > > "A", i
> > > >>>>> want
> > > >>>>>> to show him items "B", "c", "xxx" using preferences of other
> > users.
> > > >>>>>>
> > > >>>>>> The problem is that the last (???) mapreduce job returns 0 rows:
> > > >>>>>>
> > > >>>>>> Here are my settings:
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> sudo -u oozie mahout recommenditembased \
> > > >>>>>>                  --input visited_items_with_inverted_items \
> > > >>>>>>
> > > >>>>>>                  --output result \
> > > >>>>>>                  --similarityClassname SIMILARITY_LOGLIKELIHOOD
> \
> > > >>>>>>                  --usersFile inverted_items \
> > > >>>>>>                  --numRecommendations 500 \
> > > >>>>>>                  --booleanData false \
> > > >>>>>>                  --maxPrefsPerUser 100 \
> > > >>>>>>                  --maxSimilaritiesPerItem 500 \
> > > >>>>>>                  --minPrefsPerUser 0\
> > > >>>>>>                  --maxPrefsPerUserInItemSimilarity 30 \
> > > >>>>>>                  --threshold 0.91 \
> > > >>>>>>                  --tempDir  temp \
> > > >>>>>>
> > > >>>>>> Some counters... I don't get what do they mean....
> > > >>>>>>
> > > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
> > > >>>>>>
> > org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
> > > >>>>>>
> > > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530
> > > >>>>>>
> > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> > > >>>>>>
> > > >>>>>
> > > >>
> > >
> >
> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
> > > >>>>>>
> > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> > > >>>>>>  USER_RATINGS_NEGLECTED=1,798,738
> > > >>>>>>
> > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> > > >>>>> USER_RATINGS_USED=12,429,693
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
> > > >>>>>>
> > > >>>>>
> > > >>
> > >
> >
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> > > >>>>>>
> > > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879
> > > >>>>>>
> > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> > > >>>>>>
> > > >>>>>
> > > >>
> > >
> >
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> > > >>>>>>
> > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> > COOCCURRENCES=35882374
> > > >>>>>>
> > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> > PRUNED_COOCCURRENCES=0
> > > >>>>>>
> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map input
> > > records=3312879
> > > >>>>>>
> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map output
> > > >> records=17570268
> > > >>>>>>
> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce input
> > > >>>>> records=5221907
> > > >>>>>>
> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce output
> > > >>>>> records=3312879
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
> > > >>>>> records=3312879
> > > >>>>>>
> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
> > > >>>>> records=3312879
> > > >>>>>>
> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
> > > >>>>> records=3312879
> > > >>>>>>
> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
> > > >>>>> records=3312879
> > > >>>>>>
> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map input
> > > records=7528530
> > > >>>>>>
> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map output
> > > >> records=3313251
> > > >>>>>>
> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce input
> > > >>>>> records=3313251
> > > >>>>>>
> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce output
> > > >>>>> records=3313251
> > > >>>>>>
> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map input
> > > records=6626130
> > > >>>>>>
> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map output
> > > >> records=6626130
> > > >>>>>>
> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce input
> > > >>>>> records=6626130
> > > >>>>>>
> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce output
> > > >>>>> records=3312879
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map input
> > > records=3312879
> > > >>>>>>
> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map output
> > > >> records=3313251
> > > >>>>>>
> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce input
> > > >>>>> records=3313251
> > > >>>>>>
> > > >>>>>> --------
> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce output
> > records=0
> > > >>>>>> --------
> > > >>>>>>
> > > >>>>>> why 0???
> > > >>>>>
> > > >>>>
> > > >>>>
> > > >>
> > > >>
> > >
> > >
> >
>

Re: recommenditembased returns 0 records from last map-reduce job

Posted by Ted Dunning <te...@gmail.com>.

Which version of Mahout?


On Mon, Jul 21, 2014 at 11:05 AM, Serega Sheypak <se...@gmail.com>
wrote:

> Hi, I've tried: Unexpected --outputPathForSimilarityMatrix while processing
> Job-Specific
>
> sudo -u hdfs hadoop fs -rm -r hdfs://nameservice1/recommenditembased/output
> sudo -u hdfs hadoop fs -rm -r hdfs://nameservice1/recommenditembased/temp
> sudo -u oozie mahout recommenditembased \
>                     --input \
>
>
> hdfs://nameservice1/user/hive/warehouse/staging_weighted_visits_and_rec_clicks
> \
>                     --output \
>                     hdfs://nameservice1/recommenditembased/output \
>                     --similarityClassname \
>                     SIMILARITY_LOGLIKELIHOOD \
>                    --numRecommendations \
>                     500 \
>                     --booleanData \
>                     false \
>                     --maxPrefsPerUser \
>                     1000 \
>                     --maxSimilaritiesPerItem \
>                     1000 \
>                     --minPrefsPerUser \
>                     5 \
>                     --maxPrefsPerUserInItemSimilarity \
>                     30 \
>                     --threshold \
>                    1.1 \
>                     --tempDir \
>                     hdfs://nameservice1/recommenditembased/temp \
>                     --outputPathForSimilarityMatrix \
>                     hdfs://nameservice1/recommenditembased/sim_matrix
>
>
> I'm on Cloudera cdh 4.7, looks like this feature is not supported.
>
>
> 2014-07-21 11:18 GMT+04:00 Peng Zhang <pz...@gmail.com>:
>
> > Serega,
> >
> > See the last line on how to pass outputPathForSimilarityMatrix options to
> > the recommenditembased command:
> >
> > sudo -u oozie mahout recommenditembased \
> >                    --input visited_items_with_inverted_items \
> >
> >                    --output result \
> >                    --similarityClassname SIMILARITY_LOGLIKELIHOOD \
> >                    --usersFile inverted_items \
> >                    --numRecommendations 500 \
> >                    --booleanData false \
> >                    --maxPrefsPerUser 100 \
> >                    --maxSimilaritiesPerItem 500 \
> >                    --minPrefsPerUser 0\
> >                    --maxPrefsPerUserInItemSimilarity 30 \
> >                    --threshold 0.91 \
> >                    --tempDir  temp \
> >                    --outputPathForSimilarityMatrix similarityMatri \
> >
> >
> > Peng Zhang
> > pzhang.xjtu@gmail.com
> >
> >
> >
> >
> >
> > On Jul 21, 2014, at 3:09 PM, Serega Sheypak <se...@gmail.com>
> > wrote:
> >
> > > I've inspected the code, our approach wouldn't work with
> > booleanData=false.
> > > We do calcualte imte similarity in the wrong way...(((
> > > Thank you
> > > 1. We provide "fake" user_id and provide --usersFile in order to get
> > > recommendations for "fake user_id, where user_id is a negative item_id.
> > It
> > > worked when we did provide user_id->item_id pairs without preference.
> > > 2. Our target is to get item similarities. We tried
> > > org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob but
> > it
> > > returns bad result comparing to RecommenderJob with our "fake" user_id
> > > (inverted item_id)
> > >
> > > 1. I'll try the option you provided.
> > > 2. I will remove input with fake user_id and usersFile with these fake
> > ids
> > >
> > > 3.
> > >
> >
> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
> > > I don't understand how to pass ---outputPathForSimilarityMatrix option
> to
> > > RecommenderJob
> > >
> > >
> > > 2014-07-21 4:58 GMT+04:00 Peng Zhang <pz...@gmail.com>:
> > >
> > >> Seraga,
> > >>
> > >> I have two comments:
> > >> 1. Don’t use negative user ids. Since Mahout uses user id as well as
> > item
> > >> id as the row/column index, you’d better use 0, 1, 2, etc as ids
> > >> 2. If you want to get the item similarity information, you can use
> > >> --outputPathForSimilarityMatrix in the command
> > >>
> > >> Regards,
> > >> Peng Zhang
> > >> M: +86 186-1658-7856
> > >> pzhang.xjtu@gmail.com
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <serega.sheypak@gmail.com
> >
> > >> wrote:
> > >>
> > >>> All bad things happen here:
> > >>>
> > >>>
> > >>>
> > >>> Name
> > >>>
> > >>> RecommenderJob-PartialMultiplyMapper-Reducer
> > >>>
> > >>> User
> > >>>
> > >>> oozie
> > >>>
> > >>> Process User
> > >>>
> > >>> oozie
> > >>>
> > >>> Group
> > >>>
> > >>> oozie
> > >>>
> > >>> Mapper Class
> > >>>
> > >>> PartialMultiplyMapper
> > >>>
> > >>> Reducer Class
> > >>>
> > >>> AggregateAndRecommendReducer
> > >>>
> > >>>
> > >>> Job Input Directory
> > >>>
> > >>> hdfs://nameservice1/itemrec/temp/partialMultiply
> > >>>
> > >>> Job Output Directory
> > >>>
> > >>> hdfs://nameservice1/itemrec/output/
> > >>>
> > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map input
> records=3312879
> > >>>
> > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map output
> records=3313251
> > >>>
> > >>>
> > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce input
> > records=3313251
> > >>>
> > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce output records=0
> > >>>
> > >>> Why does mahout returns 0 rows? it works when booleanData=true
> > >> (preferences
> > >>> are ignored...?)
> > >>>
> > >>>
> > >>>
> > >>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <serega.sheypak@gmail.com
> >:
> > >>>
> > >>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
> > >>>> users_file:
> > >>>> --inverted_item_id
> > >>>> -1
> > >>>> -2
> > >>>> -3
> > >>>> -4
> > >>>>
> > >>>> users_items_prefs
> > >>>> --inverted item_id
> > >>>> -1 1 1.0
> > >>>> -2 2 1.0
> > >>>> -3 3 1.0
> > >>>> -4 4 1.0
> > >>>> --user_id item_id pref_value
> > >>>> 11   1 1.6
> > >>>> 11   2 1.6
> > >>>> 123 3 2.0
> > >>>> 123 4 2.0
> > >>>> 333 1 2.0
> > >>>> 333 2 1.6
> > >>>> --e.t.c.
> > >>>>
> > >>>> if I set --booleanData true
> > >>>> then mahout returns the result.
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <
> > andrew.musselman@gmail.com
> > >>> :
> > >>>>
> > >>>> I'm confused about how you're constructing the user file, and why
> > there
> > >>>>> are negated item ids here.
> > >>>>>
> > >>>>> Can you post some more details please, including Mahout version and
> > >> some
> > >>>>> sample data sets?
> > >>>>>
> > >>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak <
> > >> serega.sheypak@gmail.com>
> > >>>>> wrote:
> > >>>>>>
> > >>>>>> Hi, I'm trying to create item similarity.
> > >>>>>> I gather items which users visit during shopping and then create a
> > >> file:
> > >>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6, 1.9],
> > >> depends
> > >>>>> on
> > >>>>>> user action type and data source)
> > >>>>>> UNION
> > >>>>>> -item_id, item_id, 1.0 (from items dictionary)
> > >>>>>>
> > >>>>>> and I do provide a userFile, where user_id = -item_id
> > >>>>>>
> > >>>>>> The idea is to get item similary. If any user visits item named
> > "A", i
> > >>>>> want
> > >>>>>> to show him items "B", "c", "xxx" using preferences of other
> users.
> > >>>>>>
> > >>>>>> The problem is that the last (???) mapreduce job returns 0 rows:
> > >>>>>>
> > >>>>>> Here are my settings:
> > >>>>>>
> > >>>>>>
> > >>>>>> sudo -u oozie mahout recommenditembased \
> > >>>>>>                  --input visited_items_with_inverted_items \
> > >>>>>>
> > >>>>>>                  --output result \
> > >>>>>>                  --similarityClassname SIMILARITY_LOGLIKELIHOOD \
> > >>>>>>                  --usersFile inverted_items \
> > >>>>>>                  --numRecommendations 500 \
> > >>>>>>                  --booleanData false \
> > >>>>>>                  --maxPrefsPerUser 100 \
> > >>>>>>                  --maxSimilaritiesPerItem 500 \
> > >>>>>>                  --minPrefsPerUser 0\
> > >>>>>>                  --maxPrefsPerUserInItemSimilarity 30 \
> > >>>>>>                  --threshold 0.91 \
> > >>>>>>                  --tempDir  temp \
> > >>>>>>
> > >>>>>> Some counters... I don't get what do they mean....
> > >>>>>>
> > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
> > >>>>>>
> org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
> > >>>>>>
> > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530
> > >>>>>>
> > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> > >>>>>>
> > >>>>>
> > >>
> >
> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
> > >>>>>>
> > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> > >>>>>>  USER_RATINGS_NEGLECTED=1,798,738
> > >>>>>>
> > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> > >>>>> USER_RATINGS_USED=12,429,693
> > >>>>>>
> > >>>>>>
> > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
> > >>>>>>
> > >>>>>
> > >>
> >
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> > >>>>>>
> > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879
> > >>>>>>
> > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> > >>>>>>
> > >>>>>
> > >>
> >
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> > >>>>>>
> > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> COOCCURRENCES=35882374
> > >>>>>>
> > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> PRUNED_COOCCURRENCES=0
> > >>>>>>
> > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map input
> > records=3312879
> > >>>>>>
> > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map output
> > >> records=17570268
> > >>>>>>
> > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce input
> > >>>>> records=5221907
> > >>>>>>
> > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce output
> > >>>>> records=3312879
> > >>>>>>
> > >>>>>>
> > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
> > >>>>> records=3312879
> > >>>>>>
> > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
> > >>>>> records=3312879
> > >>>>>>
> > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
> > >>>>> records=3312879
> > >>>>>>
> > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
> > >>>>> records=3312879
> > >>>>>>
> > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map input
> > records=7528530
> > >>>>>>
> > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map output
> > >> records=3313251
> > >>>>>>
> > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce input
> > >>>>> records=3313251
> > >>>>>>
> > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce output
> > >>>>> records=3313251
> > >>>>>>
> > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map input
> > records=6626130
> > >>>>>>
> > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map output
> > >> records=6626130
> > >>>>>>
> > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce input
> > >>>>> records=6626130
> > >>>>>>
> > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce output
> > >>>>> records=3312879
> > >>>>>>
> > >>>>>>
> > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map input
> > records=3312879
> > >>>>>>
> > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map output
> > >> records=3313251
> > >>>>>>
> > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce input
> > >>>>> records=3313251
> > >>>>>>
> > >>>>>> --------
> > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce output
> records=0
> > >>>>>> --------
> > >>>>>>
> > >>>>>> why 0???
> > >>>>>
> > >>>>
> > >>>>
> > >>
> > >>
> >
> >
>

Re: recommenditembased returns 0 records from last map-reduce job

Posted by Serega Sheypak <se...@gmail.com>.

Hi, I've tried: Unexpected --outputPathForSimilarityMatrix while processing
Job-Specific

sudo -u hdfs hadoop fs -rm -r hdfs://nameservice1/recommenditembased/output
sudo -u hdfs hadoop fs -rm -r hdfs://nameservice1/recommenditembased/temp
sudo -u oozie mahout recommenditembased \
                    --input \

hdfs://nameservice1/user/hive/warehouse/staging_weighted_visits_and_rec_clicks
\
                    --output \
                    hdfs://nameservice1/recommenditembased/output \
                    --similarityClassname \
                    SIMILARITY_LOGLIKELIHOOD \
                   --numRecommendations \
                    500 \
                    --booleanData \
                    false \
                    --maxPrefsPerUser \
                    1000 \
                    --maxSimilaritiesPerItem \
                    1000 \
                    --minPrefsPerUser \
                    5 \
                    --maxPrefsPerUserInItemSimilarity \
                    30 \
                    --threshold \
                   1.1 \
                    --tempDir \
                    hdfs://nameservice1/recommenditembased/temp \
                    --outputPathForSimilarityMatrix \
                    hdfs://nameservice1/recommenditembased/sim_matrix


I'm on Cloudera cdh 4.7, looks like this feature is not supported.


2014-07-21 11:18 GMT+04:00 Peng Zhang <pz...@gmail.com>:

> Serega,
>
> See the last line on how to pass outputPathForSimilarityMatrix options to
> the recommenditembased command:
>
> sudo -u oozie mahout recommenditembased \
>                    --input visited_items_with_inverted_items \
>
>                    --output result \
>                    --similarityClassname SIMILARITY_LOGLIKELIHOOD \
>                    --usersFile inverted_items \
>                    --numRecommendations 500 \
>                    --booleanData false \
>                    --maxPrefsPerUser 100 \
>                    --maxSimilaritiesPerItem 500 \
>                    --minPrefsPerUser 0\
>                    --maxPrefsPerUserInItemSimilarity 30 \
>                    --threshold 0.91 \
>                    --tempDir  temp \
>                    --outputPathForSimilarityMatrix similarityMatri \
>
>
> Peng Zhang
> pzhang.xjtu@gmail.com
>
>
>
>
>
> On Jul 21, 2014, at 3:09 PM, Serega Sheypak <se...@gmail.com>
> wrote:
>
> > I've inspected the code, our approach wouldn't work with
> booleanData=false.
> > We do calcualte imte similarity in the wrong way...(((
> > Thank you
> > 1. We provide "fake" user_id and provide --usersFile in order to get
> > recommendations for "fake user_id, where user_id is a negative item_id.
> It
> > worked when we did provide user_id->item_id pairs without preference.
> > 2. Our target is to get item similarities. We tried
> > org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob but
> it
> > returns bad result comparing to RecommenderJob with our "fake" user_id
> > (inverted item_id)
> >
> > 1. I'll try the option you provided.
> > 2. I will remove input with fake user_id and usersFile with these fake
> ids
> >
> > 3.
> >
> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
> > I don't understand how to pass ---outputPathForSimilarityMatrix option to
> > RecommenderJob
> >
> >
> > 2014-07-21 4:58 GMT+04:00 Peng Zhang <pz...@gmail.com>:
> >
> >> Seraga,
> >>
> >> I have two comments:
> >> 1. Don’t use negative user ids. Since Mahout uses user id as well as
> item
> >> id as the row/column index, you’d better use 0, 1, 2, etc as ids
> >> 2. If you want to get the item similarity information, you can use
> >> --outputPathForSimilarityMatrix in the command
> >>
> >> Regards,
> >> Peng Zhang
> >> M: +86 186-1658-7856
> >> pzhang.xjtu@gmail.com
> >>
> >>
> >>
> >>
> >>
> >> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <se...@gmail.com>
> >> wrote:
> >>
> >>> All bad things happen here:
> >>>
> >>>
> >>>
> >>> Name
> >>>
> >>> RecommenderJob-PartialMultiplyMapper-Reducer
> >>>
> >>> User
> >>>
> >>> oozie
> >>>
> >>> Process User
> >>>
> >>> oozie
> >>>
> >>> Group
> >>>
> >>> oozie
> >>>
> >>> Mapper Class
> >>>
> >>> PartialMultiplyMapper
> >>>
> >>> Reducer Class
> >>>
> >>> AggregateAndRecommendReducer
> >>>
> >>>
> >>> Job Input Directory
> >>>
> >>> hdfs://nameservice1/itemrec/temp/partialMultiply
> >>>
> >>> Job Output Directory
> >>>
> >>> hdfs://nameservice1/itemrec/output/
> >>>
> >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map input records=3312879
> >>>
> >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map output records=3313251
> >>>
> >>>
> >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce input
> records=3313251
> >>>
> >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce output records=0
> >>>
> >>> Why does mahout returns 0 rows? it works when booleanData=true
> >> (preferences
> >>> are ignored...?)
> >>>
> >>>
> >>>
> >>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <se...@gmail.com>:
> >>>
> >>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
> >>>> users_file:
> >>>> --inverted_item_id
> >>>> -1
> >>>> -2
> >>>> -3
> >>>> -4
> >>>>
> >>>> users_items_prefs
> >>>> --inverted item_id
> >>>> -1 1 1.0
> >>>> -2 2 1.0
> >>>> -3 3 1.0
> >>>> -4 4 1.0
> >>>> --user_id item_id pref_value
> >>>> 11   1 1.6
> >>>> 11   2 1.6
> >>>> 123 3 2.0
> >>>> 123 4 2.0
> >>>> 333 1 2.0
> >>>> 333 2 1.6
> >>>> --e.t.c.
> >>>>
> >>>> if I set --booleanData true
> >>>> then mahout returns the result.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <
> andrew.musselman@gmail.com
> >>> :
> >>>>
> >>>> I'm confused about how you're constructing the user file, and why
> there
> >>>>> are negated item ids here.
> >>>>>
> >>>>> Can you post some more details please, including Mahout version and
> >> some
> >>>>> sample data sets?
> >>>>>
> >>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak <
> >> serega.sheypak@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> Hi, I'm trying to create item similarity.
> >>>>>> I gather items which users visit during shopping and then create a
> >> file:
> >>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6, 1.9],
> >> depends
> >>>>> on
> >>>>>> user action type and data source)
> >>>>>> UNION
> >>>>>> -item_id, item_id, 1.0 (from items dictionary)
> >>>>>>
> >>>>>> and I do provide a userFile, where user_id = -item_id
> >>>>>>
> >>>>>> The idea is to get item similary. If any user visits item named
> "A", i
> >>>>> want
> >>>>>> to show him items "B", "c", "xxx" using preferences of other users.
> >>>>>>
> >>>>>> The problem is that the last (???) mapreduce job returns 0 rows:
> >>>>>>
> >>>>>> Here are my settings:
> >>>>>>
> >>>>>>
> >>>>>> sudo -u oozie mahout recommenditembased \
> >>>>>>                  --input visited_items_with_inverted_items \
> >>>>>>
> >>>>>>                  --output result \
> >>>>>>                  --similarityClassname SIMILARITY_LOGLIKELIHOOD \
> >>>>>>                  --usersFile inverted_items \
> >>>>>>                  --numRecommendations 500 \
> >>>>>>                  --booleanData false \
> >>>>>>                  --maxPrefsPerUser 100 \
> >>>>>>                  --maxSimilaritiesPerItem 500 \
> >>>>>>                  --minPrefsPerUser 0\
> >>>>>>                  --maxPrefsPerUserInItemSimilarity 30 \
> >>>>>>                  --threshold 0.91 \
> >>>>>>                  --tempDir  temp \
> >>>>>>
> >>>>>> Some counters... I don't get what do they mean....
> >>>>>>
> >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
> >>>>>> org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
> >>>>>>
> >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530
> >>>>>>
> >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> >>>>>>
> >>>>>
> >>
> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
> >>>>>>
> >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> >>>>>>  USER_RATINGS_NEGLECTED=1,798,738
> >>>>>>
> >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> >>>>> USER_RATINGS_USED=12,429,693
> >>>>>>
> >>>>>>
> >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
> >>>>>>
> >>>>>
> >>
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> >>>>>>
> >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879
> >>>>>>
> >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> >>>>>>
> >>>>>
> >>
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> >>>>>>
> >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:     COOCCURRENCES=35882374
> >>>>>>
> >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:     PRUNED_COOCCURRENCES=0
> >>>>>>
> >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map input
> records=3312879
> >>>>>>
> >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map output
> >> records=17570268
> >>>>>>
> >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce input
> >>>>> records=5221907
> >>>>>>
> >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce output
> >>>>> records=3312879
> >>>>>>
> >>>>>>
> >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
> >>>>> records=3312879
> >>>>>>
> >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
> >>>>> records=3312879
> >>>>>>
> >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
> >>>>> records=3312879
> >>>>>>
> >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
> >>>>> records=3312879
> >>>>>>
> >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map input
> records=7528530
> >>>>>>
> >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map output
> >> records=3313251
> >>>>>>
> >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce input
> >>>>> records=3313251
> >>>>>>
> >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce output
> >>>>> records=3313251
> >>>>>>
> >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map input
> records=6626130
> >>>>>>
> >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map output
> >> records=6626130
> >>>>>>
> >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce input
> >>>>> records=6626130
> >>>>>>
> >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce output
> >>>>> records=3312879
> >>>>>>
> >>>>>>
> >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map input
> records=3312879
> >>>>>>
> >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map output
> >> records=3313251
> >>>>>>
> >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce input
> >>>>> records=3313251
> >>>>>>
> >>>>>> --------
> >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce output records=0
> >>>>>> --------
> >>>>>>
> >>>>>> why 0???
> >>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Re: recommenditembased returns 0 records from last map-reduce job

Posted by Peng Zhang <pz...@gmail.com>.

Serega,

See the last line on how to pass outputPathForSimilarityMatrix options to the recommenditembased command:

sudo -u oozie mahout recommenditembased \
                   --input visited_items_with_inverted_items \

                   --output result \
                   --similarityClassname SIMILARITY_LOGLIKELIHOOD \
                   --usersFile inverted_items \
                   --numRecommendations 500 \
                   --booleanData false \
                   --maxPrefsPerUser 100 \
                   --maxSimilaritiesPerItem 500 \
                   --minPrefsPerUser 0\
                   --maxPrefsPerUserInItemSimilarity 30 \
                   --threshold 0.91 \
                   --tempDir  temp \
                   --outputPathForSimilarityMatrix similarityMatri \


Peng Zhang
pzhang.xjtu@gmail.com





On Jul 21, 2014, at 3:09 PM, Serega Sheypak <se...@gmail.com> wrote:

> I've inspected the code, our approach wouldn't work with booleanData=false.
> We do calcualte imte similarity in the wrong way...(((
> Thank you
> 1. We provide "fake" user_id and provide --usersFile in order to get
> recommendations for "fake user_id, where user_id is a negative item_id. It
> worked when we did provide user_id->item_id pairs without preference.
> 2. Our target is to get item similarities. We tried
> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob but it
> returns bad result comparing to RecommenderJob with our "fake" user_id
> (inverted item_id)
> 
> 1. I'll try the option you provided.
> 2. I will remove input with fake user_id and usersFile with these fake ids
> 
> 3.
> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
> I don't understand how to pass ---outputPathForSimilarityMatrix option to
> RecommenderJob
> 
> 
> 2014-07-21 4:58 GMT+04:00 Peng Zhang <pz...@gmail.com>:
> 
>> Seraga,
>> 
>> I have two comments:
>> 1. Don’t use negative user ids. Since Mahout uses user id as well as item
>> id as the row/column index, you’d better use 0, 1, 2, etc as ids
>> 2. If you want to get the item similarity information, you can use
>> --outputPathForSimilarityMatrix in the command
>> 
>> Regards,
>> Peng Zhang
>> M: +86 186-1658-7856
>> pzhang.xjtu@gmail.com
>> 
>> 
>> 
>> 
>> 
>> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <se...@gmail.com>
>> wrote:
>> 
>>> All bad things happen here:
>>> 
>>> 
>>> 
>>> Name
>>> 
>>> RecommenderJob-PartialMultiplyMapper-Reducer
>>> 
>>> User
>>> 
>>> oozie
>>> 
>>> Process User
>>> 
>>> oozie
>>> 
>>> Group
>>> 
>>> oozie
>>> 
>>> Mapper Class
>>> 
>>> PartialMultiplyMapper
>>> 
>>> Reducer Class
>>> 
>>> AggregateAndRecommendReducer
>>> 
>>> 
>>> Job Input Directory
>>> 
>>> hdfs://nameservice1/itemrec/temp/partialMultiply
>>> 
>>> Job Output Directory
>>> 
>>> hdfs://nameservice1/itemrec/output/
>>> 
>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map input records=3312879
>>> 
>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map output records=3313251
>>> 
>>> 
>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce input records=3313251
>>> 
>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce output records=0
>>> 
>>> Why does mahout returns 0 rows? it works when booleanData=true
>> (preferences
>>> are ignored...?)
>>> 
>>> 
>>> 
>>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <se...@gmail.com>:
>>> 
>>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
>>>> users_file:
>>>> --inverted_item_id
>>>> -1
>>>> -2
>>>> -3
>>>> -4
>>>> 
>>>> users_items_prefs
>>>> --inverted item_id
>>>> -1 1 1.0
>>>> -2 2 1.0
>>>> -3 3 1.0
>>>> -4 4 1.0
>>>> --user_id item_id pref_value
>>>> 11   1 1.6
>>>> 11   2 1.6
>>>> 123 3 2.0
>>>> 123 4 2.0
>>>> 333 1 2.0
>>>> 333 2 1.6
>>>> --e.t.c.
>>>> 
>>>> if I set --booleanData true
>>>> then mahout returns the result.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <andrew.musselman@gmail.com
>>> :
>>>> 
>>>> I'm confused about how you're constructing the user file, and why there
>>>>> are negated item ids here.
>>>>> 
>>>>> Can you post some more details please, including Mahout version and
>> some
>>>>> sample data sets?
>>>>> 
>>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak <
>> serega.sheypak@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>> Hi, I'm trying to create item similarity.
>>>>>> I gather items which users visit during shopping and then create a
>> file:
>>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6, 1.9],
>> depends
>>>>> on
>>>>>> user action type and data source)
>>>>>> UNION
>>>>>> -item_id, item_id, 1.0 (from items dictionary)
>>>>>> 
>>>>>> and I do provide a userFile, where user_id = -item_id
>>>>>> 
>>>>>> The idea is to get item similary. If any user visits item named "A", i
>>>>> want
>>>>>> to show him items "B", "c", "xxx" using preferences of other users.
>>>>>> 
>>>>>> The problem is that the last (???) mapreduce job returns 0 rows:
>>>>>> 
>>>>>> Here are my settings:
>>>>>> 
>>>>>> 
>>>>>> sudo -u oozie mahout recommenditembased \
>>>>>>                  --input visited_items_with_inverted_items \
>>>>>> 
>>>>>>                  --output result \
>>>>>>                  --similarityClassname SIMILARITY_LOGLIKELIHOOD \
>>>>>>                  --usersFile inverted_items \
>>>>>>                  --numRecommendations 500 \
>>>>>>                  --booleanData false \
>>>>>>                  --maxPrefsPerUser 100 \
>>>>>>                  --maxSimilaritiesPerItem 500 \
>>>>>>                  --minPrefsPerUser 0\
>>>>>>                  --maxPrefsPerUserInItemSimilarity 30 \
>>>>>>                  --threshold 0.91 \
>>>>>>                  --tempDir  temp \
>>>>>> 
>>>>>> Some counters... I don't get what do they mean....
>>>>>> 
>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
>>>>>> org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
>>>>>> 
>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530
>>>>>> 
>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>>> 
>>>>> 
>> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
>>>>>> 
>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>>>  USER_RATINGS_NEGLECTED=1,798,738
>>>>>> 
>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>> USER_RATINGS_USED=12,429,693
>>>>>> 
>>>>>> 
>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
>>>>>> 
>>>>> 
>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>>>>> 
>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879
>>>>>> 
>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>>>> 
>>>>> 
>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>>>>> 
>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:     COOCCURRENCES=35882374
>>>>>> 
>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:     PRUNED_COOCCURRENCES=0
>>>>>> 
>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map input records=3312879
>>>>>> 
>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map output
>> records=17570268
>>>>>> 
>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce input
>>>>> records=5221907
>>>>>> 
>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce output
>>>>> records=3312879
>>>>>> 
>>>>>> 
>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
>>>>> records=3312879
>>>>>> 
>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
>>>>> records=3312879
>>>>>> 
>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
>>>>> records=3312879
>>>>>> 
>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
>>>>> records=3312879
>>>>>> 
>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map input records=7528530
>>>>>> 
>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map output
>> records=3313251
>>>>>> 
>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce input
>>>>> records=3313251
>>>>>> 
>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce output
>>>>> records=3313251
>>>>>> 
>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map input records=6626130
>>>>>> 
>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map output
>> records=6626130
>>>>>> 
>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce input
>>>>> records=6626130
>>>>>> 
>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce output
>>>>> records=3312879
>>>>>> 
>>>>>> 
>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map input records=3312879
>>>>>> 
>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map output
>> records=3313251
>>>>>> 
>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce input
>>>>> records=3313251
>>>>>> 
>>>>>> --------
>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce output records=0
>>>>>> --------
>>>>>> 
>>>>>> why 0???
>>>>> 
>>>> 
>>>> 
>> 
>>

Re: recommenditembased returns 0 records from last map-reduce job

Posted by Serega Sheypak <se...@gmail.com>.

I've inspected the code, our approach wouldn't work with booleanData=false.
We do calcualte imte similarity in the wrong way...(((
Thank you
1. We provide "fake" user_id and provide --usersFile in order to get
recommendations for "fake user_id, where user_id is a negative item_id. It
worked when we did provide user_id->item_id pairs without preference.
2. Our target is to get item similarities. We tried
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob but it
returns bad result comparing to RecommenderJob with our "fake" user_id
(inverted item_id)

1. I'll try the option you provided.
2. I will remove input with fake user_id and usersFile with these fake ids

3.
https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
I don't understand how to pass ---outputPathForSimilarityMatrix option to
RecommenderJob


2014-07-21 4:58 GMT+04:00 Peng Zhang <pz...@gmail.com>:

> Seraga,
>
> I have two comments:
> 1. Don’t use negative user ids. Since Mahout uses user id as well as item
> id as the row/column index, you’d better use 0, 1, 2, etc as ids
> 2. If you want to get the item similarity information, you can use
> --outputPathForSimilarityMatrix in the command
>
> Regards,
> Peng Zhang
> M: +86 186-1658-7856
> pzhang.xjtu@gmail.com
>
>
>
>
>
> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <se...@gmail.com>
> wrote:
>
> > All bad things happen here:
> >
> >
> >
> > Name
> >
> > RecommenderJob-PartialMultiplyMapper-Reducer
> >
> > User
> >
> > oozie
> >
> > Process User
> >
> > oozie
> >
> > Group
> >
> > oozie
> >
> > Mapper Class
> >
> > PartialMultiplyMapper
> >
> > Reducer Class
> >
> > AggregateAndRecommendReducer
> >
> >
> > Job Input Directory
> >
> > hdfs://nameservice1/itemrec/temp/partialMultiply
> >
> > Job Output Directory
> >
> > hdfs://nameservice1/itemrec/output/
> >
> > 14/07/20 23:57:47 INFO mapred.JobClient:     Map input records=3312879
> >
> > 14/07/20 23:57:47 INFO mapred.JobClient:     Map output records=3313251
> >
> >
> > 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce input records=3313251
> >
> > 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce output records=0
> >
> > Why does mahout returns 0 rows? it works when booleanData=true
> (preferences
> > are ignored...?)
> >
> >
> >
> > 2014-07-20 23:19 GMT+04:00 Serega Sheypak <se...@gmail.com>:
> >
> >> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
> >> users_file:
> >> --inverted_item_id
> >> -1
> >> -2
> >> -3
> >> -4
> >>
> >> users_items_prefs
> >> --inverted item_id
> >> -1 1 1.0
> >> -2 2 1.0
> >> -3 3 1.0
> >> -4 4 1.0
> >> --user_id item_id pref_value
> >> 11   1 1.6
> >> 11   2 1.6
> >> 123 3 2.0
> >> 123 4 2.0
> >> 333 1 2.0
> >> 333 2 1.6
> >> --e.t.c.
> >>
> >> if I set --booleanData true
> >> then mahout returns the result.
> >>
> >>
> >>
> >>
> >> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <andrew.musselman@gmail.com
> >:
> >>
> >> I'm confused about how you're constructing the user file, and why there
> >>> are negated item ids here.
> >>>
> >>> Can you post some more details please, including Mahout version and
> some
> >>> sample data sets?
> >>>
> >>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak <
> serega.sheypak@gmail.com>
> >>> wrote:
> >>>>
> >>>> Hi, I'm trying to create item similarity.
> >>>> I gather items which users visit during shopping and then create a
> file:
> >>>> user_id, item_id, weight (where weight can be: [1.0, 1.6, 1.9],
> depends
> >>> on
> >>>> user action type and data source)
> >>>> UNION
> >>>> -item_id, item_id, 1.0 (from items dictionary)
> >>>>
> >>>> and I do provide a userFile, where user_id = -item_id
> >>>>
> >>>> The idea is to get item similary. If any user visits item named "A", i
> >>> want
> >>>> to show him items "B", "c", "xxx" using preferences of other users.
> >>>>
> >>>> The problem is that the last (???) mapreduce job returns 0 rows:
> >>>>
> >>>> Here are my settings:
> >>>>
> >>>>
> >>>> sudo -u oozie mahout recommenditembased \
> >>>>                   --input visited_items_with_inverted_items \
> >>>>
> >>>>                   --output result \
> >>>>                   --similarityClassname SIMILARITY_LOGLIKELIHOOD \
> >>>>                   --usersFile inverted_items \
> >>>>                   --numRecommendations 500 \
> >>>>                   --booleanData false \
> >>>>                   --maxPrefsPerUser 100 \
> >>>>                   --maxSimilaritiesPerItem 500 \
> >>>>                   --minPrefsPerUser 0\
> >>>>                   --maxPrefsPerUserInItemSimilarity 30 \
> >>>>                   --threshold 0.91 \
> >>>>                   --tempDir  temp \
> >>>>
> >>>> Some counters... I don't get what do they mean....
> >>>>
> >>>> 14/07/20 22:43:08 INFO mapred.JobClient:
> >>>> org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
> >>>>
> >>>> 14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530
> >>>>
> >>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> >>>>
> >>>
> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
> >>>>
> >>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> >>>>   USER_RATINGS_NEGLECTED=1,798,738
> >>>>
> >>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> >>> USER_RATINGS_USED=12,429,693
> >>>>
> >>>>
> >>>> 14/07/20 22:44:24 INFO mapred.JobClient:
> >>>>
> >>>
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> >>>>
> >>>> 14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879
> >>>>
> >>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> >>>>
> >>>
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> >>>>
> >>>> 14/07/20 22:45:18 INFO mapred.JobClient:     COOCCURRENCES=35882374
> >>>>
> >>>> 14/07/20 22:45:18 INFO mapred.JobClient:     PRUNED_COOCCURRENCES=0
> >>>>
> >>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map input records=3312879
> >>>>
> >>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map output
> records=17570268
> >>>>
> >>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce input
> >>> records=5221907
> >>>>
> >>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce output
> >>> records=3312879
> >>>>
> >>>>
> >>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
> >>> records=3312879
> >>>>
> >>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
> >>> records=3312879
> >>>>
> >>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
> >>> records=3312879
> >>>>
> >>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
> >>> records=3312879
> >>>>
> >>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map input records=7528530
> >>>>
> >>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map output
> records=3313251
> >>>>
> >>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce input
> >>> records=3313251
> >>>>
> >>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce output
> >>> records=3313251
> >>>>
> >>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map input records=6626130
> >>>>
> >>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map output
> records=6626130
> >>>>
> >>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce input
> >>> records=6626130
> >>>>
> >>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce output
> >>> records=3312879
> >>>>
> >>>>
> >>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map input records=3312879
> >>>>
> >>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map output
> records=3313251
> >>>>
> >>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce input
> >>> records=3313251
> >>>>
> >>>> --------
> >>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce output records=0
> >>>> --------
> >>>>
> >>>> why 0???
> >>>
> >>
> >>
>
>

Re: recommenditembased returns 0 records from last map-reduce job

Posted by Peng Zhang <pz...@gmail.com>.

Seraga,

I have two comments:
1. Don’t use negative user ids. Since Mahout uses user id as well as item id as the row/column index, you’d better use 0, 1, 2, etc as ids
2. If you want to get the item similarity information, you can use --outputPathForSimilarityMatrix in the command

Regards,
Peng Zhang
M: +86 186-1658-7856
pzhang.xjtu@gmail.com





On Jul 21, 2014, at 4:00 AM, Serega Sheypak <se...@gmail.com> wrote:

> All bad things happen here:
> 
> 
> 
> Name
> 
> RecommenderJob-PartialMultiplyMapper-Reducer
> 
> User
> 
> oozie
> 
> Process User
> 
> oozie
> 
> Group
> 
> oozie
> 
> Mapper Class
> 
> PartialMultiplyMapper
> 
> Reducer Class
> 
> AggregateAndRecommendReducer
> 
> 
> Job Input Directory
> 
> hdfs://nameservice1/itemrec/temp/partialMultiply
> 
> Job Output Directory
> 
> hdfs://nameservice1/itemrec/output/
> 
> 14/07/20 23:57:47 INFO mapred.JobClient:     Map input records=3312879
> 
> 14/07/20 23:57:47 INFO mapred.JobClient:     Map output records=3313251
> 
> 
> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce input records=3313251
> 
> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce output records=0
> 
> Why does mahout returns 0 rows? it works when booleanData=true (preferences
> are ignored...?)
> 
> 
> 
> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <se...@gmail.com>:
> 
>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
>> users_file:
>> --inverted_item_id
>> -1
>> -2
>> -3
>> -4
>> 
>> users_items_prefs
>> --inverted item_id
>> -1 1 1.0
>> -2 2 1.0
>> -3 3 1.0
>> -4 4 1.0
>> --user_id item_id pref_value
>> 11   1 1.6
>> 11   2 1.6
>> 123 3 2.0
>> 123 4 2.0
>> 333 1 2.0
>> 333 2 1.6
>> --e.t.c.
>> 
>> if I set --booleanData true
>> then mahout returns the result.
>> 
>> 
>> 
>> 
>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <an...@gmail.com>:
>> 
>> I'm confused about how you're constructing the user file, and why there
>>> are negated item ids here.
>>> 
>>> Can you post some more details please, including Mahout version and some
>>> sample data sets?
>>> 
>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak <se...@gmail.com>
>>> wrote:
>>>> 
>>>> Hi, I'm trying to create item similarity.
>>>> I gather items which users visit during shopping and then create a file:
>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6, 1.9], depends
>>> on
>>>> user action type and data source)
>>>> UNION
>>>> -item_id, item_id, 1.0 (from items dictionary)
>>>> 
>>>> and I do provide a userFile, where user_id = -item_id
>>>> 
>>>> The idea is to get item similary. If any user visits item named "A", i
>>> want
>>>> to show him items "B", "c", "xxx" using preferences of other users.
>>>> 
>>>> The problem is that the last (???) mapreduce job returns 0 rows:
>>>> 
>>>> Here are my settings:
>>>> 
>>>> 
>>>> sudo -u oozie mahout recommenditembased \
>>>>                   --input visited_items_with_inverted_items \
>>>> 
>>>>                   --output result \
>>>>                   --similarityClassname SIMILARITY_LOGLIKELIHOOD \
>>>>                   --usersFile inverted_items \
>>>>                   --numRecommendations 500 \
>>>>                   --booleanData false \
>>>>                   --maxPrefsPerUser 100 \
>>>>                   --maxSimilaritiesPerItem 500 \
>>>>                   --minPrefsPerUser 0\
>>>>                   --maxPrefsPerUserInItemSimilarity 30 \
>>>>                   --threshold 0.91 \
>>>>                   --tempDir  temp \
>>>> 
>>>> Some counters... I don't get what do they mean....
>>>> 
>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
>>>> org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
>>>> 
>>>> 14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530
>>>> 
>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>> 
>>> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
>>>> 
>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>   USER_RATINGS_NEGLECTED=1,798,738
>>>> 
>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>> USER_RATINGS_USED=12,429,693
>>>> 
>>>> 
>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
>>>> 
>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>>> 
>>>> 14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879
>>>> 
>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>> 
>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>>> 
>>>> 14/07/20 22:45:18 INFO mapred.JobClient:     COOCCURRENCES=35882374
>>>> 
>>>> 14/07/20 22:45:18 INFO mapred.JobClient:     PRUNED_COOCCURRENCES=0
>>>> 
>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map input records=3312879
>>>> 
>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map output records=17570268
>>>> 
>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce input
>>> records=5221907
>>>> 
>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce output
>>> records=3312879
>>>> 
>>>> 
>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
>>> records=3312879
>>>> 
>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
>>> records=3312879
>>>> 
>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
>>> records=3312879
>>>> 
>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
>>> records=3312879
>>>> 
>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map input records=7528530
>>>> 
>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map output records=3313251
>>>> 
>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce input
>>> records=3313251
>>>> 
>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce output
>>> records=3313251
>>>> 
>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map input records=6626130
>>>> 
>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map output records=6626130
>>>> 
>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce input
>>> records=6626130
>>>> 
>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce output
>>> records=3312879
>>>> 
>>>> 
>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map input records=3312879
>>>> 
>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map output records=3313251
>>>> 
>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce input
>>> records=3313251
>>>> 
>>>> --------
>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce output records=0
>>>> --------
>>>> 
>>>> why 0???
>>> 
>> 
>>

Re: recommenditembased returns 0 records from last map-reduce job

Posted by Serega Sheypak <se...@gmail.com>.

All bad things happen here:



 Name

RecommenderJob-PartialMultiplyMapper-Reducer

User

oozie

Process User

oozie

Group

oozie

Mapper Class

PartialMultiplyMapper

Reducer Class

AggregateAndRecommendReducer


Job Input Directory

hdfs://nameservice1/itemrec/temp/partialMultiply

Job Output Directory

hdfs://nameservice1/itemrec/output/

14/07/20 23:57:47 INFO mapred.JobClient:     Map input records=3312879

14/07/20 23:57:47 INFO mapred.JobClient:     Map output records=3313251


14/07/20 23:57:47 INFO mapred.JobClient:     Reduce input records=3313251

14/07/20 23:57:47 INFO mapred.JobClient:     Reduce output records=0

Why does mahout returns 0 rows? it works when booleanData=true (preferences
are ignored...?)



2014-07-20 23:19 GMT+04:00 Serega Sheypak <se...@gmail.com>:

> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
> users_file:
> --inverted_item_id
> -1
> -2
> -3
> -4
>
> users_items_prefs
> --inverted item_id
> -1 1 1.0
> -2 2 1.0
> -3 3 1.0
> -4 4 1.0
> --user_id item_id pref_value
> 11   1 1.6
> 11   2 1.6
> 123 3 2.0
> 123 4 2.0
> 333 1 2.0
> 333 2 1.6
> --e.t.c.
>
> if I set --booleanData true
> then mahout returns the result.
>
>
>
>
> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <an...@gmail.com>:
>
> I'm confused about how you're constructing the user file, and why there
>> are negated item ids here.
>>
>> Can you post some more details please, including Mahout version and some
>> sample data sets?
>>
>> > On Jul 20, 2014, at 11:57 AM, Serega Sheypak <se...@gmail.com>
>> wrote:
>> >
>> > Hi, I'm trying to create item similarity.
>> > I gather items which users visit during shopping and then create a file:
>> > user_id, item_id, weight (where weight can be: [1.0, 1.6, 1.9], depends
>> on
>> > user action type and data source)
>> > UNION
>> > -item_id, item_id, 1.0 (from items dictionary)
>> >
>> > and I do provide a userFile, where user_id = -item_id
>> >
>> > The idea is to get item similary. If any user visits item named "A", i
>> want
>> > to show him items "B", "c", "xxx" using preferences of other users.
>> >
>> > The problem is that the last (???) mapreduce job returns 0 rows:
>> >
>> > Here are my settings:
>> >
>> >
>> > sudo -u oozie mahout recommenditembased \
>> >                    --input visited_items_with_inverted_items \
>> >
>> >                    --output result \
>> >                    --similarityClassname SIMILARITY_LOGLIKELIHOOD \
>> >                    --usersFile inverted_items \
>> >                    --numRecommendations 500 \
>> >                    --booleanData false \
>> >                    --maxPrefsPerUser 100 \
>> >                    --maxSimilaritiesPerItem 500 \
>> >                    --minPrefsPerUser 0\
>> >                    --maxPrefsPerUserInItemSimilarity 30 \
>> >                    --threshold 0.91 \
>> >                    --tempDir  temp \
>> >
>> > Some counters... I don't get what do they mean....
>> >
>> > 14/07/20 22:43:08 INFO mapred.JobClient:
>> >  org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
>> >
>> > 14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530
>> >
>> > 14/07/20 22:43:43 INFO mapred.JobClient:
>> >
>>  org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
>> >
>> > 14/07/20 22:43:43 INFO mapred.JobClient:
>> >    USER_RATINGS_NEGLECTED=1,798,738
>> >
>> > 14/07/20 22:43:43 INFO mapred.JobClient:
>> USER_RATINGS_USED=12,429,693
>> >
>> >
>> > 14/07/20 22:44:24 INFO mapred.JobClient:
>> >
>>  org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>> >
>> > 14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879
>> >
>> > 14/07/20 22:45:18 INFO mapred.JobClient:
>> >
>>  org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>> >
>> > 14/07/20 22:45:18 INFO mapred.JobClient:     COOCCURRENCES=35882374
>> >
>> > 14/07/20 22:45:18 INFO mapred.JobClient:     PRUNED_COOCCURRENCES=0
>> >
>> > 14/07/20 22:46:00 INFO mapred.JobClient:     Map input records=3312879
>> >
>> > 14/07/20 22:46:00 INFO mapred.JobClient:     Map output records=17570268
>> >
>> > 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce input
>> records=5221907
>> >
>> > 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce output
>> records=3312879
>> >
>> >
>> > 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
>> records=3312879
>> >
>> > 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
>> records=3312879
>> >
>> > 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
>> records=3312879
>> >
>> > 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
>> records=3312879
>> >
>> > 14/07/20 22:47:06 INFO mapred.JobClient:     Map input records=7528530
>> >
>> > 14/07/20 22:47:06 INFO mapred.JobClient:     Map output records=3313251
>> >
>> > 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce input
>> records=3313251
>> >
>> > 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce output
>> records=3313251
>> >
>> > 14/07/20 22:47:40 INFO mapred.JobClient:     Map input records=6626130
>> >
>> > 14/07/20 22:47:40 INFO mapred.JobClient:     Map output records=6626130
>> >
>> > 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce input
>> records=6626130
>> >
>> > 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce output
>> records=3312879
>> >
>> >
>> > 14/07/20 22:48:26 INFO mapred.JobClient:     Map input records=3312879
>> >
>> > 14/07/20 22:48:26 INFO mapred.JobClient:     Map output records=3313251
>> >
>> > 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce input
>> records=3313251
>> >
>> > --------
>> > 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce output records=0
>> > --------
>> >
>> > why 0???
>>
>
>

Re: recommenditembased returns 0 records from last map-reduce job

Posted by Serega Sheypak <se...@gmail.com>.

the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
users_file:
--inverted_item_id
-1
-2
-3
-4

users_items_prefs
--inverted item_id
-1 1 1.0
-2 2 1.0
-3 3 1.0
-4 4 1.0
--user_id item_id pref_value
11   1 1.6
11   2 1.6
123 3 2.0
123 4 2.0
333 1 2.0
333 2 1.6
--e.t.c.

if I set --booleanData true
then mahout returns the result.




2014-07-20 23:12 GMT+04:00 Andrew Musselman <an...@gmail.com>:

> I'm confused about how you're constructing the user file, and why there
> are negated item ids here.
>
> Can you post some more details please, including Mahout version and some
> sample data sets?
>
> > On Jul 20, 2014, at 11:57 AM, Serega Sheypak <se...@gmail.com>
> wrote:
> >
> > Hi, I'm trying to create item similarity.
> > I gather items which users visit during shopping and then create a file:
> > user_id, item_id, weight (where weight can be: [1.0, 1.6, 1.9], depends
> on
> > user action type and data source)
> > UNION
> > -item_id, item_id, 1.0 (from items dictionary)
> >
> > and I do provide a userFile, where user_id = -item_id
> >
> > The idea is to get item similary. If any user visits item named "A", i
> want
> > to show him items "B", "c", "xxx" using preferences of other users.
> >
> > The problem is that the last (???) mapreduce job returns 0 rows:
> >
> > Here are my settings:
> >
> >
> > sudo -u oozie mahout recommenditembased \
> >                    --input visited_items_with_inverted_items \
> >
> >                    --output result \
> >                    --similarityClassname SIMILARITY_LOGLIKELIHOOD \
> >                    --usersFile inverted_items \
> >                    --numRecommendations 500 \
> >                    --booleanData false \
> >                    --maxPrefsPerUser 100 \
> >                    --maxSimilaritiesPerItem 500 \
> >                    --minPrefsPerUser 0\
> >                    --maxPrefsPerUserInItemSimilarity 30 \
> >                    --threshold 0.91 \
> >                    --tempDir  temp \
> >
> > Some counters... I don't get what do they mean....
> >
> > 14/07/20 22:43:08 INFO mapred.JobClient:
> >  org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
> >
> > 14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530
> >
> > 14/07/20 22:43:43 INFO mapred.JobClient:
> >
>  org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
> >
> > 14/07/20 22:43:43 INFO mapred.JobClient:
> >    USER_RATINGS_NEGLECTED=1,798,738
> >
> > 14/07/20 22:43:43 INFO mapred.JobClient:     USER_RATINGS_USED=12,429,693
> >
> >
> > 14/07/20 22:44:24 INFO mapred.JobClient:
> >
>  org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> >
> > 14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879
> >
> > 14/07/20 22:45:18 INFO mapred.JobClient:
> >
>  org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> >
> > 14/07/20 22:45:18 INFO mapred.JobClient:     COOCCURRENCES=35882374
> >
> > 14/07/20 22:45:18 INFO mapred.JobClient:     PRUNED_COOCCURRENCES=0
> >
> > 14/07/20 22:46:00 INFO mapred.JobClient:     Map input records=3312879
> >
> > 14/07/20 22:46:00 INFO mapred.JobClient:     Map output records=17570268
> >
> > 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce input records=5221907
> >
> > 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce output
> records=3312879
> >
> >
> > 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input records=3312879
> >
> > 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
> records=3312879
> >
> > 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input records=3312879
> >
> > 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
> records=3312879
> >
> > 14/07/20 22:47:06 INFO mapred.JobClient:     Map input records=7528530
> >
> > 14/07/20 22:47:06 INFO mapred.JobClient:     Map output records=3313251
> >
> > 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce input records=3313251
> >
> > 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce output
> records=3313251
> >
> > 14/07/20 22:47:40 INFO mapred.JobClient:     Map input records=6626130
> >
> > 14/07/20 22:47:40 INFO mapred.JobClient:     Map output records=6626130
> >
> > 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce input records=6626130
> >
> > 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce output
> records=3312879
> >
> >
> > 14/07/20 22:48:26 INFO mapred.JobClient:     Map input records=3312879
> >
> > 14/07/20 22:48:26 INFO mapred.JobClient:     Map output records=3313251
> >
> > 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce input records=3313251
> >
> > --------
> > 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce output records=0
> > --------
> >
> > why 0???
>

Re: recommenditembased returns 0 records from last map-reduce job

Posted by Andrew Musselman <an...@gmail.com>.

I'm confused about how you're constructing the user file, and why there are negated item ids here.

Can you post some more details please, including Mahout version and some sample data sets?

> On Jul 20, 2014, at 11:57 AM, Serega Sheypak <se...@gmail.com> wrote:
> 
> Hi, I'm trying to create item similarity.
> I gather items which users visit during shopping and then create a file:
> user_id, item_id, weight (where weight can be: [1.0, 1.6, 1.9], depends on
> user action type and data source)
> UNION
> -item_id, item_id, 1.0 (from items dictionary)
> 
> and I do provide a userFile, where user_id = -item_id
> 
> The idea is to get item similary. If any user visits item named "A", i want
> to show him items "B", "c", "xxx" using preferences of other users.
> 
> The problem is that the last (???) mapreduce job returns 0 rows:
> 
> Here are my settings:
> 
> 
> sudo -u oozie mahout recommenditembased \
>                    --input visited_items_with_inverted_items \
> 
>                    --output result \
>                    --similarityClassname SIMILARITY_LOGLIKELIHOOD \
>                    --usersFile inverted_items \
>                    --numRecommendations 500 \
>                    --booleanData false \
>                    --maxPrefsPerUser 100 \
>                    --maxSimilaritiesPerItem 500 \
>                    --minPrefsPerUser 0\
>                    --maxPrefsPerUserInItemSimilarity 30 \
>                    --threshold 0.91 \
>                    --tempDir  temp \
> 
> Some counters... I don't get what do they mean....
> 
> 14/07/20 22:43:08 INFO mapred.JobClient:
>  org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
> 
> 14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530
> 
> 14/07/20 22:43:43 INFO mapred.JobClient:
>  org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
> 
> 14/07/20 22:43:43 INFO mapred.JobClient:
>    USER_RATINGS_NEGLECTED=1,798,738
> 
> 14/07/20 22:43:43 INFO mapred.JobClient:     USER_RATINGS_USED=12,429,693
> 
> 
> 14/07/20 22:44:24 INFO mapred.JobClient:
>  org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> 
> 14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879
> 
> 14/07/20 22:45:18 INFO mapred.JobClient:
>  org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> 
> 14/07/20 22:45:18 INFO mapred.JobClient:     COOCCURRENCES=35882374
> 
> 14/07/20 22:45:18 INFO mapred.JobClient:     PRUNED_COOCCURRENCES=0
> 
> 14/07/20 22:46:00 INFO mapred.JobClient:     Map input records=3312879
> 
> 14/07/20 22:46:00 INFO mapred.JobClient:     Map output records=17570268
> 
> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce input records=5221907
> 
> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce output records=3312879
> 
> 
> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input records=3312879
> 
> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output records=3312879
> 
> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input records=3312879
> 
> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output records=3312879
> 
> 14/07/20 22:47:06 INFO mapred.JobClient:     Map input records=7528530
> 
> 14/07/20 22:47:06 INFO mapred.JobClient:     Map output records=3313251
> 
> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce input records=3313251
> 
> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce output records=3313251
> 
> 14/07/20 22:47:40 INFO mapred.JobClient:     Map input records=6626130
> 
> 14/07/20 22:47:40 INFO mapred.JobClient:     Map output records=6626130
> 
> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce input records=6626130
> 
> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce output records=3312879
> 
> 
> 14/07/20 22:48:26 INFO mapred.JobClient:     Map input records=3312879
> 
> 14/07/20 22:48:26 INFO mapred.JobClient:     Map output records=3313251
> 
> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce input records=3313251
> 
> --------
> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce output records=0
> --------
> 
> why 0???