You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2011/10/11 18:34:26 UTC

RecommenderJob and NaN

I'm running trunk RecommenderJob (via build-asf-email.sh) and am not getting any recommendations due to NaNs being calculated in the AggregateAndRecommend step.  I'm not quite sure what is going on as it seems like this was working as little as two weeks ago (post Sebastian's big change to RecJob), but I don't see a whole lot of changes in that part of the code.

The data is user id's mapping to email thread ids.  My input data is simply a triple of user id, thread id, 1 (meaning that user participated in that thread)  It seems like I will have a lot of good values in the inputs to the AggregateAndRecommend step, except one id will be NaN and this then seems to get added in and makes everything NaN (I realize this is a very naive understanding).  I sense that I should be looking upstream in the process for a fix, but I am not sure where that is.

Any ideas where I should be looking to eliminate these NaNs?  If you want to try this with a small data set, you can get it here: http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout (but note the companion article is not published yet.)

Thanks,
Grant

Re: RecommenderJob and NaN

Posted by Grant Ingersoll <gs...@apache.org>.

On Oct 11, 2011, at 2:54 PM, Sean Owen wrote:

> NaN is added for all user item pairs that already exist in the input, to
> make them ineligible for recommendation. That's normal - could this be the
> case?

Trying to track down.  I don't think it is the self case, but not 100% sure.  

> On Oct 11, 2011 7:49 PM, "Grant Ingersoll" <gs...@apache.org> wrote:
> 
>> 
>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
>> 
>>> Where is the NaN coming up -- what has this value?
>> 
>> simColumn seems to be the originator in the Aggregate step.  For instance,
>> my current breakpoint shows:
>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
>> 
>> I can also see some in the PartialMultiplyMapper via the
>> similarityMatrixColumn.
>> 
>> Is that set by SimilarityMatrixRowWrapperMapper?
>> <code>
>> /* remove self similarity */
>>   similarityMatrixRow.set(key.get(), Double.NaN);
>> </code>
>> 
>> 
>> 
>>> It should be propagated in some cases but not others. I'm not aware of
>>> any changes here.
>> 
>> yeah, me neither.  This is all related to MAHOUT-798.
>> 
>>> 
>>> Generally small data sets will have this problem of not being able to
>>> compute much of anything useful, so NaN might be right here.
>>> But you say it was different recently, which seems to rule that out.
>> 
>> I also _believe_ I'm seeing it in a much larger data set on Hadoop, it's
>> just that's a whole lot harder to debug.
>> 
>>> 
>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <gs...@apache.org>
>> wrote:
>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not
>> getting any recommendations due to NaNs being calculated in the
>> AggregateAndRecommend step.  I'm not quite sure what is going on as it seems
>> like this was working as little as two weeks ago (post Sebastian's big
>> change to RecJob), but I don't see a whole lot of changes in that part of
>> the code.
>>>> 
>>>> The data is user id's mapping to email thread ids.  My input data is
>> simply a triple of user id, thread id, 1 (meaning that user participated in
>> that thread)  It seems like I will have a lot of good values in the inputs
>> to the AggregateAndRecommend step, except one id will be NaN and this then
>> seems to get added in and makes everything NaN (I realize this is a very
>> naive understanding).  I sense that I should be looking upstream in the
>> process for a fix, but I am not sure where that is.
>>>> 
>>>> Any ideas where I should be looking to eliminate these NaNs?  If you
>> want to try this with a small data set, you can get it here:
>> http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout(but note the companion article is not published yet.)
>>>> 
>>>> Thanks,
>>>> Grant
>> 
>> 
>> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com

Re: RecommenderJob and NaN

Posted by Sean Owen <sr...@gmail.com>.

NaN is added for all user item pairs that already exist in the input, to
make them ineligible for recommendation. That's normal - could this be the
case?
On Oct 11, 2011 7:49 PM, "Grant Ingersoll" <gs...@apache.org> wrote:

>
> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
>
> > Where is the NaN coming up -- what has this value?
>
> simColumn seems to be the originator in the Aggregate step.  For instance,
> my current breakpoint shows:
> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
>
> I can also see some in the PartialMultiplyMapper via the
> similarityMatrixColumn.
>
> Is that set by SimilarityMatrixRowWrapperMapper?
> <code>
> /* remove self similarity */
>    similarityMatrixRow.set(key.get(), Double.NaN);
> </code>
>
>
>
> > It should be propagated in some cases but not others. I'm not aware of
> > any changes here.
>
> yeah, me neither.  This is all related to MAHOUT-798.
>
> >
> > Generally small data sets will have this problem of not being able to
> > compute much of anything useful, so NaN might be right here.
> > But you say it was different recently, which seems to rule that out.
>
> I also _believe_ I'm seeing it in a much larger data set on Hadoop, it's
> just that's a whole lot harder to debug.
>
> >
> > On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <gs...@apache.org>
> wrote:
> >> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not
> getting any recommendations due to NaNs being calculated in the
> AggregateAndRecommend step.  I'm not quite sure what is going on as it seems
> like this was working as little as two weeks ago (post Sebastian's big
> change to RecJob), but I don't see a whole lot of changes in that part of
> the code.
> >>
> >> The data is user id's mapping to email thread ids.  My input data is
> simply a triple of user id, thread id, 1 (meaning that user participated in
> that thread)  It seems like I will have a lot of good values in the inputs
> to the AggregateAndRecommend step, except one id will be NaN and this then
> seems to get added in and makes everything NaN (I realize this is a very
> naive understanding).  I sense that I should be looking upstream in the
> process for a fix, but I am not sure where that is.
> >>
> >> Any ideas where I should be looking to eliminate these NaNs?  If you
> want to try this with a small data set, you can get it here:
> http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout(but note the companion article is not published yet.)
> >>
> >> Thanks,
> >> Grant
>
>
>

Re: RecommenderJob and NaN

Posted by Lance Norskog <go...@gmail.com>.

I meant running with real data.

On Wed, Oct 12, 2011 at 11:37 PM, Sean Owen <sr...@gmail.com> wrote:

> RecommenderJob? The unit tests run it all the time.
> There should not be any glitches with static variables -- don't think
> there are any.
>
> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <go...@gmail.com> wrote:
> > Is this job working well for anyone now?
> > When was the last time this job worked for someone?
> >
> > On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <gsingers@apache.org
> >wrote:
> >
> >> Both local and on EC2
> >>
> >> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote:
> >>
> >> > Hi Grant,
> >> >
> >> > Just curious, are you running this locally or distributed?
> >> >
> >> > I'd run into a similar issue, though in a completely different
> algorithm
> >> (Jimmy Lin's PageRank implementation) due to the use of a static
> variable.
> >> >
> >> > When running locally, this wasn't getting cleared between loops, and
> thus
> >> I got wonky results.
> >> >
> >> > The same thing would have happened with JVM reuse enabled.
> >> >
> >> > -- Ken
> >> >
> >> > On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote:
> >> >
> >> >> Digging some more:
> >> >>
> >> >> In AggregateAndRecommend, around lines 143, I have, for userId 0, a
> >> simColumn of:
> >> >>
> >>
> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN}
> >> >>
> >> >> Which then becomes the numerator and the denom.
> >> >>
> >> >> Looping, my next simCol is:
> >> >>
> >>
> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012}
> >> >>
> >> >> and then
> >> >>
> >>
> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012}
> >> >>
> >> >> ...
> >> >>
> >> >> Each time, those are getting added into the numerators/denoms value,
> >> such that by the time we are done looping (line 161), we have:
> >> >> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
> >> >> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
> >> >>
> >> >> numberOfSimilarItemsUsed:
> >> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0}
> >> >>
> >> >> Not sure on how to interpret this as I haven't dug into the math here
> >> yet or figured out where those NaN are coming from originally.
> >> >>
> >> >> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote:
> >> >>
> >> >>>
> >> >>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote:
> >> >>>
> >> >>>>
> >> >>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
> >> >>>>
> >> >>>>> Where is the NaN coming up -- what has this value?
> >> >>>>
> >> >>>> simColumn seems to be the originator in the Aggregate step.  For
> >> instance, my current breakpoint shows:
> >> >>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
> >> >>>>
> >> >>>> I can also see some in the PartialMultiplyMapper via the
> >> similarityMatrixColumn.
> >> >>>>
> >> >>>> Is that set by SimilarityMatrixRowWrapperMapper?
> >> >>>> <code>
> >> >>>> /* remove self similarity */
> >> >>>> similarityMatrixRow.set(key.get(), Double.NaN);
> >> >>>> </code>
> >> >>>
> >> >>> Ah, but that is just taking care of itself, so maybe not the issue.
> >> >>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>> It should be propagated in some cases but not others. I'm not
> aware
> >> of
> >> >>>>> any changes here.
> >> >>>>
> >> >>>> yeah, me neither.  This is all related to MAHOUT-798.
> >> >>>>
> >> >>>>>
> >> >>>>> Generally small data sets will have this problem of not being able
> to
> >> >>>>> compute much of anything useful, so NaN might be right here.
> >> >>>>> But you say it was different recently, which seems to rule that
> out.
> >> >>>>
> >> >>>> I also _believe_ I'm seeing it in a much larger data set on Hadoop,
> >> it's just that's a whole lot harder to debug.
> >> >>>>
> >> >>>>>
> >> >>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <
> >> gsingers@apache.org> wrote:
> >> >>>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am
> not
> >> getting any recommendations due to NaNs being calculated in the
> >> AggregateAndRecommend step.  I'm not quite sure what is going on as it
> seems
> >> like this was working as little as two weeks ago (post Sebastian's big
> >> change to RecJob), but I don't see a whole lot of changes in that part
> of
> >> the code.
> >> >>>>>>
> >> >>>>>> The data is user id's mapping to email thread ids.  My input data
> is
> >> simply a triple of user id, thread id, 1 (meaning that user participated
> in
> >> that thread)  It seems like I will have a lot of good values in the
> inputs
> >> to the AggregateAndRecommend step, except one id will be NaN and this
> then
> >> seems to get added in and makes everything NaN (I realize this is a very
> >> naive understanding).  I sense that I should be looking upstream in the
> >> process for a fix, but I am not sure where that is.
> >> >>>>>>
> >> >>>>>> Any ideas where I should be looking to eliminate these NaNs?  If
> you
> >> want to try this with a small data set, you can get it here:
> >>
> http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout(butnote the companion article is not published yet.)
> >> >>>>>>
> >> >>>>>> Thanks,
> >> >>>>>> Grant
> >> >>>>
> >> >>>>
> >> >>>
> >> >>> --------------------------------------------
> >> >>> Grant Ingersoll
> >> >>> http://www.lucidimagination.com
> >> >>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> >> >>>
> >> >>
> >> >> --------------------------------------------
> >> >> Grant Ingersoll
> >> >> http://www.lucidimagination.com
> >> >> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> >> >>
> >> >
> >> > --------------------------
> >> > Ken Krugler
> >> > +1 530-210-6378
> >> > http://bixolabs.com
> >> > custom big data solutions & training
> >> > Hadoop, Cascading, Mahout & Solr
> >> >
> >> >
> >> >
> >>
> >> --------------------------------------------
> >> Grant Ingersoll
> >> http://www.lucidimagination.com
> >> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> >>
> >>
> >
> >
> > --
> > Lance Norskog
> > goksron@gmail.com
> >
>



-- 
Lance Norskog
goksron@gmail.com

Re: RecommenderJob and NaN

Posted by Lance Norskog <go...@gmail.com>.

Bingo, I'm getting recs now.

On Fri, Oct 14, 2011 at 8:10 AM, Grant Ingersoll <gs...@apache.org>wrote:

> OK, I believe I checked in a fix.  The issue came down to me generalizing
> the SeqFilesFromMailArchives in terms of the metadata extraction (from, to,
> references, etc.) and the fact that the code I use to extract preferences
> (MailToRecMapper) depended on things being in a specific order.
>
> On Oct 14, 2011, at 2:28 AM, Lance Norskog wrote:
>
> > cd mahout/examples/bin
> > ./build-asf-email.sh content/ out/ over/
> > select 1 for recommender
> >
> > where content/ is
> > content/coccoon.apache.org
> > content/commons.apache.org
> >
> > and out/ and over/ are output directories. Run the shell script with -x
> as
> > you will probably have to tweak it.
> >
> > Lance
> >
> > On Thu, Oct 13, 2011 at 11:04 PM, Sebastian Schelter <ss...@apache.org>
> wrote:
> >
> >> Only got the raw data, how did you convert it to our standard
> >> recommender input?
> >>
> >> --sebastian
> >>
> >>
> >> On 14.10.2011 01:17, Grant Ingersoll wrote:
> >>> Were you able to get the data, Sebastian?
> >>>
> >>> On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote:
> >>>
> >>>> Grant,
> >>>>
> >>>> Can you share a little more details about the results, do you get any
> >>>> exceptions? Or do you just get no results?
> >>>>
> >>>> Using the NaNs inside the similarity matrix vectors has been included
> in
> >>>> the job for a very long time and should not cause any problems. As
> Sean
> >>>> already mentioned we have unit tests with toy data that should catch
> the
> >>>> very obvious errors in this code.
> >>>>
> >>>> Can you share the dataset? I can do a testrun on my research cluster.
> >>>>
> >>>> --sebastian
> >>>>
> >>>> On 13.10.2011 08:37, Sean Owen wrote:
> >>>>> RecommenderJob? The unit tests run it all the time.
> >>>>> There should not be any glitches with static variables -- don't think
> >>>>> there are any.
> >>>>>
> >>>>> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <go...@gmail.com>
> >> wrote:
> >>>>>> Is this job working well for anyone now?
> >>>>>> When was the last time this job worked for someone?
> >>>>>>
> >>>>>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <
> >> gsingers@apache.org>wrote:
> >>>>>>
> >>>>>>> Both local and on EC2
> >>>>>>>
> >>>>>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote:
> >>>>>>>
> >>>>>>>> Hi Grant,
> >>>>>>>>
> >>>>>>>> Just curious, are you running this locally or distributed?
> >>>>>>>>
> >>>>>>>> I'd run into a similar issue, though in a completely different
> >> algorithm
> >>>>>>> (Jimmy Lin's PageRank implementation) due to the use of a static
> >> variable.
> >>>>>>>>
> >>>>>>>> When running locally, this wasn't getting cleared between loops,
> and
> >> thus
> >>>>>>> I got wonky results.
> >>>>>>>>
> >>>>>>>> The same thing would have happened with JVM reuse enabled.
> >>>>>>>>
> >>>>>>>> -- Ken
> >>>>>>>>
> >>>>>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote:
> >>>>>>>>
> >>>>>>>>> Digging some more:
> >>>>>>>>>
> >>>>>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0,
> a
> >>>>>>> simColumn of:
> >>>>>>>>>
> >>>>>>>
> >>
> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN}
> >>>>>>>>>
> >>>>>>>>> Which then becomes the numerator and the denom.
> >>>>>>>>>
> >>>>>>>>> Looping, my next simCol is:
> >>>>>>>>>
> >>>>>>>
> >>
> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012}
> >>>>>>>>>
> >>>>>>>>> and then
> >>>>>>>>>
> >>>>>>>
> >>
> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012}
> >>>>>>>>>
> >>>>>>>>> ...
> >>>>>>>>>
> >>>>>>>>> Each time, those are getting added into the numerators/denoms
> >> value,
> >>>>>>> such that by the time we are done looping (line 161), we have:
> >>>>>>>>> numerators:
> {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
> >>>>>>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
> >>>>>>>>>
> >>>>>>>>> numberOfSimilarItemsUsed:
> >>>>>>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0}
> >>>>>>>>>
> >>>>>>>>> Not sure on how to interpret this as I haven't dug into the math
> >> here
> >>>>>>> yet or figured out where those NaN are coming from originally.
> >>>>>>>>>
> >>>>>>>>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote:
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote:
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Where is the NaN coming up -- what has this value?
> >>>>>>>>>>>
> >>>>>>>>>>> simColumn seems to be the originator in the Aggregate step.
>  For
> >>>>>>> instance, my current breakpoint shows:
> >>>>>>>>>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
> >>>>>>>>>>>
> >>>>>>>>>>> I can also see some in the PartialMultiplyMapper via the
> >>>>>>> similarityMatrixColumn.
> >>>>>>>>>>>
> >>>>>>>>>>> Is that set by SimilarityMatrixRowWrapperMapper?
> >>>>>>>>>>> <code>
> >>>>>>>>>>> /* remove self similarity */
> >>>>>>>>>>> similarityMatrixRow.set(key.get(), Double.NaN);
> >>>>>>>>>>> </code>
> >>>>>>>>>>
> >>>>>>>>>> Ah, but that is just taking care of itself, so maybe not the
> >> issue.
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> It should be propagated in some cases but not others. I'm not
> >> aware
> >>>>>>> of
> >>>>>>>>>>>> any changes here.
> >>>>>>>>>>>
> >>>>>>>>>>> yeah, me neither.  This is all related to MAHOUT-798.
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Generally small data sets will have this problem of not being
> >> able to
> >>>>>>>>>>>> compute much of anything useful, so NaN might be right here.
> >>>>>>>>>>>> But you say it was different recently, which seems to rule
> that
> >> out.
> >>>>>>>>>>>
> >>>>>>>>>>> I also _believe_ I'm seeing it in a much larger data set on
> >> Hadoop,
> >>>>>>> it's just that's a whole lot harder to debug.
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <
> >>>>>>> gsingers@apache.org> wrote:
> >>>>>>>>>>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and
> >> am not
> >>>>>>> getting any recommendations due to NaNs being calculated in the
> >>>>>>> AggregateAndRecommend step.  I'm not quite sure what is going on as
> >> it seems
> >>>>>>> like this was working as little as two weeks ago (post Sebastian's
> >> big
> >>>>>>> change to RecJob), but I don't see a whole lot of changes in that
> >> part of
> >>>>>>> the code.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The data is user id's mapping to email thread ids.  My input
> >> data is
> >>>>>>> simply a triple of user id, thread id, 1 (meaning that user
> >> participated in
> >>>>>>> that thread)  It seems like I will have a lot of good values in the
> >> inputs
> >>>>>>> to the AggregateAndRecommend step, except one id will be NaN and
> this
> >> then
> >>>>>>> seems to get added in and makes everything NaN (I realize this is a
> >> very
> >>>>>>> naive understanding).  I sense that I should be looking upstream in
> >> the
> >>>>>>> process for a fix, but I am not sure where that is.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Any ideas where I should be looking to eliminate these NaNs?
> >> If you
> >>>>>>> want to try this with a small data set, you can get it here:
> >>>>>>>
> >>
> http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout(butnotethe companion article is not published yet.)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>> Grant
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --------------------------------------------
> >>>>>>>>>> Grant Ingersoll
> >>>>>>>>>> http://www.lucidimagination.com
> >>>>>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --------------------------------------------
> >>>>>>>>> Grant Ingersoll
> >>>>>>>>> http://www.lucidimagination.com
> >>>>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> --------------------------
> >>>>>>>> Ken Krugler
> >>>>>>>> +1 530-210-6378
> >>>>>>>> http://bixolabs.com
> >>>>>>>> custom big data solutions & training
> >>>>>>>> Hadoop, Cascading, Mahout & Solr
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>> --------------------------------------------
> >>>>>>> Grant Ingersoll
> >>>>>>> http://www.lucidimagination.com
> >>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Lance Norskog
> >>>>>> goksron@gmail.com
> >>>>>>
> >>>>
> >>>
> >>> --------------------------------------------
> >>> Grant Ingersoll
> >>> http://www.lucidimagination.com
> >>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> >>>
> >>>
> >>
> >>
> >
> >
> > --
> > Lance Norskog
> > goksron@gmail.com
>
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>
>


-- 
Lance Norskog
goksron@gmail.com

Re: RecommenderJob and NaN

Posted by Grant Ingersoll <gs...@apache.org>.

OK, I believe I checked in a fix.  The issue came down to me generalizing the SeqFilesFromMailArchives in terms of the metadata extraction (from, to, references, etc.) and the fact that the code I use to extract preferences (MailToRecMapper) depended on things being in a specific order.

On Oct 14, 2011, at 2:28 AM, Lance Norskog wrote:

> cd mahout/examples/bin
> ./build-asf-email.sh content/ out/ over/
> select 1 for recommender
> 
> where content/ is
> content/coccoon.apache.org
> content/commons.apache.org
> 
> and out/ and over/ are output directories. Run the shell script with -x as
> you will probably have to tweak it.
> 
> Lance
> 
> On Thu, Oct 13, 2011 at 11:04 PM, Sebastian Schelter <ss...@apache.org> wrote:
> 
>> Only got the raw data, how did you convert it to our standard
>> recommender input?
>> 
>> --sebastian
>> 
>> 
>> On 14.10.2011 01:17, Grant Ingersoll wrote:
>>> Were you able to get the data, Sebastian?
>>> 
>>> On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote:
>>> 
>>>> Grant,
>>>> 
>>>> Can you share a little more details about the results, do you get any
>>>> exceptions? Or do you just get no results?
>>>> 
>>>> Using the NaNs inside the similarity matrix vectors has been included in
>>>> the job for a very long time and should not cause any problems. As Sean
>>>> already mentioned we have unit tests with toy data that should catch the
>>>> very obvious errors in this code.
>>>> 
>>>> Can you share the dataset? I can do a testrun on my research cluster.
>>>> 
>>>> --sebastian
>>>> 
>>>> On 13.10.2011 08:37, Sean Owen wrote:
>>>>> RecommenderJob? The unit tests run it all the time.
>>>>> There should not be any glitches with static variables -- don't think
>>>>> there are any.
>>>>> 
>>>>> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <go...@gmail.com>
>> wrote:
>>>>>> Is this job working well for anyone now?
>>>>>> When was the last time this job worked for someone?
>>>>>> 
>>>>>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <
>> gsingers@apache.org>wrote:
>>>>>> 
>>>>>>> Both local and on EC2
>>>>>>> 
>>>>>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote:
>>>>>>> 
>>>>>>>> Hi Grant,
>>>>>>>> 
>>>>>>>> Just curious, are you running this locally or distributed?
>>>>>>>> 
>>>>>>>> I'd run into a similar issue, though in a completely different
>> algorithm
>>>>>>> (Jimmy Lin's PageRank implementation) due to the use of a static
>> variable.
>>>>>>>> 
>>>>>>>> When running locally, this wasn't getting cleared between loops, and
>> thus
>>>>>>> I got wonky results.
>>>>>>>> 
>>>>>>>> The same thing would have happened with JVM reuse enabled.
>>>>>>>> 
>>>>>>>> -- Ken
>>>>>>>> 
>>>>>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote:
>>>>>>>> 
>>>>>>>>> Digging some more:
>>>>>>>>> 
>>>>>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0, a
>>>>>>> simColumn of:
>>>>>>>>> 
>>>>>>> 
>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN}
>>>>>>>>> 
>>>>>>>>> Which then becomes the numerator and the denom.
>>>>>>>>> 
>>>>>>>>> Looping, my next simCol is:
>>>>>>>>> 
>>>>>>> 
>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012}
>>>>>>>>> 
>>>>>>>>> and then
>>>>>>>>> 
>>>>>>> 
>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012}
>>>>>>>>> 
>>>>>>>>> ...
>>>>>>>>> 
>>>>>>>>> Each time, those are getting added into the numerators/denoms
>> value,
>>>>>>> such that by the time we are done looping (line 161), we have:
>>>>>>>>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
>>>>>>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
>>>>>>>>> 
>>>>>>>>> numberOfSimilarItemsUsed:
>>>>>>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0}
>>>>>>>>> 
>>>>>>>>> Not sure on how to interpret this as I haven't dug into the math
>> here
>>>>>>> yet or figured out where those NaN are coming from originally.
>>>>>>>>> 
>>>>>>>>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote:
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote:
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Where is the NaN coming up -- what has this value?
>>>>>>>>>>> 
>>>>>>>>>>> simColumn seems to be the originator in the Aggregate step.  For
>>>>>>> instance, my current breakpoint shows:
>>>>>>>>>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
>>>>>>>>>>> 
>>>>>>>>>>> I can also see some in the PartialMultiplyMapper via the
>>>>>>> similarityMatrixColumn.
>>>>>>>>>>> 
>>>>>>>>>>> Is that set by SimilarityMatrixRowWrapperMapper?
>>>>>>>>>>> <code>
>>>>>>>>>>> /* remove self similarity */
>>>>>>>>>>> similarityMatrixRow.set(key.get(), Double.NaN);
>>>>>>>>>>> </code>
>>>>>>>>>> 
>>>>>>>>>> Ah, but that is just taking care of itself, so maybe not the
>> issue.
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> It should be propagated in some cases but not others. I'm not
>> aware
>>>>>>> of
>>>>>>>>>>>> any changes here.
>>>>>>>>>>> 
>>>>>>>>>>> yeah, me neither.  This is all related to MAHOUT-798.
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Generally small data sets will have this problem of not being
>> able to
>>>>>>>>>>>> compute much of anything useful, so NaN might be right here.
>>>>>>>>>>>> But you say it was different recently, which seems to rule that
>> out.
>>>>>>>>>>> 
>>>>>>>>>>> I also _believe_ I'm seeing it in a much larger data set on
>> Hadoop,
>>>>>>> it's just that's a whole lot harder to debug.
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <
>>>>>>> gsingers@apache.org> wrote:
>>>>>>>>>>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and
>> am not
>>>>>>> getting any recommendations due to NaNs being calculated in the
>>>>>>> AggregateAndRecommend step.  I'm not quite sure what is going on as
>> it seems
>>>>>>> like this was working as little as two weeks ago (post Sebastian's
>> big
>>>>>>> change to RecJob), but I don't see a whole lot of changes in that
>> part of
>>>>>>> the code.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The data is user id's mapping to email thread ids.  My input
>> data is
>>>>>>> simply a triple of user id, thread id, 1 (meaning that user
>> participated in
>>>>>>> that thread)  It seems like I will have a lot of good values in the
>> inputs
>>>>>>> to the AggregateAndRecommend step, except one id will be NaN and this
>> then
>>>>>>> seems to get added in and makes everything NaN (I realize this is a
>> very
>>>>>>> naive understanding).  I sense that I should be looking upstream in
>> the
>>>>>>> process for a fix, but I am not sure where that is.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Any ideas where I should be looking to eliminate these NaNs?
>> If you
>>>>>>> want to try this with a small data set, you can get it here:
>>>>>>> 
>> http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout(butnote the companion article is not published yet.)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Grant
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --------------------------------------------
>>>>>>>>>> Grant Ingersoll
>>>>>>>>>> http://www.lucidimagination.com
>>>>>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --------------------------------------------
>>>>>>>>> Grant Ingersoll
>>>>>>>>> http://www.lucidimagination.com
>>>>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> --------------------------
>>>>>>>> Ken Krugler
>>>>>>>> +1 530-210-6378
>>>>>>>> http://bixolabs.com
>>>>>>>> custom big data solutions & training
>>>>>>>> Hadoop, Cascading, Mahout & Solr
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> --------------------------------------------
>>>>>>> Grant Ingersoll
>>>>>>> http://www.lucidimagination.com
>>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Lance Norskog
>>>>>> goksron@gmail.com
>>>>>> 
>>>> 
>>> 
>>> --------------------------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com
>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>> 
>>> 
>> 
>> 
> 
> 
> -- 
> Lance Norskog
> goksron@gmail.com

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com

Re: RecommenderJob and NaN

Posted by Grant Ingersoll <gs...@apache.org>.

FYI, I think I see the problem.  Working on a fix.

On Oct 14, 2011, at 2:28 AM, Lance Norskog wrote:

> cd mahout/examples/bin
> ./build-asf-email.sh content/ out/ over/
> select 1 for recommender
> 
> where content/ is
> content/coccoon.apache.org
> content/commons.apache.org
> 
> and out/ and over/ are output directories. Run the shell script with -x as
> you will probably have to tweak it.
> 
> Lance
> 
> On Thu, Oct 13, 2011 at 11:04 PM, Sebastian Schelter <ss...@apache.org> wrote:
> 
>> Only got the raw data, how did you convert it to our standard
>> recommender input?
>> 
>> --sebastian
>> 
>> 
>> On 14.10.2011 01:17, Grant Ingersoll wrote:
>>> Were you able to get the data, Sebastian?
>>> 
>>> On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote:
>>> 
>>>> Grant,
>>>> 
>>>> Can you share a little more details about the results, do you get any
>>>> exceptions? Or do you just get no results?
>>>> 
>>>> Using the NaNs inside the similarity matrix vectors has been included in
>>>> the job for a very long time and should not cause any problems. As Sean
>>>> already mentioned we have unit tests with toy data that should catch the
>>>> very obvious errors in this code.
>>>> 
>>>> Can you share the dataset? I can do a testrun on my research cluster.
>>>> 
>>>> --sebastian
>>>> 
>>>> On 13.10.2011 08:37, Sean Owen wrote:
>>>>> RecommenderJob? The unit tests run it all the time.
>>>>> There should not be any glitches with static variables -- don't think
>>>>> there are any.
>>>>> 
>>>>> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <go...@gmail.com>
>> wrote:
>>>>>> Is this job working well for anyone now?
>>>>>> When was the last time this job worked for someone?
>>>>>> 
>>>>>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <
>> gsingers@apache.org>wrote:
>>>>>> 
>>>>>>> Both local and on EC2
>>>>>>> 
>>>>>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote:
>>>>>>> 
>>>>>>>> Hi Grant,
>>>>>>>> 
>>>>>>>> Just curious, are you running this locally or distributed?
>>>>>>>> 
>>>>>>>> I'd run into a similar issue, though in a completely different
>> algorithm
>>>>>>> (Jimmy Lin's PageRank implementation) due to the use of a static
>> variable.
>>>>>>>> 
>>>>>>>> When running locally, this wasn't getting cleared between loops, and
>> thus
>>>>>>> I got wonky results.
>>>>>>>> 
>>>>>>>> The same thing would have happened with JVM reuse enabled.
>>>>>>>> 
>>>>>>>> -- Ken
>>>>>>>> 
>>>>>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote:
>>>>>>>> 
>>>>>>>>> Digging some more:
>>>>>>>>> 
>>>>>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0, a
>>>>>>> simColumn of:
>>>>>>>>> 
>>>>>>> 
>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN}
>>>>>>>>> 
>>>>>>>>> Which then becomes the numerator and the denom.
>>>>>>>>> 
>>>>>>>>> Looping, my next simCol is:
>>>>>>>>> 
>>>>>>> 
>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012}
>>>>>>>>> 
>>>>>>>>> and then
>>>>>>>>> 
>>>>>>> 
>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012}
>>>>>>>>> 
>>>>>>>>> ...
>>>>>>>>> 
>>>>>>>>> Each time, those are getting added into the numerators/denoms
>> value,
>>>>>>> such that by the time we are done looping (line 161), we have:
>>>>>>>>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
>>>>>>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
>>>>>>>>> 
>>>>>>>>> numberOfSimilarItemsUsed:
>>>>>>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0}
>>>>>>>>> 
>>>>>>>>> Not sure on how to interpret this as I haven't dug into the math
>> here
>>>>>>> yet or figured out where those NaN are coming from originally.
>>>>>>>>> 
>>>>>>>>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote:
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote:
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Where is the NaN coming up -- what has this value?
>>>>>>>>>>> 
>>>>>>>>>>> simColumn seems to be the originator in the Aggregate step.  For
>>>>>>> instance, my current breakpoint shows:
>>>>>>>>>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
>>>>>>>>>>> 
>>>>>>>>>>> I can also see some in the PartialMultiplyMapper via the
>>>>>>> similarityMatrixColumn.
>>>>>>>>>>> 
>>>>>>>>>>> Is that set by SimilarityMatrixRowWrapperMapper?
>>>>>>>>>>> <code>
>>>>>>>>>>> /* remove self similarity */
>>>>>>>>>>> similarityMatrixRow.set(key.get(), Double.NaN);
>>>>>>>>>>> </code>
>>>>>>>>>> 
>>>>>>>>>> Ah, but that is just taking care of itself, so maybe not the
>> issue.
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> It should be propagated in some cases but not others. I'm not
>> aware
>>>>>>> of
>>>>>>>>>>>> any changes here.
>>>>>>>>>>> 
>>>>>>>>>>> yeah, me neither.  This is all related to MAHOUT-798.
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Generally small data sets will have this problem of not being
>> able to
>>>>>>>>>>>> compute much of anything useful, so NaN might be right here.
>>>>>>>>>>>> But you say it was different recently, which seems to rule that
>> out.
>>>>>>>>>>> 
>>>>>>>>>>> I also _believe_ I'm seeing it in a much larger data set on
>> Hadoop,
>>>>>>> it's just that's a whole lot harder to debug.
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <
>>>>>>> gsingers@apache.org> wrote:
>>>>>>>>>>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and
>> am not
>>>>>>> getting any recommendations due to NaNs being calculated in the
>>>>>>> AggregateAndRecommend step.  I'm not quite sure what is going on as
>> it seems
>>>>>>> like this was working as little as two weeks ago (post Sebastian's
>> big
>>>>>>> change to RecJob), but I don't see a whole lot of changes in that
>> part of
>>>>>>> the code.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The data is user id's mapping to email thread ids.  My input
>> data is
>>>>>>> simply a triple of user id, thread id, 1 (meaning that user
>> participated in
>>>>>>> that thread)  It seems like I will have a lot of good values in the
>> inputs
>>>>>>> to the AggregateAndRecommend step, except one id will be NaN and this
>> then
>>>>>>> seems to get added in and makes everything NaN (I realize this is a
>> very
>>>>>>> naive understanding).  I sense that I should be looking upstream in
>> the
>>>>>>> process for a fix, but I am not sure where that is.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Any ideas where I should be looking to eliminate these NaNs?
>> If you
>>>>>>> want to try this with a small data set, you can get it here:
>>>>>>> 
>> http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout(butnote the companion article is not published yet.)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Grant
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --------------------------------------------
>>>>>>>>>> Grant Ingersoll
>>>>>>>>>> http://www.lucidimagination.com
>>>>>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --------------------------------------------
>>>>>>>>> Grant Ingersoll
>>>>>>>>> http://www.lucidimagination.com
>>>>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> --------------------------
>>>>>>>> Ken Krugler
>>>>>>>> +1 530-210-6378
>>>>>>>> http://bixolabs.com
>>>>>>>> custom big data solutions & training
>>>>>>>> Hadoop, Cascading, Mahout & Solr
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> --------------------------------------------
>>>>>>> Grant Ingersoll
>>>>>>> http://www.lucidimagination.com
>>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Lance Norskog
>>>>>> goksron@gmail.com
>>>>>> 
>>>> 
>>> 
>>> --------------------------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com
>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>> 
>>> 
>> 
>> 
> 
> 
> -- 
> Lance Norskog
> goksron@gmail.com

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com

Re: RecommenderJob and NaN

Posted by Lance Norskog <go...@gmail.com>.

cd mahout/examples/bin
./build-asf-email.sh content/ out/ over/
select 1 for recommender

where content/ is
content/coccoon.apache.org
content/commons.apache.org

and out/ and over/ are output directories. Run the shell script with -x as
you will probably have to tweak it.

Lance

On Thu, Oct 13, 2011 at 11:04 PM, Sebastian Schelter <ss...@apache.org> wrote:

> Only got the raw data, how did you convert it to our standard
> recommender input?
>
> --sebastian
>
>
> On 14.10.2011 01:17, Grant Ingersoll wrote:
> > Were you able to get the data, Sebastian?
> >
> > On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote:
> >
> >> Grant,
> >>
> >> Can you share a little more details about the results, do you get any
> >> exceptions? Or do you just get no results?
> >>
> >> Using the NaNs inside the similarity matrix vectors has been included in
> >> the job for a very long time and should not cause any problems. As Sean
> >> already mentioned we have unit tests with toy data that should catch the
> >> very obvious errors in this code.
> >>
> >> Can you share the dataset? I can do a testrun on my research cluster.
> >>
> >> --sebastian
> >>
> >> On 13.10.2011 08:37, Sean Owen wrote:
> >>> RecommenderJob? The unit tests run it all the time.
> >>> There should not be any glitches with static variables -- don't think
> >>> there are any.
> >>>
> >>> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <go...@gmail.com>
> wrote:
> >>>> Is this job working well for anyone now?
> >>>> When was the last time this job worked for someone?
> >>>>
> >>>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <
> gsingers@apache.org>wrote:
> >>>>
> >>>>> Both local and on EC2
> >>>>>
> >>>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote:
> >>>>>
> >>>>>> Hi Grant,
> >>>>>>
> >>>>>> Just curious, are you running this locally or distributed?
> >>>>>>
> >>>>>> I'd run into a similar issue, though in a completely different
> algorithm
> >>>>> (Jimmy Lin's PageRank implementation) due to the use of a static
> variable.
> >>>>>>
> >>>>>> When running locally, this wasn't getting cleared between loops, and
> thus
> >>>>> I got wonky results.
> >>>>>>
> >>>>>> The same thing would have happened with JVM reuse enabled.
> >>>>>>
> >>>>>> -- Ken
> >>>>>>
> >>>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote:
> >>>>>>
> >>>>>>> Digging some more:
> >>>>>>>
> >>>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0, a
> >>>>> simColumn of:
> >>>>>>>
> >>>>>
> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN}
> >>>>>>>
> >>>>>>> Which then becomes the numerator and the denom.
> >>>>>>>
> >>>>>>> Looping, my next simCol is:
> >>>>>>>
> >>>>>
> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012}
> >>>>>>>
> >>>>>>> and then
> >>>>>>>
> >>>>>
> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012}
> >>>>>>>
> >>>>>>> ...
> >>>>>>>
> >>>>>>> Each time, those are getting added into the numerators/denoms
> value,
> >>>>> such that by the time we are done looping (line 161), we have:
> >>>>>>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
> >>>>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
> >>>>>>>
> >>>>>>> numberOfSimilarItemsUsed:
> >>>>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0}
> >>>>>>>
> >>>>>>> Not sure on how to interpret this as I haven't dug into the math
> here
> >>>>> yet or figured out where those NaN are coming from originally.
> >>>>>>>
> >>>>>>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote:
> >>>>>>>
> >>>>>>>>
> >>>>>>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote:
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
> >>>>>>>>>
> >>>>>>>>>> Where is the NaN coming up -- what has this value?
> >>>>>>>>>
> >>>>>>>>> simColumn seems to be the originator in the Aggregate step.  For
> >>>>> instance, my current breakpoint shows:
> >>>>>>>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
> >>>>>>>>>
> >>>>>>>>> I can also see some in the PartialMultiplyMapper via the
> >>>>> similarityMatrixColumn.
> >>>>>>>>>
> >>>>>>>>> Is that set by SimilarityMatrixRowWrapperMapper?
> >>>>>>>>> <code>
> >>>>>>>>> /* remove self similarity */
> >>>>>>>>> similarityMatrixRow.set(key.get(), Double.NaN);
> >>>>>>>>> </code>
> >>>>>>>>
> >>>>>>>> Ah, but that is just taking care of itself, so maybe not the
> issue.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> It should be propagated in some cases but not others. I'm not
> aware
> >>>>> of
> >>>>>>>>>> any changes here.
> >>>>>>>>>
> >>>>>>>>> yeah, me neither.  This is all related to MAHOUT-798.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Generally small data sets will have this problem of not being
> able to
> >>>>>>>>>> compute much of anything useful, so NaN might be right here.
> >>>>>>>>>> But you say it was different recently, which seems to rule that
> out.
> >>>>>>>>>
> >>>>>>>>> I also _believe_ I'm seeing it in a much larger data set on
> Hadoop,
> >>>>> it's just that's a whole lot harder to debug.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <
> >>>>> gsingers@apache.org> wrote:
> >>>>>>>>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and
> am not
> >>>>> getting any recommendations due to NaNs being calculated in the
> >>>>> AggregateAndRecommend step.  I'm not quite sure what is going on as
> it seems
> >>>>> like this was working as little as two weeks ago (post Sebastian's
> big
> >>>>> change to RecJob), but I don't see a whole lot of changes in that
> part of
> >>>>> the code.
> >>>>>>>>>>>
> >>>>>>>>>>> The data is user id's mapping to email thread ids.  My input
> data is
> >>>>> simply a triple of user id, thread id, 1 (meaning that user
> participated in
> >>>>> that thread)  It seems like I will have a lot of good values in the
> inputs
> >>>>> to the AggregateAndRecommend step, except one id will be NaN and this
> then
> >>>>> seems to get added in and makes everything NaN (I realize this is a
> very
> >>>>> naive understanding).  I sense that I should be looking upstream in
> the
> >>>>> process for a fix, but I am not sure where that is.
> >>>>>>>>>>>
> >>>>>>>>>>> Any ideas where I should be looking to eliminate these NaNs?
>  If you
> >>>>> want to try this with a small data set, you can get it here:
> >>>>>
> http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout(butnote the companion article is not published yet.)
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Grant
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> --------------------------------------------
> >>>>>>>> Grant Ingersoll
> >>>>>>>> http://www.lucidimagination.com
> >>>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> >>>>>>>>
> >>>>>>>
> >>>>>>> --------------------------------------------
> >>>>>>> Grant Ingersoll
> >>>>>>> http://www.lucidimagination.com
> >>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> >>>>>>>
> >>>>>>
> >>>>>> --------------------------
> >>>>>> Ken Krugler
> >>>>>> +1 530-210-6378
> >>>>>> http://bixolabs.com
> >>>>>> custom big data solutions & training
> >>>>>> Hadoop, Cascading, Mahout & Solr
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> --------------------------------------------
> >>>>> Grant Ingersoll
> >>>>> http://www.lucidimagination.com
> >>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Lance Norskog
> >>>> goksron@gmail.com
> >>>>
> >>
> >
> > --------------------------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com
> > Lucene Eurocon 2011: http://www.lucene-eurocon.com
> >
> >
>
>


-- 
Lance Norskog
goksron@gmail.com

Re: RecommenderJob and NaN

Posted by Sebastian Schelter <ss...@apache.org>.

Only got the raw data, how did you convert it to our standard
recommender input?

--sebastian


On 14.10.2011 01:17, Grant Ingersoll wrote:
> Were you able to get the data, Sebastian?
> 
> On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote:
> 
>> Grant,
>>
>> Can you share a little more details about the results, do you get any
>> exceptions? Or do you just get no results?
>>
>> Using the NaNs inside the similarity matrix vectors has been included in
>> the job for a very long time and should not cause any problems. As Sean
>> already mentioned we have unit tests with toy data that should catch the
>> very obvious errors in this code.
>>
>> Can you share the dataset? I can do a testrun on my research cluster.
>>
>> --sebastian
>>
>> On 13.10.2011 08:37, Sean Owen wrote:
>>> RecommenderJob? The unit tests run it all the time.
>>> There should not be any glitches with static variables -- don't think
>>> there are any.
>>>
>>> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <go...@gmail.com> wrote:
>>>> Is this job working well for anyone now?
>>>> When was the last time this job worked for someone?
>>>>
>>>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <gs...@apache.org>wrote:
>>>>
>>>>> Both local and on EC2
>>>>>
>>>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote:
>>>>>
>>>>>> Hi Grant,
>>>>>>
>>>>>> Just curious, are you running this locally or distributed?
>>>>>>
>>>>>> I'd run into a similar issue, though in a completely different algorithm
>>>>> (Jimmy Lin's PageRank implementation) due to the use of a static variable.
>>>>>>
>>>>>> When running locally, this wasn't getting cleared between loops, and thus
>>>>> I got wonky results.
>>>>>>
>>>>>> The same thing would have happened with JVM reuse enabled.
>>>>>>
>>>>>> -- Ken
>>>>>>
>>>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote:
>>>>>>
>>>>>>> Digging some more:
>>>>>>>
>>>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0, a
>>>>> simColumn of:
>>>>>>>
>>>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN}
>>>>>>>
>>>>>>> Which then becomes the numerator and the denom.
>>>>>>>
>>>>>>> Looping, my next simCol is:
>>>>>>>
>>>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012}
>>>>>>>
>>>>>>> and then
>>>>>>>
>>>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012}
>>>>>>>
>>>>>>> ...
>>>>>>>
>>>>>>> Each time, those are getting added into the numerators/denoms value,
>>>>> such that by the time we are done looping (line 161), we have:
>>>>>>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
>>>>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
>>>>>>>
>>>>>>> numberOfSimilarItemsUsed:
>>>>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0}
>>>>>>>
>>>>>>> Not sure on how to interpret this as I haven't dug into the math here
>>>>> yet or figured out where those NaN are coming from originally.
>>>>>>>
>>>>>>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
>>>>>>>>>
>>>>>>>>>> Where is the NaN coming up -- what has this value?
>>>>>>>>>
>>>>>>>>> simColumn seems to be the originator in the Aggregate step.  For
>>>>> instance, my current breakpoint shows:
>>>>>>>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
>>>>>>>>>
>>>>>>>>> I can also see some in the PartialMultiplyMapper via the
>>>>> similarityMatrixColumn.
>>>>>>>>>
>>>>>>>>> Is that set by SimilarityMatrixRowWrapperMapper?
>>>>>>>>> <code>
>>>>>>>>> /* remove self similarity */
>>>>>>>>> similarityMatrixRow.set(key.get(), Double.NaN);
>>>>>>>>> </code>
>>>>>>>>
>>>>>>>> Ah, but that is just taking care of itself, so maybe not the issue.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> It should be propagated in some cases but not others. I'm not aware
>>>>> of
>>>>>>>>>> any changes here.
>>>>>>>>>
>>>>>>>>> yeah, me neither.  This is all related to MAHOUT-798.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Generally small data sets will have this problem of not being able to
>>>>>>>>>> compute much of anything useful, so NaN might be right here.
>>>>>>>>>> But you say it was different recently, which seems to rule that out.
>>>>>>>>>
>>>>>>>>> I also _believe_ I'm seeing it in a much larger data set on Hadoop,
>>>>> it's just that's a whole lot harder to debug.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <
>>>>> gsingers@apache.org> wrote:
>>>>>>>>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not
>>>>> getting any recommendations due to NaNs being calculated in the
>>>>> AggregateAndRecommend step.  I'm not quite sure what is going on as it seems
>>>>> like this was working as little as two weeks ago (post Sebastian's big
>>>>> change to RecJob), but I don't see a whole lot of changes in that part of
>>>>> the code.
>>>>>>>>>>>
>>>>>>>>>>> The data is user id's mapping to email thread ids.  My input data is
>>>>> simply a triple of user id, thread id, 1 (meaning that user participated in
>>>>> that thread)  It seems like I will have a lot of good values in the inputs
>>>>> to the AggregateAndRecommend step, except one id will be NaN and this then
>>>>> seems to get added in and makes everything NaN (I realize this is a very
>>>>> naive understanding).  I sense that I should be looking upstream in the
>>>>> process for a fix, but I am not sure where that is.
>>>>>>>>>>>
>>>>>>>>>>> Any ideas where I should be looking to eliminate these NaNs?  If you
>>>>> want to try this with a small data set, you can get it here:
>>>>> http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout(but note the companion article is not published yet.)
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Grant
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --------------------------------------------
>>>>>>>> Grant Ingersoll
>>>>>>>> http://www.lucidimagination.com
>>>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>>>>>>
>>>>>>>
>>>>>>> --------------------------------------------
>>>>>>> Grant Ingersoll
>>>>>>> http://www.lucidimagination.com
>>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>>>>>
>>>>>>
>>>>>> --------------------------
>>>>>> Ken Krugler
>>>>>> +1 530-210-6378
>>>>>> http://bixolabs.com
>>>>>> custom big data solutions & training
>>>>>> Hadoop, Cascading, Mahout & Solr
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --------------------------------------------
>>>>> Grant Ingersoll
>>>>> http://www.lucidimagination.com
>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Lance Norskog
>>>> goksron@gmail.com
>>>>
>>
> 
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> 
>

Re: RecommenderJob and NaN

Posted by Grant Ingersoll <gs...@apache.org>.

Were you able to get the data, Sebastian?

On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote:

> Grant,
> 
> Can you share a little more details about the results, do you get any
> exceptions? Or do you just get no results?
> 
> Using the NaNs inside the similarity matrix vectors has been included in
> the job for a very long time and should not cause any problems. As Sean
> already mentioned we have unit tests with toy data that should catch the
> very obvious errors in this code.
> 
> Can you share the dataset? I can do a testrun on my research cluster.
> 
> --sebastian
> 
> On 13.10.2011 08:37, Sean Owen wrote:
>> RecommenderJob? The unit tests run it all the time.
>> There should not be any glitches with static variables -- don't think
>> there are any.
>> 
>> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <go...@gmail.com> wrote:
>>> Is this job working well for anyone now?
>>> When was the last time this job worked for someone?
>>> 
>>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <gs...@apache.org>wrote:
>>> 
>>>> Both local and on EC2
>>>> 
>>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote:
>>>> 
>>>>> Hi Grant,
>>>>> 
>>>>> Just curious, are you running this locally or distributed?
>>>>> 
>>>>> I'd run into a similar issue, though in a completely different algorithm
>>>> (Jimmy Lin's PageRank implementation) due to the use of a static variable.
>>>>> 
>>>>> When running locally, this wasn't getting cleared between loops, and thus
>>>> I got wonky results.
>>>>> 
>>>>> The same thing would have happened with JVM reuse enabled.
>>>>> 
>>>>> -- Ken
>>>>> 
>>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote:
>>>>> 
>>>>>> Digging some more:
>>>>>> 
>>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0, a
>>>> simColumn of:
>>>>>> 
>>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN}
>>>>>> 
>>>>>> Which then becomes the numerator and the denom.
>>>>>> 
>>>>>> Looping, my next simCol is:
>>>>>> 
>>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012}
>>>>>> 
>>>>>> and then
>>>>>> 
>>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012}
>>>>>> 
>>>>>> ...
>>>>>> 
>>>>>> Each time, those are getting added into the numerators/denoms value,
>>>> such that by the time we are done looping (line 161), we have:
>>>>>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
>>>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
>>>>>> 
>>>>>> numberOfSimilarItemsUsed:
>>>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0}
>>>>>> 
>>>>>> Not sure on how to interpret this as I haven't dug into the math here
>>>> yet or figured out where those NaN are coming from originally.
>>>>>> 
>>>>>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote:
>>>>>> 
>>>>>>> 
>>>>>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
>>>>>>>> 
>>>>>>>>> Where is the NaN coming up -- what has this value?
>>>>>>>> 
>>>>>>>> simColumn seems to be the originator in the Aggregate step.  For
>>>> instance, my current breakpoint shows:
>>>>>>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
>>>>>>>> 
>>>>>>>> I can also see some in the PartialMultiplyMapper via the
>>>> similarityMatrixColumn.
>>>>>>>> 
>>>>>>>> Is that set by SimilarityMatrixRowWrapperMapper?
>>>>>>>> <code>
>>>>>>>> /* remove self similarity */
>>>>>>>> similarityMatrixRow.set(key.get(), Double.NaN);
>>>>>>>> </code>
>>>>>>> 
>>>>>>> Ah, but that is just taking care of itself, so maybe not the issue.
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> It should be propagated in some cases but not others. I'm not aware
>>>> of
>>>>>>>>> any changes here.
>>>>>>>> 
>>>>>>>> yeah, me neither.  This is all related to MAHOUT-798.
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Generally small data sets will have this problem of not being able to
>>>>>>>>> compute much of anything useful, so NaN might be right here.
>>>>>>>>> But you say it was different recently, which seems to rule that out.
>>>>>>>> 
>>>>>>>> I also _believe_ I'm seeing it in a much larger data set on Hadoop,
>>>> it's just that's a whole lot harder to debug.
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <
>>>> gsingers@apache.org> wrote:
>>>>>>>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not
>>>> getting any recommendations due to NaNs being calculated in the
>>>> AggregateAndRecommend step.  I'm not quite sure what is going on as it seems
>>>> like this was working as little as two weeks ago (post Sebastian's big
>>>> change to RecJob), but I don't see a whole lot of changes in that part of
>>>> the code.
>>>>>>>>>> 
>>>>>>>>>> The data is user id's mapping to email thread ids.  My input data is
>>>> simply a triple of user id, thread id, 1 (meaning that user participated in
>>>> that thread)  It seems like I will have a lot of good values in the inputs
>>>> to the AggregateAndRecommend step, except one id will be NaN and this then
>>>> seems to get added in and makes everything NaN (I realize this is a very
>>>> naive understanding).  I sense that I should be looking upstream in the
>>>> process for a fix, but I am not sure where that is.
>>>>>>>>>> 
>>>>>>>>>> Any ideas where I should be looking to eliminate these NaNs?  If you
>>>> want to try this with a small data set, you can get it here:
>>>> http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout(but note the companion article is not published yet.)
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Grant
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> --------------------------------------------
>>>>>>> Grant Ingersoll
>>>>>>> http://www.lucidimagination.com
>>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>>>>> 
>>>>>> 
>>>>>> --------------------------------------------
>>>>>> Grant Ingersoll
>>>>>> http://www.lucidimagination.com
>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>>>> 
>>>>> 
>>>>> --------------------------
>>>>> Ken Krugler
>>>>> +1 530-210-6378
>>>>> http://bixolabs.com
>>>>> custom big data solutions & training
>>>>> Hadoop, Cascading, Mahout & Solr
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> --------------------------------------------
>>>> Grant Ingersoll
>>>> http://www.lucidimagination.com
>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Lance Norskog
>>> goksron@gmail.com
>>> 
> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com

Re: RecommenderJob and NaN

Posted by Ted Dunning <te...@gmail.com>.

Usage within AWS is a neighborly thing to do.

But yes, Amazon donates this bandwidth.

On Thu, Oct 13, 2011 at 8:11 PM, Lance Norskog <go...@gmail.com> wrote:

> Is the Apache public download bandwidth donated by Amazon? Or should we try
> to keep usage within AWS?
>
> On Thu, Oct 13, 2011 at 3:47 AM, Grant Ingersoll <gsingers@apache.org
> >wrote:
>
> >
> > On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote:
> >
> > > Grant,
> > >
> > > Can you share a little more details about the results, do you get any
> > > exceptions? Or do you just get no results?
> >
> > No results.
> >
> > >
> > > Using the NaNs inside the similarity matrix vectors has been included
> in
> > > the job for a very long time and should not cause any problems. As Sean
> > > already mentioned we have unit tests with toy data that should catch
> the
> > > very obvious errors in this code.
> >
> > Yeah, I don't know what happened.  I know I was getting results as little
> > as two weeks ago.  I will try rolling back to an earlier commit.
> >
> > >
> > > Can you share the dataset? I can do a testrun on my research cluster.
> >
> > I already have earlier in this thread.  There is a small set via the link
> > below or you can use the ASF email public dataset on Amazon or any subset
> of
> > it.
> >
> >
> > >
> > > --sebastian
> > >
> > > On 13.10.2011 08:37, Sean Owen wrote:
> > >> RecommenderJob? The unit tests run it all the time.
> > >> There should not be any glitches with static variables -- don't think
> > >> there are any.
> > >>
> > >> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <go...@gmail.com>
> > wrote:
> > >>> Is this job working well for anyone now?
> > >>> When was the last time this job worked for someone?
> > >>>
> > >>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <
> gsingers@apache.org
> > >wrote:
> > >>>
> > >>>> Both local and on EC2
> > >>>>
> > >>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote:
> > >>>>
> > >>>>> Hi Grant,
> > >>>>>
> > >>>>> Just curious, are you running this locally or distributed?
> > >>>>>
> > >>>>> I'd run into a similar issue, though in a completely different
> > algorithm
> > >>>> (Jimmy Lin's PageRank implementation) due to the use of a static
> > variable.
> > >>>>>
> > >>>>> When running locally, this wasn't getting cleared between loops,
> and
> > thus
> > >>>> I got wonky results.
> > >>>>>
> > >>>>> The same thing would have happened with JVM reuse enabled.
> > >>>>>
> > >>>>> -- Ken
> > >>>>>
> > >>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote:
> > >>>>>
> > >>>>>> Digging some more:
> > >>>>>>
> > >>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0,
> a
> > >>>> simColumn of:
> > >>>>>>
> > >>>>
> >
> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN}
> > >>>>>>
> > >>>>>> Which then becomes the numerator and the denom.
> > >>>>>>
> > >>>>>> Looping, my next simCol is:
> > >>>>>>
> > >>>>
> >
> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012}
> > >>>>>>
> > >>>>>> and then
> > >>>>>>
> > >>>>
> >
> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012}
> > >>>>>>
> > >>>>>> ...
> > >>>>>>
> > >>>>>> Each time, those are getting added into the numerators/denoms
> value,
> > >>>> such that by the time we are done looping (line 161), we have:
> > >>>>>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
> > >>>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
> > >>>>>>
> > >>>>>> numberOfSimilarItemsUsed:
> > >>>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0}
> > >>>>>>
> > >>>>>> Not sure on how to interpret this as I haven't dug into the math
> > here
> > >>>> yet or figured out where those NaN are coming from originally.
> > >>>>>>
> > >>>>>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote:
> > >>>>>>
> > >>>>>>>
> > >>>>>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote:
> > >>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
> > >>>>>>>>
> > >>>>>>>>> Where is the NaN coming up -- what has this value?
> > >>>>>>>>
> > >>>>>>>> simColumn seems to be the originator in the Aggregate step.  For
> > >>>> instance, my current breakpoint shows:
> > >>>>>>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
> > >>>>>>>>
> > >>>>>>>> I can also see some in the PartialMultiplyMapper via the
> > >>>> similarityMatrixColumn.
> > >>>>>>>>
> > >>>>>>>> Is that set by SimilarityMatrixRowWrapperMapper?
> > >>>>>>>> <code>
> > >>>>>>>> /* remove self similarity */
> > >>>>>>>> similarityMatrixRow.set(key.get(), Double.NaN);
> > >>>>>>>> </code>
> > >>>>>>>
> > >>>>>>> Ah, but that is just taking care of itself, so maybe not the
> issue.
> > >>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>> It should be propagated in some cases but not others. I'm not
> > aware
> > >>>> of
> > >>>>>>>>> any changes here.
> > >>>>>>>>
> > >>>>>>>> yeah, me neither.  This is all related to MAHOUT-798.
> > >>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Generally small data sets will have this problem of not being
> > able to
> > >>>>>>>>> compute much of anything useful, so NaN might be right here.
> > >>>>>>>>> But you say it was different recently, which seems to rule that
> > out.
> > >>>>>>>>
> > >>>>>>>> I also _believe_ I'm seeing it in a much larger data set on
> > Hadoop,
> > >>>> it's just that's a whole lot harder to debug.
> > >>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <
> > >>>> gsingers@apache.org> wrote:
> > >>>>>>>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and
> am
> > not
> > >>>> getting any recommendations due to NaNs being calculated in the
> > >>>> AggregateAndRecommend step.  I'm not quite sure what is going on as
> it
> > seems
> > >>>> like this was working as little as two weeks ago (post Sebastian's
> big
> > >>>> change to RecJob), but I don't see a whole lot of changes in that
> part
> > of
> > >>>> the code.
> > >>>>>>>>>>
> > >>>>>>>>>> The data is user id's mapping to email thread ids.  My input
> > data is
> > >>>> simply a triple of user id, thread id, 1 (meaning that user
> > participated in
> > >>>> that thread)  It seems like I will have a lot of good values in the
> > inputs
> > >>>> to the AggregateAndRecommend step, except one id will be NaN and
> this
> > then
> > >>>> seems to get added in and makes everything NaN (I realize this is a
> > very
> > >>>> naive understanding).  I sense that I should be looking upstream in
> > the
> > >>>> process for a fix, but I am not sure where that is.
> > >>>>>>>>>>
> > >>>>>>>>>> Any ideas where I should be looking to eliminate these NaNs?
>  If
> > you
> > >>>> want to try this with a small data set, you can get it here:
> > >>>>
> >
> http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout(butnotethe companion article is not published yet.)
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks,
> > >>>>>>>>>> Grant
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>> --------------------------------------------
> > >>>>>>> Grant Ingersoll
> > >>>>>>> http://www.lucidimagination.com
> > >>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> > >>>>>>>
> > >>>>>>
> > >>>>>> --------------------------------------------
> > >>>>>> Grant Ingersoll
> > >>>>>> http://www.lucidimagination.com
> > >>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> > >>>>>>
> > >>>>>
> > >>>>> --------------------------
> > >>>>> Ken Krugler
> > >>>>> +1 530-210-6378
> > >>>>> http://bixolabs.com
> > >>>>> custom big data solutions & training
> > >>>>> Hadoop, Cascading, Mahout & Solr
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>> --------------------------------------------
> > >>>> Grant Ingersoll
> > >>>> http://www.lucidimagination.com
> > >>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> > >>>>
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>> Lance Norskog
> > >>> goksron@gmail.com
> > >>>
> > >
> >
> > --------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com
> > Lucene Eurocon 2011: http://www.lucene-eurocon.com
> >
> >
> >
> >
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: RecommenderJob and NaN

Posted by Lance Norskog <go...@gmail.com>.

Is the Apache public download bandwidth donated by Amazon? Or should we try
to keep usage within AWS?

On Thu, Oct 13, 2011 at 3:47 AM, Grant Ingersoll <gs...@apache.org>wrote:

>
> On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote:
>
> > Grant,
> >
> > Can you share a little more details about the results, do you get any
> > exceptions? Or do you just get no results?
>
> No results.
>
> >
> > Using the NaNs inside the similarity matrix vectors has been included in
> > the job for a very long time and should not cause any problems. As Sean
> > already mentioned we have unit tests with toy data that should catch the
> > very obvious errors in this code.
>
> Yeah, I don't know what happened.  I know I was getting results as little
> as two weeks ago.  I will try rolling back to an earlier commit.
>
> >
> > Can you share the dataset? I can do a testrun on my research cluster.
>
> I already have earlier in this thread.  There is a small set via the link
> below or you can use the ASF email public dataset on Amazon or any subset of
> it.
>
>
> >
> > --sebastian
> >
> > On 13.10.2011 08:37, Sean Owen wrote:
> >> RecommenderJob? The unit tests run it all the time.
> >> There should not be any glitches with static variables -- don't think
> >> there are any.
> >>
> >> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <go...@gmail.com>
> wrote:
> >>> Is this job working well for anyone now?
> >>> When was the last time this job worked for someone?
> >>>
> >>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <gsingers@apache.org
> >wrote:
> >>>
> >>>> Both local and on EC2
> >>>>
> >>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote:
> >>>>
> >>>>> Hi Grant,
> >>>>>
> >>>>> Just curious, are you running this locally or distributed?
> >>>>>
> >>>>> I'd run into a similar issue, though in a completely different
> algorithm
> >>>> (Jimmy Lin's PageRank implementation) due to the use of a static
> variable.
> >>>>>
> >>>>> When running locally, this wasn't getting cleared between loops, and
> thus
> >>>> I got wonky results.
> >>>>>
> >>>>> The same thing would have happened with JVM reuse enabled.
> >>>>>
> >>>>> -- Ken
> >>>>>
> >>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote:
> >>>>>
> >>>>>> Digging some more:
> >>>>>>
> >>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0, a
> >>>> simColumn of:
> >>>>>>
> >>>>
> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN}
> >>>>>>
> >>>>>> Which then becomes the numerator and the denom.
> >>>>>>
> >>>>>> Looping, my next simCol is:
> >>>>>>
> >>>>
> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012}
> >>>>>>
> >>>>>> and then
> >>>>>>
> >>>>
> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012}
> >>>>>>
> >>>>>> ...
> >>>>>>
> >>>>>> Each time, those are getting added into the numerators/denoms value,
> >>>> such that by the time we are done looping (line 161), we have:
> >>>>>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
> >>>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
> >>>>>>
> >>>>>> numberOfSimilarItemsUsed:
> >>>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0}
> >>>>>>
> >>>>>> Not sure on how to interpret this as I haven't dug into the math
> here
> >>>> yet or figured out where those NaN are coming from originally.
> >>>>>>
> >>>>>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote:
> >>>>>>
> >>>>>>>
> >>>>>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote:
> >>>>>>>
> >>>>>>>>
> >>>>>>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
> >>>>>>>>
> >>>>>>>>> Where is the NaN coming up -- what has this value?
> >>>>>>>>
> >>>>>>>> simColumn seems to be the originator in the Aggregate step.  For
> >>>> instance, my current breakpoint shows:
> >>>>>>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
> >>>>>>>>
> >>>>>>>> I can also see some in the PartialMultiplyMapper via the
> >>>> similarityMatrixColumn.
> >>>>>>>>
> >>>>>>>> Is that set by SimilarityMatrixRowWrapperMapper?
> >>>>>>>> <code>
> >>>>>>>> /* remove self similarity */
> >>>>>>>> similarityMatrixRow.set(key.get(), Double.NaN);
> >>>>>>>> </code>
> >>>>>>>
> >>>>>>> Ah, but that is just taking care of itself, so maybe not the issue.
> >>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> It should be propagated in some cases but not others. I'm not
> aware
> >>>> of
> >>>>>>>>> any changes here.
> >>>>>>>>
> >>>>>>>> yeah, me neither.  This is all related to MAHOUT-798.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Generally small data sets will have this problem of not being
> able to
> >>>>>>>>> compute much of anything useful, so NaN might be right here.
> >>>>>>>>> But you say it was different recently, which seems to rule that
> out.
> >>>>>>>>
> >>>>>>>> I also _believe_ I'm seeing it in a much larger data set on
> Hadoop,
> >>>> it's just that's a whole lot harder to debug.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <
> >>>> gsingers@apache.org> wrote:
> >>>>>>>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am
> not
> >>>> getting any recommendations due to NaNs being calculated in the
> >>>> AggregateAndRecommend step.  I'm not quite sure what is going on as it
> seems
> >>>> like this was working as little as two weeks ago (post Sebastian's big
> >>>> change to RecJob), but I don't see a whole lot of changes in that part
> of
> >>>> the code.
> >>>>>>>>>>
> >>>>>>>>>> The data is user id's mapping to email thread ids.  My input
> data is
> >>>> simply a triple of user id, thread id, 1 (meaning that user
> participated in
> >>>> that thread)  It seems like I will have a lot of good values in the
> inputs
> >>>> to the AggregateAndRecommend step, except one id will be NaN and this
> then
> >>>> seems to get added in and makes everything NaN (I realize this is a
> very
> >>>> naive understanding).  I sense that I should be looking upstream in
> the
> >>>> process for a fix, but I am not sure where that is.
> >>>>>>>>>>
> >>>>>>>>>> Any ideas where I should be looking to eliminate these NaNs?  If
> you
> >>>> want to try this with a small data set, you can get it here:
> >>>>
> http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout(butnote the companion article is not published yet.)
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Grant
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>> --------------------------------------------
> >>>>>>> Grant Ingersoll
> >>>>>>> http://www.lucidimagination.com
> >>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> >>>>>>>
> >>>>>>
> >>>>>> --------------------------------------------
> >>>>>> Grant Ingersoll
> >>>>>> http://www.lucidimagination.com
> >>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> >>>>>>
> >>>>>
> >>>>> --------------------------
> >>>>> Ken Krugler
> >>>>> +1 530-210-6378
> >>>>> http://bixolabs.com
> >>>>> custom big data solutions & training
> >>>>> Hadoop, Cascading, Mahout & Solr
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>> --------------------------------------------
> >>>> Grant Ingersoll
> >>>> http://www.lucidimagination.com
> >>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Lance Norskog
> >>> goksron@gmail.com
> >>>
> >
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>
>
>
>


-- 
Lance Norskog
goksron@gmail.com

Re: RecommenderJob and NaN

Posted by Grant Ingersoll <gs...@apache.org>.

On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote:

> Grant,
> 
> Can you share a little more details about the results, do you get any
> exceptions? Or do you just get no results?

No results.

> 
> Using the NaNs inside the similarity matrix vectors has been included in
> the job for a very long time and should not cause any problems. As Sean
> already mentioned we have unit tests with toy data that should catch the
> very obvious errors in this code.

Yeah, I don't know what happened.  I know I was getting results as little as two weeks ago.  I will try rolling back to an earlier commit.

> 
> Can you share the dataset? I can do a testrun on my research cluster.

I already have earlier in this thread.  There is a small set via the link below or you can use the ASF email public dataset on Amazon or any subset of it.


> 
> --sebastian
> 
> On 13.10.2011 08:37, Sean Owen wrote:
>> RecommenderJob? The unit tests run it all the time.
>> There should not be any glitches with static variables -- don't think
>> there are any.
>> 
>> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <go...@gmail.com> wrote:
>>> Is this job working well for anyone now?
>>> When was the last time this job worked for someone?
>>> 
>>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <gs...@apache.org>wrote:
>>> 
>>>> Both local and on EC2
>>>> 
>>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote:
>>>> 
>>>>> Hi Grant,
>>>>> 
>>>>> Just curious, are you running this locally or distributed?
>>>>> 
>>>>> I'd run into a similar issue, though in a completely different algorithm
>>>> (Jimmy Lin's PageRank implementation) due to the use of a static variable.
>>>>> 
>>>>> When running locally, this wasn't getting cleared between loops, and thus
>>>> I got wonky results.
>>>>> 
>>>>> The same thing would have happened with JVM reuse enabled.
>>>>> 
>>>>> -- Ken
>>>>> 
>>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote:
>>>>> 
>>>>>> Digging some more:
>>>>>> 
>>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0, a
>>>> simColumn of:
>>>>>> 
>>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN}
>>>>>> 
>>>>>> Which then becomes the numerator and the denom.
>>>>>> 
>>>>>> Looping, my next simCol is:
>>>>>> 
>>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012}
>>>>>> 
>>>>>> and then
>>>>>> 
>>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012}
>>>>>> 
>>>>>> ...
>>>>>> 
>>>>>> Each time, those are getting added into the numerators/denoms value,
>>>> such that by the time we are done looping (line 161), we have:
>>>>>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
>>>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
>>>>>> 
>>>>>> numberOfSimilarItemsUsed:
>>>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0}
>>>>>> 
>>>>>> Not sure on how to interpret this as I haven't dug into the math here
>>>> yet or figured out where those NaN are coming from originally.
>>>>>> 
>>>>>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote:
>>>>>> 
>>>>>>> 
>>>>>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
>>>>>>>> 
>>>>>>>>> Where is the NaN coming up -- what has this value?
>>>>>>>> 
>>>>>>>> simColumn seems to be the originator in the Aggregate step.  For
>>>> instance, my current breakpoint shows:
>>>>>>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
>>>>>>>> 
>>>>>>>> I can also see some in the PartialMultiplyMapper via the
>>>> similarityMatrixColumn.
>>>>>>>> 
>>>>>>>> Is that set by SimilarityMatrixRowWrapperMapper?
>>>>>>>> <code>
>>>>>>>> /* remove self similarity */
>>>>>>>> similarityMatrixRow.set(key.get(), Double.NaN);
>>>>>>>> </code>
>>>>>>> 
>>>>>>> Ah, but that is just taking care of itself, so maybe not the issue.
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> It should be propagated in some cases but not others. I'm not aware
>>>> of
>>>>>>>>> any changes here.
>>>>>>>> 
>>>>>>>> yeah, me neither.  This is all related to MAHOUT-798.
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Generally small data sets will have this problem of not being able to
>>>>>>>>> compute much of anything useful, so NaN might be right here.
>>>>>>>>> But you say it was different recently, which seems to rule that out.
>>>>>>>> 
>>>>>>>> I also _believe_ I'm seeing it in a much larger data set on Hadoop,
>>>> it's just that's a whole lot harder to debug.
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <
>>>> gsingers@apache.org> wrote:
>>>>>>>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not
>>>> getting any recommendations due to NaNs being calculated in the
>>>> AggregateAndRecommend step.  I'm not quite sure what is going on as it seems
>>>> like this was working as little as two weeks ago (post Sebastian's big
>>>> change to RecJob), but I don't see a whole lot of changes in that part of
>>>> the code.
>>>>>>>>>> 
>>>>>>>>>> The data is user id's mapping to email thread ids.  My input data is
>>>> simply a triple of user id, thread id, 1 (meaning that user participated in
>>>> that thread)  It seems like I will have a lot of good values in the inputs
>>>> to the AggregateAndRecommend step, except one id will be NaN and this then
>>>> seems to get added in and makes everything NaN (I realize this is a very
>>>> naive understanding).  I sense that I should be looking upstream in the
>>>> process for a fix, but I am not sure where that is.
>>>>>>>>>> 
>>>>>>>>>> Any ideas where I should be looking to eliminate these NaNs?  If you
>>>> want to try this with a small data set, you can get it here:
>>>> http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout(but note the companion article is not published yet.)
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Grant
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> --------------------------------------------
>>>>>>> Grant Ingersoll
>>>>>>> http://www.lucidimagination.com
>>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>>>>> 
>>>>>> 
>>>>>> --------------------------------------------
>>>>>> Grant Ingersoll
>>>>>> http://www.lucidimagination.com
>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>>>> 
>>>>> 
>>>>> --------------------------
>>>>> Ken Krugler
>>>>> +1 530-210-6378
>>>>> http://bixolabs.com
>>>>> custom big data solutions & training
>>>>> Hadoop, Cascading, Mahout & Solr
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> --------------------------------------------
>>>> Grant Ingersoll
>>>> http://www.lucidimagination.com
>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Lance Norskog
>>> goksron@gmail.com
>>> 
> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com

Re: RecommenderJob and NaN

Posted by Grant Ingersoll <gs...@apache.org>.

Looks like it is me.  Still not sure why, but getting there.

On Oct 13, 2011, at 10:35 PM, Grant Ingersoll wrote:

> Note, the next version (13df29e4fe97b4370f24d7e91ab5909de76f0f3b) doesn't work.  Debugging.  
> 
> 
> 
> On Oct 13, 2011, at 9:31 PM, Grant Ingersoll wrote:
> 
>> OK, I can confirm that an earlier version (54300025dbdd6e688a4eb3d043016eb641067c7e in github/lucidimagination/mahout) worked.  Now, to figure out why.
>> 
>> -Grant
>> 
>> On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote:
>> 
>>> Grant,
>>> 
>>> Can you share a little more details about the results, do you get any
>>> exceptions? Or do you just get no results?
>>> 
>>> Using the NaNs inside the similarity matrix vectors has been included in
>>> the job for a very long time and should not cause any problems. As Sean
>>> already mentioned we have unit tests with toy data that should catch the
>>> very obvious errors in this code.
>>> 
>>> Can you share the dataset? I can do a testrun on my research cluster.
>>> 
>>> --sebastian
>>> 
>>> On 13.10.2011 08:37, Sean Owen wrote:
>>>> RecommenderJob? The unit tests run it all the time.
>>>> There should not be any glitches with static variables -- don't think
>>>> there are any.
>>>> 
>>>> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <go...@gmail.com> wrote:
>>>>> Is this job working well for anyone now?
>>>>> When was the last time this job worked for someone?
>>>>> 
>>>>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <gs...@apache.org>wrote:
>>>>> 
>>>>>> Both local and on EC2
>>>>>> 
>>>>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote:
>>>>>> 
>>>>>>> Hi Grant,
>>>>>>> 
>>>>>>> Just curious, are you running this locally or distributed?
>>>>>>> 
>>>>>>> I'd run into a similar issue, though in a completely different algorithm
>>>>>> (Jimmy Lin's PageRank implementation) due to the use of a static variable.
>>>>>>> 
>>>>>>> When running locally, this wasn't getting cleared between loops, and thus
>>>>>> I got wonky results.
>>>>>>> 
>>>>>>> The same thing would have happened with JVM reuse enabled.
>>>>>>> 
>>>>>>> -- Ken
>>>>>>> 
>>>>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote:
>>>>>>> 
>>>>>>>> Digging some more:
>>>>>>>> 
>>>>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0, a
>>>>>> simColumn of:
>>>>>>>> 
>>>>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN}
>>>>>>>> 
>>>>>>>> Which then becomes the numerator and the denom.
>>>>>>>> 
>>>>>>>> Looping, my next simCol is:
>>>>>>>> 
>>>>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012}
>>>>>>>> 
>>>>>>>> and then
>>>>>>>> 
>>>>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012}
>>>>>>>> 
>>>>>>>> ...
>>>>>>>> 
>>>>>>>> Each time, those are getting added into the numerators/denoms value,
>>>>>> such that by the time we are done looping (line 161), we have:
>>>>>>>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
>>>>>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
>>>>>>>> 
>>>>>>>> numberOfSimilarItemsUsed:
>>>>>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0}
>>>>>>>> 
>>>>>>>> Not sure on how to interpret this as I haven't dug into the math here
>>>>>> yet or figured out where those NaN are coming from originally.
>>>>>>>> 
>>>>>>>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote:
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote:
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
>>>>>>>>>> 
>>>>>>>>>>> Where is the NaN coming up -- what has this value?
>>>>>>>>>> 
>>>>>>>>>> simColumn seems to be the originator in the Aggregate step.  For
>>>>>> instance, my current breakpoint shows:
>>>>>>>>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
>>>>>>>>>> 
>>>>>>>>>> I can also see some in the PartialMultiplyMapper via the
>>>>>> similarityMatrixColumn.
>>>>>>>>>> 
>>>>>>>>>> Is that set by SimilarityMatrixRowWrapperMapper?
>>>>>>>>>> <code>
>>>>>>>>>> /* remove self similarity */
>>>>>>>>>> similarityMatrixRow.set(key.get(), Double.NaN);
>>>>>>>>>> </code>
>>>>>>>>> 
>>>>>>>>> Ah, but that is just taking care of itself, so maybe not the issue.
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> It should be propagated in some cases but not others. I'm not aware
>>>>>> of
>>>>>>>>>>> any changes here.
>>>>>>>>>> 
>>>>>>>>>> yeah, me neither.  This is all related to MAHOUT-798.
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Generally small data sets will have this problem of not being able to
>>>>>>>>>>> compute much of anything useful, so NaN might be right here.
>>>>>>>>>>> But you say it was different recently, which seems to rule that out.
>>>>>>>>>> 
>>>>>>>>>> I also _believe_ I'm seeing it in a much larger data set on Hadoop,
>>>>>> it's just that's a whole lot harder to debug.
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <
>>>>>> gsingers@apache.org> wrote:
>>>>>>>>>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not
>>>>>> getting any recommendations due to NaNs being calculated in the
>>>>>> AggregateAndRecommend step.  I'm not quite sure what is going on as it seems
>>>>>> like this was working as little as two weeks ago (post Sebastian's big
>>>>>> change to RecJob), but I don't see a whole lot of changes in that part of
>>>>>> the code.
>>>>>>>>>>>> 
>>>>>>>>>>>> The data is user id's mapping to email thread ids.  My input data is
>>>>>> simply a triple of user id, thread id, 1 (meaning that user participated in
>>>>>> that thread)  It seems like I will have a lot of good values in the inputs
>>>>>> to the AggregateAndRecommend step, except one id will be NaN and this then
>>>>>> seems to get added in and makes everything NaN (I realize this is a very
>>>>>> naive understanding).  I sense that I should be looking upstream in the
>>>>>> process for a fix, but I am not sure where that is.
>>>>>>>>>>>> 
>>>>>>>>>>>> Any ideas where I should be looking to eliminate these NaNs?  If you
>>>>>> want to try this with a small data set, you can get it here:
>>>>>> http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout(but note the companion article is not published yet.)
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Grant
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --------------------------------------------
>>>>>>>>> Grant Ingersoll
>>>>>>>>> http://www.lucidimagination.com
>>>>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> --------------------------------------------
>>>>>>>> Grant Ingersoll
>>>>>>>> http://www.lucidimagination.com
>>>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>>>>>> 
>>>>>>> 
>>>>>>> --------------------------
>>>>>>> Ken Krugler
>>>>>>> +1 530-210-6378
>>>>>>> http://bixolabs.com
>>>>>>> custom big data solutions & training
>>>>>>> Hadoop, Cascading, Mahout & Solr
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> --------------------------------------------
>>>>>> Grant Ingersoll
>>>>>> http://www.lucidimagination.com
>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Lance Norskog
>>>>> goksron@gmail.com
>>>>> 
>>> 
>> 
>> --------------------------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>> 
> 
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com

Re: RecommenderJob and NaN

Posted by Grant Ingersoll <gs...@apache.org>.

Note, the next version (13df29e4fe97b4370f24d7e91ab5909de76f0f3b) doesn't work.  Debugging.  



On Oct 13, 2011, at 9:31 PM, Grant Ingersoll wrote:

> OK, I can confirm that an earlier version (54300025dbdd6e688a4eb3d043016eb641067c7e in github/lucidimagination/mahout) worked.  Now, to figure out why.
> 
> -Grant
> 
> On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote:
> 
>> Grant,
>> 
>> Can you share a little more details about the results, do you get any
>> exceptions? Or do you just get no results?
>> 
>> Using the NaNs inside the similarity matrix vectors has been included in
>> the job for a very long time and should not cause any problems. As Sean
>> already mentioned we have unit tests with toy data that should catch the
>> very obvious errors in this code.
>> 
>> Can you share the dataset? I can do a testrun on my research cluster.
>> 
>> --sebastian
>> 
>> On 13.10.2011 08:37, Sean Owen wrote:
>>> RecommenderJob? The unit tests run it all the time.
>>> There should not be any glitches with static variables -- don't think
>>> there are any.
>>> 
>>> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <go...@gmail.com> wrote:
>>>> Is this job working well for anyone now?
>>>> When was the last time this job worked for someone?
>>>> 
>>>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <gs...@apache.org>wrote:
>>>> 
>>>>> Both local and on EC2
>>>>> 
>>>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote:
>>>>> 
>>>>>> Hi Grant,
>>>>>> 
>>>>>> Just curious, are you running this locally or distributed?
>>>>>> 
>>>>>> I'd run into a similar issue, though in a completely different algorithm
>>>>> (Jimmy Lin's PageRank implementation) due to the use of a static variable.
>>>>>> 
>>>>>> When running locally, this wasn't getting cleared between loops, and thus
>>>>> I got wonky results.
>>>>>> 
>>>>>> The same thing would have happened with JVM reuse enabled.
>>>>>> 
>>>>>> -- Ken
>>>>>> 
>>>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote:
>>>>>> 
>>>>>>> Digging some more:
>>>>>>> 
>>>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0, a
>>>>> simColumn of:
>>>>>>> 
>>>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN}
>>>>>>> 
>>>>>>> Which then becomes the numerator and the denom.
>>>>>>> 
>>>>>>> Looping, my next simCol is:
>>>>>>> 
>>>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012}
>>>>>>> 
>>>>>>> and then
>>>>>>> 
>>>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012}
>>>>>>> 
>>>>>>> ...
>>>>>>> 
>>>>>>> Each time, those are getting added into the numerators/denoms value,
>>>>> such that by the time we are done looping (line 161), we have:
>>>>>>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
>>>>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
>>>>>>> 
>>>>>>> numberOfSimilarItemsUsed:
>>>>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0}
>>>>>>> 
>>>>>>> Not sure on how to interpret this as I haven't dug into the math here
>>>>> yet or figured out where those NaN are coming from originally.
>>>>>>> 
>>>>>>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote:
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
>>>>>>>>> 
>>>>>>>>>> Where is the NaN coming up -- what has this value?
>>>>>>>>> 
>>>>>>>>> simColumn seems to be the originator in the Aggregate step.  For
>>>>> instance, my current breakpoint shows:
>>>>>>>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
>>>>>>>>> 
>>>>>>>>> I can also see some in the PartialMultiplyMapper via the
>>>>> similarityMatrixColumn.
>>>>>>>>> 
>>>>>>>>> Is that set by SimilarityMatrixRowWrapperMapper?
>>>>>>>>> <code>
>>>>>>>>> /* remove self similarity */
>>>>>>>>> similarityMatrixRow.set(key.get(), Double.NaN);
>>>>>>>>> </code>
>>>>>>>> 
>>>>>>>> Ah, but that is just taking care of itself, so maybe not the issue.
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> It should be propagated in some cases but not others. I'm not aware
>>>>> of
>>>>>>>>>> any changes here.
>>>>>>>>> 
>>>>>>>>> yeah, me neither.  This is all related to MAHOUT-798.
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Generally small data sets will have this problem of not being able to
>>>>>>>>>> compute much of anything useful, so NaN might be right here.
>>>>>>>>>> But you say it was different recently, which seems to rule that out.
>>>>>>>>> 
>>>>>>>>> I also _believe_ I'm seeing it in a much larger data set on Hadoop,
>>>>> it's just that's a whole lot harder to debug.
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <
>>>>> gsingers@apache.org> wrote:
>>>>>>>>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not
>>>>> getting any recommendations due to NaNs being calculated in the
>>>>> AggregateAndRecommend step.  I'm not quite sure what is going on as it seems
>>>>> like this was working as little as two weeks ago (post Sebastian's big
>>>>> change to RecJob), but I don't see a whole lot of changes in that part of
>>>>> the code.
>>>>>>>>>>> 
>>>>>>>>>>> The data is user id's mapping to email thread ids.  My input data is
>>>>> simply a triple of user id, thread id, 1 (meaning that user participated in
>>>>> that thread)  It seems like I will have a lot of good values in the inputs
>>>>> to the AggregateAndRecommend step, except one id will be NaN and this then
>>>>> seems to get added in and makes everything NaN (I realize this is a very
>>>>> naive understanding).  I sense that I should be looking upstream in the
>>>>> process for a fix, but I am not sure where that is.
>>>>>>>>>>> 
>>>>>>>>>>> Any ideas where I should be looking to eliminate these NaNs?  If you
>>>>> want to try this with a small data set, you can get it here:
>>>>> http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout(but note the companion article is not published yet.)
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Grant
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> --------------------------------------------
>>>>>>>> Grant Ingersoll
>>>>>>>> http://www.lucidimagination.com
>>>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>>>>>> 
>>>>>>> 
>>>>>>> --------------------------------------------
>>>>>>> Grant Ingersoll
>>>>>>> http://www.lucidimagination.com
>>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>>>>> 
>>>>>> 
>>>>>> --------------------------
>>>>>> Ken Krugler
>>>>>> +1 530-210-6378
>>>>>> http://bixolabs.com
>>>>>> custom big data solutions & training
>>>>>> Hadoop, Cascading, Mahout & Solr
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> --------------------------------------------
>>>>> Grant Ingersoll
>>>>> http://www.lucidimagination.com
>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Lance Norskog
>>>> goksron@gmail.com
>>>> 
>> 
> 
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com

Re: RecommenderJob and NaN

Posted by Grant Ingersoll <gs...@apache.org>.

OK, I can confirm that an earlier version (54300025dbdd6e688a4eb3d043016eb641067c7e in github/lucidimagination/mahout) worked.  Now, to figure out why.

-Grant

On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote:

> Grant,
> 
> Can you share a little more details about the results, do you get any
> exceptions? Or do you just get no results?
> 
> Using the NaNs inside the similarity matrix vectors has been included in
> the job for a very long time and should not cause any problems. As Sean
> already mentioned we have unit tests with toy data that should catch the
> very obvious errors in this code.
> 
> Can you share the dataset? I can do a testrun on my research cluster.
> 
> --sebastian
> 
> On 13.10.2011 08:37, Sean Owen wrote:
>> RecommenderJob? The unit tests run it all the time.
>> There should not be any glitches with static variables -- don't think
>> there are any.
>> 
>> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <go...@gmail.com> wrote:
>>> Is this job working well for anyone now?
>>> When was the last time this job worked for someone?
>>> 
>>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <gs...@apache.org>wrote:
>>> 
>>>> Both local and on EC2
>>>> 
>>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote:
>>>> 
>>>>> Hi Grant,
>>>>> 
>>>>> Just curious, are you running this locally or distributed?
>>>>> 
>>>>> I'd run into a similar issue, though in a completely different algorithm
>>>> (Jimmy Lin's PageRank implementation) due to the use of a static variable.
>>>>> 
>>>>> When running locally, this wasn't getting cleared between loops, and thus
>>>> I got wonky results.
>>>>> 
>>>>> The same thing would have happened with JVM reuse enabled.
>>>>> 
>>>>> -- Ken
>>>>> 
>>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote:
>>>>> 
>>>>>> Digging some more:
>>>>>> 
>>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0, a
>>>> simColumn of:
>>>>>> 
>>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN}
>>>>>> 
>>>>>> Which then becomes the numerator and the denom.
>>>>>> 
>>>>>> Looping, my next simCol is:
>>>>>> 
>>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012}
>>>>>> 
>>>>>> and then
>>>>>> 
>>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012}
>>>>>> 
>>>>>> ...
>>>>>> 
>>>>>> Each time, those are getting added into the numerators/denoms value,
>>>> such that by the time we are done looping (line 161), we have:
>>>>>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
>>>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
>>>>>> 
>>>>>> numberOfSimilarItemsUsed:
>>>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0}
>>>>>> 
>>>>>> Not sure on how to interpret this as I haven't dug into the math here
>>>> yet or figured out where those NaN are coming from originally.
>>>>>> 
>>>>>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote:
>>>>>> 
>>>>>>> 
>>>>>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
>>>>>>>> 
>>>>>>>>> Where is the NaN coming up -- what has this value?
>>>>>>>> 
>>>>>>>> simColumn seems to be the originator in the Aggregate step.  For
>>>> instance, my current breakpoint shows:
>>>>>>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
>>>>>>>> 
>>>>>>>> I can also see some in the PartialMultiplyMapper via the
>>>> similarityMatrixColumn.
>>>>>>>> 
>>>>>>>> Is that set by SimilarityMatrixRowWrapperMapper?
>>>>>>>> <code>
>>>>>>>> /* remove self similarity */
>>>>>>>> similarityMatrixRow.set(key.get(), Double.NaN);
>>>>>>>> </code>
>>>>>>> 
>>>>>>> Ah, but that is just taking care of itself, so maybe not the issue.
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> It should be propagated in some cases but not others. I'm not aware
>>>> of
>>>>>>>>> any changes here.
>>>>>>>> 
>>>>>>>> yeah, me neither.  This is all related to MAHOUT-798.
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Generally small data sets will have this problem of not being able to
>>>>>>>>> compute much of anything useful, so NaN might be right here.
>>>>>>>>> But you say it was different recently, which seems to rule that out.
>>>>>>>> 
>>>>>>>> I also _believe_ I'm seeing it in a much larger data set on Hadoop,
>>>> it's just that's a whole lot harder to debug.
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <
>>>> gsingers@apache.org> wrote:
>>>>>>>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not
>>>> getting any recommendations due to NaNs being calculated in the
>>>> AggregateAndRecommend step.  I'm not quite sure what is going on as it seems
>>>> like this was working as little as two weeks ago (post Sebastian's big
>>>> change to RecJob), but I don't see a whole lot of changes in that part of
>>>> the code.
>>>>>>>>>> 
>>>>>>>>>> The data is user id's mapping to email thread ids.  My input data is
>>>> simply a triple of user id, thread id, 1 (meaning that user participated in
>>>> that thread)  It seems like I will have a lot of good values in the inputs
>>>> to the AggregateAndRecommend step, except one id will be NaN and this then
>>>> seems to get added in and makes everything NaN (I realize this is a very
>>>> naive understanding).  I sense that I should be looking upstream in the
>>>> process for a fix, but I am not sure where that is.
>>>>>>>>>> 
>>>>>>>>>> Any ideas where I should be looking to eliminate these NaNs?  If you
>>>> want to try this with a small data set, you can get it here:
>>>> http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout(but note the companion article is not published yet.)
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Grant
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> --------------------------------------------
>>>>>>> Grant Ingersoll
>>>>>>> http://www.lucidimagination.com
>>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>>>>> 
>>>>>> 
>>>>>> --------------------------------------------
>>>>>> Grant Ingersoll
>>>>>> http://www.lucidimagination.com
>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>>>> 
>>>>> 
>>>>> --------------------------
>>>>> Ken Krugler
>>>>> +1 530-210-6378
>>>>> http://bixolabs.com
>>>>> custom big data solutions & training
>>>>> Hadoop, Cascading, Mahout & Solr
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> --------------------------------------------
>>>> Grant Ingersoll
>>>> http://www.lucidimagination.com
>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Lance Norskog
>>> goksron@gmail.com
>>> 
> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com

Re: RecommenderJob and NaN

Posted by Sebastian Schelter <ss...@apache.org>.

Grant,

Can you share a little more details about the results, do you get any
exceptions? Or do you just get no results?

Using the NaNs inside the similarity matrix vectors has been included in
the job for a very long time and should not cause any problems. As Sean
already mentioned we have unit tests with toy data that should catch the
very obvious errors in this code.

Can you share the dataset? I can do a testrun on my research cluster.

--sebastian

On 13.10.2011 08:37, Sean Owen wrote:
> RecommenderJob? The unit tests run it all the time.
> There should not be any glitches with static variables -- don't think
> there are any.
> 
> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <go...@gmail.com> wrote:
>> Is this job working well for anyone now?
>> When was the last time this job worked for someone?
>>
>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <gs...@apache.org>wrote:
>>
>>> Both local and on EC2
>>>
>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote:
>>>
>>>> Hi Grant,
>>>>
>>>> Just curious, are you running this locally or distributed?
>>>>
>>>> I'd run into a similar issue, though in a completely different algorithm
>>> (Jimmy Lin's PageRank implementation) due to the use of a static variable.
>>>>
>>>> When running locally, this wasn't getting cleared between loops, and thus
>>> I got wonky results.
>>>>
>>>> The same thing would have happened with JVM reuse enabled.
>>>>
>>>> -- Ken
>>>>
>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote:
>>>>
>>>>> Digging some more:
>>>>>
>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0, a
>>> simColumn of:
>>>>>
>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN}
>>>>>
>>>>> Which then becomes the numerator and the denom.
>>>>>
>>>>> Looping, my next simCol is:
>>>>>
>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012}
>>>>>
>>>>> and then
>>>>>
>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012}
>>>>>
>>>>> ...
>>>>>
>>>>> Each time, those are getting added into the numerators/denoms value,
>>> such that by the time we are done looping (line 161), we have:
>>>>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
>>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
>>>>>
>>>>> numberOfSimilarItemsUsed:
>>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0}
>>>>>
>>>>> Not sure on how to interpret this as I haven't dug into the math here
>>> yet or figured out where those NaN are coming from originally.
>>>>>
>>>>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote:
>>>>>
>>>>>>
>>>>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote:
>>>>>>
>>>>>>>
>>>>>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
>>>>>>>
>>>>>>>> Where is the NaN coming up -- what has this value?
>>>>>>>
>>>>>>> simColumn seems to be the originator in the Aggregate step.  For
>>> instance, my current breakpoint shows:
>>>>>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
>>>>>>>
>>>>>>> I can also see some in the PartialMultiplyMapper via the
>>> similarityMatrixColumn.
>>>>>>>
>>>>>>> Is that set by SimilarityMatrixRowWrapperMapper?
>>>>>>> <code>
>>>>>>> /* remove self similarity */
>>>>>>> similarityMatrixRow.set(key.get(), Double.NaN);
>>>>>>> </code>
>>>>>>
>>>>>> Ah, but that is just taking care of itself, so maybe not the issue.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> It should be propagated in some cases but not others. I'm not aware
>>> of
>>>>>>>> any changes here.
>>>>>>>
>>>>>>> yeah, me neither.  This is all related to MAHOUT-798.
>>>>>>>
>>>>>>>>
>>>>>>>> Generally small data sets will have this problem of not being able to
>>>>>>>> compute much of anything useful, so NaN might be right here.
>>>>>>>> But you say it was different recently, which seems to rule that out.
>>>>>>>
>>>>>>> I also _believe_ I'm seeing it in a much larger data set on Hadoop,
>>> it's just that's a whole lot harder to debug.
>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <
>>> gsingers@apache.org> wrote:
>>>>>>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not
>>> getting any recommendations due to NaNs being calculated in the
>>> AggregateAndRecommend step.  I'm not quite sure what is going on as it seems
>>> like this was working as little as two weeks ago (post Sebastian's big
>>> change to RecJob), but I don't see a whole lot of changes in that part of
>>> the code.
>>>>>>>>>
>>>>>>>>> The data is user id's mapping to email thread ids.  My input data is
>>> simply a triple of user id, thread id, 1 (meaning that user participated in
>>> that thread)  It seems like I will have a lot of good values in the inputs
>>> to the AggregateAndRecommend step, except one id will be NaN and this then
>>> seems to get added in and makes everything NaN (I realize this is a very
>>> naive understanding).  I sense that I should be looking upstream in the
>>> process for a fix, but I am not sure where that is.
>>>>>>>>>
>>>>>>>>> Any ideas where I should be looking to eliminate these NaNs?  If you
>>> want to try this with a small data set, you can get it here:
>>> http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout(but note the companion article is not published yet.)
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Grant
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --------------------------------------------
>>>>>> Grant Ingersoll
>>>>>> http://www.lucidimagination.com
>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>>>>
>>>>>
>>>>> --------------------------------------------
>>>>> Grant Ingersoll
>>>>> http://www.lucidimagination.com
>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>>>
>>>>
>>>> --------------------------
>>>> Ken Krugler
>>>> +1 530-210-6378
>>>> http://bixolabs.com
>>>> custom big data solutions & training
>>>> Hadoop, Cascading, Mahout & Solr
>>>>
>>>>
>>>>
>>>
>>> --------------------------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com
>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>
>>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>

Re: RecommenderJob and NaN

Posted by Sean Owen <sr...@gmail.com>.

RecommenderJob? The unit tests run it all the time.
There should not be any glitches with static variables -- don't think
there are any.

On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <go...@gmail.com> wrote:
> Is this job working well for anyone now?
> When was the last time this job worked for someone?
>
> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <gs...@apache.org>wrote:
>
>> Both local and on EC2
>>
>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote:
>>
>> > Hi Grant,
>> >
>> > Just curious, are you running this locally or distributed?
>> >
>> > I'd run into a similar issue, though in a completely different algorithm
>> (Jimmy Lin's PageRank implementation) due to the use of a static variable.
>> >
>> > When running locally, this wasn't getting cleared between loops, and thus
>> I got wonky results.
>> >
>> > The same thing would have happened with JVM reuse enabled.
>> >
>> > -- Ken
>> >
>> > On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote:
>> >
>> >> Digging some more:
>> >>
>> >> In AggregateAndRecommend, around lines 143, I have, for userId 0, a
>> simColumn of:
>> >>
>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN}
>> >>
>> >> Which then becomes the numerator and the denom.
>> >>
>> >> Looping, my next simCol is:
>> >>
>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012}
>> >>
>> >> and then
>> >>
>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012}
>> >>
>> >> ...
>> >>
>> >> Each time, those are getting added into the numerators/denoms value,
>> such that by the time we are done looping (line 161), we have:
>> >> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
>> >> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
>> >>
>> >> numberOfSimilarItemsUsed:
>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0}
>> >>
>> >> Not sure on how to interpret this as I haven't dug into the math here
>> yet or figured out where those NaN are coming from originally.
>> >>
>> >> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote:
>> >>
>> >>>
>> >>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote:
>> >>>
>> >>>>
>> >>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
>> >>>>
>> >>>>> Where is the NaN coming up -- what has this value?
>> >>>>
>> >>>> simColumn seems to be the originator in the Aggregate step.  For
>> instance, my current breakpoint shows:
>> >>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
>> >>>>
>> >>>> I can also see some in the PartialMultiplyMapper via the
>> similarityMatrixColumn.
>> >>>>
>> >>>> Is that set by SimilarityMatrixRowWrapperMapper?
>> >>>> <code>
>> >>>> /* remove self similarity */
>> >>>> similarityMatrixRow.set(key.get(), Double.NaN);
>> >>>> </code>
>> >>>
>> >>> Ah, but that is just taking care of itself, so maybe not the issue.
>> >>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>> It should be propagated in some cases but not others. I'm not aware
>> of
>> >>>>> any changes here.
>> >>>>
>> >>>> yeah, me neither.  This is all related to MAHOUT-798.
>> >>>>
>> >>>>>
>> >>>>> Generally small data sets will have this problem of not being able to
>> >>>>> compute much of anything useful, so NaN might be right here.
>> >>>>> But you say it was different recently, which seems to rule that out.
>> >>>>
>> >>>> I also _believe_ I'm seeing it in a much larger data set on Hadoop,
>> it's just that's a whole lot harder to debug.
>> >>>>
>> >>>>>
>> >>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <
>> gsingers@apache.org> wrote:
>> >>>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not
>> getting any recommendations due to NaNs being calculated in the
>> AggregateAndRecommend step.  I'm not quite sure what is going on as it seems
>> like this was working as little as two weeks ago (post Sebastian's big
>> change to RecJob), but I don't see a whole lot of changes in that part of
>> the code.
>> >>>>>>
>> >>>>>> The data is user id's mapping to email thread ids.  My input data is
>> simply a triple of user id, thread id, 1 (meaning that user participated in
>> that thread)  It seems like I will have a lot of good values in the inputs
>> to the AggregateAndRecommend step, except one id will be NaN and this then
>> seems to get added in and makes everything NaN (I realize this is a very
>> naive understanding).  I sense that I should be looking upstream in the
>> process for a fix, but I am not sure where that is.
>> >>>>>>
>> >>>>>> Any ideas where I should be looking to eliminate these NaNs?  If you
>> want to try this with a small data set, you can get it here:
>> http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout(but note the companion article is not published yet.)
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>> Grant
>> >>>>
>> >>>>
>> >>>
>> >>> --------------------------------------------
>> >>> Grant Ingersoll
>> >>> http://www.lucidimagination.com
>> >>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>> >>>
>> >>
>> >> --------------------------------------------
>> >> Grant Ingersoll
>> >> http://www.lucidimagination.com
>> >> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>> >>
>> >
>> > --------------------------
>> > Ken Krugler
>> > +1 530-210-6378
>> > http://bixolabs.com
>> > custom big data solutions & training
>> > Hadoop, Cascading, Mahout & Solr
>> >
>> >
>> >
>>
>> --------------------------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>
>>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: RecommenderJob and NaN

Posted by Lance Norskog <go...@gmail.com>.

Is this job working well for anyone now?
When was the last time this job worked for someone?

On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <gs...@apache.org>wrote:

> Both local and on EC2
>
> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote:
>
> > Hi Grant,
> >
> > Just curious, are you running this locally or distributed?
> >
> > I'd run into a similar issue, though in a completely different algorithm
> (Jimmy Lin's PageRank implementation) due to the use of a static variable.
> >
> > When running locally, this wasn't getting cleared between loops, and thus
> I got wonky results.
> >
> > The same thing would have happened with JVM reuse enabled.
> >
> > -- Ken
> >
> > On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote:
> >
> >> Digging some more:
> >>
> >> In AggregateAndRecommend, around lines 143, I have, for userId 0, a
> simColumn of:
> >>
> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN}
> >>
> >> Which then becomes the numerator and the denom.
> >>
> >> Looping, my next simCol is:
> >>
> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012}
> >>
> >> and then
> >>
> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012}
> >>
> >> ...
> >>
> >> Each time, those are getting added into the numerators/denoms value,
> such that by the time we are done looping (line 161), we have:
> >> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
> >> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
> >>
> >> numberOfSimilarItemsUsed:
> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0}
> >>
> >> Not sure on how to interpret this as I haven't dug into the math here
> yet or figured out where those NaN are coming from originally.
> >>
> >> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote:
> >>
> >>>
> >>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote:
> >>>
> >>>>
> >>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
> >>>>
> >>>>> Where is the NaN coming up -- what has this value?
> >>>>
> >>>> simColumn seems to be the originator in the Aggregate step.  For
> instance, my current breakpoint shows:
> >>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
> >>>>
> >>>> I can also see some in the PartialMultiplyMapper via the
> similarityMatrixColumn.
> >>>>
> >>>> Is that set by SimilarityMatrixRowWrapperMapper?
> >>>> <code>
> >>>> /* remove self similarity */
> >>>> similarityMatrixRow.set(key.get(), Double.NaN);
> >>>> </code>
> >>>
> >>> Ah, but that is just taking care of itself, so maybe not the issue.
> >>>
> >>>>
> >>>>
> >>>>
> >>>>> It should be propagated in some cases but not others. I'm not aware
> of
> >>>>> any changes here.
> >>>>
> >>>> yeah, me neither.  This is all related to MAHOUT-798.
> >>>>
> >>>>>
> >>>>> Generally small data sets will have this problem of not being able to
> >>>>> compute much of anything useful, so NaN might be right here.
> >>>>> But you say it was different recently, which seems to rule that out.
> >>>>
> >>>> I also _believe_ I'm seeing it in a much larger data set on Hadoop,
> it's just that's a whole lot harder to debug.
> >>>>
> >>>>>
> >>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <
> gsingers@apache.org> wrote:
> >>>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not
> getting any recommendations due to NaNs being calculated in the
> AggregateAndRecommend step.  I'm not quite sure what is going on as it seems
> like this was working as little as two weeks ago (post Sebastian's big
> change to RecJob), but I don't see a whole lot of changes in that part of
> the code.
> >>>>>>
> >>>>>> The data is user id's mapping to email thread ids.  My input data is
> simply a triple of user id, thread id, 1 (meaning that user participated in
> that thread)  It seems like I will have a lot of good values in the inputs
> to the AggregateAndRecommend step, except one id will be NaN and this then
> seems to get added in and makes everything NaN (I realize this is a very
> naive understanding).  I sense that I should be looking upstream in the
> process for a fix, but I am not sure where that is.
> >>>>>>
> >>>>>> Any ideas where I should be looking to eliminate these NaNs?  If you
> want to try this with a small data set, you can get it here:
> http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout(but note the companion article is not published yet.)
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Grant
> >>>>
> >>>>
> >>>
> >>> --------------------------------------------
> >>> Grant Ingersoll
> >>> http://www.lucidimagination.com
> >>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> >>>
> >>
> >> --------------------------------------------
> >> Grant Ingersoll
> >> http://www.lucidimagination.com
> >> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> >>
> >
> > --------------------------
> > Ken Krugler
> > +1 530-210-6378
> > http://bixolabs.com
> > custom big data solutions & training
> > Hadoop, Cascading, Mahout & Solr
> >
> >
> >
>
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>
>


-- 
Lance Norskog
goksron@gmail.com

Re: RecommenderJob and NaN

Posted by Grant Ingersoll <gs...@apache.org>.

Both local and on EC2

On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote:

> Hi Grant,
> 
> Just curious, are you running this locally or distributed?
> 
> I'd run into a similar issue, though in a completely different algorithm (Jimmy Lin's PageRank implementation) due to the use of a static variable.
> 
> When running locally, this wasn't getting cleared between loops, and thus I got wonky results.
> 
> The same thing would have happened with JVM reuse enabled.
> 
> -- Ken
> 
> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote:
> 
>> Digging some more:
>> 
>> In AggregateAndRecommend, around lines 143, I have, for userId 0, a simColumn of:
>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN}
>> 
>> Which then becomes the numerator and the denom.
>> 
>> Looping, my next simCol is:
>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012}
>> 
>> and then
>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012}
>> 
>> ...
>> 
>> Each time, those are getting added into the numerators/denoms value, such that by the time we are done looping (line 161), we have:
>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
>> 
>> numberOfSimilarItemsUsed: {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0}
>> 
>> Not sure on how to interpret this as I haven't dug into the math here yet or figured out where those NaN are coming from originally.
>> 
>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote:
>> 
>>> 
>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote:
>>> 
>>>> 
>>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
>>>> 
>>>>> Where is the NaN coming up -- what has this value?
>>>> 
>>>> simColumn seems to be the originator in the Aggregate step.  For instance, my current breakpoint shows:
>>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
>>>> 
>>>> I can also see some in the PartialMultiplyMapper via the similarityMatrixColumn.  
>>>> 
>>>> Is that set by SimilarityMatrixRowWrapperMapper?
>>>> <code>
>>>> /* remove self similarity */
>>>> similarityMatrixRow.set(key.get(), Double.NaN);
>>>> </code>
>>> 
>>> Ah, but that is just taking care of itself, so maybe not the issue.
>>> 
>>>> 
>>>> 
>>>> 
>>>>> It should be propagated in some cases but not others. I'm not aware of
>>>>> any changes here.
>>>> 
>>>> yeah, me neither.  This is all related to MAHOUT-798.
>>>> 
>>>>> 
>>>>> Generally small data sets will have this problem of not being able to
>>>>> compute much of anything useful, so NaN might be right here.
>>>>> But you say it was different recently, which seems to rule that out.
>>>> 
>>>> I also _believe_ I'm seeing it in a much larger data set on Hadoop, it's just that's a whole lot harder to debug.
>>>> 
>>>>> 
>>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <gs...@apache.org> wrote:
>>>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not getting any recommendations due to NaNs being calculated in the AggregateAndRecommend step.  I'm not quite sure what is going on as it seems like this was working as little as two weeks ago (post Sebastian's big change to RecJob), but I don't see a whole lot of changes in that part of the code.
>>>>>> 
>>>>>> The data is user id's mapping to email thread ids.  My input data is simply a triple of user id, thread id, 1 (meaning that user participated in that thread)  It seems like I will have a lot of good values in the inputs to the AggregateAndRecommend step, except one id will be NaN and this then seems to get added in and makes everything NaN (I realize this is a very naive understanding).  I sense that I should be looking upstream in the process for a fix, but I am not sure where that is.
>>>>>> 
>>>>>> Any ideas where I should be looking to eliminate these NaNs?  If you want to try this with a small data set, you can get it here: http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout (but note the companion article is not published yet.)
>>>>>> 
>>>>>> Thanks,
>>>>>> Grant
>>>> 
>>>> 
>>> 
>>> --------------------------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com
>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>> 
>> 
>> --------------------------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>> 
> 
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
> 
> 
> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com

Re: RecommenderJob and NaN

Posted by Ken Krugler <kk...@transpac.com>.

Hi Grant,

Just curious, are you running this locally or distributed?

I'd run into a similar issue, though in a completely different algorithm (Jimmy Lin's PageRank implementation) due to the use of a static variable.

When running locally, this wasn't getting cleared between loops, and thus I got wonky results.

The same thing would have happened with JVM reuse enabled.

-- Ken

On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote:

> Digging some more:
> 
> In AggregateAndRecommend, around lines 143, I have, for userId 0, a simColumn of:
> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN}
> 
> Which then becomes the numerator and the denom.
> 
> Looping, my next simCol is:
> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012}
> 
> and then
> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012}
> 
> ...
> 
> Each time, those are getting added into the numerators/denoms value, such that by the time we are done looping (line 161), we have:
> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
> 
> numberOfSimilarItemsUsed: {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0}
> 
> Not sure on how to interpret this as I haven't dug into the math here yet or figured out where those NaN are coming from originally.
> 
> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote:
> 
>> 
>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote:
>> 
>>> 
>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
>>> 
>>>> Where is the NaN coming up -- what has this value?
>>> 
>>> simColumn seems to be the originator in the Aggregate step.  For instance, my current breakpoint shows:
>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
>>> 
>>> I can also see some in the PartialMultiplyMapper via the similarityMatrixColumn.  
>>> 
>>> Is that set by SimilarityMatrixRowWrapperMapper?
>>> <code>
>>> /* remove self similarity */
>>>  similarityMatrixRow.set(key.get(), Double.NaN);
>>> </code>
>> 
>> Ah, but that is just taking care of itself, so maybe not the issue.
>> 
>>> 
>>> 
>>> 
>>>> It should be propagated in some cases but not others. I'm not aware of
>>>> any changes here.
>>> 
>>> yeah, me neither.  This is all related to MAHOUT-798.
>>> 
>>>> 
>>>> Generally small data sets will have this problem of not being able to
>>>> compute much of anything useful, so NaN might be right here.
>>>> But you say it was different recently, which seems to rule that out.
>>> 
>>> I also _believe_ I'm seeing it in a much larger data set on Hadoop, it's just that's a whole lot harder to debug.
>>> 
>>>> 
>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <gs...@apache.org> wrote:
>>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not getting any recommendations due to NaNs being calculated in the AggregateAndRecommend step.  I'm not quite sure what is going on as it seems like this was working as little as two weeks ago (post Sebastian's big change to RecJob), but I don't see a whole lot of changes in that part of the code.
>>>>> 
>>>>> The data is user id's mapping to email thread ids.  My input data is simply a triple of user id, thread id, 1 (meaning that user participated in that thread)  It seems like I will have a lot of good values in the inputs to the AggregateAndRecommend step, except one id will be NaN and this then seems to get added in and makes everything NaN (I realize this is a very naive understanding).  I sense that I should be looking upstream in the process for a fix, but I am not sure where that is.
>>>>> 
>>>>> Any ideas where I should be looking to eliminate these NaNs?  If you want to try this with a small data set, you can get it here: http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout (but note the companion article is not published yet.)
>>>>> 
>>>>> Thanks,
>>>>> Grant
>>> 
>>> 
>> 
>> --------------------------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>> 
> 
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> 

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

Re: RecommenderJob and NaN

Posted by Grant Ingersoll <gs...@apache.org>.

Digging some more:

In AggregateAndRecommend, around lines 143, I have, for userId 0, a simColumn of:
{22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN}

Which then becomes the numerator and the denom.

Looping, my next simCol is:
{22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012}

and then
{22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012}

...

Each time, those are getting added into the numerators/denoms value, such that by the time we are done looping (line 161), we have:
numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}

numberOfSimilarItemsUsed: {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0}

Not sure on how to interpret this as I haven't dug into the math here yet or figured out where those NaN are coming from originally.

On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote:

> 
> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote:
> 
>> 
>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
>> 
>>> Where is the NaN coming up -- what has this value?
>> 
>> simColumn seems to be the originator in the Aggregate step.  For instance, my current breakpoint shows:
>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
>> 
>> I can also see some in the PartialMultiplyMapper via the similarityMatrixColumn.  
>> 
>> Is that set by SimilarityMatrixRowWrapperMapper?
>> <code>
>> /* remove self similarity */
>>   similarityMatrixRow.set(key.get(), Double.NaN);
>> </code>
> 
> Ah, but that is just taking care of itself, so maybe not the issue.
> 
>> 
>> 
>> 
>>> It should be propagated in some cases but not others. I'm not aware of
>>> any changes here.
>> 
>> yeah, me neither.  This is all related to MAHOUT-798.
>> 
>>> 
>>> Generally small data sets will have this problem of not being able to
>>> compute much of anything useful, so NaN might be right here.
>>> But you say it was different recently, which seems to rule that out.
>> 
>> I also _believe_ I'm seeing it in a much larger data set on Hadoop, it's just that's a whole lot harder to debug.
>> 
>>> 
>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <gs...@apache.org> wrote:
>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not getting any recommendations due to NaNs being calculated in the AggregateAndRecommend step.  I'm not quite sure what is going on as it seems like this was working as little as two weeks ago (post Sebastian's big change to RecJob), but I don't see a whole lot of changes in that part of the code.
>>>> 
>>>> The data is user id's mapping to email thread ids.  My input data is simply a triple of user id, thread id, 1 (meaning that user participated in that thread)  It seems like I will have a lot of good values in the inputs to the AggregateAndRecommend step, except one id will be NaN and this then seems to get added in and makes everything NaN (I realize this is a very naive understanding).  I sense that I should be looking upstream in the process for a fix, but I am not sure where that is.
>>>> 
>>>> Any ideas where I should be looking to eliminate these NaNs?  If you want to try this with a small data set, you can get it here: http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout (but note the companion article is not published yet.)
>>>> 
>>>> Thanks,
>>>> Grant
>> 
>> 
> 
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com

Re: RecommenderJob and NaN

Posted by Grant Ingersoll <gs...@apache.org>.

On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote:

> 
> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
> 
>> Where is the NaN coming up -- what has this value?
> 
> simColumn seems to be the originator in the Aggregate step.  For instance, my current breakpoint shows:
> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
> 
> I can also see some in the PartialMultiplyMapper via the similarityMatrixColumn.  
> 
> Is that set by SimilarityMatrixRowWrapperMapper?
> <code>
> /* remove self similarity */
>    similarityMatrixRow.set(key.get(), Double.NaN);
> </code>

Ah, but that is just taking care of itself, so maybe not the issue.

> 
> 
> 
>> It should be propagated in some cases but not others. I'm not aware of
>> any changes here.
> 
> yeah, me neither.  This is all related to MAHOUT-798.
> 
>> 
>> Generally small data sets will have this problem of not being able to
>> compute much of anything useful, so NaN might be right here.
>> But you say it was different recently, which seems to rule that out.
> 
> I also _believe_ I'm seeing it in a much larger data set on Hadoop, it's just that's a whole lot harder to debug.
> 
>> 
>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <gs...@apache.org> wrote:
>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not getting any recommendations due to NaNs being calculated in the AggregateAndRecommend step.  I'm not quite sure what is going on as it seems like this was working as little as two weeks ago (post Sebastian's big change to RecJob), but I don't see a whole lot of changes in that part of the code.
>>> 
>>> The data is user id's mapping to email thread ids.  My input data is simply a triple of user id, thread id, 1 (meaning that user participated in that thread)  It seems like I will have a lot of good values in the inputs to the AggregateAndRecommend step, except one id will be NaN and this then seems to get added in and makes everything NaN (I realize this is a very naive understanding).  I sense that I should be looking upstream in the process for a fix, but I am not sure where that is.
>>> 
>>> Any ideas where I should be looking to eliminate these NaNs?  If you want to try this with a small data set, you can get it here: http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout (but note the companion article is not published yet.)
>>> 
>>> Thanks,
>>> Grant
> 
> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com

Re: RecommenderJob and NaN

Posted by Grant Ingersoll <gs...@apache.org>.

On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:

> Where is the NaN coming up -- what has this value?

simColumn seems to be the originator in the Aggregate step.  For instance, my current breakpoint shows:
{309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}

I can also see some in the PartialMultiplyMapper via the similarityMatrixColumn.  

Is that set by SimilarityMatrixRowWrapperMapper?
<code>
/* remove self similarity */
    similarityMatrixRow.set(key.get(), Double.NaN);
</code>



> It should be propagated in some cases but not others. I'm not aware of
> any changes here.

yeah, me neither.  This is all related to MAHOUT-798.

> 
> Generally small data sets will have this problem of not being able to
> compute much of anything useful, so NaN might be right here.
> But you say it was different recently, which seems to rule that out.

I also _believe_ I'm seeing it in a much larger data set on Hadoop, it's just that's a whole lot harder to debug.

> 
> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <gs...@apache.org> wrote:
>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not getting any recommendations due to NaNs being calculated in the AggregateAndRecommend step.  I'm not quite sure what is going on as it seems like this was working as little as two weeks ago (post Sebastian's big change to RecJob), but I don't see a whole lot of changes in that part of the code.
>> 
>> The data is user id's mapping to email thread ids.  My input data is simply a triple of user id, thread id, 1 (meaning that user participated in that thread)  It seems like I will have a lot of good values in the inputs to the AggregateAndRecommend step, except one id will be NaN and this then seems to get added in and makes everything NaN (I realize this is a very naive understanding).  I sense that I should be looking upstream in the process for a fix, but I am not sure where that is.
>> 
>> Any ideas where I should be looking to eliminate these NaNs?  If you want to try this with a small data set, you can get it here: http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout (but note the companion article is not published yet.)
>> 
>> Thanks,
>> Grant

Re: RecommenderJob and NaN

Posted by Sean Owen <sr...@gmail.com>.

Where is the NaN coming up -- what has this value?
It should be propagated in some cases but not others. I'm not aware of
any changes here.

Generally small data sets will have this problem of not being able to
compute much of anything useful, so NaN might be right here.
But you say it was different recently, which seems to rule that out.

On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <gs...@apache.org> wrote:
> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not getting any recommendations due to NaNs being calculated in the AggregateAndRecommend step.  I'm not quite sure what is going on as it seems like this was working as little as two weeks ago (post Sebastian's big change to RecJob), but I don't see a whole lot of changes in that part of the code.
>
> The data is user id's mapping to email thread ids.  My input data is simply a triple of user id, thread id, 1 (meaning that user participated in that thread)  It seems like I will have a lot of good values in the inputs to the AggregateAndRecommend step, except one id will be NaN and this then seems to get added in and makes everything NaN (I realize this is a very naive understanding).  I sense that I should be looking upstream in the process for a fix, but I am not sure where that is.
>
> Any ideas where I should be looking to eliminate these NaNs?  If you want to try this with a small data set, you can get it here: http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout (but note the companion article is not published yet.)
>
> Thanks,
> Grant