You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Charly Lizarralde <ch...@gmail.com> on 2010/08/12 00:15:44 UTC

ItemSimilarityJob

Hi, I am testing ItemSimilarityJob with Netflix data (2.6 GB) and I have
just ran out of disk space (160 GB) in my mapred.local.dir when running
RowSimilarityJob.

Is this normal behaviour? How can I improve this?

Thanks!
Charly

Re: ItemSimilarityJob

Posted by Ted Dunning <te...@gmail.com>.
LSH and minhash clustering both find cases of very large overlaps.  We need
more subtle overlaps to be retained.

Thus, my inclination which is, as yet, unsupported by any evidence other
than what has worked for me in the past, would be to optimize the system we
have.

On Thu, Aug 12, 2010 at 12:45 PM, alex lin <al...@gmail.com> wrote:

> Hi,
>
> Below seems interesting?
> SIGIR 2010 http://www.lsdsir.org/wp-content/uploads/2010/05/lsdsir10-2.pdf
> which uses LSH + Hadoop
>
> Alex
>
> On Thu, Aug 12, 2010 at 3:32 PM, Gökhan Çapan <gk...@gmail.com> wrote:
>
> > Hi Sebastian,
> >
> > I remember the birth of this "co-occurrence based distributed
> > recommendation" idea, I was there:). That's why I know the general
> concepts
> > behind the idea. However, I haven't looked at the Mahout code lately, and
> I
> > don't know the details about how co-occurrence computation is
> implemented.
> > Choosing either pairs or stripes approach is a design choice while
> > implementing a co-occurrence computation (and other similar ones).
> Although
> > these have slight differences in implementation phase, size of
> intermediate
> > data and the time complexity of the algorithms are very different from
> each
> > other. If the problem is the size of intermediate data produced while
> > computing the co-occurrence matrix, it's worth to try the "other"
> approach.
> >
> > In addition, MinHashing may be another option to produce all candidate
> > similar pairs according to Jaccard similarity without looking all pairs.
> It
> > may approximately be implemented using MapReduce. Actually I have
> > implemented one, to detect near duplicate documents in a large document
> > set,
> > as well as to find similar items according to users' binary ratings; and
> it
> > yields to very good results, in terms of  both accuracy and performance.
> > This method is also a part of Google's news recommendation engine.
> >
> > On Thu, Aug 12, 2010 at 9:11 PM, Sebastian Schelter <
> > ssc.open@googlemail.com
> > > wrote:
> >
> > > Hi Gökhan,
> > >
> > > I had a quick look at the paper and the "stripes" approach looks
> > > interesting though I'm not sure whether we can apply it to the
> > > RowSimilarityJob.
> > >
> > > I think when we want to improve the performance of the
> ItemSimilarityJob
> > > we should go another path that follows some thoughts by Ted Dunning.
> > > IIRC he said that you don't really learn anything new about an item
> > > after seeing a certain number of preferences and thus it should be
> > > sufficient to look at a fixed number of them at maximum per item.
> > > RecommenderJob is already limiting the number of preferences considered
> > > per item with MaybePruneRowsMapper and ItemSimilarityJob should adapt
> > > that option too.
> > >
> > > The algorithm is based on the paper "Elsayed et al: Pairwise Document
> > > Similarity in Large Collections with MapReduce"
> > > (
> > >
> >
> http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> > > )
> > > which seems to use the "pairs" approach described in your paper as far
> > > as I can judge that. IIRC the authors if the papers remove frequently
> > > occurring words to achieve a linear runtime which would correspond the
> > > "maxNumberOfPreferencesPerItem" approach MAHOUT-460 suggests.
> > >
> > > --sebastian
> > >
> > >
> > >
> > > Am 12.08.2010 16:45, schrieb Gökhan Çapan:
> > > > Hi,
> > > > I haven't seen the code, but maybe Mahout needs some optimization
> while
> > > > computing item-item co-occurrences. It may be re-implemented using
> > > "stripes"
> > > > approach using in-mapper combining if it is not. It can be found at:
> > > >
> > > >    1. www.aclweb.org/anthology/D/D08/D08-1044.pdf
> > > >
> > > > If it already is, sorry for the post.
> > > >
> > > > On Thu, Aug 12, 2010 at 3:51 PM, Charly Lizarralde <
> > > > charly.lizarralde@gmail.com> wrote:
> > > >
> > > >
> > > >> Sebastian, thank's for the reply.  The step name is*
> > > >> :*RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.  and each
> > > >> map task
> > > >> takes around 10 hours to finish.
> > > >>
> > > >> Reduce task dir
> > > >>
> > > >>
> > >
> >
> (var/lib/hadoop-0.20/cache/hadoop/mapred/local/taskTracker/jobcache/job_201008111833_0007/attempt_201008111833_0007_r_000000_0/output)
> > > >> has map output files ( files like map_2.out) and each one is 5GB in
> > > size.
> > > >>
> > > >> I have been looking at the code and saw what you describe in the
> > e-mail.
> > > It
> > > >> makes sense. But still 160 GB of intermediate info from a 2.6 GB
> input
> > > file
> > > >> still makes me wonder if something is wrong.
> > > >>
> > > >> Should I just wait for the patch?
> > > >> Thanks again!
> > > >> Charly
> > > >>
> > > >> On Thu, Aug 12, 2010 at 2:34 AM, Sebastian Schelter <
> > > >> ssc.open@googlemail.com
> > > >>
> > > >>> wrote:
> > > >>>
> > > >>
> > > >>> Hi Charly,
> > > >>>
> > > >>> can you tell which Map/Reduce step was executed last before you ran
> > out
> > > >>> of disk space?
> > > >>>
> > > >>> I'm not familiar with the Netflix dataset and can only guess what
> > > >>> happened, but I would say that you ran out of diskspace because
> > > >>> ItemSimilarityJob currently uses all preferences to compute the
> > > >>> similarities. This makes it scale in the square of the number of
> > > >>> occurrences of the most popular item, which is a bad thing if that
> > > >>> number is huge. We need a way to limit the number of preferences
> > > >>> considered per item, there is already a ticket for this (
> > > >>> https://issues.apache.org/jira/browse/MAHOUT-460) and I plan to
> > > provide
> > > >>> a patch in the next days.
> > > >>>
> > > >>> --sebastian
> > > >>>
> > > >>>
> > > >>>
> > > >>> Am 12.08.2010 00:15, schrieb Charly Lizarralde:
> > > >>>
> > > >>>> Hi, I am testing ItemSimilarityJob with Netflix data (2.6 GB) and
> I
> > > >>>>
> > > >> have
> > > >>
> > > >>>> just ran out of disk space (160 GB) in my mapred.local.dir when
> > > running
> > > >>>> RowSimilarityJob.
> > > >>>>
> > > >>>> Is this normal behaviour? How can I improve this?
> > > >>>>
> > > >>>> Thanks!
> > > >>>> Charly
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>
> > > >>>
> > > >>
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
> > --
> > Gökhan Çapan
> >
>

Re: ItemSimilarityJob

Posted by alex lin <al...@gmail.com>.
Hi,

Below seems interesting?
SIGIR 2010 http://www.lsdsir.org/wp-content/uploads/2010/05/lsdsir10-2.pdf
which uses LSH + Hadoop

Alex

On Thu, Aug 12, 2010 at 3:32 PM, Gökhan Çapan <gk...@gmail.com> wrote:

> Hi Sebastian,
>
> I remember the birth of this "co-occurrence based distributed
> recommendation" idea, I was there:). That's why I know the general concepts
> behind the idea. However, I haven't looked at the Mahout code lately, and I
> don't know the details about how co-occurrence computation is implemented.
> Choosing either pairs or stripes approach is a design choice while
> implementing a co-occurrence computation (and other similar ones). Although
> these have slight differences in implementation phase, size of intermediate
> data and the time complexity of the algorithms are very different from each
> other. If the problem is the size of intermediate data produced while
> computing the co-occurrence matrix, it's worth to try the "other" approach.
>
> In addition, MinHashing may be another option to produce all candidate
> similar pairs according to Jaccard similarity without looking all pairs. It
> may approximately be implemented using MapReduce. Actually I have
> implemented one, to detect near duplicate documents in a large document
> set,
> as well as to find similar items according to users' binary ratings; and it
> yields to very good results, in terms of  both accuracy and performance.
> This method is also a part of Google's news recommendation engine.
>
> On Thu, Aug 12, 2010 at 9:11 PM, Sebastian Schelter <
> ssc.open@googlemail.com
> > wrote:
>
> > Hi Gökhan,
> >
> > I had a quick look at the paper and the "stripes" approach looks
> > interesting though I'm not sure whether we can apply it to the
> > RowSimilarityJob.
> >
> > I think when we want to improve the performance of the ItemSimilarityJob
> > we should go another path that follows some thoughts by Ted Dunning.
> > IIRC he said that you don't really learn anything new about an item
> > after seeing a certain number of preferences and thus it should be
> > sufficient to look at a fixed number of them at maximum per item.
> > RecommenderJob is already limiting the number of preferences considered
> > per item with MaybePruneRowsMapper and ItemSimilarityJob should adapt
> > that option too.
> >
> > The algorithm is based on the paper "Elsayed et al: Pairwise Document
> > Similarity in Large Collections with MapReduce"
> > (
> >
> http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> > )
> > which seems to use the "pairs" approach described in your paper as far
> > as I can judge that. IIRC the authors if the papers remove frequently
> > occurring words to achieve a linear runtime which would correspond the
> > "maxNumberOfPreferencesPerItem" approach MAHOUT-460 suggests.
> >
> > --sebastian
> >
> >
> >
> > Am 12.08.2010 16:45, schrieb Gökhan Çapan:
> > > Hi,
> > > I haven't seen the code, but maybe Mahout needs some optimization while
> > > computing item-item co-occurrences. It may be re-implemented using
> > "stripes"
> > > approach using in-mapper combining if it is not. It can be found at:
> > >
> > >    1. www.aclweb.org/anthology/D/D08/D08-1044.pdf
> > >
> > > If it already is, sorry for the post.
> > >
> > > On Thu, Aug 12, 2010 at 3:51 PM, Charly Lizarralde <
> > > charly.lizarralde@gmail.com> wrote:
> > >
> > >
> > >> Sebastian, thank's for the reply.  The step name is*
> > >> :*RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.  and each
> > >> map task
> > >> takes around 10 hours to finish.
> > >>
> > >> Reduce task dir
> > >>
> > >>
> >
> (var/lib/hadoop-0.20/cache/hadoop/mapred/local/taskTracker/jobcache/job_201008111833_0007/attempt_201008111833_0007_r_000000_0/output)
> > >> has map output files ( files like map_2.out) and each one is 5GB in
> > size.
> > >>
> > >> I have been looking at the code and saw what you describe in the
> e-mail.
> > It
> > >> makes sense. But still 160 GB of intermediate info from a 2.6 GB input
> > file
> > >> still makes me wonder if something is wrong.
> > >>
> > >> Should I just wait for the patch?
> > >> Thanks again!
> > >> Charly
> > >>
> > >> On Thu, Aug 12, 2010 at 2:34 AM, Sebastian Schelter <
> > >> ssc.open@googlemail.com
> > >>
> > >>> wrote:
> > >>>
> > >>
> > >>> Hi Charly,
> > >>>
> > >>> can you tell which Map/Reduce step was executed last before you ran
> out
> > >>> of disk space?
> > >>>
> > >>> I'm not familiar with the Netflix dataset and can only guess what
> > >>> happened, but I would say that you ran out of diskspace because
> > >>> ItemSimilarityJob currently uses all preferences to compute the
> > >>> similarities. This makes it scale in the square of the number of
> > >>> occurrences of the most popular item, which is a bad thing if that
> > >>> number is huge. We need a way to limit the number of preferences
> > >>> considered per item, there is already a ticket for this (
> > >>> https://issues.apache.org/jira/browse/MAHOUT-460) and I plan to
> > provide
> > >>> a patch in the next days.
> > >>>
> > >>> --sebastian
> > >>>
> > >>>
> > >>>
> > >>> Am 12.08.2010 00:15, schrieb Charly Lizarralde:
> > >>>
> > >>>> Hi, I am testing ItemSimilarityJob with Netflix data (2.6 GB) and I
> > >>>>
> > >> have
> > >>
> > >>>> just ran out of disk space (160 GB) in my mapred.local.dir when
> > running
> > >>>> RowSimilarityJob.
> > >>>>
> > >>>> Is this normal behaviour? How can I improve this?
> > >>>>
> > >>>> Thanks!
> > >>>> Charly
> > >>>>
> > >>>>
> > >>>>
> > >>>
> > >>>
> > >>
> > >
> > >
> > >
> >
> >
>
>
> --
> Gökhan Çapan
>

Re: ItemSimilarityJob

Posted by Gökhan Çapan <gk...@gmail.com>.
Hi Sebastian,

I remember the birth of this "co-occurrence based distributed
recommendation" idea, I was there:). That's why I know the general concepts
behind the idea. However, I haven't looked at the Mahout code lately, and I
don't know the details about how co-occurrence computation is implemented.
Choosing either pairs or stripes approach is a design choice while
implementing a co-occurrence computation (and other similar ones). Although
these have slight differences in implementation phase, size of intermediate
data and the time complexity of the algorithms are very different from each
other. If the problem is the size of intermediate data produced while
computing the co-occurrence matrix, it's worth to try the "other" approach.

In addition, MinHashing may be another option to produce all candidate
similar pairs according to Jaccard similarity without looking all pairs. It
may approximately be implemented using MapReduce. Actually I have
implemented one, to detect near duplicate documents in a large document set,
as well as to find similar items according to users' binary ratings; and it
yields to very good results, in terms of  both accuracy and performance.
This method is also a part of Google's news recommendation engine.

On Thu, Aug 12, 2010 at 9:11 PM, Sebastian Schelter <ssc.open@googlemail.com
> wrote:

> Hi Gökhan,
>
> I had a quick look at the paper and the "stripes" approach looks
> interesting though I'm not sure whether we can apply it to the
> RowSimilarityJob.
>
> I think when we want to improve the performance of the ItemSimilarityJob
> we should go another path that follows some thoughts by Ted Dunning.
> IIRC he said that you don't really learn anything new about an item
> after seeing a certain number of preferences and thus it should be
> sufficient to look at a fixed number of them at maximum per item.
> RecommenderJob is already limiting the number of preferences considered
> per item with MaybePruneRowsMapper and ItemSimilarityJob should adapt
> that option too.
>
> The algorithm is based on the paper "Elsayed et al: Pairwise Document
> Similarity in Large Collections with MapReduce"
> (
> http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> )
> which seems to use the "pairs" approach described in your paper as far
> as I can judge that. IIRC the authors if the papers remove frequently
> occurring words to achieve a linear runtime which would correspond the
> "maxNumberOfPreferencesPerItem" approach MAHOUT-460 suggests.
>
> --sebastian
>
>
>
> Am 12.08.2010 16:45, schrieb Gökhan Çapan:
> > Hi,
> > I haven't seen the code, but maybe Mahout needs some optimization while
> > computing item-item co-occurrences. It may be re-implemented using
> "stripes"
> > approach using in-mapper combining if it is not. It can be found at:
> >
> >    1. www.aclweb.org/anthology/D/D08/D08-1044.pdf
> >
> > If it already is, sorry for the post.
> >
> > On Thu, Aug 12, 2010 at 3:51 PM, Charly Lizarralde <
> > charly.lizarralde@gmail.com> wrote:
> >
> >
> >> Sebastian, thank's for the reply.  The step name is*
> >> :*RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.  and each
> >> map task
> >> takes around 10 hours to finish.
> >>
> >> Reduce task dir
> >>
> >>
> (var/lib/hadoop-0.20/cache/hadoop/mapred/local/taskTracker/jobcache/job_201008111833_0007/attempt_201008111833_0007_r_000000_0/output)
> >> has map output files ( files like map_2.out) and each one is 5GB in
> size.
> >>
> >> I have been looking at the code and saw what you describe in the e-mail.
> It
> >> makes sense. But still 160 GB of intermediate info from a 2.6 GB input
> file
> >> still makes me wonder if something is wrong.
> >>
> >> Should I just wait for the patch?
> >> Thanks again!
> >> Charly
> >>
> >> On Thu, Aug 12, 2010 at 2:34 AM, Sebastian Schelter <
> >> ssc.open@googlemail.com
> >>
> >>> wrote:
> >>>
> >>
> >>> Hi Charly,
> >>>
> >>> can you tell which Map/Reduce step was executed last before you ran out
> >>> of disk space?
> >>>
> >>> I'm not familiar with the Netflix dataset and can only guess what
> >>> happened, but I would say that you ran out of diskspace because
> >>> ItemSimilarityJob currently uses all preferences to compute the
> >>> similarities. This makes it scale in the square of the number of
> >>> occurrences of the most popular item, which is a bad thing if that
> >>> number is huge. We need a way to limit the number of preferences
> >>> considered per item, there is already a ticket for this (
> >>> https://issues.apache.org/jira/browse/MAHOUT-460) and I plan to
> provide
> >>> a patch in the next days.
> >>>
> >>> --sebastian
> >>>
> >>>
> >>>
> >>> Am 12.08.2010 00:15, schrieb Charly Lizarralde:
> >>>
> >>>> Hi, I am testing ItemSimilarityJob with Netflix data (2.6 GB) and I
> >>>>
> >> have
> >>
> >>>> just ran out of disk space (160 GB) in my mapred.local.dir when
> running
> >>>> RowSimilarityJob.
> >>>>
> >>>> Is this normal behaviour? How can I improve this?
> >>>>
> >>>> Thanks!
> >>>> Charly
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >
> >
> >
>
>


-- 
Gökhan Çapan

Re: ItemSimilarityJob

Posted by Sebastian Schelter <ss...@googlemail.com>.
Hi Gökhan,

I had a quick look at the paper and the "stripes" approach looks
interesting though I'm not sure whether we can apply it to the
RowSimilarityJob.

I think when we want to improve the performance of the ItemSimilarityJob
we should go another path that follows some thoughts by Ted Dunning.
IIRC he said that you don't really learn anything new about an item
after seeing a certain number of preferences and thus it should be
sufficient to look at a fixed number of them at maximum per item.
RecommenderJob is already limiting the number of preferences considered
per item with MaybePruneRowsMapper and ItemSimilarityJob should adapt
that option too.

The algorithm is based on the paper "Elsayed et al: Pairwise Document
Similarity in Large Collections with MapReduce"
(http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf)
which seems to use the "pairs" approach described in your paper as far
as I can judge that. IIRC the authors if the papers remove frequently
occurring words to achieve a linear runtime which would correspond the
"maxNumberOfPreferencesPerItem" approach MAHOUT-460 suggests.

--sebastian



Am 12.08.2010 16:45, schrieb Gökhan Çapan:
> Hi,
> I haven't seen the code, but maybe Mahout needs some optimization while
> computing item-item co-occurrences. It may be re-implemented using "stripes"
> approach using in-mapper combining if it is not. It can be found at:
>
>    1. www.aclweb.org/anthology/D/D08/D08-1044.pdf
>
> If it already is, sorry for the post.
>
> On Thu, Aug 12, 2010 at 3:51 PM, Charly Lizarralde <
> charly.lizarralde@gmail.com> wrote:
>
>   
>> Sebastian, thank's for the reply.  The step name is*
>> :*RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.  and each
>> map task
>> takes around 10 hours to finish.
>>
>> Reduce task dir
>>
>> (var/lib/hadoop-0.20/cache/hadoop/mapred/local/taskTracker/jobcache/job_201008111833_0007/attempt_201008111833_0007_r_000000_0/output)
>> has map output files ( files like map_2.out) and each one is 5GB in size.
>>
>> I have been looking at the code and saw what you describe in the e-mail. It
>> makes sense. But still 160 GB of intermediate info from a 2.6 GB input file
>> still makes me wonder if something is wrong.
>>
>> Should I just wait for the patch?
>> Thanks again!
>> Charly
>>
>> On Thu, Aug 12, 2010 at 2:34 AM, Sebastian Schelter <
>> ssc.open@googlemail.com
>>     
>>> wrote:
>>>       
>>     
>>> Hi Charly,
>>>
>>> can you tell which Map/Reduce step was executed last before you ran out
>>> of disk space?
>>>
>>> I'm not familiar with the Netflix dataset and can only guess what
>>> happened, but I would say that you ran out of diskspace because
>>> ItemSimilarityJob currently uses all preferences to compute the
>>> similarities. This makes it scale in the square of the number of
>>> occurrences of the most popular item, which is a bad thing if that
>>> number is huge. We need a way to limit the number of preferences
>>> considered per item, there is already a ticket for this (
>>> https://issues.apache.org/jira/browse/MAHOUT-460) and I plan to provide
>>> a patch in the next days.
>>>
>>> --sebastian
>>>
>>>
>>>
>>> Am 12.08.2010 00:15, schrieb Charly Lizarralde:
>>>       
>>>> Hi, I am testing ItemSimilarityJob with Netflix data (2.6 GB) and I
>>>>         
>> have
>>     
>>>> just ran out of disk space (160 GB) in my mapred.local.dir when running
>>>> RowSimilarityJob.
>>>>
>>>> Is this normal behaviour? How can I improve this?
>>>>
>>>> Thanks!
>>>> Charly
>>>>
>>>>
>>>>         
>>>
>>>       
>>     
>
>
>   


Re: ItemSimilarityJob

Posted by Charly Lizarralde <ch...@gmail.com>.
I really must read the papers in order keep on commenting on this
thread...if by any chance I can dive into the Job "internals" may be I can
see to write an optmization.

For now, I will buy another disk and retry the tests.

Thanks for the replies!
Charly

On Thu, Aug 12, 2010 at 5:12 PM, Ted Dunning <te...@gmail.com> wrote:

> I am sympathetic with the goals of stripes and have not analyzed the
> situation in detail.  Instead, I am simply reporting that at least one guy
> with very deep knowledge of the Hadoop map-reduce framework felt that
> similar results could be achieved without quite so much fanciness.
>
> On Thu, Aug 12, 2010 at 1:06 PM, Gökhan Çapan <gk...@gmail.com> wrote:
>
> > Hi Ted,
> > Combining phase is also applicable to stripes approach (also in-mapper
> > combining). The experiments I have remembered was from the paper, stripes
> > approach produces larger intermediate key-value pairs before the
> combiners
> > according to this experiment, though. Below is from the paper;
> >
> > "*Results demonstrate that the stripes approach is*
> > *far more efficient than the pairs approach: 666 seconds*
> > *(11m 6s) compared to 3758 seconds (62m 38s)*
> > *for the entire APW sub-corpus (improvement by a*
> > *factor of 5.7). On the entire sub-corpus, the mappers*
> > *in the pairs approach generated 2.6 billion intermediate*
> > *key-value pairs totally 31.2 GB. After the*
> > *combiners, this was reduced to 1.1 billion key-value*
> > *pairs, which roughly quantifies the amount of data*
> > *involved in the shuffling and sorting of the keys. On*
> > *the other hand, the mappers in the stripes approach*
> > *generated 653 million intermediate key-value pairs*
> > *totally 48.1 GB; after the combiners, only 28.8 million*
> > *key-value pairs were left. The stripes approach*
> > *provides more opportunities for combiners to aggregate*
> > *intermediate results, thus greatly reducing network*
> > *traffic in the sort and shuffle phase.*"
> >
> >
> > On Thu, Aug 12, 2010 at 10:34 PM, Gökhan Çapan <gk...@gmail.com>
> wrote:
> >
> > > Hi Ted,
> > >
> > > I have seen some benchmark results between different versions of
> > > co-occurrence computation, I will share them if I can find, today or
> > > tomorrow.
> > >
> > >
> > > On Thu, Aug 12, 2010 at 10:30 PM, Ted Dunning <ted.dunning@gmail.com
> > >wrote:
> > >
> > >> Jimmy Lin's stripes work was presented at the last Summit and there
> was
> > >> heated (well, warm and cordial at least) discussion with the
> Map-reduce
> > >> committers about whether good use of a combiner wouldn't do just as
> > well.
> > >>
> > >> My take-away as a spectator is that a combiner was
> > >>
> > >> a) vastly easier to code
> > >>
> > >> b) would be pretty certain to be within 2x as performant and likely
> very
> > >> close to the same speed
> > >>
> > >> c) would not need changing each time the underlying map-reduce changed
> > >>
> > >> My conclusion was that combiners were the way to go (for me).  Your
> > >> mileage,
> > >> as always, will vary.
> > >>
> > >> On Thu, Aug 12, 2010 at 7:45 AM, Gökhan Çapan <gk...@gmail.com>
> > wrote:
> > >>
> > >> > Hi,
> > >> > I haven't seen the code, but maybe Mahout needs some optimization
> > while
> > >> > computing item-item co-occurrences. It may be re-implemented using
> > >> > "stripes"
> > >> > approach using in-mapper combining if it is not. It can be found at:
> > >> >
> > >> >   1. www.aclweb.org/anthology/D/D08/D08-1044.pdf
> > >> >
> > >> > If it already is, sorry for the post.
> > >> >
> > >> > On Thu, Aug 12, 2010 at 3:51 PM, Charly Lizarralde <
> > >> > charly.lizarralde@gmail.com> wrote:
> > >> >
> > >> > > Sebastian, thank's for the reply.  The step name is*
> > >> > > :*RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.  and
> each
> > >> > > map task
> > >> > > takes around 10 hours to finish.
> > >> > >
> > >> > > Reduce task dir
> > >> > >
> > >> > >
> > >> >
> > >>
> >
> (var/lib/hadoop-0.20/cache/hadoop/mapred/local/taskTracker/jobcache/job_201008111833_0007/attempt_201008111833_0007_r_000000_0/output)
> > >> > > has map output files ( files like map_2.out) and each one is 5GB
> in
> > >> size.
> > >> > >
> > >> > > I have been looking at the code and saw what you describe in the
> > >> e-mail.
> > >> > It
> > >> > > makes sense. But still 160 GB of intermediate info from a 2.6 GB
> > input
> > >> > file
> > >> > > still makes me wonder if something is wrong.
> > >> > >
> > >> > > Should I just wait for the patch?
> > >> > > Thanks again!
> > >> > > Charly
> > >> > >
> > >> > > On Thu, Aug 12, 2010 at 2:34 AM, Sebastian Schelter <
> > >> > > ssc.open@googlemail.com
> > >> > > > wrote:
> > >> > >
> > >> > > > Hi Charly,
> > >> > > >
> > >> > > > can you tell which Map/Reduce step was executed last before you
> > ran
> > >> out
> > >> > > > of disk space?
> > >> > > >
> > >> > > > I'm not familiar with the Netflix dataset and can only guess
> what
> > >> > > > happened, but I would say that you ran out of diskspace because
> > >> > > > ItemSimilarityJob currently uses all preferences to compute the
> > >> > > > similarities. This makes it scale in the square of the number of
> > >> > > > occurrences of the most popular item, which is a bad thing if
> that
> > >> > > > number is huge. We need a way to limit the number of preferences
> > >> > > > considered per item, there is already a ticket for this (
> > >> > > > https://issues.apache.org/jira/browse/MAHOUT-460) and I plan to
> > >> > provide
> > >> > > > a patch in the next days.
> > >> > > >
> > >> > > > --sebastian
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > Am 12.08.2010 00:15, schrieb Charly Lizarralde:
> > >> > > > > Hi, I am testing ItemSimilarityJob with Netflix data (2.6 GB)
> > and
> > >> I
> > >> > > have
> > >> > > > > just ran out of disk space (160 GB) in my mapred.local.dir
> when
> > >> > running
> > >> > > > > RowSimilarityJob.
> > >> > > > >
> > >> > > > > Is this normal behaviour? How can I improve this?
> > >> > > > >
> > >> > > > > Thanks!
> > >> > > > > Charly
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > Gökhan Çapan
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > Gökhan Çapan
> > >
> >
> >
> >
> > --
> > Gökhan Çapan
> >
>

Re: ItemSimilarityJob

Posted by Ted Dunning <te...@gmail.com>.
I am sympathetic with the goals of stripes and have not analyzed the
situation in detail.  Instead, I am simply reporting that at least one guy
with very deep knowledge of the Hadoop map-reduce framework felt that
similar results could be achieved without quite so much fanciness.

On Thu, Aug 12, 2010 at 1:06 PM, Gökhan Çapan <gk...@gmail.com> wrote:

> Hi Ted,
> Combining phase is also applicable to stripes approach (also in-mapper
> combining). The experiments I have remembered was from the paper, stripes
> approach produces larger intermediate key-value pairs before the combiners
> according to this experiment, though. Below is from the paper;
>
> "*Results demonstrate that the stripes approach is*
> *far more efficient than the pairs approach: 666 seconds*
> *(11m 6s) compared to 3758 seconds (62m 38s)*
> *for the entire APW sub-corpus (improvement by a*
> *factor of 5.7). On the entire sub-corpus, the mappers*
> *in the pairs approach generated 2.6 billion intermediate*
> *key-value pairs totally 31.2 GB. After the*
> *combiners, this was reduced to 1.1 billion key-value*
> *pairs, which roughly quantifies the amount of data*
> *involved in the shuffling and sorting of the keys. On*
> *the other hand, the mappers in the stripes approach*
> *generated 653 million intermediate key-value pairs*
> *totally 48.1 GB; after the combiners, only 28.8 million*
> *key-value pairs were left. The stripes approach*
> *provides more opportunities for combiners to aggregate*
> *intermediate results, thus greatly reducing network*
> *traffic in the sort and shuffle phase.*"
>
>
> On Thu, Aug 12, 2010 at 10:34 PM, Gökhan Çapan <gk...@gmail.com> wrote:
>
> > Hi Ted,
> >
> > I have seen some benchmark results between different versions of
> > co-occurrence computation, I will share them if I can find, today or
> > tomorrow.
> >
> >
> > On Thu, Aug 12, 2010 at 10:30 PM, Ted Dunning <ted.dunning@gmail.com
> >wrote:
> >
> >> Jimmy Lin's stripes work was presented at the last Summit and there was
> >> heated (well, warm and cordial at least) discussion with the Map-reduce
> >> committers about whether good use of a combiner wouldn't do just as
> well.
> >>
> >> My take-away as a spectator is that a combiner was
> >>
> >> a) vastly easier to code
> >>
> >> b) would be pretty certain to be within 2x as performant and likely very
> >> close to the same speed
> >>
> >> c) would not need changing each time the underlying map-reduce changed
> >>
> >> My conclusion was that combiners were the way to go (for me).  Your
> >> mileage,
> >> as always, will vary.
> >>
> >> On Thu, Aug 12, 2010 at 7:45 AM, Gökhan Çapan <gk...@gmail.com>
> wrote:
> >>
> >> > Hi,
> >> > I haven't seen the code, but maybe Mahout needs some optimization
> while
> >> > computing item-item co-occurrences. It may be re-implemented using
> >> > "stripes"
> >> > approach using in-mapper combining if it is not. It can be found at:
> >> >
> >> >   1. www.aclweb.org/anthology/D/D08/D08-1044.pdf
> >> >
> >> > If it already is, sorry for the post.
> >> >
> >> > On Thu, Aug 12, 2010 at 3:51 PM, Charly Lizarralde <
> >> > charly.lizarralde@gmail.com> wrote:
> >> >
> >> > > Sebastian, thank's for the reply.  The step name is*
> >> > > :*RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.  and each
> >> > > map task
> >> > > takes around 10 hours to finish.
> >> > >
> >> > > Reduce task dir
> >> > >
> >> > >
> >> >
> >>
> (var/lib/hadoop-0.20/cache/hadoop/mapred/local/taskTracker/jobcache/job_201008111833_0007/attempt_201008111833_0007_r_000000_0/output)
> >> > > has map output files ( files like map_2.out) and each one is 5GB in
> >> size.
> >> > >
> >> > > I have been looking at the code and saw what you describe in the
> >> e-mail.
> >> > It
> >> > > makes sense. But still 160 GB of intermediate info from a 2.6 GB
> input
> >> > file
> >> > > still makes me wonder if something is wrong.
> >> > >
> >> > > Should I just wait for the patch?
> >> > > Thanks again!
> >> > > Charly
> >> > >
> >> > > On Thu, Aug 12, 2010 at 2:34 AM, Sebastian Schelter <
> >> > > ssc.open@googlemail.com
> >> > > > wrote:
> >> > >
> >> > > > Hi Charly,
> >> > > >
> >> > > > can you tell which Map/Reduce step was executed last before you
> ran
> >> out
> >> > > > of disk space?
> >> > > >
> >> > > > I'm not familiar with the Netflix dataset and can only guess what
> >> > > > happened, but I would say that you ran out of diskspace because
> >> > > > ItemSimilarityJob currently uses all preferences to compute the
> >> > > > similarities. This makes it scale in the square of the number of
> >> > > > occurrences of the most popular item, which is a bad thing if that
> >> > > > number is huge. We need a way to limit the number of preferences
> >> > > > considered per item, there is already a ticket for this (
> >> > > > https://issues.apache.org/jira/browse/MAHOUT-460) and I plan to
> >> > provide
> >> > > > a patch in the next days.
> >> > > >
> >> > > > --sebastian
> >> > > >
> >> > > >
> >> > > >
> >> > > > Am 12.08.2010 00:15, schrieb Charly Lizarralde:
> >> > > > > Hi, I am testing ItemSimilarityJob with Netflix data (2.6 GB)
> and
> >> I
> >> > > have
> >> > > > > just ran out of disk space (160 GB) in my mapred.local.dir when
> >> > running
> >> > > > > RowSimilarityJob.
> >> > > > >
> >> > > > > Is this normal behaviour? How can I improve this?
> >> > > > >
> >> > > > > Thanks!
> >> > > > > Charly
> >> > > > >
> >> > > > >
> >> > > >
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Gökhan Çapan
> >> >
> >>
> >
> >
> >
> > --
> > Gökhan Çapan
> >
>
>
>
> --
> Gökhan Çapan
>

Re: ItemSimilarityJob

Posted by Gökhan Çapan <gk...@gmail.com>.
Hi Ted,
Combining phase is also applicable to stripes approach (also in-mapper
combining). The experiments I have remembered was from the paper, stripes
approach produces larger intermediate key-value pairs before the combiners
according to this experiment, though. Below is from the paper;

"*Results demonstrate that the stripes approach is*
*far more efficient than the pairs approach: 666 seconds*
*(11m 6s) compared to 3758 seconds (62m 38s)*
*for the entire APW sub-corpus (improvement by a*
*factor of 5.7). On the entire sub-corpus, the mappers*
*in the pairs approach generated 2.6 billion intermediate*
*key-value pairs totally 31.2 GB. After the*
*combiners, this was reduced to 1.1 billion key-value*
*pairs, which roughly quantifies the amount of data*
*involved in the shuffling and sorting of the keys. On*
*the other hand, the mappers in the stripes approach*
*generated 653 million intermediate key-value pairs*
*totally 48.1 GB; after the combiners, only 28.8 million*
*key-value pairs were left. The stripes approach*
*provides more opportunities for combiners to aggregate*
*intermediate results, thus greatly reducing network*
*traffic in the sort and shuffle phase.*"


On Thu, Aug 12, 2010 at 10:34 PM, Gökhan Çapan <gk...@gmail.com> wrote:

> Hi Ted,
>
> I have seen some benchmark results between different versions of
> co-occurrence computation, I will share them if I can find, today or
> tomorrow.
>
>
> On Thu, Aug 12, 2010 at 10:30 PM, Ted Dunning <te...@gmail.com>wrote:
>
>> Jimmy Lin's stripes work was presented at the last Summit and there was
>> heated (well, warm and cordial at least) discussion with the Map-reduce
>> committers about whether good use of a combiner wouldn't do just as well.
>>
>> My take-away as a spectator is that a combiner was
>>
>> a) vastly easier to code
>>
>> b) would be pretty certain to be within 2x as performant and likely very
>> close to the same speed
>>
>> c) would not need changing each time the underlying map-reduce changed
>>
>> My conclusion was that combiners were the way to go (for me).  Your
>> mileage,
>> as always, will vary.
>>
>> On Thu, Aug 12, 2010 at 7:45 AM, Gökhan Çapan <gk...@gmail.com> wrote:
>>
>> > Hi,
>> > I haven't seen the code, but maybe Mahout needs some optimization while
>> > computing item-item co-occurrences. It may be re-implemented using
>> > "stripes"
>> > approach using in-mapper combining if it is not. It can be found at:
>> >
>> >   1. www.aclweb.org/anthology/D/D08/D08-1044.pdf
>> >
>> > If it already is, sorry for the post.
>> >
>> > On Thu, Aug 12, 2010 at 3:51 PM, Charly Lizarralde <
>> > charly.lizarralde@gmail.com> wrote:
>> >
>> > > Sebastian, thank's for the reply.  The step name is*
>> > > :*RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.  and each
>> > > map task
>> > > takes around 10 hours to finish.
>> > >
>> > > Reduce task dir
>> > >
>> > >
>> >
>> (var/lib/hadoop-0.20/cache/hadoop/mapred/local/taskTracker/jobcache/job_201008111833_0007/attempt_201008111833_0007_r_000000_0/output)
>> > > has map output files ( files like map_2.out) and each one is 5GB in
>> size.
>> > >
>> > > I have been looking at the code and saw what you describe in the
>> e-mail.
>> > It
>> > > makes sense. But still 160 GB of intermediate info from a 2.6 GB input
>> > file
>> > > still makes me wonder if something is wrong.
>> > >
>> > > Should I just wait for the patch?
>> > > Thanks again!
>> > > Charly
>> > >
>> > > On Thu, Aug 12, 2010 at 2:34 AM, Sebastian Schelter <
>> > > ssc.open@googlemail.com
>> > > > wrote:
>> > >
>> > > > Hi Charly,
>> > > >
>> > > > can you tell which Map/Reduce step was executed last before you ran
>> out
>> > > > of disk space?
>> > > >
>> > > > I'm not familiar with the Netflix dataset and can only guess what
>> > > > happened, but I would say that you ran out of diskspace because
>> > > > ItemSimilarityJob currently uses all preferences to compute the
>> > > > similarities. This makes it scale in the square of the number of
>> > > > occurrences of the most popular item, which is a bad thing if that
>> > > > number is huge. We need a way to limit the number of preferences
>> > > > considered per item, there is already a ticket for this (
>> > > > https://issues.apache.org/jira/browse/MAHOUT-460) and I plan to
>> > provide
>> > > > a patch in the next days.
>> > > >
>> > > > --sebastian
>> > > >
>> > > >
>> > > >
>> > > > Am 12.08.2010 00:15, schrieb Charly Lizarralde:
>> > > > > Hi, I am testing ItemSimilarityJob with Netflix data (2.6 GB) and
>> I
>> > > have
>> > > > > just ran out of disk space (160 GB) in my mapred.local.dir when
>> > running
>> > > > > RowSimilarityJob.
>> > > > >
>> > > > > Is this normal behaviour? How can I improve this?
>> > > > >
>> > > > > Thanks!
>> > > > > Charly
>> > > > >
>> > > > >
>> > > >
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Gökhan Çapan
>> >
>>
>
>
>
> --
> Gökhan Çapan
>



-- 
Gökhan Çapan

Re: ItemSimilarityJob

Posted by Gökhan Çapan <gk...@gmail.com>.
Hi Ted,

I have seen some benchmark results between different versions of
co-occurrence computation, I will share them if I can find, today or
tomorrow.

On Thu, Aug 12, 2010 at 10:30 PM, Ted Dunning <te...@gmail.com> wrote:

> Jimmy Lin's stripes work was presented at the last Summit and there was
> heated (well, warm and cordial at least) discussion with the Map-reduce
> committers about whether good use of a combiner wouldn't do just as well.
>
> My take-away as a spectator is that a combiner was
>
> a) vastly easier to code
>
> b) would be pretty certain to be within 2x as performant and likely very
> close to the same speed
>
> c) would not need changing each time the underlying map-reduce changed
>
> My conclusion was that combiners were the way to go (for me).  Your
> mileage,
> as always, will vary.
>
> On Thu, Aug 12, 2010 at 7:45 AM, Gökhan Çapan <gk...@gmail.com> wrote:
>
> > Hi,
> > I haven't seen the code, but maybe Mahout needs some optimization while
> > computing item-item co-occurrences. It may be re-implemented using
> > "stripes"
> > approach using in-mapper combining if it is not. It can be found at:
> >
> >   1. www.aclweb.org/anthology/D/D08/D08-1044.pdf
> >
> > If it already is, sorry for the post.
> >
> > On Thu, Aug 12, 2010 at 3:51 PM, Charly Lizarralde <
> > charly.lizarralde@gmail.com> wrote:
> >
> > > Sebastian, thank's for the reply.  The step name is*
> > > :*RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.  and each
> > > map task
> > > takes around 10 hours to finish.
> > >
> > > Reduce task dir
> > >
> > >
> >
> (var/lib/hadoop-0.20/cache/hadoop/mapred/local/taskTracker/jobcache/job_201008111833_0007/attempt_201008111833_0007_r_000000_0/output)
> > > has map output files ( files like map_2.out) and each one is 5GB in
> size.
> > >
> > > I have been looking at the code and saw what you describe in the
> e-mail.
> > It
> > > makes sense. But still 160 GB of intermediate info from a 2.6 GB input
> > file
> > > still makes me wonder if something is wrong.
> > >
> > > Should I just wait for the patch?
> > > Thanks again!
> > > Charly
> > >
> > > On Thu, Aug 12, 2010 at 2:34 AM, Sebastian Schelter <
> > > ssc.open@googlemail.com
> > > > wrote:
> > >
> > > > Hi Charly,
> > > >
> > > > can you tell which Map/Reduce step was executed last before you ran
> out
> > > > of disk space?
> > > >
> > > > I'm not familiar with the Netflix dataset and can only guess what
> > > > happened, but I would say that you ran out of diskspace because
> > > > ItemSimilarityJob currently uses all preferences to compute the
> > > > similarities. This makes it scale in the square of the number of
> > > > occurrences of the most popular item, which is a bad thing if that
> > > > number is huge. We need a way to limit the number of preferences
> > > > considered per item, there is already a ticket for this (
> > > > https://issues.apache.org/jira/browse/MAHOUT-460) and I plan to
> > provide
> > > > a patch in the next days.
> > > >
> > > > --sebastian
> > > >
> > > >
> > > >
> > > > Am 12.08.2010 00:15, schrieb Charly Lizarralde:
> > > > > Hi, I am testing ItemSimilarityJob with Netflix data (2.6 GB) and I
> > > have
> > > > > just ran out of disk space (160 GB) in my mapred.local.dir when
> > running
> > > > > RowSimilarityJob.
> > > > >
> > > > > Is this normal behaviour? How can I improve this?
> > > > >
> > > > > Thanks!
> > > > > Charly
> > > > >
> > > > >
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Gökhan Çapan
> >
>



-- 
Gökhan Çapan

Re: ItemSimilarityJob

Posted by Ted Dunning <te...@gmail.com>.
Jimmy Lin's stripes work was presented at the last Summit and there was
heated (well, warm and cordial at least) discussion with the Map-reduce
committers about whether good use of a combiner wouldn't do just as well.

My take-away as a spectator is that a combiner was

a) vastly easier to code

b) would be pretty certain to be within 2x as performant and likely very
close to the same speed

c) would not need changing each time the underlying map-reduce changed

My conclusion was that combiners were the way to go (for me).  Your mileage,
as always, will vary.

On Thu, Aug 12, 2010 at 7:45 AM, Gökhan Çapan <gk...@gmail.com> wrote:

> Hi,
> I haven't seen the code, but maybe Mahout needs some optimization while
> computing item-item co-occurrences. It may be re-implemented using
> "stripes"
> approach using in-mapper combining if it is not. It can be found at:
>
>   1. www.aclweb.org/anthology/D/D08/D08-1044.pdf
>
> If it already is, sorry for the post.
>
> On Thu, Aug 12, 2010 at 3:51 PM, Charly Lizarralde <
> charly.lizarralde@gmail.com> wrote:
>
> > Sebastian, thank's for the reply.  The step name is*
> > :*RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.  and each
> > map task
> > takes around 10 hours to finish.
> >
> > Reduce task dir
> >
> >
> (var/lib/hadoop-0.20/cache/hadoop/mapred/local/taskTracker/jobcache/job_201008111833_0007/attempt_201008111833_0007_r_000000_0/output)
> > has map output files ( files like map_2.out) and each one is 5GB in size.
> >
> > I have been looking at the code and saw what you describe in the e-mail.
> It
> > makes sense. But still 160 GB of intermediate info from a 2.6 GB input
> file
> > still makes me wonder if something is wrong.
> >
> > Should I just wait for the patch?
> > Thanks again!
> > Charly
> >
> > On Thu, Aug 12, 2010 at 2:34 AM, Sebastian Schelter <
> > ssc.open@googlemail.com
> > > wrote:
> >
> > > Hi Charly,
> > >
> > > can you tell which Map/Reduce step was executed last before you ran out
> > > of disk space?
> > >
> > > I'm not familiar with the Netflix dataset and can only guess what
> > > happened, but I would say that you ran out of diskspace because
> > > ItemSimilarityJob currently uses all preferences to compute the
> > > similarities. This makes it scale in the square of the number of
> > > occurrences of the most popular item, which is a bad thing if that
> > > number is huge. We need a way to limit the number of preferences
> > > considered per item, there is already a ticket for this (
> > > https://issues.apache.org/jira/browse/MAHOUT-460) and I plan to
> provide
> > > a patch in the next days.
> > >
> > > --sebastian
> > >
> > >
> > >
> > > Am 12.08.2010 00:15, schrieb Charly Lizarralde:
> > > > Hi, I am testing ItemSimilarityJob with Netflix data (2.6 GB) and I
> > have
> > > > just ran out of disk space (160 GB) in my mapred.local.dir when
> running
> > > > RowSimilarityJob.
> > > >
> > > > Is this normal behaviour? How can I improve this?
> > > >
> > > > Thanks!
> > > > Charly
> > > >
> > > >
> > >
> > >
> >
>
>
>
> --
> Gökhan Çapan
>

Re: ItemSimilarityJob

Posted by Gökhan Çapan <gk...@gmail.com>.
Hi,
I haven't seen the code, but maybe Mahout needs some optimization while
computing item-item co-occurrences. It may be re-implemented using "stripes"
approach using in-mapper combining if it is not. It can be found at:

   1. www.aclweb.org/anthology/D/D08/D08-1044.pdf

If it already is, sorry for the post.

On Thu, Aug 12, 2010 at 3:51 PM, Charly Lizarralde <
charly.lizarralde@gmail.com> wrote:

> Sebastian, thank's for the reply.  The step name is*
> :*RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.  and each
> map task
> takes around 10 hours to finish.
>
> Reduce task dir
>
> (var/lib/hadoop-0.20/cache/hadoop/mapred/local/taskTracker/jobcache/job_201008111833_0007/attempt_201008111833_0007_r_000000_0/output)
> has map output files ( files like map_2.out) and each one is 5GB in size.
>
> I have been looking at the code and saw what you describe in the e-mail. It
> makes sense. But still 160 GB of intermediate info from a 2.6 GB input file
> still makes me wonder if something is wrong.
>
> Should I just wait for the patch?
> Thanks again!
> Charly
>
> On Thu, Aug 12, 2010 at 2:34 AM, Sebastian Schelter <
> ssc.open@googlemail.com
> > wrote:
>
> > Hi Charly,
> >
> > can you tell which Map/Reduce step was executed last before you ran out
> > of disk space?
> >
> > I'm not familiar with the Netflix dataset and can only guess what
> > happened, but I would say that you ran out of diskspace because
> > ItemSimilarityJob currently uses all preferences to compute the
> > similarities. This makes it scale in the square of the number of
> > occurrences of the most popular item, which is a bad thing if that
> > number is huge. We need a way to limit the number of preferences
> > considered per item, there is already a ticket for this (
> > https://issues.apache.org/jira/browse/MAHOUT-460) and I plan to provide
> > a patch in the next days.
> >
> > --sebastian
> >
> >
> >
> > Am 12.08.2010 00:15, schrieb Charly Lizarralde:
> > > Hi, I am testing ItemSimilarityJob with Netflix data (2.6 GB) and I
> have
> > > just ran out of disk space (160 GB) in my mapred.local.dir when running
> > > RowSimilarityJob.
> > >
> > > Is this normal behaviour? How can I improve this?
> > >
> > > Thanks!
> > > Charly
> > >
> > >
> >
> >
>



-- 
Gökhan Çapan

Re: ItemSimilarityJob

Posted by Charly Lizarralde <ch...@gmail.com>.
Sebastian, thank's for the reply.  The step name is*
:*RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.  and each
map task
takes around 10 hours to finish.

Reduce task dir
(var/lib/hadoop-0.20/cache/hadoop/mapred/local/taskTracker/jobcache/job_201008111833_0007/attempt_201008111833_0007_r_000000_0/output)
has map output files ( files like map_2.out) and each one is 5GB in size.

I have been looking at the code and saw what you describe in the e-mail. It
makes sense. But still 160 GB of intermediate info from a 2.6 GB input file
still makes me wonder if something is wrong.

Should I just wait for the patch?
Thanks again!
Charly

On Thu, Aug 12, 2010 at 2:34 AM, Sebastian Schelter <ssc.open@googlemail.com
> wrote:

> Hi Charly,
>
> can you tell which Map/Reduce step was executed last before you ran out
> of disk space?
>
> I'm not familiar with the Netflix dataset and can only guess what
> happened, but I would say that you ran out of diskspace because
> ItemSimilarityJob currently uses all preferences to compute the
> similarities. This makes it scale in the square of the number of
> occurrences of the most popular item, which is a bad thing if that
> number is huge. We need a way to limit the number of preferences
> considered per item, there is already a ticket for this (
> https://issues.apache.org/jira/browse/MAHOUT-460) and I plan to provide
> a patch in the next days.
>
> --sebastian
>
>
>
> Am 12.08.2010 00:15, schrieb Charly Lizarralde:
> > Hi, I am testing ItemSimilarityJob with Netflix data (2.6 GB) and I have
> > just ran out of disk space (160 GB) in my mapred.local.dir when running
> > RowSimilarityJob.
> >
> > Is this normal behaviour? How can I improve this?
> >
> > Thanks!
> > Charly
> >
> >
>
>

Re: ItemSimilarityJob

Posted by Sebastian Schelter <ss...@googlemail.com>.
Hi Charly,

can you tell which Map/Reduce step was executed last before you ran out
of disk space?

I'm not familiar with the Netflix dataset and can only guess what
happened, but I would say that you ran out of diskspace because
ItemSimilarityJob currently uses all preferences to compute the
similarities. This makes it scale in the square of the number of
occurrences of the most popular item, which is a bad thing if that
number is huge. We need a way to limit the number of preferences
considered per item, there is already a ticket for this (
https://issues.apache.org/jira/browse/MAHOUT-460) and I plan to provide
a patch in the next days.

--sebastian



Am 12.08.2010 00:15, schrieb Charly Lizarralde:
> Hi, I am testing ItemSimilarityJob with Netflix data (2.6 GB) and I have
> just ran out of disk space (160 GB) in my mapred.local.dir when running
> RowSimilarityJob.
>
> Is this normal behaviour? How can I improve this?
>
> Thanks!
> Charly
>
>