You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by james q <ja...@gmail.com> on 2011/02/04 06:05:22 UTC

Memory Issue with KMeans clustering

Hello,

New user to mahout and hadoop here. Isabel Drost suggested to a colleague I
should post to the mahout user list, as I am having some general
difficulties with memory consumption and KMeans clustering.

So a general question first and foremost: what determines how much memory
does a map task consume during a KMeans clustering job? Increasing the
number of map tasks by adjusting dfs.block.size and mapred.max.split.size
doesn't seem to make the map task consume less memory. Or at least not a
very noticeable amount. I figured if there are more map tasks, each
individual map task evaluates less input keys and hence there would be less
memory consumption. Is there any way to predict memory usage of map tasks in
KMeans?

The cluster I am running consists of 10 machines, each with 8 cores and 68G
of ram. I've configured the cluster to have each machine, at maximum, run 7
map or reduce tasks. I set the map and reduce tasks to have virtually no
limit on memory consumption ... so with 7 processes each, at around 9 - 10G
per process, the machines will crap out. I can reduce the number of map
tasks per machine, but something tells me that that level of memory
consumption is wrong.

If any more information is needed to help debug this, please let me know!
Thanks!

-- james

Re: Memory Issue with KMeans clustering

Posted by Robin Anil <ro...@gmail.com>.
Nearest point to the centroid instead of average of points*

On Tue, Feb 8, 2011 at 12:35 AM, Robin Anil <ro...@gmail.com> wrote:

> We can prolly find the nearest centroid, instead of averaging it out. This
> way centroid vector wont grow big? What do you think about that Ted, Jeff?
>
>
> On Fri, Feb 4, 2011 at 9:23 PM, Ted Dunning <te...@gmail.com> wrote:
>
>> 5000 x 6838856 x 8 = 273GB of memory just for the centroids (which will
>> tend
>> to become dense)
>>
>> I recommend you decrease your input dimensionality to 10^5 - 10^6.  This
>> could decrease your memory needs to 4GB at the low end.
>>
>> What kind of input do you have?
>>
>> On Fri, Feb 4, 2011 at 7:50 AM, james q <ja...@gmail.com>
>> wrote:
>>
>> > I think the job had 5000 - 6000 clusters. The input (sparse) vectors had
>> a
>> > dimension of 6838856.
>> >
>> > -- james
>> >
>> > On Fri, Feb 4, 2011 at 1:55 AM, Ted Dunning <te...@gmail.com>
>> wrote:
>> >
>> > > How many clusters?
>> > >
>> > > How large is the dimension of your input data?
>> > >
>> > > On Thu, Feb 3, 2011 at 9:05 PM, james q <ja...@gmail.com>
>> > > wrote:
>> > >
>> > > > Hello,
>> > > >
>> > > > New user to mahout and hadoop here. Isabel Drost suggested to a
>> > colleague
>> > > I
>> > > > should post to the mahout user list, as I am having some general
>> > > > difficulties with memory consumption and KMeans clustering.
>> > > >
>> > > > So a general question first and foremost: what determines how much
>> > memory
>> > > > does a map task consume during a KMeans clustering job? Increasing
>> the
>> > > > number of map tasks by adjusting dfs.block.size and
>> > mapred.max.split.size
>> > > > doesn't seem to make the map task consume less memory. Or at least
>> not
>> > a
>> > > > very noticeable amount. I figured if there are more map tasks, each
>> > > > individual map task evaluates less input keys and hence there would
>> be
>> > > less
>> > > > memory consumption. Is there any way to predict memory usage of map
>> > tasks
>> > > > in
>> > > > KMeans?
>> > > >
>> > > > The cluster I am running consists of 10 machines, each with 8 cores
>> and
>> > > 68G
>> > > > of ram. I've configured the cluster to have each machine, at
>> maximum,
>> > run
>> > > 7
>> > > > map or reduce tasks. I set the map and reduce tasks to have
>> virtually
>> > no
>> > > > limit on memory consumption ... so with 7 processes each, at around
>> 9 -
>> > > 10G
>> > > > per process, the machines will crap out. I can reduce the number of
>> map
>> > > > tasks per machine, but something tells me that that level of memory
>> > > > consumption is wrong.
>> > > >
>> > > > If any more information is needed to help debug this, please let me
>> > know!
>> > > > Thanks!
>> > > >
>> > > > -- james
>> > > >
>> > >
>> >
>>
>
>

Re: Memory Issue with KMeans clustering

Posted by Ted Dunning <te...@gmail.com>.
On Mon, Feb 7, 2011 at 11:35 AM, Robin Anil <ro...@gmail.com> wrote:

> On Tue, Feb 8, 2011 at 12:47 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > The problem is that the centroids are the average of many documents.
>  This
> > means that the number o non-zero elements in each centroid vector
> increases
> > as the number of documents increases.
> >
> > If we approximate the centroid by the point nearest to the centroid.
> Considering we have a lot of input data. I see the centroids being real
> points(part of the input dataset) instead of imaginary ones(average).  Some
> loss is incurred here
>

This also become much more computationally intense because you can't use
combiners.  Averages are really good about some stuff.


>
> Hashed encoding would be a easier solution. The same or similar loss is
> incurred here as well due to collisions.
>
>
Actually not.  If you have multiple probes, then hashed encoding is a form
of random projection and you typically will not lose any expressivity.

Re: Memory Issue with KMeans clustering

Posted by Robin Anil <ro...@gmail.com>.
On Tue, Feb 8, 2011 at 12:47 AM, Ted Dunning <te...@gmail.com> wrote:

> The problem is that the centroids are the average of many documents.  This
> means that the number o non-zero elements in each centroid vector increases
> as the number of documents increases.
>
> If we approximate the centroid by the point nearest to the centroid.
Considering we have a lot of input data. I see the centroids being real
points(part of the input dataset) instead of imaginary ones(average).  Some
loss is incurred here

Hashed encoding would be a easier solution. The same or similar loss is
incurred here as well due to collisions.



>  On Mon, Feb 7, 2011 at 11:05 AM, Robin Anil <ro...@gmail.com> wrote:
>
> > We can prolly find the nearest centroid, instead of averaging it out.
> This
> > way centroid vector wont grow big? What do you think about that Ted,
> Jeff?
> >
> > On Fri, Feb 4, 2011 at 9:23 PM, Ted Dunning <te...@gmail.com>
> wrote:
> >
> > > 5000 x 6838856 x 8 = 273GB of memory just for the centroids (which will
> > > tend
> > > to become dense)
> > >
> > > I recommend you decrease your input dimensionality to 10^5 - 10^6.
>  This
> > > could decrease your memory needs to 4GB at the low end.
> > >
> > > What kind of input do you have?
> > >
> > > On Fri, Feb 4, 2011 at 7:50 AM, james q <ja...@gmail.com>
> > > wrote:
> > >
> > > > I think the job had 5000 - 6000 clusters. The input (sparse) vectors
> > had
> > > a
> > > > dimension of 6838856.
> > > >
> > > > -- james
> > > >
> > > > On Fri, Feb 4, 2011 at 1:55 AM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > > >
> > > > > How many clusters?
> > > > >
> > > > > How large is the dimension of your input data?
> > > > >
> > > > > On Thu, Feb 3, 2011 at 9:05 PM, james q <
> james.quacinella@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > New user to mahout and hadoop here. Isabel Drost suggested to a
> > > > colleague
> > > > > I
> > > > > > should post to the mahout user list, as I am having some general
> > > > > > difficulties with memory consumption and KMeans clustering.
> > > > > >
> > > > > > So a general question first and foremost: what determines how
> much
> > > > memory
> > > > > > does a map task consume during a KMeans clustering job?
> Increasing
> > > the
> > > > > > number of map tasks by adjusting dfs.block.size and
> > > > mapred.max.split.size
> > > > > > doesn't seem to make the map task consume less memory. Or at
> least
> > > not
> > > > a
> > > > > > very noticeable amount. I figured if there are more map tasks,
> each
> > > > > > individual map task evaluates less input keys and hence there
> would
> > > be
> > > > > less
> > > > > > memory consumption. Is there any way to predict memory usage of
> map
> > > > tasks
> > > > > > in
> > > > > > KMeans?
> > > > > >
> > > > > > The cluster I am running consists of 10 machines, each with 8
> cores
> > > and
> > > > > 68G
> > > > > > of ram. I've configured the cluster to have each machine, at
> > maximum,
> > > > run
> > > > > 7
> > > > > > map or reduce tasks. I set the map and reduce tasks to have
> > virtually
> > > > no
> > > > > > limit on memory consumption ... so with 7 processes each, at
> around
> > 9
> > > -
> > > > > 10G
> > > > > > per process, the machines will crap out. I can reduce the number
> of
> > > map
> > > > > > tasks per machine, but something tells me that that level of
> memory
> > > > > > consumption is wrong.
> > > > > >
> > > > > > If any more information is needed to help debug this, please let
> me
> > > > know!
> > > > > > Thanks!
> > > > > >
> > > > > > -- james
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Memory Issue with KMeans clustering

Posted by Ted Dunning <te...@gmail.com>.
The problem is that the centroids are the average of many documents.  This
means that the number o non-zero elements in each centroid vector increases
as the number of documents increases.

This can be handled in a few ways:

- do the averaging in a sparsity preserving way.  LLR is one such animal.
 It is probably possible to do an L_1 regularized centroid as well (but I
would have to think that through a while).

- use fixed size vectors as with hashed encodings.  Then we don't care (as
much) that the centroids are dense.

On Mon, Feb 7, 2011 at 11:05 AM, Robin Anil <ro...@gmail.com> wrote:

> We can prolly find the nearest centroid, instead of averaging it out. This
> way centroid vector wont grow big? What do you think about that Ted, Jeff?
>
> On Fri, Feb 4, 2011 at 9:23 PM, Ted Dunning <te...@gmail.com> wrote:
>
> > 5000 x 6838856 x 8 = 273GB of memory just for the centroids (which will
> > tend
> > to become dense)
> >
> > I recommend you decrease your input dimensionality to 10^5 - 10^6.  This
> > could decrease your memory needs to 4GB at the low end.
> >
> > What kind of input do you have?
> >
> > On Fri, Feb 4, 2011 at 7:50 AM, james q <ja...@gmail.com>
> > wrote:
> >
> > > I think the job had 5000 - 6000 clusters. The input (sparse) vectors
> had
> > a
> > > dimension of 6838856.
> > >
> > > -- james
> > >
> > > On Fri, Feb 4, 2011 at 1:55 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> > >
> > > > How many clusters?
> > > >
> > > > How large is the dimension of your input data?
> > > >
> > > > On Thu, Feb 3, 2011 at 9:05 PM, james q <ja...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > New user to mahout and hadoop here. Isabel Drost suggested to a
> > > colleague
> > > > I
> > > > > should post to the mahout user list, as I am having some general
> > > > > difficulties with memory consumption and KMeans clustering.
> > > > >
> > > > > So a general question first and foremost: what determines how much
> > > memory
> > > > > does a map task consume during a KMeans clustering job? Increasing
> > the
> > > > > number of map tasks by adjusting dfs.block.size and
> > > mapred.max.split.size
> > > > > doesn't seem to make the map task consume less memory. Or at least
> > not
> > > a
> > > > > very noticeable amount. I figured if there are more map tasks, each
> > > > > individual map task evaluates less input keys and hence there would
> > be
> > > > less
> > > > > memory consumption. Is there any way to predict memory usage of map
> > > tasks
> > > > > in
> > > > > KMeans?
> > > > >
> > > > > The cluster I am running consists of 10 machines, each with 8 cores
> > and
> > > > 68G
> > > > > of ram. I've configured the cluster to have each machine, at
> maximum,
> > > run
> > > > 7
> > > > > map or reduce tasks. I set the map and reduce tasks to have
> virtually
> > > no
> > > > > limit on memory consumption ... so with 7 processes each, at around
> 9
> > -
> > > > 10G
> > > > > per process, the machines will crap out. I can reduce the number of
> > map
> > > > > tasks per machine, but something tells me that that level of memory
> > > > > consumption is wrong.
> > > > >
> > > > > If any more information is needed to help debug this, please let me
> > > know!
> > > > > Thanks!
> > > > >
> > > > > -- james
> > > > >
> > > >
> > >
> >
>

Re: Memory Issue with KMeans clustering

Posted by Robin Anil <ro...@gmail.com>.
We can prolly find the nearest centroid, instead of averaging it out. This
way centroid vector wont grow big? What do you think about that Ted, Jeff?

On Fri, Feb 4, 2011 at 9:23 PM, Ted Dunning <te...@gmail.com> wrote:

> 5000 x 6838856 x 8 = 273GB of memory just for the centroids (which will
> tend
> to become dense)
>
> I recommend you decrease your input dimensionality to 10^5 - 10^6.  This
> could decrease your memory needs to 4GB at the low end.
>
> What kind of input do you have?
>
> On Fri, Feb 4, 2011 at 7:50 AM, james q <ja...@gmail.com>
> wrote:
>
> > I think the job had 5000 - 6000 clusters. The input (sparse) vectors had
> a
> > dimension of 6838856.
> >
> > -- james
> >
> > On Fri, Feb 4, 2011 at 1:55 AM, Ted Dunning <te...@gmail.com>
> wrote:
> >
> > > How many clusters?
> > >
> > > How large is the dimension of your input data?
> > >
> > > On Thu, Feb 3, 2011 at 9:05 PM, james q <ja...@gmail.com>
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > New user to mahout and hadoop here. Isabel Drost suggested to a
> > colleague
> > > I
> > > > should post to the mahout user list, as I am having some general
> > > > difficulties with memory consumption and KMeans clustering.
> > > >
> > > > So a general question first and foremost: what determines how much
> > memory
> > > > does a map task consume during a KMeans clustering job? Increasing
> the
> > > > number of map tasks by adjusting dfs.block.size and
> > mapred.max.split.size
> > > > doesn't seem to make the map task consume less memory. Or at least
> not
> > a
> > > > very noticeable amount. I figured if there are more map tasks, each
> > > > individual map task evaluates less input keys and hence there would
> be
> > > less
> > > > memory consumption. Is there any way to predict memory usage of map
> > tasks
> > > > in
> > > > KMeans?
> > > >
> > > > The cluster I am running consists of 10 machines, each with 8 cores
> and
> > > 68G
> > > > of ram. I've configured the cluster to have each machine, at maximum,
> > run
> > > 7
> > > > map or reduce tasks. I set the map and reduce tasks to have virtually
> > no
> > > > limit on memory consumption ... so with 7 processes each, at around 9
> -
> > > 10G
> > > > per process, the machines will crap out. I can reduce the number of
> map
> > > > tasks per machine, but something tells me that that level of memory
> > > > consumption is wrong.
> > > >
> > > > If any more information is needed to help debug this, please let me
> > know!
> > > > Thanks!
> > > >
> > > > -- james
> > > >
> > >
> >
>

RE: Memory Issue with KMeans clustering

Posted by Jeff Eastman <je...@Narus.com>.
This is intriguing. Can you say a bit more about "more stages per iteration"?

-----Original Message-----
From: Severance, Steve [mailto:sseverance@ebay.com] 
Sent: Friday, February 04, 2011 2:45 PM
To: user@mahout.apache.org
Subject: RE: Memory Issue with KMeans clustering

At eBay we moved all clustering off mahout to our own implementation. It was more stages per iteration but we could use our high dimensional feature spaces with our chosen number of targets. We also used sparse vectors as opposed to dense vectors.

Steve

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Friday, February 04, 2011 7:54 AM
To: user@mahout.apache.org
Subject: Re: Memory Issue with KMeans clustering

5000 x 6838856 x 8 = 273GB of memory just for the centroids (which will tend
to become dense)

I recommend you decrease your input dimensionality to 10^5 - 10^6.  This
could decrease your memory needs to 4GB at the low end.

What kind of input do you have?

On Fri, Feb 4, 2011 at 7:50 AM, james q <ja...@gmail.com> wrote:

> I think the job had 5000 - 6000 clusters. The input (sparse) vectors had a
> dimension of 6838856.
>
> -- james
>
> On Fri, Feb 4, 2011 at 1:55 AM, Ted Dunning <te...@gmail.com> wrote:
>
> > How many clusters?
> >
> > How large is the dimension of your input data?
> >
> > On Thu, Feb 3, 2011 at 9:05 PM, james q <ja...@gmail.com>
> > wrote:
> >
> > > Hello,
> > >
> > > New user to mahout and hadoop here. Isabel Drost suggested to a
> colleague
> > I
> > > should post to the mahout user list, as I am having some general
> > > difficulties with memory consumption and KMeans clustering.
> > >
> > > So a general question first and foremost: what determines how much
> memory
> > > does a map task consume during a KMeans clustering job? Increasing the
> > > number of map tasks by adjusting dfs.block.size and
> mapred.max.split.size
> > > doesn't seem to make the map task consume less memory. Or at least not
> a
> > > very noticeable amount. I figured if there are more map tasks, each
> > > individual map task evaluates less input keys and hence there would be
> > less
> > > memory consumption. Is there any way to predict memory usage of map
> tasks
> > > in
> > > KMeans?
> > >
> > > The cluster I am running consists of 10 machines, each with 8 cores and
> > 68G
> > > of ram. I've configured the cluster to have each machine, at maximum,
> run
> > 7
> > > map or reduce tasks. I set the map and reduce tasks to have virtually
> no
> > > limit on memory consumption ... so with 7 processes each, at around 9 -
> > 10G
> > > per process, the machines will crap out. I can reduce the number of map
> > > tasks per machine, but something tells me that that level of memory
> > > consumption is wrong.
> > >
> > > If any more information is needed to help debug this, please let me
> know!
> > > Thanks!
> > >
> > > -- james
> > >
> >
>

RE: Memory Issue with KMeans clustering

Posted by "Severance, Steve" <ss...@ebay.com>.
At eBay we moved all clustering off mahout to our own implementation. It was more stages per iteration but we could use our high dimensional feature spaces with our chosen number of targets. We also used sparse vectors as opposed to dense vectors.

Steve

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Friday, February 04, 2011 7:54 AM
To: user@mahout.apache.org
Subject: Re: Memory Issue with KMeans clustering

5000 x 6838856 x 8 = 273GB of memory just for the centroids (which will tend
to become dense)

I recommend you decrease your input dimensionality to 10^5 - 10^6.  This
could decrease your memory needs to 4GB at the low end.

What kind of input do you have?

On Fri, Feb 4, 2011 at 7:50 AM, james q <ja...@gmail.com> wrote:

> I think the job had 5000 - 6000 clusters. The input (sparse) vectors had a
> dimension of 6838856.
>
> -- james
>
> On Fri, Feb 4, 2011 at 1:55 AM, Ted Dunning <te...@gmail.com> wrote:
>
> > How many clusters?
> >
> > How large is the dimension of your input data?
> >
> > On Thu, Feb 3, 2011 at 9:05 PM, james q <ja...@gmail.com>
> > wrote:
> >
> > > Hello,
> > >
> > > New user to mahout and hadoop here. Isabel Drost suggested to a
> colleague
> > I
> > > should post to the mahout user list, as I am having some general
> > > difficulties with memory consumption and KMeans clustering.
> > >
> > > So a general question first and foremost: what determines how much
> memory
> > > does a map task consume during a KMeans clustering job? Increasing the
> > > number of map tasks by adjusting dfs.block.size and
> mapred.max.split.size
> > > doesn't seem to make the map task consume less memory. Or at least not
> a
> > > very noticeable amount. I figured if there are more map tasks, each
> > > individual map task evaluates less input keys and hence there would be
> > less
> > > memory consumption. Is there any way to predict memory usage of map
> tasks
> > > in
> > > KMeans?
> > >
> > > The cluster I am running consists of 10 machines, each with 8 cores and
> > 68G
> > > of ram. I've configured the cluster to have each machine, at maximum,
> run
> > 7
> > > map or reduce tasks. I set the map and reduce tasks to have virtually
> no
> > > limit on memory consumption ... so with 7 processes each, at around 9 -
> > 10G
> > > per process, the machines will crap out. I can reduce the number of map
> > > tasks per machine, but something tells me that that level of memory
> > > consumption is wrong.
> > >
> > > If any more information is needed to help debug this, please let me
> know!
> > > Thanks!
> > >
> > > -- james
> > >
> >
>

Re: recommendation help

Posted by gustavo salazar <gu...@gmail.com>.
Could I use Spearman correlation without preference values?

2011/2/4 Sean Owen <sr...@gmail.com>

> If you are referring to implementations like PearsonCorrelationSimilarity
> --
> no. Those algorithms only make sense with preference values.
>
> On Fri, Feb 4, 2011 at 10:19 PM, Paul, Seby <Se...@searshc.com> wrote:
>
> >
> > There is no user preference in our data file.   i can create
> > recommendations with mhaout using LogLikelihoodSimilarity and
> > TanimotoCoefficientSimilarity.
> >
> > Is it possible to use any other similarity metrics without the
> > preference value?
> >
> >
> > Thank you
> > Seby Paul
> >
> >
> >
> >
>



-- 
Gustavo Salazar Loor
Estudiante de Ing. Telemática - Espol

Re: recommendation help

Posted by Sean Owen <sr...@gmail.com>.
If you are referring to implementations like PearsonCorrelationSimilarity --
no. Those algorithms only make sense with preference values.

On Fri, Feb 4, 2011 at 10:19 PM, Paul, Seby <Se...@searshc.com> wrote:

>
> There is no user preference in our data file.   i can create
> recommendations with mhaout using LogLikelihoodSimilarity and
> TanimotoCoefficientSimilarity.
>
> Is it possible to use any other similarity metrics without the
> preference value?
>
>
> Thank you
> Seby Paul
>
>
>
>

recommendation help

Posted by "Paul, Seby" <Se...@searshc.com>.
There is no user preference in our data file.   i can create
recommendations with mhaout using LogLikelihoodSimilarity and
TanimotoCoefficientSimilarity.

Is it possible to use any other similarity metrics without the
preference value?


Thank you
Seby Paul




Re: Memory Issue with KMeans clustering

Posted by Ted Dunning <te...@gmail.com>.
The problem is that any average of multiple vectors is going to have lots of
non-zero values.

A model based approach could use a gradient descent approach with
regularization to build classifiers that then define the training data for
the next round of classifier building.  I have seen lots of over-fitting
with that kind of approach, however.  Strong regularization might help.

On Fri, Feb 4, 2011 at 8:42 AM, Jeff Eastman <je...@narus.com> wrote:

> That's really the big challenge using kmeans (and probably any of the other
> clustering algorithms too) for text clustering: the centroids tend to become
> dense and the memory consumption skyrockets. I wonder if the centroid
> calculation could be made smarter by setting an underflow limit and forcing
> close-to-zero terms to be exactly zero? I guess the challenge would be to
> dynamically select this limit. Or, perhaps implementing an approximating
> vector which only retains its n most significant terms? Thin ice here...
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Friday, February 04, 2011 7:54 AM
> To: user@mahout.apache.org
> Subject: Re: Memory Issue with KMeans clustering
>
> 5000 x 6838856 x 8 = 273GB of memory just for the centroids (which will
> tend
> to become dense)
>
> I recommend you decrease your input dimensionality to 10^5 - 10^6.  This
> could decrease your memory needs to 4GB at the low end.
>
> What kind of input do you have?
>
> On Fri, Feb 4, 2011 at 7:50 AM, james q <ja...@gmail.com>
> wrote:
>
> > I think the job had 5000 - 6000 clusters. The input (sparse) vectors had
> a
> > dimension of 6838856.
> >
> > -- james
> >
> > On Fri, Feb 4, 2011 at 1:55 AM, Ted Dunning <te...@gmail.com>
> wrote:
> >
> > > How many clusters?
> > >
> > > How large is the dimension of your input data?
> > >
> > > On Thu, Feb 3, 2011 at 9:05 PM, james q <ja...@gmail.com>
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > New user to mahout and hadoop here. Isabel Drost suggested to a
> > colleague
> > > I
> > > > should post to the mahout user list, as I am having some general
> > > > difficulties with memory consumption and KMeans clustering.
> > > >
> > > > So a general question first and foremost: what determines how much
> > memory
> > > > does a map task consume during a KMeans clustering job? Increasing
> the
> > > > number of map tasks by adjusting dfs.block.size and
> > mapred.max.split.size
> > > > doesn't seem to make the map task consume less memory. Or at least
> not
> > a
> > > > very noticeable amount. I figured if there are more map tasks, each
> > > > individual map task evaluates less input keys and hence there would
> be
> > > less
> > > > memory consumption. Is there any way to predict memory usage of map
> > tasks
> > > > in
> > > > KMeans?
> > > >
> > > > The cluster I am running consists of 10 machines, each with 8 cores
> and
> > > 68G
> > > > of ram. I've configured the cluster to have each machine, at maximum,
> > run
> > > 7
> > > > map or reduce tasks. I set the map and reduce tasks to have virtually
> > no
> > > > limit on memory consumption ... so with 7 processes each, at around 9
> -
> > > 10G
> > > > per process, the machines will crap out. I can reduce the number of
> map
> > > > tasks per machine, but something tells me that that level of memory
> > > > consumption is wrong.
> > > >
> > > > If any more information is needed to help debug this, please let me
> > know!
> > > > Thanks!
> > > >
> > > > -- james
> > > >
> > >
> >
>

RE: Memory Issue with KMeans clustering

Posted by Jeff Eastman <je...@Narus.com>.
That's really the big challenge using kmeans (and probably any of the other clustering algorithms too) for text clustering: the centroids tend to become dense and the memory consumption skyrockets. I wonder if the centroid calculation could be made smarter by setting an underflow limit and forcing close-to-zero terms to be exactly zero? I guess the challenge would be to dynamically select this limit. Or, perhaps implementing an approximating vector which only retains its n most significant terms? Thin ice here...

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Friday, February 04, 2011 7:54 AM
To: user@mahout.apache.org
Subject: Re: Memory Issue with KMeans clustering

5000 x 6838856 x 8 = 273GB of memory just for the centroids (which will tend
to become dense)

I recommend you decrease your input dimensionality to 10^5 - 10^6.  This
could decrease your memory needs to 4GB at the low end.

What kind of input do you have?

On Fri, Feb 4, 2011 at 7:50 AM, james q <ja...@gmail.com> wrote:

> I think the job had 5000 - 6000 clusters. The input (sparse) vectors had a
> dimension of 6838856.
>
> -- james
>
> On Fri, Feb 4, 2011 at 1:55 AM, Ted Dunning <te...@gmail.com> wrote:
>
> > How many clusters?
> >
> > How large is the dimension of your input data?
> >
> > On Thu, Feb 3, 2011 at 9:05 PM, james q <ja...@gmail.com>
> > wrote:
> >
> > > Hello,
> > >
> > > New user to mahout and hadoop here. Isabel Drost suggested to a
> colleague
> > I
> > > should post to the mahout user list, as I am having some general
> > > difficulties with memory consumption and KMeans clustering.
> > >
> > > So a general question first and foremost: what determines how much
> memory
> > > does a map task consume during a KMeans clustering job? Increasing the
> > > number of map tasks by adjusting dfs.block.size and
> mapred.max.split.size
> > > doesn't seem to make the map task consume less memory. Or at least not
> a
> > > very noticeable amount. I figured if there are more map tasks, each
> > > individual map task evaluates less input keys and hence there would be
> > less
> > > memory consumption. Is there any way to predict memory usage of map
> tasks
> > > in
> > > KMeans?
> > >
> > > The cluster I am running consists of 10 machines, each with 8 cores and
> > 68G
> > > of ram. I've configured the cluster to have each machine, at maximum,
> run
> > 7
> > > map or reduce tasks. I set the map and reduce tasks to have virtually
> no
> > > limit on memory consumption ... so with 7 processes each, at around 9 -
> > 10G
> > > per process, the machines will crap out. I can reduce the number of map
> > > tasks per machine, but something tells me that that level of memory
> > > consumption is wrong.
> > >
> > > If any more information is needed to help debug this, please let me
> know!
> > > Thanks!
> > >
> > > -- james
> > >
> >
>

Re: Memory Issue with KMeans clustering

Posted by Ted Dunning <te...@gmail.com>.
5000 x 6838856 x 8 = 273GB of memory just for the centroids (which will tend
to become dense)

I recommend you decrease your input dimensionality to 10^5 - 10^6.  This
could decrease your memory needs to 4GB at the low end.

What kind of input do you have?

On Fri, Feb 4, 2011 at 7:50 AM, james q <ja...@gmail.com> wrote:

> I think the job had 5000 - 6000 clusters. The input (sparse) vectors had a
> dimension of 6838856.
>
> -- james
>
> On Fri, Feb 4, 2011 at 1:55 AM, Ted Dunning <te...@gmail.com> wrote:
>
> > How many clusters?
> >
> > How large is the dimension of your input data?
> >
> > On Thu, Feb 3, 2011 at 9:05 PM, james q <ja...@gmail.com>
> > wrote:
> >
> > > Hello,
> > >
> > > New user to mahout and hadoop here. Isabel Drost suggested to a
> colleague
> > I
> > > should post to the mahout user list, as I am having some general
> > > difficulties with memory consumption and KMeans clustering.
> > >
> > > So a general question first and foremost: what determines how much
> memory
> > > does a map task consume during a KMeans clustering job? Increasing the
> > > number of map tasks by adjusting dfs.block.size and
> mapred.max.split.size
> > > doesn't seem to make the map task consume less memory. Or at least not
> a
> > > very noticeable amount. I figured if there are more map tasks, each
> > > individual map task evaluates less input keys and hence there would be
> > less
> > > memory consumption. Is there any way to predict memory usage of map
> tasks
> > > in
> > > KMeans?
> > >
> > > The cluster I am running consists of 10 machines, each with 8 cores and
> > 68G
> > > of ram. I've configured the cluster to have each machine, at maximum,
> run
> > 7
> > > map or reduce tasks. I set the map and reduce tasks to have virtually
> no
> > > limit on memory consumption ... so with 7 processes each, at around 9 -
> > 10G
> > > per process, the machines will crap out. I can reduce the number of map
> > > tasks per machine, but something tells me that that level of memory
> > > consumption is wrong.
> > >
> > > If any more information is needed to help debug this, please let me
> know!
> > > Thanks!
> > >
> > > -- james
> > >
> >
>

Re: Memory Issue with KMeans clustering

Posted by james q <ja...@gmail.com>.
I think the job had 5000 - 6000 clusters. The input (sparse) vectors had a
dimension of 6838856.

-- james

On Fri, Feb 4, 2011 at 1:55 AM, Ted Dunning <te...@gmail.com> wrote:

> How many clusters?
>
> How large is the dimension of your input data?
>
> On Thu, Feb 3, 2011 at 9:05 PM, james q <ja...@gmail.com>
> wrote:
>
> > Hello,
> >
> > New user to mahout and hadoop here. Isabel Drost suggested to a colleague
> I
> > should post to the mahout user list, as I am having some general
> > difficulties with memory consumption and KMeans clustering.
> >
> > So a general question first and foremost: what determines how much memory
> > does a map task consume during a KMeans clustering job? Increasing the
> > number of map tasks by adjusting dfs.block.size and mapred.max.split.size
> > doesn't seem to make the map task consume less memory. Or at least not a
> > very noticeable amount. I figured if there are more map tasks, each
> > individual map task evaluates less input keys and hence there would be
> less
> > memory consumption. Is there any way to predict memory usage of map tasks
> > in
> > KMeans?
> >
> > The cluster I am running consists of 10 machines, each with 8 cores and
> 68G
> > of ram. I've configured the cluster to have each machine, at maximum, run
> 7
> > map or reduce tasks. I set the map and reduce tasks to have virtually no
> > limit on memory consumption ... so with 7 processes each, at around 9 -
> 10G
> > per process, the machines will crap out. I can reduce the number of map
> > tasks per machine, but something tells me that that level of memory
> > consumption is wrong.
> >
> > If any more information is needed to help debug this, please let me know!
> > Thanks!
> >
> > -- james
> >
>

Re: Memory Issue with KMeans clustering

Posted by Ted Dunning <te...@gmail.com>.
How many clusters?

How large is the dimension of your input data?

On Thu, Feb 3, 2011 at 9:05 PM, james q <ja...@gmail.com> wrote:

> Hello,
>
> New user to mahout and hadoop here. Isabel Drost suggested to a colleague I
> should post to the mahout user list, as I am having some general
> difficulties with memory consumption and KMeans clustering.
>
> So a general question first and foremost: what determines how much memory
> does a map task consume during a KMeans clustering job? Increasing the
> number of map tasks by adjusting dfs.block.size and mapred.max.split.size
> doesn't seem to make the map task consume less memory. Or at least not a
> very noticeable amount. I figured if there are more map tasks, each
> individual map task evaluates less input keys and hence there would be less
> memory consumption. Is there any way to predict memory usage of map tasks
> in
> KMeans?
>
> The cluster I am running consists of 10 machines, each with 8 cores and 68G
> of ram. I've configured the cluster to have each machine, at maximum, run 7
> map or reduce tasks. I set the map and reduce tasks to have virtually no
> limit on memory consumption ... so with 7 processes each, at around 9 - 10G
> per process, the machines will crap out. I can reduce the number of map
> tasks per machine, but something tells me that that level of memory
> consumption is wrong.
>
> If any more information is needed to help debug this, please let me know!
> Thanks!
>
> -- james
>