You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Amir Mohammad Saied <am...@gmail.com> on 2013/12/04 19:18:43 UTC

Avoiding OOM for large datasets

Hi,

I've been trying to run Mahout (with Hadoop) on our data for quite sometime
now. Everything is fine on relatively small data sets, but when I try to do
K-Means clustering with the aid of Canopy on like 300000 documents, I can't
even get past the canopy generation because of OOM. We're going to cluster
similar news so T1, and T2 are set to 0.84, and 0.6 (those values lead to
desired results on sample data).

I tried setting both "mapred.map.child.java.opts", and
"mapred.reduce.child.java.opts" to "-Xmx4096M", I also
exported HADOOP_HEAPSIZE to 4000, and still having issues.

I'm running all of this in Hadoop's single node, pseudo-distributed mode on
a machine with 16GB of RAM.

Searching Internet for solutions I found this[1]. One of the bullet points
states that:

    "In all of the algorithms, all clusters are retained in memory by the
mappers and reducers"

So my question is, does Mahout on Hadoop only help in distributing CPU
bound operations? What one should do if they have a large dataset, and only
a handful of low-RAM commodity nodes?

I'm obviously a newbie, thanks for bearing with me.

[1]
http://mail-archives.apache.org/mod_mbox/mahout-user/201209.mbox/%3C506307EB.3090004@windwardsolutions.com%3E

Cheers,

Amir

Re: Avoiding OOM for large datasets

Posted by Amir Mohammad Saied <am...@gmail.com>.

I tried it again with K=1000, and KM=12610, and it finished after about 16
hours. I'm running the mapreduce version on top of a single node,
pseudo-distributed Hadoop.

How can I calculate the reasonable K for my clustering needs?


On Wed, Dec 11, 2013 at 1:34 PM, Ted Dunning <te...@gmail.com> wrote:

> This is not right.  THe sequential version would have finished long before
> this for any reasonable value of k.
>
> I do note, however, that you have set k = 200,000 where you only have
> 300,000 documents.  Depending on which value you set (I don't have the code
> handy), this may actually be increased inside the streaming k-means when it
> computes the number of sketch centroids by a factor of roughly 2 log N
> \approx 2 * 18.  This gives far more clusters than you have data points
> which is silly.
>
> Try again with a more reasonably value of k such as 1000.
>
>
>
>
>
> On Wed, Dec 11, 2013 at 7:02 AM, Amir Mohammad Saied <amirsaied@gmail.com
> >wrote:
>
> > Hi,
> >
> > I first tried Streaming K-means with about 5000 news stories, and it
> worked
> > just fine. Then I tried it over 300,000 news stories and gave it 10GB of
> > RAM. After more than 43 hours, It was still in the last merge-pass when I
> > eventually decided to stop it.
> >
> > I set K to 200000 and KM 2522308 (its for detecting similar/related news
> > stories). Using these values, is it expected to take so long?
> >
> > Cheers,
> >
> > amir
> >
> >
> > On Thu, Dec 5, 2013 at 3:38 PM, Amir Mohammad Saied <amirsaied@gmail.com
> > >wrote:
> >
> > > Suneel,
> > >
> > > Thanks!
> > >
> > > I tried Streaming K-Means, and now I've two naive questions:
> > >
> > > 1) If I understand correctly to use the results of streaming k-means I
> > > need to iterate over all of my vectors again and assign them to the
> > cluster
> > > with the closest centroid to the vector, right?
> > >
> > > 2) In clustering news, the number of clusters isn't known beforehand.
> We
> > > used to use canopy as a fast approximate clustering technique, but as I
> > > understand streaming k-means requires "K" in advance. How can I avoid
> > > guessing K?
> > >
> > > Regards,
> > >
> > > Amir
> > >
> > >
> > >
> > > On Wed, Dec 4, 2013 at 6:27 PM, Suneel Marthi <suneel_marthi@yahoo.com
> > >wrote:
> > >
> > >> Amir,
> > >>
> > >>
> > >> This has been reported before by several others (and has been my
> > >> experience too). The OOM happens during Canopy Generation phase of
> > Canopy
> > >> clustering because it only runs with a single reducer.
> > >>
> > >> If you are using Mahout 0.8 (or trunk), suggest that u look at the new
> > >> Streaming Kmeans clustering which is a quicker and more efficient than
> > the
> > >> traditional Canopy -> KMeans.
> > >>
> > >> See the following link for how to run Streaming KMeans.
> > >>
> > >>
> > >>
> >
> http://stackoverflow.com/questions/17272296/how-to-use-mahout-streaming-k-means
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Wednesday, December 4, 2013 1:19 PM, Amir Mohammad Saied <
> > >> amirsaied@gmail.com> wrote:
> > >>
> > >> Hi,
> > >>
> > >> I've been trying to run Mahout (with Hadoop) on our data for quite
> > >> sometime
> > >> now. Everything is fine on relatively small data sets, but when I try
> to
> > >> do
> > >> K-Means clustering with the aid of Canopy on like 300000 documents, I
> > >> can't
> > >> even get past the canopy generation because of OOM. We're going to
> > cluster
> > >> similar news so T1, and T2 are set to 0.84, and 0.6 (those values lead
> > to
> > >> desired results on sample data).
> > >>
> > >> I tried setting both "mapred.map.child.java.opts", and
> > >> "mapred.reduce.child.java.opts" to "-Xmx4096M", I also
> > >> exported HADOOP_HEAPSIZE to 4000, and still having issues.
> > >>
> > >> I'm running all of this in Hadoop's single node, pseudo-distributed
> mode
> > >> on
> > >> a machine with 16GB of RAM.
> > >>
> > >> Searching Internet for solutions I found this[1]. One of the bullet
> > points
> > >> states that:
> > >>
> > >>     "In all of the algorithms, all clusters are retained in memory by
> > the
> > >> mappers and reducers"
> > >>
> > >> So my question is, does Mahout on Hadoop only help in distributing CPU
> > >> bound operations? What one should do if they have a large dataset, and
> > >> only
> > >> a handful of low-RAM commodity nodes?
> > >>
> > >> I'm obviously a newbie, thanks for bearing with me.
> > >>
> > >> [1]
> > >>
> > >>
> >
> http://mail-archives.apache.org/mod_mbox/mahout-user/201209.mbox/%3C506307EB.3090004@windwardsolutions.com%3E
> > >>
> > >> Cheers,
> > >>
> > >> Amir
> > >>
> > >
> > >
> >
>

Re: Avoiding OOM for large datasets

Posted by Ted Dunning <te...@gmail.com>.

This is not right.  THe sequential version would have finished long before
this for any reasonable value of k.

I do note, however, that you have set k = 200,000 where you only have
300,000 documents.  Depending on which value you set (I don't have the code
handy), this may actually be increased inside the streaming k-means when it
computes the number of sketch centroids by a factor of roughly 2 log N
\approx 2 * 18.  This gives far more clusters than you have data points
which is silly.

Try again with a more reasonably value of k such as 1000.





On Wed, Dec 11, 2013 at 7:02 AM, Amir Mohammad Saied <am...@gmail.com>wrote:

> Hi,
>
> I first tried Streaming K-means with about 5000 news stories, and it worked
> just fine. Then I tried it over 300,000 news stories and gave it 10GB of
> RAM. After more than 43 hours, It was still in the last merge-pass when I
> eventually decided to stop it.
>
> I set K to 200000 and KM 2522308 (its for detecting similar/related news
> stories). Using these values, is it expected to take so long?
>
> Cheers,
>
> amir
>
>
> On Thu, Dec 5, 2013 at 3:38 PM, Amir Mohammad Saied <amirsaied@gmail.com
> >wrote:
>
> > Suneel,
> >
> > Thanks!
> >
> > I tried Streaming K-Means, and now I've two naive questions:
> >
> > 1) If I understand correctly to use the results of streaming k-means I
> > need to iterate over all of my vectors again and assign them to the
> cluster
> > with the closest centroid to the vector, right?
> >
> > 2) In clustering news, the number of clusters isn't known beforehand. We
> > used to use canopy as a fast approximate clustering technique, but as I
> > understand streaming k-means requires "K" in advance. How can I avoid
> > guessing K?
> >
> > Regards,
> >
> > Amir
> >
> >
> >
> > On Wed, Dec 4, 2013 at 6:27 PM, Suneel Marthi <suneel_marthi@yahoo.com
> >wrote:
> >
> >> Amir,
> >>
> >>
> >> This has been reported before by several others (and has been my
> >> experience too). The OOM happens during Canopy Generation phase of
> Canopy
> >> clustering because it only runs with a single reducer.
> >>
> >> If you are using Mahout 0.8 (or trunk), suggest that u look at the new
> >> Streaming Kmeans clustering which is a quicker and more efficient than
> the
> >> traditional Canopy -> KMeans.
> >>
> >> See the following link for how to run Streaming KMeans.
> >>
> >>
> >>
> http://stackoverflow.com/questions/17272296/how-to-use-mahout-streaming-k-means
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Wednesday, December 4, 2013 1:19 PM, Amir Mohammad Saied <
> >> amirsaied@gmail.com> wrote:
> >>
> >> Hi,
> >>
> >> I've been trying to run Mahout (with Hadoop) on our data for quite
> >> sometime
> >> now. Everything is fine on relatively small data sets, but when I try to
> >> do
> >> K-Means clustering with the aid of Canopy on like 300000 documents, I
> >> can't
> >> even get past the canopy generation because of OOM. We're going to
> cluster
> >> similar news so T1, and T2 are set to 0.84, and 0.6 (those values lead
> to
> >> desired results on sample data).
> >>
> >> I tried setting both "mapred.map.child.java.opts", and
> >> "mapred.reduce.child.java.opts" to "-Xmx4096M", I also
> >> exported HADOOP_HEAPSIZE to 4000, and still having issues.
> >>
> >> I'm running all of this in Hadoop's single node, pseudo-distributed mode
> >> on
> >> a machine with 16GB of RAM.
> >>
> >> Searching Internet for solutions I found this[1]. One of the bullet
> points
> >> states that:
> >>
> >>     "In all of the algorithms, all clusters are retained in memory by
> the
> >> mappers and reducers"
> >>
> >> So my question is, does Mahout on Hadoop only help in distributing CPU
> >> bound operations? What one should do if they have a large dataset, and
> >> only
> >> a handful of low-RAM commodity nodes?
> >>
> >> I'm obviously a newbie, thanks for bearing with me.
> >>
> >> [1]
> >>
> >>
> http://mail-archives.apache.org/mod_mbox/mahout-user/201209.mbox/%3C506307EB.3090004@windwardsolutions.com%3E
> >>
> >> Cheers,
> >>
> >> Amir
> >>
> >
> >
>

Re: Avoiding OOM for large datasets

Posted by Amir Mohammad Saied <am...@gmail.com>.

Hi,

I first tried Streaming K-means with about 5000 news stories, and it worked
just fine. Then I tried it over 300,000 news stories and gave it 10GB of
RAM. After more than 43 hours, It was still in the last merge-pass when I
eventually decided to stop it.

I set K to 200000 and KM 2522308 (its for detecting similar/related news
stories). Using these values, is it expected to take so long?

Cheers,

amir


On Thu, Dec 5, 2013 at 3:38 PM, Amir Mohammad Saied <am...@gmail.com>wrote:

> Suneel,
>
> Thanks!
>
> I tried Streaming K-Means, and now I've two naive questions:
>
> 1) If I understand correctly to use the results of streaming k-means I
> need to iterate over all of my vectors again and assign them to the cluster
> with the closest centroid to the vector, right?
>
> 2) In clustering news, the number of clusters isn't known beforehand. We
> used to use canopy as a fast approximate clustering technique, but as I
> understand streaming k-means requires "K" in advance. How can I avoid
> guessing K?
>
> Regards,
>
> Amir
>
>
>
> On Wed, Dec 4, 2013 at 6:27 PM, Suneel Marthi <su...@yahoo.com>wrote:
>
>> Amir,
>>
>>
>> This has been reported before by several others (and has been my
>> experience too). The OOM happens during Canopy Generation phase of Canopy
>> clustering because it only runs with a single reducer.
>>
>> If you are using Mahout 0.8 (or trunk), suggest that u look at the new
>> Streaming Kmeans clustering which is a quicker and more efficient than the
>> traditional Canopy -> KMeans.
>>
>> See the following link for how to run Streaming KMeans.
>>
>>
>> http://stackoverflow.com/questions/17272296/how-to-use-mahout-streaming-k-means
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wednesday, December 4, 2013 1:19 PM, Amir Mohammad Saied <
>> amirsaied@gmail.com> wrote:
>>
>> Hi,
>>
>> I've been trying to run Mahout (with Hadoop) on our data for quite
>> sometime
>> now. Everything is fine on relatively small data sets, but when I try to
>> do
>> K-Means clustering with the aid of Canopy on like 300000 documents, I
>> can't
>> even get past the canopy generation because of OOM. We're going to cluster
>> similar news so T1, and T2 are set to 0.84, and 0.6 (those values lead to
>> desired results on sample data).
>>
>> I tried setting both "mapred.map.child.java.opts", and
>> "mapred.reduce.child.java.opts" to "-Xmx4096M", I also
>> exported HADOOP_HEAPSIZE to 4000, and still having issues.
>>
>> I'm running all of this in Hadoop's single node, pseudo-distributed mode
>> on
>> a machine with 16GB of RAM.
>>
>> Searching Internet for solutions I found this[1]. One of the bullet points
>> states that:
>>
>>     "In all of the algorithms, all clusters are retained in memory by the
>> mappers and reducers"
>>
>> So my question is, does Mahout on Hadoop only help in distributing CPU
>> bound operations? What one should do if they have a large dataset, and
>> only
>> a handful of low-RAM commodity nodes?
>>
>> I'm obviously a newbie, thanks for bearing with me.
>>
>> [1]
>>
>> http://mail-archives.apache.org/mod_mbox/mahout-user/201209.mbox/%3C506307EB.3090004@windwardsolutions.com%3E
>>
>> Cheers,
>>
>> Amir
>>
>
>

Re: Avoiding OOM for large datasets

Posted by Amir Mohammad Saied <am...@gmail.com>.

Suneel,

Thanks!

I tried Streaming K-Means, and now I've two naive questions:

1) If I understand correctly to use the results of streaming k-means I need
to iterate over all of my vectors again and assign them to the cluster with
the closest centroid to the vector, right?

2) In clustering news, the number of clusters isn't known beforehand. We
used to use canopy as a fast approximate clustering technique, but as I
understand streaming k-means requires "K" in advance. How can I avoid
guessing K?

Regards,

Amir



On Wed, Dec 4, 2013 at 6:27 PM, Suneel Marthi <su...@yahoo.com>wrote:

> Amir,
>
>
> This has been reported before by several others (and has been my
> experience too). The OOM happens during Canopy Generation phase of Canopy
> clustering because it only runs with a single reducer.
>
> If you are using Mahout 0.8 (or trunk), suggest that u look at the new
> Streaming Kmeans clustering which is a quicker and more efficient than the
> traditional Canopy -> KMeans.
>
> See the following link for how to run Streaming KMeans.
>
>
> http://stackoverflow.com/questions/17272296/how-to-use-mahout-streaming-k-means
>
>
>
>
>
>
>
>
>
>
>
> On Wednesday, December 4, 2013 1:19 PM, Amir Mohammad Saied <
> amirsaied@gmail.com> wrote:
>
> Hi,
>
> I've been trying to run Mahout (with Hadoop) on our data for quite sometime
> now. Everything is fine on relatively small data sets, but when I try to do
> K-Means clustering with the aid of Canopy on like 300000 documents, I can't
> even get past the canopy generation because of OOM. We're going to cluster
> similar news so T1, and T2 are set to 0.84, and 0.6 (those values lead to
> desired results on sample data).
>
> I tried setting both "mapred.map.child.java.opts", and
> "mapred.reduce.child.java.opts" to "-Xmx4096M", I also
> exported HADOOP_HEAPSIZE to 4000, and still having issues.
>
> I'm running all of this in Hadoop's single node, pseudo-distributed mode on
> a machine with 16GB of RAM.
>
> Searching Internet for solutions I found this[1]. One of the bullet points
> states that:
>
>     "In all of the algorithms, all clusters are retained in memory by the
> mappers and reducers"
>
> So my question is, does Mahout on Hadoop only help in distributing CPU
> bound operations? What one should do if they have a large dataset, and only
> a handful of low-RAM commodity nodes?
>
> I'm obviously a newbie, thanks for bearing with me.
>
> [1]
>
> http://mail-archives.apache.org/mod_mbox/mahout-user/201209.mbox/%3C506307EB.3090004@windwardsolutions.com%3E
>
> Cheers,
>
> Amir
>

Re: Avoiding OOM for large datasets

Posted by Suneel Marthi <su...@yahoo.com>.

Amir,

This has been reported before by several others (and has been my experience too). The OOM happens during Canopy Generation phase of Canopy clustering because it only runs with a single reducer.

If you are using Mahout 0.8 (or trunk), suggest that u look at the new Streaming Kmeans clustering which is a quicker and more efficient than the traditional Canopy -> KMeans.

See the following link for how to run Streaming KMeans.

http://stackoverflow.com/questions/17272296/how-to-use-mahout-streaming-k-means

On Wednesday, December 4, 2013 1:19 PM, Amir Mohammad Saied <am...@gmail.com> wrote:

Hi,

I've been trying to run Mahout (with Hadoop) on our data for quite sometime
now. Everything is fine on relatively small data sets, but when I try to do
K-Means clustering with the aid of Canopy on like 300000 documents, I can't
even get past the canopy generation because of OOM. We're going to cluster
similar news so T1, and T2 are set to 0.84, and 0.6 (those values lead to
desired results on sample data).

I tried setting both "mapred.map.child.java.opts", and
"mapred.reduce.child.java.opts" to "-Xmx4096M", I also
exported HADOOP_HEAPSIZE to 4000, and still having issues.

I'm running all of this in Hadoop's single node, pseudo-distributed mode on
a machine with 16GB of RAM.

Searching Internet for solutions I found this[1]. One of the bullet points
states that:

"In all of the algorithms, all clusters are retained in memory by the
mappers and reducers"

So my question is, does Mahout on Hadoop only help in distributing CPU
bound operations? What one should do if they have a large dataset, and only
a handful of low-RAM commodity nodes?

I'm obviously a newbie, thanks for bearing with me.

[1]
http://mail-archives.apache.org/mod_mbox/mahout-user/201209.mbox/%3C506307EB.3090004@windwardsolutions.com%3E

Cheers,

Amir