You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Colum Foley <co...@gmail.com> on 2013/03/08 15:46:08 UTC

KMeans Throwing Hadoop write errors for large values of K

Hi All,

When I run KMeans clustering on a cluster, i notice that when I have
"large" values for k (i.e approx >1000) I get loads of hadoop write
errors:

 INFO hdfs.DFSClient: Exception in createBlockOutputStream
java.net.SocketTimeoutException: 69000 millis timeout while waiting
for channel to be ready for read. ch : java.nio.channels.SocketChannel

This continues indefinitely and lots of part-0xxxxx files are produced
of sizes of around 30kbs.

If I reduce the value for k it runs fine. Furthermore If I run it in
local mode with high values of k it runs fine.

The command I am using is as follows:

mahout kmeans -i FeatureVectorsMahoutFormat -o ClusterResults
--clusters tmp -dm
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd
1.0 -x 20 -cl -k 10000

I am running mahout 0.7.

Are there some performance parameters I need to tune for mahout when
dealing with large volumes of data?

Thanks,
Colum

Re: KMeans Throwing Hadoop write errors for large values of K

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

I don't know where the timeout is happening, but each mapper and each 
reducer writes all its clusters out at the end of its run. With a large 
number of clusters, and with the non-sparse center and radius vectors 
that tend to accumulate, this could take a while...

On 3/8/13 9:46 AM, Colum Foley wrote:
> Hi All,
>
> When I run KMeans clustering on a cluster, i notice that when I have
> "large" values for k (i.e approx >1000) I get loads of hadoop write
> errors:
>
>   INFO hdfs.DFSClient: Exception in createBlockOutputStream
> java.net.SocketTimeoutException: 69000 millis timeout while waiting
> for channel to be ready for read. ch : java.nio.channels.SocketChannel
>
> This continues indefinitely and lots of part-0xxxxx files are produced
> of sizes of around 30kbs.
>
> If I reduce the value for k it runs fine. Furthermore If I run it in
> local mode with high values of k it runs fine.
>
> The command I am using is as follows:
>
> mahout kmeans -i FeatureVectorsMahoutFormat -o ClusterResults
> --clusters tmp -dm
> org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd
> 1.0 -x 20 -cl -k 10000
>
> I am running mahout 0.7.
>
> Are there some performance parameters I need to tune for mahout when
> dealing with large volumes of data?
>
> Thanks,
> Colum
>
>

Re: KMeans Throwing Hadoop write errors for large values of K

Posted by Colum Foley <co...@gmail.com>.

thanks for the insights Ted

On 9 Mar 2013, at 18:40, Ted Dunning <te...@gmail.com> wrote:

> SVD techniques probably won't actually help that much given your current
> sparsity.  There are two issues:
> 
> first, your data is already quite small.  SVD will only make it larger
> because the average number of non-zero elements will increase dramatically.
> 
> second, given your sparsity, SVD will have very little to work with.  Very
> sparse data elements are inherently nearly orthogonal.
> 
> I think you need to find more features so that your average number of
> non-zeros goes up.
> 
> On Sat, Mar 9, 2013 at 12:53 PM, Colum Foley <co...@gmail.com> wrote:
> 
>> Thanks a lot Ted. I think there's some preprocessing I can do to remove
>> some outliers which may reduce my matrix size considerably.ill also check
>> out some SVD techniques
>> On 9 Mar 2013 17:16, "Ted Dunning" <te...@gmail.com> wrote:
>> 
>>> The new streaming k-means should be able to handle that data pretty
>>> efficiently.  My guess is that on a single 16 core machine if should be
>>> able to complete the clustering in 10 minutes or so.  That is
>> extrapolation
>>> and thus could be wildly off, of course.
>>> 
>>> You definitely mean sparse.  30 M / 20 M = 1.5 non-zero features per row.
>>> That may be a problem.  Or it might make the clustering fairly trivial.
>>> 
>>> Dan,
>>> 
>>> That code isn't checked into trunk yet, but I think.   Can you comment on
>>> where working code can be found on github?
>>> 
>>> On Sat, Mar 9, 2013 at 6:36 AM, Colum Foley <co...@gmail.com>
>> wrote:
>>> 
>>>> I have approximately 20million items and a feature vector of approx 30
>>>> million in length,very sparse.
>>>> 
>>>> Would you have any suggestions for other clustering algorithms I should
>>>> look at ?
>>>> 
>>>> Thanks,
>>>> Colum
>>>> 
>>>> On 8 Mar 2013, at 22:51, Ted Dunning <te...@gmail.com> wrote:
>>>> 
>>>>> You are beginning to exit the realm of reasonable applicability for
>>>> normal
>>>>> k-means algorithms here.
>>>>> 
>>>>> How much data do you have?
>>>>> 
>>>>> On Fri, Mar 8, 2013 at 9:46 AM, Colum Foley <co...@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> Hi All,
>>>>>> 
>>>>>> When I run KMeans clustering on a cluster, i notice that when I have
>>>>>> "large" values for k (i.e approx >1000) I get loads of hadoop write
>>>>>> errors:
>>>>>> 
>>>>>> INFO hdfs.DFSClient: Exception in createBlockOutputStream
>>>>>> java.net.SocketTimeoutException: 69000 millis timeout while waiting
>>>>>> for channel to be ready for read. ch :
>> java.nio.channels.SocketChannel
>>>>>> 
>>>>>> This continues indefinitely and lots of part-0xxxxx files are
>> produced
>>>>>> of sizes of around 30kbs.
>>>>>> 
>>>>>> If I reduce the value for k it runs fine. Furthermore If I run it in
>>>>>> local mode with high values of k it runs fine.
>>>>>> 
>>>>>> The command I am using is as follows:
>>>>>> 
>>>>>> mahout kmeans -i FeatureVectorsMahoutFormat -o ClusterResults
>>>>>> --clusters tmp -dm
>>>>>> org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
>> -cd
>>>>>> 1.0 -x 20 -cl -k 10000
>>>>>> 
>>>>>> I am running mahout 0.7.
>>>>>> 
>>>>>> Are there some performance parameters I need to tune for mahout when
>>>>>> dealing with large volumes of data?
>>>>>> 
>>>>>> Thanks,
>>>>>> Colum
>>

Re: KMeans Throwing Hadoop write errors for large values of K

Posted by Ted Dunning <te...@gmail.com>.

SVD techniques probably won't actually help that much given your current
sparsity.  There are two issues:

first, your data is already quite small.  SVD will only make it larger
because the average number of non-zero elements will increase dramatically.

second, given your sparsity, SVD will have very little to work with.  Very
sparse data elements are inherently nearly orthogonal.

I think you need to find more features so that your average number of
non-zeros goes up.

On Sat, Mar 9, 2013 at 12:53 PM, Colum Foley <co...@gmail.com> wrote:

> Thanks a lot Ted. I think there's some preprocessing I can do to remove
> some outliers which may reduce my matrix size considerably.ill also check
> out some SVD techniques
> On 9 Mar 2013 17:16, "Ted Dunning" <te...@gmail.com> wrote:
>
> > The new streaming k-means should be able to handle that data pretty
> > efficiently.  My guess is that on a single 16 core machine if should be
> > able to complete the clustering in 10 minutes or so.  That is
> extrapolation
> > and thus could be wildly off, of course.
> >
> > You definitely mean sparse.  30 M / 20 M = 1.5 non-zero features per row.
> >  That may be a problem.  Or it might make the clustering fairly trivial.
> >
> > Dan,
> >
> > That code isn't checked into trunk yet, but I think.   Can you comment on
> > where working code can be found on github?
> >
> > On Sat, Mar 9, 2013 at 6:36 AM, Colum Foley <co...@gmail.com>
> wrote:
> >
> > > I have approximately 20million items and a feature vector of approx 30
> > > million in length,very sparse.
> > >
> > > Would you have any suggestions for other clustering algorithms I should
> > > look at ?
> > >
> > > Thanks,
> > > Colum
> > >
> > > On 8 Mar 2013, at 22:51, Ted Dunning <te...@gmail.com> wrote:
> > >
> > > > You are beginning to exit the realm of reasonable applicability for
> > > normal
> > > > k-means algorithms here.
> > > >
> > > > How much data do you have?
> > > >
> > > > On Fri, Mar 8, 2013 at 9:46 AM, Colum Foley <co...@gmail.com>
> > > wrote:
> > > >
> > > >> Hi All,
> > > >>
> > > >> When I run KMeans clustering on a cluster, i notice that when I have
> > > >> "large" values for k (i.e approx >1000) I get loads of hadoop write
> > > >> errors:
> > > >>
> > > >> INFO hdfs.DFSClient: Exception in createBlockOutputStream
> > > >> java.net.SocketTimeoutException: 69000 millis timeout while waiting
> > > >> for channel to be ready for read. ch :
> java.nio.channels.SocketChannel
> > > >>
> > > >> This continues indefinitely and lots of part-0xxxxx files are
> produced
> > > >> of sizes of around 30kbs.
> > > >>
> > > >> If I reduce the value for k it runs fine. Furthermore If I run it in
> > > >> local mode with high values of k it runs fine.
> > > >>
> > > >> The command I am using is as follows:
> > > >>
> > > >> mahout kmeans -i FeatureVectorsMahoutFormat -o ClusterResults
> > > >> --clusters tmp -dm
> > > >> org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
> -cd
> > > >> 1.0 -x 20 -cl -k 10000
> > > >>
> > > >> I am running mahout 0.7.
> > > >>
> > > >> Are there some performance parameters I need to tune for mahout when
> > > >> dealing with large volumes of data?
> > > >>
> > > >> Thanks,
> > > >> Colum
> > > >>
> > >
> >
>

Re: KMeans Throwing Hadoop write errors for large values of K

Posted by Colum Foley <co...@gmail.com>.

Thanks a lot Ted. I think there's some preprocessing I can do to remove
some outliers which may reduce my matrix size considerably.ill also check
out some SVD techniques
On 9 Mar 2013 17:16, "Ted Dunning" <te...@gmail.com> wrote:

> The new streaming k-means should be able to handle that data pretty
> efficiently.  My guess is that on a single 16 core machine if should be
> able to complete the clustering in 10 minutes or so.  That is extrapolation
> and thus could be wildly off, of course.
>
> You definitely mean sparse.  30 M / 20 M = 1.5 non-zero features per row.
>  That may be a problem.  Or it might make the clustering fairly trivial.
>
> Dan,
>
> That code isn't checked into trunk yet, but I think.   Can you comment on
> where working code can be found on github?
>
> On Sat, Mar 9, 2013 at 6:36 AM, Colum Foley <co...@gmail.com> wrote:
>
> > I have approximately 20million items and a feature vector of approx 30
> > million in length,very sparse.
> >
> > Would you have any suggestions for other clustering algorithms I should
> > look at ?
> >
> > Thanks,
> > Colum
> >
> > On 8 Mar 2013, at 22:51, Ted Dunning <te...@gmail.com> wrote:
> >
> > > You are beginning to exit the realm of reasonable applicability for
> > normal
> > > k-means algorithms here.
> > >
> > > How much data do you have?
> > >
> > > On Fri, Mar 8, 2013 at 9:46 AM, Colum Foley <co...@gmail.com>
> > wrote:
> > >
> > >> Hi All,
> > >>
> > >> When I run KMeans clustering on a cluster, i notice that when I have
> > >> "large" values for k (i.e approx >1000) I get loads of hadoop write
> > >> errors:
> > >>
> > >> INFO hdfs.DFSClient: Exception in createBlockOutputStream
> > >> java.net.SocketTimeoutException: 69000 millis timeout while waiting
> > >> for channel to be ready for read. ch : java.nio.channels.SocketChannel
> > >>
> > >> This continues indefinitely and lots of part-0xxxxx files are produced
> > >> of sizes of around 30kbs.
> > >>
> > >> If I reduce the value for k it runs fine. Furthermore If I run it in
> > >> local mode with high values of k it runs fine.
> > >>
> > >> The command I am using is as follows:
> > >>
> > >> mahout kmeans -i FeatureVectorsMahoutFormat -o ClusterResults
> > >> --clusters tmp -dm
> > >> org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd
> > >> 1.0 -x 20 -cl -k 10000
> > >>
> > >> I am running mahout 0.7.
> > >>
> > >> Are there some performance parameters I need to tune for mahout when
> > >> dealing with large volumes of data?
> > >>
> > >> Thanks,
> > >> Colum
> > >>
> >
>

Re: KMeans Throwing Hadoop write errors for large values of K

Posted by Ted Dunning <te...@gmail.com>.

The new streaming k-means should be able to handle that data pretty
efficiently.  My guess is that on a single 16 core machine if should be
able to complete the clustering in 10 minutes or so.  That is extrapolation
and thus could be wildly off, of course.

You definitely mean sparse.  30 M / 20 M = 1.5 non-zero features per row.
 That may be a problem.  Or it might make the clustering fairly trivial.

Dan,

That code isn't checked into trunk yet, but I think.   Can you comment on
where working code can be found on github?

On Sat, Mar 9, 2013 at 6:36 AM, Colum Foley <co...@gmail.com> wrote:

> I have approximately 20million items and a feature vector of approx 30
> million in length,very sparse.
>
> Would you have any suggestions for other clustering algorithms I should
> look at ?
>
> Thanks,
> Colum
>
> On 8 Mar 2013, at 22:51, Ted Dunning <te...@gmail.com> wrote:
>
> > You are beginning to exit the realm of reasonable applicability for
> normal
> > k-means algorithms here.
> >
> > How much data do you have?
> >
> > On Fri, Mar 8, 2013 at 9:46 AM, Colum Foley <co...@gmail.com>
> wrote:
> >
> >> Hi All,
> >>
> >> When I run KMeans clustering on a cluster, i notice that when I have
> >> "large" values for k (i.e approx >1000) I get loads of hadoop write
> >> errors:
> >>
> >> INFO hdfs.DFSClient: Exception in createBlockOutputStream
> >> java.net.SocketTimeoutException: 69000 millis timeout while waiting
> >> for channel to be ready for read. ch : java.nio.channels.SocketChannel
> >>
> >> This continues indefinitely and lots of part-0xxxxx files are produced
> >> of sizes of around 30kbs.
> >>
> >> If I reduce the value for k it runs fine. Furthermore If I run it in
> >> local mode with high values of k it runs fine.
> >>
> >> The command I am using is as follows:
> >>
> >> mahout kmeans -i FeatureVectorsMahoutFormat -o ClusterResults
> >> --clusters tmp -dm
> >> org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd
> >> 1.0 -x 20 -cl -k 10000
> >>
> >> I am running mahout 0.7.
> >>
> >> Are there some performance parameters I need to tune for mahout when
> >> dealing with large volumes of data?
> >>
> >> Thanks,
> >> Colum
> >>
>

Re: KMeans Throwing Hadoop write errors for large values of K

Posted by Colum Foley <co...@gmail.com>.

I have approximately 20million items and a feature vector of approx 30 million in length,very sparse. 

Would you have any suggestions for other clustering algorithms I should look at ?

Thanks,
Colum 

On 8 Mar 2013, at 22:51, Ted Dunning <te...@gmail.com> wrote:

> You are beginning to exit the realm of reasonable applicability for normal
> k-means algorithms here.
> 
> How much data do you have?
> 
> On Fri, Mar 8, 2013 at 9:46 AM, Colum Foley <co...@gmail.com> wrote:
> 
>> Hi All,
>> 
>> When I run KMeans clustering on a cluster, i notice that when I have
>> "large" values for k (i.e approx >1000) I get loads of hadoop write
>> errors:
>> 
>> INFO hdfs.DFSClient: Exception in createBlockOutputStream
>> java.net.SocketTimeoutException: 69000 millis timeout while waiting
>> for channel to be ready for read. ch : java.nio.channels.SocketChannel
>> 
>> This continues indefinitely and lots of part-0xxxxx files are produced
>> of sizes of around 30kbs.
>> 
>> If I reduce the value for k it runs fine. Furthermore If I run it in
>> local mode with high values of k it runs fine.
>> 
>> The command I am using is as follows:
>> 
>> mahout kmeans -i FeatureVectorsMahoutFormat -o ClusterResults
>> --clusters tmp -dm
>> org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd
>> 1.0 -x 20 -cl -k 10000
>> 
>> I am running mahout 0.7.
>> 
>> Are there some performance parameters I need to tune for mahout when
>> dealing with large volumes of data?
>> 
>> Thanks,
>> Colum
>>

Re: KMeans Throwing Hadoop write errors for large values of K

Posted by Ted Dunning <te...@gmail.com>.

You are beginning to exit the realm of reasonable applicability for normal
k-means algorithms here.

How much data do you have?

On Fri, Mar 8, 2013 at 9:46 AM, Colum Foley <co...@gmail.com> wrote:

> Hi All,
>
> When I run KMeans clustering on a cluster, i notice that when I have
> "large" values for k (i.e approx >1000) I get loads of hadoop write
> errors:
>
>  INFO hdfs.DFSClient: Exception in createBlockOutputStream
> java.net.SocketTimeoutException: 69000 millis timeout while waiting
> for channel to be ready for read. ch : java.nio.channels.SocketChannel
>
> This continues indefinitely and lots of part-0xxxxx files are produced
> of sizes of around 30kbs.
>
> If I reduce the value for k it runs fine. Furthermore If I run it in
> local mode with high values of k it runs fine.
>
> The command I am using is as follows:
>
> mahout kmeans -i FeatureVectorsMahoutFormat -o ClusterResults
> --clusters tmp -dm
> org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd
> 1.0 -x 20 -cl -k 10000
>
> I am running mahout 0.7.
>
> Are there some performance parameters I need to tune for mahout when
> dealing with large volumes of data?
>
> Thanks,
> Colum
>