You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Mahesh Balija <ba...@gmail.com> on 2014/04/04 08:31:17 UTC

Re: problems with running K-means on hadoop's pseudo-distributed mode

Hi Wei Zhang,

Can you check whether this path exists in your Hadoop HDFS
 /tmp/mahout-work-weiz/reuters-kmeans-clusters/part-randomSeed

Instead of using cluster_reuters.sh script file can you run Kmeans manually
on your cluster.

BTW, what is the command you are using for running cluster_reuters.sh
script?

Best,
Mahesh.B.


On Wed, Apr 2, 2014 at 12:24 AM, Wei Zhang <we...@us.ibm.com> wrote:

>
> Hello,
>
> I am new to Mahout. I have installed the Mahout-0.9.
>
> I have configured a hadoop(1.0.3)) on my laptop (Redhat 6, Lenovo W530). I
> am experimenting the k-means test ( by running
> mahout-distribution-0.9/examples/bin/cluster-reuters.sh)
>
> I am able to run the   k-means test out of box on hadoop in loacal mode
> successfully.
>
> However, when I run hadoop in pseudo-distributed mode, the k-means test
> would fail (after successfully running 9 Map-Reduce jobs) with following
> stacktrace:
>
> Exception in thread "main" java.lang.IllegalStateException: No input
> clusters found
> in /tmp/mahout-work-weiz/reuters-kmeans-clusters/part-randomSeed. Check
> your -c argument.
>         at org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters
> (KMeansDriver.java:206)
>         at org.apache.mahout.clustering.kmeans.KMeansDriver.run
> (KMeansDriver.java:140)
>         at org.apache.mahout.clustering.kmeans.KMeansDriver.run
> (KMeansDriver.java:103)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.mahout.clustering.kmeans.KMeansDriver.main
> (KMeansDriver.java:47)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke
> (NativeMethodAccessorImpl.java:76)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke
> (DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:607)
>         at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke
> (ProgramDriver.java:68)
>         at org.apache.hadoop.util.ProgramDriver.driver
> (ProgramDriver.java:139)
>         at
> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke
> (NativeMethodAccessorImpl.java:76)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke
> (DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:607)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
> I tried to google the reason for this failure, but couldn't get a clear
> understanding. I am wondering could you help with some pointers ?
>
> Thanks!
>
> Wei
>

Re: problems with running K-means on hadoop's pseudo-distributed mode

Posted by Ted Dunning <te...@gmail.com>.

Heh?

Can you say more about what you are trying to do and how you are doing it?

Also, can you say how this matters to the community?

And how it relates to the recent clustering work done in Mahout?




On Mon, Jun 9, 2014 at 9:02 AM, Ajay Sharma <aj...@gmail.com> wrote:

> K-means Clustering
>  K-means: widely used clustering technique! ,Initialization: blind random
> on input data!
> Drawback: very sensitive to choice of initial clustercenters (seeds)!
> Local optimal can be arbitrarily bad wrt. objective function, compared to
> global optimal clustering
>
> Idea: spread the k initial cluster centers away from each other.!
> O(log k)-competitive with the optimal clustering" substantial convergence
> time speedups (empirical)!
>
> C - Sample a point uniformly at random from X
>     While `C´ < k do
>     Sample x € X with probability prop, to DSquare (x)
>     c <- C U {x}
> end while
>
> c € c: Cluster Center
> x € X: Data Point'D(x) distance between x and nearest Ck that has already
> chosen
>
> Test dataset
> 200 Clustering runs, each with and without k-means initialization
> Measure RSS (Intra-Class variance)
>
> K.Means optimal clustering 115 times (57.5%)
>
>  Implementation Test Dataset: 4 Square (n=16)
>
>
>
> Expected: 4 nice Cluster
>
>
>
>
>
>
>
>
> Evaluation on Test Dataset!
> • 200 clustering runs, each with and without kmeans++ initialization!
> • Measure RSS (intra-class variance)!
> • K-means! optimal clustering 115 times (57.5%) !
> • K-means++ ! optimal clustering 182 times (91%)!
>
> Comparison of the frequency distribution of RSS values between k-means and
> k-means
> ++ on the evaluation dataset (n=200)!
>
>
>
>  Comparison of the frequency distribution of RSS values between k-means and
> k-means
> ++ on the UCI real world dataset (n=500)!
>
>
>
>
>
>
>
>
>
>
>
> On Mon, Jun 9, 2014 at 10:50 AM, sumit sharma <pr...@gmail.com> wrote:
>
> > Naïve Bayes can be used for text clustering effectively in Mahout.
> >
> >
> > On Mon, Jun 9, 2014 at 7:07 PM, Eeti Jain <ee...@gmail.com> wrote:
> >
> > >
> > > Sir, I have been working on hadoop/mahout platform and performing
> > > clustering
> > > on twitter data in my thesis work. I just want to know whether Mahout
> can
> > > handle text documents in some other language? Please if you can help me
> > > sir?
> > >
> > >
> > >
> > >
> > >
> >
> >
> > --
> >
> > Best Regards:
> > Sumit Sharma
> >
>

Re: problems with running K-means on hadoop's pseudo-distributed mode

Posted by Ajay Sharma <aj...@gmail.com>.

K-means Clustering
 K-means: widely used clustering technique! ,Initialization: blind random
on input data!
Drawback: very sensitive to choice of initial clustercenters (seeds)!
Local optimal can be arbitrarily bad wrt. objective function, compared to
global optimal clustering

Idea: spread the k initial cluster centers away from each other.!
O(log k)-competitive with the optimal clustering" substantial convergence
time speedups (empirical)!

C - Sample a point uniformly at random from X
    While `C´ < k do
    Sample x € X with probability prop, to DSquare (x)
    c <- C U {x}
end while

c € c: Cluster Center
x € X: Data Point'D(x) distance between x and nearest Ck that has already
chosen

Test dataset
200 Clustering runs, each with and without k-means initialization
Measure RSS (Intra-Class variance)

K.Means optimal clustering 115 times (57.5%)

 Implementation Test Dataset: 4 Square (n=16)



Expected: 4 nice Cluster








Evaluation on Test Dataset!
• 200 clustering runs, each with and without kmeans++ initialization!
• Measure RSS (intra-class variance)!
• K-means! optimal clustering 115 times (57.5%) !
• K-means++ ! optimal clustering 182 times (91%)!

Comparison of the frequency distribution of RSS values between k-means and
k-means
++ on the evaluation dataset (n=200)!



 Comparison of the frequency distribution of RSS values between k-means and
k-means
++ on the UCI real world dataset (n=500)!











On Mon, Jun 9, 2014 at 10:50 AM, sumit sharma <pr...@gmail.com> wrote:

> Naïve Bayes can be used for text clustering effectively in Mahout.
>
>
> On Mon, Jun 9, 2014 at 7:07 PM, Eeti Jain <ee...@gmail.com> wrote:
>
> >
> > Sir, I have been working on hadoop/mahout platform and performing
> > clustering
> > on twitter data in my thesis work. I just want to know whether Mahout can
> > handle text documents in some other language? Please if you can help me
> > sir?
> >
> >
> >
> >
> >
>
>
> --
>
> Best Regards:
> Sumit Sharma
>

Re: problems with running K-means on hadoop's pseudo-distributed mode

Posted by sumit sharma <pr...@gmail.com>.

Naïve Bayes can be used for text clustering effectively in Mahout.

On Mon, Jun 9, 2014 at 7:07 PM, Eeti Jain <ee...@gmail.com> wrote:

>
> Sir, I have been working on hadoop/mahout platform and performing
> clustering
> on twitter data in my thesis work. I just want to know whether Mahout can
> handle text documents in some other language? Please if you can help me
> sir?
>
>
>
>
>

-- 

Best Regards:
Sumit Sharma

Re: problems with running K-means on hadoop's pseudo-distributed mode

Posted by Eeti Jain <ee...@gmail.com>.

Sir, I have been working on hadoop/mahout platform and performing clustering
on twitter data in my thesis work. I just want to know whether Mahout can
handle text documents in some other language? Please if you can help me sir?

Re: problems with running K-means on hadoop's pseudo-distributed mode

Posted by Wei Zhang <we...@us.ibm.com>.

hi Mahesh,

I formatted my HDFS, and now it can work out of box on pseudo-distributed
mode. My HDFS must have behaved abnormally.

Thanks!

Wei

From:	Wei Zhang/Watson/IBM@IBMUS
To:	user@mahout.apache.org,
Date:	04/04/2014 10:33 AM
Subject:	Re: problems with running K-means on hadoop's
            pseudo-distributed mode

hi Mahesh,

Thanks a lot for the response!

(1) I am using cluster_reuters.sh to run the K-means. I simply started the
script and selected k-means as the clustering algorithm. I had my
HADOOP_HOME environment variable set up, also my hadoop runs in
pseudo-distributed mode.

(2) I did have
a /tmp/mahout-work-weiz/reuters-kmeans-clusters/part-randomSeed file
generated on HDFS, but it is nearly empty (hundreds of bytes).
Compared to the generated files on my local mahout working directories,
most (if not all) generated files on HDFS are nearly empty. That is, the
output files were generated but there is no content in it (when I use
hadoop dfs -text to look into it)
I suspect the HDFS writting had some issues.

(3) The "cluster-reuters.sh" script can succesfully finish the texts
vectorization Map-Reduce jobs. But again, the output seems to be empty.
I am trying to run it in pseudo-mode manually to investigate more. I am not
sure if the cluster-reuters.sh is intended to run on a pseudo-distributed
Hadoop.

Thanks a lot and any pointer will be greatly appreciated.

Wei

Mahesh Balija ---04/04/2014 02:31:55 AM---Hi Wei Zhang, Can you check
whether this path exists in your Hadoop HDFS

From: Mahesh Balija <ba...@gmail.com>
To: user <us...@mahout.apache.org>,
Date: 04/04/2014 02:31 AM
Subject: Re: problems with running K-means on hadoop's pseudo-distributed
mode

Hi Wei Zhang,

Can you check whether this path exists in your Hadoop HDFS
/tmp/mahout-work-weiz/reuters-kmeans-clusters/part-randomSeed

Instead of using cluster_reuters.sh script file can you run Kmeans manually
on your cluster.

BTW, what is the command you are using for running cluster_reuters.sh
script?

Best,
Mahesh.B.

On Wed, Apr 2, 2014 at 12:24 AM, Wei Zhang <we...@us.ibm.com> wrote:

>
> Hello,
>
> I am new to Mahout. I have installed the Mahout-0.9.
>
> I have configured a hadoop(1.0.3)) on my laptop (Redhat 6, Lenovo W530).
I
> am experimenting the k-means test ( by running
> mahout-distribution-0.9/examples/bin/cluster-reuters.sh)
>
> I am able to run the   k-means test out of box on hadoop in loacal mode
> successfully.
>
> However, when I run hadoop in pseudo-distributed mode, the k-means test
> would fail (after successfully running 9 Map-Reduce jobs) with following
> stacktrace:
>
> Exception in thread "main" java.lang.IllegalStateException: No input
> clusters found
> in /tmp/mahout-work-weiz/reuters-kmeans-clusters/part-randomSeed. Check
> your -c argument.
>         at org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters
> (KMeansDriver.java:206)
>         at org.apache.mahout.clustering.kmeans.KMeansDriver.run
> (KMeansDriver.java:140)
>         at org.apache.mahout.clustering.kmeans.KMeansDriver.run
> (KMeansDriver.java:103)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.mahout.clustering.kmeans.KMeansDriver.main
> (KMeansDriver.java:47)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke
> (NativeMethodAccessorImpl.java:76)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke
> (DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:607)
>         at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke
> (ProgramDriver.java:68)
>         at org.apache.hadoop.util.ProgramDriver.driver
> (ProgramDriver.java:139)
>         at
> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke
> (NativeMethodAccessorImpl.java:76)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke
> (DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:607)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
> I tried to google the reason for this failure, but couldn't get a clear
> understanding. I am wondering could you help with some pointers ?
>
> Thanks!
>
> Wei
>

Re: problems with running K-means on hadoop's pseudo-distributed mode

Posted by Wei Zhang <we...@us.ibm.com>.

hi Mahesh,

Thanks a lot for the response!

(1) I am using cluster_reuters.sh to run the K-means. I simply started the
script and selected k-means as the clustering algorithm. I had my
HADOOP_HOME environment variable set up, also my hadoop runs in
pseudo-distributed mode.

(2) I did have
a /tmp/mahout-work-weiz/reuters-kmeans-clusters/part-randomSeed file
generated on HDFS, but it is nearly empty (hundreds of bytes).
Compared to the generated files on my local mahout working directories,
most (if not all) generated files on HDFS are nearly empty. That is, the
output files were generated but there is no content in it (when I use
hadoop dfs -text to look into it)
I suspect the HDFS writting had some issues.

(3) The "cluster-reuters.sh" script can succesfully finish the texts
vectorization Map-Reduce jobs. But again, the output seems to be empty.
I am trying to run it in pseudo-mode manually to investigate more. I am not
sure if the cluster-reuters.sh is intended to run on a pseudo-distributed
Hadoop.

Thanks a lot and any pointer will be greatly appreciated.

Wei

From:	Mahesh Balija <ba...@gmail.com>
To:	user <us...@mahout.apache.org>,
Date:	04/04/2014 02:31 AM
Subject:	Re: problems with running K-means on hadoop's
            pseudo-distributed mode

Hi Wei Zhang,

Can you check whether this path exists in your Hadoop HDFS
 /tmp/mahout-work-weiz/reuters-kmeans-clusters/part-randomSeed

Instead of using cluster_reuters.sh script file can you run Kmeans manually
on your cluster.

BTW, what is the command you are using for running cluster_reuters.sh
script?

Best,
Mahesh.B.

On Wed, Apr 2, 2014 at 12:24 AM, Wei Zhang <we...@us.ibm.com> wrote:

>
> Hello,
>
> I am new to Mahout. I have installed the Mahout-0.9.
>
> I have configured a hadoop(1.0.3)) on my laptop (Redhat 6, Lenovo W530).
I
> am experimenting the k-means test ( by running
> mahout-distribution-0.9/examples/bin/cluster-reuters.sh)
>
> I am able to run the   k-means test out of box on hadoop in loacal mode
> successfully.
>
> However, when I run hadoop in pseudo-distributed mode, the k-means test
> would fail (after successfully running 9 Map-Reduce jobs) with following
> stacktrace:
>
> Exception in thread "main" java.lang.IllegalStateException: No input
> clusters found
> in /tmp/mahout-work-weiz/reuters-kmeans-clusters/part-randomSeed. Check
> your -c argument.
>         at org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters
> (KMeansDriver.java:206)
>         at org.apache.mahout.clustering.kmeans.KMeansDriver.run
> (KMeansDriver.java:140)
>         at org.apache.mahout.clustering.kmeans.KMeansDriver.run
> (KMeansDriver.java:103)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.mahout.clustering.kmeans.KMeansDriver.main
> (KMeansDriver.java:47)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke
> (NativeMethodAccessorImpl.java:76)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke
> (DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:607)
>         at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke
> (ProgramDriver.java:68)
>         at org.apache.hadoop.util.ProgramDriver.driver
> (ProgramDriver.java:139)
>         at
> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke
> (NativeMethodAccessorImpl.java:76)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke
> (DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:607)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
> I tried to google the reason for this failure, but couldn't get a clear
> understanding. I am wondering could you help with some pointers ?
>
> Thanks!
>
> Wei
>