You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by nfantone <nf...@gmail.com> on 2009/07/01 15:37:38 UTC

Re: Clustering from DB

Ok, so I managed to write a VectorIterable implementation to draw data
from my database. Now, I'm in the process of understanding the output
file that kMeans (with a Canopy input) produces. Someone, please,
correct me if I'm mistaken. At first, my thought was that there were
as many "cluster-i" directories as clusters detected from the dataset
by the algorithm(s), until I printed out the content of the
"part-00000" file in them. It seems as though it stores a <Writable>
cluster ID and then a <Writable> Cluster, each line. Are those all the
actual clusters detected? If so, what's the reason behind the
directory nomenclature and its consecutive enumeration? Does every
"part-00000", in different "cluster-i" directories, hold different
clusters? And, what about the "points" directory? I can tell it
follows a <VectorID, Value> register format. What's that value
supposed to represent? The ID from the cluster it belongs, perhaps?

There really ought to be documentation about this somewhere. I don't
know if I need some kind of permission, but I'm offering myself to
write it and upload it to the Mahout wiki or wherever it should be,
once I finished my project.

Thanks in advanced.

On Fri, Jun 26, 2009 at 1:54 PM, Sean Owen<sr...@gmail.com> wrote:
> All of Mahout is generally Hadoop/HDFS based. Taste is a bit of
> exception since it has a core that is independent of Hadoop and can
> use data from files, databases, etc. It also happens to have some
> clustering logic. So you can use, say, TreeClusteringRecommender to
> generate user clusters, based on data in a database. This isn't
> Mahout's primary clustering support, but, if it fits what you need, at
> least it is there.
>
> On Fri, Jun 26, 2009 at 12:21 PM, nfantone<nf...@gmail.com> wrote:
>> Thanks for the fast response, Grant.
>>
>> I am aware of what you pointed out about Taste. I just mentioned it to
>> make a reference to something similar to what I needed to
>> implement/use, namely the "DataModel" interface.
>>
>> I'm going to try the solution you suggested and write an
>> implementation of VectorIterable. Expect me to come back here for
>> feedback.
>

Re: Clustering from DB

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

It does appear that recent changes have broken KMeans. Removing the "/*" 
makes the example work but won't work in general if there is more than 
one reducer specified. For that, the "part-0000" in the calling location 
needs to be removed so that isConverged() can iterate over all of the 
cluster part files produced by the reducers. The unit test 
'testKMeansMRJob' throws the same exception, but something must be 
handling that exception because the test passes for some reason.

I will have some time to look into this in a day or so but in the mean 
time perhaps the folks who have been doing all the KMeans changes can 
figure it out.

Jeff

nfantone wrote:
> This error is still bugging me. The exception:
>
> WARNING: java.io.FileNotFoundException: File
> output/clusters-0/part-00000/* does not exist.
> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
> does not exist.
>
> ocurrs first at:
>
> org.apache.mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:298)
>
> which corresponds to:
>
>   private static boolean isConverged(String filePath, JobConf conf,
> FileSystem fs)
>       throws IOException {
>     Path outPart = new Path(filePath + "/*");
>     SequenceFile.Reader reader = new SequenceFile.Reader(fs, outPart,
> conf);  <-- THIS
>     ...
>   }
>
> where isConverged() is called in this fashion:
>
> return isConverged(clustersOut + "/part-00000", conf, fs);
>
> by runIteration(), which is previously invoked by runJob() like:
>
>      String clustersOut = output + "/clusters-" + iteration;
>       converged = runIteration(input, clustersIn, clustersOut, measureClass,
>           delta, numReduceTasks, iteration);
>
> Consequently, assuming its the first iteration and the output folder
> has been named "output" by the user, the SequenceFile.Reader receives
> "output/clusters-0/part-00000/*" as a path, which is non-existent. I
> believe the path should end in "part-00000" and the  + "/*" should be
> removed... although someone, evidently, thought otherwise.
>
> Any feedback?
>
> On Mon, Jul 6, 2009 at 5:39 PM, nfantone<nf...@gmail.com> wrote:
>   
>> I was using Canopy to create input clusters, but the error appeared
>> while running kMeans (if I run kMeans' job only with previously
>> created clusters from Canopy placed in output/canopies as initial
>> clusters, it still fails). I noticed no other problems. I was using
>> revision 790979 before updating.  Strangely, there were no changes in
>> the job and drivers class from that revision. svn diff shows that the
>> only classes that changed in org.apache.mahout.clustering.kmeans
>> package were KMeansInfo.java and RandomSeedGenerator.java
>>
>> On Mon, Jul 6, 2009 at 3:55 PM, Jeff Eastman<jd...@windwardsolutions.com> wrote:
>>     
>>> Hum, no, it's looking for the output of the first iteration. Were there
>>> other errors? What was the last revision you were running? It does look like
>>> something got horked, as it should be looking for output/clusters-0/*. Can
>>> you diff the job and driver class to see what changed?
>>>
>>> Jeff
>>>
>>> nfantone wrote:
>>>       
>>>> Fellows, today I updated to revision 791558 and while running kMeans I
>>>> got the following exception:
>>>>
>>>> WARNING: java.io.FileNotFoundException: File
>>>> output/clusters-0/part-00000/* does not exist.
>>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>>>> does not exist.
>>>>
>>>> The algorithm isn't interrupted, though. But this exception wasn't
>>>> thrown before the update and, to me, its message is not quite clear.
>>>> It seems as it's looking for any file inside a "part-00000" directory,
>>>> which doesn't exist; and, as far as I know, "part-xxxxx" are default
>>>> names for output files.
>>>>
>>>> I could show the entire stack trace, if needed. Any pointers?
>>>>
>>>>
>>>> On Thu, Jul 2, 2009 at 3:16 PM, nfantone<nf...@gmail.com> wrote:
>>>>
>>>>         
>>>>> Thanks for the feedback, Jeff.
>>>>>
>>>>>
>>>>>           
>>>>>> The logical format of input to KMeans is <Key, Vector> as it is in
>>>>>> sequence
>>>>>> file format, but the Key is never used. To my knowledge, there is no
>>>>>> requirement to assign identifiers to the input points*. Users are free
>>>>>> to
>>>>>> associate an arbitrary name field with each vector - also label mappings
>>>>>> may
>>>>>> be assigned - but these are not manipulated by KMeans or any of the
>>>>>> other
>>>>>> clustering applications. The name field is now used as a vector
>>>>>> identifier
>>>>>> by the KMeansClusterMapper - if it is non-null - in the output step
>>>>>> only.
>>>>>>
>>>>>>             
>>>>> The key may not be used internally, but externally they can prove to
>>>>> be pretty useful. For me, keys are userIDs and each Vector represents
>>>>> his/her historical behavior. Being able to collect the output
>>>>> information as <UserID, ClusterID> is quite neat as it allows me to,
>>>>> for instance, retrieve user information using data directly from a
>>>>> HDFS file's field.
>>>>>
>>>>>
>>>>>           
>>>>
>>>>         
>>>       
>
>
>

Re: Clustering from DB

Posted by nfantone <nf...@gmail.com>.

Continued in:
http://www.nabble.com/Distance-calculation-performance-issue-td24700418.html

On Mon, Jul 27, 2009 at 3:38 PM, Grant Ingersoll<gs...@apache.org> wrote:
> I think the bigger issue here is we are doing extra work to calculate
> distance.  I'd suggest hanging on a few days to see if we can get that
> straightened out.
>
> On Jul 27, 2009, at 2:33 PM, nfantone wrote:
>
>>> Well, it does matter to some degree since picking random vectors tends to
>>> give you dense vectors whereas text gives you very sparse vectors.
>>
>>> Different patterns of sparsity can cause radically different time
>>> complexity
>>
>> for the clustering.
>>
>> I have yet to find a random combination of vectors that actually
>> benefits substantially the performance of kMeans. I have also tried
>> real datasets (like the one I was initially using from large amounts
>> of data defining consumer's buying habits) to no avail. How should a
>> collection of vectors be created to, say, not compromise the algorithm
>> functionality significantly?
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Re: Clustering from DB

Posted by Grant Ingersoll <gs...@apache.org>.

I think the bigger issue here is we are doing extra work to calculate  
distance.  I'd suggest hanging on a few days to see if we can get that  
straightened out.

On Jul 27, 2009, at 2:33 PM, nfantone wrote:

>> Well, it does matter to some degree since picking random vectors  
>> tends to give you dense vectors whereas text gives you very sparse  
>> vectors.
>
>> Different patterns of sparsity can cause radically different time  
>> complexity
> for the clustering.
>
> I have yet to find a random combination of vectors that actually
> benefits substantially the performance of kMeans. I have also tried
> real datasets (like the one I was initially using from large amounts
> of data defining consumer's buying habits) to no avail. How should a
> collection of vectors be created to, say, not compromise the algorithm
> functionality significantly?

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Clustering from DB

Posted by Ted Dunning <te...@gmail.com>.

I don't think that any particular kind of vector will "compromise" the code.

It is just that your mileage will vary with different patterns of sparsity.
It isn't just the input file size.

On Mon, Jul 27, 2009 at 11:33 AM, nfantone <nf...@gmail.com> wrote:

> How should a collection of vectors be created to, say, not compromise the
> algorithm
> functionality significantly?
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Clustering from DB

Posted by nfantone <nf...@gmail.com>.

> Well, it does matter to some degree since picking random vectors tends to give you dense vectors whereas text gives you very sparse vectors.

> Different patterns of sparsity can cause radically different time complexity
for the clustering.

I have yet to find a random combination of vectors that actually
benefits substantially the performance of kMeans. I have also tried
real datasets (like the one I was initially using from large amounts
of data defining consumer's buying habits) to no avail. How should a
collection of vectors be created to, say, not compromise the algorithm
functionality significantly?

Re: Clustering from DB

Posted by Ted Dunning <te...@gmail.com>.

Well, it does matter to some degree since picking random vectors tends to
give you dense vectors whereas text gives you very sparse vectors.

Another issue is that raw text without a kill list gives you sparse vectors
with common words always non-zero.

Different patterns of sparsity can cause radically different time complexity
for the clustering.

On Mon, Jul 27, 2009 at 11:05 AM, nfantone <nf...@gmail.com> wrote:

> > I'm not sure why testing with Random vectors would be all that useful
> other than it shows it > runs.  I wouldn't expect anything useful to come
> out of it, though.
>
> Well... my point was that it really doesn't matter how you create the
> Vectors: it's the size of the final file/s that's relevant. Then
> again, that IS the problem behind all: it runs - and that's about all
> it does, for now.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Clustering from DB

Posted by Ted Dunning <te...@gmail.com>.

Picking random vectors from a mixture of normal distributions would,
however, be a very useful test.

On Mon, Jul 27, 2009 at 6:08 AM, Grant Ingersoll <gs...@apache.org>wrote:

> I'm not sure why testing with Random vectors would be all that useful other
> than it shows it runs.  I wouldn't expect anything useful to come out of it,
> though.

-- 
Ted Dunning, CTO
DeepDyve

Re: Clustering from DB

Posted by nfantone <nf...@gmail.com>.

> I'm not sure why testing with Random vectors would be all that useful other than it shows it > runs.  I wouldn't expect anything useful to come out of it, though.

Well... my point was that it really doesn't matter how you create the
Vectors: it's the size of the final file/s that's relevant. Then
again, that IS the problem behind all: it runs - and that's about all
it does, for now.

> How did you create your SeqFile?  From what I can tell from Ted, it is important to get the > norms and distance measures lined up.

I created the file by using the random-vector-generator methods above
and the ClusteringUtils class in the project. Should the vectors be
mandatorily normalized? If so, I can tell mines aren't. Should
normalize() be called before appending a vector to the output?

> Hmm, some profiling shows the pain is in the distance calculation for emitPointToNearestCluster.

I may be wrong, but I think that's the only method being called during
a map phase (once per vector in the file/s). From a quick glance at
it, may I suggest these simple changes?

    Cluster nearestCluster = null;
    double nearestDistance = Double.MAX_VALUE;
    double distance  = 0;
    for (Cluster cluster : clusters) {
      distance = measure.distance(point, cluster.getCenter());
      if (distance < nearestDistance) {
        nearestCluster = cluster;
        nearestDistance = distance;
      }
    }

Extract the distance variable outside the loop, initialize it with 0,
and eliminate the null comparison. That is one less check to perform
for each iteration.

On Mon, Jul 27, 2009 at 1:55 PM, Shashikant Kore<sh...@gmail.com> wrote:
> On Mon, Jul 27, 2009 at 10:11 PM, Grant Ingersoll<gs...@apache.org> wrote:
>>
>> Not following.  The distance calc stuff is irrespective of the type of
>> Vector.  I was referring to the centroid length square (I think you called
>> it the triangle inequality) stuff that Shashikant added on MAHOUT-121.  We
>> use it for testing convergence, but not for other distance calculations.  I
>> haven't looked to see if it is applicable yet, but it seems like it should
>> be.
>>
>
> Grant,
>
> Yes, that part of the patch is missing.  In my original patch, I had
> modified the  emitPointToNearestCluster() in kmeans/Cluster.java to
> calculate distance between document and centroids of various clusters.
>  (There is no triangle inequality code, though.)  In the later patches
> I don't see that code.
>
> I had reviewed the final patch, but I missed out on this one.  I
> think, I only ran Canopy and not K-means. Incidentally, I am
> hopelessly out of date with trunk as recently I have not worked on
> this.  BTW, I haven't really followed this thread in depth. So, I
> might be speaking out of context here. Apologies.
>
> --shashi
>

Re: Clustering from DB

Posted by Ted Dunning <te...@gmail.com>.

I think I was the one who didn't follow.

I thought you meant the optimizations to use sparse techniques for the
distance computation.

Another candidate for the problem is that the centroids may be filling in
and becoming dense.

On Mon, Jul 27, 2009 at 9:41 AM, Grant Ingersoll <gs...@apache.org>wrote:

> That explains why Jeff didn't see the slow down with dense vectors.
>>
>
> Not following.  The distance calc stuff is irrespective of the type of
> Vector.  I was referring to the centroid length square (I think you called
> it the triangle inequality) stuff that Shashikant added on MAHOUT-121.  We
> use it for testing convergence, but not for other distance calculations.  I
> haven't looked to see if it is applicable yet, but it seems like it should
> be.

-- 
Ted Dunning, CTO
DeepDyve

Re: Clustering from DB

Posted by Grant Ingersoll <gs...@apache.org>.

On Jul 27, 2009, at 12:55 PM, Shashikant Kore wrote:

> On Mon, Jul 27, 2009 at 10:11 PM, Grant  
> Ingersoll<gs...@apache.org> wrote:
>>
>> Not following.  The distance calc stuff is irrespective of the type  
>> of
>> Vector.  I was referring to the centroid length square (I think you  
>> called
>> it the triangle inequality) stuff that Shashikant added on  
>> MAHOUT-121.  We
>> use it for testing convergence, but not for other distance  
>> calculations.  I
>> haven't looked to see if it is applicable yet, but it seems like it  
>> should
>> be.
>>
>
> Grant,
>
> Yes, that part of the patch is missing.  In my original patch, I had
> modified the  emitPointToNearestCluster() in kmeans/Cluster.java to
> calculate distance between document and centroids of various clusters.
> (There is no triangle inequality code, though.)  In the later patches
> I don't see that code.
>
> I had reviewed the final patch, but I missed out on this one.  I
> think, I only ran Canopy and not K-means. Incidentally, I am
> hopelessly out of date with trunk as recently I have not worked on
> this.  BTW, I haven't really followed this thread in depth. So, I
> might be speaking out of context here. Apologies.

I'll be on a plane tomorrow, will see if I can track down the  
differences.

-Grant

Re: Clustering from DB

Posted by Shashikant Kore <sh...@gmail.com>.

On Mon, Jul 27, 2009 at 10:11 PM, Grant Ingersoll<gs...@apache.org> wrote:
>
> Not following.  The distance calc stuff is irrespective of the type of
> Vector.  I was referring to the centroid length square (I think you called
> it the triangle inequality) stuff that Shashikant added on MAHOUT-121.  We
> use it for testing convergence, but not for other distance calculations.  I
> haven't looked to see if it is applicable yet, but it seems like it should
> be.
>

Grant,

Yes, that part of the patch is missing.  In my original patch, I had
modified the  emitPointToNearestCluster() in kmeans/Cluster.java to
calculate distance between document and centroids of various clusters.
 (There is no triangle inequality code, though.)  In the later patches
I don't see that code.

I had reviewed the final patch, but I missed out on this one.  I
think, I only ran Canopy and not K-means. Incidentally, I am
hopelessly out of date with trunk as recently I have not worked on
this.  BTW, I haven't really followed this thread in depth. So, I
might be speaking out of context here. Apologies.

--shashi

Re: Clustering from DB

Posted by Grant Ingersoll <gs...@apache.org>.

On Jul 27, 2009, at 12:03 PM, Ted Dunning wrote:

> Yes.
>
> That explains why Jeff didn't see the slow down with dense vectors.

Not following.  The distance calc stuff is irrespective of the type of  
Vector.  I was referring to the centroid length square (I think you  
called it the triangle inequality) stuff that Shashikant added on  
MAHOUT-121.  We use it for testing convergence, but not for other  
distance calculations.  I haven't looked to see if it is applicable  
yet, but it seems like it should be.

>
> On Mon, Jul 27, 2009 at 8:03 AM, Grant Ingersoll  
> <gs...@apache.org>wrote:
>
>> Hmm, some profiling shows the pain is in the distance calculation for
>> emitPointToNearestCluster.  Seems that we only use the optimized  
>> distance
>> calculations for testing convergence, but shouldn't we also use it  
>> for
>> calculating the distances to the cluster, too?
>
>
>
>
> -- 
> Ted Dunning, CTO
> DeepDyve

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Clustering from DB

Posted by Ted Dunning <te...@gmail.com>.

Yes.

That explains why Jeff didn't see the slow down with dense vectors.

On Mon, Jul 27, 2009 at 8:03 AM, Grant Ingersoll <gs...@apache.org>wrote:

> Hmm, some profiling shows the pain is in the distance calculation for
> emitPointToNearestCluster.  Seems that we only use the optimized distance
> calculations for testing convergence, but shouldn't we also use it for
> calculating the distances to the cluster, too?

-- 
Ted Dunning, CTO
DeepDyve

Re: Clustering from DB

Posted by Grant Ingersoll <gs...@apache.org>.

Hmm, some profiling shows the pain is in the distance calculation for  
emitPointToNearestCluster.  Seems that we only use the optimized  
distance calculations for testing convergence, but shouldn't we also  
use it for calculating the distances to the cluster, too?


On Jul 27, 2009, at 10:19 AM, Grant Ingersoll wrote:

> I can confirm it is taking a while.  I spun up the dataset provided  
> and am on the first iteration, the mapper is at 50% and it has been  
> over an hour.
>
> Not a good sign.  I will try profiling.
>
> On Jul 27, 2009, at 10:07 AM, Jeff Eastman wrote:
>
>> It's been over a year since I ran any tests of KMeans on larger  
>> data sets and there has been a lot of refactoring done in the  
>> interim. I was also using only dense vectors. It is entirely  
>> possible it is now doing something really poorly. I'm surprised  
>> that it is taking such a long time to munch such a small dataset  
>> but it sounds like you can reproduce it on a single machine so  
>> profiling should suggest the root cause. I'm going to be away from  
>> the computer for the next two weeks - a real vacation - so  
>> unfortunately I won't be able to contribute to this effort.
>>
>> Jeff
>>
>> Grant Ingersoll wrote:
>>>
>>> On Jul 27, 2009, at 12:00 AM, nfantone wrote:
>>>
>>>> Thanks, Grant. I just updated and notice the change.
>>>>
>>>> As a side note: you think someone could run some real tests on  
>>>> kMeans,
>>>> in particular, other than the ones already in the project? I bet  
>>>> there
>>>> are other naive (or not so naive) problems like that. After much
>>>> coding, reading and experimenting in the last weeks with  
>>>> clustering in
>>>> Mahout, I am inclined to say something may not fully work with  
>>>> kMeans,
>>>> as of now. Or perhaps it just needs some refactoring/performance
>>>> tweaks. Jeff have claimed to run the job over gigabytes of data,  
>>>> using
>>>> a rather small cluster, in minutes. Have anyone tried to accomplish
>>>> this recently (since the hadoop upgrade to 0.20)? Just use
>>>> ClusteringUtils to write a file of some (arguably not so)  
>>>> significant
>>>> number of random Vectors (say, 800.000+) and let that be the  
>>>> input of
>>>> a KMeansMRJob (testKMeansMRJob() could very well serve this purpose
>>>> with little change). You'll end up with a file of about ~85MB to
>>>> ~100MB, which can easily fit into memory in any modern computer.  
>>>> Now,
>>>> run the whole thing (I've tried both, locally and using a three
>>>> node-cluster setup - which, frankly, seemed like a bit too much
>>>> computing power for such small number of items in the dataset).  
>>>> It'll
>>>> take forever to complete.
>>>>
>>>
>>> I hope to hit this soon.  I've got some Amazon credits I need to  
>>> use and hope to put them towards this.
>>>
>>> As with any project in open source, we need people to kick the  
>>> tires, give feedback (thank you!) and also poke around the code to  
>>> make it better.
>>>
>>> Have you tried your data with some other clustering code, perhaps  
>>> Weka or something like that?
>>>
>>>
>>>> This simple methods could be used to generate any given number of
>>>> random SparseVectors for testing's sake, if anyone is interested:
>>>>
>>>> private static Random rnd = new Random();
>>>> private static final int CARDINALITY = 1200;
>>>> private static final int MAX_NON_ZEROS = 200;
>>>> private static final int MAX_VECTORS = 850000;
>>>>
>>>> private static Vector getRandomVector() {
>>>>   Integer id = rnd.nextInt(Integer.MAX_VALUE);
>>>>   Vector v = new SparseVector(id.toString(), CARDINALITY);
>>>>   int nonZeros = 0;
>>>>   while ((nonZeros = rnd.nextInt(MAX_NON_ZEROS)) == 0);
>>>>   for (int i = 0; i < nonZeros; i++) {
>>>>       v.setQuick(rnd.nextInt(CARDINALITY), rnd.nextDouble());
>>>>   }
>>>>   return v;
>>>> }
>>>>
>>>> private static List<Vector> getVectors() {
>>>>     List<Vector> vectors = new ArrayList<Vector>(MAX_VECTORS);
>>>>     for (int i = 0; i < MAX_VECTORS; i++){
>>>>         vectors.add(getRandomVector());
>>>>     }
>>>>     return vectors;
>>>> }
>>>>
>>>
>>>
>>> I'm not sure why testing with Random vectors would be all that  
>>> useful other than it shows it runs.  I wouldn't expect anything  
>>> useful to come out of it, though.
>>>
>>>
>>>> On Sun, Jul 26, 2009 at 10:30 PM, Grant Ingersoll<gsingers@apache.org 
>>>> > wrote:
>>>>> Fixed on MAHOUT-152
>>>>>
>>>>> On Jul 26, 2009, at 9:19 PM, Grant Ingersoll wrote:
>>>>>
>>>>>> That does indeed look like a problem.  I'll fix.
>>>>>>
>>>>>> On Jul 26, 2009, at 2:37 PM, nfantone wrote:
>>>>>>
>>>>>>> While (still) experiencing performance issues and inspecting  
>>>>>>> kMeans
>>>>>>> code, I found this lying around  
>>>>>>> SquaredEuclideanDistanceMeasure.java:
>>>>>>>
>>>>>>> public double distance(double centroidLengthSquare, Vector  
>>>>>>> centroid,
>>>>>>> Vector v) {
>>>>>>> if (centroid.size() != centroid.size()) {
>>>>>>>  throw new CardinalityException();
>>>>>>> }
>>>>>>> ...
>>>>>>> }
>>>>>>>
>>>>>>> I bet someone meant to compare centroid and v sizes and didn't  
>>>>>>> noticed.
>>>>>>>
>>>>>>> On Fri, Jul 24, 2009 at 12:38 PM, nfantone<nf...@gmail.com>  
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Well, as it turned out, it didn't have anything to do with my
>>>>>>>> performance issue but I found out that writing a Cluster  
>>>>>>>> (with a
>>>>>>>> single vector as its center) to a file and then reading it,  
>>>>>>>> requires
>>>>>>>> the center to be added as point; otherwise, you won't be able  
>>>>>>>> to
>>>>>>>> retrieve it as it should. Therefore, one should do:
>>>>>>>>
>>>>>>>> // Writing
>>>>>>>> String id = "someID";
>>>>>>>> Vector v = new SparseVector();
>>>>>>>> Cluster c = new Cluster(v);
>>>>>>>> c.addPoint(v);
>>>>>>>> seqWriter.append(new Text(id), c);
>>>>>>>>
>>>>>>>> // Reading
>>>>>>>> Writable key = (Writable)  
>>>>>>>> seqReader.getKeyClass().newInstance();
>>>>>>>> Cluster value = (Cluster)  
>>>>>>>> seqReader.getValueClass().newInstance();
>>>>>>>> while (seqReader.next(key, value)) {
>>>>>>>> ...
>>>>>>>> Vector centroid = value.getCenter();
>>>>>>>> ...
>>>>>>>> }
>>>>>>>>
>>>>>>>> This way, 'key' corresponds to 'id' and 'v' to 'centroid'. I  
>>>>>>>> think
>>>>>>>> this shouldn't happen. Then again, it's not that relevant, I  
>>>>>>>> guess.
>>>>>>>>
>>>>>>>> Sorry for bringing different subjects to the same thread.
>>>>>>>>
>>>>>>>> On Fri, Jul 24, 2009 at 9:14 AM, nfantone<nf...@gmail.com>  
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I've been using RandomSeedGenerator to generate initial  
>>>>>>>>> clusters for
>>>>>>>>> kMeans and while checking its code I stumbled upon this:
>>>>>>>>>
>>>>>>>>>  while (reader.next(key, value)) {
>>>>>>>>>    Cluster newCluster = new Cluster(value);
>>>>>>>>>    newCluster.addPoint(value);
>>>>>>>>>    ....
>>>>>>>>>  }
>>>>>>>>>
>>>>>>>>> I can see it adds the vector to the newly created cluster,  
>>>>>>>>> even though
>>>>>>>>> it is setting it as its center in the constructor. Wasn't this
>>>>>>>>> corrected in a past revision? I thought this was not necessary
>>>>>>>>> anymore. I'll look into it a little bit more and see if this  
>>>>>>>>> has
>>>>>>>>> something to do with my lack of performance with my dataset.
>>>>>>>>>
>>>>>>>>> On Thu, Jul 23, 2009 at 3:45 PM,  
>>>>>>>>> nfantone<nf...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Perhaps a larger convergence value might help (-d, I  
>>>>>>>>>>>>> believe).
>>>>>>>>>>>>
>>>>>>>>>>>> I'll try that.
>>>>>>>>>>
>>>>>>>>>> There was no significant change while modifying the  
>>>>>>>>>> convergence value.
>>>>>>>>>> At least, none was observed during the first three  
>>>>>>>>>> iterations which
>>>>>>>>>> lasted the same amount of time than before, more or less.
>>>>>>>>>>
>>>>>>>>>>>>> Is there any chance your data is publicly shareable?   
>>>>>>>>>>>>> Come to think
>>>>>>>>>>>>> of
>>>>>>>>>>>>> it,
>>>>>>>>>>>>> with the vector representations, as long as you don't  
>>>>>>>>>>>>> publish the
>>>>>>>>>>>>> key
>>>>>>>>>>>>> (which
>>>>>>>>>>>>> terms map to which index), I would think most all data  
>>>>>>>>>>>>> is publicly
>>>>>>>>>>>>> shareable.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm sorry, I don't quite understand what you're asking.  
>>>>>>>>>>>> Publicly
>>>>>>>>>>>> shareable? As in user-permissions to access/read/write  
>>>>>>>>>>>> the data?
>>>>>>>>>>>
>>>>>>>>>>> As in post a copy of the SequenceFile somewhere for  
>>>>>>>>>>> download,
>>>>>>>>>>> assuming you
>>>>>>>>>>> can.  Then others could presumably try it out.
>>>>>>>>>>
>>>>>>>>>> My bad. Of course it is:
>>>>>>>>>>
>>>>>>>>>> http://cringer.3kh.net/web/user-dataset.data.tar.bz2
>>>>>>>>>>
>>>>>>>>>> That's the ~62Mb SequenceFile sample I've using, in <Text,
>>>>>>>>>> SparseVector> logical format.
>>>>>>>>>>
>>>>>>>>>>> That does seem like an awfully long time for 62 MB on a 6  
>>>>>>>>>>> node
>>>>>>>>>>> cluster. How many >terations are running?
>>>>>>>>>>
>>>>>>>>>> I'm running the whole thing with a 20 iterations cap. Every  
>>>>>>>>>> iteration
>>>>>>>>>> - EXCEPT the first one which, oddly, lasted just two  
>>>>>>>>>> minutes-, took
>>>>>>>>>> around 3hs to complete:
>>>>>>>>>>
>>>>>>>>>> Hadoop job_200907221734_0001
>>>>>>>>>> Finished in: 1mins, 42sec
>>>>>>>>>>
>>>>>>>>>> Hadoop job_200907221734_0004
>>>>>>>>>> Finished in: 2hrs, 34mins, 3sec
>>>>>>>>>>
>>>>>>>>>> Hadoop job_200907221734_0005
>>>>>>>>>> Finished in: 2hrs, 59mins, 34sec
>>>>>>>>>>
>>>>>>>>>>> How did you generate your initial clusters?
>>>>>>>>>>
>>>>>>>>>> I generate the initial clusters via the RandomSeedGenerator  
>>>>>>>>>> setting a
>>>>>>>>>> 'k' value of 200.  This is what I did to initiate the  
>>>>>>>>>> process for the
>>>>>>>>>> first time:
>>>>>>>>>>
>>>>>>>>>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data
>>>>>>>>>> input/user.data
>>>>>>>>>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data
>>>>>>>>>> init/user.data
>>>>>>>>>> ./bin/hadoop jar ~/mahout-core-0.2.jar
>>>>>>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver -i input/ 
>>>>>>>>>> user.data -c
>>>>>>>>>> init -o output -r 32 -d 0.01 -k 200
>>>>>>>>>>
>>>>>>>>>>> Where are the iteration jobs spending most of their time  
>>>>>>>>>>> (map vs.
>>>>>>>>>>> reduce)
>>>>>>>>>>
>>>>>>>>>> I'm tempted to say map here, but their spent time is rather
>>>>>>>>>> comparable, actually. Reduce attempts are taking an hour  
>>>>>>>>>> and a half to
>>>>>>>>>> end (average), and so are Map attempts. Here are some  
>>>>>>>>>> representative
>>>>>>>>>> examples from the web UI:
>>>>>>>>>>
>>>>>>>>>> reduce
>>>>>>>>>> attempt_200907221734_0002_r_000006_0
>>>>>>>>>> 22-Jul-2009 21:15:01 (1hrs, 55mins, 55sec)
>>>>>>>>>>
>>>>>>>>>> map
>>>>>>>>>> attempt_200907221734_0002_m_000000_0
>>>>>>>>>> 22-Jul-2009 20:52:27 (2hrs, 16mins, 12sec)
>>>>>>>>>>
>>>>>>>>>> Perhaps, there's some inconvenient in the way I create the
>>>>>>>>>> SequenceFile? I could share the JAVA code as well, if  
>>>>>>>>>> required.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --------------------------
>>>>>> Grant Ingersoll
>>>>>> http://www.lucidimagination.com/
>>>>>>
>>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/ 
>>>>>> Droids) using
>>>>>> Solr/Lucene:
>>>>>> http://www.lucidimagination.com/search
>>>>>>
>>>>>
>>>>> --------------------------
>>>>> Grant Ingersoll
>>>>> http://www.lucidimagination.com/
>>>>>
>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/ 
>>>>> Droids) using
>>>>> Solr/Lucene:
>>>>> http://www.lucidimagination.com/search
>>>>>
>>>>>
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>>> using Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>>
>>>
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
> using Solr/Lucene:
> http://www.lucidimagination.com/search
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Clustering from DB

Posted by Grant Ingersoll <gs...@apache.org>.

I can confirm it is taking a while.  I spun up the dataset provided  
and am on the first iteration, the mapper is at 50% and it has been  
over an hour.

Not a good sign.  I will try profiling.

On Jul 27, 2009, at 10:07 AM, Jeff Eastman wrote:

> It's been over a year since I ran any tests of KMeans on larger data  
> sets and there has been a lot of refactoring done in the interim. I  
> was also using only dense vectors. It is entirely possible it is now  
> doing something really poorly. I'm surprised that it is taking such  
> a long time to munch such a small dataset but it sounds like you can  
> reproduce it on a single machine so profiling should suggest the  
> root cause. I'm going to be away from the computer for the next two  
> weeks - a real vacation - so unfortunately I won't be able to  
> contribute to this effort.
>
> Jeff
>
> Grant Ingersoll wrote:
>>
>> On Jul 27, 2009, at 12:00 AM, nfantone wrote:
>>
>>> Thanks, Grant. I just updated and notice the change.
>>>
>>> As a side note: you think someone could run some real tests on  
>>> kMeans,
>>> in particular, other than the ones already in the project? I bet  
>>> there
>>> are other naive (or not so naive) problems like that. After much
>>> coding, reading and experimenting in the last weeks with  
>>> clustering in
>>> Mahout, I am inclined to say something may not fully work with  
>>> kMeans,
>>> as of now. Or perhaps it just needs some refactoring/performance
>>> tweaks. Jeff have claimed to run the job over gigabytes of data,  
>>> using
>>> a rather small cluster, in minutes. Have anyone tried to accomplish
>>> this recently (since the hadoop upgrade to 0.20)? Just use
>>> ClusteringUtils to write a file of some (arguably not so)  
>>> significant
>>> number of random Vectors (say, 800.000+) and let that be the input  
>>> of
>>> a KMeansMRJob (testKMeansMRJob() could very well serve this purpose
>>> with little change). You'll end up with a file of about ~85MB to
>>> ~100MB, which can easily fit into memory in any modern computer.  
>>> Now,
>>> run the whole thing (I've tried both, locally and using a three
>>> node-cluster setup - which, frankly, seemed like a bit too much
>>> computing power for such small number of items in the dataset).  
>>> It'll
>>> take forever to complete.
>>>
>>
>> I hope to hit this soon.  I've got some Amazon credits I need to  
>> use and hope to put them towards this.
>>
>> As with any project in open source, we need people to kick the  
>> tires, give feedback (thank you!) and also poke around the code to  
>> make it better.
>>
>> Have you tried your data with some other clustering code, perhaps  
>> Weka or something like that?
>>
>>
>>> This simple methods could be used to generate any given number of
>>> random SparseVectors for testing's sake, if anyone is interested:
>>>
>>> private static Random rnd = new Random();
>>> private static final int CARDINALITY = 1200;
>>> private static final int MAX_NON_ZEROS = 200;
>>> private static final int MAX_VECTORS = 850000;
>>>
>>> private static Vector getRandomVector() {
>>>    Integer id = rnd.nextInt(Integer.MAX_VALUE);
>>>    Vector v = new SparseVector(id.toString(), CARDINALITY);
>>>    int nonZeros = 0;
>>>    while ((nonZeros = rnd.nextInt(MAX_NON_ZEROS)) == 0);
>>>    for (int i = 0; i < nonZeros; i++) {
>>>        v.setQuick(rnd.nextInt(CARDINALITY), rnd.nextDouble());
>>>    }
>>>    return v;
>>> }
>>>
>>> private static List<Vector> getVectors() {
>>>      List<Vector> vectors = new ArrayList<Vector>(MAX_VECTORS);
>>>      for (int i = 0; i < MAX_VECTORS; i++){
>>>          vectors.add(getRandomVector());
>>>      }
>>>      return vectors;
>>> }
>>>
>>
>>
>> I'm not sure why testing with Random vectors would be all that  
>> useful other than it shows it runs.  I wouldn't expect anything  
>> useful to come out of it, though.
>>
>>
>>> On Sun, Jul 26, 2009 at 10:30 PM, Grant Ingersoll<gsingers@apache.org 
>>> > wrote:
>>>> Fixed on MAHOUT-152
>>>>
>>>> On Jul 26, 2009, at 9:19 PM, Grant Ingersoll wrote:
>>>>
>>>>> That does indeed look like a problem.  I'll fix.
>>>>>
>>>>> On Jul 26, 2009, at 2:37 PM, nfantone wrote:
>>>>>
>>>>>> While (still) experiencing performance issues and inspecting  
>>>>>> kMeans
>>>>>> code, I found this lying around  
>>>>>> SquaredEuclideanDistanceMeasure.java:
>>>>>>
>>>>>> public double distance(double centroidLengthSquare, Vector  
>>>>>> centroid,
>>>>>> Vector v) {
>>>>>> if (centroid.size() != centroid.size()) {
>>>>>>   throw new CardinalityException();
>>>>>> }
>>>>>> ...
>>>>>> }
>>>>>>
>>>>>> I bet someone meant to compare centroid and v sizes and didn't  
>>>>>> noticed.
>>>>>>
>>>>>> On Fri, Jul 24, 2009 at 12:38 PM, nfantone<nf...@gmail.com>  
>>>>>> wrote:
>>>>>>>
>>>>>>> Well, as it turned out, it didn't have anything to do with my
>>>>>>> performance issue but I found out that writing a Cluster (with a
>>>>>>> single vector as its center) to a file and then reading it,  
>>>>>>> requires
>>>>>>> the center to be added as point; otherwise, you won't be able to
>>>>>>> retrieve it as it should. Therefore, one should do:
>>>>>>>
>>>>>>> // Writing
>>>>>>> String id = "someID";
>>>>>>> Vector v = new SparseVector();
>>>>>>> Cluster c = new Cluster(v);
>>>>>>> c.addPoint(v);
>>>>>>> seqWriter.append(new Text(id), c);
>>>>>>>
>>>>>>> // Reading
>>>>>>> Writable key = (Writable) seqReader.getKeyClass().newInstance();
>>>>>>> Cluster value = (Cluster)  
>>>>>>> seqReader.getValueClass().newInstance();
>>>>>>> while (seqReader.next(key, value)) {
>>>>>>> ...
>>>>>>> Vector centroid = value.getCenter();
>>>>>>> ...
>>>>>>> }
>>>>>>>
>>>>>>> This way, 'key' corresponds to 'id' and 'v' to 'centroid'. I  
>>>>>>> think
>>>>>>> this shouldn't happen. Then again, it's not that relevant, I  
>>>>>>> guess.
>>>>>>>
>>>>>>> Sorry for bringing different subjects to the same thread.
>>>>>>>
>>>>>>> On Fri, Jul 24, 2009 at 9:14 AM, nfantone<nf...@gmail.com>  
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> I've been using RandomSeedGenerator to generate initial  
>>>>>>>> clusters for
>>>>>>>> kMeans and while checking its code I stumbled upon this:
>>>>>>>>
>>>>>>>>   while (reader.next(key, value)) {
>>>>>>>>     Cluster newCluster = new Cluster(value);
>>>>>>>>     newCluster.addPoint(value);
>>>>>>>>     ....
>>>>>>>>   }
>>>>>>>>
>>>>>>>> I can see it adds the vector to the newly created cluster,  
>>>>>>>> even though
>>>>>>>> it is setting it as its center in the constructor. Wasn't this
>>>>>>>> corrected in a past revision? I thought this was not necessary
>>>>>>>> anymore. I'll look into it a little bit more and see if this  
>>>>>>>> has
>>>>>>>> something to do with my lack of performance with my dataset.
>>>>>>>>
>>>>>>>> On Thu, Jul 23, 2009 at 3:45 PM, nfantone<nf...@gmail.com>  
>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Perhaps a larger convergence value might help (-d, I  
>>>>>>>>>>>> believe).
>>>>>>>>>>>
>>>>>>>>>>> I'll try that.
>>>>>>>>>
>>>>>>>>> There was no significant change while modifying the  
>>>>>>>>> convergence value.
>>>>>>>>> At least, none was observed during the first three  
>>>>>>>>> iterations which
>>>>>>>>> lasted the same amount of time than before, more or less.
>>>>>>>>>
>>>>>>>>>>>> Is there any chance your data is publicly shareable?   
>>>>>>>>>>>> Come to think
>>>>>>>>>>>> of
>>>>>>>>>>>> it,
>>>>>>>>>>>> with the vector representations, as long as you don't  
>>>>>>>>>>>> publish the
>>>>>>>>>>>> key
>>>>>>>>>>>> (which
>>>>>>>>>>>> terms map to which index), I would think most all data is  
>>>>>>>>>>>> publicly
>>>>>>>>>>>> shareable.
>>>>>>>>>>>
>>>>>>>>>>> I'm sorry, I don't quite understand what you're asking.  
>>>>>>>>>>> Publicly
>>>>>>>>>>> shareable? As in user-permissions to access/read/write the  
>>>>>>>>>>> data?
>>>>>>>>>>
>>>>>>>>>> As in post a copy of the SequenceFile somewhere for download,
>>>>>>>>>> assuming you
>>>>>>>>>> can.  Then others could presumably try it out.
>>>>>>>>>
>>>>>>>>> My bad. Of course it is:
>>>>>>>>>
>>>>>>>>> http://cringer.3kh.net/web/user-dataset.data.tar.bz2
>>>>>>>>>
>>>>>>>>> That's the ~62Mb SequenceFile sample I've using, in <Text,
>>>>>>>>> SparseVector> logical format.
>>>>>>>>>
>>>>>>>>>> That does seem like an awfully long time for 62 MB on a 6  
>>>>>>>>>> node
>>>>>>>>>> cluster. How many >terations are running?
>>>>>>>>>
>>>>>>>>> I'm running the whole thing with a 20 iterations cap. Every  
>>>>>>>>> iteration
>>>>>>>>> - EXCEPT the first one which, oddly, lasted just two  
>>>>>>>>> minutes-, took
>>>>>>>>> around 3hs to complete:
>>>>>>>>>
>>>>>>>>> Hadoop job_200907221734_0001
>>>>>>>>> Finished in: 1mins, 42sec
>>>>>>>>>
>>>>>>>>> Hadoop job_200907221734_0004
>>>>>>>>> Finished in: 2hrs, 34mins, 3sec
>>>>>>>>>
>>>>>>>>> Hadoop job_200907221734_0005
>>>>>>>>> Finished in: 2hrs, 59mins, 34sec
>>>>>>>>>
>>>>>>>>>> How did you generate your initial clusters?
>>>>>>>>>
>>>>>>>>> I generate the initial clusters via the RandomSeedGenerator  
>>>>>>>>> setting a
>>>>>>>>> 'k' value of 200.  This is what I did to initiate the  
>>>>>>>>> process for the
>>>>>>>>> first time:
>>>>>>>>>
>>>>>>>>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data
>>>>>>>>> input/user.data
>>>>>>>>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data
>>>>>>>>> init/user.data
>>>>>>>>> ./bin/hadoop jar ~/mahout-core-0.2.jar
>>>>>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver -i input/ 
>>>>>>>>> user.data -c
>>>>>>>>> init -o output -r 32 -d 0.01 -k 200
>>>>>>>>>
>>>>>>>>>> Where are the iteration jobs spending most of their time  
>>>>>>>>>> (map vs.
>>>>>>>>>> reduce)
>>>>>>>>>
>>>>>>>>> I'm tempted to say map here, but their spent time is rather
>>>>>>>>> comparable, actually. Reduce attempts are taking an hour and  
>>>>>>>>> a half to
>>>>>>>>> end (average), and so are Map attempts. Here are some  
>>>>>>>>> representative
>>>>>>>>> examples from the web UI:
>>>>>>>>>
>>>>>>>>> reduce
>>>>>>>>> attempt_200907221734_0002_r_000006_0
>>>>>>>>> 22-Jul-2009 21:15:01 (1hrs, 55mins, 55sec)
>>>>>>>>>
>>>>>>>>> map
>>>>>>>>> attempt_200907221734_0002_m_000000_0
>>>>>>>>> 22-Jul-2009 20:52:27 (2hrs, 16mins, 12sec)
>>>>>>>>>
>>>>>>>>> Perhaps, there's some inconvenient in the way I create the
>>>>>>>>> SequenceFile? I could share the JAVA code as well, if  
>>>>>>>>> required.
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>> --------------------------
>>>>> Grant Ingersoll
>>>>> http://www.lucidimagination.com/
>>>>>
>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/ 
>>>>> Droids) using
>>>>> Solr/Lucene:
>>>>> http://www.lucidimagination.com/search
>>>>>
>>>>
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://www.lucidimagination.com/
>>>>
>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/ 
>>>> Droids) using
>>>> Solr/Lucene:
>>>> http://www.lucidimagination.com/search
>>>>
>>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>> using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>>
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Clustering from DB

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

It's been over a year since I ran any tests of KMeans on larger data 
sets and there has been a lot of refactoring done in the interim. I was 
also using only dense vectors. It is entirely possible it is now doing 
something really poorly. I'm surprised that it is taking such a long 
time to munch such a small dataset but it sounds like you can reproduce 
it on a single machine so profiling should suggest the root cause. I'm 
going to be away from the computer for the next two weeks - a real 
vacation - so unfortunately I won't be able to contribute to this effort.

Jeff

Grant Ingersoll wrote:
>
> On Jul 27, 2009, at 12:00 AM, nfantone wrote:
>
>> Thanks, Grant. I just updated and notice the change.
>>
>> As a side note: you think someone could run some real tests on kMeans,
>> in particular, other than the ones already in the project? I bet there
>> are other naive (or not so naive) problems like that. After much
>> coding, reading and experimenting in the last weeks with clustering in
>> Mahout, I am inclined to say something may not fully work with kMeans,
>> as of now. Or perhaps it just needs some refactoring/performance
>> tweaks. Jeff have claimed to run the job over gigabytes of data, using
>> a rather small cluster, in minutes. Have anyone tried to accomplish
>> this recently (since the hadoop upgrade to 0.20)? Just use
>> ClusteringUtils to write a file of some (arguably not so) significant
>> number of random Vectors (say, 800.000+) and let that be the input of
>> a KMeansMRJob (testKMeansMRJob() could very well serve this purpose
>> with little change). You'll end up with a file of about ~85MB to
>> ~100MB, which can easily fit into memory in any modern computer. Now,
>> run the whole thing (I've tried both, locally and using a three
>> node-cluster setup - which, frankly, seemed like a bit too much
>> computing power for such small number of items in the dataset). It'll
>> take forever to complete.
>>
>
> I hope to hit this soon.  I've got some Amazon credits I need to use 
> and hope to put them towards this.
>
> As with any project in open source, we need people to kick the tires, 
> give feedback (thank you!) and also poke around the code to make it 
> better.
>
> Have you tried your data with some other clustering code, perhaps Weka 
> or something like that?
>
>
>> This simple methods could be used to generate any given number of
>> random SparseVectors for testing's sake, if anyone is interested:
>>
>>  private static Random rnd = new Random();
>>  private static final int CARDINALITY = 1200;
>>  private static final int MAX_NON_ZEROS = 200;
>>  private static final int MAX_VECTORS = 850000;
>>
>>  private static Vector getRandomVector() {
>>     Integer id = rnd.nextInt(Integer.MAX_VALUE);
>>     Vector v = new SparseVector(id.toString(), CARDINALITY);
>>     int nonZeros = 0;
>>     while ((nonZeros = rnd.nextInt(MAX_NON_ZEROS)) == 0);
>>     for (int i = 0; i < nonZeros; i++) {
>>         v.setQuick(rnd.nextInt(CARDINALITY), rnd.nextDouble());
>>     }
>>     return v;
>>  }
>>
>>  private static List<Vector> getVectors() {
>>       List<Vector> vectors = new ArrayList<Vector>(MAX_VECTORS);
>>       for (int i = 0; i < MAX_VECTORS; i++){
>>           vectors.add(getRandomVector());
>>       }
>>       return vectors;
>>  }
>>
>
>
> I'm not sure why testing with Random vectors would be all that useful 
> other than it shows it runs.  I wouldn't expect anything useful to 
> come out of it, though.
>
>
>> On Sun, Jul 26, 2009 at 10:30 PM, Grant 
>> Ingersoll<gs...@apache.org> wrote:
>>> Fixed on MAHOUT-152
>>>
>>> On Jul 26, 2009, at 9:19 PM, Grant Ingersoll wrote:
>>>
>>>> That does indeed look like a problem.  I'll fix.
>>>>
>>>> On Jul 26, 2009, at 2:37 PM, nfantone wrote:
>>>>
>>>>> While (still) experiencing performance issues and inspecting kMeans
>>>>> code, I found this lying around SquaredEuclideanDistanceMeasure.java:
>>>>>
>>>>> public double distance(double centroidLengthSquare, Vector centroid,
>>>>> Vector v) {
>>>>>  if (centroid.size() != centroid.size()) {
>>>>>    throw new CardinalityException();
>>>>>  }
>>>>>  ...
>>>>>  }
>>>>>
>>>>> I bet someone meant to compare centroid and v sizes and didn't 
>>>>> noticed.
>>>>>
>>>>> On Fri, Jul 24, 2009 at 12:38 PM, nfantone<nf...@gmail.com> wrote:
>>>>>>
>>>>>> Well, as it turned out, it didn't have anything to do with my
>>>>>> performance issue but I found out that writing a Cluster (with a
>>>>>> single vector as its center) to a file and then reading it, requires
>>>>>> the center to be added as point; otherwise, you won't be able to
>>>>>> retrieve it as it should. Therefore, one should do:
>>>>>>
>>>>>> // Writing
>>>>>> String id = "someID";
>>>>>> Vector v = new SparseVector();
>>>>>> Cluster c = new Cluster(v);
>>>>>> c.addPoint(v);
>>>>>> seqWriter.append(new Text(id), c);
>>>>>>
>>>>>> // Reading
>>>>>> Writable key = (Writable) seqReader.getKeyClass().newInstance();
>>>>>> Cluster value = (Cluster) seqReader.getValueClass().newInstance();
>>>>>> while (seqReader.next(key, value)) {
>>>>>> ...
>>>>>> Vector centroid = value.getCenter();
>>>>>> ...
>>>>>> }
>>>>>>
>>>>>> This way, 'key' corresponds to 'id' and 'v' to 'centroid'. I think
>>>>>> this shouldn't happen. Then again, it's not that relevant, I guess.
>>>>>>
>>>>>> Sorry for bringing different subjects to the same thread.
>>>>>>
>>>>>> On Fri, Jul 24, 2009 at 9:14 AM, nfantone<nf...@gmail.com> wrote:
>>>>>>>
>>>>>>> I've been using RandomSeedGenerator to generate initial clusters 
>>>>>>> for
>>>>>>> kMeans and while checking its code I stumbled upon this:
>>>>>>>
>>>>>>>    while (reader.next(key, value)) {
>>>>>>>      Cluster newCluster = new Cluster(value);
>>>>>>>      newCluster.addPoint(value);
>>>>>>>      ....
>>>>>>>    }
>>>>>>>
>>>>>>> I can see it adds the vector to the newly created cluster, even 
>>>>>>> though
>>>>>>> it is setting it as its center in the constructor. Wasn't this
>>>>>>> corrected in a past revision? I thought this was not necessary
>>>>>>> anymore. I'll look into it a little bit more and see if this has
>>>>>>> something to do with my lack of performance with my dataset.
>>>>>>>
>>>>>>> On Thu, Jul 23, 2009 at 3:45 PM, nfantone<nf...@gmail.com> 
>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Perhaps a larger convergence value might help (-d, I believe).
>>>>>>>>>>
>>>>>>>>>> I'll try that.
>>>>>>>>
>>>>>>>> There was no significant change while modifying the convergence 
>>>>>>>> value.
>>>>>>>> At least, none was observed during the first three iterations 
>>>>>>>> which
>>>>>>>> lasted the same amount of time than before, more or less.
>>>>>>>>
>>>>>>>>>>> Is there any chance your data is publicly shareable?  Come 
>>>>>>>>>>> to think
>>>>>>>>>>> of
>>>>>>>>>>> it,
>>>>>>>>>>> with the vector representations, as long as you don't 
>>>>>>>>>>> publish the
>>>>>>>>>>> key
>>>>>>>>>>> (which
>>>>>>>>>>> terms map to which index), I would think most all data is 
>>>>>>>>>>> publicly
>>>>>>>>>>> shareable.
>>>>>>>>>>
>>>>>>>>>> I'm sorry, I don't quite understand what you're asking. Publicly
>>>>>>>>>> shareable? As in user-permissions to access/read/write the data?
>>>>>>>>>
>>>>>>>>> As in post a copy of the SequenceFile somewhere for download,
>>>>>>>>> assuming you
>>>>>>>>> can.  Then others could presumably try it out.
>>>>>>>>
>>>>>>>> My bad. Of course it is:
>>>>>>>>
>>>>>>>> http://cringer.3kh.net/web/user-dataset.data.tar.bz2
>>>>>>>>
>>>>>>>> That's the ~62Mb SequenceFile sample I've using, in <Text,
>>>>>>>> SparseVector> logical format.
>>>>>>>>
>>>>>>>>> That does seem like an awfully long time for 62 MB on a 6 node
>>>>>>>>> cluster. How many >terations are running?
>>>>>>>>
>>>>>>>> I'm running the whole thing with a 20 iterations cap. Every 
>>>>>>>> iteration
>>>>>>>> - EXCEPT the first one which, oddly, lasted just two minutes-, 
>>>>>>>> took
>>>>>>>> around 3hs to complete:
>>>>>>>>
>>>>>>>> Hadoop job_200907221734_0001
>>>>>>>> Finished in: 1mins, 42sec
>>>>>>>>
>>>>>>>> Hadoop job_200907221734_0004
>>>>>>>> Finished in: 2hrs, 34mins, 3sec
>>>>>>>>
>>>>>>>> Hadoop job_200907221734_0005
>>>>>>>> Finished in: 2hrs, 59mins, 34sec
>>>>>>>>
>>>>>>>>> How did you generate your initial clusters?
>>>>>>>>
>>>>>>>> I generate the initial clusters via the RandomSeedGenerator 
>>>>>>>> setting a
>>>>>>>> 'k' value of 200.  This is what I did to initiate the process 
>>>>>>>> for the
>>>>>>>> first time:
>>>>>>>>
>>>>>>>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data
>>>>>>>> input/user.data
>>>>>>>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data
>>>>>>>> init/user.data
>>>>>>>> ./bin/hadoop jar ~/mahout-core-0.2.jar
>>>>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver -i 
>>>>>>>> input/user.data -c
>>>>>>>> init -o output -r 32 -d 0.01 -k 200
>>>>>>>>
>>>>>>>>> Where are the iteration jobs spending most of their time (map vs.
>>>>>>>>> reduce)
>>>>>>>>
>>>>>>>> I'm tempted to say map here, but their spent time is rather
>>>>>>>> comparable, actually. Reduce attempts are taking an hour and a 
>>>>>>>> half to
>>>>>>>> end (average), and so are Map attempts. Here are some 
>>>>>>>> representative
>>>>>>>> examples from the web UI:
>>>>>>>>
>>>>>>>> reduce
>>>>>>>> attempt_200907221734_0002_r_000006_0
>>>>>>>> 22-Jul-2009 21:15:01 (1hrs, 55mins, 55sec)
>>>>>>>>
>>>>>>>> map
>>>>>>>> attempt_200907221734_0002_m_000000_0
>>>>>>>> 22-Jul-2009 20:52:27 (2hrs, 16mins, 12sec)
>>>>>>>>
>>>>>>>> Perhaps, there's some inconvenient in the way I create the
>>>>>>>> SequenceFile? I could share the JAVA code as well, if required.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://www.lucidimagination.com/
>>>>
>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) 
>>>> using
>>>> Solr/Lucene:
>>>> http://www.lucidimagination.com/search
>>>>
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) 
>>> using
>>> Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) 
> using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
>

Re: Clustering from DB

Posted by Grant Ingersoll <gs...@apache.org>.

On Jul 27, 2009, at 12:00 AM, nfantone wrote:

> Thanks, Grant. I just updated and notice the change.
>
> As a side note: you think someone could run some real tests on kMeans,
> in particular, other than the ones already in the project? I bet there
> are other naive (or not so naive) problems like that. After much
> coding, reading and experimenting in the last weeks with clustering in
> Mahout, I am inclined to say something may not fully work with kMeans,
> as of now. Or perhaps it just needs some refactoring/performance
> tweaks. Jeff have claimed to run the job over gigabytes of data, using
> a rather small cluster, in minutes. Have anyone tried to accomplish
> this recently (since the hadoop upgrade to 0.20)? Just use
> ClusteringUtils to write a file of some (arguably not so) significant
> number of random Vectors (say, 800.000+) and let that be the input of
> a KMeansMRJob (testKMeansMRJob() could very well serve this purpose
> with little change). You'll end up with a file of about ~85MB to
> ~100MB, which can easily fit into memory in any modern computer. Now,
> run the whole thing (I've tried both, locally and using a three
> node-cluster setup - which, frankly, seemed like a bit too much
> computing power for such small number of items in the dataset). It'll
> take forever to complete.
>

I hope to hit this soon.  I've got some Amazon credits I need to use  
and hope to put them towards this.

As with any project in open source, we need people to kick the tires,  
give feedback (thank you!) and also poke around the code to make it  
better.

Have you tried your data with some other clustering code, perhaps Weka  
or something like that?


> This simple methods could be used to generate any given number of
> random SparseVectors for testing's sake, if anyone is interested:
>
>  private static Random rnd = new Random();
>  private static final int CARDINALITY = 1200;
>  private static final int MAX_NON_ZEROS = 200;
>  private static final int MAX_VECTORS = 850000;
>
>  private static Vector getRandomVector() {
> 	Integer id = rnd.nextInt(Integer.MAX_VALUE);
> 	Vector v = new SparseVector(id.toString(), CARDINALITY);
> 	int nonZeros = 0;
> 	while ((nonZeros = rnd.nextInt(MAX_NON_ZEROS)) == 0);
> 	for (int i = 0; i < nonZeros; i++) {
> 		v.setQuick(rnd.nextInt(CARDINALITY), rnd.nextDouble());
> 	}
> 	return v;
>  }
>
>  private static List<Vector> getVectors() {
> 	  List<Vector> vectors = new ArrayList<Vector>(MAX_VECTORS);
> 	  for (int i = 0; i < MAX_VECTORS; i++){
> 		  vectors.add(getRandomVector());
> 	  }
> 	  return vectors;
>  }
>


I'm not sure why testing with Random vectors would be all that useful  
other than it shows it runs.  I wouldn't expect anything useful to  
come out of it, though.


> On Sun, Jul 26, 2009 at 10:30 PM, Grant  
> Ingersoll<gs...@apache.org> wrote:
>> Fixed on MAHOUT-152
>>
>> On Jul 26, 2009, at 9:19 PM, Grant Ingersoll wrote:
>>
>>> That does indeed look like a problem.  I'll fix.
>>>
>>> On Jul 26, 2009, at 2:37 PM, nfantone wrote:
>>>
>>>> While (still) experiencing performance issues and inspecting kMeans
>>>> code, I found this lying around  
>>>> SquaredEuclideanDistanceMeasure.java:
>>>>
>>>> public double distance(double centroidLengthSquare, Vector  
>>>> centroid,
>>>> Vector v) {
>>>>  if (centroid.size() != centroid.size()) {
>>>>    throw new CardinalityException();
>>>>  }
>>>>  ...
>>>>  }
>>>>
>>>> I bet someone meant to compare centroid and v sizes and didn't  
>>>> noticed.
>>>>
>>>> On Fri, Jul 24, 2009 at 12:38 PM, nfantone<nf...@gmail.com>  
>>>> wrote:
>>>>>
>>>>> Well, as it turned out, it didn't have anything to do with my
>>>>> performance issue but I found out that writing a Cluster (with a
>>>>> single vector as its center) to a file and then reading it,  
>>>>> requires
>>>>> the center to be added as point; otherwise, you won't be able to
>>>>> retrieve it as it should. Therefore, one should do:
>>>>>
>>>>> // Writing
>>>>> String id = "someID";
>>>>> Vector v = new SparseVector();
>>>>> Cluster c = new Cluster(v);
>>>>> c.addPoint(v);
>>>>> seqWriter.append(new Text(id), c);
>>>>>
>>>>> // Reading
>>>>> Writable key = (Writable) seqReader.getKeyClass().newInstance();
>>>>> Cluster value = (Cluster) seqReader.getValueClass().newInstance();
>>>>> while (seqReader.next(key, value)) {
>>>>> ...
>>>>> Vector centroid = value.getCenter();
>>>>> ...
>>>>> }
>>>>>
>>>>> This way, 'key' corresponds to 'id' and 'v' to 'centroid'. I think
>>>>> this shouldn't happen. Then again, it's not that relevant, I  
>>>>> guess.
>>>>>
>>>>> Sorry for bringing different subjects to the same thread.
>>>>>
>>>>> On Fri, Jul 24, 2009 at 9:14 AM, nfantone<nf...@gmail.com>  
>>>>> wrote:
>>>>>>
>>>>>> I've been using RandomSeedGenerator to generate initial  
>>>>>> clusters for
>>>>>> kMeans and while checking its code I stumbled upon this:
>>>>>>
>>>>>>    while (reader.next(key, value)) {
>>>>>>      Cluster newCluster = new Cluster(value);
>>>>>>      newCluster.addPoint(value);
>>>>>>      ....
>>>>>>    }
>>>>>>
>>>>>> I can see it adds the vector to the newly created cluster, even  
>>>>>> though
>>>>>> it is setting it as its center in the constructor. Wasn't this
>>>>>> corrected in a past revision? I thought this was not necessary
>>>>>> anymore. I'll look into it a little bit more and see if this has
>>>>>> something to do with my lack of performance with my dataset.
>>>>>>
>>>>>> On Thu, Jul 23, 2009 at 3:45 PM, nfantone<nf...@gmail.com>  
>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Perhaps a larger convergence value might help (-d, I  
>>>>>>>>>> believe).
>>>>>>>>>
>>>>>>>>> I'll try that.
>>>>>>>
>>>>>>> There was no significant change while modifying the  
>>>>>>> convergence value.
>>>>>>> At least, none was observed during the first three iterations  
>>>>>>> which
>>>>>>> lasted the same amount of time than before, more or less.
>>>>>>>
>>>>>>>>>> Is there any chance your data is publicly shareable?  Come  
>>>>>>>>>> to think
>>>>>>>>>> of
>>>>>>>>>> it,
>>>>>>>>>> with the vector representations, as long as you don't  
>>>>>>>>>> publish the
>>>>>>>>>> key
>>>>>>>>>> (which
>>>>>>>>>> terms map to which index), I would think most all data is  
>>>>>>>>>> publicly
>>>>>>>>>> shareable.
>>>>>>>>>
>>>>>>>>> I'm sorry, I don't quite understand what you're asking.  
>>>>>>>>> Publicly
>>>>>>>>> shareable? As in user-permissions to access/read/write the  
>>>>>>>>> data?
>>>>>>>>
>>>>>>>> As in post a copy of the SequenceFile somewhere for download,
>>>>>>>> assuming you
>>>>>>>> can.  Then others could presumably try it out.
>>>>>>>
>>>>>>> My bad. Of course it is:
>>>>>>>
>>>>>>> http://cringer.3kh.net/web/user-dataset.data.tar.bz2
>>>>>>>
>>>>>>> That's the ~62Mb SequenceFile sample I've using, in <Text,
>>>>>>> SparseVector> logical format.
>>>>>>>
>>>>>>>> That does seem like an awfully long time for 62 MB on a 6 node
>>>>>>>> cluster. How many >terations are running?
>>>>>>>
>>>>>>> I'm running the whole thing with a 20 iterations cap. Every  
>>>>>>> iteration
>>>>>>> - EXCEPT the first one which, oddly, lasted just two minutes-,  
>>>>>>> took
>>>>>>> around 3hs to complete:
>>>>>>>
>>>>>>> Hadoop job_200907221734_0001
>>>>>>> Finished in: 1mins, 42sec
>>>>>>>
>>>>>>> Hadoop job_200907221734_0004
>>>>>>> Finished in: 2hrs, 34mins, 3sec
>>>>>>>
>>>>>>> Hadoop job_200907221734_0005
>>>>>>> Finished in: 2hrs, 59mins, 34sec
>>>>>>>
>>>>>>>> How did you generate your initial clusters?
>>>>>>>
>>>>>>> I generate the initial clusters via the RandomSeedGenerator  
>>>>>>> setting a
>>>>>>> 'k' value of 200.  This is what I did to initiate the process  
>>>>>>> for the
>>>>>>> first time:
>>>>>>>
>>>>>>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data
>>>>>>> input/user.data
>>>>>>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data
>>>>>>> init/user.data
>>>>>>> ./bin/hadoop jar ~/mahout-core-0.2.jar
>>>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver -i input/ 
>>>>>>> user.data -c
>>>>>>> init -o output -r 32 -d 0.01 -k 200
>>>>>>>
>>>>>>>> Where are the iteration jobs spending most of their time (map  
>>>>>>>> vs.
>>>>>>>> reduce)
>>>>>>>
>>>>>>> I'm tempted to say map here, but their spent time is rather
>>>>>>> comparable, actually. Reduce attempts are taking an hour and a  
>>>>>>> half to
>>>>>>> end (average), and so are Map attempts. Here are some  
>>>>>>> representative
>>>>>>> examples from the web UI:
>>>>>>>
>>>>>>> reduce
>>>>>>> attempt_200907221734_0002_r_000006_0
>>>>>>> 22-Jul-2009 21:15:01 (1hrs, 55mins, 55sec)
>>>>>>>
>>>>>>> map
>>>>>>> attempt_200907221734_0002_m_000000_0
>>>>>>> 22-Jul-2009 20:52:27 (2hrs, 16mins, 12sec)
>>>>>>>
>>>>>>> Perhaps, there's some inconvenient in the way I create the
>>>>>>> SequenceFile? I could share the JAVA code as well, if required.
>>>>>>>
>>>>>>
>>>>>
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>>> using
>>> Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>> using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Clustering from DB

Posted by nfantone <nf...@gmail.com>.

Thanks, Grant. I just updated and notice the change.

As a side note: you think someone could run some real tests on kMeans,
in particular, other than the ones already in the project? I bet there
are other naive (or not so naive) problems like that. After much
coding, reading and experimenting in the last weeks with clustering in
Mahout, I am inclined to say something may not fully work with kMeans,
as of now. Or perhaps it just needs some refactoring/performance
tweaks. Jeff have claimed to run the job over gigabytes of data, using
a rather small cluster, in minutes. Have anyone tried to accomplish
this recently (since the hadoop upgrade to 0.20)? Just use
ClusteringUtils to write a file of some (arguably not so) significant
number of random Vectors (say, 800.000+) and let that be the input of
a KMeansMRJob (testKMeansMRJob() could very well serve this purpose
with little change). You'll end up with a file of about ~85MB to
~100MB, which can easily fit into memory in any modern computer. Now,
run the whole thing (I've tried both, locally and using a three
node-cluster setup - which, frankly, seemed like a bit too much
computing power for such small number of items in the dataset). It'll
take forever to complete.

This simple methods could be used to generate any given number of
random SparseVectors for testing's sake, if anyone is interested:

  private static Random rnd = new Random();
  private static final int CARDINALITY = 1200;
  private static final int MAX_NON_ZEROS = 200;
  private static final int MAX_VECTORS = 850000;

  private static Vector getRandomVector() {
	Integer id = rnd.nextInt(Integer.MAX_VALUE);
	Vector v = new SparseVector(id.toString(), CARDINALITY);
	int nonZeros = 0;
	while ((nonZeros = rnd.nextInt(MAX_NON_ZEROS)) == 0);
	for (int i = 0; i < nonZeros; i++) {
		v.setQuick(rnd.nextInt(CARDINALITY), rnd.nextDouble());
	}
	return v;
  }

  private static List<Vector> getVectors() {
	  List<Vector> vectors = new ArrayList<Vector>(MAX_VECTORS);
	  for (int i = 0; i < MAX_VECTORS; i++){
		  vectors.add(getRandomVector());
	  }
	  return vectors;
  }

On Sun, Jul 26, 2009 at 10:30 PM, Grant Ingersoll<gs...@apache.org> wrote:
> Fixed on MAHOUT-152
>
> On Jul 26, 2009, at 9:19 PM, Grant Ingersoll wrote:
>
>> That does indeed look like a problem.  I'll fix.
>>
>> On Jul 26, 2009, at 2:37 PM, nfantone wrote:
>>
>>> While (still) experiencing performance issues and inspecting kMeans
>>> code, I found this lying around SquaredEuclideanDistanceMeasure.java:
>>>
>>> public double distance(double centroidLengthSquare, Vector centroid,
>>> Vector v) {
>>>  if (centroid.size() != centroid.size()) {
>>>    throw new CardinalityException();
>>>  }
>>>  ...
>>>  }
>>>
>>> I bet someone meant to compare centroid and v sizes and didn't noticed.
>>>
>>> On Fri, Jul 24, 2009 at 12:38 PM, nfantone<nf...@gmail.com> wrote:
>>>>
>>>> Well, as it turned out, it didn't have anything to do with my
>>>> performance issue but I found out that writing a Cluster (with a
>>>> single vector as its center) to a file and then reading it, requires
>>>> the center to be added as point; otherwise, you won't be able to
>>>> retrieve it as it should. Therefore, one should do:
>>>>
>>>> // Writing
>>>> String id = "someID";
>>>> Vector v = new SparseVector();
>>>> Cluster c = new Cluster(v);
>>>> c.addPoint(v);
>>>> seqWriter.append(new Text(id), c);
>>>>
>>>> // Reading
>>>> Writable key = (Writable) seqReader.getKeyClass().newInstance();
>>>> Cluster value = (Cluster) seqReader.getValueClass().newInstance();
>>>> while (seqReader.next(key, value)) {
>>>> ...
>>>> Vector centroid = value.getCenter();
>>>> ...
>>>> }
>>>>
>>>> This way, 'key' corresponds to 'id' and 'v' to 'centroid'. I think
>>>> this shouldn't happen. Then again, it's not that relevant, I guess.
>>>>
>>>> Sorry for bringing different subjects to the same thread.
>>>>
>>>> On Fri, Jul 24, 2009 at 9:14 AM, nfantone<nf...@gmail.com> wrote:
>>>>>
>>>>> I've been using RandomSeedGenerator to generate initial clusters for
>>>>> kMeans and while checking its code I stumbled upon this:
>>>>>
>>>>>    while (reader.next(key, value)) {
>>>>>      Cluster newCluster = new Cluster(value);
>>>>>      newCluster.addPoint(value);
>>>>>      ....
>>>>>    }
>>>>>
>>>>> I can see it adds the vector to the newly created cluster, even though
>>>>> it is setting it as its center in the constructor. Wasn't this
>>>>> corrected in a past revision? I thought this was not necessary
>>>>> anymore. I'll look into it a little bit more and see if this has
>>>>> something to do with my lack of performance with my dataset.
>>>>>
>>>>> On Thu, Jul 23, 2009 at 3:45 PM, nfantone<nf...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> Perhaps a larger convergence value might help (-d, I believe).
>>>>>>>>
>>>>>>>> I'll try that.
>>>>>>
>>>>>> There was no significant change while modifying the convergence value.
>>>>>> At least, none was observed during the first three iterations which
>>>>>> lasted the same amount of time than before, more or less.
>>>>>>
>>>>>>>>> Is there any chance your data is publicly shareable?  Come to think
>>>>>>>>> of
>>>>>>>>> it,
>>>>>>>>> with the vector representations, as long as you don't publish the
>>>>>>>>> key
>>>>>>>>> (which
>>>>>>>>> terms map to which index), I would think most all data is publicly
>>>>>>>>> shareable.
>>>>>>>>
>>>>>>>> I'm sorry, I don't quite understand what you're asking. Publicly
>>>>>>>> shareable? As in user-permissions to access/read/write the data?
>>>>>>>
>>>>>>> As in post a copy of the SequenceFile somewhere for download,
>>>>>>> assuming you
>>>>>>> can.  Then others could presumably try it out.
>>>>>>
>>>>>> My bad. Of course it is:
>>>>>>
>>>>>> http://cringer.3kh.net/web/user-dataset.data.tar.bz2
>>>>>>
>>>>>> That's the ~62Mb SequenceFile sample I've using, in <Text,
>>>>>> SparseVector> logical format.
>>>>>>
>>>>>>> That does seem like an awfully long time for 62 MB on a 6 node
>>>>>>> cluster. How many >terations are running?
>>>>>>
>>>>>> I'm running the whole thing with a 20 iterations cap. Every iteration
>>>>>> - EXCEPT the first one which, oddly, lasted just two minutes-, took
>>>>>> around 3hs to complete:
>>>>>>
>>>>>> Hadoop job_200907221734_0001
>>>>>> Finished in: 1mins, 42sec
>>>>>>
>>>>>> Hadoop job_200907221734_0004
>>>>>> Finished in: 2hrs, 34mins, 3sec
>>>>>>
>>>>>> Hadoop job_200907221734_0005
>>>>>> Finished in: 2hrs, 59mins, 34sec
>>>>>>
>>>>>>> How did you generate your initial clusters?
>>>>>>
>>>>>> I generate the initial clusters via the RandomSeedGenerator setting a
>>>>>> 'k' value of 200.  This is what I did to initiate the process for the
>>>>>> first time:
>>>>>>
>>>>>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data
>>>>>> input/user.data
>>>>>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data
>>>>>> init/user.data
>>>>>> ./bin/hadoop jar ~/mahout-core-0.2.jar
>>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver -i input/user.data -c
>>>>>> init -o output -r 32 -d 0.01 -k 200
>>>>>>
>>>>>>> Where are the iteration jobs spending most of their time (map vs.
>>>>>>> reduce)
>>>>>>
>>>>>> I'm tempted to say map here, but their spent time is rather
>>>>>> comparable, actually. Reduce attempts are taking an hour and a half to
>>>>>> end (average), and so are Map attempts. Here are some representative
>>>>>> examples from the web UI:
>>>>>>
>>>>>> reduce
>>>>>> attempt_200907221734_0002_r_000006_0
>>>>>> 22-Jul-2009 21:15:01 (1hrs, 55mins, 55sec)
>>>>>>
>>>>>> map
>>>>>> attempt_200907221734_0002_m_000000_0
>>>>>> 22-Jul-2009 20:52:27 (2hrs, 16mins, 12sec)
>>>>>>
>>>>>> Perhaps, there's some inconvenient in the way I create the
>>>>>> SequenceFile? I could share the JAVA code as well, if required.
>>>>>>
>>>>>
>>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Re: Clustering from DB

Posted by Grant Ingersoll <gs...@apache.org>.

Fixed on MAHOUT-152

On Jul 26, 2009, at 9:19 PM, Grant Ingersoll wrote:

> That does indeed look like a problem.  I'll fix.
>
> On Jul 26, 2009, at 2:37 PM, nfantone wrote:
>
>> While (still) experiencing performance issues and inspecting kMeans
>> code, I found this lying around SquaredEuclideanDistanceMeasure.java:
>>
>> public double distance(double centroidLengthSquare, Vector centroid,
>> Vector v) {
>>   if (centroid.size() != centroid.size()) {
>>     throw new CardinalityException();
>>   }
>>   ...
>>  }
>>
>> I bet someone meant to compare centroid and v sizes and didn't  
>> noticed.
>>
>> On Fri, Jul 24, 2009 at 12:38 PM, nfantone<nf...@gmail.com> wrote:
>>> Well, as it turned out, it didn't have anything to do with my
>>> performance issue but I found out that writing a Cluster (with a
>>> single vector as its center) to a file and then reading it, requires
>>> the center to be added as point; otherwise, you won't be able to
>>> retrieve it as it should. Therefore, one should do:
>>>
>>> // Writing
>>> String id = "someID";
>>> Vector v = new SparseVector();
>>> Cluster c = new Cluster(v);
>>> c.addPoint(v);
>>> seqWriter.append(new Text(id), c);
>>>
>>> // Reading
>>> Writable key = (Writable) seqReader.getKeyClass().newInstance();
>>> Cluster value = (Cluster) seqReader.getValueClass().newInstance();
>>> while (seqReader.next(key, value)) {
>>> ...
>>> Vector centroid = value.getCenter();
>>> ...
>>> }
>>>
>>> This way, 'key' corresponds to 'id' and 'v' to 'centroid'. I think
>>> this shouldn't happen. Then again, it's not that relevant, I guess.
>>>
>>> Sorry for bringing different subjects to the same thread.
>>>
>>> On Fri, Jul 24, 2009 at 9:14 AM, nfantone<nf...@gmail.com> wrote:
>>>> I've been using RandomSeedGenerator to generate initial clusters  
>>>> for
>>>> kMeans and while checking its code I stumbled upon this:
>>>>
>>>>     while (reader.next(key, value)) {
>>>>       Cluster newCluster = new Cluster(value);
>>>>       newCluster.addPoint(value);
>>>>       ....
>>>>     }
>>>>
>>>> I can see it adds the vector to the newly created cluster, even  
>>>> though
>>>> it is setting it as its center in the constructor. Wasn't this
>>>> corrected in a past revision? I thought this was not necessary
>>>> anymore. I'll look into it a little bit more and see if this has
>>>> something to do with my lack of performance with my dataset.
>>>>
>>>> On Thu, Jul 23, 2009 at 3:45 PM, nfantone<nf...@gmail.com>  
>>>> wrote:
>>>>>>>> Perhaps a larger convergence value might help (-d, I believe).
>>>>>>>
>>>>>>> I'll try that.
>>>>>
>>>>> There was no significant change while modifying the convergence  
>>>>> value.
>>>>> At least, none was observed during the first three iterations  
>>>>> which
>>>>> lasted the same amount of time than before, more or less.
>>>>>
>>>>>>>> Is there any chance your data is publicly shareable?  Come to  
>>>>>>>> think of
>>>>>>>> it,
>>>>>>>> with the vector representations, as long as you don't publish  
>>>>>>>> the key
>>>>>>>> (which
>>>>>>>> terms map to which index), I would think most all data is  
>>>>>>>> publicly
>>>>>>>> shareable.
>>>>>>>
>>>>>>> I'm sorry, I don't quite understand what you're asking. Publicly
>>>>>>> shareable? As in user-permissions to access/read/write the data?
>>>>>>
>>>>>> As in post a copy of the SequenceFile somewhere for download,  
>>>>>> assuming you
>>>>>> can.  Then others could presumably try it out.
>>>>>
>>>>> My bad. Of course it is:
>>>>>
>>>>> http://cringer.3kh.net/web/user-dataset.data.tar.bz2
>>>>>
>>>>> That's the ~62Mb SequenceFile sample I've using, in <Text,
>>>>> SparseVector> logical format.
>>>>>
>>>>>> That does seem like an awfully long time for 62 MB on a 6 node  
>>>>>> cluster. How many >terations are running?
>>>>>
>>>>> I'm running the whole thing with a 20 iterations cap. Every  
>>>>> iteration
>>>>> - EXCEPT the first one which, oddly, lasted just two minutes-,  
>>>>> took
>>>>> around 3hs to complete:
>>>>>
>>>>> Hadoop job_200907221734_0001
>>>>> Finished in: 1mins, 42sec
>>>>>
>>>>> Hadoop job_200907221734_0004
>>>>> Finished in: 2hrs, 34mins, 3sec
>>>>>
>>>>> Hadoop job_200907221734_0005
>>>>> Finished in: 2hrs, 59mins, 34sec
>>>>>
>>>>>> How did you generate your initial clusters?
>>>>>
>>>>> I generate the initial clusters via the RandomSeedGenerator  
>>>>> setting a
>>>>> 'k' value of 200.  This is what I did to initiate the process  
>>>>> for the
>>>>> first time:
>>>>>
>>>>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data  
>>>>> input/user.data
>>>>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data init/ 
>>>>> user.data
>>>>> ./bin/hadoop jar ~/mahout-core-0.2.jar
>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver -i input/ 
>>>>> user.data -c
>>>>> init -o output -r 32 -d 0.01 -k 200
>>>>>
>>>>>> Where are the iteration jobs spending most of their time (map  
>>>>>> vs. reduce)
>>>>>
>>>>> I'm tempted to say map here, but their spent time is rather
>>>>> comparable, actually. Reduce attempts are taking an hour and a  
>>>>> half to
>>>>> end (average), and so are Map attempts. Here are some  
>>>>> representative
>>>>> examples from the web UI:
>>>>>
>>>>> reduce
>>>>> attempt_200907221734_0002_r_000006_0
>>>>> 22-Jul-2009 21:15:01 (1hrs, 55mins, 55sec)
>>>>>
>>>>> map
>>>>> attempt_200907221734_0002_m_000000_0
>>>>> 22-Jul-2009 20:52:27 (2hrs, 16mins, 12sec)
>>>>>
>>>>> Perhaps, there's some inconvenient in the way I create the
>>>>> SequenceFile? I could share the JAVA code as well, if required.
>>>>>
>>>>
>>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
> using Solr/Lucene:
> http://www.lucidimagination.com/search
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Clustering from DB

Posted by Grant Ingersoll <gs...@apache.org>.

That does indeed look like a problem.  I'll fix.

On Jul 26, 2009, at 2:37 PM, nfantone wrote:

> While (still) experiencing performance issues and inspecting kMeans
> code, I found this lying around SquaredEuclideanDistanceMeasure.java:
>
>  public double distance(double centroidLengthSquare, Vector centroid,
> Vector v) {
>    if (centroid.size() != centroid.size()) {
>      throw new CardinalityException();
>    }
>    ...
>   }
>
> I bet someone meant to compare centroid and v sizes and didn't  
> noticed.
>
> On Fri, Jul 24, 2009 at 12:38 PM, nfantone<nf...@gmail.com> wrote:
>> Well, as it turned out, it didn't have anything to do with my
>> performance issue but I found out that writing a Cluster (with a
>> single vector as its center) to a file and then reading it, requires
>> the center to be added as point; otherwise, you won't be able to
>> retrieve it as it should. Therefore, one should do:
>>
>> // Writing
>> String id = "someID";
>> Vector v = new SparseVector();
>> Cluster c = new Cluster(v);
>> c.addPoint(v);
>> seqWriter.append(new Text(id), c);
>>
>> // Reading
>> Writable key = (Writable) seqReader.getKeyClass().newInstance();
>> Cluster value = (Cluster) seqReader.getValueClass().newInstance();
>> while (seqReader.next(key, value)) {
>> ...
>> Vector centroid = value.getCenter();
>> ...
>> }
>>
>> This way, 'key' corresponds to 'id' and 'v' to 'centroid'. I think
>> this shouldn't happen. Then again, it's not that relevant, I guess.
>>
>> Sorry for bringing different subjects to the same thread.
>>
>> On Fri, Jul 24, 2009 at 9:14 AM, nfantone<nf...@gmail.com> wrote:
>>> I've been using RandomSeedGenerator to generate initial clusters for
>>> kMeans and while checking its code I stumbled upon this:
>>>
>>>      while (reader.next(key, value)) {
>>>        Cluster newCluster = new Cluster(value);
>>>        newCluster.addPoint(value);
>>>        ....
>>>      }
>>>
>>> I can see it adds the vector to the newly created cluster, even  
>>> though
>>> it is setting it as its center in the constructor. Wasn't this
>>> corrected in a past revision? I thought this was not necessary
>>> anymore. I'll look into it a little bit more and see if this has
>>> something to do with my lack of performance with my dataset.
>>>
>>> On Thu, Jul 23, 2009 at 3:45 PM, nfantone<nf...@gmail.com> wrote:
>>>>>>> Perhaps a larger convergence value might help (-d, I believe).
>>>>>>
>>>>>> I'll try that.
>>>>
>>>> There was no significant change while modifying the convergence  
>>>> value.
>>>> At least, none was observed during the first three iterations which
>>>> lasted the same amount of time than before, more or less.
>>>>
>>>>>>> Is there any chance your data is publicly shareable?  Come to  
>>>>>>> think of
>>>>>>> it,
>>>>>>> with the vector representations, as long as you don't publish  
>>>>>>> the key
>>>>>>> (which
>>>>>>> terms map to which index), I would think most all data is  
>>>>>>> publicly
>>>>>>> shareable.
>>>>>>
>>>>>> I'm sorry, I don't quite understand what you're asking. Publicly
>>>>>> shareable? As in user-permissions to access/read/write the data?
>>>>>
>>>>> As in post a copy of the SequenceFile somewhere for download,  
>>>>> assuming you
>>>>> can.  Then others could presumably try it out.
>>>>
>>>> My bad. Of course it is:
>>>>
>>>> http://cringer.3kh.net/web/user-dataset.data.tar.bz2
>>>>
>>>> That's the ~62Mb SequenceFile sample I've using, in <Text,
>>>> SparseVector> logical format.
>>>>
>>>>> That does seem like an awfully long time for 62 MB on a 6 node  
>>>>> cluster. How many >terations are running?
>>>>
>>>> I'm running the whole thing with a 20 iterations cap. Every  
>>>> iteration
>>>> - EXCEPT the first one which, oddly, lasted just two minutes-, took
>>>> around 3hs to complete:
>>>>
>>>> Hadoop job_200907221734_0001
>>>> Finished in: 1mins, 42sec
>>>>
>>>> Hadoop job_200907221734_0004
>>>> Finished in: 2hrs, 34mins, 3sec
>>>>
>>>> Hadoop job_200907221734_0005
>>>> Finished in: 2hrs, 59mins, 34sec
>>>>
>>>>> How did you generate your initial clusters?
>>>>
>>>> I generate the initial clusters via the RandomSeedGenerator  
>>>> setting a
>>>> 'k' value of 200.  This is what I did to initiate the process for  
>>>> the
>>>> first time:
>>>>
>>>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data input/ 
>>>> user.data
>>>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data init/ 
>>>> user.data
>>>> ./bin/hadoop jar ~/mahout-core-0.2.jar
>>>> org.apache.mahout.clustering.kmeans.KMeansDriver -i input/ 
>>>> user.data -c
>>>> init -o output -r 32 -d 0.01 -k 200
>>>>
>>>>> Where are the iteration jobs spending most of their time (map  
>>>>> vs. reduce)
>>>>
>>>> I'm tempted to say map here, but their spent time is rather
>>>> comparable, actually. Reduce attempts are taking an hour and a  
>>>> half to
>>>> end (average), and so are Map attempts. Here are some  
>>>> representative
>>>> examples from the web UI:
>>>>
>>>> reduce
>>>> attempt_200907221734_0002_r_000006_0
>>>> 22-Jul-2009 21:15:01 (1hrs, 55mins, 55sec)
>>>>
>>>> map
>>>> attempt_200907221734_0002_m_000000_0
>>>> 22-Jul-2009 20:52:27 (2hrs, 16mins, 12sec)
>>>>
>>>> Perhaps, there's some inconvenient in the way I create the
>>>> SequenceFile? I could share the JAVA code as well, if required.
>>>>
>>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Clustering from DB

Posted by nfantone <nf...@gmail.com>.

While (still) experiencing performance issues and inspecting kMeans
code, I found this lying around SquaredEuclideanDistanceMeasure.java:

  public double distance(double centroidLengthSquare, Vector centroid,
Vector v) {
    if (centroid.size() != centroid.size()) {
      throw new CardinalityException();
    }
    ...
   }

I bet someone meant to compare centroid and v sizes and didn't noticed.

On Fri, Jul 24, 2009 at 12:38 PM, nfantone<nf...@gmail.com> wrote:
> Well, as it turned out, it didn't have anything to do with my
> performance issue but I found out that writing a Cluster (with a
> single vector as its center) to a file and then reading it, requires
> the center to be added as point; otherwise, you won't be able to
> retrieve it as it should. Therefore, one should do:
>
> // Writing
> String id = "someID";
> Vector v = new SparseVector();
> Cluster c = new Cluster(v);
> c.addPoint(v);
> seqWriter.append(new Text(id), c);
>
> // Reading
> Writable key = (Writable) seqReader.getKeyClass().newInstance();
> Cluster value = (Cluster) seqReader.getValueClass().newInstance();
> while (seqReader.next(key, value)) {
> ...
> Vector centroid = value.getCenter();
> ...
> }
>
> This way, 'key' corresponds to 'id' and 'v' to 'centroid'. I think
> this shouldn't happen. Then again, it's not that relevant, I guess.
>
> Sorry for bringing different subjects to the same thread.
>
> On Fri, Jul 24, 2009 at 9:14 AM, nfantone<nf...@gmail.com> wrote:
>> I've been using RandomSeedGenerator to generate initial clusters for
>> kMeans and while checking its code I stumbled upon this:
>>
>>      while (reader.next(key, value)) {
>>        Cluster newCluster = new Cluster(value);
>>        newCluster.addPoint(value);
>>        ....
>>      }
>>
>> I can see it adds the vector to the newly created cluster, even though
>> it is setting it as its center in the constructor. Wasn't this
>> corrected in a past revision? I thought this was not necessary
>> anymore. I'll look into it a little bit more and see if this has
>> something to do with my lack of performance with my dataset.
>>
>> On Thu, Jul 23, 2009 at 3:45 PM, nfantone<nf...@gmail.com> wrote:
>>>>>> Perhaps a larger convergence value might help (-d, I believe).
>>>>>
>>>>> I'll try that.
>>>
>>> There was no significant change while modifying the convergence value.
>>> At least, none was observed during the first three iterations which
>>> lasted the same amount of time than before, more or less.
>>>
>>>>>> Is there any chance your data is publicly shareable?  Come to think of
>>>>>> it,
>>>>>> with the vector representations, as long as you don't publish the key
>>>>>> (which
>>>>>> terms map to which index), I would think most all data is publicly
>>>>>> shareable.
>>>>>
>>>>> I'm sorry, I don't quite understand what you're asking. Publicly
>>>>> shareable? As in user-permissions to access/read/write the data?
>>>>
>>>> As in post a copy of the SequenceFile somewhere for download, assuming you
>>>> can.  Then others could presumably try it out.
>>>
>>> My bad. Of course it is:
>>>
>>> http://cringer.3kh.net/web/user-dataset.data.tar.bz2
>>>
>>> That's the ~62Mb SequenceFile sample I've using, in <Text,
>>> SparseVector> logical format.
>>>
>>>>That does seem like an awfully long time for 62 MB on a 6 node cluster. How many >terations are running?
>>>
>>> I'm running the whole thing with a 20 iterations cap. Every iteration
>>> - EXCEPT the first one which, oddly, lasted just two minutes-, took
>>> around 3hs to complete:
>>>
>>> Hadoop job_200907221734_0001
>>> Finished in: 1mins, 42sec
>>>
>>> Hadoop job_200907221734_0004
>>> Finished in: 2hrs, 34mins, 3sec
>>>
>>> Hadoop job_200907221734_0005
>>> Finished in: 2hrs, 59mins, 34sec
>>>
>>>> How did you generate your initial clusters?
>>>
>>> I generate the initial clusters via the RandomSeedGenerator setting a
>>> 'k' value of 200.  This is what I did to initiate the process for the
>>> first time:
>>>
>>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data input/user.data
>>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data init/user.data
>>> ./bin/hadoop jar ~/mahout-core-0.2.jar
>>> org.apache.mahout.clustering.kmeans.KMeansDriver -i input/user.data -c
>>> init -o output -r 32 -d 0.01 -k 200
>>>
>>>>Where are the iteration jobs spending most of their time (map vs. reduce)
>>>
>>> I'm tempted to say map here, but their spent time is rather
>>> comparable, actually. Reduce attempts are taking an hour and a half to
>>> end (average), and so are Map attempts. Here are some representative
>>> examples from the web UI:
>>>
>>> reduce
>>> attempt_200907221734_0002_r_000006_0
>>> 22-Jul-2009 21:15:01 (1hrs, 55mins, 55sec)
>>>
>>> map
>>> attempt_200907221734_0002_m_000000_0
>>> 22-Jul-2009 20:52:27 (2hrs, 16mins, 12sec)
>>>
>>> Perhaps, there's some inconvenient in the way I create the
>>> SequenceFile? I could share the JAVA code as well, if required.
>>>
>>
>

Re: Clustering from DB

Posted by nfantone <nf...@gmail.com>.

Well, as it turned out, it didn't have anything to do with my
performance issue but I found out that writing a Cluster (with a
single vector as its center) to a file and then reading it, requires
the center to be added as point; otherwise, you won't be able to
retrieve it as it should. Therefore, one should do:

// Writing
String id = "someID";
Vector v = new SparseVector();
Cluster c = new Cluster(v);
c.addPoint(v);
seqWriter.append(new Text(id), c);

// Reading
Writable key = (Writable) seqReader.getKeyClass().newInstance();
Cluster value = (Cluster) seqReader.getValueClass().newInstance();
while (seqReader.next(key, value)) {
...
Vector centroid = value.getCenter();
...
}

This way, 'key' corresponds to 'id' and 'v' to 'centroid'. I think
this shouldn't happen. Then again, it's not that relevant, I guess.

Sorry for bringing different subjects to the same thread.

On Fri, Jul 24, 2009 at 9:14 AM, nfantone<nf...@gmail.com> wrote:
> I've been using RandomSeedGenerator to generate initial clusters for
> kMeans and while checking its code I stumbled upon this:
>
>      while (reader.next(key, value)) {
>        Cluster newCluster = new Cluster(value);
>        newCluster.addPoint(value);
>        ....
>      }
>
> I can see it adds the vector to the newly created cluster, even though
> it is setting it as its center in the constructor. Wasn't this
> corrected in a past revision? I thought this was not necessary
> anymore. I'll look into it a little bit more and see if this has
> something to do with my lack of performance with my dataset.
>
> On Thu, Jul 23, 2009 at 3:45 PM, nfantone<nf...@gmail.com> wrote:
>>>>> Perhaps a larger convergence value might help (-d, I believe).
>>>>
>>>> I'll try that.
>>
>> There was no significant change while modifying the convergence value.
>> At least, none was observed during the first three iterations which
>> lasted the same amount of time than before, more or less.
>>
>>>>> Is there any chance your data is publicly shareable?  Come to think of
>>>>> it,
>>>>> with the vector representations, as long as you don't publish the key
>>>>> (which
>>>>> terms map to which index), I would think most all data is publicly
>>>>> shareable.
>>>>
>>>> I'm sorry, I don't quite understand what you're asking. Publicly
>>>> shareable? As in user-permissions to access/read/write the data?
>>>
>>> As in post a copy of the SequenceFile somewhere for download, assuming you
>>> can.  Then others could presumably try it out.
>>
>> My bad. Of course it is:
>>
>> http://cringer.3kh.net/web/user-dataset.data.tar.bz2
>>
>> That's the ~62Mb SequenceFile sample I've using, in <Text,
>> SparseVector> logical format.
>>
>>>That does seem like an awfully long time for 62 MB on a 6 node cluster. How many >terations are running?
>>
>> I'm running the whole thing with a 20 iterations cap. Every iteration
>> - EXCEPT the first one which, oddly, lasted just two minutes-, took
>> around 3hs to complete:
>>
>> Hadoop job_200907221734_0001
>> Finished in: 1mins, 42sec
>>
>> Hadoop job_200907221734_0004
>> Finished in: 2hrs, 34mins, 3sec
>>
>> Hadoop job_200907221734_0005
>> Finished in: 2hrs, 59mins, 34sec
>>
>>> How did you generate your initial clusters?
>>
>> I generate the initial clusters via the RandomSeedGenerator setting a
>> 'k' value of 200.  This is what I did to initiate the process for the
>> first time:
>>
>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data input/user.data
>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data init/user.data
>> ./bin/hadoop jar ~/mahout-core-0.2.jar
>> org.apache.mahout.clustering.kmeans.KMeansDriver -i input/user.data -c
>> init -o output -r 32 -d 0.01 -k 200
>>
>>>Where are the iteration jobs spending most of their time (map vs. reduce)
>>
>> I'm tempted to say map here, but their spent time is rather
>> comparable, actually. Reduce attempts are taking an hour and a half to
>> end (average), and so are Map attempts. Here are some representative
>> examples from the web UI:
>>
>> reduce
>> attempt_200907221734_0002_r_000006_0
>> 22-Jul-2009 21:15:01 (1hrs, 55mins, 55sec)
>>
>> map
>> attempt_200907221734_0002_m_000000_0
>> 22-Jul-2009 20:52:27 (2hrs, 16mins, 12sec)
>>
>> Perhaps, there's some inconvenient in the way I create the
>> SequenceFile? I could share the JAVA code as well, if required.
>>
>

Re: Clustering from DB

Posted by nfantone <nf...@gmail.com>.

I've been using RandomSeedGenerator to generate initial clusters for
kMeans and while checking its code I stumbled upon this:

      while (reader.next(key, value)) {
        Cluster newCluster = new Cluster(value);
        newCluster.addPoint(value);
        ....
      }

I can see it adds the vector to the newly created cluster, even though
it is setting it as its center in the constructor. Wasn't this
corrected in a past revision? I thought this was not necessary
anymore. I'll look into it a little bit more and see if this has
something to do with my lack of performance with my dataset.

On Thu, Jul 23, 2009 at 3:45 PM, nfantone<nf...@gmail.com> wrote:
>>>> Perhaps a larger convergence value might help (-d, I believe).
>>>
>>> I'll try that.
>
> There was no significant change while modifying the convergence value.
> At least, none was observed during the first three iterations which
> lasted the same amount of time than before, more or less.
>
>>>> Is there any chance your data is publicly shareable?  Come to think of
>>>> it,
>>>> with the vector representations, as long as you don't publish the key
>>>> (which
>>>> terms map to which index), I would think most all data is publicly
>>>> shareable.
>>>
>>> I'm sorry, I don't quite understand what you're asking. Publicly
>>> shareable? As in user-permissions to access/read/write the data?
>>
>> As in post a copy of the SequenceFile somewhere for download, assuming you
>> can.  Then others could presumably try it out.
>
> My bad. Of course it is:
>
> http://cringer.3kh.net/web/user-dataset.data.tar.bz2
>
> That's the ~62Mb SequenceFile sample I've using, in <Text,
> SparseVector> logical format.
>
>>That does seem like an awfully long time for 62 MB on a 6 node cluster. How many >terations are running?
>
> I'm running the whole thing with a 20 iterations cap. Every iteration
> - EXCEPT the first one which, oddly, lasted just two minutes-, took
> around 3hs to complete:
>
> Hadoop job_200907221734_0001
> Finished in: 1mins, 42sec
>
> Hadoop job_200907221734_0004
> Finished in: 2hrs, 34mins, 3sec
>
> Hadoop job_200907221734_0005
> Finished in: 2hrs, 59mins, 34sec
>
>> How did you generate your initial clusters?
>
> I generate the initial clusters via the RandomSeedGenerator setting a
> 'k' value of 200.  This is what I did to initiate the process for the
> first time:
>
> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data input/user.data
> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data init/user.data
> ./bin/hadoop jar ~/mahout-core-0.2.jar
> org.apache.mahout.clustering.kmeans.KMeansDriver -i input/user.data -c
> init -o output -r 32 -d 0.01 -k 200
>
>>Where are the iteration jobs spending most of their time (map vs. reduce)
>
> I'm tempted to say map here, but their spent time is rather
> comparable, actually. Reduce attempts are taking an hour and a half to
> end (average), and so are Map attempts. Here are some representative
> examples from the web UI:
>
> reduce
> attempt_200907221734_0002_r_000006_0
> 22-Jul-2009 21:15:01 (1hrs, 55mins, 55sec)
>
> map
> attempt_200907221734_0002_m_000000_0
> 22-Jul-2009 20:52:27 (2hrs, 16mins, 12sec)
>
> Perhaps, there's some inconvenient in the way I create the
> SequenceFile? I could share the JAVA code as well, if required.
>

Re: Clustering from DB

Posted by Grant Ingersoll <gs...@apache.org>.

On Jul 23, 2009, at 2:45 PM, nfantone wrote:
> Perhaps, there's some inconvenient in the way I create the
> SequenceFile? I could share the JAVA code as well, if required.

How did you create your SeqFile?  From what I can tell from Ted, it is  
important to get the norms and distance measures lined up.

-Grant

Re: Clustering from DB

Posted by nfantone <nf...@gmail.com>.

>>> Perhaps a larger convergence value might help (-d, I believe).
>>
>> I'll try that.

There was no significant change while modifying the convergence value.
At least, none was observed during the first three iterations which
lasted the same amount of time than before, more or less.

>>> Is there any chance your data is publicly shareable?  Come to think of
>>> it,
>>> with the vector representations, as long as you don't publish the key
>>> (which
>>> terms map to which index), I would think most all data is publicly
>>> shareable.
>>
>> I'm sorry, I don't quite understand what you're asking. Publicly
>> shareable? As in user-permissions to access/read/write the data?
>
> As in post a copy of the SequenceFile somewhere for download, assuming you
> can.  Then others could presumably try it out.

My bad. Of course it is:

http://cringer.3kh.net/web/user-dataset.data.tar.bz2

That's the ~62Mb SequenceFile sample I've using, in <Text,
SparseVector> logical format.

>That does seem like an awfully long time for 62 MB on a 6 node cluster. How many >terations are running?

I'm running the whole thing with a 20 iterations cap. Every iteration
- EXCEPT the first one which, oddly, lasted just two minutes-, took
around 3hs to complete:

Hadoop job_200907221734_0001
Finished in: 1mins, 42sec

Hadoop job_200907221734_0004
Finished in: 2hrs, 34mins, 3sec

Hadoop job_200907221734_0005
Finished in: 2hrs, 59mins, 34sec

> How did you generate your initial clusters?

I generate the initial clusters via the RandomSeedGenerator setting a
'k' value of 200.  This is what I did to initiate the process for the
first time:

./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data input/user.data
./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data init/user.data
./bin/hadoop jar ~/mahout-core-0.2.jar
org.apache.mahout.clustering.kmeans.KMeansDriver -i input/user.data -c
init -o output -r 32 -d 0.01 -k 200

>Where are the iteration jobs spending most of their time (map vs. reduce)

I'm tempted to say map here, but their spent time is rather
comparable, actually. Reduce attempts are taking an hour and a half to
end (average), and so are Map attempts. Here are some representative
examples from the web UI:

reduce
attempt_200907221734_0002_r_000006_0
22-Jul-2009 21:15:01 (1hrs, 55mins, 55sec)

map
attempt_200907221734_0002_m_000000_0
22-Jul-2009 20:52:27 (2hrs, 16mins, 12sec)

Perhaps, there's some inconvenient in the way I create the
SequenceFile? I could share the JAVA code as well, if required.

Re: Clustering from DB

Posted by Grant Ingersoll <gs...@apache.org>.

On Jul 23, 2009, at 10:20 AM, nfantone wrote:

>> That does seem like a long time.
>>
>> Is your data sparse or dense?
>
> I would say sparse. My vectors are high dimensional and most of their
> values are zero.
>
>> Perhaps a larger convergence value might help (-d, I believe).
>
> I'll try that.
>
>> Is there any chance your data is publicly shareable?  Come to think  
>> of it,
>> with the vector representations, as long as you don't publish the  
>> key (which
>> terms map to which index), I would think most all data is publicly
>> shareable.
>
> I'm sorry, I don't quite understand what you're asking. Publicly
> shareable? As in user-permissions to access/read/write the data?

As in post a copy of the SequenceFile somewhere for download, assuming  
you can.  Then others could presumably try it out.


>
>> Are you on trunk of Mahout?  I think we still need more profiling  
>> to get a
>> better idea of where improvements can be made.
>
> I am. Updated this morning.
>
> I still insist on the configuration issue, and have never considered
> Mahout's algorithms implementation to be the actual cause of poor
> performance. For now, I've been running kMeans exclusively. Perhaps, I
> should try with different clustering methods and see if it takes a
> similar amount of time to complete.

Well KMeans actually runs two algorithms normally: canopy and then  
KMeans.  You could try the Random seed approach, which would skip the  
canopy run first.

Re: Clustering from DB

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

nfantone wrote:
>> That does seem like a long time.
>>
>> Is your data sparse or dense?
>>     
>
> I would say sparse. My vectors are high dimensional and most of their
> values are zero.
>
>   
>> Perhaps a larger convergence value might help (-d, I believe).
>>     
>
> I'll try that.
>
>   
>> Is there any chance your data is publicly shareable?  Come to think of it,
>> with the vector representations, as long as you don't publish the key (which
>> terms map to which index), I would think most all data is publicly
>> shareable.
>>     
>
> I'm sorry, I don't quite understand what you're asking. Publicly
> shareable? As in user-permissions to access/read/write the data?
>
>   
>> Are you on trunk of Mahout?  I think we still need more profiling to get a
>> better idea of where improvements can be made.
>>     
>
> I am. Updated this morning.
>
> I still insist on the configuration issue, and have never considered
> Mahout's algorithms implementation to be the actual cause of poor
> performance. For now, I've been running kMeans exclusively. Perhaps, I
> should try with different clustering methods and see if it takes a
> similar amount of time to complete.
>
>
>   
That does seem like an awfully long time for 62 MB on a 6 node cluster. 
How many iterations are running? Were they capped at 32 or did it run 
longer? How did you generate your initial clusters? Where are the 
iteration jobs spending most of their time (map vs. reduce) Could you 
share a copy of your data file so we can take a look at it? If it is 
just un-annotated vectors there should be no IP issues.

I've run KMeans over gigabytes of data on 10-node clusters and the jobs 
terminate in a few minutes. That is what I would expect from your job.

You could try Canopy on your data. This is a single-pass algorithm that 
should take approximately as long as one iteration of KMeans.

Jeff

Re: Clustering from DB

Posted by nfantone <nf...@gmail.com>.

> That does seem like a long time.
>
> Is your data sparse or dense?

I would say sparse. My vectors are high dimensional and most of their
values are zero.

> Perhaps a larger convergence value might help (-d, I believe).

I'll try that.

> Is there any chance your data is publicly shareable?  Come to think of it,
> with the vector representations, as long as you don't publish the key (which
> terms map to which index), I would think most all data is publicly
> shareable.

I'm sorry, I don't quite understand what you're asking. Publicly
shareable? As in user-permissions to access/read/write the data?

> Are you on trunk of Mahout?  I think we still need more profiling to get a
> better idea of where improvements can be made.

I am. Updated this morning.

I still insist on the configuration issue, and have never considered
Mahout's algorithms implementation to be the actual cause of poor
performance. For now, I've been running kMeans exclusively. Perhaps, I
should try with different clustering methods and see if it takes a
similar amount of time to complete.

Re: Clustering from DB

Posted by Grant Ingersoll <gs...@apache.org>.

On Jul 22, 2009, at 10:22 AM, nfantone wrote:

> After setting the cluster up with 6 computers (two of them being
> QuadCore and the others, DualCore, totaling 16 slave cores) and
> running a KMeansDriver job with 32 reduce tasks and ~80 map tasks
> spawned it's STILL awfully slow.
>
> ./bin/hadoop jar ~/mahout-core-0.2.jar
> org.apache.mahout.clustering.kmeans.KMeansDriver -i input/user.data -c
> init -o output -r 32 -d 0.001 -k 200
>
> Using a pretty small dataset of 62MB it took more than a whole day to
> complete. Datanodes and Jobtrackers logs don't show any visible
> errors, either. Would you mind sharing any piece of advice that could
> help me tune this thing up with my settings?
>

That does seem like a long time.

Is your data sparse or dense?

Perhaps a larger convergence value might help (-d, I believe).

Is there any chance your data is publicly shareable?  Come to think of  
it, with the vector representations, as long as you don't publish the  
key (which terms map to which index), I would think most all data is  
publicly shareable.

Are you on trunk of Mahout?  I think we still need more profiling to  
get a better idea of where improvements can be made.

-Grant

Re: Clustering from DB

Posted by nfantone <nf...@gmail.com>.

After setting the cluster up with 6 computers (two of them being
QuadCore and the others, DualCore, totaling 16 slave cores) and
running a KMeansDriver job with 32 reduce tasks and ~80 map tasks
spawned it's STILL awfully slow.

./bin/hadoop jar ~/mahout-core-0.2.jar
org.apache.mahout.clustering.kmeans.KMeansDriver -i input/user.data -c
init -o output -r 32 -d 0.001 -k 200

Using a pretty small dataset of 62MB it took more than a whole day to
complete. Datanodes and Jobtrackers logs don't show any visible
errors, either. Would you mind sharing any piece of advice that could
help me tune this thing up with my settings?



On Tue, Jul 21, 2009 at 9:05 AM, nfantone<nf...@gmail.com> wrote:
> Problem solved: the IP for the troublesome machine wasn't present in
> the DNS. Thanks, anyways.
>
> On Mon, Jul 20, 2009 at 3:58 PM, nfantone<nf...@gmail.com> wrote:
>> Update: I tried running the cluster with two particular nodes, and I
>> got the same errors. So, I'm thinking maybe it has something to do
>> with the connection to that PC (hadoop-slave01, aka 'orco').
>>
>> Here's what the jobtracker log shows from the master:
>>
>> 2009-07-20 15:46:22,366 INFO org.apache.hadoop.mapred.JobInProgress:
>> Failed fetch notification #1 for task
>> attempt_200907201540_0001_m_000001_0
>> 2009-07-20 15:46:28,113 INFO org.apache.hadoop.mapred.TaskInProgress:
>> Error from attempt_200907201540_0001_r_000002_0: Shuffle Error:
>> Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 2009-07-20 15:46:28,114 INFO org.apache.hadoop.mapred.JobTracker:
>> Adding task (cleanup)'attempt_200907201540_0001_r_000002_0' to tip
>> task_200907201540_0001_r_000002, for tracker
>> 'tracker_orco.3kh.net:localhost/127.0.0.1:59814'
>> 2009-07-20 15:46:31,116 INFO org.apache.hadoop.mapred.TaskInProgress:
>> Error from attempt_200907201540_0001_r_000000_0: Shuffle Error:
>> Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>
>> Why does it show 'orco.3kh.net:localhost'? I know it's in /etc/hosts/,
>> but I didn't expect to take into account any other lines apart from
>> the ones specifying IPs for masters and slaves. Is it attempting to
>> connect to itself and failing?
>>
>>
>> On Mon, Jul 20, 2009 at 1:30 PM, nfantone<nf...@gmail.com> wrote:
>>> Ok, here's my failure report:
>>>
>>> I can't get more than two nodes working in the cluster. With just a
>>> master and a slave, everything seems to go smoothly. However, if I add
>>> a third datanode (being the master itself, also a datanode) I keep
>>> getting this error while running the wordcount example, which I'm
>>> using to test the setup:
>>>
>>> 09/07/20 12:51:45 INFO mapred.JobClient:  map 100% reduce 17%
>>> 09/07/20 12:51:47 INFO mapred.JobClient: Task Id :
>>> attempt_200907201251_0001_m_000004_0, Status : FAILED
>>> Too many fetch-failures
>>> 09/07/20 12:51:48 WARN mapred.JobClient: Error reading task outputNo
>>> route to host
>>>
>>> While the mapping completes, the reduce task gets stuck at around 16%
>>> every time. I have googled the error message and read some responses
>>> from this list and other related forums, and it seems to be a firewall
>>> issue or something about ports not being opened; yet, this is not my
>>> case: firewall has been disabled on every node and connection between
>>> them (to and from) seems to be fine.
>>>
>>> Here's my /etc/hosts files for each node:
>>>
>>>  (master)
>>> 127.0.0.1       localhost
>>> 127.0.1.1       mauroN-Linux
>>> 192.168.200.20  hadoop-master
>>> 192.168.200.90  hadoop-slave00
>>> 192.168.200.162 hadoop-slave01
>>>
>>> (slave00)
>>> 127.0.0.1       localhost
>>> 127.0.1.1       tagore
>>> 192.168.200.20  hadoop-master
>>> 192.168.200.90  hadoop-slave00
>>> 192.168.200.162 hadoop-slave01
>>>
>>> (slave01)
>>> 127.0.0.1       localhost
>>> 127.0.1.1       orco.3kh.net orco localhost.localdomain
>>> 192.168.200.20  hadoop-master
>>> 192.168.200.90  hadoop-slave00
>>> 192.168.200.162 hadoop-slave01
>>>
>>> And .xml conf files, which are the same for each node (just relevant lines):
>>>
>>> (core-site.xml)
>>> <name>hadoop.tmp.dir</name>
>>> <value>/usr/local/hadoop/hadoop-datastore/hadoop-${user.name}</value>
>>>
>>> <name>fs.default.name</name>
>>> <value>hdfs://hadoop-master:54310/</value>
>>> <final>true</final>
>>>
>>> (mapred-site.xml)
>>> <name>mapred.job.tracker</name>
>>> <value>hdfs://hadoop-master:54311/</value>
>>> <final>true</final>
>>>
>>> <name>mapred.map.tasks</name>
>>> <value>31</value>
>>>
>>> <name>mapred.reduce.tasks</name>
>>> <value>6</value>
>>>
>>> (hdfs-site.xml)
>>> <name>dfs.replication</name>
>>> <value>3</value>
>>>
>>> I noticed that if I reduce the number of mapred.reduce.tasks to 2 or
>>> 3, the error does not pop up, but it takes quite a long time to finish
>>> (more than the time it takes for a single machine to finish it). I
>>> have blacklisted ipv6 and enabled ip_forward in every node (sudo echo
>>> 1 > /proc/sys/net/ipv4/ip_forward). Should anyone need some info from
>>> the datanodes logs, I could post it. I'm running out of ideas... and
>>> in need of enlightenment.
>>>
>>> On Thu, Jul 16, 2009 at 9:39 AM, nfantone<nf...@gmail.com> wrote:
>>>> I really appreciate all your suggestions, but from where I am and
>>>> considering the place I work at (a rather small office in Argentina)
>>>> these things aren't that affordable (monetarily and bureaucratically
>>>> speaking). That being said, I managed to get my hands around some more
>>>> equipment and I may be able to set up a small cluster of three or four
>>>> nodes - all running in a local network with Ubuntu. What I should
>>>> learn now is exactly how to configure all that is needed in order to
>>>> create it, as I have virtually no idea, nor experience in this kind of
>>>> tasks. Luckily, goggling led me to some tutorials and documentation on
>>>> the subject. I'll be following this guide for now:
>>>>
>>>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)
>>>>
>>>> I'll let know what comes out this (surely, something on the messy side
>>>> of things). Any more suggestions/ideas are more than welcome. Many
>>>> thanks, again.
>>>>
>>>
>>
>

Re: Clustering from DB

Posted by nfantone <nf...@gmail.com>.

Problem solved: the IP for the troublesome machine wasn't present in
the DNS. Thanks, anyways.

On Mon, Jul 20, 2009 at 3:58 PM, nfantone<nf...@gmail.com> wrote:
> Update: I tried running the cluster with two particular nodes, and I
> got the same errors. So, I'm thinking maybe it has something to do
> with the connection to that PC (hadoop-slave01, aka 'orco').
>
> Here's what the jobtracker log shows from the master:
>
> 2009-07-20 15:46:22,366 INFO org.apache.hadoop.mapred.JobInProgress:
> Failed fetch notification #1 for task
> attempt_200907201540_0001_m_000001_0
> 2009-07-20 15:46:28,113 INFO org.apache.hadoop.mapred.TaskInProgress:
> Error from attempt_200907201540_0001_r_000002_0: Shuffle Error:
> Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 2009-07-20 15:46:28,114 INFO org.apache.hadoop.mapred.JobTracker:
> Adding task (cleanup)'attempt_200907201540_0001_r_000002_0' to tip
> task_200907201540_0001_r_000002, for tracker
> 'tracker_orco.3kh.net:localhost/127.0.0.1:59814'
> 2009-07-20 15:46:31,116 INFO org.apache.hadoop.mapred.TaskInProgress:
> Error from attempt_200907201540_0001_r_000000_0: Shuffle Error:
> Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>
> Why does it show 'orco.3kh.net:localhost'? I know it's in /etc/hosts/,
> but I didn't expect to take into account any other lines apart from
> the ones specifying IPs for masters and slaves. Is it attempting to
> connect to itself and failing?
>
>
> On Mon, Jul 20, 2009 at 1:30 PM, nfantone<nf...@gmail.com> wrote:
>> Ok, here's my failure report:
>>
>> I can't get more than two nodes working in the cluster. With just a
>> master and a slave, everything seems to go smoothly. However, if I add
>> a third datanode (being the master itself, also a datanode) I keep
>> getting this error while running the wordcount example, which I'm
>> using to test the setup:
>>
>> 09/07/20 12:51:45 INFO mapred.JobClient:  map 100% reduce 17%
>> 09/07/20 12:51:47 INFO mapred.JobClient: Task Id :
>> attempt_200907201251_0001_m_000004_0, Status : FAILED
>> Too many fetch-failures
>> 09/07/20 12:51:48 WARN mapred.JobClient: Error reading task outputNo
>> route to host
>>
>> While the mapping completes, the reduce task gets stuck at around 16%
>> every time. I have googled the error message and read some responses
>> from this list and other related forums, and it seems to be a firewall
>> issue or something about ports not being opened; yet, this is not my
>> case: firewall has been disabled on every node and connection between
>> them (to and from) seems to be fine.
>>
>> Here's my /etc/hosts files for each node:
>>
>>  (master)
>> 127.0.0.1       localhost
>> 127.0.1.1       mauroN-Linux
>> 192.168.200.20  hadoop-master
>> 192.168.200.90  hadoop-slave00
>> 192.168.200.162 hadoop-slave01
>>
>> (slave00)
>> 127.0.0.1       localhost
>> 127.0.1.1       tagore
>> 192.168.200.20  hadoop-master
>> 192.168.200.90  hadoop-slave00
>> 192.168.200.162 hadoop-slave01
>>
>> (slave01)
>> 127.0.0.1       localhost
>> 127.0.1.1       orco.3kh.net orco localhost.localdomain
>> 192.168.200.20  hadoop-master
>> 192.168.200.90  hadoop-slave00
>> 192.168.200.162 hadoop-slave01
>>
>> And .xml conf files, which are the same for each node (just relevant lines):
>>
>> (core-site.xml)
>> <name>hadoop.tmp.dir</name>
>> <value>/usr/local/hadoop/hadoop-datastore/hadoop-${user.name}</value>
>>
>> <name>fs.default.name</name>
>> <value>hdfs://hadoop-master:54310/</value>
>> <final>true</final>
>>
>> (mapred-site.xml)
>> <name>mapred.job.tracker</name>
>> <value>hdfs://hadoop-master:54311/</value>
>> <final>true</final>
>>
>> <name>mapred.map.tasks</name>
>> <value>31</value>
>>
>> <name>mapred.reduce.tasks</name>
>> <value>6</value>
>>
>> (hdfs-site.xml)
>> <name>dfs.replication</name>
>> <value>3</value>
>>
>> I noticed that if I reduce the number of mapred.reduce.tasks to 2 or
>> 3, the error does not pop up, but it takes quite a long time to finish
>> (more than the time it takes for a single machine to finish it). I
>> have blacklisted ipv6 and enabled ip_forward in every node (sudo echo
>> 1 > /proc/sys/net/ipv4/ip_forward). Should anyone need some info from
>> the datanodes logs, I could post it. I'm running out of ideas... and
>> in need of enlightenment.
>>
>> On Thu, Jul 16, 2009 at 9:39 AM, nfantone<nf...@gmail.com> wrote:
>>> I really appreciate all your suggestions, but from where I am and
>>> considering the place I work at (a rather small office in Argentina)
>>> these things aren't that affordable (monetarily and bureaucratically
>>> speaking). That being said, I managed to get my hands around some more
>>> equipment and I may be able to set up a small cluster of three or four
>>> nodes - all running in a local network with Ubuntu. What I should
>>> learn now is exactly how to configure all that is needed in order to
>>> create it, as I have virtually no idea, nor experience in this kind of
>>> tasks. Luckily, goggling led me to some tutorials and documentation on
>>> the subject. I'll be following this guide for now:
>>>
>>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)
>>>
>>> I'll let know what comes out this (surely, something on the messy side
>>> of things). Any more suggestions/ideas are more than welcome. Many
>>> thanks, again.
>>>
>>
>

Re: Clustering from DB

Posted by nfantone <nf...@gmail.com>.

Update: I tried running the cluster with two particular nodes, and I
got the same errors. So, I'm thinking maybe it has something to do
with the connection to that PC (hadoop-slave01, aka 'orco').

Here's what the jobtracker log shows from the master:

2009-07-20 15:46:22,366 INFO org.apache.hadoop.mapred.JobInProgress:
Failed fetch notification #1 for task
attempt_200907201540_0001_m_000001_0
2009-07-20 15:46:28,113 INFO org.apache.hadoop.mapred.TaskInProgress:
Error from attempt_200907201540_0001_r_000002_0: Shuffle Error:
Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
2009-07-20 15:46:28,114 INFO org.apache.hadoop.mapred.JobTracker:
Adding task (cleanup)'attempt_200907201540_0001_r_000002_0' to tip
task_200907201540_0001_r_000002, for tracker
'tracker_orco.3kh.net:localhost/127.0.0.1:59814'
2009-07-20 15:46:31,116 INFO org.apache.hadoop.mapred.TaskInProgress:
Error from attempt_200907201540_0001_r_000000_0: Shuffle Error:
Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Why does it show 'orco.3kh.net:localhost'? I know it's in /etc/hosts/,
but I didn't expect to take into account any other lines apart from
the ones specifying IPs for masters and slaves. Is it attempting to
connect to itself and failing?


On Mon, Jul 20, 2009 at 1:30 PM, nfantone<nf...@gmail.com> wrote:
> Ok, here's my failure report:
>
> I can't get more than two nodes working in the cluster. With just a
> master and a slave, everything seems to go smoothly. However, if I add
> a third datanode (being the master itself, also a datanode) I keep
> getting this error while running the wordcount example, which I'm
> using to test the setup:
>
> 09/07/20 12:51:45 INFO mapred.JobClient:  map 100% reduce 17%
> 09/07/20 12:51:47 INFO mapred.JobClient: Task Id :
> attempt_200907201251_0001_m_000004_0, Status : FAILED
> Too many fetch-failures
> 09/07/20 12:51:48 WARN mapred.JobClient: Error reading task outputNo
> route to host
>
> While the mapping completes, the reduce task gets stuck at around 16%
> every time. I have googled the error message and read some responses
> from this list and other related forums, and it seems to be a firewall
> issue or something about ports not being opened; yet, this is not my
> case: firewall has been disabled on every node and connection between
> them (to and from) seems to be fine.
>
> Here's my /etc/hosts files for each node:
>
>  (master)
> 127.0.0.1       localhost
> 127.0.1.1       mauroN-Linux
> 192.168.200.20  hadoop-master
> 192.168.200.90  hadoop-slave00
> 192.168.200.162 hadoop-slave01
>
> (slave00)
> 127.0.0.1       localhost
> 127.0.1.1       tagore
> 192.168.200.20  hadoop-master
> 192.168.200.90  hadoop-slave00
> 192.168.200.162 hadoop-slave01
>
> (slave01)
> 127.0.0.1       localhost
> 127.0.1.1       orco.3kh.net orco localhost.localdomain
> 192.168.200.20  hadoop-master
> 192.168.200.90  hadoop-slave00
> 192.168.200.162 hadoop-slave01
>
> And .xml conf files, which are the same for each node (just relevant lines):
>
> (core-site.xml)
> <name>hadoop.tmp.dir</name>
> <value>/usr/local/hadoop/hadoop-datastore/hadoop-${user.name}</value>
>
> <name>fs.default.name</name>
> <value>hdfs://hadoop-master:54310/</value>
> <final>true</final>
>
> (mapred-site.xml)
> <name>mapred.job.tracker</name>
> <value>hdfs://hadoop-master:54311/</value>
> <final>true</final>
>
> <name>mapred.map.tasks</name>
> <value>31</value>
>
> <name>mapred.reduce.tasks</name>
> <value>6</value>
>
> (hdfs-site.xml)
> <name>dfs.replication</name>
> <value>3</value>
>
> I noticed that if I reduce the number of mapred.reduce.tasks to 2 or
> 3, the error does not pop up, but it takes quite a long time to finish
> (more than the time it takes for a single machine to finish it). I
> have blacklisted ipv6 and enabled ip_forward in every node (sudo echo
> 1 > /proc/sys/net/ipv4/ip_forward). Should anyone need some info from
> the datanodes logs, I could post it. I'm running out of ideas... and
> in need of enlightenment.
>
> On Thu, Jul 16, 2009 at 9:39 AM, nfantone<nf...@gmail.com> wrote:
>> I really appreciate all your suggestions, but from where I am and
>> considering the place I work at (a rather small office in Argentina)
>> these things aren't that affordable (monetarily and bureaucratically
>> speaking). That being said, I managed to get my hands around some more
>> equipment and I may be able to set up a small cluster of three or four
>> nodes - all running in a local network with Ubuntu. What I should
>> learn now is exactly how to configure all that is needed in order to
>> create it, as I have virtually no idea, nor experience in this kind of
>> tasks. Luckily, goggling led me to some tutorials and documentation on
>> the subject. I'll be following this guide for now:
>>
>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)
>>
>> I'll let know what comes out this (surely, something on the messy side
>> of things). Any more suggestions/ideas are more than welcome. Many
>> thanks, again.
>>
>

Re: Clustering from DB

Posted by nfantone <nf...@gmail.com>.

Ok, here's my failure report:

I can't get more than two nodes working in the cluster. With just a
master and a slave, everything seems to go smoothly. However, if I add
a third datanode (being the master itself, also a datanode) I keep
getting this error while running the wordcount example, which I'm
using to test the setup:

09/07/20 12:51:45 INFO mapred.JobClient:  map 100% reduce 17%
09/07/20 12:51:47 INFO mapred.JobClient: Task Id :
attempt_200907201251_0001_m_000004_0, Status : FAILED
Too many fetch-failures
09/07/20 12:51:48 WARN mapred.JobClient: Error reading task outputNo
route to host

While the mapping completes, the reduce task gets stuck at around 16%
every time. I have googled the error message and read some responses
from this list and other related forums, and it seems to be a firewall
issue or something about ports not being opened; yet, this is not my
case: firewall has been disabled on every node and connection between
them (to and from) seems to be fine.

Here's my /etc/hosts files for each node:

 (master)
127.0.0.1	localhost
127.0.1.1	mauroN-Linux
192.168.200.20  hadoop-master
192.168.200.90  hadoop-slave00
192.168.200.162 hadoop-slave01

(slave00)
127.0.0.1	localhost
127.0.1.1	tagore
192.168.200.20  hadoop-master
192.168.200.90  hadoop-slave00
192.168.200.162 hadoop-slave01

(slave01)
127.0.0.1	localhost
127.0.1.1	orco.3kh.net orco localhost.localdomain
192.168.200.20  hadoop-master
192.168.200.90  hadoop-slave00
192.168.200.162 hadoop-slave01

And .xml conf files, which are the same for each node (just relevant lines):

(core-site.xml)
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/hadoop-datastore/hadoop-${user.name}</value>

<name>fs.default.name</name>
<value>hdfs://hadoop-master:54310/</value>
<final>true</final>

(mapred-site.xml)
<name>mapred.job.tracker</name>
<value>hdfs://hadoop-master:54311/</value>
<final>true</final>

<name>mapred.map.tasks</name>
<value>31</value>

<name>mapred.reduce.tasks</name>
<value>6</value>

(hdfs-site.xml)
<name>dfs.replication</name>
<value>3</value>

I noticed that if I reduce the number of mapred.reduce.tasks to 2 or
3, the error does not pop up, but it takes quite a long time to finish
(more than the time it takes for a single machine to finish it). I
have blacklisted ipv6 and enabled ip_forward in every node (sudo echo
1 > /proc/sys/net/ipv4/ip_forward). Should anyone need some info from
the datanodes logs, I could post it. I'm running out of ideas... and
in need of enlightenment.

On Thu, Jul 16, 2009 at 9:39 AM, nfantone<nf...@gmail.com> wrote:
> I really appreciate all your suggestions, but from where I am and
> considering the place I work at (a rather small office in Argentina)
> these things aren't that affordable (monetarily and bureaucratically
> speaking). That being said, I managed to get my hands around some more
> equipment and I may be able to set up a small cluster of three or four
> nodes - all running in a local network with Ubuntu. What I should
> learn now is exactly how to configure all that is needed in order to
> create it, as I have virtually no idea, nor experience in this kind of
> tasks. Luckily, goggling led me to some tutorials and documentation on
> the subject. I'll be following this guide for now:
>
> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)
>
> I'll let know what comes out this (surely, something on the messy side
> of things). Any more suggestions/ideas are more than welcome. Many
> thanks, again.
>

Re: Clustering from DB

Posted by nfantone <nf...@gmail.com>.

I really appreciate all your suggestions, but from where I am and
considering the place I work at (a rather small office in Argentina)
these things aren't that affordable (monetarily and bureaucratically
speaking). That being said, I managed to get my hands around some more
equipment and I may be able to set up a small cluster of three or four
nodes - all running in a local network with Ubuntu. What I should
learn now is exactly how to configure all that is needed in order to
create it, as I have virtually no idea, nor experience in this kind of
tasks. Luckily, goggling led me to some tutorials and documentation on
the subject. I'll be following this guide for now:

http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)

I'll let know what comes out this (surely, something on the messy side
of things). Any more suggestions/ideas are more than welcome. Many
thanks, again.

Re: Clustering from DB

Posted by Ted Dunning <te...@gmail.com>.

I have always had better luck using a standard AMI and injecting a startup
script that handles all of the software installs.  It takes 20-30 seconds to
boot ubuntu and 20-30 seconds more to install java, hadoop, application
software and so on.  I use the AMI's from alestic.cm.

On Wed, Jul 15, 2009 at 2:46 PM, zaki rahaman <za...@gmail.com>wrote:

> I'd be more
> than happy to write a script to run a Job or work on a mahout AMI config.
>

Re: Clustering from DB

Posted by Grant Ingersoll <gs...@apache.org>.

Very cool!  Would love to hear more if you can share.  Getting use  
cases and powered by info out to the public is one of the key things  
we can do to drive adoption and increase Mahout's capabilities.


On Jul 15, 2009, at 5:46 PM, zaki rahaman wrote:

> I'm still prototyping something to make sure it works before I start  
> working
> on rolling it out for a large (~500GB) backlog of server data that I  
> want to
> work with. As such, I haven't looked seriously into using EC2 until  
> the test
> runs work well, but plan on doing so in the next couple days. I'd be  
> more
> than happy to write a script to run a Job or work on a mahout AMI  
> config.
>
> On Wed, Jul 15, 2009 at 5:40 PM, Grant Ingersoll  
> <gs...@apache.org>wrote:
>
>>
>> On Jul 15, 2009, at 5:25 PM, zaki rahaman wrote:
>>
>> I hope I'm understanding your setup correctly but by running on one
>>> machine,
>>> you're not fully exploiting the capabilities of Hadoop's Map/ 
>>> Reduce. Gains
>>> in computation time will only be seen by increasing the number of  
>>> cores or
>>> nodes.
>>>
>>
>> Yep.
>>
>> If you need access to more computing power, you might want to
>>> consider using Amazon's EC2 (they have preconfigured AMIs for  
>>> Hadoop but
>>> youd have to configure and install Mahout, a process which I'm not  
>>> totally
>>> familiar with as of yet as I'm still trying to do it myself).
>>>
>>
>> Please add to http://cwiki.apache.org/MAHOUT/mahoutec2.html if you  
>> can.
>> Given a Hadoop AMI, it shouldn't be all that hard to setup a Job, I
>> wouldn't think.  Would be good to have a script that does it, though.
>>
>> -Grant
>>
>
>
>
> -- 
> Zaki Rahaman

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Clustering from DB

Posted by zaki rahaman <za...@gmail.com>.

I'm still prototyping something to make sure it works before I start working
on rolling it out for a large (~500GB) backlog of server data that I want to
work with. As such, I haven't looked seriously into using EC2 until the test
runs work well, but plan on doing so in the next couple days. I'd be more
than happy to write a script to run a Job or work on a mahout AMI config.

On Wed, Jul 15, 2009 at 5:40 PM, Grant Ingersoll <gs...@apache.org>wrote:

>
> On Jul 15, 2009, at 5:25 PM, zaki rahaman wrote:
>
>  I hope I'm understanding your setup correctly but by running on one
>> machine,
>> you're not fully exploiting the capabilities of Hadoop's Map/Reduce. Gains
>> in computation time will only be seen by increasing the number of cores or
>> nodes.
>>
>
> Yep.
>
>  If you need access to more computing power, you might want to
>> consider using Amazon's EC2 (they have preconfigured AMIs for Hadoop but
>> youd have to configure and install Mahout, a process which I'm not totally
>> familiar with as of yet as I'm still trying to do it myself).
>>
>
> Please add to http://cwiki.apache.org/MAHOUT/mahoutec2.html if you can.
>  Given a Hadoop AMI, it shouldn't be all that hard to setup a Job, I
> wouldn't think.  Would be good to have a script that does it, though.
>
> -Grant
>

-- 
Zaki Rahaman

Re: Clustering from DB

Posted by Grant Ingersoll <gs...@apache.org>.

On Jul 15, 2009, at 5:25 PM, zaki rahaman wrote:

> I hope I'm understanding your setup correctly but by running on one  
> machine,
> you're not fully exploiting the capabilities of Hadoop's Map/Reduce.  
> Gains
> in computation time will only be seen by increasing the number of  
> cores or
> nodes.

Yep.

> If you need access to more computing power, you might want to
> consider using Amazon's EC2 (they have preconfigured AMIs for Hadoop  
> but
> youd have to configure and install Mahout, a process which I'm not  
> totally
> familiar with as of yet as I'm still trying to do it myself).

Please add to http://cwiki.apache.org/MAHOUT/mahoutec2.html if you  
can.  Given a Hadoop AMI, it shouldn't be all that hard to setup a  
Job, I wouldn't think.  Would be good to have a script that does it,  
though.

-Grant

Re: Clustering from DB

Posted by zaki rahaman <za...@gmail.com>.

I hope I'm understanding your setup correctly but by running on one machine,
you're not fully exploiting the capabilities of Hadoop's Map/Reduce. Gains
in computation time will only be seen by increasing the number of cores or
nodes. If you need access to more computing power, you might want to
consider using Amazon's EC2 (they have preconfigured AMIs for Hadoop but
youd have to configure and install Mahout, a process which I'm not totally
familiar with as of yet as I'm still trying to do it myself).

On Wed, Jul 15, 2009 at 4:24 PM, nfantone <nf...@gmail.com> wrote:

> Well, I grew tired of watching the whole thing run and stopped it. I,
> then, started another test, this time around using a smaller dataset
> of 3Gb and it is still taking way too long.
> See inline comments.
>
> > You are only specifying a single reducer. Try increasing that as below.
>
> I did. I set it to my K value (200).
>
> > No, number of nodes is the number of nodes (computers) in your cluster.
> You
> > did not say how many nodes you are running on.
>
> I'm running and compiling the application on one simple desktop
> computer at work, and that isn't likely to change after the
> development process is finished.
>
> > Yes, Hadoop allocates this automatically. How many map tasks are being
> > spawned?
>
> Being uncertain of where (and when) Hadoop computes the adequate
> number of map tasks, what I did was inspect the following 'numMaps'
> variable while debugging:
>
>   if (name.startsWith("part") && !name.endsWith(".crc")) {
>         SequenceFile.Reader reader = new SequenceFile.Reader(fs,
> part.getPath(), conf);
>        int numMaps = conf.getNumMapTasks();
>  ... }
>
> which is at the beginning of the isConverged() method. Its current
> value is, in every iteration, 2. I suspect this isn't right at all,
> whether because this is not the proper place to ask for the number of
> maps or it's not being setted the way it should.
> From the Hadoop Javadoc:
>
> "The number of maps is usually driven by the total size of the inputs
> i.e. total number of blocks of the input files."
>
> In my input file each block represents a vector that corresponds to
> some computed user behavior. The number of users to be clustered (i.e
> the number of blocks) is a parameter expected by the application.
> Perhaps, I should -somehow- change the block size of my HDFS file? Or
> tweak something in the Configuration/FileSystem instance i'm using to
> write it?
>
> >> For now, the clustering is STILL running in the background, ha.
> >>
> >> On Wed, Jul 15, 2009 at 12:30 PM, Jeff
> >> Eastman<jd...@windwardsolutions.com> wrote:
> >>
> >>>
> >>> Glad to hear KMeans is working reliably now. Your performance problems
> >>> will
> >>> require some additional tuning. Here are some suggestions:
> >>> - You did not mention how many mappers are running in your job. With
> 60gb
> >>> in
> >>> a single input file, I would think Hadoop would allocate multiple
> mapper
> >>> tasks automatically, since there are thousands of potential splits. If
> >>> this
> >>> is not happening (is the file compressed?), then breaking it into
> >>> multiple
> >>> parts in a preprocessing step would allow you to get more concurrency
> in
> >>> the
> >>> map phase.
> >>> - Same with the reducers; how many are you running and what is your K?
> >>> The
> >>> default number of reducers is 2, but you can increase this up to the
> >>> number
> >>> of clusters to increase parallelism. Unlike Canopy and Mean Shift,
> KMeans
> >>> can use multiple reducers up to that limit.
> >>> - Finally, what is the size of your cluster? Adding machines would be
> >>> another way to increase concurrency, since map and reduce tasks are
> >>> spread
> >>> across the entire cluster.
> >>>
> >>> 60 gb is a small dataset for Hadoop. I don't think it should be taking
> >>> that
> >>> long.
> >>> Jeff
> >>>
> >>> nfantone wrote:
> >>>
> >>>>
> >>>> After updating to the latest revision, everything seems to be working
> >>>> just fine. However, the task I set up to do, user clustering by
> >>>> KMeans, is taking forever to complete: I initiated the job yesterday's
> >>>> morning and it's still running today (an elapsed time of nearly 18hs
> >>>> and counting...). Of course, the main reason behind it it's the huge
> >>>> size of the data set I'm trying to process (a ~60Gb HDFS file), but
> >>>> I'm looking for ways to improve the performance. Would splitting the
> >>>> input file into smaller parts do any difference? Is it even possible
> >>>> to set the Driver in order to use more than one input (right now, I'm
> >>>> specifying a full path to a single file, including its filename)? What
> >>>> about setting a higher number of reducers? Is there any drawbacks to
> >>>> that? Running multiple KMeans' job in several threads?
> >>>>
> >>>> Or perhaps, I'm just doing something wrong and should not be taking
> >>>> this long. Surely, I'm not the first one to encounter this running
> >>>> time issue with large datasets. Ideas, anyone?
> >>>>
> >>>>
> >>>> On Mon, Jul 13, 2009 at 2:39 PM, nfantone<nf...@gmail.com> wrote:
> >>>>
> >>>>
> >>>>>
> >>>>> Great work. It works like a charm now. Thank you very much.
> >>>>>
> >>>>> On Mon, Jul 13, 2009 at 1:41 PM, Jeff
> >>>>> Eastman<jd...@windwardsolutions.com>
> >>>>> wrote:
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> r793620 fixes the KMeansDriver.isConverged() method to iterate over
> >>>>>> all
> >>>>>> cluster part files. Unit test now runs without error and the
> synthetic
> >>>>>> control job completes too.
> >>>>>>
> >>>>>>
> >>>>>> Jeff Eastman wrote:
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> In this case, the code should be reading all of the clusters into
> >>>>>>> memory
> >>>>>>> to see if they have all converged. These may be split into multiple
> >>>>>>> part
> >>>>>>> files if more than one reducer is specified. So /* is the correct
> >>>>>>> file
> >>>>>>> pattern and it is the calling site that should remove the
> /part-0000
> >>>>>>> reference. The code in isConverged should loop through all the
> parts,
> >>>>>>> returning if they have all converged or not.
> >>>>>>>
> >>>>>>> I'll take a detailed look tomorrow.
> >>>>>>>
> >>>>>>>
> >>>>>>> Grant Ingersoll wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Hmm, that might be a mistake on my part when trying to resolve how
> >>>>>>>> Hadoop
> >>>>>>>> 0.20 now resolves globs.  I somewhat blindly applied "/*" where
> >>>>>>>> needed, but
> >>>>>>>> I think it is likely worth revistiing here where a specific file
> is
> >>>>>>>> needed?
> >>>>>>>>
> >>>>>>>> -Grant
> >>>>>>>>
> >>>>>>>> On Jul 10, 2009, at 3:08 PM, nfantone wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> This error is still bugging me. The exception:
> >>>>>>>>>
> >>>>>>>>> WARNING: java.io.FileNotFoundException: File
> >>>>>>>>> output/clusters-0/part-00000/* does not exist.
> >>>>>>>>> java.io.FileNotFoundException: File
> output/clusters-0/part-00000/*
> >>>>>>>>> does not exist.
> >>>>>>>>>
> >>>>>>>>> ocurrs first at:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> org.apache.mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:298)
> >>>>>>>>>
> >>>>>>>>> which corresponds to:
> >>>>>>>>>
> >>>>>>>>>  private static boolean isConverged(String filePath, JobConf
> conf,
> >>>>>>>>> FileSystem fs)
> >>>>>>>>>   throws IOException {
> >>>>>>>>>  Path outPart = new Path(filePath + "/*");
> >>>>>>>>>  SequenceFile.Reader reader = new SequenceFile.Reader(fs,
> outPart,
> >>>>>>>>> conf);  <-- THIS
> >>>>>>>>>  ...
> >>>>>>>>>  }
> >>>>>>>>>
> >>>>>>>>> where isConverged() is called in this fashion:
> >>>>>>>>>
> >>>>>>>>> return isConverged(clustersOut + "/part-00000", conf, fs);
> >>>>>>>>>
> >>>>>>>>> by runIteration(), which is previously invoked by runJob() like:
> >>>>>>>>>
> >>>>>>>>>  String clustersOut = output + "/clusters-" + iteration;
> >>>>>>>>>   converged = runIteration(input, clustersIn, clustersOut,
> >>>>>>>>> measureClass,
> >>>>>>>>>       delta, numReduceTasks, iteration);
> >>>>>>>>>
> >>>>>>>>> Consequently, assuming its the first iteration and the output
> >>>>>>>>> folder
> >>>>>>>>> has been named "output" by the user, the SequenceFile.Reader
> >>>>>>>>> receives
> >>>>>>>>> "output/clusters-0/part-00000/*" as a path, which is
> non-existent.
> >>>>>>>>> I
> >>>>>>>>> believe the path should end in "part-00000" and the  + "/*"
> should
> >>>>>>>>> be
> >>>>>>>>> removed... although someone, evidently, thought otherwise.
> >>>>>>>>>
> >>>>>>>>> Any feedback?
> >>>>>>>>>
> >>>>>>>>> On Mon, Jul 6, 2009 at 5:39 PM, nfantone<nf...@gmail.com>
> wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I was using Canopy to create input clusters, but the error
> >>>>>>>>>> appeared
> >>>>>>>>>> while running kMeans (if I run kMeans' job only with previously
> >>>>>>>>>> created clusters from Canopy placed in output/canopies as
> initial
> >>>>>>>>>> clusters, it still fails). I noticed no other problems. I was
> >>>>>>>>>> using
> >>>>>>>>>> revision 790979 before updating.  Strangely, there were no
> changes
> >>>>>>>>>> in
> >>>>>>>>>> the job and drivers class from that revision. svn diff shows
> that
> >>>>>>>>>> the
> >>>>>>>>>> only classes that changed in org.apache.mahout.clustering.kmeans
> >>>>>>>>>> package were KMeansInfo.java and RandomSeedGenerator.java
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Jul 6, 2009 at 3:55 PM, Jeff
> >>>>>>>>>> Eastman<jd...@windwardsolutions.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Hum, no, it's looking for the output of the first iteration.
> Were
> >>>>>>>>>>> there
> >>>>>>>>>>> other errors? What was the last revision you were running? It
> >>>>>>>>>>> does
> >>>>>>>>>>> look like
> >>>>>>>>>>> something got horked, as it should be looking for
> >>>>>>>>>>> output/clusters-0/*.
> >>>>>>>>>>> Can
> >>>>>>>>>>> you diff the job and driver class to see what changed?
> >>>>>>>>>>>
> >>>>>>>>>>> Jeff
> >>>>>>>>>>>
> >>>>>>>>>>> nfantone wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Fellows, today I updated to revision 791558 and while running
> >>>>>>>>>>>> kMeans
> >>>>>>>>>>>> I
> >>>>>>>>>>>> got the following exception:
> >>>>>>>>>>>>
> >>>>>>>>>>>> WARNING: java.io.FileNotFoundException: File
> >>>>>>>>>>>> output/clusters-0/part-00000/* does not exist.
> >>>>>>>>>>>> java.io.FileNotFoundException: File
> >>>>>>>>>>>> output/clusters-0/part-00000/*
> >>>>>>>>>>>> does not exist.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The algorithm isn't interrupted, though. But this exception
> >>>>>>>>>>>> wasn't
> >>>>>>>>>>>> thrown before the update and, to me, its message is not quite
> >>>>>>>>>>>> clear.
> >>>>>>>>>>>> It seems as it's looking for any file inside a "part-00000"
> >>>>>>>>>>>> directory,
> >>>>>>>>>>>> which doesn't exist; and, as far as I know, "part-xxxxx" are
> >>>>>>>>>>>> default
> >>>>>>>>>>>> names for output files.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I could show the entire stack trace, if needed. Any pointers?
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Jul 2, 2009 at 3:16 PM, nfantone<nf...@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for the feedback, Jeff.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The logical format of input to KMeans is <Key, Vector> as it
> >>>>>>>>>>>>>> is
> >>>>>>>>>>>>>> in
> >>>>>>>>>>>>>> sequence
> >>>>>>>>>>>>>> file format, but the Key is never used. To my knowledge,
> there
> >>>>>>>>>>>>>> is
> >>>>>>>>>>>>>> no
> >>>>>>>>>>>>>> requirement to assign identifiers to the input points*.
> Users
> >>>>>>>>>>>>>> are
> >>>>>>>>>>>>>> free
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>> associate an arbitrary name field with each vector - also
> >>>>>>>>>>>>>> label
> >>>>>>>>>>>>>> mappings
> >>>>>>>>>>>>>> may
> >>>>>>>>>>>>>> be assigned - but these are not manipulated by KMeans or any
> >>>>>>>>>>>>>> of
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>> other
> >>>>>>>>>>>>>> clustering applications. The name field is now used as a
> >>>>>>>>>>>>>> vector
> >>>>>>>>>>>>>> identifier
> >>>>>>>>>>>>>> by the KMeansClusterMapper - if it is non-null - in the
> output
> >>>>>>>>>>>>>> step
> >>>>>>>>>>>>>> only.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The key may not be used internally, but externally they can
> >>>>>>>>>>>>> prove
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>> be pretty useful. For me, keys are userIDs and each Vector
> >>>>>>>>>>>>> represents
> >>>>>>>>>>>>> his/her historical behavior. Being able to collect the output
> >>>>>>>>>>>>> information as <UserID, ClusterID> is quite neat as it allows
> >>>>>>>>>>>>> me
> >>>>>>>>>>>>> to,
> >>>>>>>>>>>>> for instance, retrieve user information using data directly
> >>>>>>>>>>>>> from
> >>>>>>>>>>>>> a
> >>>>>>>>>>>>> HDFS file's field.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>>>> --------------------------
> >>>>>>>> Grant Ingersoll
> >>>>>>>> http://www.lucidimagination.com/
> >>>>>>>>
> >>>>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
> >>>>>>>> using
> >>>>>>>> Solr/Lucene:
> >>>>>>>> http://www.lucidimagination.com/search
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >>
> >
> >
>



-- 
Zaki Rahaman

Re: Clustering from DB

Posted by nfantone <nf...@gmail.com>.

Um... Here I am bringing news that are somewhat inconsistent with your
suggestion: CanopyDriver runs its job just fine with the very same
dataset. It sure takes a while, but it finished in an acceptable time.
Unless the convergence condition for the algorithms are radically
different, I'd say there's something odd going on. Of course, I'll
take into consideration what you mentioned about adding nodes to my
cluster, although it doesn't depend entirely on me.

On Wed, Jul 15, 2009 at 5:55 PM, Jeff Eastman<jd...@windwardsolutions.com> wrote:
> nfantone wrote:
>>
>> Well, I grew tired of watching the whole thing run and stopped it. I,
>> then, started another test, this time around using a smaller dataset
>> of 3Gb and it is still taking way too long.
>> See inline comments.
>>
>>
>>>
>>> You are only specifying a single reducer. Try increasing that as below.
>>>
>>
>> I did. I set it to my K value (200).
>>
>
> Way too big given your single node operation. See below.
>>
>>
>>>
>>> No, number of nodes is the number of nodes (computers) in your cluster.
>>> You
>>> did not say how many nodes you are running on.
>>>
>>
>> I'm running and compiling the application on one simple desktop
>> computer at work, and that isn't likely to change after the
>> development process is finished.
>>
>>
>
> This is the root of your problem: You only have a single node in your
> cluster. Running Hadoop in this configuration is possible, but it will be
> much slower than if you had more machines. Perhaps you can get some interest
> from some of your other colleagues in donating some storage and cycles on
> their machines to your effort. When I was at CollabNet, I got a dozen
> developer's machines running in a cluster so I could test out the early
> clustering stuff. These machines typically had gigs of free storage and were
> not heavily utilized in CPU capacity, so nobody ever noticed I was running
> jobs on them at all.
>
> Alternatively, for a couple of dollars on AWS you can run the job on a
> cluster of your own. For your job I would expect the cost to be literally in
> the couple of dollars range.
>
> You will find KMeans will scale almost linearly with the number of boxes you
> throw at it.
>

Re: Clustering from DB

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

nfantone wrote:
> Well, I grew tired of watching the whole thing run and stopped it. I,
> then, started another test, this time around using a smaller dataset
> of 3Gb and it is still taking way too long.
> See inline comments.
>
>   
>> You are only specifying a single reducer. Try increasing that as below.
>>     
>
> I did. I set it to my K value (200).
>   
Way too big given your single node operation. See below.
>   
>> No, number of nodes is the number of nodes (computers) in your cluster. You
>> did not say how many nodes you are running on.
>>     
>
> I'm running and compiling the application on one simple desktop
> computer at work, and that isn't likely to change after the
> development process is finished.
>
>   
This is the root of your problem: You only have a single node in your 
cluster. Running Hadoop in this configuration is possible, but it will 
be much slower than if you had more machines. Perhaps you can get some 
interest from some of your other colleagues in donating some storage and 
cycles on their machines to your effort. When I was at CollabNet, I got 
a dozen developer's machines running in a cluster so I could test out 
the early clustering stuff. These machines typically had gigs of free 
storage and were not heavily utilized in CPU capacity, so nobody ever 
noticed I was running jobs on them at all.

Alternatively, for a couple of dollars on AWS you can run the job on a 
cluster of your own. For your job I would expect the cost to be 
literally in the couple of dollars range.

You will find KMeans will scale almost linearly with the number of boxes 
you throw at it.

Re: Clustering from DB

Posted by nfantone <nf...@gmail.com>.

Well, I grew tired of watching the whole thing run and stopped it. I,
then, started another test, this time around using a smaller dataset
of 3Gb and it is still taking way too long.
See inline comments.

> You are only specifying a single reducer. Try increasing that as below.

I did. I set it to my K value (200).

> No, number of nodes is the number of nodes (computers) in your cluster. You
> did not say how many nodes you are running on.

I'm running and compiling the application on one simple desktop
computer at work, and that isn't likely to change after the
development process is finished.

> Yes, Hadoop allocates this automatically. How many map tasks are being
> spawned?

Being uncertain of where (and when) Hadoop computes the adequate
number of map tasks, what I did was inspect the following 'numMaps'
variable while debugging:

   if (name.startsWith("part") && !name.endsWith(".crc")) {
        SequenceFile.Reader reader = new SequenceFile.Reader(fs,
part.getPath(), conf);
	int numMaps = conf.getNumMapTasks();
  ... }

which is at the beginning of the isConverged() method. Its current
value is, in every iteration, 2. I suspect this isn't right at all,
whether because this is not the proper place to ask for the number of
maps or it's not being setted the way it should.
>From the Hadoop Javadoc:

"The number of maps is usually driven by the total size of the inputs
i.e. total number of blocks of the input files."

In my input file each block represents a vector that corresponds to
some computed user behavior. The number of users to be clustered (i.e
the number of blocks) is a parameter expected by the application.
Perhaps, I should -somehow- change the block size of my HDFS file? Or
tweak something in the Configuration/FileSystem instance i'm using to
write it?

>> For now, the clustering is STILL running in the background, ha.
>>
>> On Wed, Jul 15, 2009 at 12:30 PM, Jeff
>> Eastman<jd...@windwardsolutions.com> wrote:
>>
>>>
>>> Glad to hear KMeans is working reliably now. Your performance problems
>>> will
>>> require some additional tuning. Here are some suggestions:
>>> - You did not mention how many mappers are running in your job. With 60gb
>>> in
>>> a single input file, I would think Hadoop would allocate multiple mapper
>>> tasks automatically, since there are thousands of potential splits. If
>>> this
>>> is not happening (is the file compressed?), then breaking it into
>>> multiple
>>> parts in a preprocessing step would allow you to get more concurrency in
>>> the
>>> map phase.
>>> - Same with the reducers; how many are you running and what is your K?
>>> The
>>> default number of reducers is 2, but you can increase this up to the
>>> number
>>> of clusters to increase parallelism. Unlike Canopy and Mean Shift, KMeans
>>> can use multiple reducers up to that limit.
>>> - Finally, what is the size of your cluster? Adding machines would be
>>> another way to increase concurrency, since map and reduce tasks are
>>> spread
>>> across the entire cluster.
>>>
>>> 60 gb is a small dataset for Hadoop. I don't think it should be taking
>>> that
>>> long.
>>> Jeff
>>>
>>> nfantone wrote:
>>>
>>>>
>>>> After updating to the latest revision, everything seems to be working
>>>> just fine. However, the task I set up to do, user clustering by
>>>> KMeans, is taking forever to complete: I initiated the job yesterday's
>>>> morning and it's still running today (an elapsed time of nearly 18hs
>>>> and counting...). Of course, the main reason behind it it's the huge
>>>> size of the data set I'm trying to process (a ~60Gb HDFS file), but
>>>> I'm looking for ways to improve the performance. Would splitting the
>>>> input file into smaller parts do any difference? Is it even possible
>>>> to set the Driver in order to use more than one input (right now, I'm
>>>> specifying a full path to a single file, including its filename)? What
>>>> about setting a higher number of reducers? Is there any drawbacks to
>>>> that? Running multiple KMeans' job in several threads?
>>>>
>>>> Or perhaps, I'm just doing something wrong and should not be taking
>>>> this long. Surely, I'm not the first one to encounter this running
>>>> time issue with large datasets. Ideas, anyone?
>>>>
>>>>
>>>> On Mon, Jul 13, 2009 at 2:39 PM, nfantone<nf...@gmail.com> wrote:
>>>>
>>>>
>>>>>
>>>>> Great work. It works like a charm now. Thank you very much.
>>>>>
>>>>> On Mon, Jul 13, 2009 at 1:41 PM, Jeff
>>>>> Eastman<jd...@windwardsolutions.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>>>
>>>>>> r793620 fixes the KMeansDriver.isConverged() method to iterate over
>>>>>> all
>>>>>> cluster part files. Unit test now runs without error and the synthetic
>>>>>> control job completes too.
>>>>>>
>>>>>>
>>>>>> Jeff Eastman wrote:
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> In this case, the code should be reading all of the clusters into
>>>>>>> memory
>>>>>>> to see if they have all converged. These may be split into multiple
>>>>>>> part
>>>>>>> files if more than one reducer is specified. So /* is the correct
>>>>>>> file
>>>>>>> pattern and it is the calling site that should remove the /part-0000
>>>>>>> reference. The code in isConverged should loop through all the parts,
>>>>>>> returning if they have all converged or not.
>>>>>>>
>>>>>>> I'll take a detailed look tomorrow.
>>>>>>>
>>>>>>>
>>>>>>> Grant Ingersoll wrote:
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Hmm, that might be a mistake on my part when trying to resolve how
>>>>>>>> Hadoop
>>>>>>>> 0.20 now resolves globs.  I somewhat blindly applied "/*" where
>>>>>>>> needed, but
>>>>>>>> I think it is likely worth revistiing here where a specific file is
>>>>>>>> needed?
>>>>>>>>
>>>>>>>> -Grant
>>>>>>>>
>>>>>>>> On Jul 10, 2009, at 3:08 PM, nfantone wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> This error is still bugging me. The exception:
>>>>>>>>>
>>>>>>>>> WARNING: java.io.FileNotFoundException: File
>>>>>>>>> output/clusters-0/part-00000/* does not exist.
>>>>>>>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>>>>>>>>> does not exist.
>>>>>>>>>
>>>>>>>>> ocurrs first at:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:298)
>>>>>>>>>
>>>>>>>>> which corresponds to:
>>>>>>>>>
>>>>>>>>>  private static boolean isConverged(String filePath, JobConf conf,
>>>>>>>>> FileSystem fs)
>>>>>>>>>   throws IOException {
>>>>>>>>>  Path outPart = new Path(filePath + "/*");
>>>>>>>>>  SequenceFile.Reader reader = new SequenceFile.Reader(fs, outPart,
>>>>>>>>> conf);  <-- THIS
>>>>>>>>>  ...
>>>>>>>>>  }
>>>>>>>>>
>>>>>>>>> where isConverged() is called in this fashion:
>>>>>>>>>
>>>>>>>>> return isConverged(clustersOut + "/part-00000", conf, fs);
>>>>>>>>>
>>>>>>>>> by runIteration(), which is previously invoked by runJob() like:
>>>>>>>>>
>>>>>>>>>  String clustersOut = output + "/clusters-" + iteration;
>>>>>>>>>   converged = runIteration(input, clustersIn, clustersOut,
>>>>>>>>> measureClass,
>>>>>>>>>       delta, numReduceTasks, iteration);
>>>>>>>>>
>>>>>>>>> Consequently, assuming its the first iteration and the output
>>>>>>>>> folder
>>>>>>>>> has been named "output" by the user, the SequenceFile.Reader
>>>>>>>>> receives
>>>>>>>>> "output/clusters-0/part-00000/*" as a path, which is non-existent.
>>>>>>>>> I
>>>>>>>>> believe the path should end in "part-00000" and the  + "/*" should
>>>>>>>>> be
>>>>>>>>> removed... although someone, evidently, thought otherwise.
>>>>>>>>>
>>>>>>>>> Any feedback?
>>>>>>>>>
>>>>>>>>> On Mon, Jul 6, 2009 at 5:39 PM, nfantone<nf...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I was using Canopy to create input clusters, but the error
>>>>>>>>>> appeared
>>>>>>>>>> while running kMeans (if I run kMeans' job only with previously
>>>>>>>>>> created clusters from Canopy placed in output/canopies as initial
>>>>>>>>>> clusters, it still fails). I noticed no other problems. I was
>>>>>>>>>> using
>>>>>>>>>> revision 790979 before updating.  Strangely, there were no changes
>>>>>>>>>> in
>>>>>>>>>> the job and drivers class from that revision. svn diff shows that
>>>>>>>>>> the
>>>>>>>>>> only classes that changed in org.apache.mahout.clustering.kmeans
>>>>>>>>>> package were KMeansInfo.java and RandomSeedGenerator.java
>>>>>>>>>>
>>>>>>>>>> On Mon, Jul 6, 2009 at 3:55 PM, Jeff
>>>>>>>>>> Eastman<jd...@windwardsolutions.com> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hum, no, it's looking for the output of the first iteration. Were
>>>>>>>>>>> there
>>>>>>>>>>> other errors? What was the last revision you were running? It
>>>>>>>>>>> does
>>>>>>>>>>> look like
>>>>>>>>>>> something got horked, as it should be looking for
>>>>>>>>>>> output/clusters-0/*.
>>>>>>>>>>> Can
>>>>>>>>>>> you diff the job and driver class to see what changed?
>>>>>>>>>>>
>>>>>>>>>>> Jeff
>>>>>>>>>>>
>>>>>>>>>>> nfantone wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Fellows, today I updated to revision 791558 and while running
>>>>>>>>>>>> kMeans
>>>>>>>>>>>> I
>>>>>>>>>>>> got the following exception:
>>>>>>>>>>>>
>>>>>>>>>>>> WARNING: java.io.FileNotFoundException: File
>>>>>>>>>>>> output/clusters-0/part-00000/* does not exist.
>>>>>>>>>>>> java.io.FileNotFoundException: File
>>>>>>>>>>>> output/clusters-0/part-00000/*
>>>>>>>>>>>> does not exist.
>>>>>>>>>>>>
>>>>>>>>>>>> The algorithm isn't interrupted, though. But this exception
>>>>>>>>>>>> wasn't
>>>>>>>>>>>> thrown before the update and, to me, its message is not quite
>>>>>>>>>>>> clear.
>>>>>>>>>>>> It seems as it's looking for any file inside a "part-00000"
>>>>>>>>>>>> directory,
>>>>>>>>>>>> which doesn't exist; and, as far as I know, "part-xxxxx" are
>>>>>>>>>>>> default
>>>>>>>>>>>> names for output files.
>>>>>>>>>>>>
>>>>>>>>>>>> I could show the entire stack trace, if needed. Any pointers?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jul 2, 2009 at 3:16 PM, nfantone<nf...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for the feedback, Jeff.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The logical format of input to KMeans is <Key, Vector> as it
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>> sequence
>>>>>>>>>>>>>> file format, but the Key is never used. To my knowledge, there
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>> no
>>>>>>>>>>>>>> requirement to assign identifiers to the input points*. Users
>>>>>>>>>>>>>> are
>>>>>>>>>>>>>> free
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> associate an arbitrary name field with each vector - also
>>>>>>>>>>>>>> label
>>>>>>>>>>>>>> mappings
>>>>>>>>>>>>>> may
>>>>>>>>>>>>>> be assigned - but these are not manipulated by KMeans or any
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> other
>>>>>>>>>>>>>> clustering applications. The name field is now used as a
>>>>>>>>>>>>>> vector
>>>>>>>>>>>>>> identifier
>>>>>>>>>>>>>> by the KMeansClusterMapper - if it is non-null - in the output
>>>>>>>>>>>>>> step
>>>>>>>>>>>>>> only.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The key may not be used internally, but externally they can
>>>>>>>>>>>>> prove
>>>>>>>>>>>>> to
>>>>>>>>>>>>> be pretty useful. For me, keys are userIDs and each Vector
>>>>>>>>>>>>> represents
>>>>>>>>>>>>> his/her historical behavior. Being able to collect the output
>>>>>>>>>>>>> information as <UserID, ClusterID> is quite neat as it allows
>>>>>>>>>>>>> me
>>>>>>>>>>>>> to,
>>>>>>>>>>>>> for instance, retrieve user information using data directly
>>>>>>>>>>>>> from
>>>>>>>>>>>>> a
>>>>>>>>>>>>> HDFS file's field.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>
>>>>>>>> --------------------------
>>>>>>>> Grant Ingersoll
>>>>>>>> http://www.lucidimagination.com/
>>>>>>>>
>>>>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>>>>>>>> using
>>>>>>>> Solr/Lucene:
>>>>>>>> http://www.lucidimagination.com/search
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>
>

Re: Clustering from DB

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

nfantone wrote:
> Hi there, Jeff.
>
> I'm currently running KMeans via its Driver as:
>
> KMeansDriver.runJob("input/user.data", "init", "output",
> EuclideanDistanceMeasure.class.getName(), 0.001, 40, 1,
> SparseVector.class);
>   
You are only specifying a single reducer. Try increasing that as below.
> "user.data" is my 60Gb input file. I'll try changing the number of
> reducers from 1 to 200, which is my K, as you mentioned. Nevertheless,
> it seems as the troublesome (aka, the one taking so long) part is the
> mapping part, as the log output prints things of this sort
> INFO:  map 67% reduce 0%,
> and it just resets the counter when 100% is reached (reduce percentage
> is always displayed as 0%).
>
> As a side note, Hadoop's official wiki explains:
>
> "The right number of reduces seems to be 0.95 or 1.75 multiplied by
> (<no. of nodes> * mapred.tasktracker.reduce.tasks.maximum)."
>
> I assume that by "no. of nodes" it's referring to the number of
> clusters trying to be produced (though, I'm not quite certain), so the
> appropriate number of Reducers should be
> 0.95*K*mapred.tasktracker.reduce.tasks.maximum.
>   
No, number of nodes is the number of nodes (computers) in your cluster. 
You did not say how many nodes you are running on.
> Regarding the number of mappers, the wiki suggest that its number is
> automatically computed by the Hadoop engine, although a hint can be
> provided by calling setNumMapTasks() (which, by the way, seems to be
> deprecated now). I'm not so sure how to increment this number now.
> I'll keep on investigating the documentation.
>   
Yes, Hadoop allocates this automatically. How many map tasks are being 
spawned?
> For now, the clustering is STILL running in the background, ha.
>
> On Wed, Jul 15, 2009 at 12:30 PM, Jeff
> Eastman<jd...@windwardsolutions.com> wrote:
>   
>> Glad to hear KMeans is working reliably now. Your performance problems will
>> require some additional tuning. Here are some suggestions:
>> - You did not mention how many mappers are running in your job. With 60gb in
>> a single input file, I would think Hadoop would allocate multiple mapper
>> tasks automatically, since there are thousands of potential splits. If this
>> is not happening (is the file compressed?), then breaking it into multiple
>> parts in a preprocessing step would allow you to get more concurrency in the
>> map phase.
>> - Same with the reducers; how many are you running and what is your K? The
>> default number of reducers is 2, but you can increase this up to the number
>> of clusters to increase parallelism. Unlike Canopy and Mean Shift, KMeans
>> can use multiple reducers up to that limit.
>> - Finally, what is the size of your cluster? Adding machines would be
>> another way to increase concurrency, since map and reduce tasks are spread
>> across the entire cluster.
>>
>> 60 gb is a small dataset for Hadoop. I don't think it should be taking that
>> long.
>> Jeff
>>
>> nfantone wrote:
>>     
>>> After updating to the latest revision, everything seems to be working
>>> just fine. However, the task I set up to do, user clustering by
>>> KMeans, is taking forever to complete: I initiated the job yesterday's
>>> morning and it's still running today (an elapsed time of nearly 18hs
>>> and counting...). Of course, the main reason behind it it's the huge
>>> size of the data set I'm trying to process (a ~60Gb HDFS file), but
>>> I'm looking for ways to improve the performance. Would splitting the
>>> input file into smaller parts do any difference? Is it even possible
>>> to set the Driver in order to use more than one input (right now, I'm
>>> specifying a full path to a single file, including its filename)? What
>>> about setting a higher number of reducers? Is there any drawbacks to
>>> that? Running multiple KMeans' job in several threads?
>>>
>>> Or perhaps, I'm just doing something wrong and should not be taking
>>> this long. Surely, I'm not the first one to encounter this running
>>> time issue with large datasets. Ideas, anyone?
>>>
>>>
>>> On Mon, Jul 13, 2009 at 2:39 PM, nfantone<nf...@gmail.com> wrote:
>>>
>>>       
>>>> Great work. It works like a charm now. Thank you very much.
>>>>
>>>> On Mon, Jul 13, 2009 at 1:41 PM, Jeff Eastman<jd...@windwardsolutions.com>
>>>> wrote:
>>>>
>>>>         
>>>>> r793620 fixes the KMeansDriver.isConverged() method to iterate over all
>>>>> cluster part files. Unit test now runs without error and the synthetic
>>>>> control job completes too.
>>>>>
>>>>>
>>>>> Jeff Eastman wrote:
>>>>>
>>>>>           
>>>>>> In this case, the code should be reading all of the clusters into
>>>>>> memory
>>>>>> to see if they have all converged. These may be split into multiple
>>>>>> part
>>>>>> files if more than one reducer is specified. So /* is the correct file
>>>>>> pattern and it is the calling site that should remove the /part-0000
>>>>>> reference. The code in isConverged should loop through all the parts,
>>>>>> returning if they have all converged or not.
>>>>>>
>>>>>> I'll take a detailed look tomorrow.
>>>>>>
>>>>>>
>>>>>> Grant Ingersoll wrote:
>>>>>>
>>>>>>             
>>>>>>> Hmm, that might be a mistake on my part when trying to resolve how
>>>>>>> Hadoop
>>>>>>> 0.20 now resolves globs.  I somewhat blindly applied "/*" where
>>>>>>> needed, but
>>>>>>> I think it is likely worth revistiing here where a specific file is
>>>>>>> needed?
>>>>>>>
>>>>>>> -Grant
>>>>>>>
>>>>>>> On Jul 10, 2009, at 3:08 PM, nfantone wrote:
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> This error is still bugging me. The exception:
>>>>>>>>
>>>>>>>> WARNING: java.io.FileNotFoundException: File
>>>>>>>> output/clusters-0/part-00000/* does not exist.
>>>>>>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>>>>>>>> does not exist.
>>>>>>>>
>>>>>>>> ocurrs first at:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:298)
>>>>>>>>
>>>>>>>> which corresponds to:
>>>>>>>>
>>>>>>>>  private static boolean isConverged(String filePath, JobConf conf,
>>>>>>>> FileSystem fs)
>>>>>>>>    throws IOException {
>>>>>>>>  Path outPart = new Path(filePath + "/*");
>>>>>>>>  SequenceFile.Reader reader = new SequenceFile.Reader(fs, outPart,
>>>>>>>> conf);  <-- THIS
>>>>>>>>  ...
>>>>>>>>  }
>>>>>>>>
>>>>>>>> where isConverged() is called in this fashion:
>>>>>>>>
>>>>>>>> return isConverged(clustersOut + "/part-00000", conf, fs);
>>>>>>>>
>>>>>>>> by runIteration(), which is previously invoked by runJob() like:
>>>>>>>>
>>>>>>>>   String clustersOut = output + "/clusters-" + iteration;
>>>>>>>>    converged = runIteration(input, clustersIn, clustersOut,
>>>>>>>> measureClass,
>>>>>>>>        delta, numReduceTasks, iteration);
>>>>>>>>
>>>>>>>> Consequently, assuming its the first iteration and the output folder
>>>>>>>> has been named "output" by the user, the SequenceFile.Reader receives
>>>>>>>> "output/clusters-0/part-00000/*" as a path, which is non-existent. I
>>>>>>>> believe the path should end in "part-00000" and the  + "/*" should be
>>>>>>>> removed... although someone, evidently, thought otherwise.
>>>>>>>>
>>>>>>>> Any feedback?
>>>>>>>>
>>>>>>>> On Mon, Jul 6, 2009 at 5:39 PM, nfantone<nf...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> I was using Canopy to create input clusters, but the error appeared
>>>>>>>>> while running kMeans (if I run kMeans' job only with previously
>>>>>>>>> created clusters from Canopy placed in output/canopies as initial
>>>>>>>>> clusters, it still fails). I noticed no other problems. I was using
>>>>>>>>> revision 790979 before updating.  Strangely, there were no changes
>>>>>>>>> in
>>>>>>>>> the job and drivers class from that revision. svn diff shows that
>>>>>>>>> the
>>>>>>>>> only classes that changed in org.apache.mahout.clustering.kmeans
>>>>>>>>> package were KMeansInfo.java and RandomSeedGenerator.java
>>>>>>>>>
>>>>>>>>> On Mon, Jul 6, 2009 at 3:55 PM, Jeff
>>>>>>>>> Eastman<jd...@windwardsolutions.com> wrote:
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>>>> Hum, no, it's looking for the output of the first iteration. Were
>>>>>>>>>> there
>>>>>>>>>> other errors? What was the last revision you were running? It does
>>>>>>>>>> look like
>>>>>>>>>> something got horked, as it should be looking for
>>>>>>>>>> output/clusters-0/*.
>>>>>>>>>> Can
>>>>>>>>>> you diff the job and driver class to see what changed?
>>>>>>>>>>
>>>>>>>>>> Jeff
>>>>>>>>>>
>>>>>>>>>> nfantone wrote:
>>>>>>>>>>
>>>>>>>>>>                     
>>>>>>>>>>> Fellows, today I updated to revision 791558 and while running
>>>>>>>>>>> kMeans
>>>>>>>>>>> I
>>>>>>>>>>> got the following exception:
>>>>>>>>>>>
>>>>>>>>>>> WARNING: java.io.FileNotFoundException: File
>>>>>>>>>>> output/clusters-0/part-00000/* does not exist.
>>>>>>>>>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>>>>>>>>>>> does not exist.
>>>>>>>>>>>
>>>>>>>>>>> The algorithm isn't interrupted, though. But this exception wasn't
>>>>>>>>>>> thrown before the update and, to me, its message is not quite
>>>>>>>>>>> clear.
>>>>>>>>>>> It seems as it's looking for any file inside a "part-00000"
>>>>>>>>>>> directory,
>>>>>>>>>>> which doesn't exist; and, as far as I know, "part-xxxxx" are
>>>>>>>>>>> default
>>>>>>>>>>> names for output files.
>>>>>>>>>>>
>>>>>>>>>>> I could show the entire stack trace, if needed. Any pointers?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jul 2, 2009 at 3:16 PM, nfantone<nf...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                       
>>>>>>>>>>>> Thanks for the feedback, Jeff.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                         
>>>>>>>>>>>>> The logical format of input to KMeans is <Key, Vector> as it is
>>>>>>>>>>>>> in
>>>>>>>>>>>>> sequence
>>>>>>>>>>>>> file format, but the Key is never used. To my knowledge, there
>>>>>>>>>>>>> is
>>>>>>>>>>>>> no
>>>>>>>>>>>>> requirement to assign identifiers to the input points*. Users
>>>>>>>>>>>>> are
>>>>>>>>>>>>> free
>>>>>>>>>>>>> to
>>>>>>>>>>>>> associate an arbitrary name field with each vector - also label
>>>>>>>>>>>>> mappings
>>>>>>>>>>>>> may
>>>>>>>>>>>>> be assigned - but these are not manipulated by KMeans or any of
>>>>>>>>>>>>> the
>>>>>>>>>>>>> other
>>>>>>>>>>>>> clustering applications. The name field is now used as a vector
>>>>>>>>>>>>> identifier
>>>>>>>>>>>>> by the KMeansClusterMapper - if it is non-null - in the output
>>>>>>>>>>>>> step
>>>>>>>>>>>>> only.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>                           
>>>>>>>>>>>> The key may not be used internally, but externally they can prove
>>>>>>>>>>>> to
>>>>>>>>>>>> be pretty useful. For me, keys are userIDs and each Vector
>>>>>>>>>>>> represents
>>>>>>>>>>>> his/her historical behavior. Being able to collect the output
>>>>>>>>>>>> information as <UserID, ClusterID> is quite neat as it allows me
>>>>>>>>>>>> to,
>>>>>>>>>>>> for instance, retrieve user information using data directly from
>>>>>>>>>>>> a
>>>>>>>>>>>> HDFS file's field.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                         
>>>>>>>>>>>                       
>>>>>>>>>>                     
>>>>>>> --------------------------
>>>>>>> Grant Ingersoll
>>>>>>> http://www.lucidimagination.com/
>>>>>>>
>>>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>>>>>>> using
>>>>>>> Solr/Lucene:
>>>>>>> http://www.lucidimagination.com/search
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>             
>>>>>           
>>>
>>>       
>>     
>
>
>

Re: Clustering from DB

Posted by nfantone <nf...@gmail.com>.

Hi there, Jeff.

I'm currently running KMeans via its Driver as:

KMeansDriver.runJob("input/user.data", "init", "output",
EuclideanDistanceMeasure.class.getName(), 0.001, 40, 1,
SparseVector.class);

"user.data" is my 60Gb input file. I'll try changing the number of
reducers from 1 to 200, which is my K, as you mentioned. Nevertheless,
it seems as the troublesome (aka, the one taking so long) part is the
mapping part, as the log output prints things of this sort
INFO:  map 67% reduce 0%,
and it just resets the counter when 100% is reached (reduce percentage
is always displayed as 0%).

As a side note, Hadoop's official wiki explains:

"The right number of reduces seems to be 0.95 or 1.75 multiplied by
(<no. of nodes> * mapred.tasktracker.reduce.tasks.maximum)."

I assume that by "no. of nodes" it's referring to the number of
clusters trying to be produced (though, I'm not quite certain), so the
appropriate number of Reducers should be
0.95*K*mapred.tasktracker.reduce.tasks.maximum.

Regarding the number of mappers, the wiki suggest that its number is
automatically computed by the Hadoop engine, although a hint can be
provided by calling setNumMapTasks() (which, by the way, seems to be
deprecated now). I'm not so sure how to increment this number now.
I'll keep on investigating the documentation.

For now, the clustering is STILL running in the background, ha.

On Wed, Jul 15, 2009 at 12:30 PM, Jeff
Eastman<jd...@windwardsolutions.com> wrote:
> Glad to hear KMeans is working reliably now. Your performance problems will
> require some additional tuning. Here are some suggestions:
> - You did not mention how many mappers are running in your job. With 60gb in
> a single input file, I would think Hadoop would allocate multiple mapper
> tasks automatically, since there are thousands of potential splits. If this
> is not happening (is the file compressed?), then breaking it into multiple
> parts in a preprocessing step would allow you to get more concurrency in the
> map phase.
> - Same with the reducers; how many are you running and what is your K? The
> default number of reducers is 2, but you can increase this up to the number
> of clusters to increase parallelism. Unlike Canopy and Mean Shift, KMeans
> can use multiple reducers up to that limit.
> - Finally, what is the size of your cluster? Adding machines would be
> another way to increase concurrency, since map and reduce tasks are spread
> across the entire cluster.
>
> 60 gb is a small dataset for Hadoop. I don't think it should be taking that
> long.
> Jeff
>
> nfantone wrote:
>>
>> After updating to the latest revision, everything seems to be working
>> just fine. However, the task I set up to do, user clustering by
>> KMeans, is taking forever to complete: I initiated the job yesterday's
>> morning and it's still running today (an elapsed time of nearly 18hs
>> and counting...). Of course, the main reason behind it it's the huge
>> size of the data set I'm trying to process (a ~60Gb HDFS file), but
>> I'm looking for ways to improve the performance. Would splitting the
>> input file into smaller parts do any difference? Is it even possible
>> to set the Driver in order to use more than one input (right now, I'm
>> specifying a full path to a single file, including its filename)? What
>> about setting a higher number of reducers? Is there any drawbacks to
>> that? Running multiple KMeans' job in several threads?
>>
>> Or perhaps, I'm just doing something wrong and should not be taking
>> this long. Surely, I'm not the first one to encounter this running
>> time issue with large datasets. Ideas, anyone?
>>
>>
>> On Mon, Jul 13, 2009 at 2:39 PM, nfantone<nf...@gmail.com> wrote:
>>
>>>
>>> Great work. It works like a charm now. Thank you very much.
>>>
>>> On Mon, Jul 13, 2009 at 1:41 PM, Jeff Eastman<jd...@windwardsolutions.com>
>>> wrote:
>>>
>>>>
>>>> r793620 fixes the KMeansDriver.isConverged() method to iterate over all
>>>> cluster part files. Unit test now runs without error and the synthetic
>>>> control job completes too.
>>>>
>>>>
>>>> Jeff Eastman wrote:
>>>>
>>>>>
>>>>> In this case, the code should be reading all of the clusters into
>>>>> memory
>>>>> to see if they have all converged. These may be split into multiple
>>>>> part
>>>>> files if more than one reducer is specified. So /* is the correct file
>>>>> pattern and it is the calling site that should remove the /part-0000
>>>>> reference. The code in isConverged should loop through all the parts,
>>>>> returning if they have all converged or not.
>>>>>
>>>>> I'll take a detailed look tomorrow.
>>>>>
>>>>>
>>>>> Grant Ingersoll wrote:
>>>>>
>>>>>>
>>>>>> Hmm, that might be a mistake on my part when trying to resolve how
>>>>>> Hadoop
>>>>>> 0.20 now resolves globs.  I somewhat blindly applied "/*" where
>>>>>> needed, but
>>>>>> I think it is likely worth revistiing here where a specific file is
>>>>>> needed?
>>>>>>
>>>>>> -Grant
>>>>>>
>>>>>> On Jul 10, 2009, at 3:08 PM, nfantone wrote:
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> This error is still bugging me. The exception:
>>>>>>>
>>>>>>> WARNING: java.io.FileNotFoundException: File
>>>>>>> output/clusters-0/part-00000/* does not exist.
>>>>>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>>>>>>> does not exist.
>>>>>>>
>>>>>>> ocurrs first at:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:298)
>>>>>>>
>>>>>>> which corresponds to:
>>>>>>>
>>>>>>>  private static boolean isConverged(String filePath, JobConf conf,
>>>>>>> FileSystem fs)
>>>>>>>    throws IOException {
>>>>>>>  Path outPart = new Path(filePath + "/*");
>>>>>>>  SequenceFile.Reader reader = new SequenceFile.Reader(fs, outPart,
>>>>>>> conf);  <-- THIS
>>>>>>>  ...
>>>>>>>  }
>>>>>>>
>>>>>>> where isConverged() is called in this fashion:
>>>>>>>
>>>>>>> return isConverged(clustersOut + "/part-00000", conf, fs);
>>>>>>>
>>>>>>> by runIteration(), which is previously invoked by runJob() like:
>>>>>>>
>>>>>>>   String clustersOut = output + "/clusters-" + iteration;
>>>>>>>    converged = runIteration(input, clustersIn, clustersOut,
>>>>>>> measureClass,
>>>>>>>        delta, numReduceTasks, iteration);
>>>>>>>
>>>>>>> Consequently, assuming its the first iteration and the output folder
>>>>>>> has been named "output" by the user, the SequenceFile.Reader receives
>>>>>>> "output/clusters-0/part-00000/*" as a path, which is non-existent. I
>>>>>>> believe the path should end in "part-00000" and the  + "/*" should be
>>>>>>> removed... although someone, evidently, thought otherwise.
>>>>>>>
>>>>>>> Any feedback?
>>>>>>>
>>>>>>> On Mon, Jul 6, 2009 at 5:39 PM, nfantone<nf...@gmail.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> I was using Canopy to create input clusters, but the error appeared
>>>>>>>> while running kMeans (if I run kMeans' job only with previously
>>>>>>>> created clusters from Canopy placed in output/canopies as initial
>>>>>>>> clusters, it still fails). I noticed no other problems. I was using
>>>>>>>> revision 790979 before updating.  Strangely, there were no changes
>>>>>>>> in
>>>>>>>> the job and drivers class from that revision. svn diff shows that
>>>>>>>> the
>>>>>>>> only classes that changed in org.apache.mahout.clustering.kmeans
>>>>>>>> package were KMeansInfo.java and RandomSeedGenerator.java
>>>>>>>>
>>>>>>>> On Mon, Jul 6, 2009 at 3:55 PM, Jeff
>>>>>>>> Eastman<jd...@windwardsolutions.com> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hum, no, it's looking for the output of the first iteration. Were
>>>>>>>>> there
>>>>>>>>> other errors? What was the last revision you were running? It does
>>>>>>>>> look like
>>>>>>>>> something got horked, as it should be looking for
>>>>>>>>> output/clusters-0/*.
>>>>>>>>> Can
>>>>>>>>> you diff the job and driver class to see what changed?
>>>>>>>>>
>>>>>>>>> Jeff
>>>>>>>>>
>>>>>>>>> nfantone wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Fellows, today I updated to revision 791558 and while running
>>>>>>>>>> kMeans
>>>>>>>>>> I
>>>>>>>>>> got the following exception:
>>>>>>>>>>
>>>>>>>>>> WARNING: java.io.FileNotFoundException: File
>>>>>>>>>> output/clusters-0/part-00000/* does not exist.
>>>>>>>>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>>>>>>>>>> does not exist.
>>>>>>>>>>
>>>>>>>>>> The algorithm isn't interrupted, though. But this exception wasn't
>>>>>>>>>> thrown before the update and, to me, its message is not quite
>>>>>>>>>> clear.
>>>>>>>>>> It seems as it's looking for any file inside a "part-00000"
>>>>>>>>>> directory,
>>>>>>>>>> which doesn't exist; and, as far as I know, "part-xxxxx" are
>>>>>>>>>> default
>>>>>>>>>> names for output files.
>>>>>>>>>>
>>>>>>>>>> I could show the entire stack trace, if needed. Any pointers?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Jul 2, 2009 at 3:16 PM, nfantone<nf...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the feedback, Jeff.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The logical format of input to KMeans is <Key, Vector> as it is
>>>>>>>>>>>> in
>>>>>>>>>>>> sequence
>>>>>>>>>>>> file format, but the Key is never used. To my knowledge, there
>>>>>>>>>>>> is
>>>>>>>>>>>> no
>>>>>>>>>>>> requirement to assign identifiers to the input points*. Users
>>>>>>>>>>>> are
>>>>>>>>>>>> free
>>>>>>>>>>>> to
>>>>>>>>>>>> associate an arbitrary name field with each vector - also label
>>>>>>>>>>>> mappings
>>>>>>>>>>>> may
>>>>>>>>>>>> be assigned - but these are not manipulated by KMeans or any of
>>>>>>>>>>>> the
>>>>>>>>>>>> other
>>>>>>>>>>>> clustering applications. The name field is now used as a vector
>>>>>>>>>>>> identifier
>>>>>>>>>>>> by the KMeansClusterMapper - if it is non-null - in the output
>>>>>>>>>>>> step
>>>>>>>>>>>> only.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The key may not be used internally, but externally they can prove
>>>>>>>>>>> to
>>>>>>>>>>> be pretty useful. For me, keys are userIDs and each Vector
>>>>>>>>>>> represents
>>>>>>>>>>> his/her historical behavior. Being able to collect the output
>>>>>>>>>>> information as <UserID, ClusterID> is quite neat as it allows me
>>>>>>>>>>> to,
>>>>>>>>>>> for instance, retrieve user information using data directly from
>>>>>>>>>>> a
>>>>>>>>>>> HDFS file's field.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>> --------------------------
>>>>>> Grant Ingersoll
>>>>>> http://www.lucidimagination.com/
>>>>>>
>>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>>>>>> using
>>>>>> Solr/Lucene:
>>>>>> http://www.lucidimagination.com/search
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>
>>
>>
>
>

Re: Clustering from DB

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Glad to hear KMeans is working reliably now. Your performance problems 
will require some additional tuning. Here are some suggestions:
- You did not mention how many mappers are running in your job. With 
60gb in a single input file, I would think Hadoop would allocate 
multiple mapper tasks automatically, since there are thousands of 
potential splits. If this is not happening (is the file compressed?), 
then breaking it into multiple parts in a preprocessing step would allow 
you to get more concurrency in the map phase.
- Same with the reducers; how many are you running and what is your K? 
The default number of reducers is 2, but you can increase this up to the 
number of clusters to increase parallelism. Unlike Canopy and Mean 
Shift, KMeans can use multiple reducers up to that limit.
- Finally, what is the size of your cluster? Adding machines would be 
another way to increase concurrency, since map and reduce tasks are 
spread across the entire cluster.

60 gb is a small dataset for Hadoop. I don't think it should be taking 
that long.
Jeff

nfantone wrote:
> After updating to the latest revision, everything seems to be working
> just fine. However, the task I set up to do, user clustering by
> KMeans, is taking forever to complete: I initiated the job yesterday's
> morning and it's still running today (an elapsed time of nearly 18hs
> and counting...). Of course, the main reason behind it it's the huge
> size of the data set I'm trying to process (a ~60Gb HDFS file), but
> I'm looking for ways to improve the performance. Would splitting the
> input file into smaller parts do any difference? Is it even possible
> to set the Driver in order to use more than one input (right now, I'm
> specifying a full path to a single file, including its filename)? What
> about setting a higher number of reducers? Is there any drawbacks to
> that? Running multiple KMeans' job in several threads?
>
> Or perhaps, I'm just doing something wrong and should not be taking
> this long. Surely, I'm not the first one to encounter this running
> time issue with large datasets. Ideas, anyone?
>
>
> On Mon, Jul 13, 2009 at 2:39 PM, nfantone<nf...@gmail.com> wrote:
>   
>> Great work. It works like a charm now. Thank you very much.
>>
>> On Mon, Jul 13, 2009 at 1:41 PM, Jeff Eastman<jd...@windwardsolutions.com> wrote:
>>     
>>> r793620 fixes the KMeansDriver.isConverged() method to iterate over all
>>> cluster part files. Unit test now runs without error and the synthetic
>>> control job completes too.
>>>
>>>
>>> Jeff Eastman wrote:
>>>       
>>>> In this case, the code should be reading all of the clusters into memory
>>>> to see if they have all converged. These may be split into multiple part
>>>> files if more than one reducer is specified. So /* is the correct file
>>>> pattern and it is the calling site that should remove the /part-0000
>>>> reference. The code in isConverged should loop through all the parts,
>>>> returning if they have all converged or not.
>>>>
>>>> I'll take a detailed look tomorrow.
>>>>
>>>>
>>>> Grant Ingersoll wrote:
>>>>         
>>>>> Hmm, that might be a mistake on my part when trying to resolve how Hadoop
>>>>> 0.20 now resolves globs.  I somewhat blindly applied "/*" where needed, but
>>>>> I think it is likely worth revistiing here where a specific file is needed?
>>>>>
>>>>> -Grant
>>>>>
>>>>> On Jul 10, 2009, at 3:08 PM, nfantone wrote:
>>>>>
>>>>>           
>>>>>> This error is still bugging me. The exception:
>>>>>>
>>>>>> WARNING: java.io.FileNotFoundException: File
>>>>>> output/clusters-0/part-00000/* does not exist.
>>>>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>>>>>> does not exist.
>>>>>>
>>>>>> ocurrs first at:
>>>>>>
>>>>>>
>>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:298)
>>>>>>
>>>>>> which corresponds to:
>>>>>>
>>>>>>  private static boolean isConverged(String filePath, JobConf conf,
>>>>>> FileSystem fs)
>>>>>>     throws IOException {
>>>>>>   Path outPart = new Path(filePath + "/*");
>>>>>>   SequenceFile.Reader reader = new SequenceFile.Reader(fs, outPart,
>>>>>> conf);  <-- THIS
>>>>>>   ...
>>>>>>  }
>>>>>>
>>>>>> where isConverged() is called in this fashion:
>>>>>>
>>>>>> return isConverged(clustersOut + "/part-00000", conf, fs);
>>>>>>
>>>>>> by runIteration(), which is previously invoked by runJob() like:
>>>>>>
>>>>>>    String clustersOut = output + "/clusters-" + iteration;
>>>>>>     converged = runIteration(input, clustersIn, clustersOut,
>>>>>> measureClass,
>>>>>>         delta, numReduceTasks, iteration);
>>>>>>
>>>>>> Consequently, assuming its the first iteration and the output folder
>>>>>> has been named "output" by the user, the SequenceFile.Reader receives
>>>>>> "output/clusters-0/part-00000/*" as a path, which is non-existent. I
>>>>>> believe the path should end in "part-00000" and the  + "/*" should be
>>>>>> removed... although someone, evidently, thought otherwise.
>>>>>>
>>>>>> Any feedback?
>>>>>>
>>>>>> On Mon, Jul 6, 2009 at 5:39 PM, nfantone<nf...@gmail.com> wrote:
>>>>>>             
>>>>>>> I was using Canopy to create input clusters, but the error appeared
>>>>>>> while running kMeans (if I run kMeans' job only with previously
>>>>>>> created clusters from Canopy placed in output/canopies as initial
>>>>>>> clusters, it still fails). I noticed no other problems. I was using
>>>>>>> revision 790979 before updating.  Strangely, there were no changes in
>>>>>>> the job and drivers class from that revision. svn diff shows that the
>>>>>>> only classes that changed in org.apache.mahout.clustering.kmeans
>>>>>>> package were KMeansInfo.java and RandomSeedGenerator.java
>>>>>>>
>>>>>>> On Mon, Jul 6, 2009 at 3:55 PM, Jeff
>>>>>>> Eastman<jd...@windwardsolutions.com> wrote:
>>>>>>>               
>>>>>>>> Hum, no, it's looking for the output of the first iteration. Were
>>>>>>>> there
>>>>>>>> other errors? What was the last revision you were running? It does
>>>>>>>> look like
>>>>>>>> something got horked, as it should be looking for output/clusters-0/*.
>>>>>>>> Can
>>>>>>>> you diff the job and driver class to see what changed?
>>>>>>>>
>>>>>>>> Jeff
>>>>>>>>
>>>>>>>> nfantone wrote:
>>>>>>>>                 
>>>>>>>>> Fellows, today I updated to revision 791558 and while running kMeans
>>>>>>>>> I
>>>>>>>>> got the following exception:
>>>>>>>>>
>>>>>>>>> WARNING: java.io.FileNotFoundException: File
>>>>>>>>> output/clusters-0/part-00000/* does not exist.
>>>>>>>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>>>>>>>>> does not exist.
>>>>>>>>>
>>>>>>>>> The algorithm isn't interrupted, though. But this exception wasn't
>>>>>>>>> thrown before the update and, to me, its message is not quite clear.
>>>>>>>>> It seems as it's looking for any file inside a "part-00000"
>>>>>>>>> directory,
>>>>>>>>> which doesn't exist; and, as far as I know, "part-xxxxx" are default
>>>>>>>>> names for output files.
>>>>>>>>>
>>>>>>>>> I could show the entire stack trace, if needed. Any pointers?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jul 2, 2009 at 3:16 PM, nfantone<nf...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>>>> Thanks for the feedback, Jeff.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     
>>>>>>>>>>> The logical format of input to KMeans is <Key, Vector> as it is in
>>>>>>>>>>> sequence
>>>>>>>>>>> file format, but the Key is never used. To my knowledge, there is
>>>>>>>>>>> no
>>>>>>>>>>> requirement to assign identifiers to the input points*. Users are
>>>>>>>>>>> free
>>>>>>>>>>> to
>>>>>>>>>>> associate an arbitrary name field with each vector - also label
>>>>>>>>>>> mappings
>>>>>>>>>>> may
>>>>>>>>>>> be assigned - but these are not manipulated by KMeans or any of the
>>>>>>>>>>> other
>>>>>>>>>>> clustering applications. The name field is now used as a vector
>>>>>>>>>>> identifier
>>>>>>>>>>> by the KMeansClusterMapper - if it is non-null - in the output step
>>>>>>>>>>> only.
>>>>>>>>>>>
>>>>>>>>>>>                       
>>>>>>>>>> The key may not be used internally, but externally they can prove to
>>>>>>>>>> be pretty useful. For me, keys are userIDs and each Vector
>>>>>>>>>> represents
>>>>>>>>>> his/her historical behavior. Being able to collect the output
>>>>>>>>>> information as <UserID, ClusterID> is quite neat as it allows me to,
>>>>>>>>>> for instance, retrieve user information using data directly from a
>>>>>>>>>> HDFS file's field.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>>                 
>>>>> --------------------------
>>>>> Grant Ingersoll
>>>>> http://www.lucidimagination.com/
>>>>>
>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>>>>> Solr/Lucene:
>>>>> http://www.lucidimagination.com/search
>>>>>
>>>>>
>>>>>
>>>>>           
>>>>
>>>>         
>>>       
>
>
>

Re: Clustering from DB

Posted by Grant Ingersoll <gs...@apache.org>.

On Jul 15, 2009, at 8:49 AM, nfantone wrote:

> Surely, I'm not the first one to encounter this running
> time issue with large datasets. Ideas, anyone?
>

You may very well be the first to try larger sets, although 60 GB  
doesn't seem huge.  We have not, AFAIK, done much large scale testing,  
but I'm happy to be told otherwise.  That's not to say it won't work,  
just that it needs more investigation.

-Grant

Re: Clustering from DB

Posted by nfantone <nf...@gmail.com>.

After updating to the latest revision, everything seems to be working
just fine. However, the task I set up to do, user clustering by
KMeans, is taking forever to complete: I initiated the job yesterday's
morning and it's still running today (an elapsed time of nearly 18hs
and counting...). Of course, the main reason behind it it's the huge
size of the data set I'm trying to process (a ~60Gb HDFS file), but
I'm looking for ways to improve the performance. Would splitting the
input file into smaller parts do any difference? Is it even possible
to set the Driver in order to use more than one input (right now, I'm
specifying a full path to a single file, including its filename)? What
about setting a higher number of reducers? Is there any drawbacks to
that? Running multiple KMeans' job in several threads?

Or perhaps, I'm just doing something wrong and should not be taking
this long. Surely, I'm not the first one to encounter this running
time issue with large datasets. Ideas, anyone?


On Mon, Jul 13, 2009 at 2:39 PM, nfantone<nf...@gmail.com> wrote:
> Great work. It works like a charm now. Thank you very much.
>
> On Mon, Jul 13, 2009 at 1:41 PM, Jeff Eastman<jd...@windwardsolutions.com> wrote:
>> r793620 fixes the KMeansDriver.isConverged() method to iterate over all
>> cluster part files. Unit test now runs without error and the synthetic
>> control job completes too.
>>
>>
>> Jeff Eastman wrote:
>>>
>>> In this case, the code should be reading all of the clusters into memory
>>> to see if they have all converged. These may be split into multiple part
>>> files if more than one reducer is specified. So /* is the correct file
>>> pattern and it is the calling site that should remove the /part-0000
>>> reference. The code in isConverged should loop through all the parts,
>>> returning if they have all converged or not.
>>>
>>> I'll take a detailed look tomorrow.
>>>
>>>
>>> Grant Ingersoll wrote:
>>>>
>>>> Hmm, that might be a mistake on my part when trying to resolve how Hadoop
>>>> 0.20 now resolves globs.  I somewhat blindly applied "/*" where needed, but
>>>> I think it is likely worth revistiing here where a specific file is needed?
>>>>
>>>> -Grant
>>>>
>>>> On Jul 10, 2009, at 3:08 PM, nfantone wrote:
>>>>
>>>>> This error is still bugging me. The exception:
>>>>>
>>>>> WARNING: java.io.FileNotFoundException: File
>>>>> output/clusters-0/part-00000/* does not exist.
>>>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>>>>> does not exist.
>>>>>
>>>>> ocurrs first at:
>>>>>
>>>>>
>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:298)
>>>>>
>>>>> which corresponds to:
>>>>>
>>>>>  private static boolean isConverged(String filePath, JobConf conf,
>>>>> FileSystem fs)
>>>>>     throws IOException {
>>>>>   Path outPart = new Path(filePath + "/*");
>>>>>   SequenceFile.Reader reader = new SequenceFile.Reader(fs, outPart,
>>>>> conf);  <-- THIS
>>>>>   ...
>>>>>  }
>>>>>
>>>>> where isConverged() is called in this fashion:
>>>>>
>>>>> return isConverged(clustersOut + "/part-00000", conf, fs);
>>>>>
>>>>> by runIteration(), which is previously invoked by runJob() like:
>>>>>
>>>>>    String clustersOut = output + "/clusters-" + iteration;
>>>>>     converged = runIteration(input, clustersIn, clustersOut,
>>>>> measureClass,
>>>>>         delta, numReduceTasks, iteration);
>>>>>
>>>>> Consequently, assuming its the first iteration and the output folder
>>>>> has been named "output" by the user, the SequenceFile.Reader receives
>>>>> "output/clusters-0/part-00000/*" as a path, which is non-existent. I
>>>>> believe the path should end in "part-00000" and the  + "/*" should be
>>>>> removed... although someone, evidently, thought otherwise.
>>>>>
>>>>> Any feedback?
>>>>>
>>>>> On Mon, Jul 6, 2009 at 5:39 PM, nfantone<nf...@gmail.com> wrote:
>>>>>>
>>>>>> I was using Canopy to create input clusters, but the error appeared
>>>>>> while running kMeans (if I run kMeans' job only with previously
>>>>>> created clusters from Canopy placed in output/canopies as initial
>>>>>> clusters, it still fails). I noticed no other problems. I was using
>>>>>> revision 790979 before updating.  Strangely, there were no changes in
>>>>>> the job and drivers class from that revision. svn diff shows that the
>>>>>> only classes that changed in org.apache.mahout.clustering.kmeans
>>>>>> package were KMeansInfo.java and RandomSeedGenerator.java
>>>>>>
>>>>>> On Mon, Jul 6, 2009 at 3:55 PM, Jeff
>>>>>> Eastman<jd...@windwardsolutions.com> wrote:
>>>>>>>
>>>>>>> Hum, no, it's looking for the output of the first iteration. Were
>>>>>>> there
>>>>>>> other errors? What was the last revision you were running? It does
>>>>>>> look like
>>>>>>> something got horked, as it should be looking for output/clusters-0/*.
>>>>>>> Can
>>>>>>> you diff the job and driver class to see what changed?
>>>>>>>
>>>>>>> Jeff
>>>>>>>
>>>>>>> nfantone wrote:
>>>>>>>>
>>>>>>>> Fellows, today I updated to revision 791558 and while running kMeans
>>>>>>>> I
>>>>>>>> got the following exception:
>>>>>>>>
>>>>>>>> WARNING: java.io.FileNotFoundException: File
>>>>>>>> output/clusters-0/part-00000/* does not exist.
>>>>>>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>>>>>>>> does not exist.
>>>>>>>>
>>>>>>>> The algorithm isn't interrupted, though. But this exception wasn't
>>>>>>>> thrown before the update and, to me, its message is not quite clear.
>>>>>>>> It seems as it's looking for any file inside a "part-00000"
>>>>>>>> directory,
>>>>>>>> which doesn't exist; and, as far as I know, "part-xxxxx" are default
>>>>>>>> names for output files.
>>>>>>>>
>>>>>>>> I could show the entire stack trace, if needed. Any pointers?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jul 2, 2009 at 3:16 PM, nfantone<nf...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks for the feedback, Jeff.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The logical format of input to KMeans is <Key, Vector> as it is in
>>>>>>>>>> sequence
>>>>>>>>>> file format, but the Key is never used. To my knowledge, there is
>>>>>>>>>> no
>>>>>>>>>> requirement to assign identifiers to the input points*. Users are
>>>>>>>>>> free
>>>>>>>>>> to
>>>>>>>>>> associate an arbitrary name field with each vector - also label
>>>>>>>>>> mappings
>>>>>>>>>> may
>>>>>>>>>> be assigned - but these are not manipulated by KMeans or any of the
>>>>>>>>>> other
>>>>>>>>>> clustering applications. The name field is now used as a vector
>>>>>>>>>> identifier
>>>>>>>>>> by the KMeansClusterMapper - if it is non-null - in the output step
>>>>>>>>>> only.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The key may not be used internally, but externally they can prove to
>>>>>>>>> be pretty useful. For me, keys are userIDs and each Vector
>>>>>>>>> represents
>>>>>>>>> his/her historical behavior. Being able to collect the output
>>>>>>>>> information as <UserID, ClusterID> is quite neat as it allows me to,
>>>>>>>>> for instance, retrieve user information using data directly from a
>>>>>>>>> HDFS file's field.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://www.lucidimagination.com/
>>>>
>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>>>> Solr/Lucene:
>>>> http://www.lucidimagination.com/search
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>

Re: Clustering from DB

Posted by nfantone <nf...@gmail.com>.

Yes. After taking another look into it, I'll tend yo agree with Jeff
here. isConverged() should be receiving an absolute path to a
directory containing all the clusters, which could have been split
into several parts.

I'll also look into that tomorrow, at work.

On Sun, Jul 12, 2009 at 7:51 PM, Jeff Eastman<jd...@windwardsolutions.com> wrote:
> In this case, the code should be reading all of the clusters into memory to
> see if they have all converged. These may be split into multiple part files
> if more than one reducer is specified. So /* is the correct file pattern and
> it is the calling site that should remove the /part-0000 reference. The code
> in isConverged should loop through all the parts, returning if they have all
> converged or not.
>
> I'll take a detailed look tomorrow.
>
>
> Grant Ingersoll wrote:
>>
>> Hmm, that might be a mistake on my part when trying to resolve how Hadoop
>> 0.20 now resolves globs.  I somewhat blindly applied "/*" where needed, but
>> I think it is likely worth revistiing here where a specific file is needed?
>>
>> -Grant
>>
>> On Jul 10, 2009, at 3:08 PM, nfantone wrote:
>>
>>> This error is still bugging me. The exception:
>>>
>>> WARNING: java.io.FileNotFoundException: File
>>> output/clusters-0/part-00000/* does not exist.
>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>>> does not exist.
>>>
>>> ocurrs first at:
>>>
>>>
>>> org.apache.mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:298)
>>>
>>> which corresponds to:
>>>
>>>  private static boolean isConverged(String filePath, JobConf conf,
>>> FileSystem fs)
>>>     throws IOException {
>>>   Path outPart = new Path(filePath + "/*");
>>>   SequenceFile.Reader reader = new SequenceFile.Reader(fs, outPart,
>>> conf);  <-- THIS
>>>   ...
>>>  }
>>>
>>> where isConverged() is called in this fashion:
>>>
>>> return isConverged(clustersOut + "/part-00000", conf, fs);
>>>
>>> by runIteration(), which is previously invoked by runJob() like:
>>>
>>>    String clustersOut = output + "/clusters-" + iteration;
>>>     converged = runIteration(input, clustersIn, clustersOut,
>>> measureClass,
>>>         delta, numReduceTasks, iteration);
>>>
>>> Consequently, assuming its the first iteration and the output folder
>>> has been named "output" by the user, the SequenceFile.Reader receives
>>> "output/clusters-0/part-00000/*" as a path, which is non-existent. I
>>> believe the path should end in "part-00000" and the  + "/*" should be
>>> removed... although someone, evidently, thought otherwise.
>>>
>>> Any feedback?
>>>
>>> On Mon, Jul 6, 2009 at 5:39 PM, nfantone<nf...@gmail.com> wrote:
>>>>
>>>> I was using Canopy to create input clusters, but the error appeared
>>>> while running kMeans (if I run kMeans' job only with previously
>>>> created clusters from Canopy placed in output/canopies as initial
>>>> clusters, it still fails). I noticed no other problems. I was using
>>>> revision 790979 before updating.  Strangely, there were no changes in
>>>> the job and drivers class from that revision. svn diff shows that the
>>>> only classes that changed in org.apache.mahout.clustering.kmeans
>>>> package were KMeansInfo.java and RandomSeedGenerator.java
>>>>
>>>> On Mon, Jul 6, 2009 at 3:55 PM, Jeff Eastman<jd...@windwardsolutions.com>
>>>> wrote:
>>>>>
>>>>> Hum, no, it's looking for the output of the first iteration. Were there
>>>>> other errors? What was the last revision you were running? It does look
>>>>> like
>>>>> something got horked, as it should be looking for output/clusters-0/*.
>>>>> Can
>>>>> you diff the job and driver class to see what changed?
>>>>>
>>>>> Jeff
>>>>>
>>>>> nfantone wrote:
>>>>>>
>>>>>> Fellows, today I updated to revision 791558 and while running kMeans I
>>>>>> got the following exception:
>>>>>>
>>>>>> WARNING: java.io.FileNotFoundException: File
>>>>>> output/clusters-0/part-00000/* does not exist.
>>>>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>>>>>> does not exist.
>>>>>>
>>>>>> The algorithm isn't interrupted, though. But this exception wasn't
>>>>>> thrown before the update and, to me, its message is not quite clear.
>>>>>> It seems as it's looking for any file inside a "part-00000" directory,
>>>>>> which doesn't exist; and, as far as I know, "part-xxxxx" are default
>>>>>> names for output files.
>>>>>>
>>>>>> I could show the entire stack trace, if needed. Any pointers?
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 2, 2009 at 3:16 PM, nfantone<nf...@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> Thanks for the feedback, Jeff.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> The logical format of input to KMeans is <Key, Vector> as it is in
>>>>>>>> sequence
>>>>>>>> file format, but the Key is never used. To my knowledge, there is no
>>>>>>>> requirement to assign identifiers to the input points*. Users are
>>>>>>>> free
>>>>>>>> to
>>>>>>>> associate an arbitrary name field with each vector - also label
>>>>>>>> mappings
>>>>>>>> may
>>>>>>>> be assigned - but these are not manipulated by KMeans or any of the
>>>>>>>> other
>>>>>>>> clustering applications. The name field is now used as a vector
>>>>>>>> identifier
>>>>>>>> by the KMeansClusterMapper - if it is non-null - in the output step
>>>>>>>> only.
>>>>>>>>
>>>>>>>
>>>>>>> The key may not be used internally, but externally they can prove to
>>>>>>> be pretty useful. For me, keys are userIDs and each Vector represents
>>>>>>> his/her historical behavior. Being able to collect the output
>>>>>>> information as <UserID, ClusterID> is quite neat as it allows me to,
>>>>>>> for instance, retrieve user information using data directly from a
>>>>>>> HDFS file's field.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>>
>
>

Re: Clustering from DB

Posted by nfantone <nf...@gmail.com>.

Great work. It works like a charm now. Thank you very much.

On Mon, Jul 13, 2009 at 1:41 PM, Jeff Eastman<jd...@windwardsolutions.com> wrote:
> r793620 fixes the KMeansDriver.isConverged() method to iterate over all
> cluster part files. Unit test now runs without error and the synthetic
> control job completes too.
>
>
> Jeff Eastman wrote:
>>
>> In this case, the code should be reading all of the clusters into memory
>> to see if they have all converged. These may be split into multiple part
>> files if more than one reducer is specified. So /* is the correct file
>> pattern and it is the calling site that should remove the /part-0000
>> reference. The code in isConverged should loop through all the parts,
>> returning if they have all converged or not.
>>
>> I'll take a detailed look tomorrow.
>>
>>
>> Grant Ingersoll wrote:
>>>
>>> Hmm, that might be a mistake on my part when trying to resolve how Hadoop
>>> 0.20 now resolves globs.  I somewhat blindly applied "/*" where needed, but
>>> I think it is likely worth revistiing here where a specific file is needed?
>>>
>>> -Grant
>>>
>>> On Jul 10, 2009, at 3:08 PM, nfantone wrote:
>>>
>>>> This error is still bugging me. The exception:
>>>>
>>>> WARNING: java.io.FileNotFoundException: File
>>>> output/clusters-0/part-00000/* does not exist.
>>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>>>> does not exist.
>>>>
>>>> ocurrs first at:
>>>>
>>>>
>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:298)
>>>>
>>>> which corresponds to:
>>>>
>>>>  private static boolean isConverged(String filePath, JobConf conf,
>>>> FileSystem fs)
>>>>     throws IOException {
>>>>   Path outPart = new Path(filePath + "/*");
>>>>   SequenceFile.Reader reader = new SequenceFile.Reader(fs, outPart,
>>>> conf);  <-- THIS
>>>>   ...
>>>>  }
>>>>
>>>> where isConverged() is called in this fashion:
>>>>
>>>> return isConverged(clustersOut + "/part-00000", conf, fs);
>>>>
>>>> by runIteration(), which is previously invoked by runJob() like:
>>>>
>>>>    String clustersOut = output + "/clusters-" + iteration;
>>>>     converged = runIteration(input, clustersIn, clustersOut,
>>>> measureClass,
>>>>         delta, numReduceTasks, iteration);
>>>>
>>>> Consequently, assuming its the first iteration and the output folder
>>>> has been named "output" by the user, the SequenceFile.Reader receives
>>>> "output/clusters-0/part-00000/*" as a path, which is non-existent. I
>>>> believe the path should end in "part-00000" and the  + "/*" should be
>>>> removed... although someone, evidently, thought otherwise.
>>>>
>>>> Any feedback?
>>>>
>>>> On Mon, Jul 6, 2009 at 5:39 PM, nfantone<nf...@gmail.com> wrote:
>>>>>
>>>>> I was using Canopy to create input clusters, but the error appeared
>>>>> while running kMeans (if I run kMeans' job only with previously
>>>>> created clusters from Canopy placed in output/canopies as initial
>>>>> clusters, it still fails). I noticed no other problems. I was using
>>>>> revision 790979 before updating.  Strangely, there were no changes in
>>>>> the job and drivers class from that revision. svn diff shows that the
>>>>> only classes that changed in org.apache.mahout.clustering.kmeans
>>>>> package were KMeansInfo.java and RandomSeedGenerator.java
>>>>>
>>>>> On Mon, Jul 6, 2009 at 3:55 PM, Jeff
>>>>> Eastman<jd...@windwardsolutions.com> wrote:
>>>>>>
>>>>>> Hum, no, it's looking for the output of the first iteration. Were
>>>>>> there
>>>>>> other errors? What was the last revision you were running? It does
>>>>>> look like
>>>>>> something got horked, as it should be looking for output/clusters-0/*.
>>>>>> Can
>>>>>> you diff the job and driver class to see what changed?
>>>>>>
>>>>>> Jeff
>>>>>>
>>>>>> nfantone wrote:
>>>>>>>
>>>>>>> Fellows, today I updated to revision 791558 and while running kMeans
>>>>>>> I
>>>>>>> got the following exception:
>>>>>>>
>>>>>>> WARNING: java.io.FileNotFoundException: File
>>>>>>> output/clusters-0/part-00000/* does not exist.
>>>>>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>>>>>>> does not exist.
>>>>>>>
>>>>>>> The algorithm isn't interrupted, though. But this exception wasn't
>>>>>>> thrown before the update and, to me, its message is not quite clear.
>>>>>>> It seems as it's looking for any file inside a "part-00000"
>>>>>>> directory,
>>>>>>> which doesn't exist; and, as far as I know, "part-xxxxx" are default
>>>>>>> names for output files.
>>>>>>>
>>>>>>> I could show the entire stack trace, if needed. Any pointers?
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 2, 2009 at 3:16 PM, nfantone<nf...@gmail.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks for the feedback, Jeff.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> The logical format of input to KMeans is <Key, Vector> as it is in
>>>>>>>>> sequence
>>>>>>>>> file format, but the Key is never used. To my knowledge, there is
>>>>>>>>> no
>>>>>>>>> requirement to assign identifiers to the input points*. Users are
>>>>>>>>> free
>>>>>>>>> to
>>>>>>>>> associate an arbitrary name field with each vector - also label
>>>>>>>>> mappings
>>>>>>>>> may
>>>>>>>>> be assigned - but these are not manipulated by KMeans or any of the
>>>>>>>>> other
>>>>>>>>> clustering applications. The name field is now used as a vector
>>>>>>>>> identifier
>>>>>>>>> by the KMeansClusterMapper - if it is non-null - in the output step
>>>>>>>>> only.
>>>>>>>>>
>>>>>>>>
>>>>>>>> The key may not be used internally, but externally they can prove to
>>>>>>>> be pretty useful. For me, keys are userIDs and each Vector
>>>>>>>> represents
>>>>>>>> his/her historical behavior. Being able to collect the output
>>>>>>>> information as <UserID, ClusterID> is quite neat as it allows me to,
>>>>>>>> for instance, retrieve user information using data directly from a
>>>>>>>> HDFS file's field.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>>> Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>>
>>>
>>
>>
>>
>
>

Re: Clustering from DB

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

r793620 fixes the KMeansDriver.isConverged() method to iterate over all 
cluster part files. Unit test now runs without error and the synthetic 
control job completes too.


Jeff Eastman wrote:
> In this case, the code should be reading all of the clusters into 
> memory to see if they have all converged. These may be split into 
> multiple part files if more than one reducer is specified. So /* is 
> the correct file pattern and it is the calling site that should remove 
> the /part-0000 reference. The code in isConverged should loop through 
> all the parts, returning if they have all converged or not.
>
> I'll take a detailed look tomorrow.
>
>
> Grant Ingersoll wrote:
>> Hmm, that might be a mistake on my part when trying to resolve how 
>> Hadoop 0.20 now resolves globs.  I somewhat blindly applied "/*" 
>> where needed, but I think it is likely worth revistiing here where a 
>> specific file is needed?
>>
>> -Grant
>>
>> On Jul 10, 2009, at 3:08 PM, nfantone wrote:
>>
>>> This error is still bugging me. The exception:
>>>
>>> WARNING: java.io.FileNotFoundException: File
>>> output/clusters-0/part-00000/* does not exist.
>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>>> does not exist.
>>>
>>> ocurrs first at:
>>>
>>> org.apache.mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:298) 
>>>
>>>
>>> which corresponds to:
>>>
>>>  private static boolean isConverged(String filePath, JobConf conf,
>>> FileSystem fs)
>>>      throws IOException {
>>>    Path outPart = new Path(filePath + "/*");
>>>    SequenceFile.Reader reader = new SequenceFile.Reader(fs, outPart,
>>> conf);  <-- THIS
>>>    ...
>>>  }
>>>
>>> where isConverged() is called in this fashion:
>>>
>>> return isConverged(clustersOut + "/part-00000", conf, fs);
>>>
>>> by runIteration(), which is previously invoked by runJob() like:
>>>
>>>     String clustersOut = output + "/clusters-" + iteration;
>>>      converged = runIteration(input, clustersIn, clustersOut, 
>>> measureClass,
>>>          delta, numReduceTasks, iteration);
>>>
>>> Consequently, assuming its the first iteration and the output folder
>>> has been named "output" by the user, the SequenceFile.Reader receives
>>> "output/clusters-0/part-00000/*" as a path, which is non-existent. I
>>> believe the path should end in "part-00000" and the  + "/*" should be
>>> removed... although someone, evidently, thought otherwise.
>>>
>>> Any feedback?
>>>
>>> On Mon, Jul 6, 2009 at 5:39 PM, nfantone<nf...@gmail.com> wrote:
>>>> I was using Canopy to create input clusters, but the error appeared
>>>> while running kMeans (if I run kMeans' job only with previously
>>>> created clusters from Canopy placed in output/canopies as initial
>>>> clusters, it still fails). I noticed no other problems. I was using
>>>> revision 790979 before updating.  Strangely, there were no changes in
>>>> the job and drivers class from that revision. svn diff shows that the
>>>> only classes that changed in org.apache.mahout.clustering.kmeans
>>>> package were KMeansInfo.java and RandomSeedGenerator.java
>>>>
>>>> On Mon, Jul 6, 2009 at 3:55 PM, Jeff 
>>>> Eastman<jd...@windwardsolutions.com> wrote:
>>>>> Hum, no, it's looking for the output of the first iteration. Were 
>>>>> there
>>>>> other errors? What was the last revision you were running? It does 
>>>>> look like
>>>>> something got horked, as it should be looking for 
>>>>> output/clusters-0/*. Can
>>>>> you diff the job and driver class to see what changed?
>>>>>
>>>>> Jeff
>>>>>
>>>>> nfantone wrote:
>>>>>>
>>>>>> Fellows, today I updated to revision 791558 and while running 
>>>>>> kMeans I
>>>>>> got the following exception:
>>>>>>
>>>>>> WARNING: java.io.FileNotFoundException: File
>>>>>> output/clusters-0/part-00000/* does not exist.
>>>>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>>>>>> does not exist.
>>>>>>
>>>>>> The algorithm isn't interrupted, though. But this exception wasn't
>>>>>> thrown before the update and, to me, its message is not quite clear.
>>>>>> It seems as it's looking for any file inside a "part-00000" 
>>>>>> directory,
>>>>>> which doesn't exist; and, as far as I know, "part-xxxxx" are default
>>>>>> names for output files.
>>>>>>
>>>>>> I could show the entire stack trace, if needed. Any pointers?
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 2, 2009 at 3:16 PM, nfantone<nf...@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> Thanks for the feedback, Jeff.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> The logical format of input to KMeans is <Key, Vector> as it is in
>>>>>>>> sequence
>>>>>>>> file format, but the Key is never used. To my knowledge, there 
>>>>>>>> is no
>>>>>>>> requirement to assign identifiers to the input points*. Users 
>>>>>>>> are free
>>>>>>>> to
>>>>>>>> associate an arbitrary name field with each vector - also label 
>>>>>>>> mappings
>>>>>>>> may
>>>>>>>> be assigned - but these are not manipulated by KMeans or any of 
>>>>>>>> the
>>>>>>>> other
>>>>>>>> clustering applications. The name field is now used as a vector
>>>>>>>> identifier
>>>>>>>> by the KMeansClusterMapper - if it is non-null - in the output 
>>>>>>>> step
>>>>>>>> only.
>>>>>>>>
>>>>>>>
>>>>>>> The key may not be used internally, but externally they can 
>>>>>>> prove to
>>>>>>> be pretty useful. For me, keys are userIDs and each Vector 
>>>>>>> represents
>>>>>>> his/her historical behavior. Being able to collect the output
>>>>>>> information as <UserID, ClusterID> is quite neat as it allows me 
>>>>>>> to,
>>>>>>> for instance, retrieve user information using data directly from a
>>>>>>> HDFS file's field.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) 
>> using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>>
>
>
>

Re: Clustering from DB

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

In this case, the code should be reading all of the clusters into memory 
to see if they have all converged. These may be split into multiple part 
files if more than one reducer is specified. So /* is the correct file 
pattern and it is the calling site that should remove the /part-0000 
reference. The code in isConverged should loop through all the parts, 
returning if they have all converged or not.

I'll take a detailed look tomorrow.


Grant Ingersoll wrote:
> Hmm, that might be a mistake on my part when trying to resolve how 
> Hadoop 0.20 now resolves globs.  I somewhat blindly applied "/*" where 
> needed, but I think it is likely worth revistiing here where a 
> specific file is needed?
>
> -Grant
>
> On Jul 10, 2009, at 3:08 PM, nfantone wrote:
>
>> This error is still bugging me. The exception:
>>
>> WARNING: java.io.FileNotFoundException: File
>> output/clusters-0/part-00000/* does not exist.
>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>> does not exist.
>>
>> ocurrs first at:
>>
>> org.apache.mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:298) 
>>
>>
>> which corresponds to:
>>
>>  private static boolean isConverged(String filePath, JobConf conf,
>> FileSystem fs)
>>      throws IOException {
>>    Path outPart = new Path(filePath + "/*");
>>    SequenceFile.Reader reader = new SequenceFile.Reader(fs, outPart,
>> conf);  <-- THIS
>>    ...
>>  }
>>
>> where isConverged() is called in this fashion:
>>
>> return isConverged(clustersOut + "/part-00000", conf, fs);
>>
>> by runIteration(), which is previously invoked by runJob() like:
>>
>>     String clustersOut = output + "/clusters-" + iteration;
>>      converged = runIteration(input, clustersIn, clustersOut, 
>> measureClass,
>>          delta, numReduceTasks, iteration);
>>
>> Consequently, assuming its the first iteration and the output folder
>> has been named "output" by the user, the SequenceFile.Reader receives
>> "output/clusters-0/part-00000/*" as a path, which is non-existent. I
>> believe the path should end in "part-00000" and the  + "/*" should be
>> removed... although someone, evidently, thought otherwise.
>>
>> Any feedback?
>>
>> On Mon, Jul 6, 2009 at 5:39 PM, nfantone<nf...@gmail.com> wrote:
>>> I was using Canopy to create input clusters, but the error appeared
>>> while running kMeans (if I run kMeans' job only with previously
>>> created clusters from Canopy placed in output/canopies as initial
>>> clusters, it still fails). I noticed no other problems. I was using
>>> revision 790979 before updating.  Strangely, there were no changes in
>>> the job and drivers class from that revision. svn diff shows that the
>>> only classes that changed in org.apache.mahout.clustering.kmeans
>>> package were KMeansInfo.java and RandomSeedGenerator.java
>>>
>>> On Mon, Jul 6, 2009 at 3:55 PM, Jeff 
>>> Eastman<jd...@windwardsolutions.com> wrote:
>>>> Hum, no, it's looking for the output of the first iteration. Were 
>>>> there
>>>> other errors? What was the last revision you were running? It does 
>>>> look like
>>>> something got horked, as it should be looking for 
>>>> output/clusters-0/*. Can
>>>> you diff the job and driver class to see what changed?
>>>>
>>>> Jeff
>>>>
>>>> nfantone wrote:
>>>>>
>>>>> Fellows, today I updated to revision 791558 and while running 
>>>>> kMeans I
>>>>> got the following exception:
>>>>>
>>>>> WARNING: java.io.FileNotFoundException: File
>>>>> output/clusters-0/part-00000/* does not exist.
>>>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>>>>> does not exist.
>>>>>
>>>>> The algorithm isn't interrupted, though. But this exception wasn't
>>>>> thrown before the update and, to me, its message is not quite clear.
>>>>> It seems as it's looking for any file inside a "part-00000" 
>>>>> directory,
>>>>> which doesn't exist; and, as far as I know, "part-xxxxx" are default
>>>>> names for output files.
>>>>>
>>>>> I could show the entire stack trace, if needed. Any pointers?
>>>>>
>>>>>
>>>>> On Thu, Jul 2, 2009 at 3:16 PM, nfantone<nf...@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> Thanks for the feedback, Jeff.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> The logical format of input to KMeans is <Key, Vector> as it is in
>>>>>>> sequence
>>>>>>> file format, but the Key is never used. To my knowledge, there 
>>>>>>> is no
>>>>>>> requirement to assign identifiers to the input points*. Users 
>>>>>>> are free
>>>>>>> to
>>>>>>> associate an arbitrary name field with each vector - also label 
>>>>>>> mappings
>>>>>>> may
>>>>>>> be assigned - but these are not manipulated by KMeans or any of the
>>>>>>> other
>>>>>>> clustering applications. The name field is now used as a vector
>>>>>>> identifier
>>>>>>> by the KMeansClusterMapper - if it is non-null - in the output step
>>>>>>> only.
>>>>>>>
>>>>>>
>>>>>> The key may not be used internally, but externally they can prove to
>>>>>> be pretty useful. For me, keys are userIDs and each Vector 
>>>>>> represents
>>>>>> his/her historical behavior. Being able to collect the output
>>>>>> information as <UserID, ClusterID> is quite neat as it allows me to,
>>>>>> for instance, retrieve user information using data directly from a
>>>>>> HDFS file's field.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) 
> using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
>

Re: Clustering from DB

Posted by Grant Ingersoll <gs...@apache.org>.

Hmm, that might be a mistake on my part when trying to resolve how  
Hadoop 0.20 now resolves globs.  I somewhat blindly applied "/*" where  
needed, but I think it is likely worth revistiing here where a  
specific file is needed?

-Grant

On Jul 10, 2009, at 3:08 PM, nfantone wrote:

> This error is still bugging me. The exception:
>
> WARNING: java.io.FileNotFoundException: File
> output/clusters-0/part-00000/* does not exist.
> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
> does not exist.
>
> ocurrs first at:
>
> org 
> .apache 
> .mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java: 
> 298)
>
> which corresponds to:
>
>  private static boolean isConverged(String filePath, JobConf conf,
> FileSystem fs)
>      throws IOException {
>    Path outPart = new Path(filePath + "/*");
>    SequenceFile.Reader reader = new SequenceFile.Reader(fs, outPart,
> conf);  <-- THIS
>    ...
>  }
>
> where isConverged() is called in this fashion:
>
> return isConverged(clustersOut + "/part-00000", conf, fs);
>
> by runIteration(), which is previously invoked by runJob() like:
>
>     String clustersOut = output + "/clusters-" + iteration;
>      converged = runIteration(input, clustersIn, clustersOut,  
> measureClass,
>          delta, numReduceTasks, iteration);
>
> Consequently, assuming its the first iteration and the output folder
> has been named "output" by the user, the SequenceFile.Reader receives
> "output/clusters-0/part-00000/*" as a path, which is non-existent. I
> believe the path should end in "part-00000" and the  + "/*" should be
> removed... although someone, evidently, thought otherwise.
>
> Any feedback?
>
> On Mon, Jul 6, 2009 at 5:39 PM, nfantone<nf...@gmail.com> wrote:
>> I was using Canopy to create input clusters, but the error appeared
>> while running kMeans (if I run kMeans' job only with previously
>> created clusters from Canopy placed in output/canopies as initial
>> clusters, it still fails). I noticed no other problems. I was using
>> revision 790979 before updating.  Strangely, there were no changes in
>> the job and drivers class from that revision. svn diff shows that the
>> only classes that changed in org.apache.mahout.clustering.kmeans
>> package were KMeansInfo.java and RandomSeedGenerator.java
>>
>> On Mon, Jul 6, 2009 at 3:55 PM, Jeff Eastman<jdog@windwardsolutions.com 
>> > wrote:
>>> Hum, no, it's looking for the output of the first iteration. Were  
>>> there
>>> other errors? What was the last revision you were running? It does  
>>> look like
>>> something got horked, as it should be looking for output/ 
>>> clusters-0/*. Can
>>> you diff the job and driver class to see what changed?
>>>
>>> Jeff
>>>
>>> nfantone wrote:
>>>>
>>>> Fellows, today I updated to revision 791558 and while running  
>>>> kMeans I
>>>> got the following exception:
>>>>
>>>> WARNING: java.io.FileNotFoundException: File
>>>> output/clusters-0/part-00000/* does not exist.
>>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>>>> does not exist.
>>>>
>>>> The algorithm isn't interrupted, though. But this exception wasn't
>>>> thrown before the update and, to me, its message is not quite  
>>>> clear.
>>>> It seems as it's looking for any file inside a "part-00000"  
>>>> directory,
>>>> which doesn't exist; and, as far as I know, "part-xxxxx" are  
>>>> default
>>>> names for output files.
>>>>
>>>> I could show the entire stack trace, if needed. Any pointers?
>>>>
>>>>
>>>> On Thu, Jul 2, 2009 at 3:16 PM, nfantone<nf...@gmail.com> wrote:
>>>>
>>>>>
>>>>> Thanks for the feedback, Jeff.
>>>>>
>>>>>
>>>>>>
>>>>>> The logical format of input to KMeans is <Key, Vector> as it is  
>>>>>> in
>>>>>> sequence
>>>>>> file format, but the Key is never used. To my knowledge, there  
>>>>>> is no
>>>>>> requirement to assign identifiers to the input points*. Users  
>>>>>> are free
>>>>>> to
>>>>>> associate an arbitrary name field with each vector - also label  
>>>>>> mappings
>>>>>> may
>>>>>> be assigned - but these are not manipulated by KMeans or any of  
>>>>>> the
>>>>>> other
>>>>>> clustering applications. The name field is now used as a vector
>>>>>> identifier
>>>>>> by the KMeansClusterMapper - if it is non-null - in the output  
>>>>>> step
>>>>>> only.
>>>>>>
>>>>>
>>>>> The key may not be used internally, but externally they can  
>>>>> prove to
>>>>> be pretty useful. For me, keys are userIDs and each Vector  
>>>>> represents
>>>>> his/her historical behavior. Being able to collect the output
>>>>> information as <UserID, ClusterID> is quite neat as it allows me  
>>>>> to,
>>>>> for instance, retrieve user information using data directly from a
>>>>> HDFS file's field.
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Clustering from DB

Posted by nfantone <nf...@gmail.com>.

This error is still bugging me. The exception:

WARNING: java.io.FileNotFoundException: File
output/clusters-0/part-00000/* does not exist.
java.io.FileNotFoundException: File output/clusters-0/part-00000/*
does not exist.

ocurrs first at:

org.apache.mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:298)

which corresponds to:

  private static boolean isConverged(String filePath, JobConf conf,
FileSystem fs)
      throws IOException {
    Path outPart = new Path(filePath + "/*");
    SequenceFile.Reader reader = new SequenceFile.Reader(fs, outPart,
conf);  <-- THIS
    ...
  }

where isConverged() is called in this fashion:

return isConverged(clustersOut + "/part-00000", conf, fs);

by runIteration(), which is previously invoked by runJob() like:

     String clustersOut = output + "/clusters-" + iteration;
      converged = runIteration(input, clustersIn, clustersOut, measureClass,
          delta, numReduceTasks, iteration);

Consequently, assuming its the first iteration and the output folder
has been named "output" by the user, the SequenceFile.Reader receives
"output/clusters-0/part-00000/*" as a path, which is non-existent. I
believe the path should end in "part-00000" and the  + "/*" should be
removed... although someone, evidently, thought otherwise.

Any feedback?

On Mon, Jul 6, 2009 at 5:39 PM, nfantone<nf...@gmail.com> wrote:
> I was using Canopy to create input clusters, but the error appeared
> while running kMeans (if I run kMeans' job only with previously
> created clusters from Canopy placed in output/canopies as initial
> clusters, it still fails). I noticed no other problems. I was using
> revision 790979 before updating.  Strangely, there were no changes in
> the job and drivers class from that revision. svn diff shows that the
> only classes that changed in org.apache.mahout.clustering.kmeans
> package were KMeansInfo.java and RandomSeedGenerator.java
>
> On Mon, Jul 6, 2009 at 3:55 PM, Jeff Eastman<jd...@windwardsolutions.com> wrote:
>> Hum, no, it's looking for the output of the first iteration. Were there
>> other errors? What was the last revision you were running? It does look like
>> something got horked, as it should be looking for output/clusters-0/*. Can
>> you diff the job and driver class to see what changed?
>>
>> Jeff
>>
>> nfantone wrote:
>>>
>>> Fellows, today I updated to revision 791558 and while running kMeans I
>>> got the following exception:
>>>
>>> WARNING: java.io.FileNotFoundException: File
>>> output/clusters-0/part-00000/* does not exist.
>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>>> does not exist.
>>>
>>> The algorithm isn't interrupted, though. But this exception wasn't
>>> thrown before the update and, to me, its message is not quite clear.
>>> It seems as it's looking for any file inside a "part-00000" directory,
>>> which doesn't exist; and, as far as I know, "part-xxxxx" are default
>>> names for output files.
>>>
>>> I could show the entire stack trace, if needed. Any pointers?
>>>
>>>
>>> On Thu, Jul 2, 2009 at 3:16 PM, nfantone<nf...@gmail.com> wrote:
>>>
>>>>
>>>> Thanks for the feedback, Jeff.
>>>>
>>>>
>>>>>
>>>>> The logical format of input to KMeans is <Key, Vector> as it is in
>>>>> sequence
>>>>> file format, but the Key is never used. To my knowledge, there is no
>>>>> requirement to assign identifiers to the input points*. Users are free
>>>>> to
>>>>> associate an arbitrary name field with each vector - also label mappings
>>>>> may
>>>>> be assigned - but these are not manipulated by KMeans or any of the
>>>>> other
>>>>> clustering applications. The name field is now used as a vector
>>>>> identifier
>>>>> by the KMeansClusterMapper - if it is non-null - in the output step
>>>>> only.
>>>>>
>>>>
>>>> The key may not be used internally, but externally they can prove to
>>>> be pretty useful. For me, keys are userIDs and each Vector represents
>>>> his/her historical behavior. Being able to collect the output
>>>> information as <UserID, ClusterID> is quite neat as it allows me to,
>>>> for instance, retrieve user information using data directly from a
>>>> HDFS file's field.
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>

Re: Clustering from DB

Posted by nfantone <nf...@gmail.com>.

I was using Canopy to create input clusters, but the error appeared
while running kMeans (if I run kMeans' job only with previously
created clusters from Canopy placed in output/canopies as initial
clusters, it still fails). I noticed no other problems. I was using
revision 790979 before updating.  Strangely, there were no changes in
the job and drivers class from that revision. svn diff shows that the
only classes that changed in org.apache.mahout.clustering.kmeans
package were KMeansInfo.java and RandomSeedGenerator.java

On Mon, Jul 6, 2009 at 3:55 PM, Jeff Eastman<jd...@windwardsolutions.com> wrote:
> Hum, no, it's looking for the output of the first iteration. Were there
> other errors? What was the last revision you were running? It does look like
> something got horked, as it should be looking for output/clusters-0/*. Can
> you diff the job and driver class to see what changed?
>
> Jeff
>
> nfantone wrote:
>>
>> Fellows, today I updated to revision 791558 and while running kMeans I
>> got the following exception:
>>
>> WARNING: java.io.FileNotFoundException: File
>> output/clusters-0/part-00000/* does not exist.
>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>> does not exist.
>>
>> The algorithm isn't interrupted, though. But this exception wasn't
>> thrown before the update and, to me, its message is not quite clear.
>> It seems as it's looking for any file inside a "part-00000" directory,
>> which doesn't exist; and, as far as I know, "part-xxxxx" are default
>> names for output files.
>>
>> I could show the entire stack trace, if needed. Any pointers?
>>
>>
>> On Thu, Jul 2, 2009 at 3:16 PM, nfantone<nf...@gmail.com> wrote:
>>
>>>
>>> Thanks for the feedback, Jeff.
>>>
>>>
>>>>
>>>> The logical format of input to KMeans is <Key, Vector> as it is in
>>>> sequence
>>>> file format, but the Key is never used. To my knowledge, there is no
>>>> requirement to assign identifiers to the input points*. Users are free
>>>> to
>>>> associate an arbitrary name field with each vector - also label mappings
>>>> may
>>>> be assigned - but these are not manipulated by KMeans or any of the
>>>> other
>>>> clustering applications. The name field is now used as a vector
>>>> identifier
>>>> by the KMeansClusterMapper - if it is non-null - in the output step
>>>> only.
>>>>
>>>
>>> The key may not be used internally, but externally they can prove to
>>> be pretty useful. For me, keys are userIDs and each Vector represents
>>> his/her historical behavior. Being able to collect the output
>>> information as <UserID, ClusterID> is quite neat as it allows me to,
>>> for instance, retrieve user information using data directly from a
>>> HDFS file's field.
>>>
>>>
>>
>>
>>
>
>

Re: Clustering from DB

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

It's looking for the initial set of k, clusters which are input to the 
algorithm. Did you run Canopy to create them or do you have another 
sampling technique to initialize these clusters?


Jeff

nfantone wrote:
> Fellows, today I updated to revision 791558 and while running kMeans I
> got the following exception:
>
> WARNING: java.io.FileNotFoundException: File
> output/clusters-0/part-00000/* does not exist.
> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
> does not exist.
>
> The algorithm isn't interrupted, though. But this exception wasn't
> thrown before the update and, to me, its message is not quite clear.
> It seems as it's looking for any file inside a "part-00000" directory,
> which doesn't exist; and, as far as I know, "part-xxxxx" are default
> names for output files.
>
> I could show the entire stack trace, if needed. Any pointers?
>
>
> On Thu, Jul 2, 2009 at 3:16 PM, nfantone<nf...@gmail.com> wrote:
>   
>> Thanks for the feedback, Jeff.
>>
>>     
>>> The logical format of input to KMeans is <Key, Vector> as it is in sequence
>>> file format, but the Key is never used. To my knowledge, there is no
>>> requirement to assign identifiers to the input points*. Users are free to
>>> associate an arbitrary name field with each vector - also label mappings may
>>> be assigned - but these are not manipulated by KMeans or any of the other
>>> clustering applications. The name field is now used as a vector identifier
>>> by the KMeansClusterMapper - if it is non-null - in the output step only.
>>>       
>> The key may not be used internally, but externally they can prove to
>> be pretty useful. For me, keys are userIDs and each Vector represents
>> his/her historical behavior. Being able to collect the output
>> information as <UserID, ClusterID> is quite neat as it allows me to,
>> for instance, retrieve user information using data directly from a
>> HDFS file's field.
>>
>>     
>
>
>

Re: Clustering from DB

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Hum, no, it's looking for the output of the first iteration. Were there 
other errors? What was the last revision you were running? It does look 
like something got horked, as it should be looking for 
output/clusters-0/*. Can you diff the job and driver class to see what 
changed?

Jeff

nfantone wrote:
> Fellows, today I updated to revision 791558 and while running kMeans I
> got the following exception:
>
> WARNING: java.io.FileNotFoundException: File
> output/clusters-0/part-00000/* does not exist.
> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
> does not exist.
>
> The algorithm isn't interrupted, though. But this exception wasn't
> thrown before the update and, to me, its message is not quite clear.
> It seems as it's looking for any file inside a "part-00000" directory,
> which doesn't exist; and, as far as I know, "part-xxxxx" are default
> names for output files.
>
> I could show the entire stack trace, if needed. Any pointers?
>
>
> On Thu, Jul 2, 2009 at 3:16 PM, nfantone<nf...@gmail.com> wrote:
>   
>> Thanks for the feedback, Jeff.
>>
>>     
>>> The logical format of input to KMeans is <Key, Vector> as it is in sequence
>>> file format, but the Key is never used. To my knowledge, there is no
>>> requirement to assign identifiers to the input points*. Users are free to
>>> associate an arbitrary name field with each vector - also label mappings may
>>> be assigned - but these are not manipulated by KMeans or any of the other
>>> clustering applications. The name field is now used as a vector identifier
>>> by the KMeansClusterMapper - if it is non-null - in the output step only.
>>>       
>> The key may not be used internally, but externally they can prove to
>> be pretty useful. For me, keys are userIDs and each Vector represents
>> his/her historical behavior. Being able to collect the output
>> information as <UserID, ClusterID> is quite neat as it allows me to,
>> for instance, retrieve user information using data directly from a
>> HDFS file's field.
>>
>>     
>
>
>

Re: Clustering from DB

Posted by nfantone <nf...@gmail.com>.

Fellows, today I updated to revision 791558 and while running kMeans I
got the following exception:

WARNING: java.io.FileNotFoundException: File
output/clusters-0/part-00000/* does not exist.
java.io.FileNotFoundException: File output/clusters-0/part-00000/*
does not exist.

The algorithm isn't interrupted, though. But this exception wasn't
thrown before the update and, to me, its message is not quite clear.
It seems as it's looking for any file inside a "part-00000" directory,
which doesn't exist; and, as far as I know, "part-xxxxx" are default
names for output files.

I could show the entire stack trace, if needed. Any pointers?

On Thu, Jul 2, 2009 at 3:16 PM, nfantone<nf...@gmail.com> wrote:
> Thanks for the feedback, Jeff.
>
>> The logical format of input to KMeans is <Key, Vector> as it is in sequence
>> file format, but the Key is never used. To my knowledge, there is no
>> requirement to assign identifiers to the input points*. Users are free to
>> associate an arbitrary name field with each vector - also label mappings may
>> be assigned - but these are not manipulated by KMeans or any of the other
>> clustering applications. The name field is now used as a vector identifier
>> by the KMeansClusterMapper - if it is non-null - in the output step only.
>
> The key may not be used internally, but externally they can prove to
> be pretty useful. For me, keys are userIDs and each Vector represents
> his/her historical behavior. Being able to collect the output
> information as <UserID, ClusterID> is quite neat as it allows me to,
> for instance, retrieve user information using data directly from a
> HDFS file's field.
>

Re: Clustering from DB

Posted by nfantone <nf...@gmail.com>.

Thanks for the feedback, Jeff.

> The logical format of input to KMeans is <Key, Vector> as it is in sequence
> file format, but the Key is never used. To my knowledge, there is no
> requirement to assign identifiers to the input points*. Users are free to
> associate an arbitrary name field with each vector - also label mappings may
> be assigned - but these are not manipulated by KMeans or any of the other
> clustering applications. The name field is now used as a vector identifier
> by the KMeansClusterMapper - if it is non-null - in the output step only.

The key may not be used internally, but externally they can prove to
be pretty useful. For me, keys are userIDs and each Vector represents
his/her historical behavior. Being able to collect the output
information as <UserID, ClusterID> is quite neat as it allows me to,
for instance, retrieve user information using data directly from a
HDFS file's field.

Re: Clustering from DB

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

See inline comments:

nfantone wrote:
> After some research and testing, I believe I can throw some light on
> the subject. The runJob() static method defined in KMeansDriver
> expects three file paths, referencing three different files with
> different logical record's format; moreover, a "points" directory,
> along with other files, are created as part of the output:
>
> 1) input
>
> Description: A file containing data to be clustered, represented by Vectors.
> Path: An absolute path to an HDFS data file.  Example: "input/thedata.dat"
> Logical format: <ID, Vector>. The ID could be anything as long as it
> extends Writable.
>   
The logical format of input to KMeans is <Key, Vector> as it is in 
sequence file format, but the Key is never used. To my knowledge, there 
is no requirement to assign identifiers to the input points*. Users are 
free to associate an arbitrary name field with each vector - also label 
mappings may be assigned - but these are not manipulated by KMeans or 
any of the other clustering applications. The name field is now used as 
a vector identifier by the KMeansClusterMapper - if it is non-null - in 
the output step only.

*MeanShift could certainly benefit from a requirement that all input 
points have unique identifiers. Using the optional name field in this 
manner seems pretty kludgy to me.
> Code example (writing an input file):
>
> // Get FileSystem through Configuration
> Configuration conf = new Configuration();
> Filesystem fs = FileSystem.get(conf);
>
> // Instantiate writer to input data in a .dat file
> // with a <Text, SparseVector> logical format
> String fileName = "input/thedata.dat";
> Path path = new Path(fileName);
>
> SequenceFile.Writer seqVectorWriter = new SequenceFile.Writer(fs,
> conf, path, Text.class, SparseVector.class);
> VectorWriter writer = new SequenceFileVectorWriter(seqVectorWriter);
>
> // Write Vectors to file. inputVectors could be any VectorIterable
> implementation.
> writer.write(inputVectors);
> writer.close();
>
> 2) clustersIn
>
> Description: A file containing the initial pre-computed (or randomly
> selected) clusters to be used by kMeans. The 'k' value is determined
> by the number of clusters in THIS file.
> Path: An absolute path to a DIRECTORY containing any number of files
> with a "part-xxxxx" name format, where 'x' is a one digit number. The
> name should be omitted from the path. Example: "input/initial", where
> 'initial' has a "part-00000" file stored in it.
> Logical format: <ID, ClusterBase>. The ID could be anything as long as
> it extends Writable.
>   
Again, the sequence file format requires an ID but this is not used. 
Each cluster has an internal ID in its state which is used by the 
implementation. Typically, the ID is the same as the internal ID.
> Code example (writing a clustersIn file):
>
> // Get FileSystem through Configuration
> Configuration conf = new Configuration();
> Filesystem fs = FileSystem.get(conf);
>
> // Instantiate writer to input clusters in a file with a <Text,
> Cluster> logical format
> String fileName = "input/initial/part-00000";
> Path path = new Path(fileName);
>
> SequenceFile.Writer seqClusterWriter = new SequenceFile.Writer(fs,
> conf, path Text.class, Cluster.class);
>
> // We choose 'k' random Vectors as centers for the initial clusters.
> // 'inputVectors' could be any VectorIterable implementation.
> // CANT_INITIAL_CLUSTERS is a desired integer value .
> // The identifier of a Cluster is used as its ID.
> // AFAICT, you DO NOT need to add the center as an actual point in the cluster,
> // after cluster creation. This has been corrected recently.
> int k = 0;
> Iterator it = inputVectors.iterator();
> while (it.hasNext() && k++ < CANT_INITIAL_CLUSTERS) {
> 	Vector v = (Vector)it.next();
> 	Cluster c = new Cluster(v);
> 	seqClusterWriter.append(new Text(c.getIdentifier()), c);
> }
> seqClusterWriter.close();
>
> 3) output
>
> Description: The output files generated by the algorithm, in which the
> results are stored. Directories named "clusters-i" -'i' being a
> positive integer- are created. I'm not quite certain, but I believe
> its nomenclature comes from the number of MAP/REDUCED tasks involved.
> "part-00000" files are placed in those directories - they hold records
> logically structured as <Text, Cluster>, each of which represent a
> determined cluster in the dataset.
>   
Each iteration produces a new set of clusters and these are stored in a 
"clusters-i" directory. The number of parts in each file is determined 
by the number of reducers used by the clustering implementation. Only 
KMeans and Dirichlet allow more than one reducer. Dirichlet and 
MeanShift put all these iteration-generated files in a separate state 
directory in the output path. The nomenclature of these directories is 
not standard and I see an improvement is needed.
> Path: An absolute path to a parent directory for the "clusters-i"
> directories. Example: "output".
> Code example (reading and printing an output file):
>
> // Get FileSystem through Configutaion
> Configuration conf = new Configuration();
> Filesystem fs = FileSystem.get(conf);
>
> // Create a reader for a 'part-00000' file
> Path outPath = new Path("output/clusters-0/part-00000");
> SequenceFile.Reader reader  = new SequenceFile.Reader(fs, outPath, conf);
>
> Writable key =  (Writable) reader.getKeyClass().newInstance();
> Cluster value = new Cluster();
> Vector center = null;
>
> // Read file's records and print each cluster as 'Cluster: key {center}'
> while (reader.next(key, value)) {
> 	System.out.println("Cluster: " + key + " { ");
> 	center = value.getCenter();
>
> 	for (int i = 0; i < center.size(); i++) {
> 		System.out.print(center.get(i) + " ");
> 	}
> System.out.println(" }");
>
> 4) points
>
> Description: A directory containing a "part-00000" file with a
> <VectorID, CusterID> (both being Text type fields). It's basically an
> index (with VectorID as key) that matches every Vector described in
> the input ("thedata.dat" in our example) with the cluster they now
> belong.
> Logical format: <VectorID, ClusterID>. VectorID matches the ID
> specified by the first field of each record int the input file.
> ClusterID matches the ID in the first field of each "part-xxxxx"
> included in a "clusters-i" directory.
>   
The output points format has been recently changed from <ClusterID, 
Vector-asFormatString> to output either:
<Vector.name, ClusterID> or <Vector.asFormatString, ClusterId> depending 
upon if the points have been named or not.

The "TODO: This is ugly" comment in the Cluster code used for this 
kludge is spot on.
Jeff

Re: Clustering from DB

Posted by nfantone <nf...@gmail.com>.

After some research and testing, I believe I can throw some light on
the subject. The runJob() static method defined in KMeansDriver
expects three file paths, referencing three different files with
different logical record's format; moreover, a "points" directory,
along with other files, are created as part of the output:

1) input

Description: A file containing data to be clustered, represented by Vectors.
Path: An absolute path to an HDFS data file.  Example: "input/thedata.dat"
Logical format: <ID, Vector>. The ID could be anything as long as it
extends Writable.
Code example (writing an input file):

// Get FileSystem through Configuration
Configuration conf = new Configuration();
Filesystem fs = FileSystem.get(conf);

// Instantiate writer to input data in a .dat file
// with a <Text, SparseVector> logical format
String fileName = "input/thedata.dat";
Path path = new Path(fileName);

SequenceFile.Writer seqVectorWriter = new SequenceFile.Writer(fs,
conf, path, Text.class, SparseVector.class);
VectorWriter writer = new SequenceFileVectorWriter(seqVectorWriter);

// Write Vectors to file. inputVectors could be any VectorIterable
implementation.
writer.write(inputVectors);
writer.close();

2) clustersIn

Description: A file containing the initial pre-computed (or randomly
selected) clusters to be used by kMeans. The 'k' value is determined
by the number of clusters in THIS file.
Path: An absolute path to a DIRECTORY containing any number of files
with a "part-xxxxx" name format, where 'x' is a one digit number. The
name should be omitted from the path. Example: "input/initial", where
'initial' has a "part-00000" file stored in it.
Logical format: <ID, ClusterBase>. The ID could be anything as long as
it extends Writable.
Code example (writing a clustersIn file):

// Get FileSystem through Configuration
Configuration conf = new Configuration();
Filesystem fs = FileSystem.get(conf);

// Instantiate writer to input clusters in a file with a <Text,
Cluster> logical format
String fileName = "input/initial/part-00000";
Path path = new Path(fileName);

SequenceFile.Writer seqClusterWriter = new SequenceFile.Writer(fs,
conf, path Text.class, Cluster.class);

// We choose 'k' random Vectors as centers for the initial clusters.
// 'inputVectors' could be any VectorIterable implementation.
// CANT_INITIAL_CLUSTERS is a desired integer value .
// The identifier of a Cluster is used as its ID.
// AFAICT, you DO NOT need to add the center as an actual point in the cluster,
// after cluster creation. This has been corrected recently.
int k = 0;
Iterator it = inputVectors.iterator();
while (it.hasNext() && k++ < CANT_INITIAL_CLUSTERS) {
	Vector v = (Vector)it.next();
	Cluster c = new Cluster(v);
	seqClusterWriter.append(new Text(c.getIdentifier()), c);
}
seqClusterWriter.close();

3) output

Description: The output files generated by the algorithm, in which the
results are stored. Directories named "clusters-i" -'i' being a
positive integer- are created. I'm not quite certain, but I believe
its nomenclature comes from the number of MAP/REDUCED tasks involved.
"part-00000" files are placed in those directories - they hold records
logically structured as <Text, Cluster>, each of which represent a
determined cluster in the dataset.
Path: An absolute path to a parent directory for the "clusters-i"
directories. Example: "output".
Code example (reading and printing an output file):

// Get FileSystem through Configutaion
Configuration conf = new Configuration();
Filesystem fs = FileSystem.get(conf);

// Create a reader for a 'part-00000' file
Path outPath = new Path("output/clusters-0/part-00000");
SequenceFile.Reader reader  = new SequenceFile.Reader(fs, outPath, conf);

Writable key =  (Writable) reader.getKeyClass().newInstance();
Cluster value = new Cluster();
Vector center = null;

// Read file's records and print each cluster as 'Cluster: key {center}'
while (reader.next(key, value)) {
	System.out.println("Cluster: " + key + " { ");
	center = value.getCenter();

	for (int i = 0; i < center.size(); i++) {
		System.out.print(center.get(i) + " ");
	}
System.out.println(" }");

4) points

Description: A directory containing a "part-00000" file with a
<VectorID, CusterID> (both being Text type fields). It's basically an
index (with VectorID as key) that matches every Vector described in
the input ("thedata.dat" in our example) with the cluster they now
belong.
Logical format: <VectorID, ClusterID>. VectorID matches the ID
specified by the first field of each record int the input file.
ClusterID matches the ID in the first field of each "part-xxxxx"
included in a "clusters-i" directory.

That's that, for now. Surely, this is not error-proof and should be
revised and improved, but it could very well serve as a start for a
documentation page. Try and catch sentences were omitted for code's
clarity sake. Comments, suggestions and corrections are, obviously,
welcomed.

Description:
On Thu, Jul 2, 2009 at 12:32 AM, Grant Ingersoll<gs...@apache.org> wrote:
>
> On Jul 1, 2009, at 9:37 AM, nfantone wrote:
>
>> Ok, so I managed to write a VectorIterable implementation to draw data
>> from my database. Now, I'm in the process of understanding the output
>> file that kMeans (with a Canopy input) produces. Someone, please,
>> correct me if I'm mistaken. At first, my thought was that there were
>> as many "cluster-i" directories as clusters detected from the dataset
>> by the algorithm(s), until I printed out the content of the
>> "part-00000" file in them. It seems as though it stores a <Writable>
>> cluster ID and then a <Writable> Cluster, each line. Are those all the
>> actual clusters detected? If so, what's the reason behind the
>> directory nomenclature and its consecutive enumeration?
>
> I was wondering the same thing myself.  I believe it has to do with the
> number of iterations or reduce tasks, but I haven't looked closely at the
> code yet.  Maybe Jeff can jump in here.
>
>
>> Does every
>> "part-00000", in different "cluster-i" directories, hold different
>> clusters? And, what about the "points" directory? I can tell it
>> follows a <VectorID, Value> register format. What's that value
>> supposed to represent? The ID from the cluster it belongs, perhaps?
>
> I believe this is the case.
>
>>
>> There really ought to be documentation about this somewhere. I don't
>> know if I need some kind of permission, but I'm offering myself to
>> write it and upload it to the Mahout wiki or wherever it should be,
>> once I finished my project.
>>
>
> +1
>
>> Thanks in advanced.
>>
>> On Fri, Jun 26, 2009 at 1:54 PM, Sean Owen<sr...@gmail.com> wrote:
>>>
>>> All of Mahout is generally Hadoop/HDFS based. Taste is a bit of
>>> exception since it has a core that is independent of Hadoop and can
>>> use data from files, databases, etc. It also happens to have some
>>> clustering logic. So you can use, say, TreeClusteringRecommender to
>>> generate user clusters, based on data in a database. This isn't
>>> Mahout's primary clustering support, but, if it fits what you need, at
>>> least it is there.
>>>
>>> On Fri, Jun 26, 2009 at 12:21 PM, nfantone<nf...@gmail.com> wrote:
>>>>
>>>> Thanks for the fast response, Grant.
>>>>
>>>> I am aware of what you pointed out about Taste. I just mentioned it to
>>>> make a reference to something similar to what I needed to
>>>> implement/use, namely the "DataModel" interface.
>>>>
>>>> I'm going to try the solution you suggested and write an
>>>> implementation of VectorIterable. Expect me to come back here for
>>>> feedback.
>>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Re: Clustering from DB

Posted by Grant Ingersoll <gs...@apache.org>.

On Jul 1, 2009, at 9:37 AM, nfantone wrote:

> Ok, so I managed to write a VectorIterable implementation to draw data
> from my database. Now, I'm in the process of understanding the output
> file that kMeans (with a Canopy input) produces. Someone, please,
> correct me if I'm mistaken. At first, my thought was that there were
> as many "cluster-i" directories as clusters detected from the dataset
> by the algorithm(s), until I printed out the content of the
> "part-00000" file in them. It seems as though it stores a <Writable>
> cluster ID and then a <Writable> Cluster, each line. Are those all the
> actual clusters detected? If so, what's the reason behind the
> directory nomenclature and its consecutive enumeration?

I was wondering the same thing myself.  I believe it has to do with  
the number of iterations or reduce tasks, but I haven't looked closely  
at the code yet.  Maybe Jeff can jump in here.


> Does every
> "part-00000", in different "cluster-i" directories, hold different
> clusters? And, what about the "points" directory? I can tell it
> follows a <VectorID, Value> register format. What's that value
> supposed to represent? The ID from the cluster it belongs, perhaps?

I believe this is the case.

>
> There really ought to be documentation about this somewhere. I don't
> know if I need some kind of permission, but I'm offering myself to
> write it and upload it to the Mahout wiki or wherever it should be,
> once I finished my project.
>

+1

> Thanks in advanced.
>
> On Fri, Jun 26, 2009 at 1:54 PM, Sean Owen<sr...@gmail.com> wrote:
>> All of Mahout is generally Hadoop/HDFS based. Taste is a bit of
>> exception since it has a core that is independent of Hadoop and can
>> use data from files, databases, etc. It also happens to have some
>> clustering logic. So you can use, say, TreeClusteringRecommender to
>> generate user clusters, based on data in a database. This isn't
>> Mahout's primary clustering support, but, if it fits what you need,  
>> at
>> least it is there.
>>
>> On Fri, Jun 26, 2009 at 12:21 PM, nfantone<nf...@gmail.com> wrote:
>>> Thanks for the fast response, Grant.
>>>
>>> I am aware of what you pointed out about Taste. I just mentioned  
>>> it to
>>> make a reference to something similar to what I needed to
>>> implement/use, namely the "DataModel" interface.
>>>
>>> I'm going to try the solution you suggested and write an
>>> implementation of VectorIterable. Expect me to come back here for
>>> feedback.
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search