You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Lijie Xu <cs...@gmail.com> on 2011/12/16 06:10:13 UTC

Random Selection Algorithm Problem in org.apache.mahout.clustering.kmeans.RandomSeedGenerator

Hi, I'm now reading the source code of 
"org.apache.mahout.clustering.kmeans.RandomSeedGenerator".
There may be a problem in function "buildRandom" which aims to select 
the random k centroid vectors from streaming records.

I'm wondering whether this algorithm is correct and I think the right 
algorithm is as follows:

To select the k elements from streaming resource, we put the first k 
elements (i=1,2,...,k) into the buffer.
When the /i/th (i > k) element comes, we want to keep it with the 
probability of k/i.
If the /i/th element is selected, then we can randomly delete an element 
from the buffer and add the /i/th element into the buffer.

So I think the code with red color is doubtable. If I'm wrong, please 
tell me. Thx!

for (Pair<Writable,VectorWritable> record
              : new 
SequenceFileIterable<Writable,VectorWritable>(fileStatus.getPath(), 
true, conf)) {
           Writable key = record.getFirst();
           VectorWritable value = record.getSecond();
           Cluster newCluster = new Cluster(value.get(), 
nextClusterId++, measure);
           newCluster.observe(value.get(), 1);
           Text newText = new Text(key.toString());
           int currentSize = chosenTexts.size();
           if (currentSize < k) {
             chosenTexts.add(newText);
             chosenClusters.add(newCluster);
           } else if (random.nextInt(currentSize + 1) == 0) { // with 
chance 1/(currentSize+1) pick new element
             int indexToRemove = random.nextInt(currentSize); // evict 
one chosen randomly
             chosenTexts.remove(indexToRemove);
             chosenClusters.remove(indexToRemove);
             chosenTexts.add(newText);
             chosenClusters.add(newCluster);
           }
         }



Re: Random Selection Algorithm Problem in org.apache.mahout.clustering.kmeans.RandomSeedGenerator

Posted by Ted Dunning <te...@gmail.com>.
Lijie,

You are correct.  This code is in error.  The mailing list lost your
coloring, but your point is still there.

I think that the code should be this instead.  Ironically, the comment in
the original code describes what the code does accurately.

int itemsSeenSoFar = 0;
for (Pair<Writable,VectorWritable> record
 : new SequenceFileIterable<Writable,VectorWritable>(fileStatus.getPath(),
true, conf)) {
    itemsSeenSoFar++;
    Writable key = record.getFirst();
    VectorWritable value = record.getSecond();
    Cluster newCluster = new Cluster(value.get(), nextClusterId++, measure);
    newCluster.observe(value.get(), 1);
    Text newText = new Text(key.toString());
    int currentSize = chosenTexts.size();
    if (currentSize < k) {
chosenTexts.add(newText);
chosenClusters.add(newCluster);
    } else {
int itemToReplace = random.nextInt(itemsSeenSoFar);
chosenTexts.set(itemToReplace, newText);
 chosenClusters.set(itemToReplace, newCluster);
    }
}

On Thu, Dec 15, 2011 at 9:10 PM, Lijie Xu <cs...@gmail.com> wrote:

> Hi, I'm now reading the source code of "org.apache.mahout.clustering.**
> kmeans.RandomSeedGenerator".
> There may be a problem in function "buildRandom" which aims to select the
> random k centroid vectors from streaming records.
>
> I'm wondering whether this algorithm is correct and I think the right
> algorithm is as follows:
>
> To select the k elements from streaming resource, we put the first k
> elements (i=1,2,...,k) into the buffer.
> When the /i/th (i > k) element comes, we want to keep it with the
> probability of k/i.
> If the /i/th element is selected, then we can randomly delete an element
> from the buffer and add the /i/th element into the buffer.
>
> So I think the code with red color is doubtable. If I'm wrong, please tell
> me. Thx!
>
> for (Pair<Writable,VectorWritable> record
>             : new SequenceFileIterable<Writable,**
> VectorWritable>(fileStatus.**getPath(), true, conf)) {
>          Writable key = record.getFirst();
>          VectorWritable value = record.getSecond();
>          Cluster newCluster = new Cluster(value.get(), nextClusterId++,
> measure);
>          newCluster.observe(value.get()**, 1);
>          Text newText = new Text(key.toString());
>          int currentSize = chosenTexts.size();
>          if (currentSize < k) {
>            chosenTexts.add(newText);
>            chosenClusters.add(newCluster)**;
>          } else if (random.nextInt(currentSize + 1) == 0) { // with chance
> 1/(currentSize+1) pick new element
>            int indexToRemove = random.nextInt(currentSize); // evict one
> chosen randomly
>            chosenTexts.remove(**indexToRemove);
>            chosenClusters.remove(**indexToRemove);
>            chosenTexts.add(newText);
>            chosenClusters.add(newCluster)**;
>          }
>        }
>
>
>