You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Lijie Xu <cs...@gmail.com> on 2011/12/16 06:10:13 UTC
Random Selection Algorithm Problem in org.apache.mahout.clustering.kmeans.RandomSeedGenerator
Hi, I'm now reading the source code of
"org.apache.mahout.clustering.kmeans.RandomSeedGenerator".
There may be a problem in function "buildRandom" which aims to select
the random k centroid vectors from streaming records.
I'm wondering whether this algorithm is correct and I think the right
algorithm is as follows:
To select the k elements from streaming resource, we put the first k
elements (i=1,2,...,k) into the buffer.
When the /i/th (i > k) element comes, we want to keep it with the
probability of k/i.
If the /i/th element is selected, then we can randomly delete an element
from the buffer and add the /i/th element into the buffer.
So I think the code with red color is doubtable. If I'm wrong, please
tell me. Thx!
for (Pair<Writable,VectorWritable> record
: new
SequenceFileIterable<Writable,VectorWritable>(fileStatus.getPath(),
true, conf)) {
Writable key = record.getFirst();
VectorWritable value = record.getSecond();
Cluster newCluster = new Cluster(value.get(),
nextClusterId++, measure);
newCluster.observe(value.get(), 1);
Text newText = new Text(key.toString());
int currentSize = chosenTexts.size();
if (currentSize < k) {
chosenTexts.add(newText);
chosenClusters.add(newCluster);
} else if (random.nextInt(currentSize + 1) == 0) { // with
chance 1/(currentSize+1) pick new element
int indexToRemove = random.nextInt(currentSize); // evict
one chosen randomly
chosenTexts.remove(indexToRemove);
chosenClusters.remove(indexToRemove);
chosenTexts.add(newText);
chosenClusters.add(newCluster);
}
}
Re: Random Selection Algorithm Problem in org.apache.mahout.clustering.kmeans.RandomSeedGenerator
Posted by Ted Dunning <te...@gmail.com>.
Lijie,
You are correct. This code is in error. The mailing list lost your
coloring, but your point is still there.
I think that the code should be this instead. Ironically, the comment in
the original code describes what the code does accurately.
int itemsSeenSoFar = 0;
for (Pair<Writable,VectorWritable> record
: new SequenceFileIterable<Writable,VectorWritable>(fileStatus.getPath(),
true, conf)) {
itemsSeenSoFar++;
Writable key = record.getFirst();
VectorWritable value = record.getSecond();
Cluster newCluster = new Cluster(value.get(), nextClusterId++, measure);
newCluster.observe(value.get(), 1);
Text newText = new Text(key.toString());
int currentSize = chosenTexts.size();
if (currentSize < k) {
chosenTexts.add(newText);
chosenClusters.add(newCluster);
} else {
int itemToReplace = random.nextInt(itemsSeenSoFar);
chosenTexts.set(itemToReplace, newText);
chosenClusters.set(itemToReplace, newCluster);
}
}
On Thu, Dec 15, 2011 at 9:10 PM, Lijie Xu <cs...@gmail.com> wrote:
> Hi, I'm now reading the source code of "org.apache.mahout.clustering.**
> kmeans.RandomSeedGenerator".
> There may be a problem in function "buildRandom" which aims to select the
> random k centroid vectors from streaming records.
>
> I'm wondering whether this algorithm is correct and I think the right
> algorithm is as follows:
>
> To select the k elements from streaming resource, we put the first k
> elements (i=1,2,...,k) into the buffer.
> When the /i/th (i > k) element comes, we want to keep it with the
> probability of k/i.
> If the /i/th element is selected, then we can randomly delete an element
> from the buffer and add the /i/th element into the buffer.
>
> So I think the code with red color is doubtable. If I'm wrong, please tell
> me. Thx!
>
> for (Pair<Writable,VectorWritable> record
> : new SequenceFileIterable<Writable,**
> VectorWritable>(fileStatus.**getPath(), true, conf)) {
> Writable key = record.getFirst();
> VectorWritable value = record.getSecond();
> Cluster newCluster = new Cluster(value.get(), nextClusterId++,
> measure);
> newCluster.observe(value.get()**, 1);
> Text newText = new Text(key.toString());
> int currentSize = chosenTexts.size();
> if (currentSize < k) {
> chosenTexts.add(newText);
> chosenClusters.add(newCluster)**;
> } else if (random.nextInt(currentSize + 1) == 0) { // with chance
> 1/(currentSize+1) pick new element
> int indexToRemove = random.nextInt(currentSize); // evict one
> chosen randomly
> chosenTexts.remove(**indexToRemove);
> chosenClusters.remove(**indexToRemove);
> chosenTexts.add(newText);
> chosenClusters.add(newCluster)**;
> }
> }
>
>
>