You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Amit Kolhe <am...@techepoch.com> on 2010/07/12 12:41:14 UTC
How To get the Documents from generated Cluster
Hi All,
I had tried Reuters clustering example using bin/build-reuters.sh.
Its run successfully now my question is how i can get the document list from
generated clusters.
In Cluster dump step its show out-put like below.please help me to
understand this.
:C-14565: [0:0.021, 00:0.071, 00.03:0.022, 00.13:0.024, 00.36:0.023,
00.45:0.023, 00.49:0.047, 00.80:
Top Terms:
vs =>
7.523020040478454
loss =>
5.752297657262768
oper =>
5.442479698123499
net =>
5.263766102586645
cts =>
5.031350276932608
shr =>
4.535172744121599
mln =>
3.9992122872350198
qtr =>
3.726620754607078
profit =>
3.653005141154945
revs =>
3.6289278566086622
dlrs =>
3.545725744977706
note =>
3.507477366353763
excludes =>
2.5428448984544882
includes =>
2.4644159538019212
avg =>
2.433302790452011
shrs =>
2.4035651785900973
gain =>
2.3975824745235874
4th =>
2.3647150188609394
mths =>
2.171140240782154
year =>
2.14992541570207
What I understand with clustering that it will generate the small cluster
for similar documents or stories like Google news if so then how to display
real document results.
Thanks and Regards,
Amit Kolhe
Re: How To get the Documents from generated Cluster
Posted by vanjor <va...@gmail.com>.
U may have to inside the keams job code ,to fullfill it,
cause after kmeans , got k point , and run clustering from the tfidf vector
file such as:
id : vectors
/10000001
{11133:0.33407269965183145,4179:0.6294628642719677,11147:0.47122968104183194,3428:0.5197254290056249}
/10000002
{2693:0.1914665765512973,1078:0.12018772808991451,12357:0.4096048435885022,3428:0.23087411002109504,1590:0.19757134430454687,4912:0.21339950481825154,1621:0.3342897624454898,4781:0.39371810276427055,11848:0.44143150752208343,10170:0.42616584472568675}
/10000003
,the mahout job lose to save the according document id , ,parse it into
catory weight:vector
31770 1.0: [2187:0.324, 6168:0.592, 9571:0.445, 10840:0.507, 11032:0.299]
and i can tell u ,if u are runing in hadoop MR mode ,the code is:
KMeansClusterMapper:
@Override
protected void map(WritableComparable<?> key, VectorWritable point,
Context context)
throws IOException, InterruptedException {
clusterer.outputPointWithClusterInfo(point.get(), clusters, context);
}//there input the document key and value:vector
to
public void outputPointWithClusterInfo(Vector vector,
Iterable<Cluster> clusters,
Mapper<?,?,IntWritable,WeightedVectorWritable>.Context context)
throws IOException, InterruptedException {
AbstractCluster nearestCluster = null;
double nearestDistance = Double.MAX_VALUE;
for (AbstractCluster cluster : clusters) {
Vector clusterCenter = cluster.getCenter();
double distance = measure.distance(clusterCenter.getLengthSquared(),
clusterCenter, vector);
if (distance < nearestDistance || nearestCluster == null) {
nearestCluster = cluster;
nearestDistance = distance;
}
}
context.write(new IntWritable(nearestCluster.getId()), new
WeightedVectorWritable(1, vector));
}
but output lose to recode the document id ,but just recode the catory it
belongs and it value
--
View this message in context: http://lucene.472066.n3.nabble.com/How-To-get-the-Documents-from-generated-Cluster-tp960031p3245288.html
Sent from the Mahout User List mailing list archive at Nabble.com.
Re: How To get the Documents from generated Cluster
Posted by Jeff Eastman <jd...@windwardsolutions.com>.
If you ran the kmeans clustering algorithm (the default) then you need
to add a -cl option to obtain the clustered documents in the
output/clusteredPoints directory. Run bin/mahout/kmeans to see the
command line help:
[--input <input> --clusters <clusters> --output <output>
--distanceMeasure
<distanceMeasure> --convergenceDelta <convergenceDelta> --maxIter <maxIter>
--maxRed <maxRed> --k <k> --overwrite --help --clustering]
Options
--input (-i) input Path to job input directory.
--clusters (-c) clusters The input centroids, as Vectors.
Must be a SequenceFile of
Writable,
Cluster/Canopy. If k
is also specified, then
a random
set of vectors will be
selected
and written out to this
path
first
--output (-o) output The directory pathname for output.
--distanceMeasure (-dm) distanceMeasure The classname of the
DistanceMeasure. Default
is SquaredEuclidean
--convergenceDelta (-cd) convergenceDelta The convergence delta
value. Default is 0.5
--maxIter (-x) maxIter The maximum number of iterations.
--maxRed (-r) maxRed The number of reduce tasks. Defaults to 2
--k (-k) k The k in k-Means. If
specified,
then a random selection
of k
Vectors will be chosen
as the
Centroid and written to the
clusters input path.
--overwrite (-ow) If present, overwrite the output
directory before running
job
--help (-h) Print out help
--clustering (-cl) If present, run clustering after
the iterations have
taken place
If you run LDA on your documents it has its own output mechanism, ldatopics.
Jeff
On 7/14/10 7:41 AM, Grant Ingersoll wrote:
> How did you run the command? Also, how did you run your clustering? I believe (I'm not looking at output at the moment), but I believe there is a points directory created and it contains the results. I believe it is all captured on the Wiki under the algorithms->clustering section.
>
> -Grant
>
> On Jul 12, 2010, at 6:41 AM, Amit Kolhe wrote:
>
>
>> Hi All,
>>
>>
>>
>> I had tried Reuters clustering example using bin/build-reuters.sh.
>>
>>
>>
>> Its run successfully now my question is how i can get the document list from
>> generated clusters.
>>
>>
>>
>> In Cluster dump step its show out-put like below.please help me to
>> understand this.
>>
>>
>>
>> :C-14565: [0:0.021, 00:0.071, 00.03:0.022, 00.13:0.024, 00.36:0.023,
>> 00.45:0.023, 00.49:0.047, 00.80:
>>
>> Top Terms:
>>
>> vs =>
>> 7.523020040478454
>>
>> loss =>
>> 5.752297657262768
>>
>> oper =>
>> 5.442479698123499
>>
>> net =>
>> 5.263766102586645
>>
>> cts =>
>> 5.031350276932608
>>
>> shr =>
>> 4.535172744121599
>>
>> mln =>
>> 3.9992122872350198
>>
>> qtr =>
>> 3.726620754607078
>>
>> profit =>
>> 3.653005141154945
>>
>> revs =>
>> 3.6289278566086622
>>
>> dlrs =>
>> 3.545725744977706
>>
>> note =>
>> 3.507477366353763
>>
>> excludes =>
>> 2.5428448984544882
>>
>> includes =>
>> 2.4644159538019212
>>
>> avg =>
>> 2.433302790452011
>>
>> shrs =>
>> 2.4035651785900973
>>
>> gain =>
>> 2.3975824745235874
>>
>> 4th =>
>> 2.3647150188609394
>>
>> mths =>
>> 2.171140240782154
>>
>> year =>
>> 2.14992541570207
>>
>>
>>
>> What I understand with clustering that it will generate the small cluster
>> for similar documents or stories like Google news if so then how to display
>> real document results.
>>
>>
>>
>>
>>
>> Thanks and Regards,
>>
>> Amit Kolhe
>>
>>
>
>
Re: How To get the Documents from generated Cluster
Posted by Grant Ingersoll <gs...@apache.org>.
How did you run the command? Also, how did you run your clustering? I believe (I'm not looking at output at the moment), but I believe there is a points directory created and it contains the results. I believe it is all captured on the Wiki under the algorithms->clustering section.
-Grant
On Jul 12, 2010, at 6:41 AM, Amit Kolhe wrote:
> Hi All,
>
>
>
> I had tried Reuters clustering example using bin/build-reuters.sh.
>
>
>
> Its run successfully now my question is how i can get the document list from
> generated clusters.
>
>
>
> In Cluster dump step its show out-put like below.please help me to
> understand this.
>
>
>
> :C-14565: [0:0.021, 00:0.071, 00.03:0.022, 00.13:0.024, 00.36:0.023,
> 00.45:0.023, 00.49:0.047, 00.80:
>
> Top Terms:
>
> vs =>
> 7.523020040478454
>
> loss =>
> 5.752297657262768
>
> oper =>
> 5.442479698123499
>
> net =>
> 5.263766102586645
>
> cts =>
> 5.031350276932608
>
> shr =>
> 4.535172744121599
>
> mln =>
> 3.9992122872350198
>
> qtr =>
> 3.726620754607078
>
> profit =>
> 3.653005141154945
>
> revs =>
> 3.6289278566086622
>
> dlrs =>
> 3.545725744977706
>
> note =>
> 3.507477366353763
>
> excludes =>
> 2.5428448984544882
>
> includes =>
> 2.4644159538019212
>
> avg =>
> 2.433302790452011
>
> shrs =>
> 2.4035651785900973
>
> gain =>
> 2.3975824745235874
>
> 4th =>
> 2.3647150188609394
>
> mths =>
> 2.171140240782154
>
> year =>
> 2.14992541570207
>
>
>
> What I understand with clustering that it will generate the small cluster
> for similar documents or stories like Google news if so then how to display
> real document results.
>
>
>
>
>
> Thanks and Regards,
>
> Amit Kolhe
>