You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Amit Kolhe <am...@techepoch.com> on 2010/07/12 12:41:14 UTC

How To get the Documents from generated Cluster

Hi All,

 

I had tried Reuters clustering example using bin/build-reuters.sh.

 

Its run successfully now my question is how i can get the document list from
generated clusters.

 

In Cluster dump step its show out-put like below.please help me to
understand this.

 

:C-14565: [0:0.021, 00:0.071, 00.03:0.022, 00.13:0.024, 00.36:0.023,
00.45:0.023, 00.49:0.047, 00.80:

        Top Terms:

                vs                                      =>
7.523020040478454

                loss                                    =>
5.752297657262768

                oper                                    =>
5.442479698123499

                net                                     =>
5.263766102586645

                cts                                     =>
5.031350276932608

                shr                                     =>
4.535172744121599

                mln                                     =>
3.9992122872350198

                qtr                                     =>
3.726620754607078

                profit                                  =>
3.653005141154945

                revs                                    =>
3.6289278566086622

                dlrs                                    =>
3.545725744977706

                note                                    =>
3.507477366353763

                excludes                                =>
2.5428448984544882

                includes                                =>
2.4644159538019212

                avg                                     =>
2.433302790452011

                shrs                                    =>
2.4035651785900973

                gain                                    =>
2.3975824745235874

                4th                                     =>
2.3647150188609394

                mths                                    =>
2.171140240782154

                year                                    =>
2.14992541570207

 

What I understand with clustering that it will generate the small cluster
for similar documents or stories like Google news if so then how to display
real document results.

 

 

Thanks and Regards,

Amit Kolhe

Re: How To get the Documents from generated Cluster

Posted by vanjor <va...@gmail.com>.

U may have to inside the  keams job code ,to fullfill it,
cause after kmeans , got k point , and run clustering from the tfidf vector
file such as:

id : vectors
/10000001
{11133:0.33407269965183145,4179:0.6294628642719677,11147:0.47122968104183194,3428:0.5197254290056249}
/10000002
{2693:0.1914665765512973,1078:0.12018772808991451,12357:0.4096048435885022,3428:0.23087411002109504,1590:0.19757134430454687,4912:0.21339950481825154,1621:0.3342897624454898,4781:0.39371810276427055,11848:0.44143150752208343,10170:0.42616584472568675}
/10000003

,the mahout job lose to save the according document id , ,parse it into

catory        weight:vector
31770	1.0: [2187:0.324, 6168:0.592, 9571:0.445, 10840:0.507, 11032:0.299]


and i can tell u ,if u are runing in hadoop MR mode ,the code is:

KMeansClusterMapper:

  @Override
  protected void map(WritableComparable<?> key, VectorWritable point,
Context context)
    throws IOException, InterruptedException {
    clusterer.outputPointWithClusterInfo(point.get(), clusters, context);
  }//there input the document key and value:vector

to

  public void outputPointWithClusterInfo(Vector vector,
                                         Iterable<Cluster> clusters,
                                        
Mapper<?,?,IntWritable,WeightedVectorWritable>.Context context)
    throws IOException, InterruptedException {
    AbstractCluster nearestCluster = null;
    double nearestDistance = Double.MAX_VALUE;
    for (AbstractCluster cluster : clusters) {
      Vector clusterCenter = cluster.getCenter();
      double distance = measure.distance(clusterCenter.getLengthSquared(),
clusterCenter, vector);
      if (distance < nearestDistance || nearestCluster == null) {
        nearestCluster = cluster;
        nearestDistance = distance;
      }
    }
    context.write(new IntWritable(nearestCluster.getId()), new
WeightedVectorWritable(1, vector));
  }

but output lose to recode the document id ,but just recode the catory it
belongs and it value

--
View this message in context: http://lucene.472066.n3.nabble.com/How-To-get-the-Documents-from-generated-Cluster-tp960031p3245288.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: How To get the Documents from generated Cluster

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

If you ran the kmeans clustering algorithm (the default) then you need 
to add a -cl option to obtain the clustered documents in the 
output/clusteredPoints directory. Run bin/mahout/kmeans to see the 
command line help:

  [--input <input> --clusters <clusters> --output <output> 
--distanceMeasure
<distanceMeasure> --convergenceDelta <convergenceDelta> --maxIter <maxIter>
--maxRed <maxRed> --k <k> --overwrite --help --clustering]
Options
   --input (-i) input                    Path to job input directory.
   --clusters (-c) clusters           The input centroids, as Vectors.
                                                Must be a SequenceFile of
                                                Writable, 
Cluster/Canopy.  If k
                                                is also specified, then 
a random
                                                set of vectors will be 
selected
                                                and written out to this 
path
                                                first
   --output (-o) output              The directory pathname for output.
   --distanceMeasure (-dm) distanceMeasure      The classname of the
                                                DistanceMeasure. Default 
is SquaredEuclidean
   --convergenceDelta (-cd) convergenceDelta    The convergence delta 
value. Default is 0.5
   --maxIter (-x) maxIter         The maximum number of iterations.
   --maxRed (-r) maxRed         The number of reduce tasks.  Defaults to 2
   --k (-k) k                               The k in k-Means.  If 
specified,
                                                then a random selection 
of k
                                                Vectors will be chosen 
as the
                                                Centroid and written to the
                                                clusters input path.
   --overwrite (-ow)                 If present, overwrite the output
                                                directory before running 
job
   --help (-h)                             Print out help
   --clustering (-cl)                    If present, run clustering after
                                                the iterations have 
taken place

If you run LDA on your documents it has its own output mechanism, ldatopics.
Jeff

On 7/14/10 7:41 AM, Grant Ingersoll wrote:
> How did you run the command?  Also, how did you run your clustering?  I believe (I'm not looking at output at the moment), but I believe there is a points directory created and it contains the results.  I believe it is all captured on the Wiki under the algorithms->clustering section.
>
> -Grant
>
> On Jul 12, 2010, at 6:41 AM, Amit Kolhe wrote:
>
>    
>> Hi All,
>>
>>
>>
>> I had tried Reuters clustering example using bin/build-reuters.sh.
>>
>>
>>
>> Its run successfully now my question is how i can get the document list from
>> generated clusters.
>>
>>
>>
>> In Cluster dump step its show out-put like below.please help me to
>> understand this.
>>
>>
>>
>> :C-14565: [0:0.021, 00:0.071, 00.03:0.022, 00.13:0.024, 00.36:0.023,
>> 00.45:0.023, 00.49:0.047, 00.80:
>>
>>         Top Terms:
>>
>>                 vs                                      =>
>> 7.523020040478454
>>
>>                 loss                                    =>
>> 5.752297657262768
>>
>>                 oper                                    =>
>> 5.442479698123499
>>
>>                 net                                     =>
>> 5.263766102586645
>>
>>                 cts                                     =>
>> 5.031350276932608
>>
>>                 shr                                     =>
>> 4.535172744121599
>>
>>                 mln                                     =>
>> 3.9992122872350198
>>
>>                 qtr                                     =>
>> 3.726620754607078
>>
>>                 profit                                  =>
>> 3.653005141154945
>>
>>                 revs                                    =>
>> 3.6289278566086622
>>
>>                 dlrs                                    =>
>> 3.545725744977706
>>
>>                 note                                    =>
>> 3.507477366353763
>>
>>                 excludes                                =>
>> 2.5428448984544882
>>
>>                 includes                                =>
>> 2.4644159538019212
>>
>>                 avg                                     =>
>> 2.433302790452011
>>
>>                 shrs                                    =>
>> 2.4035651785900973
>>
>>                 gain                                    =>
>> 2.3975824745235874
>>
>>                 4th                                     =>
>> 2.3647150188609394
>>
>>                 mths                                    =>
>> 2.171140240782154
>>
>>                 year                                    =>
>> 2.14992541570207
>>
>>
>>
>> What I understand with clustering that it will generate the small cluster
>> for similar documents or stories like Google news if so then how to display
>> real document results.
>>
>>
>>
>>
>>
>> Thanks and Regards,
>>
>> Amit Kolhe
>>
>>      
>
>

Re: How To get the Documents from generated Cluster

Posted by Grant Ingersoll <gs...@apache.org>.

How did you run the command?  Also, how did you run your clustering?  I believe (I'm not looking at output at the moment), but I believe there is a points directory created and it contains the results.  I believe it is all captured on the Wiki under the algorithms->clustering section.

-Grant

On Jul 12, 2010, at 6:41 AM, Amit Kolhe wrote:

> Hi All,
> 
> 
> 
> I had tried Reuters clustering example using bin/build-reuters.sh.
> 
> 
> 
> Its run successfully now my question is how i can get the document list from
> generated clusters.
> 
> 
> 
> In Cluster dump step its show out-put like below.please help me to
> understand this.
> 
> 
> 
> :C-14565: [0:0.021, 00:0.071, 00.03:0.022, 00.13:0.024, 00.36:0.023,
> 00.45:0.023, 00.49:0.047, 00.80:
> 
>        Top Terms:
> 
>                vs                                      =>
> 7.523020040478454
> 
>                loss                                    =>
> 5.752297657262768
> 
>                oper                                    =>
> 5.442479698123499
> 
>                net                                     =>
> 5.263766102586645
> 
>                cts                                     =>
> 5.031350276932608
> 
>                shr                                     =>
> 4.535172744121599
> 
>                mln                                     =>
> 3.9992122872350198
> 
>                qtr                                     =>
> 3.726620754607078
> 
>                profit                                  =>
> 3.653005141154945
> 
>                revs                                    =>
> 3.6289278566086622
> 
>                dlrs                                    =>
> 3.545725744977706
> 
>                note                                    =>
> 3.507477366353763
> 
>                excludes                                =>
> 2.5428448984544882
> 
>                includes                                =>
> 2.4644159538019212
> 
>                avg                                     =>
> 2.433302790452011
> 
>                shrs                                    =>
> 2.4035651785900973
> 
>                gain                                    =>
> 2.3975824745235874
> 
>                4th                                     =>
> 2.3647150188609394
> 
>                mths                                    =>
> 2.171140240782154
> 
>                year                                    =>
> 2.14992541570207
> 
> 
> 
> What I understand with clustering that it will generate the small cluster
> for similar documents or stories like Google news if so then how to display
> real document results.
> 
> 
> 
> 
> 
> Thanks and Regards,
> 
> Amit Kolhe
>