You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by edward choi <mp...@gmail.com> on 2011/11/03 15:25:05 UTC
Re: Dirichlet Process Clustering not working

Nope. As I said earlier, negative term weight could be seen from both my
document set result and reuters set result.
Cluster 16, 18, 19 have the exact same negative term weights in both cases.
Reuters set were made into vectors with following options:
--minDF 2 --maxDFPercent 50 --weight TFIDF --norm 2 -ng 2 -nv

Regards,
Ed

2011/11/1 Jeff Eastman <je...@narus.com>

> The trick was switching to the distance measure model. The default
> Gaussian model was doing complicated math for each point and for each of
> the thousands of dimensions and for each cluster. Then it multiplied all
> the term pdfs together and underflowed! 100x improvement seems about right.
> Glad it is working so well. I figured it could be coaxed to do so. I'm
> still concerned about your negative term weights. Is this coming from your
> dataset?
>
> -----Original Message-----
> From: edward choi [mailto:mp2893@gmail.com]
> Sent: Thursday, October 27, 2011 9:57 AM
> To: user@mahout.apache.org
> Subject: Re: Dirichlet Process Clustering not working
>
> I downloaded the most recent version of Mahout from apache SVN.
> Using the new arguments, I have tested DPC on my own news documents. (not
> reuters set)
>
> Turns out, it really had great improvements. First of all, documents are
> somewhat distributed across 20 clusters.
> The total number of documents were 5896.
> DC-0 had 1014 documents. DC-1 had 4305 documents.
> Nine clusters had zero documents. Rest of the clusters had from 1 to 214
> documents each.
>
> The quality of the clusters weren't so pretty but I guess that has got to
> do
> with the crude preprocessing step. (raw news documents have links, ads,
> reader comments, etc. etc. etc)
> I will know better when I test with build-reuters.sh
>
> One more thing. Unfortunately there are still some negative values in the
> cluster points.
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> DC-16 total= 0 model= DMC:16{n=0 c=[0:-1.093, 0.07:-0.891, 0.08:1.327,
> 0.1:0.504, 0.18:-0.705, 0.2:0.318, 0.25:1.824, 0.3:0.273, 0.32:-0.792,
> 0.4:0.390, 0.41:-1.314, 0.5:0.727, 0.7:0.734, 0.70:-0.973,
>    Top Terms:
>        kodak camera                            =>  4.5009259007672835
>        player july                             =>   4.216287519075373
>        figure mix                              =>   4.139826527167421
>        department defense                      =>   4.009974576583582
>        remark wednesday                        =>  3.9945681051149564
>        counsel infection                       =>   3.886000915158471
>        jefferson county                        =>  3.8442975919513667
>        jersey say                              =>  3.7821696224124786
>        tell couple                             =>  3.7644857721992415
>        3.5 million                             =>   3.743525174300145
> DC-18 total= 0 model= DMC:18{n=0 c=[0:-1.093, 0.07:-0.891, 0.08:1.327,
> 0.1:0.504, 0.18:-0.705, 0.2:0.318, 0.25:1.824, 0.3:0.273, 0.32:-0.792,
> 0.4:0.390, 0.41:-1.314, 0.5:0.727, 0.7:0.734, 0.70:-0.973,
>    Top Terms:
>        kodak camera                            =>  4.5009259007672835
>        player july                             =>   4.216287519075373
>        figure mix                              =>   4.139826527167421
>        department defense                      =>   4.009974576583582
>        remark wednesday                        =>  3.9945681051149564
>        counsel infection                       =>   3.886000915158471
>        jefferson county                        =>  3.8442975919513667
>        jersey say                              =>  3.7821696224124786
>        tell couple                             =>  3.7644857721992415
>        3.5 million                             =>   3.743525174300145
> DC-19 total= 0 model= DMC:19{n=0 c=[0:-1.093, 0.07:-0.891, 0.08:1.327,
> 0.1:0.504, 0.18:-0.705, 0.2:0.318, 0.25:1.824, 0.3:0.273, 0.32:-0.792,
> 0.4:0.390, 0.41:-1.314, 0.5:0.727, 0.7:0.734, 0.70:-0.973,
>    Top Terms:
>        kodak camera                            =>  4.5009259007672835
>        player july                             =>   4.216287519075373
>        figure mix                              =>   4.139826527167421
>        department defense                      =>   4.009974576583582
>        remark wednesday                        =>  3.9945681051149564
>        counsel infection                       =>   3.886000915158471
>        jefferson county                        =>  3.8442975919513667
>        jersey say                              =>  3.7821696224124786
>        tell couple                             =>  3.7644857721992415
>        3.5 million                             =>   3.743525174300145
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Among nine clusters which have zero members, above three have negative
> values.
> Interestingly, all three of them have the exact same values and top terms.
> I
> wonder what this means.
>
> Anyway I'll post another thread when I have played around with Reuters set.
>
> Ed
>
> ps. The runtime has indeed reduced significantly!!! Possibly 100 times
> faster as you said. Loved it!!
>
> 2011/10/20 Jeff Eastman <je...@narus.com>
>
> > R1186452 commits two small changes that seem to do much better with
> Reuters
> > than before:
> > - fixed DistanceMeasureClusterDistribution to generate Gaussian element
> > values in the prior clusters. Zero values in previous implementation
> don't
> > work with CosineDistanceMeasure.
> > - changed Dirichlet arguments to use DMCD and CosineDM in
> build-reuters.sh
> > - switched -mp to DenseVector since all the prior center elements are
> > Gaussian and generally non-zero
> > - increased -a0 to 2
> >
> > Build-reuters now does a much better job with the wide topic vectors
> using
> > the DMCD/CosineDM. And it runs maybe 100x faster too. Here are the new
> > arguments:
> >
> >  $MAHOUT dirichlet \
> >    -i ${WORK_DIR}/reuters-out-seqdir-sparse-dirichlet/tfidf-vectors \
> >    -o ${WORK_DIR}/reuters-dirichlet -k 20 -ow -x 10 -a0 2 \
> >    -md
> >
> org.apache.mahout.clustering.dirichlet.models.DistanceMeasureClusterDistribution
> > \
> >    -mp org.apache.mahout.math.DenseVector \
> >    -dm org.apache.mahout.common.distance.CosineDistanceMeasure
> >
> >
> > -----Original Message-----
> > From: Jeff Eastman [mailto:jeastman@Narus.com]
> > Sent: Wednesday, October 19, 2011 9:53 AM
> > To: user@mahout.apache.org
> > Subject: RE: Dirichlet Process Clustering not working
> >
> > The pdf() implementation in GaussianCluster is pretty lame. It is
> computing
> > a running product of the element pdfs which, for wide input vectors
> (Reuters
> > is 41,807), always underflows and returns 0. Here's the code:
> >
> >  public double pdf(VectorWritable vw) {
> >    Vector x = vw.get();
> >    // return the product of the component pdfs
> >    // TODO: is this reasonable? correct? It seems to work in some cases.
> >    double pdf = 1;
> >    for (int i = 0; i < x.size(); i++) {
> >      // small prior on stdDev to avoid numeric instability when stdDev==0
> >      pdf *= UncommonDistributions.dNorm(x.getQuick(i),
> >          getCenter().getQuick(i), getRadius().getQuick(i) + 0.000001);
> >    }
> >    return pdf;
> >  }
> >
> > -----Original Message-----
> > From: Jeff Eastman [mailto:jeastman@Narus.com]
> > Sent: Wednesday, October 19, 2011 9:04 AM
> > To: user@mahout.apache.org
> > Subject: RE: Dirichlet Process Clustering not working
> >
> > I agree something is amiss here, but it could be the model is just not
> > suitable for this problem. Running with the Reuters dataset, I see all
> the
> > points being assigned to C-0 in the very first iteration as you do. I
> think
> > the problem is with the pdf() calculations in the mapper for very wide
> > vectors such as we are using. For smaller dimension vectors, DPC appears
> to
> > be working great.
> >
> > I'm going to commit the build-reuters.sh enhancements I've added for
> FuzzyK
> > and DPC so we can both use the same platform. I will report more
> progress as
> > I dig in deeper today...
> >
> > -----Original Message-----
> > From: edward choi [mailto:mp2893@gmail.com]
> > Sent: Wednesday, October 19, 2011 8:11 AM
> > To: user@mahout.apache.org
> > Subject: Re: Dirichlet Process Clustering not working
> >
> > Okay, I've just tried DPC with reuters document set.
> > I let the 'build-reuters.sh' create the sequence files and vectors. (From
> > the looks of its dictionary generated by mahout, the number of features
> > seemed to be less than 100,000)
> > Then I used them to do DPC. (15 clusters, 10 iteration, 1.0 alpha,
> > clustering true, no addtional options)
> > Below is the result of the clusterdump of clusters-10
> >
> >
> ----------------------------------------------------------------------------------------------------------------------------
> > C-0: GC:0{n=15745 c=[0:0.026, 0.003:0.001, 0.01:0.004, 0.02:0.002,
> > 0.05:0.004, 0.07:0.005, 0.07
> >    Top Terms:
> >        said                                    =>  1.6577128281476725
> >        mln                                     =>  1.2455441154347937
> >        dlrs                                    =>  1.1173752482257673
> >        3                                       =>   1.042824193090437
> >        pct                                     =>  1.0223684722334667
> >        reuter                                  =>  0.9934255143959358
> > C-1: GC:1{n=0 c=[0:-0.595, 0.003:0.228, 0.01:-0.401, 0.02:-0.711,
> > 0.05:1.840, 0.07:0.136, 0.077:-0.739, 0.1:-0.177, 0.10:
> >    Top Terms:....
> > C-10: GC:10{n=0 c=[0:0.090, 0.003:-1.426, 0.01:-0.472, 0.02:0.672,
> > 0.05:0.800, 0.07:0.691, 0.077:1.037, 0.1:0
> >    Top Terms:....
> > C-11: GC:11{n=0 c=[0:-0.835, 0.003:-1.748, 0.01:-1.030, 0.02:-1.760,
> > 0.05:-0.343, 0.07:0.286, 0.077:1.179,
> >    Top Terms:....
> >
> >
> ----------------------------------------------------------------------------------------------------------------------------
> > I guess the same thing happened again. So the document set is not the
> > problem. Something is definitely wrong with DPC.
> > Interesting thing is that the first cluster point does not have a single
> > negative value in it.
> > Rest of the cluster points have a lot of negative values. So I guess this
> > phenomenon has something to do with the first cluster hogging all the
> > documents.
> > Any comments on this result?
> > (I haven't tried TestClusterDumper.testDirichlet2&3 yet. I'll post
> another
> > thread when I am done with that).
> >
> > Regards,
> > Ed
> >
> >
> >
>