You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Pat Ferrel (JIRA)" <ji...@apache.org> on 2012/06/05 21:23:23 UTC

[jira] [Commented] (MAHOUT-1028) seq2sparse n-gram weighting creates malformed vectors which crashes kmeans

    [ https://issues.apache.org/jira/browse/MAHOUT-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13289654#comment-13289654 ] 

Pat Ferrel commented on MAHOUT-1028:
------------------------------------

The command line that creates the malformed vector is 

mahout seq2sparse     -i b3/seqfiles/     -o b3/vectors/     -ow -chunk 2000     -x 40     -seq     -n 2     -nv     -ng 2     -ml 2000

When you run 

mahout seq2sparse     -i b3/seqfiles/     -o b3/vectors/     -ow -chunk 2000     -x 40     -seq     -n 2     -nv

The vector looks fine and clustering doesn't die. Attaching the output of seqdumper on the part that contains the doc for this vector.
                
> seq2sparse n-gram weighting creates malformed vectors which crashes kmeans
> --------------------------------------------------------------------------
>
>                 Key: MAHOUT-1028
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1028
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.7
>         Environment: using trunk snapshot about June 1. 
>            Reporter: Pat Ferrel
>             Fix For: 0.7
>
>
> I think I found the root but not sure what needs fixing.
> I took out n-gram generation and the vector now looks like this:
> Key: https://farfetchers.com/category/collections/source/brice-berard:
> Value: https://farfetchers.com/category/collections/source/brice-berard:{701:0.5484552974788475,1876:0.6020428878306935,3620:0.5802940184767269}
> This works in clustering.
> It doesn't seem like a malformed vector should crash clustering (it apparently doesn't in mahout 0.6) but it looks like something in seq2sparse's n-gram weighting does cause a malformed vector.
> I'll file a JIRA
> On 6/5/12 11:48 AM, Pat Ferrel wrote:
> > Using seqdumper on the TFIDF vectors, that vector is indeed in the list
> > Key: https://farfetchers.com/category/collections/source/brice-berard:
> > Value: https://farfetchers.com/category/collections/source/brice-berard:{
> >
> > Looking in the seqfiles we find the document in part-00005 of 10 in no particular part of the file.
> > Key: https://farfetchers.com/category/collections/source/brice-berard:
> > Value: ::Title::
> > Brice Berard | FarFetchers.com
> > Blog Posts
> >
> > On the chance that this originates in seq2sparse I'll try changing options until the vector looks different. and try clustering again.
> >
> > On 6/5/12 10:43 AM, Pat Ferrel wrote:
> >> I'm not completely sure what I'm looking at but...
> >>
> >> In iterateSeq on iteration #1  of processing vectors/tfidf-vectors it reads
> >> vector = "https://farfetchers.com/category/collections/source/brice-berard:{"
> >>
> >> it's a named vector where the  url is the name, the value is "{", which looks wrong and when that is classified to get a probability it gets
> >>
> >> probabilities = "{0:NaN,1:NaN,2:NaN,3:NaN,4:NaN,5:NaN,6:NaN,7:NaN,8:NaN,9:NaN,10:NaN,11:NaN,12:NaN,13:NaN,14:NaN,15:NaN,16:NaN,17:NaN,18:NaN,19:NaN}"
> >>
> >> That causes the probabilities.maxValueIndex() = -1 and everything dies.
> >>
> >> vector looks wrong, doesn't it? Truncated?
> >>
> >> I went back to try the same on mahout 0.6 but iterateSeq does not get called though I used -xm sequential on both runs. I can't see kmeans-clusters/clusters-0 being created on mahout 0.6 either. Is that part of the refactoring?
> >>
> >> On 6/4/12 3:07 PM, Pat Ferrel wrote:
> >>> Some things to try:
> >>> - Have you verified the contents of your input vectors actually have data in them?
> >>> * YES, from the other email you know that the data works fine in 0.6
> >>> - Can you run the cluster dumper on the b3/kmeans-clusters/clusters-0 contents?
> >>> * YES, It is attached from trunk's clusterdump after the failure of kmeans, of course. A simple data set fortunately.
> >>> - Is it possible to run the sequential version (-xm sequential)? If it is you could run it in a debugger to gain more insight.
> >>> * YES, will report back.
> >>>
> >>> On 6/4/12 2:19 PM, Jeff Eastman wrote:
> >>>> It looks like the probabilities vector returned by AbstractClusteringPolicy.classify() has no non-zero elements. In this case, AbstractClusteringPolicy.select()'s call to AbstractVector.maxValueIndex() is returning -1 and that is causing the exception.
> >>>>
> >>>> How could this happen? I'm not exactly sure, but consider that the probabilities vector is calculated in AbstractClusteringPolicy.classify() by calling DistanceMeasureCluster.pdf() on each of the prior clusters in b3/kmeans-clusters/clusters-0. With a CosineDistanceMeasure I don't see how this could ever return zero. Certainly, some of your vectors will match the prior cluster centers exactly (they were sampled from the input) and those values would return pdf==1. Even if the cosine distance was 1 the pdf would be 0.5.
> >>>>
> >>>> Some things to try:
> >>>> - Have you verified the contents of your input vectors actually have data in them?
> >>>> - Can you run the cluster dumper on the b3/kmeans-clusters/clusters-0 contents?
> >>>> - Is it possible to run the sequential version (-xm sequential)? If it is you could run it in a debugger to gain more insight.
> >>>>
> >>>> Jeff
> >>>>
> >>>> On 6/4/12 12:05 PM, Pat Ferrel wrote:
> >>>>> Using the CLI to kmeans from several trunk versions I get an error I don't understand.  When the job died the b3/canopy-centroids/clusters-0-final contained the random-seeds file generated by the kmeans driver and the b3/kmeans-clusters/clusters-0 had several part files but b3/kmeans-clusters/clusters-1 was empty. When I look through the code from the trace it doesn't make much sense.
> >>>>>
> >>>>> Command line:
> >>>>> mahout kmeans
> >>>>>   -i b3/vectors/tfidf-vectors/
> >>>>>   -k 20
> >>>>>   -c b3/canopy-centroids/clusters-0-final
> >>>>>   -cl
> >>>>>   -o b3/kmeans-clusters
> >>>>>   -ow
> >>>>>   -cd 0.01
> >>>>>   -x 30
> >>>>>   -dm org.apache.mahout.common.distance.CosineDistanceMeasure
> >>>>>
> >>>>> Error:
> >>>>> 12/06/04 07:55:03 INFO common.AbstractJob: Command line arguments: {--clustering=null, --clusters=[b3/canopy-centroids/clusters-0-final], --convergenceDelta=[0.01], --distanceMeasure=[org.apache.mahout.common.distance.CosineDistanceMeasure], --endPhase=[2147483647], --input=[b3/vectors/tfidf-vectors/], --maxIter=[30], --method=[mapreduce], --numClusters=[20], --output=[b3/kmeans-clusters], --overwrite=null, --startPhase=[0], --tempDir=[temp]}
> >>>>> 2012-06-04 07:55:03.752 java[67308:1903] Unable to load realm info from SCDynamicStore
> >>>>> 12/06/04 07:55:03 INFO common.HadoopUtil: Deleting b3/canopy-centroids/clusters-0-final
> >>>>> 12/06/04 07:55:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> >>>>> 12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new compressor
> >>>>> 12/06/04 07:55:04 INFO kmeans.RandomSeedGenerator: Wrote 20 vectors to b3/canopy-centroids/clusters-0-final/part-randomSeed
> >>>>> 12/06/04 07:55:04 INFO kmeans.KMeansDriver: Input: b3/vectors/tfidf-vectors Clusters In: b3/canopy-centroids/clusters-0-final/part-randomSeed Out: b3/kmeans-clusters Distance: org.apache.mahout.common.distance.CosineDistanceMeasure
> >>>>> 12/06/04 07:55:04 INFO kmeans.KMeansDriver: convergence: 0.01 max Iterations: 30 num Reduce Tasks: org.apache.mahout.math.VectorWritable Input Vectors: {}
> >>>>> 12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new decompressor
> >>>>> Cluster Iterator running iteration 1 over priorPath: b3/kmeans-clusters/clusters-0
> >>>>> 12/06/04 07:55:05 INFO input.FileInputFormat: Total input paths to process : 1
> >>>>> 12/06/04 07:55:05 INFO mapred.JobClient: Running job: job_local_0001
> >>>>> 12/06/04 07:55:06 INFO mapred.MapTask: io.sort.mb = 100
> >>>>> 12/06/04 07:55:08 INFO mapred.MapTask: data buffer = 79691776/99614720
> >>>>> 12/06/04 07:55:08 INFO mapred.MapTask: record buffer = 262144/327680
> >>>>> 12/06/04 07:55:08 INFO mapred.JobClient:  map 0% reduce 0%
> >>>>> 12/06/04 07:55:09 WARN mapred.LocalJobRunner: job_local_0001
> >>>>> org.apache.mahout.math.IndexException: Index -1 is outside allowable range of [0,20)
> >>>>>     at org.apache.mahout.math.AbstractVector.set(AbstractVector.java:439)
> >>>>>     at org.apache.mahout.clustering.iterator.AbstractClusteringPolicy.select(AbstractClusteringPolicy.java:44)
> >>>>>     at org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:52)
> >>>>>     at org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:18)
> >>>>>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> >>>>>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >>>>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
> >>>>>     at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> >>>>> 12/06/04 07:55:09 INFO mapred.JobClient: Job complete: job_local_0001
> >>>>> 12/06/04 07:55:09 INFO mapred.JobClient: Counters: 0
> >>>>> Exception in thread "main" java.lang.InterruptedException: Cluster Iteration 1 failed processing b3/kmeans-clusters/clusters-1
> >>>>>     at org.apache.mahout.clustering.iterator.ClusterIterator.iterateMR(ClusterIterator.java:186)
> >>>>>     at org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:229)
> >>>>>     at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:149)
> >>>>>     at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:108)
> >>>>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>>>>     at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:49)
> >>>>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>>>>     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>>>>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>>>>     at java.lang.reflect.Method.invoke(Method.java:597)
> >>>>>     at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> >>>>>     at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> >>>>>     at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira