You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2012/03/05 20:29:35 UTC

Re: How to find the k most similar docs

I'm using Mahout 0.6 compiled from source via 'mvn install' I used 
Suneel's code below to get NumberOfColumns.

When I try to run the rowsimilarity job via:

    bin/mahout rowsimilarity -i wikipedia-clusters/tfidf-vectors/ -o
    /wikipedia-similarity -r 87325 -s SIMILARITY_COSINE -m 10  -ess true

I get the following error

    12/03/04 19:14:32 INFO common.AbstractJob: Command line arguments:
    {--endPhase=2147483647, --excludeSelfSimilarity=true,
    --input=wikipedia-clusters/tfidf-vectors/,
    --maxSimilaritiesPerRow=10, --numberOfColumns=87325,
    --output=/wikipedia-similarity,
    --similarityClassname=SIMILARITY_COSINE, --startPhase=0, --tempDir=temp}
    2012-03-04 19:14:32.376 java[1090:1903] Unable to load realm info
    from SCDynamicStore
    12/03/04 19:14:33 INFO input.FileInputFormat: Total input paths to
    process : 1
    12/03/04 19:14:33 INFO mapred.JobClient: Running job: job_local_0001
    12/03/04 19:14:33 INFO mapred.MapTask: io.sort.mb = 100
    12/03/04 19:14:33 INFO mapred.MapTask: data buffer = 79691776/99614720
    12/03/04 19:14:33 INFO mapred.MapTask: record buffer = 262144/327680
    12/03/04 19:14:34 WARN mapred.LocalJobRunner: job_local_0001
    java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be
    cast to org.apache.hadoop.io.IntWritable
         at
    org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$VectorNormMapper.map(RowSimilarityJob.java:154)
         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
         at
    org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

The cast error (as I understand it) usually happens when you pass in a 
classname incorrectly. This seems likely since coocurence similarity is 
being used?

I've probably missed something obvious about how to pass in similarity 
measure to use?


On 2/19/12 9:00 PM, Suneel Marthi wrote:
> Hi Pat,
>
>
> 1. Please look at the discussion thread at http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/browser for a description of what the RowSimilarityJob does.  The RowSimilarityJob implementation is based on the research paper  - http://www.csee.ogi.edu/~zak/cs506-pslc/docsim.pdf
>
> I'll add the details on the mahout wiki page sometime this week.
>
> 2. 'maxSimilaritiesPerRow' returns the best similarities (not the first) - by default this returns top 100 if not specified.
>
> 3. If you would like to discard the similarities per row below a certain value you can specify a threshold -tr,  which would limit the results to only those documents that have a similarity value greater than the threshold.
>
>     Depending on the similarity measures that you get as the final output, it should give you an idea of what T1 and T2 should be.  In my particular use case I was only interested in documents that had a similarity measure of 0.7 or greater,hence 0.7 would be my T2; and the top most similar documents has a similarity value of 0.99999 (which was what I used as my T1).
>
> 4. 'numberOfColumns' is not optional; but I tend to agree with you that this should be inferred automatically if not specified by the size of the input vector.  This could be an enhancement to add to the RowSimilarityJob.
>
>     Code snippet below gets the number of columns in a matrix if not specified by the user.
>
>     Path inputMatrixPath = new Path(getInputPath());
>
>     SequenceFile.Reader  sequenceFileReader =  new SequenceFile.Reader (fs, inputMatrixPath, conf);
>
>     int NumberOfColumns = getDimensions(sequenceFileReader);
>
> sequenceFileReader.close();
> private int getDimensions(Reader reader) throws IOException, InstantiationException, IllegalAccessException {
>      Class keyClass = reader.getKeyClass();
>      Writable row = (Writable) keyClass.newInstance();
>      if (! reader.getValueClass().equals(VectorWritable.class)) {
>        throw new IllegalArgumentException("Value type of sequencefile must be a VectorWritable");
>      }
>      VectorWritable vw = new VectorWritable();
> if (!reader.next(row, vw)) {
>        log.error("matrix must have at least one row");
>        throw new IllegalStateException();
>      }
>      Vector v = vw.get();
>      return v.size();
>   }
> 5. RowSimilarityJob also has an option to excludeSelfSimilarity (which is false by default) but you need to specify this so that you don't end up comparing a document with itself and ending up with a similarity measure of 1.0 (if using Cosine measure).
>
> Let me know if you have any more questions.
>      
>
>
>
>
> ________________________________
>   From: Sebastian Schelter<ss...@apache.org>
> To: user@mahout.apache.org
> Sent: Sunday, February 19, 2012 4:33 PM
> Subject: Re: How to find the k most similar docs
>
> Hi Pat,
>
> 'numberOfColumns' is not optional but is only used by a few
> similarityMeasures (such as loglikelihood ratio).
> 'maxSimilaritiesPerRow' retains the top similarities.
>
> --sebastian
>
>
> On 19.02.2012 22:11, Pat Ferrel wrote:
>> This looks perfect, thanks.
>>
>> I had planned to do the RowSimilarityJob after clustering to reduce the
>> rows from the entire corpus to only those in a cluster. You mention
>> using the distance between similar rows to get an idea of the distances
>> for canopy clustering. This seems a very good idea since I have no other
>> good way to generate T1 and T2. The downside is that I have to do
>> RowSimilarityJob on all docs in the corpus. I assume that since you have
>> done this on 10 Million docs that the benefit in getting good canopies
>> outweighs doing similarity on all docs as far as processing resources
>> needed?
>>
>> I am
>   new to reading mapreduce code so may I ask some noob questions:
>>    * is the best documentation here?
>>   
>> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/math/hadoop/similarity/RowSimilarityJob.html#run(java.lang.String[])
>>
>>    * the command line arguments include: numberOfColumns, shouldn't that
>>      be easily extracted from the input matrix? is this optional? How do
>>      I tell which argument is optional from the docs?
>>    * the argument maxSimilaritiesPerRow could return first or best, it is
>>      difficult to see which.
>>
>> I have the source but perhaps due to the string based binding I am
>> finding it hard to track down what code is run so any tips
>   for reading
>> the code or docs are greatly appreciated.
>>
>>
>> On 2/18/12 1:27 PM, Suneel Marthi wrote:
>>> You might want to look at the RowSimilarityJob in Mahout to determine
>>> document similarity.
>>>
>>>
>>> Here's what you would do:-
>>>
>>> Assuming that your documents have already been vectorized, first
>>> convert the vectors into an M*N matrix by calling the RowIdJob in
>>> Mahout where M = No. of rows (or documents in your case) and N= No. of
>>> columns (or the unique terms).
>>>
>>>
>>> Then run the RowSimilarity job on the matrix generated in the previous
>>> step by specifying a cosine similarity measure, this should generate
>>> an output that gives the most similar documents for each of the
>>> documents and the similarity distance between them. RowSimilarityJob
>>> is a
>   mapreduce job so you should be able to run this on a really large
>>> corpus (I had run this on 10 million web pages).
>>> The output of the RowSimilarity along with the similarity distances
>>> that are generated between document pairs should give an idea as to
>>> what the values of T1 and T2 should be when running canopy clustering.
>>> And the number of clusters generated by running canopy would
>>> eventually be fed into k-means as you had mentioned.
>>>
>>>
>>>
>>>
>>>
>>> ________________________________
>>>     From: Pat Ferrel<pa...@occamsmachete.com>
>>> To: user@mahout.apache.org
>>> Sent: Saturday, February 18, 2012 2:39 PM
>>> Subject: How to
>   find the k most similar docs
>>> Given documents that are vectorized into Mahout vectors, have stop
>>> words removed, and a TFIDF dictionary, what is the best distributed
>>> way to get k nearest documents using a measure like cosine similarity
>>> (or the others provided in Mahout)? I will be doing this for every
>>> document in the corpus so the question is partly how best to do this
>>> given the existing mahout+hadoop framework. What is the intuition
>>> about processing resources needed?
>>>
>>> Expansion: At some point I'd like to extend this idea to find similar
>>> clusters but expect that the same method should work only with
>>> centroids instead of doc vectors. Also I expect to do canopy
>>> clustering to feed into kmeans clustering. I'll perform the similarity
>>> measure only on docs in the same cluster. I think I understand
>   how to
>>> do this preprocessing so the question is primarily the k most similar
>>> docs and/or centroids. This sounds like k nearest neighbors, if so is
>>> this the best way to do it in
>>>     mahout+hadoop?

Re: How to find the k most similar docs

Posted by Fernando Fernández <fe...@gmail.com>.

I'm surprised no one has mentioned SVD yet. You  are supposed to obtain
better resutls using SVD factors instead of original TF-IDF vectors when
computing similarities (This is the theory). Many text mining applications
follow these steps:

- Stopword removal.
- Tf-Idf computation.
- Svd factorization.
- Clustering or supervised classification using SVD factors.

You have SVD distributed routines in Mahout you can use
(DistributedLanczosSolver), you may wnat to check them out.

Best,

Fernando.

2012/3/5 Suneel Marthi <su...@yahoo.com>

> Pat,
>
> Your input to RowSimilarity seems to be the tfidf-vectors directory which
> is <Text, vectorWritable>.
>
> Before executing the RowSimilarity job u need to run the RowIdJob which
> creates a matrix of <IntWritable, VectorWritable>.  This matrix should be
> the input to RowSimilarity.
>
> Also from your command, you seem to be missing --tempDir argument, you
> would need that too.
>
> Suneel
>
>
> ________________________________
>  From: Sebastian Schelter <ss...@apache.org>
> To: user@mahout.apache.org
> Sent: Monday, March 5, 2012 2:32 PM
> Subject: Re: How to find the k most similar docs
>
> That's the problem:
>
> org.apache.hadoop.io.Text cannot be
>    cast to org.apache.hadoop.io.IntWritable
>
> RowSimilarityJob expects <IntWritable,VectorWritable> as input, it seems
> you supply <Text,VectorWritable>.
>
> --sebastian
>
> On 05.03.2012 20:29, Pat Ferrel wrote:
> > org.apache.hadoop.io.Text cannot be
> >    cast to org.apache.hadoop.io.IntWritable
>

Re: How to find the k most similar docs

Posted by Suneel Marthi <su...@yahoo.com>.

Did the RowSimilarityJob execute successfully? Your output should have been one or more part-r-* files (depending on the number of reducers you have configured in ur environment). 


You should be able to get a sequence dump of the  wikipedia-similarity/part-m-00000 file to see what they are.

The output format of RowSimilarityJob is <IntWritable, VectorWritable>.



________________________________
 From: Pat Ferrel <pa...@occamsmachete.com>
To: 
Cc: "user@mahout.apache.org" <us...@mahout.apache.org> 
Sent: Tuesday, March 6, 2012 8:14 PM
Subject: Re: How to find the k most similar docs
 
Ok, making progress. I created a matrix using rowid and got the following output:

   Maclaurin:mahout-distribution-0.6 pferrel$ bin/mahout rowid -i
   wikipedia-clusters/tfidf-vectors/ -o wikipedia-matrix --tempDir temp
   ...
   12/03/05 16:52:45 INFO common.AbstractJob: Command line arguments:
   {--endPhase=2147483647, --input=wikipedia-clusters/tfidf-vectors/,
   --output=wikipedia-matrix, --startPhase=0, --tempDir=temp}
   2012-03-05 16:52:45.870 java[4940:1903] Unable to load realm info
   from SCDynamicStore
   12/03/05 16:52:46 WARN util.NativeCodeLoader: Unable to load
   native-hadoop library for your platform... using builtin-java
   classes where applicable
   12/03/05 16:52:46 INFO compress.CodecPool: Got brand-new compressor
   12/03/05 16:52:46 INFO compress.CodecPool: Got brand-new compressor
   12/03/05 16:52:47 INFO vectors.RowIdJob: Wrote out matrix with 4838
   rows and 87325 columns to wikipedia-matrix/matrix
   12/03/05 16:52:47 INFO driver.MahoutDriver: Program took 1758 ms
   (Minutes: 0.0293)

So a doc matrix with 4838 docs and 87325 dimensions. Next I ran RowSimilarityJob

   Maclaurin:mahout-distribution-0.6 pferrel$ bin/mahout rowsimilarity
   -i wikipedia-matrix/matrix -o wikipedia-similarity -r 87325
   --similarityClassname SIMILARITY_COSINE -m 10 -ess true --tempDir temp

This gives me output in wikipedia-similarity/part-m-00000 but the size is 97 bytes? Shouldn't it have created 4838 * 10 results? Ten per row? I set no threshold so I'd expect it to pick the 10 nearest even if they are far away.

BTW what is the output format?

On 3/5/12 11:48 AM, Suneel Marthi wrote:
> Pat,
> 
> Your input to RowSimilarity seems to be the tfidf-vectors directory which is <Text, vectorWritable>.
> 
> Before executing the RowSimilarity job u need to run the RowIdJob which creates a matrix of <IntWritable, VectorWritable>.  This matrix should be the input to RowSimilarity.
> 
> Also from your command, you seem to be missing --tempDir argument, you would need that too.
> 
> Suneel
> 
> ------------------------------------------------------------------------
> *From:* Sebastian Schelter <ss...@apache.org>
> *To:* user@mahout.apache.org
> *Sent:* Monday, March 5, 2012 2:32 PM
> *Subject:* Re: How to find the k most similar docs
> 
> That's the problem:
> 
> org.apache.hadoop.io.Text cannot be
>   cast to org.apache.hadoop.io <http://org.apache.hadoop.io.Int>.IntWritable
> 
> RowSimilarityJob expects <IntWritable,VectorWritable> as input, it seems
> you supply <Text,VectorWritable>.
> 
> --sebastian
> 
> On 05.03.2012 20:29, Pat Ferrel wrote:
> > org.apache.hadoop.io.Text cannot be
> >    cast to org.apache.hadoop.io.IntWritable
> 
> 
>

Re: RowSimilarityJob

Posted by Suneel Marthi <su...@yahoo.com>.

I should have been more elaborate in my previous reply. 


RowId job creates a matrix which is of type <IntWritable, VectorWritable> and a docIndex <IntWritable, Text>

docIndex is a map of the rowId to the keys generated from seq2sparse.

What you would need to do is to join the output of RowSimilarity to docIndex to get the format u r looking for.


Hope that helps.


Suneel


________________________________
 From: Suneel Marthi <su...@yahoo.com>
To: "user@mahout.apache.org" <us...@mahout.apache.org> 
Sent: Tuesday, March 20, 2012 1:41 PM
Subject: Re: RowSimilarityJob
 
Docindex is ur answer

Sent from my iPhone

On Mar 20, 2012, at 12:28 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> How do you map the output of RowSimilarity to documents? What I really need is to create an association of
> 
>   doc1 --> docn, docm, doci, etc.
> 
> The output of rowsimilarity looks like
> 
>   rowid --> vector of rowids : distances
> 
> for example:
> 
>   Key: 0: Value: {14458:0.2966480826934176,11399:0.30290014772966095,
>   12793:0.22009858979452146,3275:0.1871791030103281,
>   14613:0.3534278632679437,4411:0.2516380602790199,
>   17520:0.3139731583634198,13611:0.18968888212315968,
>   14354:0.17673965754661425,0:1.0000000000000004}
> 
> It would be nice to use the same keys as they are output by seq2aparse, in my case named vectors so file names would appear in the output as rowids. Creating my association would be trivial.
> 
> Have I missed a dictionary containing rowid to docid(name) mapping?
>

Re: RowSimilarityJob

Posted by Suneel Marthi <su...@yahoo.com>.

Docindex is ur answer

Sent from my iPhone

On Mar 20, 2012, at 12:28 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> How do you map the output of RowSimilarity to documents? What I really need is to create an association of
> 
>   doc1 --> docn, docm, doci, etc.
> 
> The output of rowsimilarity looks like
> 
>   rowid --> vector of rowids : distances
> 
> for example:
> 
>   Key: 0: Value: {14458:0.2966480826934176,11399:0.30290014772966095,
>   12793:0.22009858979452146,3275:0.1871791030103281,
>   14613:0.3534278632679437,4411:0.2516380602790199,
>   17520:0.3139731583634198,13611:0.18968888212315968,
>   14354:0.17673965754661425,0:1.0000000000000004}
> 
> It would be nice to use the same keys as they are output by seq2aparse, in my case named vectors so file names would appear in the output as rowids. Creating my association would be trivial.
> 
> Have I missed a dictionary containing rowid to docid(name) mapping?
>

RowSimilarityJob

Posted by Pat Ferrel <pa...@occamsmachete.com>.

How do you map the output of RowSimilarity to documents? What I really 
need is to create an association of

    doc1 --> docn, docm, doci, etc.

The output of rowsimilarity looks like

    rowid --> vector of rowids : distances

for example:

    Key: 0: Value: {14458:0.2966480826934176,11399:0.30290014772966095,
    12793:0.22009858979452146,3275:0.1871791030103281,
    14613:0.3534278632679437,4411:0.2516380602790199,
    17520:0.3139731583634198,13611:0.18968888212315968,
    14354:0.17673965754661425,0:1.0000000000000004}

It would be nice to use the same keys as they are output by seq2aparse, 
in my case named vectors so file names would appear in the output as 
rowids. Creating my association would be trivial.

Have I missed a dictionary containing rowid to docid(name) mapping?

Re: How to find the k most similar docs

Posted by Lance Norskog <go...@gmail.com>.

No, the matrix multiplication operations all (probably) take
<int,vector> where int is the row number. There has to be a
universally unique row number. If there is no row number associated
with a row in a distributed matrix op, how can the reducers know which
rows they have?

Rows do not necessarily have to be in order; some sequential programs
might depend on this (but they should not).

On Fri, Mar 9, 2012 at 9:50 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> I assume that the other matrix operations will consume and produce <Text,
> MatrixWritable>? If so how do you create <Text, MatrixWritable> from the
> output of rowid <IntWritable, VectorWritable>?
>
> Also while we are at it how do you use vectordump? If you do "bin/mahout
> vectordump --help" you get some crazy output that is unreadable. I would
> have guessed that vectordump would work on either <IntWritable,
> VectorWritable> so the output of rowid OR <Text, VectorWritable> the
> contents of tfidf-vectors/part-r-00000 but it doesn't seem to work on either
> using "bin/mahout vectordump -s path-to-file"
>
> Thanks
> Pat
>
>
> On 3/9/12 4:26 AM, Suneel Marthi wrote:
>>
>> Pat,
>>
>> MatrixDump expects an input file of<Text, MatrixWritable>  .  The matrix
>> that gets created from RowIdJob is<IntWritable, VectorWritable>  and you
>> cannot run MatrixDump to see the contents of the matrix.  You need to use
>> seqdumper as you had done.
>>
>>
>>
>> ________________________________
>>  From: Pat Ferrel<pa...@occamsmachete.com>
>> To: user@mahout.apache.org
>> Sent: Thursday, March 8, 2012 7:14 PM
>> Subject: Re: How to find the k most similar docs
>>
>> OK, back to the beginning. I went through the entire sequence again with
>> the notable exception that I did not create named vectors. I also tweaked
>> some of the seq2sparse parameters.
>>
>>    bin/mahout seq2sparse -i wp-seqfiles -o wp-vectors -ow -a
>>    org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 100 -wt tfidf
>>    -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2
>>
>> after doing a rowid on the tfidf vectors I still get an error doing
>> matrixdump on wp-matrix/matrix. Am I using it wrong? Taking on faith that a
>> matrix was created I perform the rowsimilarity job and now get a far bigger
>> file created that looks OK
>>
>>    bin/mahout rowsimilarity -r 311433 -i wp-matrix/matrix -o
>>    wp-similarity -ess -s SIMILARITY_COSINE -m 10
>>    MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
>>    Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
>>    HADOOP_CONF_DIR=/usr/local/hadoop/conf
>>    MAHOUT-JOB:
>>    /home/pat/mahout-distribution-0.6/mahout-examples-0.6-job.jar
>>    12/03/08 15:48:35 INFO common.AbstractJob: Command line arguments:
>>    {--endPhase=2147483647, --excludeSelfSimilarity=false,
>>    --input=wp-matrix/matrix, --maxSimilaritiesPerRow=10,
>>    --numberOfColumns=311433, --output=wp-similarity,
>>    --similarityClassname=SIMILARITY_COSINE, --startPhase=0,
>> --tempDir=temp}
>>    12/03/08 15:48:36 INFO input.FileInputFormat: Total input paths to
>>    process : 1
>>    12/03/08 15:48:36 INFO mapred.JobClient: Running job:
>>    job_201203071745_0040
>>    12/03/08 15:48:37 INFO mapred.JobClient:  map 0% reduce 0%
>>    12/03/08 15:48:58 INFO mapred.JobClient:  map 17% reduce 0%
>>    12/03/08 15:49:01 INFO mapred.JobClient:  map 27% reduce 0%
>>    12/03/08 15:49:04 INFO mapred.JobClient:  map 40% reduce 0%
>>    12/03/08 15:49:07 INFO mapred.JobClient:  map 47% reduce 0%
>>    12/03/08 15:49:10 INFO mapred.JobClient:  map 60% reduce 0%
>>    12/03/08 15:49:13 INFO mapred.JobClient:  map 70% reduce 0%
>>    12/03/08 15:49:16 INFO mapred.JobClient:  map 80% reduce 0%
>>    12/03/08 15:49:19 INFO mapred.JobClient:  map 92% reduce 0%
>>    12/03/08 15:49:22 INFO mapred.JobClient:  map 100% reduce 0%
>>    12/03/08 15:49:46 INFO mapred.JobClient:  map 100% reduce 33%
>>    12/03/08 15:49:52 INFO mapred.JobClient:  map 100% reduce 100%
>>    12/03/08 15:49:57 INFO mapred.JobClient: Job complete:
>>    job_201203071745_0040
>>    12/03/08 15:49:57 INFO mapred.JobClient: Counters: 26
>>    12/03/08 15:49:57 INFO mapred.JobClient:   Job Counters
>>    12/03/08 15:49:57 INFO mapred.JobClient:     Launched reduce tasks=1
>>    12/03/08 15:49:57 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=55564
>>    12/03/08 15:49:57 INFO mapred.JobClient:     Total time spent by all
>>    reduces waiting after reserving slots (ms)=0
>>    12/03/08 15:49:57 INFO mapred.JobClient:     Total time spent by all
>>    maps waiting after reserving slots (ms)=0
>>    12/03/08 15:49:57 INFO mapred.JobClient:     Rack-local map tasks=1
>>    12/03/08 15:49:57 INFO mapred.JobClient:     Launched map tasks=1
>>    12/03/08 15:49:57 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=13565
>>    12/03/08 15:49:57 INFO mapred.JobClient:   File Output Format Counters
>>    12/03/08 15:49:57 INFO mapred.JobClient:     Bytes Written=45587186
>>    12/03/08 15:49:57 INFO mapred.JobClient:   FileSystemCounters
>>    12/03/08 15:49:57 INFO mapred.JobClient:     FILE_BYTES_READ=99732287
>>    12/03/08 15:49:57 INFO mapred.JobClient:     HDFS_BYTES_READ=17156393
>>    12/03/08 15:49:57 INFO mapred.JobClient:
>> FILE_BYTES_WRITTEN=138104586
>>    12/03/08 15:49:57 INFO mapred.JobClient:
>> HDFS_BYTES_WRITTEN=45587207
>>    12/03/08 15:49:57 INFO mapred.JobClient:   File Input Format Counters
>>    12/03/08 15:49:57 INFO mapred.JobClient:     Bytes Read=17156283
>>    12/03/08 15:49:57 INFO mapred.JobClient:
>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>    12/03/08 15:49:57 INFO mapred.JobClient:     ROWS=4838
>>    12/03/08 15:49:57 INFO mapred.JobClient:   Map-Reduce Framework
>>    12/03/08 15:49:57 INFO mapred.JobClient:     Reduce input groups=294936
>>    12/03/08 15:49:57 INFO mapred.JobClient:     Map output materialized
>>    bytes=38326948
>>    12/03/08 15:49:57 INFO mapred.JobClient:     Combine output
>>    records=2242965
>>    12/03/08 15:49:57 INFO mapred.JobClient:     Map input records=4838
>>    12/03/08 15:49:57 INFO mapred.JobClient:     Reduce shuffle
>>    bytes=38326948
>>    12/03/08 15:49:57 INFO mapred.JobClient:     Reduce output
>>    records=294933
>>    12/03/08 15:49:57 INFO mapred.JobClient:     Spilled Records=3432447
>>    12/03/08 15:49:57 INFO mapred.JobClient:     Map output bytes=83168813
>>    12/03/08 15:49:57 INFO mapred.JobClient:     Combine input
>>    records=5912090
>>    12/03/08 15:49:57 INFO mapred.JobClient:     Map output records=3964061
>>    12/03/08 15:49:57 INFO mapred.JobClient:     SPLIT_RAW_BYTES=110
>>    12/03/08 15:49:57 INFO mapred.JobClient:     Reduce input
>> records=294936
>>    12/03/08 15:49:58 INFO input.FileInputFormat: Total input paths to
>>    process : 1
>>    12/03/08 15:49:58 INFO mapred.JobClient: Running job:
>>    job_201203071745_0041
>>    12/03/08 15:49:59 INFO mapred.JobClient:  map 0% reduce 0%
>>    12/03/08 15:50:19 INFO mapred.JobClient:  map 8% reduce 0%
>>    12/03/08 15:50:22 INFO mapred.JobClient:  map 12% reduce 0%
>>    12/03/08 15:50:25 INFO mapred.JobClient:  map 15% reduce 0%
>>    12/03/08 15:50:28 INFO mapred.JobClient:  map 21% reduce 0%
>>    12/03/08 15:50:31 INFO mapred.JobClient:  map 23% reduce 0%
>>    12/03/08 15:50:34 INFO mapred.JobClient:  map 28% reduce 0%
>>    12/03/08 15:50:37 INFO mapred.JobClient:  map 32% reduce 0%
>>    12/03/08 15:50:40 INFO mapred.JobClient:  map 33% reduce 0%
>>    12/03/08 15:50:43 INFO mapred.JobClient:  map 35% reduce 0%
>>    12/03/08 15:50:46 INFO mapred.JobClient:  map 40% reduce 0%
>>    12/03/08 15:50:49 INFO mapred.JobClient:  map 42% reduce 0%
>>    12/03/08 15:50:52 INFO mapred.JobClient:  map 47% reduce 0%
>>    12/03/08 15:50:55 INFO mapred.JobClient:  map 48% reduce 0%
>>    12/03/08 15:50:58 INFO mapred.JobClient:  map 55% reduce 0%
>>    12/03/08 15:51:01 INFO mapred.JobClient:  map 57% reduce 0%
>>    12/03/08 15:51:04 INFO mapred.JobClient:  map 62% reduce 0%
>>    12/03/08 15:51:07 INFO mapred.JobClient:  map 67% reduce 0%
>>    12/03/08 15:51:10 INFO mapred.JobClient:  map 69% reduce 0%
>>    12/03/08 15:51:13 INFO mapred.JobClient:  map 75% reduce 0%
>>    12/03/08 15:51:20 INFO mapred.JobClient:  map 80% reduce 0%
>>    12/03/08 15:51:23 INFO mapred.JobClient:  map 81% reduce 0%
>>    12/03/08 15:51:26 INFO mapred.JobClient:  map 86% reduce 0%
>>    12/03/08 15:51:29 INFO mapred.JobClient:  map 88% reduce 0%
>>    12/03/08 15:51:31 INFO mapred.JobClient:  map 92% reduce 0%
>>    12/03/08 15:51:34 INFO mapred.JobClient:  map 94% reduce 0%
>>    12/03/08 15:51:37 INFO mapred.JobClient:  map 98% reduce 0%
>>    12/03/08 15:51:40 INFO mapred.JobClient:  map 100% reduce 0%
>>    12/03/08 15:52:19 INFO mapred.JobClient:  map 100% reduce 70%
>>    12/03/08 15:52:26 INFO mapred.JobClient:  map 100% reduce 100%
>>    12/03/08 15:52:31 INFO mapred.JobClient: Job complete:
>>    job_201203071745_0041
>>    12/03/08 15:52:31 INFO mapred.JobClient: Counters: 27
>>    12/03/08 15:52:31 INFO mapred.JobClient:   Job Counters
>>    12/03/08 15:52:31 INFO mapred.JobClient:     Launched reduce tasks=1
>>    12/03/08 15:52:31 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=124769
>>    12/03/08 15:52:31 INFO mapred.JobClient:     Total time spent by all
>>    reduces waiting after reserving slots (ms)=0
>>    12/03/08 15:52:31 INFO mapred.JobClient:     Total time spent by all
>>    maps waiting after reserving slots (ms)=0
>>    12/03/08 15:52:31 INFO mapred.JobClient:     Rack-local map tasks=1
>>    12/03/08 15:52:31 INFO mapred.JobClient:     Launched map tasks=1
>>    12/03/08 15:52:31 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=16543
>>    12/03/08 15:52:31 INFO mapred.JobClient:   File Output Format Counters
>>    12/03/08 15:52:31 INFO mapred.JobClient:     Bytes Written=73395270
>>    12/03/08 15:52:31 INFO mapred.JobClient:   FileSystemCounters
>>    12/03/08 15:52:31 INFO mapred.JobClient:     FILE_BYTES_READ=509127834
>>    12/03/08 15:52:31 INFO mapred.JobClient:     HDFS_BYTES_READ=45587326
>>    12/03/08 15:52:31 INFO mapred.JobClient:
>> FILE_BYTES_WRITTEN=577589760
>>    12/03/08 15:52:31 INFO mapred.JobClient:
>> HDFS_BYTES_WRITTEN=73395270
>>    12/03/08 15:52:31 INFO mapred.JobClient:   File Input Format Counters
>>    12/03/08 15:52:31 INFO mapred.JobClient:     Bytes Read=45587186
>>    12/03/08 15:52:31 INFO mapred.JobClient:
>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>    12/03/08 15:52:31 INFO mapred.JobClient:     PRUNED_COOCCURRENCES=0
>>    12/03/08 15:52:31 INFO mapred.JobClient:     COOCCURRENCES=65114863
>>    12/03/08 15:52:31 INFO mapred.JobClient:   Map-Reduce Framework
>>    12/03/08 15:52:31 INFO mapred.JobClient:     Reduce input groups=4837
>>    12/03/08 15:52:31 INFO mapred.JobClient:     Map output materialized
>>    bytes=68416023
>>    12/03/08 15:52:31 INFO mapred.JobClient:     Combine output
>>    records=79108
>>    12/03/08 15:52:31 INFO mapred.JobClient:     Map input records=294933
>>    12/03/08 15:52:31 INFO mapred.JobClient:     Reduce shuffle
>>    bytes=68416023
>>    12/03/08 15:52:31 INFO mapred.JobClient:     Reduce output records=4837
>>    12/03/08 15:52:31 INFO mapred.JobClient:     Spilled Records=117235
>>    12/03/08 15:52:31 INFO mapred.JobClient:     Map output bytes=694645784
>>    12/03/08 15:52:31 INFO mapred.JobClient:     Combine input
>>    records=4038329
>>    12/03/08 15:52:31 INFO mapred.JobClient:     Map output records=3964058
>>    12/03/08 15:52:31 INFO mapred.JobClient:     SPLIT_RAW_BYTES=119
>>    12/03/08 15:52:31 INFO mapred.JobClient:     Reduce input records=4837
>>    12/03/08 15:52:32 INFO input.FileInputFormat: Total input paths to
>>    process : 1
>>    12/03/08 15:52:32 INFO mapred.JobClient: Running job:
>>    job_201203071745_0042
>>    12/03/08 15:52:33 INFO mapred.JobClient:  map 0% reduce 0%
>>    12/03/08 15:52:52 INFO mapred.JobClient:  map 3% reduce 0%
>>    12/03/08 15:52:55 INFO mapred.JobClient:  map 5% reduce 0%
>>    12/03/08 15:52:58 INFO mapred.JobClient:  map 7% reduce 0%
>>    12/03/08 15:53:01 INFO mapred.JobClient:  map 9% reduce 0%
>>    12/03/08 15:53:04 INFO mapred.JobClient:  map 10% reduce 0%
>>    12/03/08 15:53:07 INFO mapred.JobClient:  map 12% reduce 0%
>>    12/03/08 15:53:10 INFO mapred.JobClient:  map 14% reduce 0%
>>    12/03/08 15:53:13 INFO mapred.JobClient:  map 17% reduce 0%
>>    12/03/08 15:53:16 INFO mapred.JobClient:  map 18% reduce 0%
>>    12/03/08 15:53:19 INFO mapred.JobClient:  map 21% reduce 0%
>>    12/03/08 15:53:22 INFO mapred.JobClient:  map 23% reduce 0%
>>    12/03/08 15:53:25 INFO mapred.JobClient:  map 25% reduce 0%
>>    12/03/08 15:53:28 INFO mapred.JobClient:  map 27% reduce 0%
>>    12/03/08 15:53:31 INFO mapred.JobClient:  map 29% reduce 0%
>>    12/03/08 15:53:34 INFO mapred.JobClient:  map 31% reduce 0%
>>    12/03/08 15:53:37 INFO mapred.JobClient:  map 33% reduce 0%
>>    12/03/08 15:53:40 INFO mapred.JobClient:  map 35% reduce 0%
>>    12/03/08 15:53:43 INFO mapred.JobClient:  map 37% reduce 0%
>>    12/03/08 15:53:46 INFO mapred.JobClient:  map 39% reduce 0%
>>    12/03/08 15:53:49 INFO mapred.JobClient:  map 41% reduce 0%
>>    12/03/08 15:53:52 INFO mapred.JobClient:  map 43% reduce 0%
>>    12/03/08 15:53:55 INFO mapred.JobClient:  map 46% reduce 0%
>>    12/03/08 15:53:58 INFO mapred.JobClient:  map 48% reduce 0%
>>    12/03/08 15:54:01 INFO mapred.JobClient:  map 50% reduce 0%
>>    12/03/08 15:54:04 INFO mapred.JobClient:  map 53% reduce 0%
>>    12/03/08 15:54:07 INFO mapred.JobClient:  map 55% reduce 0%
>>    12/03/08 15:54:10 INFO mapred.JobClient:  map 57% reduce 0%
>>    12/03/08 15:54:13 INFO mapred.JobClient:  map 60% reduce 0%
>>    12/03/08 15:54:16 INFO mapred.JobClient:  map 63% reduce 0%
>>    12/03/08 15:54:19 INFO mapred.JobClient:  map 65% reduce 0%
>>    12/03/08 15:54:22 INFO mapred.JobClient:  map 68% reduce 0%
>>    12/03/08 15:54:25 INFO mapred.JobClient:  map 71% reduce 0%
>>    12/03/08 15:54:28 INFO mapred.JobClient:  map 74% reduce 0%
>>    12/03/08 15:54:31 INFO mapred.JobClient:  map 77% reduce 0%
>>    12/03/08 15:54:34 INFO mapred.JobClient:  map 81% reduce 0%
>>    12/03/08 15:54:37 INFO mapred.JobClient:  map 84% reduce 0%
>>    12/03/08 15:54:40 INFO mapred.JobClient:  map 88% reduce 0%
>>    12/03/08 15:54:43 INFO mapred.JobClient:  map 93% reduce 0%
>>    12/03/08 15:54:46 INFO mapred.JobClient:  map 99% reduce 0%
>>    12/03/08 15:54:49 INFO mapred.JobClient:  map 100% reduce 0%
>>    12/03/08 15:55:01 INFO mapred.JobClient:  map 100% reduce 100%
>>    12/03/08 15:55:06 INFO mapred.JobClient: Job complete:
>>    job_201203071745_0042
>>    12/03/08 15:55:06 INFO mapred.JobClient: Counters: 25
>>    12/03/08 15:55:06 INFO mapred.JobClient:   Job Counters
>>    12/03/08 15:55:06 INFO mapred.JobClient:     Launched reduce tasks=1
>>    12/03/08 15:55:06 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=133985
>>    12/03/08 15:55:06 INFO mapred.JobClient:     Total time spent by all
>>    reduces waiting after reserving slots (ms)=0
>>    12/03/08 15:55:06 INFO mapred.JobClient:     Total time spent by all
>>    maps waiting after reserving slots (ms)=0
>>    12/03/08 15:55:06 INFO mapred.JobClient:     Launched map tasks=1
>>    12/03/08 15:55:06 INFO mapred.JobClient:     Data-local map tasks=1
>>    12/03/08 15:55:06 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10311
>>    12/03/08 15:55:06 INFO mapred.JobClient:   File Output Format Counters
>>    12/03/08 15:55:06 INFO mapred.JobClient:     Bytes Written=580158
>>    12/03/08 15:55:06 INFO mapred.JobClient:   FileSystemCounters
>>    12/03/08 15:55:06 INFO mapred.JobClient:     FILE_BYTES_READ=14921344
>>    12/03/08 15:55:06 INFO mapred.JobClient:     HDFS_BYTES_READ=73395400
>>    12/03/08 15:55:06 INFO mapred.JobClient:
>> FILE_BYTES_WRITTEN=15396906
>>    12/03/08 15:55:06 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=580158
>>    12/03/08 15:55:06 INFO mapred.JobClient:   File Input Format Counters
>>    12/03/08 15:55:06 INFO mapred.JobClient:     Bytes Read=73395270
>>    12/03/08 15:55:06 INFO mapred.JobClient:   Map-Reduce Framework
>>    12/03/08 15:55:06 INFO mapred.JobClient:     Reduce input groups=4837
>>    12/03/08 15:55:06 INFO mapred.JobClient:     Map output materialized
>>    bytes=431573
>>    12/03/08 15:55:06 INFO mapred.JobClient:     Combine output
>>    records=96955
>>    12/03/08 15:55:06 INFO mapred.JobClient:     Map input records=4837
>>    12/03/08 15:55:06 INFO mapred.JobClient:     Reduce shuffle bytes=0
>>    12/03/08 15:55:06 INFO mapred.JobClient:     Reduce output records=4837
>>    12/03/08 15:55:06 INFO mapred.JobClient:     Spilled Records=166369
>>    12/03/08 15:55:06 INFO mapred.JobClient:     Map output bytes=153928302
>>    12/03/08 15:55:06 INFO mapred.JobClient:     Combine input
>>    records=7418380
>>    12/03/08 15:55:06 INFO mapred.JobClient:     Map output records=7326262
>>    12/03/08 15:55:06 INFO mapred.JobClient:     SPLIT_RAW_BYTES=130
>>    12/03/08 15:55:06 INFO mapred.JobClient:     Reduce input records=4837
>>    12/03/08 15:55:06 INFO driver.MahoutDriver: Program took 391379 ms
>>    (Minutes: 6.522983333333333)
>>
>> performing seqdumper on the output looks reasonable.
>>
>> Maybe named vectors is a problem?
>>
>>
>> On 3/7/12 8:50 AM, Sebastian Schelter wrote:
>>>
>>> Hi Pat,
>>>
>>> Something is going completely wrong. The first pass over the data of
>>> RowSimilarityJob transposes the input matrix. From the output of the
>>> first jobs, it seems as if your input is a 4838 x 3 matrix only:
>>>
>>> Map input records=4838
>>> Map output records=3
>>> Combine input records=3
>>> Combine output records=3
>>> Reduce input records=3
>>>
>>> Could you have a detailed look at the input to RowSimilarityJob?
>>>
>>> --sebastian
>>>
>>>
>>> On 07.03.2012 17:38, Pat Ferrel wrote:
>>>>
>>>>      12/03/06 17:02:42 INFO mapred.JobClient:     Map input records=0



-- 
Lance Norskog
goksron@gmail.com

Re: How to find the k most similar docs

Posted by Pat Ferrel <pa...@occamsmachete.com>.

I assume that the other matrix operations will consume and produce 
<Text, MatrixWritable>? If so how do you create <Text, MatrixWritable> 
from the output of rowid <IntWritable, VectorWritable>?

Also while we are at it how do you use vectordump? If you do "bin/mahout 
vectordump --help" you get some crazy output that is unreadable. I would 
have guessed that vectordump would work on either <IntWritable, 
VectorWritable> so the output of rowid OR <Text, VectorWritable> the 
contents of tfidf-vectors/part-r-00000 but it doesn't seem to work on 
either using "bin/mahout vectordump -s path-to-file"

Thanks
Pat

On 3/9/12 4:26 AM, Suneel Marthi wrote:
> Pat,
>
> MatrixDump expects an input file of<Text, MatrixWritable>  .  The matrix that gets created from RowIdJob is<IntWritable, VectorWritable>  and you cannot run MatrixDump to see the contents of the matrix.  You need to use seqdumper as you had done.
>
>
>
> ________________________________
>   From: Pat Ferrel<pa...@occamsmachete.com>
> To: user@mahout.apache.org
> Sent: Thursday, March 8, 2012 7:14 PM
> Subject: Re: How to find the k most similar docs
>
> OK, back to the beginning. I went through the entire sequence again with the notable exception that I did not create named vectors. I also tweaked some of the seq2sparse parameters.
>
>     bin/mahout seq2sparse -i wp-seqfiles -o wp-vectors -ow -a
>     org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 100 -wt tfidf
>     -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2
>
> after doing a rowid on the tfidf vectors I still get an error doing matrixdump on wp-matrix/matrix. Am I using it wrong? Taking on faith that a matrix was created I perform the rowsimilarity job and now get a far bigger file created that looks OK
>
>     bin/mahout rowsimilarity -r 311433 -i wp-matrix/matrix -o
>     wp-similarity -ess -s SIMILARITY_COSINE -m 10
>     MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
>     Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
>     HADOOP_CONF_DIR=/usr/local/hadoop/conf
>     MAHOUT-JOB:
>     /home/pat/mahout-distribution-0.6/mahout-examples-0.6-job.jar
>     12/03/08 15:48:35 INFO common.AbstractJob: Command line arguments:
>     {--endPhase=2147483647, --excludeSelfSimilarity=false,
>     --input=wp-matrix/matrix, --maxSimilaritiesPerRow=10,
>     --numberOfColumns=311433, --output=wp-similarity,
>     --similarityClassname=SIMILARITY_COSINE, --startPhase=0, --tempDir=temp}
>     12/03/08 15:48:36 INFO input.FileInputFormat: Total input paths to
>     process : 1
>     12/03/08 15:48:36 INFO mapred.JobClient: Running job:
>     job_201203071745_0040
>     12/03/08 15:48:37 INFO mapred.JobClient:  map 0% reduce 0%
>     12/03/08 15:48:58 INFO mapred.JobClient:  map 17% reduce 0%
>     12/03/08 15:49:01 INFO mapred.JobClient:  map 27% reduce 0%
>     12/03/08 15:49:04 INFO mapred.JobClient:  map 40% reduce 0%
>     12/03/08 15:49:07 INFO mapred.JobClient:  map 47% reduce 0%
>     12/03/08 15:49:10 INFO mapred.JobClient:  map 60% reduce 0%
>     12/03/08 15:49:13 INFO mapred.JobClient:  map 70% reduce 0%
>     12/03/08 15:49:16 INFO mapred.JobClient:  map 80% reduce 0%
>     12/03/08 15:49:19 INFO mapred.JobClient:  map 92% reduce 0%
>     12/03/08 15:49:22 INFO mapred.JobClient:  map 100% reduce 0%
>     12/03/08 15:49:46 INFO mapred.JobClient:  map 100% reduce 33%
>     12/03/08 15:49:52 INFO mapred.JobClient:  map 100% reduce 100%
>     12/03/08 15:49:57 INFO mapred.JobClient: Job complete:
>     job_201203071745_0040
>     12/03/08 15:49:57 INFO mapred.JobClient: Counters: 26
>     12/03/08 15:49:57 INFO mapred.JobClient:   Job Counters
>     12/03/08 15:49:57 INFO mapred.JobClient:     Launched reduce tasks=1
>     12/03/08 15:49:57 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=55564
>     12/03/08 15:49:57 INFO mapred.JobClient:     Total time spent by all
>     reduces waiting after reserving slots (ms)=0
>     12/03/08 15:49:57 INFO mapred.JobClient:     Total time spent by all
>     maps waiting after reserving slots (ms)=0
>     12/03/08 15:49:57 INFO mapred.JobClient:     Rack-local map tasks=1
>     12/03/08 15:49:57 INFO mapred.JobClient:     Launched map tasks=1
>     12/03/08 15:49:57 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=13565
>     12/03/08 15:49:57 INFO mapred.JobClient:   File Output Format Counters
>     12/03/08 15:49:57 INFO mapred.JobClient:     Bytes Written=45587186
>     12/03/08 15:49:57 INFO mapred.JobClient:   FileSystemCounters
>     12/03/08 15:49:57 INFO mapred.JobClient:     FILE_BYTES_READ=99732287
>     12/03/08 15:49:57 INFO mapred.JobClient:     HDFS_BYTES_READ=17156393
>     12/03/08 15:49:57 INFO mapred.JobClient:       FILE_BYTES_WRITTEN=138104586
>     12/03/08 15:49:57 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=45587207
>     12/03/08 15:49:57 INFO mapred.JobClient:   File Input Format Counters
>     12/03/08 15:49:57 INFO mapred.JobClient:     Bytes Read=17156283
>     12/03/08 15:49:57 INFO mapred.JobClient:     org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>     12/03/08 15:49:57 INFO mapred.JobClient:     ROWS=4838
>     12/03/08 15:49:57 INFO mapred.JobClient:   Map-Reduce Framework
>     12/03/08 15:49:57 INFO mapred.JobClient:     Reduce input groups=294936
>     12/03/08 15:49:57 INFO mapred.JobClient:     Map output materialized
>     bytes=38326948
>     12/03/08 15:49:57 INFO mapred.JobClient:     Combine output
>     records=2242965
>     12/03/08 15:49:57 INFO mapred.JobClient:     Map input records=4838
>     12/03/08 15:49:57 INFO mapred.JobClient:     Reduce shuffle
>     bytes=38326948
>     12/03/08 15:49:57 INFO mapred.JobClient:     Reduce output
>     records=294933
>     12/03/08 15:49:57 INFO mapred.JobClient:     Spilled Records=3432447
>     12/03/08 15:49:57 INFO mapred.JobClient:     Map output bytes=83168813
>     12/03/08 15:49:57 INFO mapred.JobClient:     Combine input
>     records=5912090
>     12/03/08 15:49:57 INFO mapred.JobClient:     Map output records=3964061
>     12/03/08 15:49:57 INFO mapred.JobClient:     SPLIT_RAW_BYTES=110
>     12/03/08 15:49:57 INFO mapred.JobClient:     Reduce input records=294936
>     12/03/08 15:49:58 INFO input.FileInputFormat: Total input paths to
>     process : 1
>     12/03/08 15:49:58 INFO mapred.JobClient: Running job:
>     job_201203071745_0041
>     12/03/08 15:49:59 INFO mapred.JobClient:  map 0% reduce 0%
>     12/03/08 15:50:19 INFO mapred.JobClient:  map 8% reduce 0%
>     12/03/08 15:50:22 INFO mapred.JobClient:  map 12% reduce 0%
>     12/03/08 15:50:25 INFO mapred.JobClient:  map 15% reduce 0%
>     12/03/08 15:50:28 INFO mapred.JobClient:  map 21% reduce 0%
>     12/03/08 15:50:31 INFO mapred.JobClient:  map 23% reduce 0%
>     12/03/08 15:50:34 INFO mapred.JobClient:  map 28% reduce 0%
>     12/03/08 15:50:37 INFO mapred.JobClient:  map 32% reduce 0%
>     12/03/08 15:50:40 INFO mapred.JobClient:  map 33% reduce 0%
>     12/03/08 15:50:43 INFO mapred.JobClient:  map 35% reduce 0%
>     12/03/08 15:50:46 INFO mapred.JobClient:  map 40% reduce 0%
>     12/03/08 15:50:49 INFO mapred.JobClient:  map 42% reduce 0%
>     12/03/08 15:50:52 INFO mapred.JobClient:  map 47% reduce 0%
>     12/03/08 15:50:55 INFO mapred.JobClient:  map 48% reduce 0%
>     12/03/08 15:50:58 INFO mapred.JobClient:  map 55% reduce 0%
>     12/03/08 15:51:01 INFO mapred.JobClient:  map 57% reduce 0%
>     12/03/08 15:51:04 INFO mapred.JobClient:  map 62% reduce 0%
>     12/03/08 15:51:07 INFO mapred.JobClient:  map 67% reduce 0%
>     12/03/08 15:51:10 INFO mapred.JobClient:  map 69% reduce 0%
>     12/03/08 15:51:13 INFO mapred.JobClient:  map 75% reduce 0%
>     12/03/08 15:51:20 INFO mapred.JobClient:  map 80% reduce 0%
>     12/03/08 15:51:23 INFO mapred.JobClient:  map 81% reduce 0%
>     12/03/08 15:51:26 INFO mapred.JobClient:  map 86% reduce 0%
>     12/03/08 15:51:29 INFO mapred.JobClient:  map 88% reduce 0%
>     12/03/08 15:51:31 INFO mapred.JobClient:  map 92% reduce 0%
>     12/03/08 15:51:34 INFO mapred.JobClient:  map 94% reduce 0%
>     12/03/08 15:51:37 INFO mapred.JobClient:  map 98% reduce 0%
>     12/03/08 15:51:40 INFO mapred.JobClient:  map 100% reduce 0%
>     12/03/08 15:52:19 INFO mapred.JobClient:  map 100% reduce 70%
>     12/03/08 15:52:26 INFO mapred.JobClient:  map 100% reduce 100%
>     12/03/08 15:52:31 INFO mapred.JobClient: Job complete:
>     job_201203071745_0041
>     12/03/08 15:52:31 INFO mapred.JobClient: Counters: 27
>     12/03/08 15:52:31 INFO mapred.JobClient:   Job Counters
>     12/03/08 15:52:31 INFO mapred.JobClient:     Launched reduce tasks=1
>     12/03/08 15:52:31 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=124769
>     12/03/08 15:52:31 INFO mapred.JobClient:     Total time spent by all
>     reduces waiting after reserving slots (ms)=0
>     12/03/08 15:52:31 INFO mapred.JobClient:     Total time spent by all
>     maps waiting after reserving slots (ms)=0
>     12/03/08 15:52:31 INFO mapred.JobClient:     Rack-local map tasks=1
>     12/03/08 15:52:31 INFO mapred.JobClient:     Launched map tasks=1
>     12/03/08 15:52:31 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=16543
>     12/03/08 15:52:31 INFO mapred.JobClient:   File Output Format Counters
>     12/03/08 15:52:31 INFO mapred.JobClient:     Bytes Written=73395270
>     12/03/08 15:52:31 INFO mapred.JobClient:   FileSystemCounters
>     12/03/08 15:52:31 INFO mapred.JobClient:     FILE_BYTES_READ=509127834
>     12/03/08 15:52:31 INFO mapred.JobClient:     HDFS_BYTES_READ=45587326
>     12/03/08 15:52:31 INFO mapred.JobClient:       FILE_BYTES_WRITTEN=577589760
>     12/03/08 15:52:31 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=73395270
>     12/03/08 15:52:31 INFO mapred.JobClient:   File Input Format Counters
>     12/03/08 15:52:31 INFO mapred.JobClient:     Bytes Read=45587186
>     12/03/08 15:52:31 INFO mapred.JobClient:     org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>     12/03/08 15:52:31 INFO mapred.JobClient:     PRUNED_COOCCURRENCES=0
>     12/03/08 15:52:31 INFO mapred.JobClient:     COOCCURRENCES=65114863
>     12/03/08 15:52:31 INFO mapred.JobClient:   Map-Reduce Framework
>     12/03/08 15:52:31 INFO mapred.JobClient:     Reduce input groups=4837
>     12/03/08 15:52:31 INFO mapred.JobClient:     Map output materialized
>     bytes=68416023
>     12/03/08 15:52:31 INFO mapred.JobClient:     Combine output
>     records=79108
>     12/03/08 15:52:31 INFO mapred.JobClient:     Map input records=294933
>     12/03/08 15:52:31 INFO mapred.JobClient:     Reduce shuffle
>     bytes=68416023
>     12/03/08 15:52:31 INFO mapred.JobClient:     Reduce output records=4837
>     12/03/08 15:52:31 INFO mapred.JobClient:     Spilled Records=117235
>     12/03/08 15:52:31 INFO mapred.JobClient:     Map output bytes=694645784
>     12/03/08 15:52:31 INFO mapred.JobClient:     Combine input
>     records=4038329
>     12/03/08 15:52:31 INFO mapred.JobClient:     Map output records=3964058
>     12/03/08 15:52:31 INFO mapred.JobClient:     SPLIT_RAW_BYTES=119
>     12/03/08 15:52:31 INFO mapred.JobClient:     Reduce input records=4837
>     12/03/08 15:52:32 INFO input.FileInputFormat: Total input paths to
>     process : 1
>     12/03/08 15:52:32 INFO mapred.JobClient: Running job:
>     job_201203071745_0042
>     12/03/08 15:52:33 INFO mapred.JobClient:  map 0% reduce 0%
>     12/03/08 15:52:52 INFO mapred.JobClient:  map 3% reduce 0%
>     12/03/08 15:52:55 INFO mapred.JobClient:  map 5% reduce 0%
>     12/03/08 15:52:58 INFO mapred.JobClient:  map 7% reduce 0%
>     12/03/08 15:53:01 INFO mapred.JobClient:  map 9% reduce 0%
>     12/03/08 15:53:04 INFO mapred.JobClient:  map 10% reduce 0%
>     12/03/08 15:53:07 INFO mapred.JobClient:  map 12% reduce 0%
>     12/03/08 15:53:10 INFO mapred.JobClient:  map 14% reduce 0%
>     12/03/08 15:53:13 INFO mapred.JobClient:  map 17% reduce 0%
>     12/03/08 15:53:16 INFO mapred.JobClient:  map 18% reduce 0%
>     12/03/08 15:53:19 INFO mapred.JobClient:  map 21% reduce 0%
>     12/03/08 15:53:22 INFO mapred.JobClient:  map 23% reduce 0%
>     12/03/08 15:53:25 INFO mapred.JobClient:  map 25% reduce 0%
>     12/03/08 15:53:28 INFO mapred.JobClient:  map 27% reduce 0%
>     12/03/08 15:53:31 INFO mapred.JobClient:  map 29% reduce 0%
>     12/03/08 15:53:34 INFO mapred.JobClient:  map 31% reduce 0%
>     12/03/08 15:53:37 INFO mapred.JobClient:  map 33% reduce 0%
>     12/03/08 15:53:40 INFO mapred.JobClient:  map 35% reduce 0%
>     12/03/08 15:53:43 INFO mapred.JobClient:  map 37% reduce 0%
>     12/03/08 15:53:46 INFO mapred.JobClient:  map 39% reduce 0%
>     12/03/08 15:53:49 INFO mapred.JobClient:  map 41% reduce 0%
>     12/03/08 15:53:52 INFO mapred.JobClient:  map 43% reduce 0%
>     12/03/08 15:53:55 INFO mapred.JobClient:  map 46% reduce 0%
>     12/03/08 15:53:58 INFO mapred.JobClient:  map 48% reduce 0%
>     12/03/08 15:54:01 INFO mapred.JobClient:  map 50% reduce 0%
>     12/03/08 15:54:04 INFO mapred.JobClient:  map 53% reduce 0%
>     12/03/08 15:54:07 INFO mapred.JobClient:  map 55% reduce 0%
>     12/03/08 15:54:10 INFO mapred.JobClient:  map 57% reduce 0%
>     12/03/08 15:54:13 INFO mapred.JobClient:  map 60% reduce 0%
>     12/03/08 15:54:16 INFO mapred.JobClient:  map 63% reduce 0%
>     12/03/08 15:54:19 INFO mapred.JobClient:  map 65% reduce 0%
>     12/03/08 15:54:22 INFO mapred.JobClient:  map 68% reduce 0%
>     12/03/08 15:54:25 INFO mapred.JobClient:  map 71% reduce 0%
>     12/03/08 15:54:28 INFO mapred.JobClient:  map 74% reduce 0%
>     12/03/08 15:54:31 INFO mapred.JobClient:  map 77% reduce 0%
>     12/03/08 15:54:34 INFO mapred.JobClient:  map 81% reduce 0%
>     12/03/08 15:54:37 INFO mapred.JobClient:  map 84% reduce 0%
>     12/03/08 15:54:40 INFO mapred.JobClient:  map 88% reduce 0%
>     12/03/08 15:54:43 INFO mapred.JobClient:  map 93% reduce 0%
>     12/03/08 15:54:46 INFO mapred.JobClient:  map 99% reduce 0%
>     12/03/08 15:54:49 INFO mapred.JobClient:  map 100% reduce 0%
>     12/03/08 15:55:01 INFO mapred.JobClient:  map 100% reduce 100%
>     12/03/08 15:55:06 INFO mapred.JobClient: Job complete:
>     job_201203071745_0042
>     12/03/08 15:55:06 INFO mapred.JobClient: Counters: 25
>     12/03/08 15:55:06 INFO mapred.JobClient:   Job Counters
>     12/03/08 15:55:06 INFO mapred.JobClient:     Launched reduce tasks=1
>     12/03/08 15:55:06 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=133985
>     12/03/08 15:55:06 INFO mapred.JobClient:     Total time spent by all
>     reduces waiting after reserving slots (ms)=0
>     12/03/08 15:55:06 INFO mapred.JobClient:     Total time spent by all
>     maps waiting after reserving slots (ms)=0
>     12/03/08 15:55:06 INFO mapred.JobClient:     Launched map tasks=1
>     12/03/08 15:55:06 INFO mapred.JobClient:     Data-local map tasks=1
>     12/03/08 15:55:06 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10311
>     12/03/08 15:55:06 INFO mapred.JobClient:   File Output Format Counters
>     12/03/08 15:55:06 INFO mapred.JobClient:     Bytes Written=580158
>     12/03/08 15:55:06 INFO mapred.JobClient:   FileSystemCounters
>     12/03/08 15:55:06 INFO mapred.JobClient:     FILE_BYTES_READ=14921344
>     12/03/08 15:55:06 INFO mapred.JobClient:     HDFS_BYTES_READ=73395400
>     12/03/08 15:55:06 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=15396906
>     12/03/08 15:55:06 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=580158
>     12/03/08 15:55:06 INFO mapred.JobClient:   File Input Format Counters
>     12/03/08 15:55:06 INFO mapred.JobClient:     Bytes Read=73395270
>     12/03/08 15:55:06 INFO mapred.JobClient:   Map-Reduce Framework
>     12/03/08 15:55:06 INFO mapred.JobClient:     Reduce input groups=4837
>     12/03/08 15:55:06 INFO mapred.JobClient:     Map output materialized
>     bytes=431573
>     12/03/08 15:55:06 INFO mapred.JobClient:     Combine output
>     records=96955
>     12/03/08 15:55:06 INFO mapred.JobClient:     Map input records=4837
>     12/03/08 15:55:06 INFO mapred.JobClient:     Reduce shuffle bytes=0
>     12/03/08 15:55:06 INFO mapred.JobClient:     Reduce output records=4837
>     12/03/08 15:55:06 INFO mapred.JobClient:     Spilled Records=166369
>     12/03/08 15:55:06 INFO mapred.JobClient:     Map output bytes=153928302
>     12/03/08 15:55:06 INFO mapred.JobClient:     Combine input
>     records=7418380
>     12/03/08 15:55:06 INFO mapred.JobClient:     Map output records=7326262
>     12/03/08 15:55:06 INFO mapred.JobClient:     SPLIT_RAW_BYTES=130
>     12/03/08 15:55:06 INFO mapred.JobClient:     Reduce input records=4837
>     12/03/08 15:55:06 INFO driver.MahoutDriver: Program took 391379 ms
>     (Minutes: 6.522983333333333)
>
> performing seqdumper on the output looks reasonable.
>
> Maybe named vectors is a problem?
>
>
> On 3/7/12 8:50 AM, Sebastian Schelter wrote:
>> Hi Pat,
>>
>> Something is going completely wrong. The first pass over the data of
>> RowSimilarityJob transposes the input matrix. From the output of the
>> first jobs, it seems as if your input is a 4838 x 3 matrix only:
>>
>> Map input records=4838
>> Map output records=3
>> Combine input records=3
>> Combine output records=3
>> Reduce input records=3
>>
>> Could you have a detailed look at the input to RowSimilarityJob?
>>
>> --sebastian
>>
>>
>> On 07.03.2012 17:38, Pat Ferrel wrote:
>>>       12/03/06 17:02:42 INFO mapred.JobClient:     Map input records=0

Re: How to find the k most similar docs

Posted by Suneel Marthi <su...@yahoo.com>.

Pat,

MatrixDump expects an input file of  <Text, MatrixWritable> .  The matrix that gets created from RowIdJob is <IntWritable, VectorWritable> and you cannot run MatrixDump to see the contents of the matrix.  You need to use seqdumper as you had done.



________________________________
 From: Pat Ferrel <pa...@occamsmachete.com>
To: user@mahout.apache.org 
Sent: Thursday, March 8, 2012 7:14 PM
Subject: Re: How to find the k most similar docs
 
OK, back to the beginning. I went through the entire sequence again with the notable exception that I did not create named vectors. I also tweaked some of the seq2sparse parameters.

   bin/mahout seq2sparse -i wp-seqfiles -o wp-vectors -ow -a
   org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 100 -wt tfidf
   -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2

after doing a rowid on the tfidf vectors I still get an error doing matrixdump on wp-matrix/matrix. Am I using it wrong? Taking on faith that a matrix was created I perform the rowsimilarity job and now get a far bigger file created that looks OK

   bin/mahout rowsimilarity -r 311433 -i wp-matrix/matrix -o
   wp-similarity -ess -s SIMILARITY_COSINE -m 10
   MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
   Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
   HADOOP_CONF_DIR=/usr/local/hadoop/conf
   MAHOUT-JOB:
   /home/pat/mahout-distribution-0.6/mahout-examples-0.6-job.jar
   12/03/08 15:48:35 INFO common.AbstractJob: Command line arguments:
   {--endPhase=2147483647, --excludeSelfSimilarity=false,
   --input=wp-matrix/matrix, --maxSimilaritiesPerRow=10,
   --numberOfColumns=311433, --output=wp-similarity,
   --similarityClassname=SIMILARITY_COSINE, --startPhase=0, --tempDir=temp}
   12/03/08 15:48:36 INFO input.FileInputFormat: Total input paths to
   process : 1
   12/03/08 15:48:36 INFO mapred.JobClient: Running job:
   job_201203071745_0040
   12/03/08 15:48:37 INFO mapred.JobClient:  map 0% reduce 0%
   12/03/08 15:48:58 INFO mapred.JobClient:  map 17% reduce 0%
   12/03/08 15:49:01 INFO mapred.JobClient:  map 27% reduce 0%
   12/03/08 15:49:04 INFO mapred.JobClient:  map 40% reduce 0%
   12/03/08 15:49:07 INFO mapred.JobClient:  map 47% reduce 0%
   12/03/08 15:49:10 INFO mapred.JobClient:  map 60% reduce 0%
   12/03/08 15:49:13 INFO mapred.JobClient:  map 70% reduce 0%
   12/03/08 15:49:16 INFO mapred.JobClient:  map 80% reduce 0%
   12/03/08 15:49:19 INFO mapred.JobClient:  map 92% reduce 0%
   12/03/08 15:49:22 INFO mapred.JobClient:  map 100% reduce 0%
   12/03/08 15:49:46 INFO mapred.JobClient:  map 100% reduce 33%
   12/03/08 15:49:52 INFO mapred.JobClient:  map 100% reduce 100%
   12/03/08 15:49:57 INFO mapred.JobClient: Job complete:
   job_201203071745_0040
   12/03/08 15:49:57 INFO mapred.JobClient: Counters: 26
   12/03/08 15:49:57 INFO mapred.JobClient:   Job Counters
   12/03/08 15:49:57 INFO mapred.JobClient:     Launched reduce tasks=1
   12/03/08 15:49:57 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=55564
   12/03/08 15:49:57 INFO mapred.JobClient:     Total time spent by all
   reduces waiting after reserving slots (ms)=0
   12/03/08 15:49:57 INFO mapred.JobClient:     Total time spent by all
   maps waiting after reserving slots (ms)=0
   12/03/08 15:49:57 INFO mapred.JobClient:     Rack-local map tasks=1
   12/03/08 15:49:57 INFO mapred.JobClient:     Launched map tasks=1
   12/03/08 15:49:57 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=13565
   12/03/08 15:49:57 INFO mapred.JobClient:   File Output Format Counters
   12/03/08 15:49:57 INFO mapred.JobClient:     Bytes Written=45587186
   12/03/08 15:49:57 INFO mapred.JobClient:   FileSystemCounters
   12/03/08 15:49:57 INFO mapred.JobClient:     FILE_BYTES_READ=99732287
   12/03/08 15:49:57 INFO mapred.JobClient:     HDFS_BYTES_READ=17156393
   12/03/08 15:49:57 INFO mapred.JobClient:       FILE_BYTES_WRITTEN=138104586
   12/03/08 15:49:57 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=45587207
   12/03/08 15:49:57 INFO mapred.JobClient:   File Input Format Counters
   12/03/08 15:49:57 INFO mapred.JobClient:     Bytes Read=17156283
   12/03/08 15:49:57 INFO mapred.JobClient:     org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
   12/03/08 15:49:57 INFO mapred.JobClient:     ROWS=4838
   12/03/08 15:49:57 INFO mapred.JobClient:   Map-Reduce Framework
   12/03/08 15:49:57 INFO mapred.JobClient:     Reduce input groups=294936
   12/03/08 15:49:57 INFO mapred.JobClient:     Map output materialized
   bytes=38326948
   12/03/08 15:49:57 INFO mapred.JobClient:     Combine output
   records=2242965
   12/03/08 15:49:57 INFO mapred.JobClient:     Map input records=4838
   12/03/08 15:49:57 INFO mapred.JobClient:     Reduce shuffle
   bytes=38326948
   12/03/08 15:49:57 INFO mapred.JobClient:     Reduce output
   records=294933
   12/03/08 15:49:57 INFO mapred.JobClient:     Spilled Records=3432447
   12/03/08 15:49:57 INFO mapred.JobClient:     Map output bytes=83168813
   12/03/08 15:49:57 INFO mapred.JobClient:     Combine input
   records=5912090
   12/03/08 15:49:57 INFO mapred.JobClient:     Map output records=3964061
   12/03/08 15:49:57 INFO mapred.JobClient:     SPLIT_RAW_BYTES=110
   12/03/08 15:49:57 INFO mapred.JobClient:     Reduce input records=294936
   12/03/08 15:49:58 INFO input.FileInputFormat: Total input paths to
   process : 1
   12/03/08 15:49:58 INFO mapred.JobClient: Running job:
   job_201203071745_0041
   12/03/08 15:49:59 INFO mapred.JobClient:  map 0% reduce 0%
   12/03/08 15:50:19 INFO mapred.JobClient:  map 8% reduce 0%
   12/03/08 15:50:22 INFO mapred.JobClient:  map 12% reduce 0%
   12/03/08 15:50:25 INFO mapred.JobClient:  map 15% reduce 0%
   12/03/08 15:50:28 INFO mapred.JobClient:  map 21% reduce 0%
   12/03/08 15:50:31 INFO mapred.JobClient:  map 23% reduce 0%
   12/03/08 15:50:34 INFO mapred.JobClient:  map 28% reduce 0%
   12/03/08 15:50:37 INFO mapred.JobClient:  map 32% reduce 0%
   12/03/08 15:50:40 INFO mapred.JobClient:  map 33% reduce 0%
   12/03/08 15:50:43 INFO mapred.JobClient:  map 35% reduce 0%
   12/03/08 15:50:46 INFO mapred.JobClient:  map 40% reduce 0%
   12/03/08 15:50:49 INFO mapred.JobClient:  map 42% reduce 0%
   12/03/08 15:50:52 INFO mapred.JobClient:  map 47% reduce 0%
   12/03/08 15:50:55 INFO mapred.JobClient:  map 48% reduce 0%
   12/03/08 15:50:58 INFO mapred.JobClient:  map 55% reduce 0%
   12/03/08 15:51:01 INFO mapred.JobClient:  map 57% reduce 0%
   12/03/08 15:51:04 INFO mapred.JobClient:  map 62% reduce 0%
   12/03/08 15:51:07 INFO mapred.JobClient:  map 67% reduce 0%
   12/03/08 15:51:10 INFO mapred.JobClient:  map 69% reduce 0%
   12/03/08 15:51:13 INFO mapred.JobClient:  map 75% reduce 0%
   12/03/08 15:51:20 INFO mapred.JobClient:  map 80% reduce 0%
   12/03/08 15:51:23 INFO mapred.JobClient:  map 81% reduce 0%
   12/03/08 15:51:26 INFO mapred.JobClient:  map 86% reduce 0%
   12/03/08 15:51:29 INFO mapred.JobClient:  map 88% reduce 0%
   12/03/08 15:51:31 INFO mapred.JobClient:  map 92% reduce 0%
   12/03/08 15:51:34 INFO mapred.JobClient:  map 94% reduce 0%
   12/03/08 15:51:37 INFO mapred.JobClient:  map 98% reduce 0%
   12/03/08 15:51:40 INFO mapred.JobClient:  map 100% reduce 0%
   12/03/08 15:52:19 INFO mapred.JobClient:  map 100% reduce 70%
   12/03/08 15:52:26 INFO mapred.JobClient:  map 100% reduce 100%
   12/03/08 15:52:31 INFO mapred.JobClient: Job complete:
   job_201203071745_0041
   12/03/08 15:52:31 INFO mapred.JobClient: Counters: 27
   12/03/08 15:52:31 INFO mapred.JobClient:   Job Counters
   12/03/08 15:52:31 INFO mapred.JobClient:     Launched reduce tasks=1
   12/03/08 15:52:31 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=124769
   12/03/08 15:52:31 INFO mapred.JobClient:     Total time spent by all
   reduces waiting after reserving slots (ms)=0
   12/03/08 15:52:31 INFO mapred.JobClient:     Total time spent by all
   maps waiting after reserving slots (ms)=0
   12/03/08 15:52:31 INFO mapred.JobClient:     Rack-local map tasks=1
   12/03/08 15:52:31 INFO mapred.JobClient:     Launched map tasks=1
   12/03/08 15:52:31 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=16543
   12/03/08 15:52:31 INFO mapred.JobClient:   File Output Format Counters
   12/03/08 15:52:31 INFO mapred.JobClient:     Bytes Written=73395270
   12/03/08 15:52:31 INFO mapred.JobClient:   FileSystemCounters
   12/03/08 15:52:31 INFO mapred.JobClient:     FILE_BYTES_READ=509127834
   12/03/08 15:52:31 INFO mapred.JobClient:     HDFS_BYTES_READ=45587326
   12/03/08 15:52:31 INFO mapred.JobClient:       FILE_BYTES_WRITTEN=577589760
   12/03/08 15:52:31 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=73395270
   12/03/08 15:52:31 INFO mapred.JobClient:   File Input Format Counters
   12/03/08 15:52:31 INFO mapred.JobClient:     Bytes Read=45587186
   12/03/08 15:52:31 INFO mapred.JobClient:     org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
   12/03/08 15:52:31 INFO mapred.JobClient:     PRUNED_COOCCURRENCES=0
   12/03/08 15:52:31 INFO mapred.JobClient:     COOCCURRENCES=65114863
   12/03/08 15:52:31 INFO mapred.JobClient:   Map-Reduce Framework
   12/03/08 15:52:31 INFO mapred.JobClient:     Reduce input groups=4837
   12/03/08 15:52:31 INFO mapred.JobClient:     Map output materialized
   bytes=68416023
   12/03/08 15:52:31 INFO mapred.JobClient:     Combine output
   records=79108
   12/03/08 15:52:31 INFO mapred.JobClient:     Map input records=294933
   12/03/08 15:52:31 INFO mapred.JobClient:     Reduce shuffle
   bytes=68416023
   12/03/08 15:52:31 INFO mapred.JobClient:     Reduce output records=4837
   12/03/08 15:52:31 INFO mapred.JobClient:     Spilled Records=117235
   12/03/08 15:52:31 INFO mapred.JobClient:     Map output bytes=694645784
   12/03/08 15:52:31 INFO mapred.JobClient:     Combine input
   records=4038329
   12/03/08 15:52:31 INFO mapred.JobClient:     Map output records=3964058
   12/03/08 15:52:31 INFO mapred.JobClient:     SPLIT_RAW_BYTES=119
   12/03/08 15:52:31 INFO mapred.JobClient:     Reduce input records=4837
   12/03/08 15:52:32 INFO input.FileInputFormat: Total input paths to
   process : 1
   12/03/08 15:52:32 INFO mapred.JobClient: Running job:
   job_201203071745_0042
   12/03/08 15:52:33 INFO mapred.JobClient:  map 0% reduce 0%
   12/03/08 15:52:52 INFO mapred.JobClient:  map 3% reduce 0%
   12/03/08 15:52:55 INFO mapred.JobClient:  map 5% reduce 0%
   12/03/08 15:52:58 INFO mapred.JobClient:  map 7% reduce 0%
   12/03/08 15:53:01 INFO mapred.JobClient:  map 9% reduce 0%
   12/03/08 15:53:04 INFO mapred.JobClient:  map 10% reduce 0%
   12/03/08 15:53:07 INFO mapred.JobClient:  map 12% reduce 0%
   12/03/08 15:53:10 INFO mapred.JobClient:  map 14% reduce 0%
   12/03/08 15:53:13 INFO mapred.JobClient:  map 17% reduce 0%
   12/03/08 15:53:16 INFO mapred.JobClient:  map 18% reduce 0%
   12/03/08 15:53:19 INFO mapred.JobClient:  map 21% reduce 0%
   12/03/08 15:53:22 INFO mapred.JobClient:  map 23% reduce 0%
   12/03/08 15:53:25 INFO mapred.JobClient:  map 25% reduce 0%
   12/03/08 15:53:28 INFO mapred.JobClient:  map 27% reduce 0%
   12/03/08 15:53:31 INFO mapred.JobClient:  map 29% reduce 0%
   12/03/08 15:53:34 INFO mapred.JobClient:  map 31% reduce 0%
   12/03/08 15:53:37 INFO mapred.JobClient:  map 33% reduce 0%
   12/03/08 15:53:40 INFO mapred.JobClient:  map 35% reduce 0%
   12/03/08 15:53:43 INFO mapred.JobClient:  map 37% reduce 0%
   12/03/08 15:53:46 INFO mapred.JobClient:  map 39% reduce 0%
   12/03/08 15:53:49 INFO mapred.JobClient:  map 41% reduce 0%
   12/03/08 15:53:52 INFO mapred.JobClient:  map 43% reduce 0%
   12/03/08 15:53:55 INFO mapred.JobClient:  map 46% reduce 0%
   12/03/08 15:53:58 INFO mapred.JobClient:  map 48% reduce 0%
   12/03/08 15:54:01 INFO mapred.JobClient:  map 50% reduce 0%
   12/03/08 15:54:04 INFO mapred.JobClient:  map 53% reduce 0%
   12/03/08 15:54:07 INFO mapred.JobClient:  map 55% reduce 0%
   12/03/08 15:54:10 INFO mapred.JobClient:  map 57% reduce 0%
   12/03/08 15:54:13 INFO mapred.JobClient:  map 60% reduce 0%
   12/03/08 15:54:16 INFO mapred.JobClient:  map 63% reduce 0%
   12/03/08 15:54:19 INFO mapred.JobClient:  map 65% reduce 0%
   12/03/08 15:54:22 INFO mapred.JobClient:  map 68% reduce 0%
   12/03/08 15:54:25 INFO mapred.JobClient:  map 71% reduce 0%
   12/03/08 15:54:28 INFO mapred.JobClient:  map 74% reduce 0%
   12/03/08 15:54:31 INFO mapred.JobClient:  map 77% reduce 0%
   12/03/08 15:54:34 INFO mapred.JobClient:  map 81% reduce 0%
   12/03/08 15:54:37 INFO mapred.JobClient:  map 84% reduce 0%
   12/03/08 15:54:40 INFO mapred.JobClient:  map 88% reduce 0%
   12/03/08 15:54:43 INFO mapred.JobClient:  map 93% reduce 0%
   12/03/08 15:54:46 INFO mapred.JobClient:  map 99% reduce 0%
   12/03/08 15:54:49 INFO mapred.JobClient:  map 100% reduce 0%
   12/03/08 15:55:01 INFO mapred.JobClient:  map 100% reduce 100%
   12/03/08 15:55:06 INFO mapred.JobClient: Job complete:
   job_201203071745_0042
   12/03/08 15:55:06 INFO mapred.JobClient: Counters: 25
   12/03/08 15:55:06 INFO mapred.JobClient:   Job Counters
   12/03/08 15:55:06 INFO mapred.JobClient:     Launched reduce tasks=1
   12/03/08 15:55:06 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=133985
   12/03/08 15:55:06 INFO mapred.JobClient:     Total time spent by all
   reduces waiting after reserving slots (ms)=0
   12/03/08 15:55:06 INFO mapred.JobClient:     Total time spent by all
   maps waiting after reserving slots (ms)=0
   12/03/08 15:55:06 INFO mapred.JobClient:     Launched map tasks=1
   12/03/08 15:55:06 INFO mapred.JobClient:     Data-local map tasks=1
   12/03/08 15:55:06 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10311
   12/03/08 15:55:06 INFO mapred.JobClient:   File Output Format Counters
   12/03/08 15:55:06 INFO mapred.JobClient:     Bytes Written=580158
   12/03/08 15:55:06 INFO mapred.JobClient:   FileSystemCounters
   12/03/08 15:55:06 INFO mapred.JobClient:     FILE_BYTES_READ=14921344
   12/03/08 15:55:06 INFO mapred.JobClient:     HDFS_BYTES_READ=73395400
   12/03/08 15:55:06 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=15396906
   12/03/08 15:55:06 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=580158
   12/03/08 15:55:06 INFO mapred.JobClient:   File Input Format Counters
   12/03/08 15:55:06 INFO mapred.JobClient:     Bytes Read=73395270
   12/03/08 15:55:06 INFO mapred.JobClient:   Map-Reduce Framework
   12/03/08 15:55:06 INFO mapred.JobClient:     Reduce input groups=4837
   12/03/08 15:55:06 INFO mapred.JobClient:     Map output materialized
   bytes=431573
   12/03/08 15:55:06 INFO mapred.JobClient:     Combine output
   records=96955
   12/03/08 15:55:06 INFO mapred.JobClient:     Map input records=4837
   12/03/08 15:55:06 INFO mapred.JobClient:     Reduce shuffle bytes=0
   12/03/08 15:55:06 INFO mapred.JobClient:     Reduce output records=4837
   12/03/08 15:55:06 INFO mapred.JobClient:     Spilled Records=166369
   12/03/08 15:55:06 INFO mapred.JobClient:     Map output bytes=153928302
   12/03/08 15:55:06 INFO mapred.JobClient:     Combine input
   records=7418380
   12/03/08 15:55:06 INFO mapred.JobClient:     Map output records=7326262
   12/03/08 15:55:06 INFO mapred.JobClient:     SPLIT_RAW_BYTES=130
   12/03/08 15:55:06 INFO mapred.JobClient:     Reduce input records=4837
   12/03/08 15:55:06 INFO driver.MahoutDriver: Program took 391379 ms
   (Minutes: 6.522983333333333)

performing seqdumper on the output looks reasonable.

Maybe named vectors is a problem?


On 3/7/12 8:50 AM, Sebastian Schelter wrote:
> Hi Pat,
> 
> Something is going completely wrong. The first pass over the data of
> RowSimilarityJob transposes the input matrix. From the output of the
> first jobs, it seems as if your input is a 4838 x 3 matrix only:
> 
> Map input records=4838
> Map output records=3
> Combine input records=3
> Combine output records=3
> Reduce input records=3
> 
> Could you have a detailed look at the input to RowSimilarityJob?
> 
> --sebastian
> 
> 
> On 07.03.2012 17:38, Pat Ferrel wrote:
>>     12/03/06 17:02:42 INFO mapred.JobClient:     Map input records=0
>

Re: How to find the k most similar docs

Posted by Pat Ferrel <pa...@occamsmachete.com>.

OK, back to the beginning. I went through the entire sequence again with 
the notable exception that I did not create named vectors. I also 
tweaked some of the seq2sparse parameters.

    bin/mahout seq2sparse -i wp-seqfiles -o wp-vectors -ow -a
    org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 100 -wt tfidf
    -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2

after doing a rowid on the tfidf vectors I still get an error doing 
matrixdump on wp-matrix/matrix. Am I using it wrong? Taking on faith 
that a matrix was created I perform the rowsimilarity job and now get a 
far bigger file created that looks OK

    bin/mahout rowsimilarity -r 311433 -i wp-matrix/matrix -o
    wp-similarity -ess -s SIMILARITY_COSINE -m 10
    MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
    Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
    HADOOP_CONF_DIR=/usr/local/hadoop/conf
    MAHOUT-JOB:
    /home/pat/mahout-distribution-0.6/mahout-examples-0.6-job.jar
    12/03/08 15:48:35 INFO common.AbstractJob: Command line arguments:
    {--endPhase=2147483647, --excludeSelfSimilarity=false,
    --input=wp-matrix/matrix, --maxSimilaritiesPerRow=10,
    --numberOfColumns=311433, --output=wp-similarity,
    --similarityClassname=SIMILARITY_COSINE, --startPhase=0, --tempDir=temp}
    12/03/08 15:48:36 INFO input.FileInputFormat: Total input paths to
    process : 1
    12/03/08 15:48:36 INFO mapred.JobClient: Running job:
    job_201203071745_0040
    12/03/08 15:48:37 INFO mapred.JobClient:  map 0% reduce 0%
    12/03/08 15:48:58 INFO mapred.JobClient:  map 17% reduce 0%
    12/03/08 15:49:01 INFO mapred.JobClient:  map 27% reduce 0%
    12/03/08 15:49:04 INFO mapred.JobClient:  map 40% reduce 0%
    12/03/08 15:49:07 INFO mapred.JobClient:  map 47% reduce 0%
    12/03/08 15:49:10 INFO mapred.JobClient:  map 60% reduce 0%
    12/03/08 15:49:13 INFO mapred.JobClient:  map 70% reduce 0%
    12/03/08 15:49:16 INFO mapred.JobClient:  map 80% reduce 0%
    12/03/08 15:49:19 INFO mapred.JobClient:  map 92% reduce 0%
    12/03/08 15:49:22 INFO mapred.JobClient:  map 100% reduce 0%
    12/03/08 15:49:46 INFO mapred.JobClient:  map 100% reduce 33%
    12/03/08 15:49:52 INFO mapred.JobClient:  map 100% reduce 100%
    12/03/08 15:49:57 INFO mapred.JobClient: Job complete:
    job_201203071745_0040
    12/03/08 15:49:57 INFO mapred.JobClient: Counters: 26
    12/03/08 15:49:57 INFO mapred.JobClient:   Job Counters
    12/03/08 15:49:57 INFO mapred.JobClient:     Launched reduce tasks=1
    12/03/08 15:49:57 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=55564
    12/03/08 15:49:57 INFO mapred.JobClient:     Total time spent by all
    reduces waiting after reserving slots (ms)=0
    12/03/08 15:49:57 INFO mapred.JobClient:     Total time spent by all
    maps waiting after reserving slots (ms)=0
    12/03/08 15:49:57 INFO mapred.JobClient:     Rack-local map tasks=1
    12/03/08 15:49:57 INFO mapred.JobClient:     Launched map tasks=1
    12/03/08 15:49:57 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=13565
    12/03/08 15:49:57 INFO mapred.JobClient:   File Output Format Counters
    12/03/08 15:49:57 INFO mapred.JobClient:     Bytes Written=45587186
    12/03/08 15:49:57 INFO mapred.JobClient:   FileSystemCounters
    12/03/08 15:49:57 INFO mapred.JobClient:     FILE_BYTES_READ=99732287
    12/03/08 15:49:57 INFO mapred.JobClient:     HDFS_BYTES_READ=17156393
    12/03/08 15:49:57 INFO mapred.JobClient:    
    FILE_BYTES_WRITTEN=138104586
    12/03/08 15:49:57 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=45587207
    12/03/08 15:49:57 INFO mapred.JobClient:   File Input Format Counters
    12/03/08 15:49:57 INFO mapred.JobClient:     Bytes Read=17156283
    12/03/08 15:49:57 INFO mapred.JobClient:  
    org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
    12/03/08 15:49:57 INFO mapred.JobClient:     ROWS=4838
    12/03/08 15:49:57 INFO mapred.JobClient:   Map-Reduce Framework
    12/03/08 15:49:57 INFO mapred.JobClient:     Reduce input groups=294936
    12/03/08 15:49:57 INFO mapred.JobClient:     Map output materialized
    bytes=38326948
    12/03/08 15:49:57 INFO mapred.JobClient:     Combine output
    records=2242965
    12/03/08 15:49:57 INFO mapred.JobClient:     Map input records=4838
    12/03/08 15:49:57 INFO mapred.JobClient:     Reduce shuffle
    bytes=38326948
    12/03/08 15:49:57 INFO mapred.JobClient:     Reduce output
    records=294933
    12/03/08 15:49:57 INFO mapred.JobClient:     Spilled Records=3432447
    12/03/08 15:49:57 INFO mapred.JobClient:     Map output bytes=83168813
    12/03/08 15:49:57 INFO mapred.JobClient:     Combine input
    records=5912090
    12/03/08 15:49:57 INFO mapred.JobClient:     Map output records=3964061
    12/03/08 15:49:57 INFO mapred.JobClient:     SPLIT_RAW_BYTES=110
    12/03/08 15:49:57 INFO mapred.JobClient:     Reduce input records=294936
    12/03/08 15:49:58 INFO input.FileInputFormat: Total input paths to
    process : 1
    12/03/08 15:49:58 INFO mapred.JobClient: Running job:
    job_201203071745_0041
    12/03/08 15:49:59 INFO mapred.JobClient:  map 0% reduce 0%
    12/03/08 15:50:19 INFO mapred.JobClient:  map 8% reduce 0%
    12/03/08 15:50:22 INFO mapred.JobClient:  map 12% reduce 0%
    12/03/08 15:50:25 INFO mapred.JobClient:  map 15% reduce 0%
    12/03/08 15:50:28 INFO mapred.JobClient:  map 21% reduce 0%
    12/03/08 15:50:31 INFO mapred.JobClient:  map 23% reduce 0%
    12/03/08 15:50:34 INFO mapred.JobClient:  map 28% reduce 0%
    12/03/08 15:50:37 INFO mapred.JobClient:  map 32% reduce 0%
    12/03/08 15:50:40 INFO mapred.JobClient:  map 33% reduce 0%
    12/03/08 15:50:43 INFO mapred.JobClient:  map 35% reduce 0%
    12/03/08 15:50:46 INFO mapred.JobClient:  map 40% reduce 0%
    12/03/08 15:50:49 INFO mapred.JobClient:  map 42% reduce 0%
    12/03/08 15:50:52 INFO mapred.JobClient:  map 47% reduce 0%
    12/03/08 15:50:55 INFO mapred.JobClient:  map 48% reduce 0%
    12/03/08 15:50:58 INFO mapred.JobClient:  map 55% reduce 0%
    12/03/08 15:51:01 INFO mapred.JobClient:  map 57% reduce 0%
    12/03/08 15:51:04 INFO mapred.JobClient:  map 62% reduce 0%
    12/03/08 15:51:07 INFO mapred.JobClient:  map 67% reduce 0%
    12/03/08 15:51:10 INFO mapred.JobClient:  map 69% reduce 0%
    12/03/08 15:51:13 INFO mapred.JobClient:  map 75% reduce 0%
    12/03/08 15:51:20 INFO mapred.JobClient:  map 80% reduce 0%
    12/03/08 15:51:23 INFO mapred.JobClient:  map 81% reduce 0%
    12/03/08 15:51:26 INFO mapred.JobClient:  map 86% reduce 0%
    12/03/08 15:51:29 INFO mapred.JobClient:  map 88% reduce 0%
    12/03/08 15:51:31 INFO mapred.JobClient:  map 92% reduce 0%
    12/03/08 15:51:34 INFO mapred.JobClient:  map 94% reduce 0%
    12/03/08 15:51:37 INFO mapred.JobClient:  map 98% reduce 0%
    12/03/08 15:51:40 INFO mapred.JobClient:  map 100% reduce 0%
    12/03/08 15:52:19 INFO mapred.JobClient:  map 100% reduce 70%
    12/03/08 15:52:26 INFO mapred.JobClient:  map 100% reduce 100%
    12/03/08 15:52:31 INFO mapred.JobClient: Job complete:
    job_201203071745_0041
    12/03/08 15:52:31 INFO mapred.JobClient: Counters: 27
    12/03/08 15:52:31 INFO mapred.JobClient:   Job Counters
    12/03/08 15:52:31 INFO mapred.JobClient:     Launched reduce tasks=1
    12/03/08 15:52:31 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=124769
    12/03/08 15:52:31 INFO mapred.JobClient:     Total time spent by all
    reduces waiting after reserving slots (ms)=0
    12/03/08 15:52:31 INFO mapred.JobClient:     Total time spent by all
    maps waiting after reserving slots (ms)=0
    12/03/08 15:52:31 INFO mapred.JobClient:     Rack-local map tasks=1
    12/03/08 15:52:31 INFO mapred.JobClient:     Launched map tasks=1
    12/03/08 15:52:31 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=16543
    12/03/08 15:52:31 INFO mapred.JobClient:   File Output Format Counters
    12/03/08 15:52:31 INFO mapred.JobClient:     Bytes Written=73395270
    12/03/08 15:52:31 INFO mapred.JobClient:   FileSystemCounters
    12/03/08 15:52:31 INFO mapred.JobClient:     FILE_BYTES_READ=509127834
    12/03/08 15:52:31 INFO mapred.JobClient:     HDFS_BYTES_READ=45587326
    12/03/08 15:52:31 INFO mapred.JobClient:    
    FILE_BYTES_WRITTEN=577589760
    12/03/08 15:52:31 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=73395270
    12/03/08 15:52:31 INFO mapred.JobClient:   File Input Format Counters
    12/03/08 15:52:31 INFO mapred.JobClient:     Bytes Read=45587186
    12/03/08 15:52:31 INFO mapred.JobClient:  
    org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
    12/03/08 15:52:31 INFO mapred.JobClient:     PRUNED_COOCCURRENCES=0
    12/03/08 15:52:31 INFO mapred.JobClient:     COOCCURRENCES=65114863
    12/03/08 15:52:31 INFO mapred.JobClient:   Map-Reduce Framework
    12/03/08 15:52:31 INFO mapred.JobClient:     Reduce input groups=4837
    12/03/08 15:52:31 INFO mapred.JobClient:     Map output materialized
    bytes=68416023
    12/03/08 15:52:31 INFO mapred.JobClient:     Combine output
    records=79108
    12/03/08 15:52:31 INFO mapred.JobClient:     Map input records=294933
    12/03/08 15:52:31 INFO mapred.JobClient:     Reduce shuffle
    bytes=68416023
    12/03/08 15:52:31 INFO mapred.JobClient:     Reduce output records=4837
    12/03/08 15:52:31 INFO mapred.JobClient:     Spilled Records=117235
    12/03/08 15:52:31 INFO mapred.JobClient:     Map output bytes=694645784
    12/03/08 15:52:31 INFO mapred.JobClient:     Combine input
    records=4038329
    12/03/08 15:52:31 INFO mapred.JobClient:     Map output records=3964058
    12/03/08 15:52:31 INFO mapred.JobClient:     SPLIT_RAW_BYTES=119
    12/03/08 15:52:31 INFO mapred.JobClient:     Reduce input records=4837
    12/03/08 15:52:32 INFO input.FileInputFormat: Total input paths to
    process : 1
    12/03/08 15:52:32 INFO mapred.JobClient: Running job:
    job_201203071745_0042
    12/03/08 15:52:33 INFO mapred.JobClient:  map 0% reduce 0%
    12/03/08 15:52:52 INFO mapred.JobClient:  map 3% reduce 0%
    12/03/08 15:52:55 INFO mapred.JobClient:  map 5% reduce 0%
    12/03/08 15:52:58 INFO mapred.JobClient:  map 7% reduce 0%
    12/03/08 15:53:01 INFO mapred.JobClient:  map 9% reduce 0%
    12/03/08 15:53:04 INFO mapred.JobClient:  map 10% reduce 0%
    12/03/08 15:53:07 INFO mapred.JobClient:  map 12% reduce 0%
    12/03/08 15:53:10 INFO mapred.JobClient:  map 14% reduce 0%
    12/03/08 15:53:13 INFO mapred.JobClient:  map 17% reduce 0%
    12/03/08 15:53:16 INFO mapred.JobClient:  map 18% reduce 0%
    12/03/08 15:53:19 INFO mapred.JobClient:  map 21% reduce 0%
    12/03/08 15:53:22 INFO mapred.JobClient:  map 23% reduce 0%
    12/03/08 15:53:25 INFO mapred.JobClient:  map 25% reduce 0%
    12/03/08 15:53:28 INFO mapred.JobClient:  map 27% reduce 0%
    12/03/08 15:53:31 INFO mapred.JobClient:  map 29% reduce 0%
    12/03/08 15:53:34 INFO mapred.JobClient:  map 31% reduce 0%
    12/03/08 15:53:37 INFO mapred.JobClient:  map 33% reduce 0%
    12/03/08 15:53:40 INFO mapred.JobClient:  map 35% reduce 0%
    12/03/08 15:53:43 INFO mapred.JobClient:  map 37% reduce 0%
    12/03/08 15:53:46 INFO mapred.JobClient:  map 39% reduce 0%
    12/03/08 15:53:49 INFO mapred.JobClient:  map 41% reduce 0%
    12/03/08 15:53:52 INFO mapred.JobClient:  map 43% reduce 0%
    12/03/08 15:53:55 INFO mapred.JobClient:  map 46% reduce 0%
    12/03/08 15:53:58 INFO mapred.JobClient:  map 48% reduce 0%
    12/03/08 15:54:01 INFO mapred.JobClient:  map 50% reduce 0%
    12/03/08 15:54:04 INFO mapred.JobClient:  map 53% reduce 0%
    12/03/08 15:54:07 INFO mapred.JobClient:  map 55% reduce 0%
    12/03/08 15:54:10 INFO mapred.JobClient:  map 57% reduce 0%
    12/03/08 15:54:13 INFO mapred.JobClient:  map 60% reduce 0%
    12/03/08 15:54:16 INFO mapred.JobClient:  map 63% reduce 0%
    12/03/08 15:54:19 INFO mapred.JobClient:  map 65% reduce 0%
    12/03/08 15:54:22 INFO mapred.JobClient:  map 68% reduce 0%
    12/03/08 15:54:25 INFO mapred.JobClient:  map 71% reduce 0%
    12/03/08 15:54:28 INFO mapred.JobClient:  map 74% reduce 0%
    12/03/08 15:54:31 INFO mapred.JobClient:  map 77% reduce 0%
    12/03/08 15:54:34 INFO mapred.JobClient:  map 81% reduce 0%
    12/03/08 15:54:37 INFO mapred.JobClient:  map 84% reduce 0%
    12/03/08 15:54:40 INFO mapred.JobClient:  map 88% reduce 0%
    12/03/08 15:54:43 INFO mapred.JobClient:  map 93% reduce 0%
    12/03/08 15:54:46 INFO mapred.JobClient:  map 99% reduce 0%
    12/03/08 15:54:49 INFO mapred.JobClient:  map 100% reduce 0%
    12/03/08 15:55:01 INFO mapred.JobClient:  map 100% reduce 100%
    12/03/08 15:55:06 INFO mapred.JobClient: Job complete:
    job_201203071745_0042
    12/03/08 15:55:06 INFO mapred.JobClient: Counters: 25
    12/03/08 15:55:06 INFO mapred.JobClient:   Job Counters
    12/03/08 15:55:06 INFO mapred.JobClient:     Launched reduce tasks=1
    12/03/08 15:55:06 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=133985
    12/03/08 15:55:06 INFO mapred.JobClient:     Total time spent by all
    reduces waiting after reserving slots (ms)=0
    12/03/08 15:55:06 INFO mapred.JobClient:     Total time spent by all
    maps waiting after reserving slots (ms)=0
    12/03/08 15:55:06 INFO mapred.JobClient:     Launched map tasks=1
    12/03/08 15:55:06 INFO mapred.JobClient:     Data-local map tasks=1
    12/03/08 15:55:06 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10311
    12/03/08 15:55:06 INFO mapred.JobClient:   File Output Format Counters
    12/03/08 15:55:06 INFO mapred.JobClient:     Bytes Written=580158
    12/03/08 15:55:06 INFO mapred.JobClient:   FileSystemCounters
    12/03/08 15:55:06 INFO mapred.JobClient:     FILE_BYTES_READ=14921344
    12/03/08 15:55:06 INFO mapred.JobClient:     HDFS_BYTES_READ=73395400
    12/03/08 15:55:06 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=15396906
    12/03/08 15:55:06 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=580158
    12/03/08 15:55:06 INFO mapred.JobClient:   File Input Format Counters
    12/03/08 15:55:06 INFO mapred.JobClient:     Bytes Read=73395270
    12/03/08 15:55:06 INFO mapred.JobClient:   Map-Reduce Framework
    12/03/08 15:55:06 INFO mapred.JobClient:     Reduce input groups=4837
    12/03/08 15:55:06 INFO mapred.JobClient:     Map output materialized
    bytes=431573
    12/03/08 15:55:06 INFO mapred.JobClient:     Combine output
    records=96955
    12/03/08 15:55:06 INFO mapred.JobClient:     Map input records=4837
    12/03/08 15:55:06 INFO mapred.JobClient:     Reduce shuffle bytes=0
    12/03/08 15:55:06 INFO mapred.JobClient:     Reduce output records=4837
    12/03/08 15:55:06 INFO mapred.JobClient:     Spilled Records=166369
    12/03/08 15:55:06 INFO mapred.JobClient:     Map output bytes=153928302
    12/03/08 15:55:06 INFO mapred.JobClient:     Combine input
    records=7418380
    12/03/08 15:55:06 INFO mapred.JobClient:     Map output records=7326262
    12/03/08 15:55:06 INFO mapred.JobClient:     SPLIT_RAW_BYTES=130
    12/03/08 15:55:06 INFO mapred.JobClient:     Reduce input records=4837
    12/03/08 15:55:06 INFO driver.MahoutDriver: Program took 391379 ms
    (Minutes: 6.522983333333333)

performing seqdumper on the output looks reasonable.

Maybe named vectors is a problem?


On 3/7/12 8:50 AM, Sebastian Schelter wrote:
> Hi Pat,
>
> Something is going completely wrong. The first pass over the data of
> RowSimilarityJob transposes the input matrix. From the output of the
> first jobs, it seems as if your input is a 4838 x 3 matrix only:
>
> Map input records=4838
> Map output records=3
> Combine input records=3
> Combine output records=3
> Reduce input records=3
>
> Could you have a detailed look at the input to RowSimilarityJob?
>
> --sebastian
>
>
> On 07.03.2012 17:38, Pat Ferrel wrote:
>>     12/03/06 17:02:42 INFO mapred.JobClient:     Map input records=0
>

Re: How to find the k most similar docs

Posted by Sebastian Schelter <ss...@apache.org>.

Hi Pat,

Something is going completely wrong. The first pass over the data of
RowSimilarityJob transposes the input matrix. From the output of the
first jobs, it seems as if your input is a 4838 x 3 matrix only:

Map input records=4838
Map output records=3
Combine input records=3
Combine output records=3
Reduce input records=3

Could you have a detailed look at the input to RowSimilarityJob?

--sebastian


On 07.03.2012 17:38, Pat Ferrel wrote:
>    12/03/06 17:02:42 INFO mapred.JobClient:     Map input records=0

Re: How to find the k most similar docs

Posted by Pat Ferrel <pa...@occamsmachete.com>.

I have been experimenting with different analyzers and n-grams to clean 
up the vectors. Here is a run on a high dimensionality set of vectors 
with a loose analyzer (I think it was the default) The output of the 
rowid job was:

    pat@occam2:~/mahout-distribution-0.6$ bin/mahout rowid -i
    wikipedia-tfidf-custom-analyzer/tfidf-vectors/ -o wikipedia-matrix
    --tempDir temp
    MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
    Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
    HADOOP_CONF_DIR=/usr/local/hadoop/conf
    MAHOUT-JOB:
    /home/pat/mahout-distribution-0.6/mahout-examples-0.6-job.jar
    12/03/06 16:53:29 INFO common.AbstractJob: Command line arguments:
    {--endPhase=2147483647,
    --input=wikipedia-tfidf-custom-analyzer/tfidf-vectors/,
    --output=wikipedia-matrix, --startPhase=0, --tempDir=temp}
    12/03/06 16:53:30 INFO util.NativeCodeLoader: Loaded the
    native-hadoop library
    12/03/06 16:53:30 INFO zlib.ZlibFactory: Successfully loaded &
    initialized native-zlib library
    12/03/06 16:53:30 INFO compress.CodecPool: Got brand-new compressor
    12/03/06 16:53:30 INFO compress.CodecPool: Got brand-new compressor
    12/03/06 16:53:30 INFO vectors.RowIdJob: Wrote out matrix with 4838
    rows and 286907 columns to wikipedia-matrix/matrix
    12/03/06 16:53:30 INFO driver.MahoutDriver: Program took 1248 ms
    (Minutes: 0.0208)

Then I removed temp (shouldn't the jobs do that?) and ran the 
rowsililarity job:

    pat@occam2:~/mahout-distribution-0.6$ bin/mahout rowsimilarity -i
    wikipedia-matrix/matrix -o wikipedia-similarity -r 286907
    --similarityClassname SIMILARITY_COSINE -m 10 -ess true --tempDir temp
    MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
    Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
    HADOOP_CONF_DIR=/usr/local/hadoop/conf
    MAHOUT-JOB:
    /home/pat/mahout-distribution-0.6/mahout-examples-0.6-job.jar
    12/03/06 17:00:55 INFO common.AbstractJob: Command line arguments:
    {--endPhase=2147483647, --excludeSelfSimilarity=true,
    --input=wikipedia-matrix/matrix, --maxSimilaritiesPerRow=10,
    --numberOfColumns=286907, --output=wikipedia-similarity,
    --similarityClassname=SIMILARITY_COSINE, --startPhase=0, --tempDir=temp}
    12/03/06 17:00:56 INFO input.FileInputFormat: Total input paths to
    process : 1
    12/03/06 17:00:56 INFO mapred.JobClient: Running job:
    job_201203061645_0006
    12/03/06 17:00:57 INFO mapred.JobClient:  map 0% reduce 0%
    12/03/06 17:01:13 INFO mapred.JobClient:  map 100% reduce 0%
    12/03/06 17:01:25 INFO mapred.JobClient:  map 100% reduce 100%
    12/03/06 17:01:30 INFO mapred.JobClient: Job complete:
    job_201203061645_0006
    12/03/06 17:01:30 INFO mapred.JobClient: Counters: 26
    12/03/06 17:01:30 INFO mapred.JobClient:   Job Counters
    12/03/06 17:01:30 INFO mapred.JobClient:     Launched reduce tasks=1
    12/03/06 17:01:30 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=13502
    12/03/06 17:01:30 INFO mapred.JobClient:     Total time spent by all
    reduces waiting after reserving slots (ms)=0
    12/03/06 17:01:30 INFO mapred.JobClient:     Total time spent by all
    maps waiting after reserving slots (ms)=0
    12/03/06 17:01:30 INFO mapred.JobClient:     Rack-local map tasks=1
    12/03/06 17:01:30 INFO mapred.JobClient:     Launched map tasks=1
    12/03/06 17:01:30 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10496
    12/03/06 17:01:30 INFO mapred.JobClient:   File Output Format Counters
    12/03/06 17:01:30 INFO mapred.JobClient:     Bytes Written=97
    12/03/06 17:01:30 INFO mapred.JobClient:   FileSystemCounters
    12/03/06 17:01:30 INFO mapred.JobClient:     FILE_BYTES_READ=40
    12/03/06 17:01:30 INFO mapred.JobClient:     HDFS_BYTES_READ=122407
    12/03/06 17:01:30 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=45437
    12/03/06 17:01:30 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=118
    12/03/06 17:01:30 INFO mapred.JobClient:   File Input Format Counters
    12/03/06 17:01:30 INFO mapred.JobClient:     Bytes Read=122290
    12/03/06 17:01:30 INFO mapred.JobClient:
    org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
    12/03/06 17:01:30 INFO mapred.JobClient: ROWS=4838
    12/03/06 17:01:30 INFO mapred.JobClient:   Map-Reduce Framework
    12/03/06 17:01:30 INFO mapred.JobClient:     Reduce input groups=3
    12/03/06 17:01:30 INFO mapred.JobClient:     Map output materialized
    bytes=32
    12/03/06 17:01:30 INFO mapred.JobClient:     Combine output records=3
    12/03/06 17:01:30 INFO mapred.JobClient:     Map input records=4838
    12/03/06 17:01:30 INFO mapred.JobClient:     Reduce shuffle bytes=32
    12/03/06 17:01:30 INFO mapred.JobClient:     Reduce output records=0
    12/03/06 17:01:30 INFO mapred.JobClient:     Spilled Records=6
    12/03/06 17:01:30 INFO mapred.JobClient:     Map output bytes=33
    12/03/06 17:01:30 INFO mapred.JobClient:     Combine input records=3
    12/03/06 17:01:30 INFO mapred.JobClient:     Map output records=3
    12/03/06 17:01:30 INFO mapred.JobClient:     SPLIT_RAW_BYTES=117
    12/03/06 17:01:30 INFO mapred.JobClient:     Reduce input records=3
    12/03/06 17:01:30 INFO input.FileInputFormat: Total input paths to
    process : 1
    12/03/06 17:01:31 INFO mapred.JobClient: Running job:
    job_201203061645_0007
    12/03/06 17:01:32 INFO mapred.JobClient:  map 0% reduce 0%
    12/03/06 17:01:49 INFO mapred.JobClient:  map 100% reduce 0%
    12/03/06 17:02:01 INFO mapred.JobClient:  map 100% reduce 100%
    12/03/06 17:02:06 INFO mapred.JobClient: Job complete:
    job_201203061645_0007
    12/03/06 17:02:06 INFO mapred.JobClient: Counters: 25
    12/03/06 17:02:06 INFO mapred.JobClient:   Job Counters
    12/03/06 17:02:06 INFO mapred.JobClient:     Launched reduce tasks=1
    12/03/06 17:02:06 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=12989
    12/03/06 17:02:06 INFO mapred.JobClient:     Total time spent by all
    reduces waiting after reserving slots (ms)=0
    12/03/06 17:02:06 INFO mapred.JobClient:     Total time spent by all
    maps waiting after reserving slots (ms)=0
    12/03/06 17:02:06 INFO mapred.JobClient:     Launched map tasks=1
    12/03/06 17:02:06 INFO mapred.JobClient:     Data-local map tasks=1
    12/03/06 17:02:06 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10341
    12/03/06 17:02:06 INFO mapred.JobClient:   File Output Format Counters
    12/03/06 17:02:06 INFO mapred.JobClient:     Bytes Written=97
    12/03/06 17:02:06 INFO mapred.JobClient:   FileSystemCounters
    12/03/06 17:02:06 INFO mapred.JobClient:     FILE_BYTES_READ=22
    12/03/06 17:02:06 INFO mapred.JobClient:     HDFS_BYTES_READ=237
    12/03/06 17:02:06 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=45937
    12/03/06 17:02:06 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=97
    12/03/06 17:02:06 INFO mapred.JobClient:   File Input Format Counters
    12/03/06 17:02:06 INFO mapred.JobClient:     Bytes Read=97
    12/03/06 17:02:06 INFO mapred.JobClient:   Map-Reduce Framework
    12/03/06 17:02:06 INFO mapred.JobClient:     Reduce input groups=0
    12/03/06 17:02:06 INFO mapred.JobClient:     Map output materialized
    bytes=14
    12/03/06 17:02:06 INFO mapred.JobClient:     Combine output records=0
    12/03/06 17:02:06 INFO mapred.JobClient:     Map input records=0
    12/03/06 17:02:06 INFO mapred.JobClient:     Reduce shuffle bytes=0
    12/03/06 17:02:06 INFO mapred.JobClient:     Reduce output records=0
    12/03/06 17:02:06 INFO mapred.JobClient:     Spilled Records=0
    12/03/06 17:02:06 INFO mapred.JobClient:     Map output bytes=0
    12/03/06 17:02:06 INFO mapred.JobClient:     Combine input records=0
    12/03/06 17:02:06 INFO mapred.JobClient:     Map output records=0
    12/03/06 17:02:06 INFO mapred.JobClient:     SPLIT_RAW_BYTES=119
    12/03/06 17:02:06 INFO mapred.JobClient:     Reduce input records=0
    12/03/06 17:02:07 INFO input.FileInputFormat: Total input paths to
    process : 1
    12/03/06 17:02:07 INFO mapred.JobClient: Running job:
    job_201203061645_0008
    12/03/06 17:02:08 INFO mapred.JobClient:  map 0% reduce 0%
    12/03/06 17:02:25 INFO mapred.JobClient:  map 100% reduce 0%
    12/03/06 17:02:37 INFO mapred.JobClient:  map 100% reduce 100%
    12/03/06 17:02:42 INFO mapred.JobClient: Job complete:
    job_201203061645_0008
    12/03/06 17:02:42 INFO mapred.JobClient: Counters: 25
    12/03/06 17:02:42 INFO mapred.JobClient:   Job Counters
    12/03/06 17:02:42 INFO mapred.JobClient:     Launched reduce tasks=1
    12/03/06 17:02:42 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=12971
    12/03/06 17:02:42 INFO mapred.JobClient:     Total time spent by all
    reduces waiting after reserving slots (ms)=0
    12/03/06 17:02:42 INFO mapred.JobClient:     Total time spent by all
    maps waiting after reserving slots (ms)=0
    12/03/06 17:02:42 INFO mapred.JobClient:     Launched map tasks=1
    12/03/06 17:02:42 INFO mapred.JobClient:     Data-local map tasks=1
    12/03/06 17:02:42 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10322
    12/03/06 17:02:42 INFO mapred.JobClient:   File Output Format Counters
    12/03/06 17:02:42 INFO mapred.JobClient:     Bytes Written=97
    12/03/06 17:02:42 INFO mapred.JobClient:   FileSystemCounters
    12/03/06 17:02:42 INFO mapred.JobClient:     FILE_BYTES_READ=22
    12/03/06 17:02:42 INFO mapred.JobClient:     HDFS_BYTES_READ=227
    12/03/06 17:02:42 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=44039
    12/03/06 17:02:42 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=97
    12/03/06 17:02:42 INFO mapred.JobClient:   File Input Format Counters
    12/03/06 17:02:42 INFO mapred.JobClient:     Bytes Read=97
    12/03/06 17:02:42 INFO mapred.JobClient:   Map-Reduce Framework
    12/03/06 17:02:42 INFO mapred.JobClient:     Reduce input groups=0
    12/03/06 17:02:42 INFO mapred.JobClient:     Map output materialized
    bytes=14
    12/03/06 17:02:42 INFO mapred.JobClient:     Combine output records=0
    12/03/06 17:02:42 INFO mapred.JobClient:     Map input records=0
    12/03/06 17:02:42 INFO mapred.JobClient:     Reduce shuffle bytes=14
    12/03/06 17:02:42 INFO mapred.JobClient:     Reduce output records=0
    12/03/06 17:02:42 INFO mapred.JobClient:     Spilled Records=0
    12/03/06 17:02:42 INFO mapred.JobClient:     Map output bytes=0
    12/03/06 17:02:42 INFO mapred.JobClient:     Combine input records=0
    12/03/06 17:02:42 INFO mapred.JobClient:     Map output records=0
    12/03/06 17:02:42 INFO mapred.JobClient:     SPLIT_RAW_BYTES=130
    12/03/06 17:02:42 INFO mapred.JobClient:     Reduce input records=0
    12/03/06 17:02:42 INFO driver.MahoutDriver: Program took 107225 ms
    (Minutes: 1.7870833333333334)

It seems to have executed correctly. I ran it on a small cluster but it 
was awfully fast at that. The row counter is there but not the others.

How is the output stored? What does it represent? I would expect a 
sequence of row ids as keys with ten rowids each as values? I used named 
vectors if that matters.

The output is of the correct type but empty. Here is the seqdump output, 
notice count=0, and the file is 97 bytes.

    pat@occam2:~/mahout-distribution-0.6$ bin/mahout seqdumper -s
    wikipedia-similarity/part-r-00000
    MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
    Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
    HADOOP_CONF_DIR=/usr/local/hadoop/conf
    MAHOUT-JOB:
    /home/pat/mahout-distribution-0.6/mahout-examples-0.6-job.jar
    12/03/07 08:31:59 INFO common.AbstractJob: Command line arguments:
    {--endPhase=2147483647, --seqFile=wikipedia-similarity/part-r-00000,
    --startPhase=0, --tempDir=temp}
    Input Path: wikipedia-similarity/part-r-00000
    Key class: class org.apache.hadoop.io.IntWritable Value Class: class
    org.apache.mahout.math.VectorWritable
    Count: 0
    12/03/07 08:31:59 INFO driver.MahoutDriver: Program took 603 ms
    (Minutes: 0.01005)


On 3/6/12 11:09 PM, Sebastian Schelter wrote:
> Hi Pat,
>
> You are right, these results look strange. RowSimilarityJob has 3 custom
> counters (ROWS, COOCCURRENCES, PRUNED_COOCCURRENCES), can you give use
> the numbers for these?
>
> --sebastian
>
> On 07.03.2012 02:14, Pat Ferrel wrote:
>> Ok, making progress. I created a matrix using rowid and got the
>> following output:
>>
>>     Maclaurin:mahout-distribution-0.6 pferrel$ bin/mahout rowid -i
>>     wikipedia-clusters/tfidf-vectors/ -o wikipedia-matrix --tempDir temp
>>     ...
>>     12/03/05 16:52:45 INFO common.AbstractJob: Command line arguments:
>>     {--endPhase=2147483647, --input=wikipedia-clusters/tfidf-vectors/,
>>     --output=wikipedia-matrix, --startPhase=0, --tempDir=temp}
>>     2012-03-05 16:52:45.870 java[4940:1903] Unable to load realm info
>>     from SCDynamicStore
>>     12/03/05 16:52:46 WARN util.NativeCodeLoader: Unable to load
>>     native-hadoop library for your platform... using builtin-java
>>     classes where applicable
>>     12/03/05 16:52:46 INFO compress.CodecPool: Got brand-new compressor
>>     12/03/05 16:52:46 INFO compress.CodecPool: Got brand-new compressor
>>     12/03/05 16:52:47 INFO vectors.RowIdJob: Wrote out matrix with 4838
>>     rows and 87325 columns to wikipedia-matrix/matrix
>>     12/03/05 16:52:47 INFO driver.MahoutDriver: Program took 1758 ms
>>     (Minutes: 0.0293)
>>
>> So a doc matrix with 4838 docs and 87325 dimensions. Next I ran
>> RowSimilarityJob
>>
>>     Maclaurin:mahout-distribution-0.6 pferrel$ bin/mahout rowsimilarity
>>     -i wikipedia-matrix/matrix -o wikipedia-similarity -r 87325
>>     --similarityClassname SIMILARITY_COSINE -m 10 -ess true --tempDir temp
>>
>> This gives me output in wikipedia-similarity/part-m-00000 but the size
>> is 97 bytes? Shouldn't it have created 4838 * 10 results? Ten per row? I
>> set no threshold so I'd expect it to pick the 10 nearest even if they
>> are far away.
>>
>> BTW what is the output format?
>>
>> On 3/5/12 11:48 AM, Suneel Marthi wrote:
>>> Pat,
>>>
>>> Your input to RowSimilarity seems to be the tfidf-vectors directory
>>> which is<Text, vectorWritable>.
>>>
>>> Before executing the RowSimilarity job u need to run the RowIdJob
>>> which creates a matrix of<IntWritable, VectorWritable>.  This matrix
>>> should be the input to RowSimilarity.
>>>
>>> Also from your command, you seem to be missing --tempDir argument, you
>>> would need that too.
>>>
>>> Suneel
>>>
>>> ------------------------------------------------------------------------
>>> *From:* Sebastian Schelter<ss...@apache.org>
>>> *To:* user@mahout.apache.org
>>> *Sent:* Monday, March 5, 2012 2:32 PM
>>> *Subject:* Re: How to find the k most similar docs
>>>
>>> That's the problem:
>>>
>>> org.apache.hadoop.io.Text cannot be
>>>    cast to org.apache.hadoop.io
>>> <http://org.apache.hadoop.io.Int>.IntWritable
>>>
>>> RowSimilarityJob expects<IntWritable,VectorWritable>  as input, it seems
>>> you supply<Text,VectorWritable>.
>>>
>>> --sebastian
>>>
>>> On 05.03.2012 20:29, Pat Ferrel wrote:
>>>> org.apache.hadoop.io.Text cannot be
>>>>     cast to org.apache.hadoop.io.IntWritable
>>>
>>>
>

Re: How to find the k most similar docs

Posted by Sebastian Schelter <ss...@apache.org>.

Hi Pat,

You are right, these results look strange. RowSimilarityJob has 3 custom
counters (ROWS, COOCCURRENCES, PRUNED_COOCCURRENCES), can you give use
the numbers for these?

--sebastian

On 07.03.2012 02:14, Pat Ferrel wrote:
> Ok, making progress. I created a matrix using rowid and got the
> following output:
> 
>    Maclaurin:mahout-distribution-0.6 pferrel$ bin/mahout rowid -i
>    wikipedia-clusters/tfidf-vectors/ -o wikipedia-matrix --tempDir temp
>    ...
>    12/03/05 16:52:45 INFO common.AbstractJob: Command line arguments:
>    {--endPhase=2147483647, --input=wikipedia-clusters/tfidf-vectors/,
>    --output=wikipedia-matrix, --startPhase=0, --tempDir=temp}
>    2012-03-05 16:52:45.870 java[4940:1903] Unable to load realm info
>    from SCDynamicStore
>    12/03/05 16:52:46 WARN util.NativeCodeLoader: Unable to load
>    native-hadoop library for your platform... using builtin-java
>    classes where applicable
>    12/03/05 16:52:46 INFO compress.CodecPool: Got brand-new compressor
>    12/03/05 16:52:46 INFO compress.CodecPool: Got brand-new compressor
>    12/03/05 16:52:47 INFO vectors.RowIdJob: Wrote out matrix with 4838
>    rows and 87325 columns to wikipedia-matrix/matrix
>    12/03/05 16:52:47 INFO driver.MahoutDriver: Program took 1758 ms
>    (Minutes: 0.0293)
> 
> So a doc matrix with 4838 docs and 87325 dimensions. Next I ran
> RowSimilarityJob
> 
>    Maclaurin:mahout-distribution-0.6 pferrel$ bin/mahout rowsimilarity
>    -i wikipedia-matrix/matrix -o wikipedia-similarity -r 87325
>    --similarityClassname SIMILARITY_COSINE -m 10 -ess true --tempDir temp
> 
> This gives me output in wikipedia-similarity/part-m-00000 but the size
> is 97 bytes? Shouldn't it have created 4838 * 10 results? Ten per row? I
> set no threshold so I'd expect it to pick the 10 nearest even if they
> are far away.
> 
> BTW what is the output format?
> 
> On 3/5/12 11:48 AM, Suneel Marthi wrote:
>> Pat,
>>
>> Your input to RowSimilarity seems to be the tfidf-vectors directory
>> which is <Text, vectorWritable>.
>>
>> Before executing the RowSimilarity job u need to run the RowIdJob
>> which creates a matrix of <IntWritable, VectorWritable>.  This matrix
>> should be the input to RowSimilarity.
>>
>> Also from your command, you seem to be missing --tempDir argument, you
>> would need that too.
>>
>> Suneel
>>
>> ------------------------------------------------------------------------
>> *From:* Sebastian Schelter <ss...@apache.org>
>> *To:* user@mahout.apache.org
>> *Sent:* Monday, March 5, 2012 2:32 PM
>> *Subject:* Re: How to find the k most similar docs
>>
>> That's the problem:
>>
>> org.apache.hadoop.io.Text cannot be
>>   cast to org.apache.hadoop.io
>> <http://org.apache.hadoop.io.Int>.IntWritable
>>
>> RowSimilarityJob expects <IntWritable,VectorWritable> as input, it seems
>> you supply <Text,VectorWritable>.
>>
>> --sebastian
>>
>> On 05.03.2012 20:29, Pat Ferrel wrote:
>> > org.apache.hadoop.io.Text cannot be
>> >    cast to org.apache.hadoop.io.IntWritable
>>
>>
>>
>

Re: How to find the k most similar docs

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Ok, making progress. I created a matrix using rowid and got the 
following output:

    Maclaurin:mahout-distribution-0.6 pferrel$ bin/mahout rowid -i
    wikipedia-clusters/tfidf-vectors/ -o wikipedia-matrix --tempDir temp
    ...
    12/03/05 16:52:45 INFO common.AbstractJob: Command line arguments:
    {--endPhase=2147483647, --input=wikipedia-clusters/tfidf-vectors/,
    --output=wikipedia-matrix, --startPhase=0, --tempDir=temp}
    2012-03-05 16:52:45.870 java[4940:1903] Unable to load realm info
    from SCDynamicStore
    12/03/05 16:52:46 WARN util.NativeCodeLoader: Unable to load
    native-hadoop library for your platform... using builtin-java
    classes where applicable
    12/03/05 16:52:46 INFO compress.CodecPool: Got brand-new compressor
    12/03/05 16:52:46 INFO compress.CodecPool: Got brand-new compressor
    12/03/05 16:52:47 INFO vectors.RowIdJob: Wrote out matrix with 4838
    rows and 87325 columns to wikipedia-matrix/matrix
    12/03/05 16:52:47 INFO driver.MahoutDriver: Program took 1758 ms
    (Minutes: 0.0293)

So a doc matrix with 4838 docs and 87325 dimensions. Next I ran 
RowSimilarityJob

    Maclaurin:mahout-distribution-0.6 pferrel$ bin/mahout rowsimilarity
    -i wikipedia-matrix/matrix -o wikipedia-similarity -r 87325
    --similarityClassname SIMILARITY_COSINE -m 10 -ess true --tempDir temp

This gives me output in wikipedia-similarity/part-m-00000 but the size 
is 97 bytes? Shouldn't it have created 4838 * 10 results? Ten per row? I 
set no threshold so I'd expect it to pick the 10 nearest even if they 
are far away.

BTW what is the output format?

On 3/5/12 11:48 AM, Suneel Marthi wrote:
> Pat,
>
> Your input to RowSimilarity seems to be the tfidf-vectors directory 
> which is <Text, vectorWritable>.
>
> Before executing the RowSimilarity job u need to run the RowIdJob 
> which creates a matrix of <IntWritable, VectorWritable>.  This matrix 
> should be the input to RowSimilarity.
>
> Also from your command, you seem to be missing --tempDir argument, you 
> would need that too.
>
> Suneel
>
> ------------------------------------------------------------------------
> *From:* Sebastian Schelter <ss...@apache.org>
> *To:* user@mahout.apache.org
> *Sent:* Monday, March 5, 2012 2:32 PM
> *Subject:* Re: How to find the k most similar docs
>
> That's the problem:
>
> org.apache.hadoop.io.Text cannot be
>   cast to org.apache.hadoop.io 
> <http://org.apache.hadoop.io.Int>.IntWritable
>
> RowSimilarityJob expects <IntWritable,VectorWritable> as input, it seems
> you supply <Text,VectorWritable>.
>
> --sebastian
>
> On 05.03.2012 20:29, Pat Ferrel wrote:
> > org.apache.hadoop.io.Text cannot be
> >    cast to org.apache.hadoop.io.IntWritable
>
>
>

Re: How to find the k most similar docs

Posted by Suneel Marthi <su...@yahoo.com>.

Pat,

Your input to RowSimilarity seems to be the tfidf-vectors directory which is <Text, vectorWritable>.

Before executing the RowSimilarity job u need to run the RowIdJob which creates a matrix of <IntWritable, VectorWritable>.  This matrix should be the input to RowSimilarity.

Also from your command, you seem to be missing --tempDir argument, you would need that too.

Suneel

________________________________
 From: Sebastian Schelter <ss...@apache.org>
To: user@mahout.apache.org 
Sent: Monday, March 5, 2012 2:32 PM
Subject: Re: How to find the k most similar docs

That's the problem:

org.apache.hadoop.io.Text cannot be
   cast to org.apache.hadoop.io.IntWritable

RowSimilarityJob expects <IntWritable,VectorWritable> as input, it seems
you supply <Text,VectorWritable>.

--sebastian

On 05.03.2012 20:29, Pat Ferrel wrote:
> org.apache.hadoop.io.Text cannot be
>    cast to org.apache.hadoop.io.IntWritable

Re: How to find the k most similar docs

Posted by Sebastian Schelter <ss...@apache.org>.

That's the problem:

org.apache.hadoop.io.Text cannot be
   cast to org.apache.hadoop.io.IntWritable

RowSimilarityJob expects <IntWritable,VectorWritable> as input, it seems
you supply <Text,VectorWritable>.

--sebastian

On 05.03.2012 20:29, Pat Ferrel wrote:
> org.apache.hadoop.io.Text cannot be
>    cast to org.apache.hadoop.io.IntWritable