You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2012/03/05 20:29:35 UTC
Re: How to find the k most similar docs
I'm using Mahout 0.6 compiled from source via 'mvn install' I used
Suneel's code below to get NumberOfColumns.
When I try to run the rowsimilarity job via:
bin/mahout rowsimilarity -i wikipedia-clusters/tfidf-vectors/ -o
/wikipedia-similarity -r 87325 -s SIMILARITY_COSINE -m 10 -ess true
I get the following error
12/03/04 19:14:32 INFO common.AbstractJob: Command line arguments:
{--endPhase=2147483647, --excludeSelfSimilarity=true,
--input=wikipedia-clusters/tfidf-vectors/,
--maxSimilaritiesPerRow=10, --numberOfColumns=87325,
--output=/wikipedia-similarity,
--similarityClassname=SIMILARITY_COSINE, --startPhase=0, --tempDir=temp}
2012-03-04 19:14:32.376 java[1090:1903] Unable to load realm info
from SCDynamicStore
12/03/04 19:14:33 INFO input.FileInputFormat: Total input paths to
process : 1
12/03/04 19:14:33 INFO mapred.JobClient: Running job: job_local_0001
12/03/04 19:14:33 INFO mapred.MapTask: io.sort.mb = 100
12/03/04 19:14:33 INFO mapred.MapTask: data buffer = 79691776/99614720
12/03/04 19:14:33 INFO mapred.MapTask: record buffer = 262144/327680
12/03/04 19:14:34 WARN mapred.LocalJobRunner: job_local_0001
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be
cast to org.apache.hadoop.io.IntWritable
at
org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$VectorNormMapper.map(RowSimilarityJob.java:154)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
The cast error (as I understand it) usually happens when you pass in a
classname incorrectly. This seems likely since coocurence similarity is
being used?
I've probably missed something obvious about how to pass in similarity
measure to use?
On 2/19/12 9:00 PM, Suneel Marthi wrote:
> Hi Pat,
>
>
> 1. Please look at the discussion thread at http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/browser for a description of what the RowSimilarityJob does. The RowSimilarityJob implementation is based on the research paper - http://www.csee.ogi.edu/~zak/cs506-pslc/docsim.pdf
>
> I'll add the details on the mahout wiki page sometime this week.
>
> 2. 'maxSimilaritiesPerRow' returns the best similarities (not the first) - by default this returns top 100 if not specified.
>
> 3. If you would like to discard the similarities per row below a certain value you can specify a threshold -tr, which would limit the results to only those documents that have a similarity value greater than the threshold.
>
> Depending on the similarity measures that you get as the final output, it should give you an idea of what T1 and T2 should be. In my particular use case I was only interested in documents that had a similarity measure of 0.7 or greater,hence 0.7 would be my T2; and the top most similar documents has a similarity value of 0.99999 (which was what I used as my T1).
>
> 4. 'numberOfColumns' is not optional; but I tend to agree with you that this should be inferred automatically if not specified by the size of the input vector. This could be an enhancement to add to the RowSimilarityJob.
>
> Code snippet below gets the number of columns in a matrix if not specified by the user.
>
> Path inputMatrixPath = new Path(getInputPath());
>
> SequenceFile.Reader sequenceFileReader = new SequenceFile.Reader (fs, inputMatrixPath, conf);
>
> int NumberOfColumns = getDimensions(sequenceFileReader);
>
> sequenceFileReader.close();
> private int getDimensions(Reader reader) throws IOException, InstantiationException, IllegalAccessException {
> Class keyClass = reader.getKeyClass();
> Writable row = (Writable) keyClass.newInstance();
> if (! reader.getValueClass().equals(VectorWritable.class)) {
> throw new IllegalArgumentException("Value type of sequencefile must be a VectorWritable");
> }
> VectorWritable vw = new VectorWritable();
> if (!reader.next(row, vw)) {
> log.error("matrix must have at least one row");
> throw new IllegalStateException();
> }
> Vector v = vw.get();
> return v.size();
> }
> 5. RowSimilarityJob also has an option to excludeSelfSimilarity (which is false by default) but you need to specify this so that you don't end up comparing a document with itself and ending up with a similarity measure of 1.0 (if using Cosine measure).
>
> Let me know if you have any more questions.
>
>
>
>
>
> ________________________________
> From: Sebastian Schelter<ss...@apache.org>
> To: user@mahout.apache.org
> Sent: Sunday, February 19, 2012 4:33 PM
> Subject: Re: How to find the k most similar docs
>
> Hi Pat,
>
> 'numberOfColumns' is not optional but is only used by a few
> similarityMeasures (such as loglikelihood ratio).
> 'maxSimilaritiesPerRow' retains the top similarities.
>
> --sebastian
>
>
> On 19.02.2012 22:11, Pat Ferrel wrote:
>> This looks perfect, thanks.
>>
>> I had planned to do the RowSimilarityJob after clustering to reduce the
>> rows from the entire corpus to only those in a cluster. You mention
>> using the distance between similar rows to get an idea of the distances
>> for canopy clustering. This seems a very good idea since I have no other
>> good way to generate T1 and T2. The downside is that I have to do
>> RowSimilarityJob on all docs in the corpus. I assume that since you have
>> done this on 10 Million docs that the benefit in getting good canopies
>> outweighs doing similarity on all docs as far as processing resources
>> needed?
>>
>> I am
> new to reading mapreduce code so may I ask some noob questions:
>> * is the best documentation here?
>>
>> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/math/hadoop/similarity/RowSimilarityJob.html#run(java.lang.String[])
>>
>> * the command line arguments include: numberOfColumns, shouldn't that
>> be easily extracted from the input matrix? is this optional? How do
>> I tell which argument is optional from the docs?
>> * the argument maxSimilaritiesPerRow could return first or best, it is
>> difficult to see which.
>>
>> I have the source but perhaps due to the string based binding I am
>> finding it hard to track down what code is run so any tips
> for reading
>> the code or docs are greatly appreciated.
>>
>>
>> On 2/18/12 1:27 PM, Suneel Marthi wrote:
>>> You might want to look at the RowSimilarityJob in Mahout to determine
>>> document similarity.
>>>
>>>
>>> Here's what you would do:-
>>>
>>> Assuming that your documents have already been vectorized, first
>>> convert the vectors into an M*N matrix by calling the RowIdJob in
>>> Mahout where M = No. of rows (or documents in your case) and N= No. of
>>> columns (or the unique terms).
>>>
>>>
>>> Then run the RowSimilarity job on the matrix generated in the previous
>>> step by specifying a cosine similarity measure, this should generate
>>> an output that gives the most similar documents for each of the
>>> documents and the similarity distance between them. RowSimilarityJob
>>> is a
> mapreduce job so you should be able to run this on a really large
>>> corpus (I had run this on 10 million web pages).
>>> The output of the RowSimilarity along with the similarity distances
>>> that are generated between document pairs should give an idea as to
>>> what the values of T1 and T2 should be when running canopy clustering.
>>> And the number of clusters generated by running canopy would
>>> eventually be fed into k-means as you had mentioned.
>>>
>>>
>>>
>>>
>>>
>>> ________________________________
>>> From: Pat Ferrel<pa...@occamsmachete.com>
>>> To: user@mahout.apache.org
>>> Sent: Saturday, February 18, 2012 2:39 PM
>>> Subject: How to
> find the k most similar docs
>>> Given documents that are vectorized into Mahout vectors, have stop
>>> words removed, and a TFIDF dictionary, what is the best distributed
>>> way to get k nearest documents using a measure like cosine similarity
>>> (or the others provided in Mahout)? I will be doing this for every
>>> document in the corpus so the question is partly how best to do this
>>> given the existing mahout+hadoop framework. What is the intuition
>>> about processing resources needed?
>>>
>>> Expansion: At some point I'd like to extend this idea to find similar
>>> clusters but expect that the same method should work only with
>>> centroids instead of doc vectors. Also I expect to do canopy
>>> clustering to feed into kmeans clustering. I'll perform the similarity
>>> measure only on docs in the same cluster. I think I understand
> how to
>>> do this preprocessing so the question is primarily the k most similar
>>> docs and/or centroids. This sounds like k nearest neighbors, if so is
>>> this the best way to do it in
>>> mahout+hadoop?
Re: How to find the k most similar docs
Posted by Fernando Fernández <fe...@gmail.com>.
I'm surprised no one has mentioned SVD yet. You are supposed to obtain
better resutls using SVD factors instead of original TF-IDF vectors when
computing similarities (This is the theory). Many text mining applications
follow these steps:
- Stopword removal.
- Tf-Idf computation.
- Svd factorization.
- Clustering or supervised classification using SVD factors.
You have SVD distributed routines in Mahout you can use
(DistributedLanczosSolver), you may wnat to check them out.
Best,
Fernando.
2012/3/5 Suneel Marthi <su...@yahoo.com>
> Pat,
>
> Your input to RowSimilarity seems to be the tfidf-vectors directory which
> is <Text, vectorWritable>.
>
> Before executing the RowSimilarity job u need to run the RowIdJob which
> creates a matrix of <IntWritable, VectorWritable>. This matrix should be
> the input to RowSimilarity.
>
> Also from your command, you seem to be missing --tempDir argument, you
> would need that too.
>
> Suneel
>
>
> ________________________________
> From: Sebastian Schelter <ss...@apache.org>
> To: user@mahout.apache.org
> Sent: Monday, March 5, 2012 2:32 PM
> Subject: Re: How to find the k most similar docs
>
> That's the problem:
>
> org.apache.hadoop.io.Text cannot be
> cast to org.apache.hadoop.io.IntWritable
>
> RowSimilarityJob expects <IntWritable,VectorWritable> as input, it seems
> you supply <Text,VectorWritable>.
>
> --sebastian
>
> On 05.03.2012 20:29, Pat Ferrel wrote:
> > org.apache.hadoop.io.Text cannot be
> > cast to org.apache.hadoop.io.IntWritable
>
Re: How to find the k most similar docs
Posted by Suneel Marthi <su...@yahoo.com>.
Did the RowSimilarityJob execute successfully? Your output should have been one or more part-r-* files (depending on the number of reducers you have configured in ur environment).
You should be able to get a sequence dump of the wikipedia-similarity/part-m-00000 file to see what they are.
The output format of RowSimilarityJob is <IntWritable, VectorWritable>.
________________________________
From: Pat Ferrel <pa...@occamsmachete.com>
To:
Cc: "user@mahout.apache.org" <us...@mahout.apache.org>
Sent: Tuesday, March 6, 2012 8:14 PM
Subject: Re: How to find the k most similar docs
Ok, making progress. I created a matrix using rowid and got the following output:
Maclaurin:mahout-distribution-0.6 pferrel$ bin/mahout rowid -i
wikipedia-clusters/tfidf-vectors/ -o wikipedia-matrix --tempDir temp
...
12/03/05 16:52:45 INFO common.AbstractJob: Command line arguments:
{--endPhase=2147483647, --input=wikipedia-clusters/tfidf-vectors/,
--output=wikipedia-matrix, --startPhase=0, --tempDir=temp}
2012-03-05 16:52:45.870 java[4940:1903] Unable to load realm info
from SCDynamicStore
12/03/05 16:52:46 WARN util.NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java
classes where applicable
12/03/05 16:52:46 INFO compress.CodecPool: Got brand-new compressor
12/03/05 16:52:46 INFO compress.CodecPool: Got brand-new compressor
12/03/05 16:52:47 INFO vectors.RowIdJob: Wrote out matrix with 4838
rows and 87325 columns to wikipedia-matrix/matrix
12/03/05 16:52:47 INFO driver.MahoutDriver: Program took 1758 ms
(Minutes: 0.0293)
So a doc matrix with 4838 docs and 87325 dimensions. Next I ran RowSimilarityJob
Maclaurin:mahout-distribution-0.6 pferrel$ bin/mahout rowsimilarity
-i wikipedia-matrix/matrix -o wikipedia-similarity -r 87325
--similarityClassname SIMILARITY_COSINE -m 10 -ess true --tempDir temp
This gives me output in wikipedia-similarity/part-m-00000 but the size is 97 bytes? Shouldn't it have created 4838 * 10 results? Ten per row? I set no threshold so I'd expect it to pick the 10 nearest even if they are far away.
BTW what is the output format?
On 3/5/12 11:48 AM, Suneel Marthi wrote:
> Pat,
>
> Your input to RowSimilarity seems to be the tfidf-vectors directory which is <Text, vectorWritable>.
>
> Before executing the RowSimilarity job u need to run the RowIdJob which creates a matrix of <IntWritable, VectorWritable>. This matrix should be the input to RowSimilarity.
>
> Also from your command, you seem to be missing --tempDir argument, you would need that too.
>
> Suneel
>
> ------------------------------------------------------------------------
> *From:* Sebastian Schelter <ss...@apache.org>
> *To:* user@mahout.apache.org
> *Sent:* Monday, March 5, 2012 2:32 PM
> *Subject:* Re: How to find the k most similar docs
>
> That's the problem:
>
> org.apache.hadoop.io.Text cannot be
> cast to org.apache.hadoop.io <http://org.apache.hadoop.io.Int>.IntWritable
>
> RowSimilarityJob expects <IntWritable,VectorWritable> as input, it seems
> you supply <Text,VectorWritable>.
>
> --sebastian
>
> On 05.03.2012 20:29, Pat Ferrel wrote:
> > org.apache.hadoop.io.Text cannot be
> > cast to org.apache.hadoop.io.IntWritable
>
>
>
Re: RowSimilarityJob
Posted by Suneel Marthi <su...@yahoo.com>.
I should have been more elaborate in my previous reply.
RowId job creates a matrix which is of type <IntWritable, VectorWritable> and a docIndex <IntWritable, Text>
docIndex is a map of the rowId to the keys generated from seq2sparse.
What you would need to do is to join the output of RowSimilarity to docIndex to get the format u r looking for.
Hope that helps.
Suneel
________________________________
From: Suneel Marthi <su...@yahoo.com>
To: "user@mahout.apache.org" <us...@mahout.apache.org>
Sent: Tuesday, March 20, 2012 1:41 PM
Subject: Re: RowSimilarityJob
Docindex is ur answer
Sent from my iPhone
On Mar 20, 2012, at 12:28 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> How do you map the output of RowSimilarity to documents? What I really need is to create an association of
>
> doc1 --> docn, docm, doci, etc.
>
> The output of rowsimilarity looks like
>
> rowid --> vector of rowids : distances
>
> for example:
>
> Key: 0: Value: {14458:0.2966480826934176,11399:0.30290014772966095,
> 12793:0.22009858979452146,3275:0.1871791030103281,
> 14613:0.3534278632679437,4411:0.2516380602790199,
> 17520:0.3139731583634198,13611:0.18968888212315968,
> 14354:0.17673965754661425,0:1.0000000000000004}
>
> It would be nice to use the same keys as they are output by seq2aparse, in my case named vectors so file names would appear in the output as rowids. Creating my association would be trivial.
>
> Have I missed a dictionary containing rowid to docid(name) mapping?
>
Re: RowSimilarityJob
Posted by Suneel Marthi <su...@yahoo.com>.
Docindex is ur answer
Sent from my iPhone
On Mar 20, 2012, at 12:28 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> How do you map the output of RowSimilarity to documents? What I really need is to create an association of
>
> doc1 --> docn, docm, doci, etc.
>
> The output of rowsimilarity looks like
>
> rowid --> vector of rowids : distances
>
> for example:
>
> Key: 0: Value: {14458:0.2966480826934176,11399:0.30290014772966095,
> 12793:0.22009858979452146,3275:0.1871791030103281,
> 14613:0.3534278632679437,4411:0.2516380602790199,
> 17520:0.3139731583634198,13611:0.18968888212315968,
> 14354:0.17673965754661425,0:1.0000000000000004}
>
> It would be nice to use the same keys as they are output by seq2aparse, in my case named vectors so file names would appear in the output as rowids. Creating my association would be trivial.
>
> Have I missed a dictionary containing rowid to docid(name) mapping?
>
RowSimilarityJob
Posted by Pat Ferrel <pa...@occamsmachete.com>.
How do you map the output of RowSimilarity to documents? What I really
need is to create an association of
doc1 --> docn, docm, doci, etc.
The output of rowsimilarity looks like
rowid --> vector of rowids : distances
for example:
Key: 0: Value: {14458:0.2966480826934176,11399:0.30290014772966095,
12793:0.22009858979452146,3275:0.1871791030103281,
14613:0.3534278632679437,4411:0.2516380602790199,
17520:0.3139731583634198,13611:0.18968888212315968,
14354:0.17673965754661425,0:1.0000000000000004}
It would be nice to use the same keys as they are output by seq2aparse,
in my case named vectors so file names would appear in the output as
rowids. Creating my association would be trivial.
Have I missed a dictionary containing rowid to docid(name) mapping?
Re: How to find the k most similar docs
Posted by Lance Norskog <go...@gmail.com>.
No, the matrix multiplication operations all (probably) take
<int,vector> where int is the row number. There has to be a
universally unique row number. If there is no row number associated
with a row in a distributed matrix op, how can the reducers know which
rows they have?
Rows do not necessarily have to be in order; some sequential programs
might depend on this (but they should not).
On Fri, Mar 9, 2012 at 9:50 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> I assume that the other matrix operations will consume and produce <Text,
> MatrixWritable>? If so how do you create <Text, MatrixWritable> from the
> output of rowid <IntWritable, VectorWritable>?
>
> Also while we are at it how do you use vectordump? If you do "bin/mahout
> vectordump --help" you get some crazy output that is unreadable. I would
> have guessed that vectordump would work on either <IntWritable,
> VectorWritable> so the output of rowid OR <Text, VectorWritable> the
> contents of tfidf-vectors/part-r-00000 but it doesn't seem to work on either
> using "bin/mahout vectordump -s path-to-file"
>
> Thanks
> Pat
>
>
> On 3/9/12 4:26 AM, Suneel Marthi wrote:
>>
>> Pat,
>>
>> MatrixDump expects an input file of<Text, MatrixWritable> . The matrix
>> that gets created from RowIdJob is<IntWritable, VectorWritable> and you
>> cannot run MatrixDump to see the contents of the matrix. You need to use
>> seqdumper as you had done.
>>
>>
>>
>> ________________________________
>> From: Pat Ferrel<pa...@occamsmachete.com>
>> To: user@mahout.apache.org
>> Sent: Thursday, March 8, 2012 7:14 PM
>> Subject: Re: How to find the k most similar docs
>>
>> OK, back to the beginning. I went through the entire sequence again with
>> the notable exception that I did not create named vectors. I also tweaked
>> some of the seq2sparse parameters.
>>
>> bin/mahout seq2sparse -i wp-seqfiles -o wp-vectors -ow -a
>> org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 100 -wt tfidf
>> -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2
>>
>> after doing a rowid on the tfidf vectors I still get an error doing
>> matrixdump on wp-matrix/matrix. Am I using it wrong? Taking on faith that a
>> matrix was created I perform the rowsimilarity job and now get a far bigger
>> file created that looks OK
>>
>> bin/mahout rowsimilarity -r 311433 -i wp-matrix/matrix -o
>> wp-similarity -ess -s SIMILARITY_COSINE -m 10
>> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
>> Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
>> HADOOP_CONF_DIR=/usr/local/hadoop/conf
>> MAHOUT-JOB:
>> /home/pat/mahout-distribution-0.6/mahout-examples-0.6-job.jar
>> 12/03/08 15:48:35 INFO common.AbstractJob: Command line arguments:
>> {--endPhase=2147483647, --excludeSelfSimilarity=false,
>> --input=wp-matrix/matrix, --maxSimilaritiesPerRow=10,
>> --numberOfColumns=311433, --output=wp-similarity,
>> --similarityClassname=SIMILARITY_COSINE, --startPhase=0,
>> --tempDir=temp}
>> 12/03/08 15:48:36 INFO input.FileInputFormat: Total input paths to
>> process : 1
>> 12/03/08 15:48:36 INFO mapred.JobClient: Running job:
>> job_201203071745_0040
>> 12/03/08 15:48:37 INFO mapred.JobClient: map 0% reduce 0%
>> 12/03/08 15:48:58 INFO mapred.JobClient: map 17% reduce 0%
>> 12/03/08 15:49:01 INFO mapred.JobClient: map 27% reduce 0%
>> 12/03/08 15:49:04 INFO mapred.JobClient: map 40% reduce 0%
>> 12/03/08 15:49:07 INFO mapred.JobClient: map 47% reduce 0%
>> 12/03/08 15:49:10 INFO mapred.JobClient: map 60% reduce 0%
>> 12/03/08 15:49:13 INFO mapred.JobClient: map 70% reduce 0%
>> 12/03/08 15:49:16 INFO mapred.JobClient: map 80% reduce 0%
>> 12/03/08 15:49:19 INFO mapred.JobClient: map 92% reduce 0%
>> 12/03/08 15:49:22 INFO mapred.JobClient: map 100% reduce 0%
>> 12/03/08 15:49:46 INFO mapred.JobClient: map 100% reduce 33%
>> 12/03/08 15:49:52 INFO mapred.JobClient: map 100% reduce 100%
>> 12/03/08 15:49:57 INFO mapred.JobClient: Job complete:
>> job_201203071745_0040
>> 12/03/08 15:49:57 INFO mapred.JobClient: Counters: 26
>> 12/03/08 15:49:57 INFO mapred.JobClient: Job Counters
>> 12/03/08 15:49:57 INFO mapred.JobClient: Launched reduce tasks=1
>> 12/03/08 15:49:57 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=55564
>> 12/03/08 15:49:57 INFO mapred.JobClient: Total time spent by all
>> reduces waiting after reserving slots (ms)=0
>> 12/03/08 15:49:57 INFO mapred.JobClient: Total time spent by all
>> maps waiting after reserving slots (ms)=0
>> 12/03/08 15:49:57 INFO mapred.JobClient: Rack-local map tasks=1
>> 12/03/08 15:49:57 INFO mapred.JobClient: Launched map tasks=1
>> 12/03/08 15:49:57 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=13565
>> 12/03/08 15:49:57 INFO mapred.JobClient: File Output Format Counters
>> 12/03/08 15:49:57 INFO mapred.JobClient: Bytes Written=45587186
>> 12/03/08 15:49:57 INFO mapred.JobClient: FileSystemCounters
>> 12/03/08 15:49:57 INFO mapred.JobClient: FILE_BYTES_READ=99732287
>> 12/03/08 15:49:57 INFO mapred.JobClient: HDFS_BYTES_READ=17156393
>> 12/03/08 15:49:57 INFO mapred.JobClient:
>> FILE_BYTES_WRITTEN=138104586
>> 12/03/08 15:49:57 INFO mapred.JobClient:
>> HDFS_BYTES_WRITTEN=45587207
>> 12/03/08 15:49:57 INFO mapred.JobClient: File Input Format Counters
>> 12/03/08 15:49:57 INFO mapred.JobClient: Bytes Read=17156283
>> 12/03/08 15:49:57 INFO mapred.JobClient:
>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>> 12/03/08 15:49:57 INFO mapred.JobClient: ROWS=4838
>> 12/03/08 15:49:57 INFO mapred.JobClient: Map-Reduce Framework
>> 12/03/08 15:49:57 INFO mapred.JobClient: Reduce input groups=294936
>> 12/03/08 15:49:57 INFO mapred.JobClient: Map output materialized
>> bytes=38326948
>> 12/03/08 15:49:57 INFO mapred.JobClient: Combine output
>> records=2242965
>> 12/03/08 15:49:57 INFO mapred.JobClient: Map input records=4838
>> 12/03/08 15:49:57 INFO mapred.JobClient: Reduce shuffle
>> bytes=38326948
>> 12/03/08 15:49:57 INFO mapred.JobClient: Reduce output
>> records=294933
>> 12/03/08 15:49:57 INFO mapred.JobClient: Spilled Records=3432447
>> 12/03/08 15:49:57 INFO mapred.JobClient: Map output bytes=83168813
>> 12/03/08 15:49:57 INFO mapred.JobClient: Combine input
>> records=5912090
>> 12/03/08 15:49:57 INFO mapred.JobClient: Map output records=3964061
>> 12/03/08 15:49:57 INFO mapred.JobClient: SPLIT_RAW_BYTES=110
>> 12/03/08 15:49:57 INFO mapred.JobClient: Reduce input
>> records=294936
>> 12/03/08 15:49:58 INFO input.FileInputFormat: Total input paths to
>> process : 1
>> 12/03/08 15:49:58 INFO mapred.JobClient: Running job:
>> job_201203071745_0041
>> 12/03/08 15:49:59 INFO mapred.JobClient: map 0% reduce 0%
>> 12/03/08 15:50:19 INFO mapred.JobClient: map 8% reduce 0%
>> 12/03/08 15:50:22 INFO mapred.JobClient: map 12% reduce 0%
>> 12/03/08 15:50:25 INFO mapred.JobClient: map 15% reduce 0%
>> 12/03/08 15:50:28 INFO mapred.JobClient: map 21% reduce 0%
>> 12/03/08 15:50:31 INFO mapred.JobClient: map 23% reduce 0%
>> 12/03/08 15:50:34 INFO mapred.JobClient: map 28% reduce 0%
>> 12/03/08 15:50:37 INFO mapred.JobClient: map 32% reduce 0%
>> 12/03/08 15:50:40 INFO mapred.JobClient: map 33% reduce 0%
>> 12/03/08 15:50:43 INFO mapred.JobClient: map 35% reduce 0%
>> 12/03/08 15:50:46 INFO mapred.JobClient: map 40% reduce 0%
>> 12/03/08 15:50:49 INFO mapred.JobClient: map 42% reduce 0%
>> 12/03/08 15:50:52 INFO mapred.JobClient: map 47% reduce 0%
>> 12/03/08 15:50:55 INFO mapred.JobClient: map 48% reduce 0%
>> 12/03/08 15:50:58 INFO mapred.JobClient: map 55% reduce 0%
>> 12/03/08 15:51:01 INFO mapred.JobClient: map 57% reduce 0%
>> 12/03/08 15:51:04 INFO mapred.JobClient: map 62% reduce 0%
>> 12/03/08 15:51:07 INFO mapred.JobClient: map 67% reduce 0%
>> 12/03/08 15:51:10 INFO mapred.JobClient: map 69% reduce 0%
>> 12/03/08 15:51:13 INFO mapred.JobClient: map 75% reduce 0%
>> 12/03/08 15:51:20 INFO mapred.JobClient: map 80% reduce 0%
>> 12/03/08 15:51:23 INFO mapred.JobClient: map 81% reduce 0%
>> 12/03/08 15:51:26 INFO mapred.JobClient: map 86% reduce 0%
>> 12/03/08 15:51:29 INFO mapred.JobClient: map 88% reduce 0%
>> 12/03/08 15:51:31 INFO mapred.JobClient: map 92% reduce 0%
>> 12/03/08 15:51:34 INFO mapred.JobClient: map 94% reduce 0%
>> 12/03/08 15:51:37 INFO mapred.JobClient: map 98% reduce 0%
>> 12/03/08 15:51:40 INFO mapred.JobClient: map 100% reduce 0%
>> 12/03/08 15:52:19 INFO mapred.JobClient: map 100% reduce 70%
>> 12/03/08 15:52:26 INFO mapred.JobClient: map 100% reduce 100%
>> 12/03/08 15:52:31 INFO mapred.JobClient: Job complete:
>> job_201203071745_0041
>> 12/03/08 15:52:31 INFO mapred.JobClient: Counters: 27
>> 12/03/08 15:52:31 INFO mapred.JobClient: Job Counters
>> 12/03/08 15:52:31 INFO mapred.JobClient: Launched reduce tasks=1
>> 12/03/08 15:52:31 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=124769
>> 12/03/08 15:52:31 INFO mapred.JobClient: Total time spent by all
>> reduces waiting after reserving slots (ms)=0
>> 12/03/08 15:52:31 INFO mapred.JobClient: Total time spent by all
>> maps waiting after reserving slots (ms)=0
>> 12/03/08 15:52:31 INFO mapred.JobClient: Rack-local map tasks=1
>> 12/03/08 15:52:31 INFO mapred.JobClient: Launched map tasks=1
>> 12/03/08 15:52:31 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=16543
>> 12/03/08 15:52:31 INFO mapred.JobClient: File Output Format Counters
>> 12/03/08 15:52:31 INFO mapred.JobClient: Bytes Written=73395270
>> 12/03/08 15:52:31 INFO mapred.JobClient: FileSystemCounters
>> 12/03/08 15:52:31 INFO mapred.JobClient: FILE_BYTES_READ=509127834
>> 12/03/08 15:52:31 INFO mapred.JobClient: HDFS_BYTES_READ=45587326
>> 12/03/08 15:52:31 INFO mapred.JobClient:
>> FILE_BYTES_WRITTEN=577589760
>> 12/03/08 15:52:31 INFO mapred.JobClient:
>> HDFS_BYTES_WRITTEN=73395270
>> 12/03/08 15:52:31 INFO mapred.JobClient: File Input Format Counters
>> 12/03/08 15:52:31 INFO mapred.JobClient: Bytes Read=45587186
>> 12/03/08 15:52:31 INFO mapred.JobClient:
>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>> 12/03/08 15:52:31 INFO mapred.JobClient: PRUNED_COOCCURRENCES=0
>> 12/03/08 15:52:31 INFO mapred.JobClient: COOCCURRENCES=65114863
>> 12/03/08 15:52:31 INFO mapred.JobClient: Map-Reduce Framework
>> 12/03/08 15:52:31 INFO mapred.JobClient: Reduce input groups=4837
>> 12/03/08 15:52:31 INFO mapred.JobClient: Map output materialized
>> bytes=68416023
>> 12/03/08 15:52:31 INFO mapred.JobClient: Combine output
>> records=79108
>> 12/03/08 15:52:31 INFO mapred.JobClient: Map input records=294933
>> 12/03/08 15:52:31 INFO mapred.JobClient: Reduce shuffle
>> bytes=68416023
>> 12/03/08 15:52:31 INFO mapred.JobClient: Reduce output records=4837
>> 12/03/08 15:52:31 INFO mapred.JobClient: Spilled Records=117235
>> 12/03/08 15:52:31 INFO mapred.JobClient: Map output bytes=694645784
>> 12/03/08 15:52:31 INFO mapred.JobClient: Combine input
>> records=4038329
>> 12/03/08 15:52:31 INFO mapred.JobClient: Map output records=3964058
>> 12/03/08 15:52:31 INFO mapred.JobClient: SPLIT_RAW_BYTES=119
>> 12/03/08 15:52:31 INFO mapred.JobClient: Reduce input records=4837
>> 12/03/08 15:52:32 INFO input.FileInputFormat: Total input paths to
>> process : 1
>> 12/03/08 15:52:32 INFO mapred.JobClient: Running job:
>> job_201203071745_0042
>> 12/03/08 15:52:33 INFO mapred.JobClient: map 0% reduce 0%
>> 12/03/08 15:52:52 INFO mapred.JobClient: map 3% reduce 0%
>> 12/03/08 15:52:55 INFO mapred.JobClient: map 5% reduce 0%
>> 12/03/08 15:52:58 INFO mapred.JobClient: map 7% reduce 0%
>> 12/03/08 15:53:01 INFO mapred.JobClient: map 9% reduce 0%
>> 12/03/08 15:53:04 INFO mapred.JobClient: map 10% reduce 0%
>> 12/03/08 15:53:07 INFO mapred.JobClient: map 12% reduce 0%
>> 12/03/08 15:53:10 INFO mapred.JobClient: map 14% reduce 0%
>> 12/03/08 15:53:13 INFO mapred.JobClient: map 17% reduce 0%
>> 12/03/08 15:53:16 INFO mapred.JobClient: map 18% reduce 0%
>> 12/03/08 15:53:19 INFO mapred.JobClient: map 21% reduce 0%
>> 12/03/08 15:53:22 INFO mapred.JobClient: map 23% reduce 0%
>> 12/03/08 15:53:25 INFO mapred.JobClient: map 25% reduce 0%
>> 12/03/08 15:53:28 INFO mapred.JobClient: map 27% reduce 0%
>> 12/03/08 15:53:31 INFO mapred.JobClient: map 29% reduce 0%
>> 12/03/08 15:53:34 INFO mapred.JobClient: map 31% reduce 0%
>> 12/03/08 15:53:37 INFO mapred.JobClient: map 33% reduce 0%
>> 12/03/08 15:53:40 INFO mapred.JobClient: map 35% reduce 0%
>> 12/03/08 15:53:43 INFO mapred.JobClient: map 37% reduce 0%
>> 12/03/08 15:53:46 INFO mapred.JobClient: map 39% reduce 0%
>> 12/03/08 15:53:49 INFO mapred.JobClient: map 41% reduce 0%
>> 12/03/08 15:53:52 INFO mapred.JobClient: map 43% reduce 0%
>> 12/03/08 15:53:55 INFO mapred.JobClient: map 46% reduce 0%
>> 12/03/08 15:53:58 INFO mapred.JobClient: map 48% reduce 0%
>> 12/03/08 15:54:01 INFO mapred.JobClient: map 50% reduce 0%
>> 12/03/08 15:54:04 INFO mapred.JobClient: map 53% reduce 0%
>> 12/03/08 15:54:07 INFO mapred.JobClient: map 55% reduce 0%
>> 12/03/08 15:54:10 INFO mapred.JobClient: map 57% reduce 0%
>> 12/03/08 15:54:13 INFO mapred.JobClient: map 60% reduce 0%
>> 12/03/08 15:54:16 INFO mapred.JobClient: map 63% reduce 0%
>> 12/03/08 15:54:19 INFO mapred.JobClient: map 65% reduce 0%
>> 12/03/08 15:54:22 INFO mapred.JobClient: map 68% reduce 0%
>> 12/03/08 15:54:25 INFO mapred.JobClient: map 71% reduce 0%
>> 12/03/08 15:54:28 INFO mapred.JobClient: map 74% reduce 0%
>> 12/03/08 15:54:31 INFO mapred.JobClient: map 77% reduce 0%
>> 12/03/08 15:54:34 INFO mapred.JobClient: map 81% reduce 0%
>> 12/03/08 15:54:37 INFO mapred.JobClient: map 84% reduce 0%
>> 12/03/08 15:54:40 INFO mapred.JobClient: map 88% reduce 0%
>> 12/03/08 15:54:43 INFO mapred.JobClient: map 93% reduce 0%
>> 12/03/08 15:54:46 INFO mapred.JobClient: map 99% reduce 0%
>> 12/03/08 15:54:49 INFO mapred.JobClient: map 100% reduce 0%
>> 12/03/08 15:55:01 INFO mapred.JobClient: map 100% reduce 100%
>> 12/03/08 15:55:06 INFO mapred.JobClient: Job complete:
>> job_201203071745_0042
>> 12/03/08 15:55:06 INFO mapred.JobClient: Counters: 25
>> 12/03/08 15:55:06 INFO mapred.JobClient: Job Counters
>> 12/03/08 15:55:06 INFO mapred.JobClient: Launched reduce tasks=1
>> 12/03/08 15:55:06 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=133985
>> 12/03/08 15:55:06 INFO mapred.JobClient: Total time spent by all
>> reduces waiting after reserving slots (ms)=0
>> 12/03/08 15:55:06 INFO mapred.JobClient: Total time spent by all
>> maps waiting after reserving slots (ms)=0
>> 12/03/08 15:55:06 INFO mapred.JobClient: Launched map tasks=1
>> 12/03/08 15:55:06 INFO mapred.JobClient: Data-local map tasks=1
>> 12/03/08 15:55:06 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=10311
>> 12/03/08 15:55:06 INFO mapred.JobClient: File Output Format Counters
>> 12/03/08 15:55:06 INFO mapred.JobClient: Bytes Written=580158
>> 12/03/08 15:55:06 INFO mapred.JobClient: FileSystemCounters
>> 12/03/08 15:55:06 INFO mapred.JobClient: FILE_BYTES_READ=14921344
>> 12/03/08 15:55:06 INFO mapred.JobClient: HDFS_BYTES_READ=73395400
>> 12/03/08 15:55:06 INFO mapred.JobClient:
>> FILE_BYTES_WRITTEN=15396906
>> 12/03/08 15:55:06 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=580158
>> 12/03/08 15:55:06 INFO mapred.JobClient: File Input Format Counters
>> 12/03/08 15:55:06 INFO mapred.JobClient: Bytes Read=73395270
>> 12/03/08 15:55:06 INFO mapred.JobClient: Map-Reduce Framework
>> 12/03/08 15:55:06 INFO mapred.JobClient: Reduce input groups=4837
>> 12/03/08 15:55:06 INFO mapred.JobClient: Map output materialized
>> bytes=431573
>> 12/03/08 15:55:06 INFO mapred.JobClient: Combine output
>> records=96955
>> 12/03/08 15:55:06 INFO mapred.JobClient: Map input records=4837
>> 12/03/08 15:55:06 INFO mapred.JobClient: Reduce shuffle bytes=0
>> 12/03/08 15:55:06 INFO mapred.JobClient: Reduce output records=4837
>> 12/03/08 15:55:06 INFO mapred.JobClient: Spilled Records=166369
>> 12/03/08 15:55:06 INFO mapred.JobClient: Map output bytes=153928302
>> 12/03/08 15:55:06 INFO mapred.JobClient: Combine input
>> records=7418380
>> 12/03/08 15:55:06 INFO mapred.JobClient: Map output records=7326262
>> 12/03/08 15:55:06 INFO mapred.JobClient: SPLIT_RAW_BYTES=130
>> 12/03/08 15:55:06 INFO mapred.JobClient: Reduce input records=4837
>> 12/03/08 15:55:06 INFO driver.MahoutDriver: Program took 391379 ms
>> (Minutes: 6.522983333333333)
>>
>> performing seqdumper on the output looks reasonable.
>>
>> Maybe named vectors is a problem?
>>
>>
>> On 3/7/12 8:50 AM, Sebastian Schelter wrote:
>>>
>>> Hi Pat,
>>>
>>> Something is going completely wrong. The first pass over the data of
>>> RowSimilarityJob transposes the input matrix. From the output of the
>>> first jobs, it seems as if your input is a 4838 x 3 matrix only:
>>>
>>> Map input records=4838
>>> Map output records=3
>>> Combine input records=3
>>> Combine output records=3
>>> Reduce input records=3
>>>
>>> Could you have a detailed look at the input to RowSimilarityJob?
>>>
>>> --sebastian
>>>
>>>
>>> On 07.03.2012 17:38, Pat Ferrel wrote:
>>>>
>>>> 12/03/06 17:02:42 INFO mapred.JobClient: Map input records=0
--
Lance Norskog
goksron@gmail.com
Re: How to find the k most similar docs
Posted by Pat Ferrel <pa...@occamsmachete.com>.
I assume that the other matrix operations will consume and produce
<Text, MatrixWritable>? If so how do you create <Text, MatrixWritable>
from the output of rowid <IntWritable, VectorWritable>?
Also while we are at it how do you use vectordump? If you do "bin/mahout
vectordump --help" you get some crazy output that is unreadable. I would
have guessed that vectordump would work on either <IntWritable,
VectorWritable> so the output of rowid OR <Text, VectorWritable> the
contents of tfidf-vectors/part-r-00000 but it doesn't seem to work on
either using "bin/mahout vectordump -s path-to-file"
Thanks
Pat
On 3/9/12 4:26 AM, Suneel Marthi wrote:
> Pat,
>
> MatrixDump expects an input file of<Text, MatrixWritable> . The matrix that gets created from RowIdJob is<IntWritable, VectorWritable> and you cannot run MatrixDump to see the contents of the matrix. You need to use seqdumper as you had done.
>
>
>
> ________________________________
> From: Pat Ferrel<pa...@occamsmachete.com>
> To: user@mahout.apache.org
> Sent: Thursday, March 8, 2012 7:14 PM
> Subject: Re: How to find the k most similar docs
>
> OK, back to the beginning. I went through the entire sequence again with the notable exception that I did not create named vectors. I also tweaked some of the seq2sparse parameters.
>
> bin/mahout seq2sparse -i wp-seqfiles -o wp-vectors -ow -a
> org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 100 -wt tfidf
> -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2
>
> after doing a rowid on the tfidf vectors I still get an error doing matrixdump on wp-matrix/matrix. Am I using it wrong? Taking on faith that a matrix was created I perform the rowsimilarity job and now get a far bigger file created that looks OK
>
> bin/mahout rowsimilarity -r 311433 -i wp-matrix/matrix -o
> wp-similarity -ess -s SIMILARITY_COSINE -m 10
> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
> HADOOP_CONF_DIR=/usr/local/hadoop/conf
> MAHOUT-JOB:
> /home/pat/mahout-distribution-0.6/mahout-examples-0.6-job.jar
> 12/03/08 15:48:35 INFO common.AbstractJob: Command line arguments:
> {--endPhase=2147483647, --excludeSelfSimilarity=false,
> --input=wp-matrix/matrix, --maxSimilaritiesPerRow=10,
> --numberOfColumns=311433, --output=wp-similarity,
> --similarityClassname=SIMILARITY_COSINE, --startPhase=0, --tempDir=temp}
> 12/03/08 15:48:36 INFO input.FileInputFormat: Total input paths to
> process : 1
> 12/03/08 15:48:36 INFO mapred.JobClient: Running job:
> job_201203071745_0040
> 12/03/08 15:48:37 INFO mapred.JobClient: map 0% reduce 0%
> 12/03/08 15:48:58 INFO mapred.JobClient: map 17% reduce 0%
> 12/03/08 15:49:01 INFO mapred.JobClient: map 27% reduce 0%
> 12/03/08 15:49:04 INFO mapred.JobClient: map 40% reduce 0%
> 12/03/08 15:49:07 INFO mapred.JobClient: map 47% reduce 0%
> 12/03/08 15:49:10 INFO mapred.JobClient: map 60% reduce 0%
> 12/03/08 15:49:13 INFO mapred.JobClient: map 70% reduce 0%
> 12/03/08 15:49:16 INFO mapred.JobClient: map 80% reduce 0%
> 12/03/08 15:49:19 INFO mapred.JobClient: map 92% reduce 0%
> 12/03/08 15:49:22 INFO mapred.JobClient: map 100% reduce 0%
> 12/03/08 15:49:46 INFO mapred.JobClient: map 100% reduce 33%
> 12/03/08 15:49:52 INFO mapred.JobClient: map 100% reduce 100%
> 12/03/08 15:49:57 INFO mapred.JobClient: Job complete:
> job_201203071745_0040
> 12/03/08 15:49:57 INFO mapred.JobClient: Counters: 26
> 12/03/08 15:49:57 INFO mapred.JobClient: Job Counters
> 12/03/08 15:49:57 INFO mapred.JobClient: Launched reduce tasks=1
> 12/03/08 15:49:57 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=55564
> 12/03/08 15:49:57 INFO mapred.JobClient: Total time spent by all
> reduces waiting after reserving slots (ms)=0
> 12/03/08 15:49:57 INFO mapred.JobClient: Total time spent by all
> maps waiting after reserving slots (ms)=0
> 12/03/08 15:49:57 INFO mapred.JobClient: Rack-local map tasks=1
> 12/03/08 15:49:57 INFO mapred.JobClient: Launched map tasks=1
> 12/03/08 15:49:57 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=13565
> 12/03/08 15:49:57 INFO mapred.JobClient: File Output Format Counters
> 12/03/08 15:49:57 INFO mapred.JobClient: Bytes Written=45587186
> 12/03/08 15:49:57 INFO mapred.JobClient: FileSystemCounters
> 12/03/08 15:49:57 INFO mapred.JobClient: FILE_BYTES_READ=99732287
> 12/03/08 15:49:57 INFO mapred.JobClient: HDFS_BYTES_READ=17156393
> 12/03/08 15:49:57 INFO mapred.JobClient: FILE_BYTES_WRITTEN=138104586
> 12/03/08 15:49:57 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=45587207
> 12/03/08 15:49:57 INFO mapred.JobClient: File Input Format Counters
> 12/03/08 15:49:57 INFO mapred.JobClient: Bytes Read=17156283
> 12/03/08 15:49:57 INFO mapred.JobClient: org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> 12/03/08 15:49:57 INFO mapred.JobClient: ROWS=4838
> 12/03/08 15:49:57 INFO mapred.JobClient: Map-Reduce Framework
> 12/03/08 15:49:57 INFO mapred.JobClient: Reduce input groups=294936
> 12/03/08 15:49:57 INFO mapred.JobClient: Map output materialized
> bytes=38326948
> 12/03/08 15:49:57 INFO mapred.JobClient: Combine output
> records=2242965
> 12/03/08 15:49:57 INFO mapred.JobClient: Map input records=4838
> 12/03/08 15:49:57 INFO mapred.JobClient: Reduce shuffle
> bytes=38326948
> 12/03/08 15:49:57 INFO mapred.JobClient: Reduce output
> records=294933
> 12/03/08 15:49:57 INFO mapred.JobClient: Spilled Records=3432447
> 12/03/08 15:49:57 INFO mapred.JobClient: Map output bytes=83168813
> 12/03/08 15:49:57 INFO mapred.JobClient: Combine input
> records=5912090
> 12/03/08 15:49:57 INFO mapred.JobClient: Map output records=3964061
> 12/03/08 15:49:57 INFO mapred.JobClient: SPLIT_RAW_BYTES=110
> 12/03/08 15:49:57 INFO mapred.JobClient: Reduce input records=294936
> 12/03/08 15:49:58 INFO input.FileInputFormat: Total input paths to
> process : 1
> 12/03/08 15:49:58 INFO mapred.JobClient: Running job:
> job_201203071745_0041
> 12/03/08 15:49:59 INFO mapred.JobClient: map 0% reduce 0%
> 12/03/08 15:50:19 INFO mapred.JobClient: map 8% reduce 0%
> 12/03/08 15:50:22 INFO mapred.JobClient: map 12% reduce 0%
> 12/03/08 15:50:25 INFO mapred.JobClient: map 15% reduce 0%
> 12/03/08 15:50:28 INFO mapred.JobClient: map 21% reduce 0%
> 12/03/08 15:50:31 INFO mapred.JobClient: map 23% reduce 0%
> 12/03/08 15:50:34 INFO mapred.JobClient: map 28% reduce 0%
> 12/03/08 15:50:37 INFO mapred.JobClient: map 32% reduce 0%
> 12/03/08 15:50:40 INFO mapred.JobClient: map 33% reduce 0%
> 12/03/08 15:50:43 INFO mapred.JobClient: map 35% reduce 0%
> 12/03/08 15:50:46 INFO mapred.JobClient: map 40% reduce 0%
> 12/03/08 15:50:49 INFO mapred.JobClient: map 42% reduce 0%
> 12/03/08 15:50:52 INFO mapred.JobClient: map 47% reduce 0%
> 12/03/08 15:50:55 INFO mapred.JobClient: map 48% reduce 0%
> 12/03/08 15:50:58 INFO mapred.JobClient: map 55% reduce 0%
> 12/03/08 15:51:01 INFO mapred.JobClient: map 57% reduce 0%
> 12/03/08 15:51:04 INFO mapred.JobClient: map 62% reduce 0%
> 12/03/08 15:51:07 INFO mapred.JobClient: map 67% reduce 0%
> 12/03/08 15:51:10 INFO mapred.JobClient: map 69% reduce 0%
> 12/03/08 15:51:13 INFO mapred.JobClient: map 75% reduce 0%
> 12/03/08 15:51:20 INFO mapred.JobClient: map 80% reduce 0%
> 12/03/08 15:51:23 INFO mapred.JobClient: map 81% reduce 0%
> 12/03/08 15:51:26 INFO mapred.JobClient: map 86% reduce 0%
> 12/03/08 15:51:29 INFO mapred.JobClient: map 88% reduce 0%
> 12/03/08 15:51:31 INFO mapred.JobClient: map 92% reduce 0%
> 12/03/08 15:51:34 INFO mapred.JobClient: map 94% reduce 0%
> 12/03/08 15:51:37 INFO mapred.JobClient: map 98% reduce 0%
> 12/03/08 15:51:40 INFO mapred.JobClient: map 100% reduce 0%
> 12/03/08 15:52:19 INFO mapred.JobClient: map 100% reduce 70%
> 12/03/08 15:52:26 INFO mapred.JobClient: map 100% reduce 100%
> 12/03/08 15:52:31 INFO mapred.JobClient: Job complete:
> job_201203071745_0041
> 12/03/08 15:52:31 INFO mapred.JobClient: Counters: 27
> 12/03/08 15:52:31 INFO mapred.JobClient: Job Counters
> 12/03/08 15:52:31 INFO mapred.JobClient: Launched reduce tasks=1
> 12/03/08 15:52:31 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=124769
> 12/03/08 15:52:31 INFO mapred.JobClient: Total time spent by all
> reduces waiting after reserving slots (ms)=0
> 12/03/08 15:52:31 INFO mapred.JobClient: Total time spent by all
> maps waiting after reserving slots (ms)=0
> 12/03/08 15:52:31 INFO mapred.JobClient: Rack-local map tasks=1
> 12/03/08 15:52:31 INFO mapred.JobClient: Launched map tasks=1
> 12/03/08 15:52:31 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=16543
> 12/03/08 15:52:31 INFO mapred.JobClient: File Output Format Counters
> 12/03/08 15:52:31 INFO mapred.JobClient: Bytes Written=73395270
> 12/03/08 15:52:31 INFO mapred.JobClient: FileSystemCounters
> 12/03/08 15:52:31 INFO mapred.JobClient: FILE_BYTES_READ=509127834
> 12/03/08 15:52:31 INFO mapred.JobClient: HDFS_BYTES_READ=45587326
> 12/03/08 15:52:31 INFO mapred.JobClient: FILE_BYTES_WRITTEN=577589760
> 12/03/08 15:52:31 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=73395270
> 12/03/08 15:52:31 INFO mapred.JobClient: File Input Format Counters
> 12/03/08 15:52:31 INFO mapred.JobClient: Bytes Read=45587186
> 12/03/08 15:52:31 INFO mapred.JobClient: org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> 12/03/08 15:52:31 INFO mapred.JobClient: PRUNED_COOCCURRENCES=0
> 12/03/08 15:52:31 INFO mapred.JobClient: COOCCURRENCES=65114863
> 12/03/08 15:52:31 INFO mapred.JobClient: Map-Reduce Framework
> 12/03/08 15:52:31 INFO mapred.JobClient: Reduce input groups=4837
> 12/03/08 15:52:31 INFO mapred.JobClient: Map output materialized
> bytes=68416023
> 12/03/08 15:52:31 INFO mapred.JobClient: Combine output
> records=79108
> 12/03/08 15:52:31 INFO mapred.JobClient: Map input records=294933
> 12/03/08 15:52:31 INFO mapred.JobClient: Reduce shuffle
> bytes=68416023
> 12/03/08 15:52:31 INFO mapred.JobClient: Reduce output records=4837
> 12/03/08 15:52:31 INFO mapred.JobClient: Spilled Records=117235
> 12/03/08 15:52:31 INFO mapred.JobClient: Map output bytes=694645784
> 12/03/08 15:52:31 INFO mapred.JobClient: Combine input
> records=4038329
> 12/03/08 15:52:31 INFO mapred.JobClient: Map output records=3964058
> 12/03/08 15:52:31 INFO mapred.JobClient: SPLIT_RAW_BYTES=119
> 12/03/08 15:52:31 INFO mapred.JobClient: Reduce input records=4837
> 12/03/08 15:52:32 INFO input.FileInputFormat: Total input paths to
> process : 1
> 12/03/08 15:52:32 INFO mapred.JobClient: Running job:
> job_201203071745_0042
> 12/03/08 15:52:33 INFO mapred.JobClient: map 0% reduce 0%
> 12/03/08 15:52:52 INFO mapred.JobClient: map 3% reduce 0%
> 12/03/08 15:52:55 INFO mapred.JobClient: map 5% reduce 0%
> 12/03/08 15:52:58 INFO mapred.JobClient: map 7% reduce 0%
> 12/03/08 15:53:01 INFO mapred.JobClient: map 9% reduce 0%
> 12/03/08 15:53:04 INFO mapred.JobClient: map 10% reduce 0%
> 12/03/08 15:53:07 INFO mapred.JobClient: map 12% reduce 0%
> 12/03/08 15:53:10 INFO mapred.JobClient: map 14% reduce 0%
> 12/03/08 15:53:13 INFO mapred.JobClient: map 17% reduce 0%
> 12/03/08 15:53:16 INFO mapred.JobClient: map 18% reduce 0%
> 12/03/08 15:53:19 INFO mapred.JobClient: map 21% reduce 0%
> 12/03/08 15:53:22 INFO mapred.JobClient: map 23% reduce 0%
> 12/03/08 15:53:25 INFO mapred.JobClient: map 25% reduce 0%
> 12/03/08 15:53:28 INFO mapred.JobClient: map 27% reduce 0%
> 12/03/08 15:53:31 INFO mapred.JobClient: map 29% reduce 0%
> 12/03/08 15:53:34 INFO mapred.JobClient: map 31% reduce 0%
> 12/03/08 15:53:37 INFO mapred.JobClient: map 33% reduce 0%
> 12/03/08 15:53:40 INFO mapred.JobClient: map 35% reduce 0%
> 12/03/08 15:53:43 INFO mapred.JobClient: map 37% reduce 0%
> 12/03/08 15:53:46 INFO mapred.JobClient: map 39% reduce 0%
> 12/03/08 15:53:49 INFO mapred.JobClient: map 41% reduce 0%
> 12/03/08 15:53:52 INFO mapred.JobClient: map 43% reduce 0%
> 12/03/08 15:53:55 INFO mapred.JobClient: map 46% reduce 0%
> 12/03/08 15:53:58 INFO mapred.JobClient: map 48% reduce 0%
> 12/03/08 15:54:01 INFO mapred.JobClient: map 50% reduce 0%
> 12/03/08 15:54:04 INFO mapred.JobClient: map 53% reduce 0%
> 12/03/08 15:54:07 INFO mapred.JobClient: map 55% reduce 0%
> 12/03/08 15:54:10 INFO mapred.JobClient: map 57% reduce 0%
> 12/03/08 15:54:13 INFO mapred.JobClient: map 60% reduce 0%
> 12/03/08 15:54:16 INFO mapred.JobClient: map 63% reduce 0%
> 12/03/08 15:54:19 INFO mapred.JobClient: map 65% reduce 0%
> 12/03/08 15:54:22 INFO mapred.JobClient: map 68% reduce 0%
> 12/03/08 15:54:25 INFO mapred.JobClient: map 71% reduce 0%
> 12/03/08 15:54:28 INFO mapred.JobClient: map 74% reduce 0%
> 12/03/08 15:54:31 INFO mapred.JobClient: map 77% reduce 0%
> 12/03/08 15:54:34 INFO mapred.JobClient: map 81% reduce 0%
> 12/03/08 15:54:37 INFO mapred.JobClient: map 84% reduce 0%
> 12/03/08 15:54:40 INFO mapred.JobClient: map 88% reduce 0%
> 12/03/08 15:54:43 INFO mapred.JobClient: map 93% reduce 0%
> 12/03/08 15:54:46 INFO mapred.JobClient: map 99% reduce 0%
> 12/03/08 15:54:49 INFO mapred.JobClient: map 100% reduce 0%
> 12/03/08 15:55:01 INFO mapred.JobClient: map 100% reduce 100%
> 12/03/08 15:55:06 INFO mapred.JobClient: Job complete:
> job_201203071745_0042
> 12/03/08 15:55:06 INFO mapred.JobClient: Counters: 25
> 12/03/08 15:55:06 INFO mapred.JobClient: Job Counters
> 12/03/08 15:55:06 INFO mapred.JobClient: Launched reduce tasks=1
> 12/03/08 15:55:06 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=133985
> 12/03/08 15:55:06 INFO mapred.JobClient: Total time spent by all
> reduces waiting after reserving slots (ms)=0
> 12/03/08 15:55:06 INFO mapred.JobClient: Total time spent by all
> maps waiting after reserving slots (ms)=0
> 12/03/08 15:55:06 INFO mapred.JobClient: Launched map tasks=1
> 12/03/08 15:55:06 INFO mapred.JobClient: Data-local map tasks=1
> 12/03/08 15:55:06 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=10311
> 12/03/08 15:55:06 INFO mapred.JobClient: File Output Format Counters
> 12/03/08 15:55:06 INFO mapred.JobClient: Bytes Written=580158
> 12/03/08 15:55:06 INFO mapred.JobClient: FileSystemCounters
> 12/03/08 15:55:06 INFO mapred.JobClient: FILE_BYTES_READ=14921344
> 12/03/08 15:55:06 INFO mapred.JobClient: HDFS_BYTES_READ=73395400
> 12/03/08 15:55:06 INFO mapred.JobClient: FILE_BYTES_WRITTEN=15396906
> 12/03/08 15:55:06 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=580158
> 12/03/08 15:55:06 INFO mapred.JobClient: File Input Format Counters
> 12/03/08 15:55:06 INFO mapred.JobClient: Bytes Read=73395270
> 12/03/08 15:55:06 INFO mapred.JobClient: Map-Reduce Framework
> 12/03/08 15:55:06 INFO mapred.JobClient: Reduce input groups=4837
> 12/03/08 15:55:06 INFO mapred.JobClient: Map output materialized
> bytes=431573
> 12/03/08 15:55:06 INFO mapred.JobClient: Combine output
> records=96955
> 12/03/08 15:55:06 INFO mapred.JobClient: Map input records=4837
> 12/03/08 15:55:06 INFO mapred.JobClient: Reduce shuffle bytes=0
> 12/03/08 15:55:06 INFO mapred.JobClient: Reduce output records=4837
> 12/03/08 15:55:06 INFO mapred.JobClient: Spilled Records=166369
> 12/03/08 15:55:06 INFO mapred.JobClient: Map output bytes=153928302
> 12/03/08 15:55:06 INFO mapred.JobClient: Combine input
> records=7418380
> 12/03/08 15:55:06 INFO mapred.JobClient: Map output records=7326262
> 12/03/08 15:55:06 INFO mapred.JobClient: SPLIT_RAW_BYTES=130
> 12/03/08 15:55:06 INFO mapred.JobClient: Reduce input records=4837
> 12/03/08 15:55:06 INFO driver.MahoutDriver: Program took 391379 ms
> (Minutes: 6.522983333333333)
>
> performing seqdumper on the output looks reasonable.
>
> Maybe named vectors is a problem?
>
>
> On 3/7/12 8:50 AM, Sebastian Schelter wrote:
>> Hi Pat,
>>
>> Something is going completely wrong. The first pass over the data of
>> RowSimilarityJob transposes the input matrix. From the output of the
>> first jobs, it seems as if your input is a 4838 x 3 matrix only:
>>
>> Map input records=4838
>> Map output records=3
>> Combine input records=3
>> Combine output records=3
>> Reduce input records=3
>>
>> Could you have a detailed look at the input to RowSimilarityJob?
>>
>> --sebastian
>>
>>
>> On 07.03.2012 17:38, Pat Ferrel wrote:
>>> 12/03/06 17:02:42 INFO mapred.JobClient: Map input records=0
Re: How to find the k most similar docs
Posted by Suneel Marthi <su...@yahoo.com>.
Pat,
MatrixDump expects an input file of <Text, MatrixWritable> . The matrix that gets created from RowIdJob is <IntWritable, VectorWritable> and you cannot run MatrixDump to see the contents of the matrix. You need to use seqdumper as you had done.
________________________________
From: Pat Ferrel <pa...@occamsmachete.com>
To: user@mahout.apache.org
Sent: Thursday, March 8, 2012 7:14 PM
Subject: Re: How to find the k most similar docs
OK, back to the beginning. I went through the entire sequence again with the notable exception that I did not create named vectors. I also tweaked some of the seq2sparse parameters.
bin/mahout seq2sparse -i wp-seqfiles -o wp-vectors -ow -a
org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 100 -wt tfidf
-s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2
after doing a rowid on the tfidf vectors I still get an error doing matrixdump on wp-matrix/matrix. Am I using it wrong? Taking on faith that a matrix was created I perform the rowsimilarity job and now get a far bigger file created that looks OK
bin/mahout rowsimilarity -r 311433 -i wp-matrix/matrix -o
wp-similarity -ess -s SIMILARITY_COSINE -m 10
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
HADOOP_CONF_DIR=/usr/local/hadoop/conf
MAHOUT-JOB:
/home/pat/mahout-distribution-0.6/mahout-examples-0.6-job.jar
12/03/08 15:48:35 INFO common.AbstractJob: Command line arguments:
{--endPhase=2147483647, --excludeSelfSimilarity=false,
--input=wp-matrix/matrix, --maxSimilaritiesPerRow=10,
--numberOfColumns=311433, --output=wp-similarity,
--similarityClassname=SIMILARITY_COSINE, --startPhase=0, --tempDir=temp}
12/03/08 15:48:36 INFO input.FileInputFormat: Total input paths to
process : 1
12/03/08 15:48:36 INFO mapred.JobClient: Running job:
job_201203071745_0040
12/03/08 15:48:37 INFO mapred.JobClient: map 0% reduce 0%
12/03/08 15:48:58 INFO mapred.JobClient: map 17% reduce 0%
12/03/08 15:49:01 INFO mapred.JobClient: map 27% reduce 0%
12/03/08 15:49:04 INFO mapred.JobClient: map 40% reduce 0%
12/03/08 15:49:07 INFO mapred.JobClient: map 47% reduce 0%
12/03/08 15:49:10 INFO mapred.JobClient: map 60% reduce 0%
12/03/08 15:49:13 INFO mapred.JobClient: map 70% reduce 0%
12/03/08 15:49:16 INFO mapred.JobClient: map 80% reduce 0%
12/03/08 15:49:19 INFO mapred.JobClient: map 92% reduce 0%
12/03/08 15:49:22 INFO mapred.JobClient: map 100% reduce 0%
12/03/08 15:49:46 INFO mapred.JobClient: map 100% reduce 33%
12/03/08 15:49:52 INFO mapred.JobClient: map 100% reduce 100%
12/03/08 15:49:57 INFO mapred.JobClient: Job complete:
job_201203071745_0040
12/03/08 15:49:57 INFO mapred.JobClient: Counters: 26
12/03/08 15:49:57 INFO mapred.JobClient: Job Counters
12/03/08 15:49:57 INFO mapred.JobClient: Launched reduce tasks=1
12/03/08 15:49:57 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=55564
12/03/08 15:49:57 INFO mapred.JobClient: Total time spent by all
reduces waiting after reserving slots (ms)=0
12/03/08 15:49:57 INFO mapred.JobClient: Total time spent by all
maps waiting after reserving slots (ms)=0
12/03/08 15:49:57 INFO mapred.JobClient: Rack-local map tasks=1
12/03/08 15:49:57 INFO mapred.JobClient: Launched map tasks=1
12/03/08 15:49:57 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=13565
12/03/08 15:49:57 INFO mapred.JobClient: File Output Format Counters
12/03/08 15:49:57 INFO mapred.JobClient: Bytes Written=45587186
12/03/08 15:49:57 INFO mapred.JobClient: FileSystemCounters
12/03/08 15:49:57 INFO mapred.JobClient: FILE_BYTES_READ=99732287
12/03/08 15:49:57 INFO mapred.JobClient: HDFS_BYTES_READ=17156393
12/03/08 15:49:57 INFO mapred.JobClient: FILE_BYTES_WRITTEN=138104586
12/03/08 15:49:57 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=45587207
12/03/08 15:49:57 INFO mapred.JobClient: File Input Format Counters
12/03/08 15:49:57 INFO mapred.JobClient: Bytes Read=17156283
12/03/08 15:49:57 INFO mapred.JobClient: org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
12/03/08 15:49:57 INFO mapred.JobClient: ROWS=4838
12/03/08 15:49:57 INFO mapred.JobClient: Map-Reduce Framework
12/03/08 15:49:57 INFO mapred.JobClient: Reduce input groups=294936
12/03/08 15:49:57 INFO mapred.JobClient: Map output materialized
bytes=38326948
12/03/08 15:49:57 INFO mapred.JobClient: Combine output
records=2242965
12/03/08 15:49:57 INFO mapred.JobClient: Map input records=4838
12/03/08 15:49:57 INFO mapred.JobClient: Reduce shuffle
bytes=38326948
12/03/08 15:49:57 INFO mapred.JobClient: Reduce output
records=294933
12/03/08 15:49:57 INFO mapred.JobClient: Spilled Records=3432447
12/03/08 15:49:57 INFO mapred.JobClient: Map output bytes=83168813
12/03/08 15:49:57 INFO mapred.JobClient: Combine input
records=5912090
12/03/08 15:49:57 INFO mapred.JobClient: Map output records=3964061
12/03/08 15:49:57 INFO mapred.JobClient: SPLIT_RAW_BYTES=110
12/03/08 15:49:57 INFO mapred.JobClient: Reduce input records=294936
12/03/08 15:49:58 INFO input.FileInputFormat: Total input paths to
process : 1
12/03/08 15:49:58 INFO mapred.JobClient: Running job:
job_201203071745_0041
12/03/08 15:49:59 INFO mapred.JobClient: map 0% reduce 0%
12/03/08 15:50:19 INFO mapred.JobClient: map 8% reduce 0%
12/03/08 15:50:22 INFO mapred.JobClient: map 12% reduce 0%
12/03/08 15:50:25 INFO mapred.JobClient: map 15% reduce 0%
12/03/08 15:50:28 INFO mapred.JobClient: map 21% reduce 0%
12/03/08 15:50:31 INFO mapred.JobClient: map 23% reduce 0%
12/03/08 15:50:34 INFO mapred.JobClient: map 28% reduce 0%
12/03/08 15:50:37 INFO mapred.JobClient: map 32% reduce 0%
12/03/08 15:50:40 INFO mapred.JobClient: map 33% reduce 0%
12/03/08 15:50:43 INFO mapred.JobClient: map 35% reduce 0%
12/03/08 15:50:46 INFO mapred.JobClient: map 40% reduce 0%
12/03/08 15:50:49 INFO mapred.JobClient: map 42% reduce 0%
12/03/08 15:50:52 INFO mapred.JobClient: map 47% reduce 0%
12/03/08 15:50:55 INFO mapred.JobClient: map 48% reduce 0%
12/03/08 15:50:58 INFO mapred.JobClient: map 55% reduce 0%
12/03/08 15:51:01 INFO mapred.JobClient: map 57% reduce 0%
12/03/08 15:51:04 INFO mapred.JobClient: map 62% reduce 0%
12/03/08 15:51:07 INFO mapred.JobClient: map 67% reduce 0%
12/03/08 15:51:10 INFO mapred.JobClient: map 69% reduce 0%
12/03/08 15:51:13 INFO mapred.JobClient: map 75% reduce 0%
12/03/08 15:51:20 INFO mapred.JobClient: map 80% reduce 0%
12/03/08 15:51:23 INFO mapred.JobClient: map 81% reduce 0%
12/03/08 15:51:26 INFO mapred.JobClient: map 86% reduce 0%
12/03/08 15:51:29 INFO mapred.JobClient: map 88% reduce 0%
12/03/08 15:51:31 INFO mapred.JobClient: map 92% reduce 0%
12/03/08 15:51:34 INFO mapred.JobClient: map 94% reduce 0%
12/03/08 15:51:37 INFO mapred.JobClient: map 98% reduce 0%
12/03/08 15:51:40 INFO mapred.JobClient: map 100% reduce 0%
12/03/08 15:52:19 INFO mapred.JobClient: map 100% reduce 70%
12/03/08 15:52:26 INFO mapred.JobClient: map 100% reduce 100%
12/03/08 15:52:31 INFO mapred.JobClient: Job complete:
job_201203071745_0041
12/03/08 15:52:31 INFO mapred.JobClient: Counters: 27
12/03/08 15:52:31 INFO mapred.JobClient: Job Counters
12/03/08 15:52:31 INFO mapred.JobClient: Launched reduce tasks=1
12/03/08 15:52:31 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=124769
12/03/08 15:52:31 INFO mapred.JobClient: Total time spent by all
reduces waiting after reserving slots (ms)=0
12/03/08 15:52:31 INFO mapred.JobClient: Total time spent by all
maps waiting after reserving slots (ms)=0
12/03/08 15:52:31 INFO mapred.JobClient: Rack-local map tasks=1
12/03/08 15:52:31 INFO mapred.JobClient: Launched map tasks=1
12/03/08 15:52:31 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=16543
12/03/08 15:52:31 INFO mapred.JobClient: File Output Format Counters
12/03/08 15:52:31 INFO mapred.JobClient: Bytes Written=73395270
12/03/08 15:52:31 INFO mapred.JobClient: FileSystemCounters
12/03/08 15:52:31 INFO mapred.JobClient: FILE_BYTES_READ=509127834
12/03/08 15:52:31 INFO mapred.JobClient: HDFS_BYTES_READ=45587326
12/03/08 15:52:31 INFO mapred.JobClient: FILE_BYTES_WRITTEN=577589760
12/03/08 15:52:31 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=73395270
12/03/08 15:52:31 INFO mapred.JobClient: File Input Format Counters
12/03/08 15:52:31 INFO mapred.JobClient: Bytes Read=45587186
12/03/08 15:52:31 INFO mapred.JobClient: org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
12/03/08 15:52:31 INFO mapred.JobClient: PRUNED_COOCCURRENCES=0
12/03/08 15:52:31 INFO mapred.JobClient: COOCCURRENCES=65114863
12/03/08 15:52:31 INFO mapred.JobClient: Map-Reduce Framework
12/03/08 15:52:31 INFO mapred.JobClient: Reduce input groups=4837
12/03/08 15:52:31 INFO mapred.JobClient: Map output materialized
bytes=68416023
12/03/08 15:52:31 INFO mapred.JobClient: Combine output
records=79108
12/03/08 15:52:31 INFO mapred.JobClient: Map input records=294933
12/03/08 15:52:31 INFO mapred.JobClient: Reduce shuffle
bytes=68416023
12/03/08 15:52:31 INFO mapred.JobClient: Reduce output records=4837
12/03/08 15:52:31 INFO mapred.JobClient: Spilled Records=117235
12/03/08 15:52:31 INFO mapred.JobClient: Map output bytes=694645784
12/03/08 15:52:31 INFO mapred.JobClient: Combine input
records=4038329
12/03/08 15:52:31 INFO mapred.JobClient: Map output records=3964058
12/03/08 15:52:31 INFO mapred.JobClient: SPLIT_RAW_BYTES=119
12/03/08 15:52:31 INFO mapred.JobClient: Reduce input records=4837
12/03/08 15:52:32 INFO input.FileInputFormat: Total input paths to
process : 1
12/03/08 15:52:32 INFO mapred.JobClient: Running job:
job_201203071745_0042
12/03/08 15:52:33 INFO mapred.JobClient: map 0% reduce 0%
12/03/08 15:52:52 INFO mapred.JobClient: map 3% reduce 0%
12/03/08 15:52:55 INFO mapred.JobClient: map 5% reduce 0%
12/03/08 15:52:58 INFO mapred.JobClient: map 7% reduce 0%
12/03/08 15:53:01 INFO mapred.JobClient: map 9% reduce 0%
12/03/08 15:53:04 INFO mapred.JobClient: map 10% reduce 0%
12/03/08 15:53:07 INFO mapred.JobClient: map 12% reduce 0%
12/03/08 15:53:10 INFO mapred.JobClient: map 14% reduce 0%
12/03/08 15:53:13 INFO mapred.JobClient: map 17% reduce 0%
12/03/08 15:53:16 INFO mapred.JobClient: map 18% reduce 0%
12/03/08 15:53:19 INFO mapred.JobClient: map 21% reduce 0%
12/03/08 15:53:22 INFO mapred.JobClient: map 23% reduce 0%
12/03/08 15:53:25 INFO mapred.JobClient: map 25% reduce 0%
12/03/08 15:53:28 INFO mapred.JobClient: map 27% reduce 0%
12/03/08 15:53:31 INFO mapred.JobClient: map 29% reduce 0%
12/03/08 15:53:34 INFO mapred.JobClient: map 31% reduce 0%
12/03/08 15:53:37 INFO mapred.JobClient: map 33% reduce 0%
12/03/08 15:53:40 INFO mapred.JobClient: map 35% reduce 0%
12/03/08 15:53:43 INFO mapred.JobClient: map 37% reduce 0%
12/03/08 15:53:46 INFO mapred.JobClient: map 39% reduce 0%
12/03/08 15:53:49 INFO mapred.JobClient: map 41% reduce 0%
12/03/08 15:53:52 INFO mapred.JobClient: map 43% reduce 0%
12/03/08 15:53:55 INFO mapred.JobClient: map 46% reduce 0%
12/03/08 15:53:58 INFO mapred.JobClient: map 48% reduce 0%
12/03/08 15:54:01 INFO mapred.JobClient: map 50% reduce 0%
12/03/08 15:54:04 INFO mapred.JobClient: map 53% reduce 0%
12/03/08 15:54:07 INFO mapred.JobClient: map 55% reduce 0%
12/03/08 15:54:10 INFO mapred.JobClient: map 57% reduce 0%
12/03/08 15:54:13 INFO mapred.JobClient: map 60% reduce 0%
12/03/08 15:54:16 INFO mapred.JobClient: map 63% reduce 0%
12/03/08 15:54:19 INFO mapred.JobClient: map 65% reduce 0%
12/03/08 15:54:22 INFO mapred.JobClient: map 68% reduce 0%
12/03/08 15:54:25 INFO mapred.JobClient: map 71% reduce 0%
12/03/08 15:54:28 INFO mapred.JobClient: map 74% reduce 0%
12/03/08 15:54:31 INFO mapred.JobClient: map 77% reduce 0%
12/03/08 15:54:34 INFO mapred.JobClient: map 81% reduce 0%
12/03/08 15:54:37 INFO mapred.JobClient: map 84% reduce 0%
12/03/08 15:54:40 INFO mapred.JobClient: map 88% reduce 0%
12/03/08 15:54:43 INFO mapred.JobClient: map 93% reduce 0%
12/03/08 15:54:46 INFO mapred.JobClient: map 99% reduce 0%
12/03/08 15:54:49 INFO mapred.JobClient: map 100% reduce 0%
12/03/08 15:55:01 INFO mapred.JobClient: map 100% reduce 100%
12/03/08 15:55:06 INFO mapred.JobClient: Job complete:
job_201203071745_0042
12/03/08 15:55:06 INFO mapred.JobClient: Counters: 25
12/03/08 15:55:06 INFO mapred.JobClient: Job Counters
12/03/08 15:55:06 INFO mapred.JobClient: Launched reduce tasks=1
12/03/08 15:55:06 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=133985
12/03/08 15:55:06 INFO mapred.JobClient: Total time spent by all
reduces waiting after reserving slots (ms)=0
12/03/08 15:55:06 INFO mapred.JobClient: Total time spent by all
maps waiting after reserving slots (ms)=0
12/03/08 15:55:06 INFO mapred.JobClient: Launched map tasks=1
12/03/08 15:55:06 INFO mapred.JobClient: Data-local map tasks=1
12/03/08 15:55:06 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=10311
12/03/08 15:55:06 INFO mapred.JobClient: File Output Format Counters
12/03/08 15:55:06 INFO mapred.JobClient: Bytes Written=580158
12/03/08 15:55:06 INFO mapred.JobClient: FileSystemCounters
12/03/08 15:55:06 INFO mapred.JobClient: FILE_BYTES_READ=14921344
12/03/08 15:55:06 INFO mapred.JobClient: HDFS_BYTES_READ=73395400
12/03/08 15:55:06 INFO mapred.JobClient: FILE_BYTES_WRITTEN=15396906
12/03/08 15:55:06 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=580158
12/03/08 15:55:06 INFO mapred.JobClient: File Input Format Counters
12/03/08 15:55:06 INFO mapred.JobClient: Bytes Read=73395270
12/03/08 15:55:06 INFO mapred.JobClient: Map-Reduce Framework
12/03/08 15:55:06 INFO mapred.JobClient: Reduce input groups=4837
12/03/08 15:55:06 INFO mapred.JobClient: Map output materialized
bytes=431573
12/03/08 15:55:06 INFO mapred.JobClient: Combine output
records=96955
12/03/08 15:55:06 INFO mapred.JobClient: Map input records=4837
12/03/08 15:55:06 INFO mapred.JobClient: Reduce shuffle bytes=0
12/03/08 15:55:06 INFO mapred.JobClient: Reduce output records=4837
12/03/08 15:55:06 INFO mapred.JobClient: Spilled Records=166369
12/03/08 15:55:06 INFO mapred.JobClient: Map output bytes=153928302
12/03/08 15:55:06 INFO mapred.JobClient: Combine input
records=7418380
12/03/08 15:55:06 INFO mapred.JobClient: Map output records=7326262
12/03/08 15:55:06 INFO mapred.JobClient: SPLIT_RAW_BYTES=130
12/03/08 15:55:06 INFO mapred.JobClient: Reduce input records=4837
12/03/08 15:55:06 INFO driver.MahoutDriver: Program took 391379 ms
(Minutes: 6.522983333333333)
performing seqdumper on the output looks reasonable.
Maybe named vectors is a problem?
On 3/7/12 8:50 AM, Sebastian Schelter wrote:
> Hi Pat,
>
> Something is going completely wrong. The first pass over the data of
> RowSimilarityJob transposes the input matrix. From the output of the
> first jobs, it seems as if your input is a 4838 x 3 matrix only:
>
> Map input records=4838
> Map output records=3
> Combine input records=3
> Combine output records=3
> Reduce input records=3
>
> Could you have a detailed look at the input to RowSimilarityJob?
>
> --sebastian
>
>
> On 07.03.2012 17:38, Pat Ferrel wrote:
>> 12/03/06 17:02:42 INFO mapred.JobClient: Map input records=0
>
Re: How to find the k most similar docs
Posted by Pat Ferrel <pa...@occamsmachete.com>.
OK, back to the beginning. I went through the entire sequence again with
the notable exception that I did not create named vectors. I also
tweaked some of the seq2sparse parameters.
bin/mahout seq2sparse -i wp-seqfiles -o wp-vectors -ow -a
org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 100 -wt tfidf
-s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2
after doing a rowid on the tfidf vectors I still get an error doing
matrixdump on wp-matrix/matrix. Am I using it wrong? Taking on faith
that a matrix was created I perform the rowsimilarity job and now get a
far bigger file created that looks OK
bin/mahout rowsimilarity -r 311433 -i wp-matrix/matrix -o
wp-similarity -ess -s SIMILARITY_COSINE -m 10
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
HADOOP_CONF_DIR=/usr/local/hadoop/conf
MAHOUT-JOB:
/home/pat/mahout-distribution-0.6/mahout-examples-0.6-job.jar
12/03/08 15:48:35 INFO common.AbstractJob: Command line arguments:
{--endPhase=2147483647, --excludeSelfSimilarity=false,
--input=wp-matrix/matrix, --maxSimilaritiesPerRow=10,
--numberOfColumns=311433, --output=wp-similarity,
--similarityClassname=SIMILARITY_COSINE, --startPhase=0, --tempDir=temp}
12/03/08 15:48:36 INFO input.FileInputFormat: Total input paths to
process : 1
12/03/08 15:48:36 INFO mapred.JobClient: Running job:
job_201203071745_0040
12/03/08 15:48:37 INFO mapred.JobClient: map 0% reduce 0%
12/03/08 15:48:58 INFO mapred.JobClient: map 17% reduce 0%
12/03/08 15:49:01 INFO mapred.JobClient: map 27% reduce 0%
12/03/08 15:49:04 INFO mapred.JobClient: map 40% reduce 0%
12/03/08 15:49:07 INFO mapred.JobClient: map 47% reduce 0%
12/03/08 15:49:10 INFO mapred.JobClient: map 60% reduce 0%
12/03/08 15:49:13 INFO mapred.JobClient: map 70% reduce 0%
12/03/08 15:49:16 INFO mapred.JobClient: map 80% reduce 0%
12/03/08 15:49:19 INFO mapred.JobClient: map 92% reduce 0%
12/03/08 15:49:22 INFO mapred.JobClient: map 100% reduce 0%
12/03/08 15:49:46 INFO mapred.JobClient: map 100% reduce 33%
12/03/08 15:49:52 INFO mapred.JobClient: map 100% reduce 100%
12/03/08 15:49:57 INFO mapred.JobClient: Job complete:
job_201203071745_0040
12/03/08 15:49:57 INFO mapred.JobClient: Counters: 26
12/03/08 15:49:57 INFO mapred.JobClient: Job Counters
12/03/08 15:49:57 INFO mapred.JobClient: Launched reduce tasks=1
12/03/08 15:49:57 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=55564
12/03/08 15:49:57 INFO mapred.JobClient: Total time spent by all
reduces waiting after reserving slots (ms)=0
12/03/08 15:49:57 INFO mapred.JobClient: Total time spent by all
maps waiting after reserving slots (ms)=0
12/03/08 15:49:57 INFO mapred.JobClient: Rack-local map tasks=1
12/03/08 15:49:57 INFO mapred.JobClient: Launched map tasks=1
12/03/08 15:49:57 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=13565
12/03/08 15:49:57 INFO mapred.JobClient: File Output Format Counters
12/03/08 15:49:57 INFO mapred.JobClient: Bytes Written=45587186
12/03/08 15:49:57 INFO mapred.JobClient: FileSystemCounters
12/03/08 15:49:57 INFO mapred.JobClient: FILE_BYTES_READ=99732287
12/03/08 15:49:57 INFO mapred.JobClient: HDFS_BYTES_READ=17156393
12/03/08 15:49:57 INFO mapred.JobClient:
FILE_BYTES_WRITTEN=138104586
12/03/08 15:49:57 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=45587207
12/03/08 15:49:57 INFO mapred.JobClient: File Input Format Counters
12/03/08 15:49:57 INFO mapred.JobClient: Bytes Read=17156283
12/03/08 15:49:57 INFO mapred.JobClient:
org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
12/03/08 15:49:57 INFO mapred.JobClient: ROWS=4838
12/03/08 15:49:57 INFO mapred.JobClient: Map-Reduce Framework
12/03/08 15:49:57 INFO mapred.JobClient: Reduce input groups=294936
12/03/08 15:49:57 INFO mapred.JobClient: Map output materialized
bytes=38326948
12/03/08 15:49:57 INFO mapred.JobClient: Combine output
records=2242965
12/03/08 15:49:57 INFO mapred.JobClient: Map input records=4838
12/03/08 15:49:57 INFO mapred.JobClient: Reduce shuffle
bytes=38326948
12/03/08 15:49:57 INFO mapred.JobClient: Reduce output
records=294933
12/03/08 15:49:57 INFO mapred.JobClient: Spilled Records=3432447
12/03/08 15:49:57 INFO mapred.JobClient: Map output bytes=83168813
12/03/08 15:49:57 INFO mapred.JobClient: Combine input
records=5912090
12/03/08 15:49:57 INFO mapred.JobClient: Map output records=3964061
12/03/08 15:49:57 INFO mapred.JobClient: SPLIT_RAW_BYTES=110
12/03/08 15:49:57 INFO mapred.JobClient: Reduce input records=294936
12/03/08 15:49:58 INFO input.FileInputFormat: Total input paths to
process : 1
12/03/08 15:49:58 INFO mapred.JobClient: Running job:
job_201203071745_0041
12/03/08 15:49:59 INFO mapred.JobClient: map 0% reduce 0%
12/03/08 15:50:19 INFO mapred.JobClient: map 8% reduce 0%
12/03/08 15:50:22 INFO mapred.JobClient: map 12% reduce 0%
12/03/08 15:50:25 INFO mapred.JobClient: map 15% reduce 0%
12/03/08 15:50:28 INFO mapred.JobClient: map 21% reduce 0%
12/03/08 15:50:31 INFO mapred.JobClient: map 23% reduce 0%
12/03/08 15:50:34 INFO mapred.JobClient: map 28% reduce 0%
12/03/08 15:50:37 INFO mapred.JobClient: map 32% reduce 0%
12/03/08 15:50:40 INFO mapred.JobClient: map 33% reduce 0%
12/03/08 15:50:43 INFO mapred.JobClient: map 35% reduce 0%
12/03/08 15:50:46 INFO mapred.JobClient: map 40% reduce 0%
12/03/08 15:50:49 INFO mapred.JobClient: map 42% reduce 0%
12/03/08 15:50:52 INFO mapred.JobClient: map 47% reduce 0%
12/03/08 15:50:55 INFO mapred.JobClient: map 48% reduce 0%
12/03/08 15:50:58 INFO mapred.JobClient: map 55% reduce 0%
12/03/08 15:51:01 INFO mapred.JobClient: map 57% reduce 0%
12/03/08 15:51:04 INFO mapred.JobClient: map 62% reduce 0%
12/03/08 15:51:07 INFO mapred.JobClient: map 67% reduce 0%
12/03/08 15:51:10 INFO mapred.JobClient: map 69% reduce 0%
12/03/08 15:51:13 INFO mapred.JobClient: map 75% reduce 0%
12/03/08 15:51:20 INFO mapred.JobClient: map 80% reduce 0%
12/03/08 15:51:23 INFO mapred.JobClient: map 81% reduce 0%
12/03/08 15:51:26 INFO mapred.JobClient: map 86% reduce 0%
12/03/08 15:51:29 INFO mapred.JobClient: map 88% reduce 0%
12/03/08 15:51:31 INFO mapred.JobClient: map 92% reduce 0%
12/03/08 15:51:34 INFO mapred.JobClient: map 94% reduce 0%
12/03/08 15:51:37 INFO mapred.JobClient: map 98% reduce 0%
12/03/08 15:51:40 INFO mapred.JobClient: map 100% reduce 0%
12/03/08 15:52:19 INFO mapred.JobClient: map 100% reduce 70%
12/03/08 15:52:26 INFO mapred.JobClient: map 100% reduce 100%
12/03/08 15:52:31 INFO mapred.JobClient: Job complete:
job_201203071745_0041
12/03/08 15:52:31 INFO mapred.JobClient: Counters: 27
12/03/08 15:52:31 INFO mapred.JobClient: Job Counters
12/03/08 15:52:31 INFO mapred.JobClient: Launched reduce tasks=1
12/03/08 15:52:31 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=124769
12/03/08 15:52:31 INFO mapred.JobClient: Total time spent by all
reduces waiting after reserving slots (ms)=0
12/03/08 15:52:31 INFO mapred.JobClient: Total time spent by all
maps waiting after reserving slots (ms)=0
12/03/08 15:52:31 INFO mapred.JobClient: Rack-local map tasks=1
12/03/08 15:52:31 INFO mapred.JobClient: Launched map tasks=1
12/03/08 15:52:31 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=16543
12/03/08 15:52:31 INFO mapred.JobClient: File Output Format Counters
12/03/08 15:52:31 INFO mapred.JobClient: Bytes Written=73395270
12/03/08 15:52:31 INFO mapred.JobClient: FileSystemCounters
12/03/08 15:52:31 INFO mapred.JobClient: FILE_BYTES_READ=509127834
12/03/08 15:52:31 INFO mapred.JobClient: HDFS_BYTES_READ=45587326
12/03/08 15:52:31 INFO mapred.JobClient:
FILE_BYTES_WRITTEN=577589760
12/03/08 15:52:31 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=73395270
12/03/08 15:52:31 INFO mapred.JobClient: File Input Format Counters
12/03/08 15:52:31 INFO mapred.JobClient: Bytes Read=45587186
12/03/08 15:52:31 INFO mapred.JobClient:
org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
12/03/08 15:52:31 INFO mapred.JobClient: PRUNED_COOCCURRENCES=0
12/03/08 15:52:31 INFO mapred.JobClient: COOCCURRENCES=65114863
12/03/08 15:52:31 INFO mapred.JobClient: Map-Reduce Framework
12/03/08 15:52:31 INFO mapred.JobClient: Reduce input groups=4837
12/03/08 15:52:31 INFO mapred.JobClient: Map output materialized
bytes=68416023
12/03/08 15:52:31 INFO mapred.JobClient: Combine output
records=79108
12/03/08 15:52:31 INFO mapred.JobClient: Map input records=294933
12/03/08 15:52:31 INFO mapred.JobClient: Reduce shuffle
bytes=68416023
12/03/08 15:52:31 INFO mapred.JobClient: Reduce output records=4837
12/03/08 15:52:31 INFO mapred.JobClient: Spilled Records=117235
12/03/08 15:52:31 INFO mapred.JobClient: Map output bytes=694645784
12/03/08 15:52:31 INFO mapred.JobClient: Combine input
records=4038329
12/03/08 15:52:31 INFO mapred.JobClient: Map output records=3964058
12/03/08 15:52:31 INFO mapred.JobClient: SPLIT_RAW_BYTES=119
12/03/08 15:52:31 INFO mapred.JobClient: Reduce input records=4837
12/03/08 15:52:32 INFO input.FileInputFormat: Total input paths to
process : 1
12/03/08 15:52:32 INFO mapred.JobClient: Running job:
job_201203071745_0042
12/03/08 15:52:33 INFO mapred.JobClient: map 0% reduce 0%
12/03/08 15:52:52 INFO mapred.JobClient: map 3% reduce 0%
12/03/08 15:52:55 INFO mapred.JobClient: map 5% reduce 0%
12/03/08 15:52:58 INFO mapred.JobClient: map 7% reduce 0%
12/03/08 15:53:01 INFO mapred.JobClient: map 9% reduce 0%
12/03/08 15:53:04 INFO mapred.JobClient: map 10% reduce 0%
12/03/08 15:53:07 INFO mapred.JobClient: map 12% reduce 0%
12/03/08 15:53:10 INFO mapred.JobClient: map 14% reduce 0%
12/03/08 15:53:13 INFO mapred.JobClient: map 17% reduce 0%
12/03/08 15:53:16 INFO mapred.JobClient: map 18% reduce 0%
12/03/08 15:53:19 INFO mapred.JobClient: map 21% reduce 0%
12/03/08 15:53:22 INFO mapred.JobClient: map 23% reduce 0%
12/03/08 15:53:25 INFO mapred.JobClient: map 25% reduce 0%
12/03/08 15:53:28 INFO mapred.JobClient: map 27% reduce 0%
12/03/08 15:53:31 INFO mapred.JobClient: map 29% reduce 0%
12/03/08 15:53:34 INFO mapred.JobClient: map 31% reduce 0%
12/03/08 15:53:37 INFO mapred.JobClient: map 33% reduce 0%
12/03/08 15:53:40 INFO mapred.JobClient: map 35% reduce 0%
12/03/08 15:53:43 INFO mapred.JobClient: map 37% reduce 0%
12/03/08 15:53:46 INFO mapred.JobClient: map 39% reduce 0%
12/03/08 15:53:49 INFO mapred.JobClient: map 41% reduce 0%
12/03/08 15:53:52 INFO mapred.JobClient: map 43% reduce 0%
12/03/08 15:53:55 INFO mapred.JobClient: map 46% reduce 0%
12/03/08 15:53:58 INFO mapred.JobClient: map 48% reduce 0%
12/03/08 15:54:01 INFO mapred.JobClient: map 50% reduce 0%
12/03/08 15:54:04 INFO mapred.JobClient: map 53% reduce 0%
12/03/08 15:54:07 INFO mapred.JobClient: map 55% reduce 0%
12/03/08 15:54:10 INFO mapred.JobClient: map 57% reduce 0%
12/03/08 15:54:13 INFO mapred.JobClient: map 60% reduce 0%
12/03/08 15:54:16 INFO mapred.JobClient: map 63% reduce 0%
12/03/08 15:54:19 INFO mapred.JobClient: map 65% reduce 0%
12/03/08 15:54:22 INFO mapred.JobClient: map 68% reduce 0%
12/03/08 15:54:25 INFO mapred.JobClient: map 71% reduce 0%
12/03/08 15:54:28 INFO mapred.JobClient: map 74% reduce 0%
12/03/08 15:54:31 INFO mapred.JobClient: map 77% reduce 0%
12/03/08 15:54:34 INFO mapred.JobClient: map 81% reduce 0%
12/03/08 15:54:37 INFO mapred.JobClient: map 84% reduce 0%
12/03/08 15:54:40 INFO mapred.JobClient: map 88% reduce 0%
12/03/08 15:54:43 INFO mapred.JobClient: map 93% reduce 0%
12/03/08 15:54:46 INFO mapred.JobClient: map 99% reduce 0%
12/03/08 15:54:49 INFO mapred.JobClient: map 100% reduce 0%
12/03/08 15:55:01 INFO mapred.JobClient: map 100% reduce 100%
12/03/08 15:55:06 INFO mapred.JobClient: Job complete:
job_201203071745_0042
12/03/08 15:55:06 INFO mapred.JobClient: Counters: 25
12/03/08 15:55:06 INFO mapred.JobClient: Job Counters
12/03/08 15:55:06 INFO mapred.JobClient: Launched reduce tasks=1
12/03/08 15:55:06 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=133985
12/03/08 15:55:06 INFO mapred.JobClient: Total time spent by all
reduces waiting after reserving slots (ms)=0
12/03/08 15:55:06 INFO mapred.JobClient: Total time spent by all
maps waiting after reserving slots (ms)=0
12/03/08 15:55:06 INFO mapred.JobClient: Launched map tasks=1
12/03/08 15:55:06 INFO mapred.JobClient: Data-local map tasks=1
12/03/08 15:55:06 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=10311
12/03/08 15:55:06 INFO mapred.JobClient: File Output Format Counters
12/03/08 15:55:06 INFO mapred.JobClient: Bytes Written=580158
12/03/08 15:55:06 INFO mapred.JobClient: FileSystemCounters
12/03/08 15:55:06 INFO mapred.JobClient: FILE_BYTES_READ=14921344
12/03/08 15:55:06 INFO mapred.JobClient: HDFS_BYTES_READ=73395400
12/03/08 15:55:06 INFO mapred.JobClient: FILE_BYTES_WRITTEN=15396906
12/03/08 15:55:06 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=580158
12/03/08 15:55:06 INFO mapred.JobClient: File Input Format Counters
12/03/08 15:55:06 INFO mapred.JobClient: Bytes Read=73395270
12/03/08 15:55:06 INFO mapred.JobClient: Map-Reduce Framework
12/03/08 15:55:06 INFO mapred.JobClient: Reduce input groups=4837
12/03/08 15:55:06 INFO mapred.JobClient: Map output materialized
bytes=431573
12/03/08 15:55:06 INFO mapred.JobClient: Combine output
records=96955
12/03/08 15:55:06 INFO mapred.JobClient: Map input records=4837
12/03/08 15:55:06 INFO mapred.JobClient: Reduce shuffle bytes=0
12/03/08 15:55:06 INFO mapred.JobClient: Reduce output records=4837
12/03/08 15:55:06 INFO mapred.JobClient: Spilled Records=166369
12/03/08 15:55:06 INFO mapred.JobClient: Map output bytes=153928302
12/03/08 15:55:06 INFO mapred.JobClient: Combine input
records=7418380
12/03/08 15:55:06 INFO mapred.JobClient: Map output records=7326262
12/03/08 15:55:06 INFO mapred.JobClient: SPLIT_RAW_BYTES=130
12/03/08 15:55:06 INFO mapred.JobClient: Reduce input records=4837
12/03/08 15:55:06 INFO driver.MahoutDriver: Program took 391379 ms
(Minutes: 6.522983333333333)
performing seqdumper on the output looks reasonable.
Maybe named vectors is a problem?
On 3/7/12 8:50 AM, Sebastian Schelter wrote:
> Hi Pat,
>
> Something is going completely wrong. The first pass over the data of
> RowSimilarityJob transposes the input matrix. From the output of the
> first jobs, it seems as if your input is a 4838 x 3 matrix only:
>
> Map input records=4838
> Map output records=3
> Combine input records=3
> Combine output records=3
> Reduce input records=3
>
> Could you have a detailed look at the input to RowSimilarityJob?
>
> --sebastian
>
>
> On 07.03.2012 17:38, Pat Ferrel wrote:
>> 12/03/06 17:02:42 INFO mapred.JobClient: Map input records=0
>
Re: How to find the k most similar docs
Posted by Sebastian Schelter <ss...@apache.org>.
Hi Pat,
Something is going completely wrong. The first pass over the data of
RowSimilarityJob transposes the input matrix. From the output of the
first jobs, it seems as if your input is a 4838 x 3 matrix only:
Map input records=4838
Map output records=3
Combine input records=3
Combine output records=3
Reduce input records=3
Could you have a detailed look at the input to RowSimilarityJob?
--sebastian
On 07.03.2012 17:38, Pat Ferrel wrote:
> 12/03/06 17:02:42 INFO mapred.JobClient: Map input records=0
Re: How to find the k most similar docs
Posted by Pat Ferrel <pa...@occamsmachete.com>.
I have been experimenting with different analyzers and n-grams to clean
up the vectors. Here is a run on a high dimensionality set of vectors
with a loose analyzer (I think it was the default) The output of the
rowid job was:
pat@occam2:~/mahout-distribution-0.6$ bin/mahout rowid -i
wikipedia-tfidf-custom-analyzer/tfidf-vectors/ -o wikipedia-matrix
--tempDir temp
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
HADOOP_CONF_DIR=/usr/local/hadoop/conf
MAHOUT-JOB:
/home/pat/mahout-distribution-0.6/mahout-examples-0.6-job.jar
12/03/06 16:53:29 INFO common.AbstractJob: Command line arguments:
{--endPhase=2147483647,
--input=wikipedia-tfidf-custom-analyzer/tfidf-vectors/,
--output=wikipedia-matrix, --startPhase=0, --tempDir=temp}
12/03/06 16:53:30 INFO util.NativeCodeLoader: Loaded the
native-hadoop library
12/03/06 16:53:30 INFO zlib.ZlibFactory: Successfully loaded &
initialized native-zlib library
12/03/06 16:53:30 INFO compress.CodecPool: Got brand-new compressor
12/03/06 16:53:30 INFO compress.CodecPool: Got brand-new compressor
12/03/06 16:53:30 INFO vectors.RowIdJob: Wrote out matrix with 4838
rows and 286907 columns to wikipedia-matrix/matrix
12/03/06 16:53:30 INFO driver.MahoutDriver: Program took 1248 ms
(Minutes: 0.0208)
Then I removed temp (shouldn't the jobs do that?) and ran the
rowsililarity job:
pat@occam2:~/mahout-distribution-0.6$ bin/mahout rowsimilarity -i
wikipedia-matrix/matrix -o wikipedia-similarity -r 286907
--similarityClassname SIMILARITY_COSINE -m 10 -ess true --tempDir temp
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
HADOOP_CONF_DIR=/usr/local/hadoop/conf
MAHOUT-JOB:
/home/pat/mahout-distribution-0.6/mahout-examples-0.6-job.jar
12/03/06 17:00:55 INFO common.AbstractJob: Command line arguments:
{--endPhase=2147483647, --excludeSelfSimilarity=true,
--input=wikipedia-matrix/matrix, --maxSimilaritiesPerRow=10,
--numberOfColumns=286907, --output=wikipedia-similarity,
--similarityClassname=SIMILARITY_COSINE, --startPhase=0, --tempDir=temp}
12/03/06 17:00:56 INFO input.FileInputFormat: Total input paths to
process : 1
12/03/06 17:00:56 INFO mapred.JobClient: Running job:
job_201203061645_0006
12/03/06 17:00:57 INFO mapred.JobClient: map 0% reduce 0%
12/03/06 17:01:13 INFO mapred.JobClient: map 100% reduce 0%
12/03/06 17:01:25 INFO mapred.JobClient: map 100% reduce 100%
12/03/06 17:01:30 INFO mapred.JobClient: Job complete:
job_201203061645_0006
12/03/06 17:01:30 INFO mapred.JobClient: Counters: 26
12/03/06 17:01:30 INFO mapred.JobClient: Job Counters
12/03/06 17:01:30 INFO mapred.JobClient: Launched reduce tasks=1
12/03/06 17:01:30 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=13502
12/03/06 17:01:30 INFO mapred.JobClient: Total time spent by all
reduces waiting after reserving slots (ms)=0
12/03/06 17:01:30 INFO mapred.JobClient: Total time spent by all
maps waiting after reserving slots (ms)=0
12/03/06 17:01:30 INFO mapred.JobClient: Rack-local map tasks=1
12/03/06 17:01:30 INFO mapred.JobClient: Launched map tasks=1
12/03/06 17:01:30 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=10496
12/03/06 17:01:30 INFO mapred.JobClient: File Output Format Counters
12/03/06 17:01:30 INFO mapred.JobClient: Bytes Written=97
12/03/06 17:01:30 INFO mapred.JobClient: FileSystemCounters
12/03/06 17:01:30 INFO mapred.JobClient: FILE_BYTES_READ=40
12/03/06 17:01:30 INFO mapred.JobClient: HDFS_BYTES_READ=122407
12/03/06 17:01:30 INFO mapred.JobClient: FILE_BYTES_WRITTEN=45437
12/03/06 17:01:30 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=118
12/03/06 17:01:30 INFO mapred.JobClient: File Input Format Counters
12/03/06 17:01:30 INFO mapred.JobClient: Bytes Read=122290
12/03/06 17:01:30 INFO mapred.JobClient:
org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
12/03/06 17:01:30 INFO mapred.JobClient: ROWS=4838
12/03/06 17:01:30 INFO mapred.JobClient: Map-Reduce Framework
12/03/06 17:01:30 INFO mapred.JobClient: Reduce input groups=3
12/03/06 17:01:30 INFO mapred.JobClient: Map output materialized
bytes=32
12/03/06 17:01:30 INFO mapred.JobClient: Combine output records=3
12/03/06 17:01:30 INFO mapred.JobClient: Map input records=4838
12/03/06 17:01:30 INFO mapred.JobClient: Reduce shuffle bytes=32
12/03/06 17:01:30 INFO mapred.JobClient: Reduce output records=0
12/03/06 17:01:30 INFO mapred.JobClient: Spilled Records=6
12/03/06 17:01:30 INFO mapred.JobClient: Map output bytes=33
12/03/06 17:01:30 INFO mapred.JobClient: Combine input records=3
12/03/06 17:01:30 INFO mapred.JobClient: Map output records=3
12/03/06 17:01:30 INFO mapred.JobClient: SPLIT_RAW_BYTES=117
12/03/06 17:01:30 INFO mapred.JobClient: Reduce input records=3
12/03/06 17:01:30 INFO input.FileInputFormat: Total input paths to
process : 1
12/03/06 17:01:31 INFO mapred.JobClient: Running job:
job_201203061645_0007
12/03/06 17:01:32 INFO mapred.JobClient: map 0% reduce 0%
12/03/06 17:01:49 INFO mapred.JobClient: map 100% reduce 0%
12/03/06 17:02:01 INFO mapred.JobClient: map 100% reduce 100%
12/03/06 17:02:06 INFO mapred.JobClient: Job complete:
job_201203061645_0007
12/03/06 17:02:06 INFO mapred.JobClient: Counters: 25
12/03/06 17:02:06 INFO mapred.JobClient: Job Counters
12/03/06 17:02:06 INFO mapred.JobClient: Launched reduce tasks=1
12/03/06 17:02:06 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=12989
12/03/06 17:02:06 INFO mapred.JobClient: Total time spent by all
reduces waiting after reserving slots (ms)=0
12/03/06 17:02:06 INFO mapred.JobClient: Total time spent by all
maps waiting after reserving slots (ms)=0
12/03/06 17:02:06 INFO mapred.JobClient: Launched map tasks=1
12/03/06 17:02:06 INFO mapred.JobClient: Data-local map tasks=1
12/03/06 17:02:06 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=10341
12/03/06 17:02:06 INFO mapred.JobClient: File Output Format Counters
12/03/06 17:02:06 INFO mapred.JobClient: Bytes Written=97
12/03/06 17:02:06 INFO mapred.JobClient: FileSystemCounters
12/03/06 17:02:06 INFO mapred.JobClient: FILE_BYTES_READ=22
12/03/06 17:02:06 INFO mapred.JobClient: HDFS_BYTES_READ=237
12/03/06 17:02:06 INFO mapred.JobClient: FILE_BYTES_WRITTEN=45937
12/03/06 17:02:06 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=97
12/03/06 17:02:06 INFO mapred.JobClient: File Input Format Counters
12/03/06 17:02:06 INFO mapred.JobClient: Bytes Read=97
12/03/06 17:02:06 INFO mapred.JobClient: Map-Reduce Framework
12/03/06 17:02:06 INFO mapred.JobClient: Reduce input groups=0
12/03/06 17:02:06 INFO mapred.JobClient: Map output materialized
bytes=14
12/03/06 17:02:06 INFO mapred.JobClient: Combine output records=0
12/03/06 17:02:06 INFO mapred.JobClient: Map input records=0
12/03/06 17:02:06 INFO mapred.JobClient: Reduce shuffle bytes=0
12/03/06 17:02:06 INFO mapred.JobClient: Reduce output records=0
12/03/06 17:02:06 INFO mapred.JobClient: Spilled Records=0
12/03/06 17:02:06 INFO mapred.JobClient: Map output bytes=0
12/03/06 17:02:06 INFO mapred.JobClient: Combine input records=0
12/03/06 17:02:06 INFO mapred.JobClient: Map output records=0
12/03/06 17:02:06 INFO mapred.JobClient: SPLIT_RAW_BYTES=119
12/03/06 17:02:06 INFO mapred.JobClient: Reduce input records=0
12/03/06 17:02:07 INFO input.FileInputFormat: Total input paths to
process : 1
12/03/06 17:02:07 INFO mapred.JobClient: Running job:
job_201203061645_0008
12/03/06 17:02:08 INFO mapred.JobClient: map 0% reduce 0%
12/03/06 17:02:25 INFO mapred.JobClient: map 100% reduce 0%
12/03/06 17:02:37 INFO mapred.JobClient: map 100% reduce 100%
12/03/06 17:02:42 INFO mapred.JobClient: Job complete:
job_201203061645_0008
12/03/06 17:02:42 INFO mapred.JobClient: Counters: 25
12/03/06 17:02:42 INFO mapred.JobClient: Job Counters
12/03/06 17:02:42 INFO mapred.JobClient: Launched reduce tasks=1
12/03/06 17:02:42 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=12971
12/03/06 17:02:42 INFO mapred.JobClient: Total time spent by all
reduces waiting after reserving slots (ms)=0
12/03/06 17:02:42 INFO mapred.JobClient: Total time spent by all
maps waiting after reserving slots (ms)=0
12/03/06 17:02:42 INFO mapred.JobClient: Launched map tasks=1
12/03/06 17:02:42 INFO mapred.JobClient: Data-local map tasks=1
12/03/06 17:02:42 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=10322
12/03/06 17:02:42 INFO mapred.JobClient: File Output Format Counters
12/03/06 17:02:42 INFO mapred.JobClient: Bytes Written=97
12/03/06 17:02:42 INFO mapred.JobClient: FileSystemCounters
12/03/06 17:02:42 INFO mapred.JobClient: FILE_BYTES_READ=22
12/03/06 17:02:42 INFO mapred.JobClient: HDFS_BYTES_READ=227
12/03/06 17:02:42 INFO mapred.JobClient: FILE_BYTES_WRITTEN=44039
12/03/06 17:02:42 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=97
12/03/06 17:02:42 INFO mapred.JobClient: File Input Format Counters
12/03/06 17:02:42 INFO mapred.JobClient: Bytes Read=97
12/03/06 17:02:42 INFO mapred.JobClient: Map-Reduce Framework
12/03/06 17:02:42 INFO mapred.JobClient: Reduce input groups=0
12/03/06 17:02:42 INFO mapred.JobClient: Map output materialized
bytes=14
12/03/06 17:02:42 INFO mapred.JobClient: Combine output records=0
12/03/06 17:02:42 INFO mapred.JobClient: Map input records=0
12/03/06 17:02:42 INFO mapred.JobClient: Reduce shuffle bytes=14
12/03/06 17:02:42 INFO mapred.JobClient: Reduce output records=0
12/03/06 17:02:42 INFO mapred.JobClient: Spilled Records=0
12/03/06 17:02:42 INFO mapred.JobClient: Map output bytes=0
12/03/06 17:02:42 INFO mapred.JobClient: Combine input records=0
12/03/06 17:02:42 INFO mapred.JobClient: Map output records=0
12/03/06 17:02:42 INFO mapred.JobClient: SPLIT_RAW_BYTES=130
12/03/06 17:02:42 INFO mapred.JobClient: Reduce input records=0
12/03/06 17:02:42 INFO driver.MahoutDriver: Program took 107225 ms
(Minutes: 1.7870833333333334)
It seems to have executed correctly. I ran it on a small cluster but it
was awfully fast at that. The row counter is there but not the others.
How is the output stored? What does it represent? I would expect a
sequence of row ids as keys with ten rowids each as values? I used named
vectors if that matters.
The output is of the correct type but empty. Here is the seqdump output,
notice count=0, and the file is 97 bytes.
pat@occam2:~/mahout-distribution-0.6$ bin/mahout seqdumper -s
wikipedia-similarity/part-r-00000
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
HADOOP_CONF_DIR=/usr/local/hadoop/conf
MAHOUT-JOB:
/home/pat/mahout-distribution-0.6/mahout-examples-0.6-job.jar
12/03/07 08:31:59 INFO common.AbstractJob: Command line arguments:
{--endPhase=2147483647, --seqFile=wikipedia-similarity/part-r-00000,
--startPhase=0, --tempDir=temp}
Input Path: wikipedia-similarity/part-r-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class
org.apache.mahout.math.VectorWritable
Count: 0
12/03/07 08:31:59 INFO driver.MahoutDriver: Program took 603 ms
(Minutes: 0.01005)
On 3/6/12 11:09 PM, Sebastian Schelter wrote:
> Hi Pat,
>
> You are right, these results look strange. RowSimilarityJob has 3 custom
> counters (ROWS, COOCCURRENCES, PRUNED_COOCCURRENCES), can you give use
> the numbers for these?
>
> --sebastian
>
> On 07.03.2012 02:14, Pat Ferrel wrote:
>> Ok, making progress. I created a matrix using rowid and got the
>> following output:
>>
>> Maclaurin:mahout-distribution-0.6 pferrel$ bin/mahout rowid -i
>> wikipedia-clusters/tfidf-vectors/ -o wikipedia-matrix --tempDir temp
>> ...
>> 12/03/05 16:52:45 INFO common.AbstractJob: Command line arguments:
>> {--endPhase=2147483647, --input=wikipedia-clusters/tfidf-vectors/,
>> --output=wikipedia-matrix, --startPhase=0, --tempDir=temp}
>> 2012-03-05 16:52:45.870 java[4940:1903] Unable to load realm info
>> from SCDynamicStore
>> 12/03/05 16:52:46 WARN util.NativeCodeLoader: Unable to load
>> native-hadoop library for your platform... using builtin-java
>> classes where applicable
>> 12/03/05 16:52:46 INFO compress.CodecPool: Got brand-new compressor
>> 12/03/05 16:52:46 INFO compress.CodecPool: Got brand-new compressor
>> 12/03/05 16:52:47 INFO vectors.RowIdJob: Wrote out matrix with 4838
>> rows and 87325 columns to wikipedia-matrix/matrix
>> 12/03/05 16:52:47 INFO driver.MahoutDriver: Program took 1758 ms
>> (Minutes: 0.0293)
>>
>> So a doc matrix with 4838 docs and 87325 dimensions. Next I ran
>> RowSimilarityJob
>>
>> Maclaurin:mahout-distribution-0.6 pferrel$ bin/mahout rowsimilarity
>> -i wikipedia-matrix/matrix -o wikipedia-similarity -r 87325
>> --similarityClassname SIMILARITY_COSINE -m 10 -ess true --tempDir temp
>>
>> This gives me output in wikipedia-similarity/part-m-00000 but the size
>> is 97 bytes? Shouldn't it have created 4838 * 10 results? Ten per row? I
>> set no threshold so I'd expect it to pick the 10 nearest even if they
>> are far away.
>>
>> BTW what is the output format?
>>
>> On 3/5/12 11:48 AM, Suneel Marthi wrote:
>>> Pat,
>>>
>>> Your input to RowSimilarity seems to be the tfidf-vectors directory
>>> which is<Text, vectorWritable>.
>>>
>>> Before executing the RowSimilarity job u need to run the RowIdJob
>>> which creates a matrix of<IntWritable, VectorWritable>. This matrix
>>> should be the input to RowSimilarity.
>>>
>>> Also from your command, you seem to be missing --tempDir argument, you
>>> would need that too.
>>>
>>> Suneel
>>>
>>> ------------------------------------------------------------------------
>>> *From:* Sebastian Schelter<ss...@apache.org>
>>> *To:* user@mahout.apache.org
>>> *Sent:* Monday, March 5, 2012 2:32 PM
>>> *Subject:* Re: How to find the k most similar docs
>>>
>>> That's the problem:
>>>
>>> org.apache.hadoop.io.Text cannot be
>>> cast to org.apache.hadoop.io
>>> <http://org.apache.hadoop.io.Int>.IntWritable
>>>
>>> RowSimilarityJob expects<IntWritable,VectorWritable> as input, it seems
>>> you supply<Text,VectorWritable>.
>>>
>>> --sebastian
>>>
>>> On 05.03.2012 20:29, Pat Ferrel wrote:
>>>> org.apache.hadoop.io.Text cannot be
>>>> cast to org.apache.hadoop.io.IntWritable
>>>
>>>
>
Re: How to find the k most similar docs
Posted by Sebastian Schelter <ss...@apache.org>.
Hi Pat,
You are right, these results look strange. RowSimilarityJob has 3 custom
counters (ROWS, COOCCURRENCES, PRUNED_COOCCURRENCES), can you give use
the numbers for these?
--sebastian
On 07.03.2012 02:14, Pat Ferrel wrote:
> Ok, making progress. I created a matrix using rowid and got the
> following output:
>
> Maclaurin:mahout-distribution-0.6 pferrel$ bin/mahout rowid -i
> wikipedia-clusters/tfidf-vectors/ -o wikipedia-matrix --tempDir temp
> ...
> 12/03/05 16:52:45 INFO common.AbstractJob: Command line arguments:
> {--endPhase=2147483647, --input=wikipedia-clusters/tfidf-vectors/,
> --output=wikipedia-matrix, --startPhase=0, --tempDir=temp}
> 2012-03-05 16:52:45.870 java[4940:1903] Unable to load realm info
> from SCDynamicStore
> 12/03/05 16:52:46 WARN util.NativeCodeLoader: Unable to load
> native-hadoop library for your platform... using builtin-java
> classes where applicable
> 12/03/05 16:52:46 INFO compress.CodecPool: Got brand-new compressor
> 12/03/05 16:52:46 INFO compress.CodecPool: Got brand-new compressor
> 12/03/05 16:52:47 INFO vectors.RowIdJob: Wrote out matrix with 4838
> rows and 87325 columns to wikipedia-matrix/matrix
> 12/03/05 16:52:47 INFO driver.MahoutDriver: Program took 1758 ms
> (Minutes: 0.0293)
>
> So a doc matrix with 4838 docs and 87325 dimensions. Next I ran
> RowSimilarityJob
>
> Maclaurin:mahout-distribution-0.6 pferrel$ bin/mahout rowsimilarity
> -i wikipedia-matrix/matrix -o wikipedia-similarity -r 87325
> --similarityClassname SIMILARITY_COSINE -m 10 -ess true --tempDir temp
>
> This gives me output in wikipedia-similarity/part-m-00000 but the size
> is 97 bytes? Shouldn't it have created 4838 * 10 results? Ten per row? I
> set no threshold so I'd expect it to pick the 10 nearest even if they
> are far away.
>
> BTW what is the output format?
>
> On 3/5/12 11:48 AM, Suneel Marthi wrote:
>> Pat,
>>
>> Your input to RowSimilarity seems to be the tfidf-vectors directory
>> which is <Text, vectorWritable>.
>>
>> Before executing the RowSimilarity job u need to run the RowIdJob
>> which creates a matrix of <IntWritable, VectorWritable>. This matrix
>> should be the input to RowSimilarity.
>>
>> Also from your command, you seem to be missing --tempDir argument, you
>> would need that too.
>>
>> Suneel
>>
>> ------------------------------------------------------------------------
>> *From:* Sebastian Schelter <ss...@apache.org>
>> *To:* user@mahout.apache.org
>> *Sent:* Monday, March 5, 2012 2:32 PM
>> *Subject:* Re: How to find the k most similar docs
>>
>> That's the problem:
>>
>> org.apache.hadoop.io.Text cannot be
>> cast to org.apache.hadoop.io
>> <http://org.apache.hadoop.io.Int>.IntWritable
>>
>> RowSimilarityJob expects <IntWritable,VectorWritable> as input, it seems
>> you supply <Text,VectorWritable>.
>>
>> --sebastian
>>
>> On 05.03.2012 20:29, Pat Ferrel wrote:
>> > org.apache.hadoop.io.Text cannot be
>> > cast to org.apache.hadoop.io.IntWritable
>>
>>
>>
>
Re: How to find the k most similar docs
Posted by Pat Ferrel <pa...@occamsmachete.com>.
Ok, making progress. I created a matrix using rowid and got the
following output:
Maclaurin:mahout-distribution-0.6 pferrel$ bin/mahout rowid -i
wikipedia-clusters/tfidf-vectors/ -o wikipedia-matrix --tempDir temp
...
12/03/05 16:52:45 INFO common.AbstractJob: Command line arguments:
{--endPhase=2147483647, --input=wikipedia-clusters/tfidf-vectors/,
--output=wikipedia-matrix, --startPhase=0, --tempDir=temp}
2012-03-05 16:52:45.870 java[4940:1903] Unable to load realm info
from SCDynamicStore
12/03/05 16:52:46 WARN util.NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java
classes where applicable
12/03/05 16:52:46 INFO compress.CodecPool: Got brand-new compressor
12/03/05 16:52:46 INFO compress.CodecPool: Got brand-new compressor
12/03/05 16:52:47 INFO vectors.RowIdJob: Wrote out matrix with 4838
rows and 87325 columns to wikipedia-matrix/matrix
12/03/05 16:52:47 INFO driver.MahoutDriver: Program took 1758 ms
(Minutes: 0.0293)
So a doc matrix with 4838 docs and 87325 dimensions. Next I ran
RowSimilarityJob
Maclaurin:mahout-distribution-0.6 pferrel$ bin/mahout rowsimilarity
-i wikipedia-matrix/matrix -o wikipedia-similarity -r 87325
--similarityClassname SIMILARITY_COSINE -m 10 -ess true --tempDir temp
This gives me output in wikipedia-similarity/part-m-00000 but the size
is 97 bytes? Shouldn't it have created 4838 * 10 results? Ten per row? I
set no threshold so I'd expect it to pick the 10 nearest even if they
are far away.
BTW what is the output format?
On 3/5/12 11:48 AM, Suneel Marthi wrote:
> Pat,
>
> Your input to RowSimilarity seems to be the tfidf-vectors directory
> which is <Text, vectorWritable>.
>
> Before executing the RowSimilarity job u need to run the RowIdJob
> which creates a matrix of <IntWritable, VectorWritable>. This matrix
> should be the input to RowSimilarity.
>
> Also from your command, you seem to be missing --tempDir argument, you
> would need that too.
>
> Suneel
>
> ------------------------------------------------------------------------
> *From:* Sebastian Schelter <ss...@apache.org>
> *To:* user@mahout.apache.org
> *Sent:* Monday, March 5, 2012 2:32 PM
> *Subject:* Re: How to find the k most similar docs
>
> That's the problem:
>
> org.apache.hadoop.io.Text cannot be
> cast to org.apache.hadoop.io
> <http://org.apache.hadoop.io.Int>.IntWritable
>
> RowSimilarityJob expects <IntWritable,VectorWritable> as input, it seems
> you supply <Text,VectorWritable>.
>
> --sebastian
>
> On 05.03.2012 20:29, Pat Ferrel wrote:
> > org.apache.hadoop.io.Text cannot be
> > cast to org.apache.hadoop.io.IntWritable
>
>
>
Re: How to find the k most similar docs
Posted by Suneel Marthi <su...@yahoo.com>.
Pat,
Your input to RowSimilarity seems to be the tfidf-vectors directory which is <Text, vectorWritable>.
Before executing the RowSimilarity job u need to run the RowIdJob which creates a matrix of <IntWritable, VectorWritable>. This matrix should be the input to RowSimilarity.
Also from your command, you seem to be missing --tempDir argument, you would need that too.
Suneel
________________________________
From: Sebastian Schelter <ss...@apache.org>
To: user@mahout.apache.org
Sent: Monday, March 5, 2012 2:32 PM
Subject: Re: How to find the k most similar docs
That's the problem:
org.apache.hadoop.io.Text cannot be
cast to org.apache.hadoop.io.IntWritable
RowSimilarityJob expects <IntWritable,VectorWritable> as input, it seems
you supply <Text,VectorWritable>.
--sebastian
On 05.03.2012 20:29, Pat Ferrel wrote:
> org.apache.hadoop.io.Text cannot be
> cast to org.apache.hadoop.io.IntWritable
Re: How to find the k most similar docs
Posted by Sebastian Schelter <ss...@apache.org>.
That's the problem:
org.apache.hadoop.io.Text cannot be
cast to org.apache.hadoop.io.IntWritable
RowSimilarityJob expects <IntWritable,VectorWritable> as input, it seems
you supply <Text,VectorWritable>.
--sebastian
On 05.03.2012 20:29, Pat Ferrel wrote:
> org.apache.hadoop.io.Text cannot be
> cast to org.apache.hadoop.io.IntWritable