You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Divya <di...@k2associates.com.sg> on 2010/10/28 10:11:44 UTC

generate similar documents

Hi,

I have directory of documents from which I have generated Sequence file
using SequenceFilesFromDirectory and then converted it into vectors
SparseVectorsFromSequenceFiles

Now referring below link to  generate a list of most similar documents 

 

http://mail-archives.apache.org/mod_mbox/mahout-user/201007.mbox/%3C4C2E3EED
.6070703@googlemail.com%3E

 

How can I use RowSimilarityJob to generate list of similar documents  .

 

<ol>

 * <li>-Dmapred.input.dir=(path): Directory containing a {@link
DistributedRowMatrix} as a

 * SequenceFile<IntWritable,VectorWritable></li>

 * <li>-Dmapred.output.dir=(path): output path where the computations output
should go (a {@link DistributedRowMatrix}

 * stored as a SequenceFile<IntWritable,VectorWritable>)</li>

 * <li>--numberOfColumns: the number of columns in the input matrix</li>

 * <li>--similarityClassname (classname): an implementation of {@link
DistributedVectorSimilarity} used to compute the

 * similarity</li>

 * <li>--maxSimilaritiesPerRow (integer): cap the number of similar rows per
row to this number (100)</li>

 * </ol>

 *

 

Which argument should I pass numberOfColumns and similarityClassname ?

 

 

Regards,

Divya

Re: generate similar documents

Posted by Sebastian Schelter <ss...@apache.org>.

You have to supply that number, however if you don't use it number in 
the similarity computation (only SIMILARITY_LOGLIKELIHOOD uses it) you 
can safely ignore it and pass in any number.

--sebastian

On 28.10.2010 12:02, Divya wrote:
> Hi Sebastian,
>  From where can I get the numberOfColumns.
> How can I calculate I have these many columns my matrix has as
> SparseVectorsFromSequenceFiles generates vectors in binary format.
>
> Regards,
> Divya
>
> -----Original Message-----
> From: Sebastian Schelter [mailto:ssc@apache.org]
> Sent: Thursday, October 28, 2010 4:28 PM
> To: dev@mahout.apache.org
> Subject: Re: generate similar documents
>
> Hi Divya,
>
> --similarityClassname should point to an implementation of
> org.apache.mahout.math.hadoop.similarity.vector.DistributedVectorSimilarity,
>
> you can use any value from
> org.apache.mahout.math.hadoop.similarity.SimilarityType to use a
> predefined similarity measure or you can point to an implementation of
> your own
>
> --numberOfColumns is the number of columns of the input matrix, which
> would be the number of unique terms as I suppose your matrix is
> documents x terms
>
> --sebastian
>
> On 28.10.2010 10:11, Divya wrote:
>    
>> Hi,
>>
>> I have directory of documents from which I have generated Sequence file
>> using SequenceFilesFromDirectory and then converted it into vectors
>> SparseVectorsFromSequenceFiles
>>
>> Now referring below link to  generate a list of most similar documents
>>
>>
>>
>>
>>      
> http://mail-archives.apache.org/mod_mbox/mahout-user/201007.mbox/%3C4C2E3EED
>    
>> .6070703@googlemail.com%3E
>>
>>
>>
>> How can I use RowSimilarityJob to generate list of similar documents  .
>>
>>
>>
>> <ol>
>>
>>    *<li>-Dmapred.input.dir=(path): Directory containing a {@link
>> DistributedRowMatrix} as a
>>
>>    * SequenceFile<IntWritable,VectorWritable></li>
>>
>>    *<li>-Dmapred.output.dir=(path): output path where the computations
>>      
> output
>    
>> should go (a {@link DistributedRowMatrix}
>>
>>    * stored as a SequenceFile<IntWritable,VectorWritable>)</li>
>>
>>    *<li>--numberOfColumns: the number of columns in the input matrix</li>
>>
>>    *<li>--similarityClassname (classname): an implementation of {@link
>> DistributedVectorSimilarity} used to compute the
>>
>>    * similarity</li>
>>
>>    *<li>--maxSimilaritiesPerRow (integer): cap the number of similar rows
>>      
> per
>    
>> row to this number (100)</li>
>>
>>    *</ol>
>>
>>    *
>>
>>
>>
>> Which argument should I pass numberOfColumns and similarityClassname ?
>>
>>
>>
>>
>>
>> Regards,
>>
>> Divya
>>
>>
>>
>>      
>
>

RE: generate similar documents

Posted by Divya <di...@k2associates.com.sg>.

Hi Sebastian,
>From where can I get the numberOfColumns.
How can I calculate I have these many columns my matrix has as
SparseVectorsFromSequenceFiles generates vectors in binary format.

Regards,
Divya 

-----Original Message-----
From: Sebastian Schelter [mailto:ssc@apache.org] 
Sent: Thursday, October 28, 2010 4:28 PM
To: dev@mahout.apache.org
Subject: Re: generate similar documents

Hi Divya,

--similarityClassname should point to an implementation of 
org.apache.mahout.math.hadoop.similarity.vector.DistributedVectorSimilarity,

you can use any value from 
org.apache.mahout.math.hadoop.similarity.SimilarityType to use a 
predefined similarity measure or you can point to an implementation of 
your own

--numberOfColumns is the number of columns of the input matrix, which 
would be the number of unique terms as I suppose your matrix is 
documents x terms

--sebastian

On 28.10.2010 10:11, Divya wrote:
> Hi,
>
> I have directory of documents from which I have generated Sequence file
> using SequenceFilesFromDirectory and then converted it into vectors
> SparseVectorsFromSequenceFiles
>
> Now referring below link to  generate a list of most similar documents
>
>
>
>
http://mail-archives.apache.org/mod_mbox/mahout-user/201007.mbox/%3C4C2E3EED
> .6070703@googlemail.com%3E
>
>
>
> How can I use RowSimilarityJob to generate list of similar documents  .
>
>
>
> <ol>
>
>   *<li>-Dmapred.input.dir=(path): Directory containing a {@link
> DistributedRowMatrix} as a
>
>   * SequenceFile<IntWritable,VectorWritable></li>
>
>   *<li>-Dmapred.output.dir=(path): output path where the computations
output
> should go (a {@link DistributedRowMatrix}
>
>   * stored as a SequenceFile<IntWritable,VectorWritable>)</li>
>
>   *<li>--numberOfColumns: the number of columns in the input matrix</li>
>
>   *<li>--similarityClassname (classname): an implementation of {@link
> DistributedVectorSimilarity} used to compute the
>
>   * similarity</li>
>
>   *<li>--maxSimilaritiesPerRow (integer): cap the number of similar rows
per
> row to this number (100)</li>
>
>   *</ol>
>
>   *
>
>
>
> Which argument should I pass numberOfColumns and similarityClassname ?
>
>
>
>
>
> Regards,
>
> Divya
>
>
>

Re: generate similar documents

Posted by Sebastian Schelter <ss...@apache.org>.

Hi Divya,

--similarityClassname should point to an implementation of 
org.apache.mahout.math.hadoop.similarity.vector.DistributedVectorSimilarity, 
you can use any value from 
org.apache.mahout.math.hadoop.similarity.SimilarityType to use a 
predefined similarity measure or you can point to an implementation of 
your own

--numberOfColumns is the number of columns of the input matrix, which 
would be the number of unique terms as I suppose your matrix is 
documents x terms

--sebastian

On 28.10.2010 10:11, Divya wrote:
> Hi,
>
> I have directory of documents from which I have generated Sequence file
> using SequenceFilesFromDirectory and then converted it into vectors
> SparseVectorsFromSequenceFiles
>
> Now referring below link to  generate a list of most similar documents
>
>
>
> http://mail-archives.apache.org/mod_mbox/mahout-user/201007.mbox/%3C4C2E3EED
> .6070703@googlemail.com%3E
>
>
>
> How can I use RowSimilarityJob to generate list of similar documents  .
>
>
>
> <ol>
>
>   *<li>-Dmapred.input.dir=(path): Directory containing a {@link
> DistributedRowMatrix} as a
>
>   * SequenceFile<IntWritable,VectorWritable></li>
>
>   *<li>-Dmapred.output.dir=(path): output path where the computations output
> should go (a {@link DistributedRowMatrix}
>
>   * stored as a SequenceFile<IntWritable,VectorWritable>)</li>
>
>   *<li>--numberOfColumns: the number of columns in the input matrix</li>
>
>   *<li>--similarityClassname (classname): an implementation of {@link
> DistributedVectorSimilarity} used to compute the
>
>   * similarity</li>
>
>   *<li>--maxSimilaritiesPerRow (integer): cap the number of similar rows per
> row to this number (100)</li>
>
>   *</ol>
>
>   *
>
>
>
> Which argument should I pass numberOfColumns and similarityClassname ?
>
>
>
>
>
> Regards,
>
> Divya
>
>
>

Re: generate similar documents

Posted by Grant Ingersoll <gs...@apache.org>.

Hi,

Please ask these questions on user@mahout.apache.org.  The dev@ mailing list is geared towards the development of the Mahout code, while the user list is geared towards questions on how to use Mahout.

Thanks,
Grant


On Oct 28, 2010, at 4:11 AM, Divya wrote:

> Hi,
> 
> I have directory of documents from which I have generated Sequence file
> using SequenceFilesFromDirectory and then converted it into vectors
> SparseVectorsFromSequenceFiles
> 
> Now referring below link to  generate a list of most similar documents 
> 
> 
> 
> http://mail-archives.apache.org/mod_mbox/mahout-user/201007.mbox/%3C4C2E3EED
> .6070703@googlemail.com%3E
> 
> 
> 
> How can I use RowSimilarityJob to generate list of similar documents  .
> 
> 
> 
> <ol>
> 
> * <li>-Dmapred.input.dir=(path): Directory containing a {@link
> DistributedRowMatrix} as a
> 
> * SequenceFile<IntWritable,VectorWritable></li>
> 
> * <li>-Dmapred.output.dir=(path): output path where the computations output
> should go (a {@link DistributedRowMatrix}
> 
> * stored as a SequenceFile<IntWritable,VectorWritable>)</li>
> 
> * <li>--numberOfColumns: the number of columns in the input matrix</li>
> 
> * <li>--similarityClassname (classname): an implementation of {@link
> DistributedVectorSimilarity} used to compute the
> 
> * similarity</li>
> 
> * <li>--maxSimilaritiesPerRow (integer): cap the number of similar rows per
> row to this number (100)</li>
> 
> * </ol>
> 
> *
> 
> 
> 
> Which argument should I pass numberOfColumns and similarityClassname ?
> 
> 
> 
> 
> 
> Regards,
> 
> Divya 
> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem docs using Solr/Lucene:
http://www.lucidimagination.com/search