You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Sören Brunk <so...@deri.org> on 2011/11/08 17:33:00 UTC

RowSimilarityJob input

Hi,

I'm trying to use RowSimilarityJob (current trunk) to calculate pairwise 
similarities between feature vectors but I'm struggling a bit with the 
correct input format.

I used SparseVectorsFromSequenceFiles to create a bunch of vectors from 
documents. But using the tfidf vectors directly as input doesn't work as 
it produces vectors with Strings as keys, while RowSimilarityJob seems 
to expect IntWritable.
I've also seen something about DistributedRowMatrix as input in some 
older docs.

Any hints? Is RowSimilarityJob a good choice for that task at all?

Thanks for your help,
Sören

Re: RowSimilarityJob input

Posted by Sören Brunk <so...@deri.org>.

Ok after simply converting the vector keys from Text to IntWritable, it 
worked fine for me.
Took a while though, but it ran only on my local machine with default 
vectorization settings and almost no preprocessing, so there's much room 
for improvement.

Thanks for your help!
Sören

On 08/11/11 16:45, Sebastian Schelter wrote:
> Hi Sören,
>
> RowSimilarityJob expects IntWritable,VectorWritable as input. It should
> be a reasonable choice for comparing the pairwise similarities between
> text documents. I suggest you throw away the 1% most frequent terms as
> described in http://terpconnect.umd.edu/~oard/pdf/acl08elsayed2.pdf. I
> think SparseVectorsFromSequenceFiles is already doing that per default.
>
> Would be great if let the mailinglist know how it worked out for you.
>
> Greetings to Galway!
> Sebastian
>
> On 08.11.2011 17:33, Sören Brunk wrote:
>> Hi,
>>
>> I'm trying to use RowSimilarityJob (current trunk) to calculate pairwise
>> similarities between feature vectors but I'm struggling a bit with the
>> correct input format.
>>
>> I used SparseVectorsFromSequenceFiles to create a bunch of vectors from
>> documents. But using the tfidf vectors directly as input doesn't work as
>> it produces vectors with Strings as keys, while RowSimilarityJob seems
>> to expect IntWritable.
>> I've also seen something about DistributedRowMatrix as input in some
>> older docs.
>>
>> Any hints? Is RowSimilarityJob a good choice for that task at all?
>>
>> Thanks for your help,
>> Sören

Re: RowSimilarityJob input

Posted by Sebastian Schelter <ss...@apache.org>.

Hi Sören,

RowSimilarityJob expects IntWritable,VectorWritable as input. It should
be a reasonable choice for comparing the pairwise similarities between
text documents. I suggest you throw away the 1% most frequent terms as
described in http://terpconnect.umd.edu/~oard/pdf/acl08elsayed2.pdf. I
think SparseVectorsFromSequenceFiles is already doing that per default.

Would be great if let the mailinglist know how it worked out for you.

Greetings to Galway!
Sebastian

On 08.11.2011 17:33, Sören Brunk wrote:
> Hi,
> 
> I'm trying to use RowSimilarityJob (current trunk) to calculate pairwise
> similarities between feature vectors but I'm struggling a bit with the
> correct input format.
> 
> I used SparseVectorsFromSequenceFiles to create a bunch of vectors from
> documents. But using the tfidf vectors directly as input doesn't work as
> it produces vectors with Strings as keys, while RowSimilarityJob seems
> to expect IntWritable.
> I've also seen something about DistributedRowMatrix as input in some
> older docs.
> 
> Any hints? Is RowSimilarityJob a good choice for that task at all?
> 
> Thanks for your help,
> Sören