You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Stefan Wienert <st...@wienert.cc> on 2011/06/14 19:15:09 UTC

tf-idf + svd + cosine similarity

Hey Guys,

I have some strange results in my LSA-Pipeline.

First, I explain the steps my data is making:
1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as weighter
2) Transposing TDM
3a) Using Mahout SVD (Lanczos) with the transposed TDM
3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
3c) Using no dimension reduction (for testing purpose)
4) Transpose result (ONLY none / svd)
5) Calculating Cosine Similarty (from Mahout)

Now... Some strange thinks happen:
First of all: The demo data shows the similarity from document 1 to
all other documents.

the results using only cosine similarty (without dimension reduction):
http://the-lord.de/img/none.png

the result using svd, rank 10
http://the-lord.de/img/svd-10.png
some points falling down to the bottom.

the results using ssvd rank 10
http://the-lord.de/img/ssvd-10.png

the result using svd, rank 100
http://the-lord.de/img/svd-100.png
more points falling down to the bottom.

the results using ssvd rank 100
http://the-lord.de/img/ssvd-100.png

the results using svd rank 200
http://the-lord.de/img/svd-200.png
even more points falling down to the bottom.

the results using svd rank 1000
http://the-lord.de/img/svd-1000.png
most points are at the bottom

please beware of the scale:
- the avg from none: 0,8712
- the avg from svd rank 10: 0,2648
- the avg from svd rank 100: 0,0628
- the avg from svd rank 200: 0,0238
- the avg from svd rank 1000: 0,0116

so my question is:
Can you explain this behavior? Why are the documents getting more
equal with more ranks in svd. I thought it was the opposite.

Cheers
Stefan

Re: tf-idf + svd + cosine similarity

Posted by Stefan Wienert <st...@wienert.cc>.

Hi Sebastian,

the bug does not affect me with:
NONE > bugcheck.pdf
SVD > bugcheck2.pdf
(although it was active)

Cheers,
Stefan


2011/6/14 Sebastian Schelter <ss...@apache.org>:
> Hi Stefan,
>
> I checked the implementation of RowSimilarityJob and we might still have a
> bug in the 0.5 release... (f**k). I don't know if your problem is caused by
> that, but the similarity scores might not be correct...
>
> We had this issue in 0.4 already, when someone realized that cooccurrences
> were mapped out inconsistently, so for 0.5 we made sure that we always map
> the smaller row as first value. But apparently I did not adjust the value
> setting for the Cooccurrence object...
>
> In 0.5 the code is:
>
>  if (rowA <= rowB) {
>   rowPair.set(rowA, rowB, weightA, weightB);
>  } else {
>   rowPair.set(rowB, rowA, weightB, weightA);
>  }
>  coocurrence.set(column.get(), valueA, valueB);
>
> But I should be (already fixed in current trunk some days ago):
>
>  if (rowA <= rowB) {
>   rowPair.set(rowA, rowB, weightA, weightB);
>   coocurrence.set(column.get(), valueA, valueB);
>  } else {
>   rowPair.set(rowB, rowA, weightB, weightA);
>   coocurrence.set(column.get(), valueB, valueA);
>  }
>
> Maybe you could rerun your test with the current trunk?
>
> --sebastian
>
> On 14.06.2011 20:54, Sean Owen wrote:
>>
>> It is a similarity, not a distance. Higher values mean more
>> similarity, not less.
>>
>> I agree that similarity ought to decrease with more dimensions. That
>> is what you observe -- except that you see quite high average
>> similarity with no dimension reduction!
>>
>> An average cosine similarity of 0.87 sounds "high" to me for anything
>> but a few dimensions. What's the dimensionality of the input without
>> dimension reduction?
>>
>> Something is amiss in this pipeline. It is an interesting question!
>>
>> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<st...@wienert.cc>  wrote:
>>>
>>> Actually I'm using  RowSimilarityJob() with
>>> --input input
>>> --output output
>>> --numberOfColumns documentCount
>>> --maxSimilaritiesPerRow documentCount
>>> --similarityClassname SIMILARITY_UNCENTERED_COSINE
>>>
>>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE
>>> calculates...
>>> the source says: "distributed implementation of cosine similarity that
>>> does not center its data"
>>>
>>> So... this seems to be the similarity and not the distance?
>>>
>>> Cheers,
>>> Stefan
>>>
>>>
>>>
>>> 2011/6/14 Stefan Wienert<st...@wienert.cc>:
>>>>
>>>> but... why do I get the different results with cosine similarity with
>>>> no dimension reduction (with 100,000 dimensions) ?
>>>>
>>>> 2011/6/14 Fernando Fernández<fe...@gmail.com>:
>>>>>
>>>>> Actually that's what your results are showing, aren't they? With rank
>>>>> 1000
>>>>> the similarity avg is the lowest...
>>>>>
>>>>>
>>>>> 2011/6/14 Jake Mannix<ja...@gmail.com>
>>>>>
>>>>>> actually, wait - are your graphs showing *similarity*, or *distance*?
>>>>>>  In
>>>>>> higher
>>>>>> dimensions, *distance* (and cosine angle) should grow, but on the
>>>>>> other
>>>>>> hand,
>>>>>> *similarity* (1-cos(angle)) should go toward 0.
>>>>>>
>>>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert<st...@wienert.cc>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Guys,
>>>>>>>
>>>>>>> I have some strange results in my LSA-Pipeline.
>>>>>>>
>>>>>>> First, I explain the steps my data is making:
>>>>>>> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF
>>>>>>> as
>>>>>>> weighter
>>>>>>> 2) Transposing TDM
>>>>>>> 3a) Using Mahout SVD (Lanczos) with the transposed TDM
>>>>>>> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
>>>>>>> 3c) Using no dimension reduction (for testing purpose)
>>>>>>> 4) Transpose result (ONLY none / svd)
>>>>>>> 5) Calculating Cosine Similarty (from Mahout)
>>>>>>>
>>>>>>> Now... Some strange thinks happen:
>>>>>>> First of all: The demo data shows the similarity from document 1 to
>>>>>>> all other documents.
>>>>>>>
>>>>>>> the results using only cosine similarty (without dimension
>>>>>>> reduction):
>>>>>>> http://the-lord.de/img/none.png
>>>>>>>
>>>>>>> the result using svd, rank 10
>>>>>>> http://the-lord.de/img/svd-10.png
>>>>>>> some points falling down to the bottom.
>>>>>>>
>>>>>>> the results using ssvd rank 10
>>>>>>> http://the-lord.de/img/ssvd-10.png
>>>>>>>
>>>>>>> the result using svd, rank 100
>>>>>>> http://the-lord.de/img/svd-100.png
>>>>>>> more points falling down to the bottom.
>>>>>>>
>>>>>>> the results using ssvd rank 100
>>>>>>> http://the-lord.de/img/ssvd-100.png
>>>>>>>
>>>>>>> the results using svd rank 200
>>>>>>> http://the-lord.de/img/svd-200.png
>>>>>>> even more points falling down to the bottom.
>>>>>>>
>>>>>>> the results using svd rank 1000
>>>>>>> http://the-lord.de/img/svd-1000.png
>>>>>>> most points are at the bottom
>>>>>>>
>>>>>>> please beware of the scale:
>>>>>>> - the avg from none: 0,8712
>>>>>>> - the avg from svd rank 10: 0,2648
>>>>>>> - the avg from svd rank 100: 0,0628
>>>>>>> - the avg from svd rank 200: 0,0238
>>>>>>> - the avg from svd rank 1000: 0,0116
>>>>>>>
>>>>>>> so my question is:
>>>>>>> Can you explain this behavior? Why are the documents getting more
>>>>>>> equal with more ranks in svd. I thought it was the opposite.
>>>>>>>
>>>>>>> Cheers
>>>>>>> Stefan
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Stefan Wienert
>>>>
>>>> http://www.wienert.cc
>>>> stefan@wienert.cc
>>>>
>>>> Telefon: +495251-2026838
>>>> Mobil: +49176-40170270
>>>>
>>>
>>>
>>>
>>> --
>>> Stefan Wienert
>>>
>>> http://www.wienert.cc
>>> stefan@wienert.cc
>>>
>>> Telefon: +495251-2026838
>>> Mobil: +49176-40170270
>>>
>
>



-- 
Stefan Wienert

http://www.wienert.cc
stefan@wienert.cc

Telefon: +495251-2026838
Mobil: +49176-40170270

Re: tf-idf + svd + cosine similarity

Posted by Stefan Wienert <st...@wienert.cc>.

one last question: for cosine similarity, sometimes the results are
negative (which means angel between vectors is greater than 90°). but
what does this means for the similarity?

Cheers,
Stefan

2011/6/14 Stefan Wienert <st...@wienert.cc>:
> So... lets check the dimensions:
>
> First step: Lucene Output:
> 227 rows (=docs) and 107909 cols (=tems)
>
> transposed to:
> 107909 rows and 227 cols
>
> reduced with svd (rank 100) to:
> 99 rows and 227 cols
>
> transposed to: (actually there was a bug (with no effect on the SVD
> result but on NONE result))
> 227 rows and 99 cols
>
> So... now the cosine results are very similar to SVD 200.
>
> Results are added.
>
> @Sebastian: I will check if the bug affects my results.
>
> 2011/6/14 Fernando Fernández <fe...@gmail.com>:
>> Hi Stefan,
>>
>> Are  you sure you need to transpose the input marix? I thought that what you
>> get from lucene index was already document(rows)-term(columns) matrix, but
>> you say that you obtain term-document matrix and transpose it. Is this
>> correct? What are you using to obtain this matrix from Lucene? Is it
>> possible that you are calculating similarities with the wrong matrix in some
>> of the two cases? (With/without dimension reduction).
>>
>> Best,
>> Fernando.
>>
>> 2011/6/14 Sebastian Schelter <ss...@apache.org>
>>
>>> Hi Stefan,
>>>
>>> I checked the implementation of RowSimilarityJob and we might still have a
>>> bug in the 0.5 release... (f**k). I don't know if your problem is caused by
>>> that, but the similarity scores might not be correct...
>>>
>>> We had this issue in 0.4 already, when someone realized that cooccurrences
>>> were mapped out inconsistently, so for 0.5 we made sure that we always map
>>> the smaller row as first value. But apparently I did not adjust the value
>>> setting for the Cooccurrence object...
>>>
>>> In 0.5 the code is:
>>>
>>>  if (rowA <= rowB) {
>>>   rowPair.set(rowA, rowB, weightA, weightB);
>>>  } else {
>>>   rowPair.set(rowB, rowA, weightB, weightA);
>>>  }
>>>  coocurrence.set(column.get(), valueA, valueB);
>>>
>>> But I should be (already fixed in current trunk some days ago):
>>>
>>>  if (rowA <= rowB) {
>>>   rowPair.set(rowA, rowB, weightA, weightB);
>>>   coocurrence.set(column.get(), valueA, valueB);
>>>  } else {
>>>   rowPair.set(rowB, rowA, weightB, weightA);
>>>   coocurrence.set(column.get(), valueB, valueA);
>>>  }
>>>
>>> Maybe you could rerun your test with the current trunk?
>>>
>>> --sebastian
>>>
>>>
>>> On 14.06.2011 20:54, Sean Owen wrote:
>>>
>>>> It is a similarity, not a distance. Higher values mean more
>>>> similarity, not less.
>>>>
>>>> I agree that similarity ought to decrease with more dimensions. That
>>>> is what you observe -- except that you see quite high average
>>>> similarity with no dimension reduction!
>>>>
>>>> An average cosine similarity of 0.87 sounds "high" to me for anything
>>>> but a few dimensions. What's the dimensionality of the input without
>>>> dimension reduction?
>>>>
>>>> Something is amiss in this pipeline. It is an interesting question!
>>>>
>>>> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<st...@wienert.cc>
>>>>  wrote:
>>>>
>>>>> Actually I'm using  RowSimilarityJob() with
>>>>> --input input
>>>>> --output output
>>>>> --numberOfColumns documentCount
>>>>> --maxSimilaritiesPerRow documentCount
>>>>> --similarityClassname SIMILARITY_UNCENTERED_COSINE
>>>>>
>>>>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE
>>>>> calculates...
>>>>> the source says: "distributed implementation of cosine similarity that
>>>>> does not center its data"
>>>>>
>>>>> So... this seems to be the similarity and not the distance?
>>>>>
>>>>> Cheers,
>>>>> Stefan
>>>>>
>>>>>
>>>>>
>>>>> 2011/6/14 Stefan Wienert<st...@wienert.cc>:
>>>>>
>>>>>> but... why do I get the different results with cosine similarity with
>>>>>> no dimension reduction (with 100,000 dimensions) ?
>>>>>>
>>>>>> 2011/6/14 Fernando Fernández<fe...@gmail.com>:
>>>>>>
>>>>>>> Actually that's what your results are showing, aren't they? With rank
>>>>>>> 1000
>>>>>>> the similarity avg is the lowest...
>>>>>>>
>>>>>>>
>>>>>>> 2011/6/14 Jake Mannix<ja...@gmail.com>
>>>>>>>
>>>>>>>  actually, wait - are your graphs showing *similarity*, or *distance*?
>>>>>>>>  In
>>>>>>>> higher
>>>>>>>> dimensions, *distance* (and cosine angle) should grow, but on the
>>>>>>>> other
>>>>>>>> hand,
>>>>>>>> *similarity* (1-cos(angle)) should go toward 0.
>>>>>>>>
>>>>>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert<st...@wienert.cc>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>  Hey Guys,
>>>>>>>>>
>>>>>>>>> I have some strange results in my LSA-Pipeline.
>>>>>>>>>
>>>>>>>>> First, I explain the steps my data is making:
>>>>>>>>> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF
>>>>>>>>> as
>>>>>>>>> weighter
>>>>>>>>> 2) Transposing TDM
>>>>>>>>> 3a) Using Mahout SVD (Lanczos) with the transposed TDM
>>>>>>>>> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
>>>>>>>>> 3c) Using no dimension reduction (for testing purpose)
>>>>>>>>> 4) Transpose result (ONLY none / svd)
>>>>>>>>> 5) Calculating Cosine Similarty (from Mahout)
>>>>>>>>>
>>>>>>>>> Now... Some strange thinks happen:
>>>>>>>>> First of all: The demo data shows the similarity from document 1 to
>>>>>>>>> all other documents.
>>>>>>>>>
>>>>>>>>> the results using only cosine similarty (without dimension
>>>>>>>>> reduction):
>>>>>>>>> http://the-lord.de/img/none.png
>>>>>>>>>
>>>>>>>>> the result using svd, rank 10
>>>>>>>>> http://the-lord.de/img/svd-10.png
>>>>>>>>> some points falling down to the bottom.
>>>>>>>>>
>>>>>>>>> the results using ssvd rank 10
>>>>>>>>> http://the-lord.de/img/ssvd-10.png
>>>>>>>>>
>>>>>>>>> the result using svd, rank 100
>>>>>>>>> http://the-lord.de/img/svd-100.png
>>>>>>>>> more points falling down to the bottom.
>>>>>>>>>
>>>>>>>>> the results using ssvd rank 100
>>>>>>>>> http://the-lord.de/img/ssvd-100.png
>>>>>>>>>
>>>>>>>>> the results using svd rank 200
>>>>>>>>> http://the-lord.de/img/svd-200.png
>>>>>>>>> even more points falling down to the bottom.
>>>>>>>>>
>>>>>>>>> the results using svd rank 1000
>>>>>>>>> http://the-lord.de/img/svd-1000.png
>>>>>>>>> most points are at the bottom
>>>>>>>>>
>>>>>>>>> please beware of the scale:
>>>>>>>>> - the avg from none: 0,8712
>>>>>>>>> - the avg from svd rank 10: 0,2648
>>>>>>>>> - the avg from svd rank 100: 0,0628
>>>>>>>>> - the avg from svd rank 200: 0,0238
>>>>>>>>> - the avg from svd rank 1000: 0,0116
>>>>>>>>>
>>>>>>>>> so my question is:
>>>>>>>>> Can you explain this behavior? Why are the documents getting more
>>>>>>>>> equal with more ranks in svd. I thought it was the opposite.
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>> Stefan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Stefan Wienert
>>>>>>
>>>>>> http://www.wienert.cc
>>>>>> stefan@wienert.cc
>>>>>>
>>>>>> Telefon: +495251-2026838
>>>>>> Mobil: +49176-40170270
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Stefan Wienert
>>>>>
>>>>> http://www.wienert.cc
>>>>> stefan@wienert.cc
>>>>>
>>>>> Telefon: +495251-2026838
>>>>> Mobil: +49176-40170270
>>>>>
>>>>>
>>>
>>
>
>
>
> --
> Stefan Wienert
>
> http://www.wienert.cc
> stefan@wienert.cc
>
> Telefon: +495251-2026838
> Mobil: +49176-40170270
>



-- 
Stefan Wienert

http://www.wienert.cc
stefan@wienert.cc

Telefon: +495251-2026838
Mobil: +49176-40170270

Re: tf-idf + svd + cosine similarity

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

that actually looks more like it. Not so many documents  similar to a
randomly picked one.


On Tue, Jun 14, 2011 at 3:09 PM, Stefan Wienert <st...@wienert.cc> wrote:
> Dmitriy, be aware: there was a bug in none.png... I deleted it a minute ago, see
> http://the-lord.de/img/beispielwerte.pdf
> for better results.
>
> First... U or V are the singular values not the eigenvectors ;)
>
> Lanczos-SVD in mahout is computing the eigenvectors of M*M (it
> multiplies the input matrix with the transposed one)
>
> As a fact, I don't need U, just V, so I need to transpose M (because
> the eigenvectors of MM* = V).
>
> So... normalizing the eigenvectors: Is the cosine similarity not doing
> this? or ignoring the length of the vectors?
> http://en.wikipedia.org/wiki/Cosine_similarity
>
> my parameters for ssvd:
> --rank 100
> --oversampling 10
> --blockHeight 227
> --computeU false
> --input
> --output
>
> the rest should be on default.
>
> acutally I do not really know what these oversampling parameter means...
>
> 2011/6/14 Dmitriy Lyubimov <dl...@gmail.com>:
>> Interesting.
>>
>> (I have one confusion of mine RE: lanczos -- is it computing U
>> eigenvectors or V? The doc says "eigenvectors" but doesn't say left or
>> right. if it's V (right eigenvectors) this sequence should be fine).
>>
>> With ssvd i don't do transpose, i just do coputation of U which will
>> produce document singular vectors directly.
>>
>> Also, i am not sure that Lanczos actually normalizes the eigenvectors,
>> but SSVD does (or multiplies normalized version by square root of a
>> singlular value, whichever requested). So depending on which space
>> your rotate results in, cosine similarities may be different. I assume
>> you used normalized (true) eigenvectors from ssvd.
>>
>> Also would be interesting to know what oversampling parameter you (p) you used.
>>
>> Thanks.
>> -d
>>
>>
>> On Tue, Jun 14, 2011 at 2:04 PM, Stefan Wienert <st...@wienert.cc> wrote:
>>> So... lets check the dimensions:
>>>
>>> First step: Lucene Output:
>>> 227 rows (=docs) and 107909 cols (=tems)
>>>
>>> transposed to:
>>> 107909 rows and 227 cols
>>>
>>> reduced with svd (rank 100) to:
>>> 99 rows and 227 cols
>>>
>>> transposed to: (actually there was a bug (with no effect on the SVD
>>> result but on NONE result))
>>> 227 rows and 99 cols
>>>
>>> So... now the cosine results are very similar to SVD 200.
>>>
>>> Results are added.
>>>
>>> @Sebastian: I will check if the bug affects my results.
>>>
>>> 2011/6/14 Fernando Fernández <fe...@gmail.com>:
>>>> Hi Stefan,
>>>>
>>>> Are  you sure you need to transpose the input marix? I thought that what you
>>>> get from lucene index was already document(rows)-term(columns) matrix, but
>>>> you say that you obtain term-document matrix and transpose it. Is this
>>>> correct? What are you using to obtain this matrix from Lucene? Is it
>>>> possible that you are calculating similarities with the wrong matrix in some
>>>> of the two cases? (With/without dimension reduction).
>>>>
>>>> Best,
>>>> Fernando.
>>>>
>>>> 2011/6/14 Sebastian Schelter <ss...@apache.org>
>>>>
>>>>> Hi Stefan,
>>>>>
>>>>> I checked the implementation of RowSimilarityJob and we might still have a
>>>>> bug in the 0.5 release... (f**k). I don't know if your problem is caused by
>>>>> that, but the similarity scores might not be correct...
>>>>>
>>>>> We had this issue in 0.4 already, when someone realized that cooccurrences
>>>>> were mapped out inconsistently, so for 0.5 we made sure that we always map
>>>>> the smaller row as first value. But apparently I did not adjust the value
>>>>> setting for the Cooccurrence object...
>>>>>
>>>>> In 0.5 the code is:
>>>>>
>>>>>  if (rowA <= rowB) {
>>>>>   rowPair.set(rowA, rowB, weightA, weightB);
>>>>>  } else {
>>>>>   rowPair.set(rowB, rowA, weightB, weightA);
>>>>>  }
>>>>>  coocurrence.set(column.get(), valueA, valueB);
>>>>>
>>>>> But I should be (already fixed in current trunk some days ago):
>>>>>
>>>>>  if (rowA <= rowB) {
>>>>>   rowPair.set(rowA, rowB, weightA, weightB);
>>>>>   coocurrence.set(column.get(), valueA, valueB);
>>>>>  } else {
>>>>>   rowPair.set(rowB, rowA, weightB, weightA);
>>>>>   coocurrence.set(column.get(), valueB, valueA);
>>>>>  }
>>>>>
>>>>> Maybe you could rerun your test with the current trunk?
>>>>>
>>>>> --sebastian
>>>>>
>>>>>
>>>>> On 14.06.2011 20:54, Sean Owen wrote:
>>>>>
>>>>>> It is a similarity, not a distance. Higher values mean more
>>>>>> similarity, not less.
>>>>>>
>>>>>> I agree that similarity ought to decrease with more dimensions. That
>>>>>> is what you observe -- except that you see quite high average
>>>>>> similarity with no dimension reduction!
>>>>>>
>>>>>> An average cosine similarity of 0.87 sounds "high" to me for anything
>>>>>> but a few dimensions. What's the dimensionality of the input without
>>>>>> dimension reduction?
>>>>>>
>>>>>> Something is amiss in this pipeline. It is an interesting question!
>>>>>>
>>>>>> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<st...@wienert.cc>
>>>>>>  wrote:
>>>>>>
>>>>>>> Actually I'm using  RowSimilarityJob() with
>>>>>>> --input input
>>>>>>> --output output
>>>>>>> --numberOfColumns documentCount
>>>>>>> --maxSimilaritiesPerRow documentCount
>>>>>>> --similarityClassname SIMILARITY_UNCENTERED_COSINE
>>>>>>>
>>>>>>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE
>>>>>>> calculates...
>>>>>>> the source says: "distributed implementation of cosine similarity that
>>>>>>> does not center its data"
>>>>>>>
>>>>>>> So... this seems to be the similarity and not the distance?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Stefan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2011/6/14 Stefan Wienert<st...@wienert.cc>:
>>>>>>>
>>>>>>>> but... why do I get the different results with cosine similarity with
>>>>>>>> no dimension reduction (with 100,000 dimensions) ?
>>>>>>>>
>>>>>>>> 2011/6/14 Fernando Fernández<fe...@gmail.com>:
>>>>>>>>
>>>>>>>>> Actually that's what your results are showing, aren't they? With rank
>>>>>>>>> 1000
>>>>>>>>> the similarity avg is the lowest...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2011/6/14 Jake Mannix<ja...@gmail.com>
>>>>>>>>>
>>>>>>>>>  actually, wait - are your graphs showing *similarity*, or *distance*?
>>>>>>>>>>  In
>>>>>>>>>> higher
>>>>>>>>>> dimensions, *distance* (and cosine angle) should grow, but on the
>>>>>>>>>> other
>>>>>>>>>> hand,
>>>>>>>>>> *similarity* (1-cos(angle)) should go toward 0.
>>>>>>>>>>
>>>>>>>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert<st...@wienert.cc>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>  Hey Guys,
>>>>>>>>>>>
>>>>>>>>>>> I have some strange results in my LSA-Pipeline.
>>>>>>>>>>>
>>>>>>>>>>> First, I explain the steps my data is making:
>>>>>>>>>>> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF
>>>>>>>>>>> as
>>>>>>>>>>> weighter
>>>>>>>>>>> 2) Transposing TDM
>>>>>>>>>>> 3a) Using Mahout SVD (Lanczos) with the transposed TDM
>>>>>>>>>>> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
>>>>>>>>>>> 3c) Using no dimension reduction (for testing purpose)
>>>>>>>>>>> 4) Transpose result (ONLY none / svd)
>>>>>>>>>>> 5) Calculating Cosine Similarty (from Mahout)
>>>>>>>>>>>
>>>>>>>>>>> Now... Some strange thinks happen:
>>>>>>>>>>> First of all: The demo data shows the similarity from document 1 to
>>>>>>>>>>> all other documents.
>>>>>>>>>>>
>>>>>>>>>>> the results using only cosine similarty (without dimension
>>>>>>>>>>> reduction):
>>>>>>>>>>> http://the-lord.de/img/none.png
>>>>>>>>>>>
>>>>>>>>>>> the result using svd, rank 10
>>>>>>>>>>> http://the-lord.de/img/svd-10.png
>>>>>>>>>>> some points falling down to the bottom.
>>>>>>>>>>>
>>>>>>>>>>> the results using ssvd rank 10
>>>>>>>>>>> http://the-lord.de/img/ssvd-10.png
>>>>>>>>>>>
>>>>>>>>>>> the result using svd, rank 100
>>>>>>>>>>> http://the-lord.de/img/svd-100.png
>>>>>>>>>>> more points falling down to the bottom.
>>>>>>>>>>>
>>>>>>>>>>> the results using ssvd rank 100
>>>>>>>>>>> http://the-lord.de/img/ssvd-100.png
>>>>>>>>>>>
>>>>>>>>>>> the results using svd rank 200
>>>>>>>>>>> http://the-lord.de/img/svd-200.png
>>>>>>>>>>> even more points falling down to the bottom.
>>>>>>>>>>>
>>>>>>>>>>> the results using svd rank 1000
>>>>>>>>>>> http://the-lord.de/img/svd-1000.png
>>>>>>>>>>> most points are at the bottom
>>>>>>>>>>>
>>>>>>>>>>> please beware of the scale:
>>>>>>>>>>> - the avg from none: 0,8712
>>>>>>>>>>> - the avg from svd rank 10: 0,2648
>>>>>>>>>>> - the avg from svd rank 100: 0,0628
>>>>>>>>>>> - the avg from svd rank 200: 0,0238
>>>>>>>>>>> - the avg from svd rank 1000: 0,0116
>>>>>>>>>>>
>>>>>>>>>>> so my question is:
>>>>>>>>>>> Can you explain this behavior? Why are the documents getting more
>>>>>>>>>>> equal with more ranks in svd. I thought it was the opposite.
>>>>>>>>>>>
>>>>>>>>>>> Cheers
>>>>>>>>>>> Stefan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Stefan Wienert
>>>>>>>>
>>>>>>>> http://www.wienert.cc
>>>>>>>> stefan@wienert.cc
>>>>>>>>
>>>>>>>> Telefon: +495251-2026838
>>>>>>>> Mobil: +49176-40170270
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Stefan Wienert
>>>>>>>
>>>>>>> http://www.wienert.cc
>>>>>>> stefan@wienert.cc
>>>>>>>
>>>>>>> Telefon: +495251-2026838
>>>>>>> Mobil: +49176-40170270
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Stefan Wienert
>>>
>>> http://www.wienert.cc
>>> stefan@wienert.cc
>>>
>>> Telefon: +495251-2026838
>>> Mobil: +49176-40170270
>>>
>>
>
>
>
> --
> Stefan Wienert
>
> http://www.wienert.cc
> stefan@wienert.cc
>
> Telefon: +495251-2026838
> Mobil: +49176-40170270
>

Re: tf-idf + svd + cosine similarity

Posted by Ted Dunning <te...@gmail.com>.

I have been intermittently following this point.

Some folks have said that having higher dimensional SVD's should change the
distribution of distances.

Actually, that isn't quite true.  SVD preserves dot products as much as
possible.  With lower dimensional projections you lose some information, but
as the singular values decline, you lose less and less information.

It *is* however true that *random* unit vectors in higher dimension have a
dot product that is more and more tightly clustered around zero.  This is a
different case entirely from the case that we are talking about where you
have real data projected down into a lower dimensional space.

On Wed, Jun 15, 2011 at 7:44 PM, Jake Mannix <ja...@gmail.com> wrote:

> On Wed, Jun 15, 2011 at 10:06 AM, Stefan Wienert <st...@wienert.cc>
> wrote:
>
> > Hmm. Seems I have plenty of negative results (nearly half of the
> > similarity). I can add +0.3 then the greatest negative results are
> > near 0. This is not optimal...
> > I can project the results to [0..1].
> >
>
> Looking for *dissimilar* results seems odd.  What are you trying to do?
>
> What people normally do is look for clusters of similar documents, or
> just the top-N most similar documents to each document.  In both of these
> cases, you don't care about the documents whose similarity to anyone is
> zero, or less than zero.
>
>  -jake
>
>
> > Any other suggestions or comments?
> >
> > Cheers
> > Stefan
> >
> > 2011/6/15 Jake Mannix <ja...@gmail.com>:
> > > While your original vectors never had similarity less than zero, after
> > > projection onto the SVD space, you may "project away" similarities
> > > between two vectors, and they are now negatively correlated in this
> > > space (think about projecting (1,0,1) and (0,1,1) onto the 1-d vector
> > > space spanned by (1,-1,0) - they go from having similarity +1/sqrt(2)
> > > to similarity -1).
> > >
> > > I always interpret all similarities <= 0 as "maximally dissimilar",
> > > even if technically -1 is where this is exactly true.
> > >
> > >  -jake
> > >
> > > On Wed, Jun 15, 2011 at 2:10 AM, Stefan Wienert <st...@wienert.cc>
> > wrote:
> > >
> > >> Ignoring is no option... so I have to interpret these values.
> > >> Can one say that documents with similarity = -1 are the less similar
> > >> documents? I don't think this is right.
> > >> Any other assumptions?
> > >>
> > >> 2011/6/15 Fernando Fernández <fe...@gmail.com>:
> > >> > One question that I think it has not been answered yet is that of
> the
> > >> > negative simliarities. In literature you can find that
> similiarity=-1
> > >> means
> > >> > that "documents talk about opposite topics", but I think this is a
> > quite
> > >> > abstract idea... I just ignore them, when I'm trying to find top-k
> > >> similar
> > >> > documents these surely won't be useful. I read recently that this
> has
> > to
> > >> do
> > >> > with the assumptions in SVD which is designed for normal
> distributions
> > >> (This
> > >> > implies the posibility of negative values). There are other
> techniques
> > >> > (Non-negative factorization) that tries to solve this. I don't know
> if
> > >> > there's something in mahout about this.
> > >> >
> > >> > Best,
> > >> >
> > >> > Fernando.
> > >> >
> > >> > 2011/6/15 Ted Dunning <te...@gmail.com>
> > >> >
> > >> >> The normal terminology is to name U and V in SVD as "singular
> > vectors"
> > >> as
> > >> >> opposed to eigenvectors.  The term eigenvectors is normally
> reserved
> > for
> > >> >> the
> > >> >> symmetric case of U S U'  (more generally, the Hermitian case, but
> we
> > >> only
> > >> >> support real values).
> > >> >>
> > >> >> On Wed, Jun 15, 2011 at 12:35 AM, Dmitriy Lyubimov <
> > dlieu.7@gmail.com
> > >> >> >wrote:
> > >> >>
> > >> >> > I beg to differ... U and V are left and right eigenvectors, and
> > >> >> > singular values is denoted as Sigma (which is a square root of
> > eigen
> > >> >> > values of the AA' as you correctly pointed out) .
> > >> >> >
> > >> >>
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Stefan Wienert
> > >>
> > >> http://www.wienert.cc
> > >> stefan@wienert.cc
> > >>
> > >> Telefon: +495251-2026838
> > >> Mobil: +49176-40170270
> > >>
> > >
> >
> >
> >
> > --
> > Stefan Wienert
> >
> > http://www.wienert.cc
> > stefan@wienert.cc
> >
> > Telefon: +495251-2026838
> > Mobil: +49176-40170270
> >
>

Re: tf-idf + svd + cosine similarity

Posted by Jake Mannix <ja...@gmail.com>.

On Wed, Jun 15, 2011 at 10:06 AM, Stefan Wienert <st...@wienert.cc> wrote:

> Hmm. Seems I have plenty of negative results (nearly half of the
> similarity). I can add +0.3 then the greatest negative results are
> near 0. This is not optimal...
> I can project the results to [0..1].
>

Looking for *dissimilar* results seems odd.  What are you trying to do?

What people normally do is look for clusters of similar documents, or
just the top-N most similar documents to each document.  In both of these
cases, you don't care about the documents whose similarity to anyone is
zero, or less than zero.

  -jake


> Any other suggestions or comments?
>
> Cheers
> Stefan
>
> 2011/6/15 Jake Mannix <ja...@gmail.com>:
> > While your original vectors never had similarity less than zero, after
> > projection onto the SVD space, you may "project away" similarities
> > between two vectors, and they are now negatively correlated in this
> > space (think about projecting (1,0,1) and (0,1,1) onto the 1-d vector
> > space spanned by (1,-1,0) - they go from having similarity +1/sqrt(2)
> > to similarity -1).
> >
> > I always interpret all similarities <= 0 as "maximally dissimilar",
> > even if technically -1 is where this is exactly true.
> >
> >  -jake
> >
> > On Wed, Jun 15, 2011 at 2:10 AM, Stefan Wienert <st...@wienert.cc>
> wrote:
> >
> >> Ignoring is no option... so I have to interpret these values.
> >> Can one say that documents with similarity = -1 are the less similar
> >> documents? I don't think this is right.
> >> Any other assumptions?
> >>
> >> 2011/6/15 Fernando Fernández <fe...@gmail.com>:
> >> > One question that I think it has not been answered yet is that of the
> >> > negative simliarities. In literature you can find that similiarity=-1
> >> means
> >> > that "documents talk about opposite topics", but I think this is a
> quite
> >> > abstract idea... I just ignore them, when I'm trying to find top-k
> >> similar
> >> > documents these surely won't be useful. I read recently that this has
> to
> >> do
> >> > with the assumptions in SVD which is designed for normal distributions
> >> (This
> >> > implies the posibility of negative values). There are other techniques
> >> > (Non-negative factorization) that tries to solve this. I don't know if
> >> > there's something in mahout about this.
> >> >
> >> > Best,
> >> >
> >> > Fernando.
> >> >
> >> > 2011/6/15 Ted Dunning <te...@gmail.com>
> >> >
> >> >> The normal terminology is to name U and V in SVD as "singular
> vectors"
> >> as
> >> >> opposed to eigenvectors.  The term eigenvectors is normally reserved
> for
> >> >> the
> >> >> symmetric case of U S U'  (more generally, the Hermitian case, but we
> >> only
> >> >> support real values).
> >> >>
> >> >> On Wed, Jun 15, 2011 at 12:35 AM, Dmitriy Lyubimov <
> dlieu.7@gmail.com
> >> >> >wrote:
> >> >>
> >> >> > I beg to differ... U and V are left and right eigenvectors, and
> >> >> > singular values is denoted as Sigma (which is a square root of
> eigen
> >> >> > values of the AA' as you correctly pointed out) .
> >> >> >
> >> >>
> >> >
> >>
> >>
> >>
> >> --
> >> Stefan Wienert
> >>
> >> http://www.wienert.cc
> >> stefan@wienert.cc
> >>
> >> Telefon: +495251-2026838
> >> Mobil: +49176-40170270
> >>
> >
>
>
>
> --
> Stefan Wienert
>
> http://www.wienert.cc
> stefan@wienert.cc
>
> Telefon: +495251-2026838
> Mobil: +49176-40170270
>

Re: tf-idf + svd + cosine similarity

Posted by Stefan Wienert <st...@wienert.cc>.

Hmm. Seems I have plenty of negative results (nearly half of the
similarity). I can add +0.3 then the greatest negative results are
near 0. This is not optimal...
I can project the results to [0..1].
Any other suggestions or comments?

Cheers
Stefan

2011/6/15 Jake Mannix <ja...@gmail.com>:
> While your original vectors never had similarity less than zero, after
> projection onto the SVD space, you may "project away" similarities
> between two vectors, and they are now negatively correlated in this
> space (think about projecting (1,0,1) and (0,1,1) onto the 1-d vector
> space spanned by (1,-1,0) - they go from having similarity +1/sqrt(2)
> to similarity -1).
>
> I always interpret all similarities <= 0 as "maximally dissimilar",
> even if technically -1 is where this is exactly true.
>
>  -jake
>
> On Wed, Jun 15, 2011 at 2:10 AM, Stefan Wienert <st...@wienert.cc> wrote:
>
>> Ignoring is no option... so I have to interpret these values.
>> Can one say that documents with similarity = -1 are the less similar
>> documents? I don't think this is right.
>> Any other assumptions?
>>
>> 2011/6/15 Fernando Fernández <fe...@gmail.com>:
>> > One question that I think it has not been answered yet is that of the
>> > negative simliarities. In literature you can find that similiarity=-1
>> means
>> > that "documents talk about opposite topics", but I think this is a quite
>> > abstract idea... I just ignore them, when I'm trying to find top-k
>> similar
>> > documents these surely won't be useful. I read recently that this has to
>> do
>> > with the assumptions in SVD which is designed for normal distributions
>> (This
>> > implies the posibility of negative values). There are other techniques
>> > (Non-negative factorization) that tries to solve this. I don't know if
>> > there's something in mahout about this.
>> >
>> > Best,
>> >
>> > Fernando.
>> >
>> > 2011/6/15 Ted Dunning <te...@gmail.com>
>> >
>> >> The normal terminology is to name U and V in SVD as "singular vectors"
>> as
>> >> opposed to eigenvectors.  The term eigenvectors is normally reserved for
>> >> the
>> >> symmetric case of U S U'  (more generally, the Hermitian case, but we
>> only
>> >> support real values).
>> >>
>> >> On Wed, Jun 15, 2011 at 12:35 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>> >> >wrote:
>> >>
>> >> > I beg to differ... U and V are left and right eigenvectors, and
>> >> > singular values is denoted as Sigma (which is a square root of eigen
>> >> > values of the AA' as you correctly pointed out) .
>> >> >
>> >>
>> >
>>
>>
>>
>> --
>> Stefan Wienert
>>
>> http://www.wienert.cc
>> stefan@wienert.cc
>>
>> Telefon: +495251-2026838
>> Mobil: +49176-40170270
>>
>



-- 
Stefan Wienert

http://www.wienert.cc
stefan@wienert.cc

Telefon: +495251-2026838
Mobil: +49176-40170270

Re: tf-idf + svd + cosine similarity

Posted by Jake Mannix <ja...@gmail.com>.

While your original vectors never had similarity less than zero, after
projection onto the SVD space, you may "project away" similarities
between two vectors, and they are now negatively correlated in this
space (think about projecting (1,0,1) and (0,1,1) onto the 1-d vector
space spanned by (1,-1,0) - they go from having similarity +1/sqrt(2)
to similarity -1).

I always interpret all similarities <= 0 as "maximally dissimilar",
even if technically -1 is where this is exactly true.

  -jake

On Wed, Jun 15, 2011 at 2:10 AM, Stefan Wienert <st...@wienert.cc> wrote:

> Ignoring is no option... so I have to interpret these values.
> Can one say that documents with similarity = -1 are the less similar
> documents? I don't think this is right.
> Any other assumptions?
>
> 2011/6/15 Fernando Fernández <fe...@gmail.com>:
> > One question that I think it has not been answered yet is that of the
> > negative simliarities. In literature you can find that similiarity=-1
> means
> > that "documents talk about opposite topics", but I think this is a quite
> > abstract idea... I just ignore them, when I'm trying to find top-k
> similar
> > documents these surely won't be useful. I read recently that this has to
> do
> > with the assumptions in SVD which is designed for normal distributions
> (This
> > implies the posibility of negative values). There are other techniques
> > (Non-negative factorization) that tries to solve this. I don't know if
> > there's something in mahout about this.
> >
> > Best,
> >
> > Fernando.
> >
> > 2011/6/15 Ted Dunning <te...@gmail.com>
> >
> >> The normal terminology is to name U and V in SVD as "singular vectors"
> as
> >> opposed to eigenvectors.  The term eigenvectors is normally reserved for
> >> the
> >> symmetric case of U S U'  (more generally, the Hermitian case, but we
> only
> >> support real values).
> >>
> >> On Wed, Jun 15, 2011 at 12:35 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >> >wrote:
> >>
> >> > I beg to differ... U and V are left and right eigenvectors, and
> >> > singular values is denoted as Sigma (which is a square root of eigen
> >> > values of the AA' as you correctly pointed out) .
> >> >
> >>
> >
>
>
>
> --
> Stefan Wienert
>
> http://www.wienert.cc
> stefan@wienert.cc
>
> Telefon: +495251-2026838
> Mobil: +49176-40170270
>

Re: tf-idf + svd + cosine similarity

Posted by Fernando Fernández <fe...@gmail.com>.

I think that LanczosSolver provides negative values as well, I don't know
about SSVD.

I guess that if similarity has a high negative value, you can say that
documents talk about things that almost never appear together in the same
text (if term A appears, then term B won't appear), but I think this is
almost impossible in practice (at least the most extreme case with
similiarity=-1), as there are always common expressions that appear in many
documents. I think that's why avg(similiarity) is always above 0 in your
case.

2011/6/15 Sean Owen <sr...@gmail.com>

> The features all take on non-negative values here, right?
> Then the cosine can't be negative.
>
> In another context, where features could be negative, cosine could
> indeed be negative. -1 means most dissimilar of all -- the feature
> vectors are exactly opposed.
>
> On Wed, Jun 15, 2011 at 10:10 AM, Stefan Wienert <st...@wienert.cc>
> wrote:
> > Ignoring is no option... so I have to interpret these values.
> > Can one say that documents with similarity = -1 are the less similar
> > documents? I don't think this is right.
> > Any other assumptions?
>

Re: tf-idf + svd + cosine similarity

Posted by Sean Owen <sr...@gmail.com>.

The features all take on non-negative values here, right?
Then the cosine can't be negative.

In another context, where features could be negative, cosine could
indeed be negative. -1 means most dissimilar of all -- the feature
vectors are exactly opposed.

On Wed, Jun 15, 2011 at 10:10 AM, Stefan Wienert <st...@wienert.cc> wrote:
> Ignoring is no option... so I have to interpret these values.
> Can one say that documents with similarity = -1 are the less similar
> documents? I don't think this is right.
> Any other assumptions?

Re: tf-idf + svd + cosine similarity

Posted by Stefan Wienert <st...@wienert.cc>.

Ignoring is no option... so I have to interpret these values.
Can one say that documents with similarity = -1 are the less similar
documents? I don't think this is right.
Any other assumptions?

2011/6/15 Fernando Fernández <fe...@gmail.com>:
> One question that I think it has not been answered yet is that of the
> negative simliarities. In literature you can find that similiarity=-1 means
> that "documents talk about opposite topics", but I think this is a quite
> abstract idea... I just ignore them, when I'm trying to find top-k similar
> documents these surely won't be useful. I read recently that this has to do
> with the assumptions in SVD which is designed for normal distributions (This
> implies the posibility of negative values). There are other techniques
> (Non-negative factorization) that tries to solve this. I don't know if
> there's something in mahout about this.
>
> Best,
>
> Fernando.
>
> 2011/6/15 Ted Dunning <te...@gmail.com>
>
>> The normal terminology is to name U and V in SVD as "singular vectors" as
>> opposed to eigenvectors.  The term eigenvectors is normally reserved for
>> the
>> symmetric case of U S U'  (more generally, the Hermitian case, but we only
>> support real values).
>>
>> On Wed, Jun 15, 2011 at 12:35 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>> >wrote:
>>
>> > I beg to differ... U and V are left and right eigenvectors, and
>> > singular values is denoted as Sigma (which is a square root of eigen
>> > values of the AA' as you correctly pointed out) .
>> >
>>
>



-- 
Stefan Wienert

http://www.wienert.cc
stefan@wienert.cc

Telefon: +495251-2026838
Mobil: +49176-40170270

Re: tf-idf + svd + cosine similarity

Posted by Fernando Fernández <fe...@gmail.com>.

One question that I think it has not been answered yet is that of the
negative simliarities. In literature you can find that similiarity=-1 means
that "documents talk about opposite topics", but I think this is a quite
abstract idea... I just ignore them, when I'm trying to find top-k similar
documents these surely won't be useful. I read recently that this has to do
with the assumptions in SVD which is designed for normal distributions (This
implies the posibility of negative values). There are other techniques
(Non-negative factorization) that tries to solve this. I don't know if
there's something in mahout about this.

Best,

Fernando.

2011/6/15 Ted Dunning <te...@gmail.com>

> The normal terminology is to name U and V in SVD as "singular vectors" as
> opposed to eigenvectors.  The term eigenvectors is normally reserved for
> the
> symmetric case of U S U'  (more generally, the Hermitian case, but we only
> support real values).
>
> On Wed, Jun 15, 2011 at 12:35 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
>
> > I beg to differ... U and V are left and right eigenvectors, and
> > singular values is denoted as Sigma (which is a square root of eigen
> > values of the AA' as you correctly pointed out) .
> >
>

Re: tf-idf + svd + cosine similarity

Posted by Ted Dunning <te...@gmail.com>.

The normal terminology is to name U and V in SVD as "singular vectors" as
opposed to eigenvectors.  The term eigenvectors is normally reserved for the
symmetric case of U S U'  (more generally, the Hermitian case, but we only
support real values).

On Wed, Jun 15, 2011 at 12:35 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> I beg to differ... U and V are left and right eigenvectors, and
> singular values is denoted as Sigma (which is a square root of eigen
> values of the AA' as you correctly pointed out) .
>

Re: tf-idf + svd + cosine similarity

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

thanks, Jake.

On Tue, Jun 14, 2011 at 4:09 PM, Jake Mannix <ja...@gmail.com> wrote:
> On Tue, Jun 14, 2011 at 3:35 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>> Normalization means that second norm of columns in the eigenvector
>> matrix (i.e. all columns) is 1. In classic SVD A=U*Sigma*V', even if
>> it is a thin one, U and V are orthonormal.  I might be wrong but i was
>> under impression that i saw some discussion saying Lanczos singular
>> vector matrix is not necessarily orthonormal (although columns do form
>> orthogonal basis). I might be wrong about it.
>>
>
> LanczosSolver normalizes the singular vectors (LanczosSolver.java, line
> 162),
> and yes, returns V, not U: if U is documents x latent factors (so gives the
> projection of each input document onto the reduced basis), and V is
> latent factors x terms (and has rows which gives each show which
> latent factors are made up of what terms).  Lanczos solver doesn't keep
> track
> of documents (partly for scalability: documents can be thought of as
> "training" your latent factor model), but they instead return the latent
> factor by term "model": V.
>
>  -jake
>

Re: tf-idf + svd + cosine similarity

Posted by Jake Mannix <ja...@gmail.com>.

On Tue, Jun 14, 2011 at 3:35 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> Normalization means that second norm of columns in the eigenvector
> matrix (i.e. all columns) is 1. In classic SVD A=U*Sigma*V', even if
> it is a thin one, U and V are orthonormal.  I might be wrong but i was
> under impression that i saw some discussion saying Lanczos singular
> vector matrix is not necessarily orthonormal (although columns do form
> orthogonal basis). I might be wrong about it.
>

LanczosSolver normalizes the singular vectors (LanczosSolver.java, line
162),
and yes, returns V, not U: if U is documents x latent factors (so gives the
projection of each input document onto the reduced basis), and V is
latent factors x terms (and has rows which gives each show which
latent factors are made up of what terms).  Lanczos solver doesn't keep
track
of documents (partly for scalability: documents can be thought of as
"training" your latent factor model), but they instead return the latent
factor by term "model": V.

  -jake

Re: tf-idf + svd + cosine similarity

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I beg to differ... U and V are left and right eigenvectors, and
singular values is denoted as Sigma (which is a square root of eigen
values of the AA' as you correctly pointed out) .

Yes so i figured Lanczos must be doing V (otherwise your dimensions
wouldn't match) . Also i guess eigenvector implies the right ones not
the left ones by default.

Normalization means that second norm of columns in the eigenvector
matrix (i.e. all columns) is 1. In classic SVD A=U*Sigma*V', even if
it is a thin one, U and V are orthonormal.  I might be wrong but i was
under impression that i saw some discussion saying Lanczos singular
vector matrix is not necessarily orthonormal (although columns do form
orthogonal basis). I might be wrong about it.

Anyway i know for sure that SSVD gives option to rotate in both
eigenspace and the space scaled by square roots of eigenvalues. The
latter allows single space for row items and column items and enables
similarity measures among them.

Oversampling parameter is parameter -p you give to SSVD. (didn't you
give it? ) What's your command line for SSVD was?

Basically it means that for 10-rank thin SVD you need to give
something like k=10 p=90 which means the algorithm actually computes
100 dimentional random projection and computes SVD on it (or rather
actaully indeed eigendecomposition of BB') and then throws away 90
singular values and 90 latent factors as well from the result.


On Tue, Jun 14, 2011 at 3:09 PM, Stefan Wienert <st...@wienert.cc> wrote:
> Dmitriy, be aware: there was a bug in none.png... I deleted it a minute ago, see
> http://the-lord.de/img/beispielwerte.pdf
> for better results.
>
> First... U or V are the singular values not the eigenvectors ;)
>
> Lanczos-SVD in mahout is computing the eigenvectors of M*M (it
> multiplies the input matrix with the transposed one)
>
> As a fact, I don't need U, just V, so I need to transpose M (because
> the eigenvectors of MM* = V).
>
> So... normalizing the eigenvectors: Is the cosine similarity not doing
> this? or ignoring the length of the vectors?
> http://en.wikipedia.org/wiki/Cosine_similarity
>
> my parameters for ssvd:
> --rank 100
> --oversampling 10
> --blockHeight 227
> --computeU false
> --input
> --output
>
> the rest should be on default.
>
> acutally I do not really know what these oversampling parameter means...
>
> 2011/6/14 Dmitriy Lyubimov <dl...@gmail.com>:
>> Interesting.
>>
>> (I have one confusion of mine RE: lanczos -- is it computing U
>> eigenvectors or V? The doc says "eigenvectors" but doesn't say left or
>> right. if it's V (right eigenvectors) this sequence should be fine).
>>
>> With ssvd i don't do transpose, i just do coputation of U which will
>> produce document singular vectors directly.
>>
>> Also, i am not sure that Lanczos actually normalizes the eigenvectors,
>> but SSVD does (or multiplies normalized version by square root of a
>> singlular value, whichever requested). So depending on which space
>> your rotate results in, cosine similarities may be different. I assume
>> you used normalized (true) eigenvectors from ssvd.
>>
>> Also would be interesting to know what oversampling parameter you (p) you used.
>>
>> Thanks.
>> -d
>>
>>
>> On Tue, Jun 14, 2011 at 2:04 PM, Stefan Wienert <st...@wienert.cc> wrote:
>>> So... lets check the dimensions:
>>>
>>> First step: Lucene Output:
>>> 227 rows (=docs) and 107909 cols (=tems)
>>>
>>> transposed to:
>>> 107909 rows and 227 cols
>>>
>>> reduced with svd (rank 100) to:
>>> 99 rows and 227 cols
>>>
>>> transposed to: (actually there was a bug (with no effect on the SVD
>>> result but on NONE result))
>>> 227 rows and 99 cols
>>>
>>> So... now the cosine results are very similar to SVD 200.
>>>
>>> Results are added.
>>>
>>> @Sebastian: I will check if the bug affects my results.
>>>
>>> 2011/6/14 Fernando Fernández <fe...@gmail.com>:
>>>> Hi Stefan,
>>>>
>>>> Are  you sure you need to transpose the input marix? I thought that what you
>>>> get from lucene index was already document(rows)-term(columns) matrix, but
>>>> you say that you obtain term-document matrix and transpose it. Is this
>>>> correct? What are you using to obtain this matrix from Lucene? Is it
>>>> possible that you are calculating similarities with the wrong matrix in some
>>>> of the two cases? (With/without dimension reduction).
>>>>
>>>> Best,
>>>> Fernando.
>>>>
>>>> 2011/6/14 Sebastian Schelter <ss...@apache.org>
>>>>
>>>>> Hi Stefan,
>>>>>
>>>>> I checked the implementation of RowSimilarityJob and we might still have a
>>>>> bug in the 0.5 release... (f**k). I don't know if your problem is caused by
>>>>> that, but the similarity scores might not be correct...
>>>>>
>>>>> We had this issue in 0.4 already, when someone realized that cooccurrences
>>>>> were mapped out inconsistently, so for 0.5 we made sure that we always map
>>>>> the smaller row as first value. But apparently I did not adjust the value
>>>>> setting for the Cooccurrence object...
>>>>>
>>>>> In 0.5 the code is:
>>>>>
>>>>>  if (rowA <= rowB) {
>>>>>   rowPair.set(rowA, rowB, weightA, weightB);
>>>>>  } else {
>>>>>   rowPair.set(rowB, rowA, weightB, weightA);
>>>>>  }
>>>>>  coocurrence.set(column.get(), valueA, valueB);
>>>>>
>>>>> But I should be (already fixed in current trunk some days ago):
>>>>>
>>>>>  if (rowA <= rowB) {
>>>>>   rowPair.set(rowA, rowB, weightA, weightB);
>>>>>   coocurrence.set(column.get(), valueA, valueB);
>>>>>  } else {
>>>>>   rowPair.set(rowB, rowA, weightB, weightA);
>>>>>   coocurrence.set(column.get(), valueB, valueA);
>>>>>  }
>>>>>
>>>>> Maybe you could rerun your test with the current trunk?
>>>>>
>>>>> --sebastian
>>>>>
>>>>>
>>>>> On 14.06.2011 20:54, Sean Owen wrote:
>>>>>
>>>>>> It is a similarity, not a distance. Higher values mean more
>>>>>> similarity, not less.
>>>>>>
>>>>>> I agree that similarity ought to decrease with more dimensions. That
>>>>>> is what you observe -- except that you see quite high average
>>>>>> similarity with no dimension reduction!
>>>>>>
>>>>>> An average cosine similarity of 0.87 sounds "high" to me for anything
>>>>>> but a few dimensions. What's the dimensionality of the input without
>>>>>> dimension reduction?
>>>>>>
>>>>>> Something is amiss in this pipeline. It is an interesting question!
>>>>>>
>>>>>> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<st...@wienert.cc>
>>>>>>  wrote:
>>>>>>
>>>>>>> Actually I'm using  RowSimilarityJob() with
>>>>>>> --input input
>>>>>>> --output output
>>>>>>> --numberOfColumns documentCount
>>>>>>> --maxSimilaritiesPerRow documentCount
>>>>>>> --similarityClassname SIMILARITY_UNCENTERED_COSINE
>>>>>>>
>>>>>>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE
>>>>>>> calculates...
>>>>>>> the source says: "distributed implementation of cosine similarity that
>>>>>>> does not center its data"
>>>>>>>
>>>>>>> So... this seems to be the similarity and not the distance?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Stefan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2011/6/14 Stefan Wienert<st...@wienert.cc>:
>>>>>>>
>>>>>>>> but... why do I get the different results with cosine similarity with
>>>>>>>> no dimension reduction (with 100,000 dimensions) ?
>>>>>>>>
>>>>>>>> 2011/6/14 Fernando Fernández<fe...@gmail.com>:
>>>>>>>>
>>>>>>>>> Actually that's what your results are showing, aren't they? With rank
>>>>>>>>> 1000
>>>>>>>>> the similarity avg is the lowest...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2011/6/14 Jake Mannix<ja...@gmail.com>
>>>>>>>>>
>>>>>>>>>  actually, wait - are your graphs showing *similarity*, or *distance*?
>>>>>>>>>>  In
>>>>>>>>>> higher
>>>>>>>>>> dimensions, *distance* (and cosine angle) should grow, but on the
>>>>>>>>>> other
>>>>>>>>>> hand,
>>>>>>>>>> *similarity* (1-cos(angle)) should go toward 0.
>>>>>>>>>>
>>>>>>>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert<st...@wienert.cc>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>  Hey Guys,
>>>>>>>>>>>
>>>>>>>>>>> I have some strange results in my LSA-Pipeline.
>>>>>>>>>>>
>>>>>>>>>>> First, I explain the steps my data is making:
>>>>>>>>>>> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF
>>>>>>>>>>> as
>>>>>>>>>>> weighter
>>>>>>>>>>> 2) Transposing TDM
>>>>>>>>>>> 3a) Using Mahout SVD (Lanczos) with the transposed TDM
>>>>>>>>>>> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
>>>>>>>>>>> 3c) Using no dimension reduction (for testing purpose)
>>>>>>>>>>> 4) Transpose result (ONLY none / svd)
>>>>>>>>>>> 5) Calculating Cosine Similarty (from Mahout)
>>>>>>>>>>>
>>>>>>>>>>> Now... Some strange thinks happen:
>>>>>>>>>>> First of all: The demo data shows the similarity from document 1 to
>>>>>>>>>>> all other documents.
>>>>>>>>>>>
>>>>>>>>>>> the results using only cosine similarty (without dimension
>>>>>>>>>>> reduction):
>>>>>>>>>>> http://the-lord.de/img/none.png
>>>>>>>>>>>
>>>>>>>>>>> the result using svd, rank 10
>>>>>>>>>>> http://the-lord.de/img/svd-10.png
>>>>>>>>>>> some points falling down to the bottom.
>>>>>>>>>>>
>>>>>>>>>>> the results using ssvd rank 10
>>>>>>>>>>> http://the-lord.de/img/ssvd-10.png
>>>>>>>>>>>
>>>>>>>>>>> the result using svd, rank 100
>>>>>>>>>>> http://the-lord.de/img/svd-100.png
>>>>>>>>>>> more points falling down to the bottom.
>>>>>>>>>>>
>>>>>>>>>>> the results using ssvd rank 100
>>>>>>>>>>> http://the-lord.de/img/ssvd-100.png
>>>>>>>>>>>
>>>>>>>>>>> the results using svd rank 200
>>>>>>>>>>> http://the-lord.de/img/svd-200.png
>>>>>>>>>>> even more points falling down to the bottom.
>>>>>>>>>>>
>>>>>>>>>>> the results using svd rank 1000
>>>>>>>>>>> http://the-lord.de/img/svd-1000.png
>>>>>>>>>>> most points are at the bottom
>>>>>>>>>>>
>>>>>>>>>>> please beware of the scale:
>>>>>>>>>>> - the avg from none: 0,8712
>>>>>>>>>>> - the avg from svd rank 10: 0,2648
>>>>>>>>>>> - the avg from svd rank 100: 0,0628
>>>>>>>>>>> - the avg from svd rank 200: 0,0238
>>>>>>>>>>> - the avg from svd rank 1000: 0,0116
>>>>>>>>>>>
>>>>>>>>>>> so my question is:
>>>>>>>>>>> Can you explain this behavior? Why are the documents getting more
>>>>>>>>>>> equal with more ranks in svd. I thought it was the opposite.
>>>>>>>>>>>
>>>>>>>>>>> Cheers
>>>>>>>>>>> Stefan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Stefan Wienert
>>>>>>>>
>>>>>>>> http://www.wienert.cc
>>>>>>>> stefan@wienert.cc
>>>>>>>>
>>>>>>>> Telefon: +495251-2026838
>>>>>>>> Mobil: +49176-40170270
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Stefan Wienert
>>>>>>>
>>>>>>> http://www.wienert.cc
>>>>>>> stefan@wienert.cc
>>>>>>>
>>>>>>> Telefon: +495251-2026838
>>>>>>> Mobil: +49176-40170270
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Stefan Wienert
>>>
>>> http://www.wienert.cc
>>> stefan@wienert.cc
>>>
>>> Telefon: +495251-2026838
>>> Mobil: +49176-40170270
>>>
>>
>
>
>
> --
> Stefan Wienert
>
> http://www.wienert.cc
> stefan@wienert.cc
>
> Telefon: +495251-2026838
> Mobil: +49176-40170270
>

Re: tf-idf + svd + cosine similarity

Posted by Stefan Wienert <st...@wienert.cc>.

Dmitriy, be aware: there was a bug in none.png... I deleted it a minute ago, see
http://the-lord.de/img/beispielwerte.pdf
for better results.

First... U or V are the singular values not the eigenvectors ;)

Lanczos-SVD in mahout is computing the eigenvectors of M*M (it
multiplies the input matrix with the transposed one)

As a fact, I don't need U, just V, so I need to transpose M (because
the eigenvectors of MM* = V).

So... normalizing the eigenvectors: Is the cosine similarity not doing
this? or ignoring the length of the vectors?
http://en.wikipedia.org/wiki/Cosine_similarity

my parameters for ssvd:
--rank 100
--oversampling 10
--blockHeight 227
--computeU false
--input
--output

the rest should be on default.

acutally I do not really know what these oversampling parameter means...

2011/6/14 Dmitriy Lyubimov <dl...@gmail.com>:
> Interesting.
>
> (I have one confusion of mine RE: lanczos -- is it computing U
> eigenvectors or V? The doc says "eigenvectors" but doesn't say left or
> right. if it's V (right eigenvectors) this sequence should be fine).
>
> With ssvd i don't do transpose, i just do coputation of U which will
> produce document singular vectors directly.
>
> Also, i am not sure that Lanczos actually normalizes the eigenvectors,
> but SSVD does (or multiplies normalized version by square root of a
> singlular value, whichever requested). So depending on which space
> your rotate results in, cosine similarities may be different. I assume
> you used normalized (true) eigenvectors from ssvd.
>
> Also would be interesting to know what oversampling parameter you (p) you used.
>
> Thanks.
> -d
>
>
> On Tue, Jun 14, 2011 at 2:04 PM, Stefan Wienert <st...@wienert.cc> wrote:
>> So... lets check the dimensions:
>>
>> First step: Lucene Output:
>> 227 rows (=docs) and 107909 cols (=tems)
>>
>> transposed to:
>> 107909 rows and 227 cols
>>
>> reduced with svd (rank 100) to:
>> 99 rows and 227 cols
>>
>> transposed to: (actually there was a bug (with no effect on the SVD
>> result but on NONE result))
>> 227 rows and 99 cols
>>
>> So... now the cosine results are very similar to SVD 200.
>>
>> Results are added.
>>
>> @Sebastian: I will check if the bug affects my results.
>>
>> 2011/6/14 Fernando Fernández <fe...@gmail.com>:
>>> Hi Stefan,
>>>
>>> Are  you sure you need to transpose the input marix? I thought that what you
>>> get from lucene index was already document(rows)-term(columns) matrix, but
>>> you say that you obtain term-document matrix and transpose it. Is this
>>> correct? What are you using to obtain this matrix from Lucene? Is it
>>> possible that you are calculating similarities with the wrong matrix in some
>>> of the two cases? (With/without dimension reduction).
>>>
>>> Best,
>>> Fernando.
>>>
>>> 2011/6/14 Sebastian Schelter <ss...@apache.org>
>>>
>>>> Hi Stefan,
>>>>
>>>> I checked the implementation of RowSimilarityJob and we might still have a
>>>> bug in the 0.5 release... (f**k). I don't know if your problem is caused by
>>>> that, but the similarity scores might not be correct...
>>>>
>>>> We had this issue in 0.4 already, when someone realized that cooccurrences
>>>> were mapped out inconsistently, so for 0.5 we made sure that we always map
>>>> the smaller row as first value. But apparently I did not adjust the value
>>>> setting for the Cooccurrence object...
>>>>
>>>> In 0.5 the code is:
>>>>
>>>>  if (rowA <= rowB) {
>>>>   rowPair.set(rowA, rowB, weightA, weightB);
>>>>  } else {
>>>>   rowPair.set(rowB, rowA, weightB, weightA);
>>>>  }
>>>>  coocurrence.set(column.get(), valueA, valueB);
>>>>
>>>> But I should be (already fixed in current trunk some days ago):
>>>>
>>>>  if (rowA <= rowB) {
>>>>   rowPair.set(rowA, rowB, weightA, weightB);
>>>>   coocurrence.set(column.get(), valueA, valueB);
>>>>  } else {
>>>>   rowPair.set(rowB, rowA, weightB, weightA);
>>>>   coocurrence.set(column.get(), valueB, valueA);
>>>>  }
>>>>
>>>> Maybe you could rerun your test with the current trunk?
>>>>
>>>> --sebastian
>>>>
>>>>
>>>> On 14.06.2011 20:54, Sean Owen wrote:
>>>>
>>>>> It is a similarity, not a distance. Higher values mean more
>>>>> similarity, not less.
>>>>>
>>>>> I agree that similarity ought to decrease with more dimensions. That
>>>>> is what you observe -- except that you see quite high average
>>>>> similarity with no dimension reduction!
>>>>>
>>>>> An average cosine similarity of 0.87 sounds "high" to me for anything
>>>>> but a few dimensions. What's the dimensionality of the input without
>>>>> dimension reduction?
>>>>>
>>>>> Something is amiss in this pipeline. It is an interesting question!
>>>>>
>>>>> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<st...@wienert.cc>
>>>>>  wrote:
>>>>>
>>>>>> Actually I'm using  RowSimilarityJob() with
>>>>>> --input input
>>>>>> --output output
>>>>>> --numberOfColumns documentCount
>>>>>> --maxSimilaritiesPerRow documentCount
>>>>>> --similarityClassname SIMILARITY_UNCENTERED_COSINE
>>>>>>
>>>>>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE
>>>>>> calculates...
>>>>>> the source says: "distributed implementation of cosine similarity that
>>>>>> does not center its data"
>>>>>>
>>>>>> So... this seems to be the similarity and not the distance?
>>>>>>
>>>>>> Cheers,
>>>>>> Stefan
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2011/6/14 Stefan Wienert<st...@wienert.cc>:
>>>>>>
>>>>>>> but... why do I get the different results with cosine similarity with
>>>>>>> no dimension reduction (with 100,000 dimensions) ?
>>>>>>>
>>>>>>> 2011/6/14 Fernando Fernández<fe...@gmail.com>:
>>>>>>>
>>>>>>>> Actually that's what your results are showing, aren't they? With rank
>>>>>>>> 1000
>>>>>>>> the similarity avg is the lowest...
>>>>>>>>
>>>>>>>>
>>>>>>>> 2011/6/14 Jake Mannix<ja...@gmail.com>
>>>>>>>>
>>>>>>>>  actually, wait - are your graphs showing *similarity*, or *distance*?
>>>>>>>>>  In
>>>>>>>>> higher
>>>>>>>>> dimensions, *distance* (and cosine angle) should grow, but on the
>>>>>>>>> other
>>>>>>>>> hand,
>>>>>>>>> *similarity* (1-cos(angle)) should go toward 0.
>>>>>>>>>
>>>>>>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert<st...@wienert.cc>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>  Hey Guys,
>>>>>>>>>>
>>>>>>>>>> I have some strange results in my LSA-Pipeline.
>>>>>>>>>>
>>>>>>>>>> First, I explain the steps my data is making:
>>>>>>>>>> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF
>>>>>>>>>> as
>>>>>>>>>> weighter
>>>>>>>>>> 2) Transposing TDM
>>>>>>>>>> 3a) Using Mahout SVD (Lanczos) with the transposed TDM
>>>>>>>>>> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
>>>>>>>>>> 3c) Using no dimension reduction (for testing purpose)
>>>>>>>>>> 4) Transpose result (ONLY none / svd)
>>>>>>>>>> 5) Calculating Cosine Similarty (from Mahout)
>>>>>>>>>>
>>>>>>>>>> Now... Some strange thinks happen:
>>>>>>>>>> First of all: The demo data shows the similarity from document 1 to
>>>>>>>>>> all other documents.
>>>>>>>>>>
>>>>>>>>>> the results using only cosine similarty (without dimension
>>>>>>>>>> reduction):
>>>>>>>>>> http://the-lord.de/img/none.png
>>>>>>>>>>
>>>>>>>>>> the result using svd, rank 10
>>>>>>>>>> http://the-lord.de/img/svd-10.png
>>>>>>>>>> some points falling down to the bottom.
>>>>>>>>>>
>>>>>>>>>> the results using ssvd rank 10
>>>>>>>>>> http://the-lord.de/img/ssvd-10.png
>>>>>>>>>>
>>>>>>>>>> the result using svd, rank 100
>>>>>>>>>> http://the-lord.de/img/svd-100.png
>>>>>>>>>> more points falling down to the bottom.
>>>>>>>>>>
>>>>>>>>>> the results using ssvd rank 100
>>>>>>>>>> http://the-lord.de/img/ssvd-100.png
>>>>>>>>>>
>>>>>>>>>> the results using svd rank 200
>>>>>>>>>> http://the-lord.de/img/svd-200.png
>>>>>>>>>> even more points falling down to the bottom.
>>>>>>>>>>
>>>>>>>>>> the results using svd rank 1000
>>>>>>>>>> http://the-lord.de/img/svd-1000.png
>>>>>>>>>> most points are at the bottom
>>>>>>>>>>
>>>>>>>>>> please beware of the scale:
>>>>>>>>>> - the avg from none: 0,8712
>>>>>>>>>> - the avg from svd rank 10: 0,2648
>>>>>>>>>> - the avg from svd rank 100: 0,0628
>>>>>>>>>> - the avg from svd rank 200: 0,0238
>>>>>>>>>> - the avg from svd rank 1000: 0,0116
>>>>>>>>>>
>>>>>>>>>> so my question is:
>>>>>>>>>> Can you explain this behavior? Why are the documents getting more
>>>>>>>>>> equal with more ranks in svd. I thought it was the opposite.
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>> Stefan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Stefan Wienert
>>>>>>>
>>>>>>> http://www.wienert.cc
>>>>>>> stefan@wienert.cc
>>>>>>>
>>>>>>> Telefon: +495251-2026838
>>>>>>> Mobil: +49176-40170270
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Stefan Wienert
>>>>>>
>>>>>> http://www.wienert.cc
>>>>>> stefan@wienert.cc
>>>>>>
>>>>>> Telefon: +495251-2026838
>>>>>> Mobil: +49176-40170270
>>>>>>
>>>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Stefan Wienert
>>
>> http://www.wienert.cc
>> stefan@wienert.cc
>>
>> Telefon: +495251-2026838
>> Mobil: +49176-40170270
>>
>



-- 
Stefan Wienert

http://www.wienert.cc
stefan@wienert.cc

Telefon: +495251-2026838
Mobil: +49176-40170270

Re: tf-idf + svd + cosine similarity

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Still, indeed, i am purplexed by amount of documents in SVD results
that in cosine terms are >0.9. That basically means they all should
have almost identical set of infrequent words. which from your graph
looks like almost 20-30% or so.

On Tue, Jun 14, 2011 at 2:35 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> Interesting.
>
> (I have one confusion of mine RE: lanczos -- is it computing U
> eigenvectors or V? The doc says "eigenvectors" but doesn't say left or
> right. if it's V (right eigenvectors) this sequence should be fine).
>
> With ssvd i don't do transpose, i just do coputation of U which will
> produce document singular vectors directly.
>
> Also, i am not sure that Lanczos actually normalizes the eigenvectors,
> but SSVD does (or multiplies normalized version by square root of a
> singlular value, whichever requested). So depending on which space
> your rotate results in, cosine similarities may be different. I assume
> you used normalized (true) eigenvectors from ssvd.
>
> Also would be interesting to know what oversampling parameter you (p) you used.
>
> Thanks.
> -d
>
>
> On Tue, Jun 14, 2011 at 2:04 PM, Stefan Wienert <st...@wienert.cc> wrote:
>> So... lets check the dimensions:
>>
>> First step: Lucene Output:
>> 227 rows (=docs) and 107909 cols (=tems)
>>
>> transposed to:
>> 107909 rows and 227 cols
>>
>> reduced with svd (rank 100) to:
>> 99 rows and 227 cols
>>
>> transposed to: (actually there was a bug (with no effect on the SVD
>> result but on NONE result))
>> 227 rows and 99 cols
>>
>> So... now the cosine results are very similar to SVD 200.
>>
>> Results are added.
>>
>> @Sebastian: I will check if the bug affects my results.
>>
>> 2011/6/14 Fernando Fernández <fe...@gmail.com>:
>>> Hi Stefan,
>>>
>>> Are  you sure you need to transpose the input marix? I thought that what you
>>> get from lucene index was already document(rows)-term(columns) matrix, but
>>> you say that you obtain term-document matrix and transpose it. Is this
>>> correct? What are you using to obtain this matrix from Lucene? Is it
>>> possible that you are calculating similarities with the wrong matrix in some
>>> of the two cases? (With/without dimension reduction).
>>>
>>> Best,
>>> Fernando.
>>>
>>> 2011/6/14 Sebastian Schelter <ss...@apache.org>
>>>
>>>> Hi Stefan,
>>>>
>>>> I checked the implementation of RowSimilarityJob and we might still have a
>>>> bug in the 0.5 release... (f**k). I don't know if your problem is caused by
>>>> that, but the similarity scores might not be correct...
>>>>
>>>> We had this issue in 0.4 already, when someone realized that cooccurrences
>>>> were mapped out inconsistently, so for 0.5 we made sure that we always map
>>>> the smaller row as first value. But apparently I did not adjust the value
>>>> setting for the Cooccurrence object...
>>>>
>>>> In 0.5 the code is:
>>>>
>>>>  if (rowA <= rowB) {
>>>>   rowPair.set(rowA, rowB, weightA, weightB);
>>>>  } else {
>>>>   rowPair.set(rowB, rowA, weightB, weightA);
>>>>  }
>>>>  coocurrence.set(column.get(), valueA, valueB);
>>>>
>>>> But I should be (already fixed in current trunk some days ago):
>>>>
>>>>  if (rowA <= rowB) {
>>>>   rowPair.set(rowA, rowB, weightA, weightB);
>>>>   coocurrence.set(column.get(), valueA, valueB);
>>>>  } else {
>>>>   rowPair.set(rowB, rowA, weightB, weightA);
>>>>   coocurrence.set(column.get(), valueB, valueA);
>>>>  }
>>>>
>>>> Maybe you could rerun your test with the current trunk?
>>>>
>>>> --sebastian
>>>>
>>>>
>>>> On 14.06.2011 20:54, Sean Owen wrote:
>>>>
>>>>> It is a similarity, not a distance. Higher values mean more
>>>>> similarity, not less.
>>>>>
>>>>> I agree that similarity ought to decrease with more dimensions. That
>>>>> is what you observe -- except that you see quite high average
>>>>> similarity with no dimension reduction!
>>>>>
>>>>> An average cosine similarity of 0.87 sounds "high" to me for anything
>>>>> but a few dimensions. What's the dimensionality of the input without
>>>>> dimension reduction?
>>>>>
>>>>> Something is amiss in this pipeline. It is an interesting question!
>>>>>
>>>>> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<st...@wienert.cc>
>>>>>  wrote:
>>>>>
>>>>>> Actually I'm using  RowSimilarityJob() with
>>>>>> --input input
>>>>>> --output output
>>>>>> --numberOfColumns documentCount
>>>>>> --maxSimilaritiesPerRow documentCount
>>>>>> --similarityClassname SIMILARITY_UNCENTERED_COSINE
>>>>>>
>>>>>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE
>>>>>> calculates...
>>>>>> the source says: "distributed implementation of cosine similarity that
>>>>>> does not center its data"
>>>>>>
>>>>>> So... this seems to be the similarity and not the distance?
>>>>>>
>>>>>> Cheers,
>>>>>> Stefan
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2011/6/14 Stefan Wienert<st...@wienert.cc>:
>>>>>>
>>>>>>> but... why do I get the different results with cosine similarity with
>>>>>>> no dimension reduction (with 100,000 dimensions) ?
>>>>>>>
>>>>>>> 2011/6/14 Fernando Fernández<fe...@gmail.com>:
>>>>>>>
>>>>>>>> Actually that's what your results are showing, aren't they? With rank
>>>>>>>> 1000
>>>>>>>> the similarity avg is the lowest...
>>>>>>>>
>>>>>>>>
>>>>>>>> 2011/6/14 Jake Mannix<ja...@gmail.com>
>>>>>>>>
>>>>>>>>  actually, wait - are your graphs showing *similarity*, or *distance*?
>>>>>>>>>  In
>>>>>>>>> higher
>>>>>>>>> dimensions, *distance* (and cosine angle) should grow, but on the
>>>>>>>>> other
>>>>>>>>> hand,
>>>>>>>>> *similarity* (1-cos(angle)) should go toward 0.
>>>>>>>>>
>>>>>>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert<st...@wienert.cc>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>  Hey Guys,
>>>>>>>>>>
>>>>>>>>>> I have some strange results in my LSA-Pipeline.
>>>>>>>>>>
>>>>>>>>>> First, I explain the steps my data is making:
>>>>>>>>>> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF
>>>>>>>>>> as
>>>>>>>>>> weighter
>>>>>>>>>> 2) Transposing TDM
>>>>>>>>>> 3a) Using Mahout SVD (Lanczos) with the transposed TDM
>>>>>>>>>> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
>>>>>>>>>> 3c) Using no dimension reduction (for testing purpose)
>>>>>>>>>> 4) Transpose result (ONLY none / svd)
>>>>>>>>>> 5) Calculating Cosine Similarty (from Mahout)
>>>>>>>>>>
>>>>>>>>>> Now... Some strange thinks happen:
>>>>>>>>>> First of all: The demo data shows the similarity from document 1 to
>>>>>>>>>> all other documents.
>>>>>>>>>>
>>>>>>>>>> the results using only cosine similarty (without dimension
>>>>>>>>>> reduction):
>>>>>>>>>> http://the-lord.de/img/none.png
>>>>>>>>>>
>>>>>>>>>> the result using svd, rank 10
>>>>>>>>>> http://the-lord.de/img/svd-10.png
>>>>>>>>>> some points falling down to the bottom.
>>>>>>>>>>
>>>>>>>>>> the results using ssvd rank 10
>>>>>>>>>> http://the-lord.de/img/ssvd-10.png
>>>>>>>>>>
>>>>>>>>>> the result using svd, rank 100
>>>>>>>>>> http://the-lord.de/img/svd-100.png
>>>>>>>>>> more points falling down to the bottom.
>>>>>>>>>>
>>>>>>>>>> the results using ssvd rank 100
>>>>>>>>>> http://the-lord.de/img/ssvd-100.png
>>>>>>>>>>
>>>>>>>>>> the results using svd rank 200
>>>>>>>>>> http://the-lord.de/img/svd-200.png
>>>>>>>>>> even more points falling down to the bottom.
>>>>>>>>>>
>>>>>>>>>> the results using svd rank 1000
>>>>>>>>>> http://the-lord.de/img/svd-1000.png
>>>>>>>>>> most points are at the bottom
>>>>>>>>>>
>>>>>>>>>> please beware of the scale:
>>>>>>>>>> - the avg from none: 0,8712
>>>>>>>>>> - the avg from svd rank 10: 0,2648
>>>>>>>>>> - the avg from svd rank 100: 0,0628
>>>>>>>>>> - the avg from svd rank 200: 0,0238
>>>>>>>>>> - the avg from svd rank 1000: 0,0116
>>>>>>>>>>
>>>>>>>>>> so my question is:
>>>>>>>>>> Can you explain this behavior? Why are the documents getting more
>>>>>>>>>> equal with more ranks in svd. I thought it was the opposite.
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>> Stefan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Stefan Wienert
>>>>>>>
>>>>>>> http://www.wienert.cc
>>>>>>> stefan@wienert.cc
>>>>>>>
>>>>>>> Telefon: +495251-2026838
>>>>>>> Mobil: +49176-40170270
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Stefan Wienert
>>>>>>
>>>>>> http://www.wienert.cc
>>>>>> stefan@wienert.cc
>>>>>>
>>>>>> Telefon: +495251-2026838
>>>>>> Mobil: +49176-40170270
>>>>>>
>>>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Stefan Wienert
>>
>> http://www.wienert.cc
>> stefan@wienert.cc
>>
>> Telefon: +495251-2026838
>> Mobil: +49176-40170270
>>
>

Re: tf-idf + svd + cosine similarity

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Interesting.

(I have one confusion of mine RE: lanczos -- is it computing U
eigenvectors or V? The doc says "eigenvectors" but doesn't say left or
right. if it's V (right eigenvectors) this sequence should be fine).

With ssvd i don't do transpose, i just do coputation of U which will
produce document singular vectors directly.

Also, i am not sure that Lanczos actually normalizes the eigenvectors,
but SSVD does (or multiplies normalized version by square root of a
singlular value, whichever requested). So depending on which space
your rotate results in, cosine similarities may be different. I assume
you used normalized (true) eigenvectors from ssvd.

Also would be interesting to know what oversampling parameter you (p) you used.

Thanks.
-d


On Tue, Jun 14, 2011 at 2:04 PM, Stefan Wienert <st...@wienert.cc> wrote:
> So... lets check the dimensions:
>
> First step: Lucene Output:
> 227 rows (=docs) and 107909 cols (=tems)
>
> transposed to:
> 107909 rows and 227 cols
>
> reduced with svd (rank 100) to:
> 99 rows and 227 cols
>
> transposed to: (actually there was a bug (with no effect on the SVD
> result but on NONE result))
> 227 rows and 99 cols
>
> So... now the cosine results are very similar to SVD 200.
>
> Results are added.
>
> @Sebastian: I will check if the bug affects my results.
>
> 2011/6/14 Fernando Fernández <fe...@gmail.com>:
>> Hi Stefan,
>>
>> Are  you sure you need to transpose the input marix? I thought that what you
>> get from lucene index was already document(rows)-term(columns) matrix, but
>> you say that you obtain term-document matrix and transpose it. Is this
>> correct? What are you using to obtain this matrix from Lucene? Is it
>> possible that you are calculating similarities with the wrong matrix in some
>> of the two cases? (With/without dimension reduction).
>>
>> Best,
>> Fernando.
>>
>> 2011/6/14 Sebastian Schelter <ss...@apache.org>
>>
>>> Hi Stefan,
>>>
>>> I checked the implementation of RowSimilarityJob and we might still have a
>>> bug in the 0.5 release... (f**k). I don't know if your problem is caused by
>>> that, but the similarity scores might not be correct...
>>>
>>> We had this issue in 0.4 already, when someone realized that cooccurrences
>>> were mapped out inconsistently, so for 0.5 we made sure that we always map
>>> the smaller row as first value. But apparently I did not adjust the value
>>> setting for the Cooccurrence object...
>>>
>>> In 0.5 the code is:
>>>
>>>  if (rowA <= rowB) {
>>>   rowPair.set(rowA, rowB, weightA, weightB);
>>>  } else {
>>>   rowPair.set(rowB, rowA, weightB, weightA);
>>>  }
>>>  coocurrence.set(column.get(), valueA, valueB);
>>>
>>> But I should be (already fixed in current trunk some days ago):
>>>
>>>  if (rowA <= rowB) {
>>>   rowPair.set(rowA, rowB, weightA, weightB);
>>>   coocurrence.set(column.get(), valueA, valueB);
>>>  } else {
>>>   rowPair.set(rowB, rowA, weightB, weightA);
>>>   coocurrence.set(column.get(), valueB, valueA);
>>>  }
>>>
>>> Maybe you could rerun your test with the current trunk?
>>>
>>> --sebastian
>>>
>>>
>>> On 14.06.2011 20:54, Sean Owen wrote:
>>>
>>>> It is a similarity, not a distance. Higher values mean more
>>>> similarity, not less.
>>>>
>>>> I agree that similarity ought to decrease with more dimensions. That
>>>> is what you observe -- except that you see quite high average
>>>> similarity with no dimension reduction!
>>>>
>>>> An average cosine similarity of 0.87 sounds "high" to me for anything
>>>> but a few dimensions. What's the dimensionality of the input without
>>>> dimension reduction?
>>>>
>>>> Something is amiss in this pipeline. It is an interesting question!
>>>>
>>>> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<st...@wienert.cc>
>>>>  wrote:
>>>>
>>>>> Actually I'm using  RowSimilarityJob() with
>>>>> --input input
>>>>> --output output
>>>>> --numberOfColumns documentCount
>>>>> --maxSimilaritiesPerRow documentCount
>>>>> --similarityClassname SIMILARITY_UNCENTERED_COSINE
>>>>>
>>>>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE
>>>>> calculates...
>>>>> the source says: "distributed implementation of cosine similarity that
>>>>> does not center its data"
>>>>>
>>>>> So... this seems to be the similarity and not the distance?
>>>>>
>>>>> Cheers,
>>>>> Stefan
>>>>>
>>>>>
>>>>>
>>>>> 2011/6/14 Stefan Wienert<st...@wienert.cc>:
>>>>>
>>>>>> but... why do I get the different results with cosine similarity with
>>>>>> no dimension reduction (with 100,000 dimensions) ?
>>>>>>
>>>>>> 2011/6/14 Fernando Fernández<fe...@gmail.com>:
>>>>>>
>>>>>>> Actually that's what your results are showing, aren't they? With rank
>>>>>>> 1000
>>>>>>> the similarity avg is the lowest...
>>>>>>>
>>>>>>>
>>>>>>> 2011/6/14 Jake Mannix<ja...@gmail.com>
>>>>>>>
>>>>>>>  actually, wait - are your graphs showing *similarity*, or *distance*?
>>>>>>>>  In
>>>>>>>> higher
>>>>>>>> dimensions, *distance* (and cosine angle) should grow, but on the
>>>>>>>> other
>>>>>>>> hand,
>>>>>>>> *similarity* (1-cos(angle)) should go toward 0.
>>>>>>>>
>>>>>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert<st...@wienert.cc>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>  Hey Guys,
>>>>>>>>>
>>>>>>>>> I have some strange results in my LSA-Pipeline.
>>>>>>>>>
>>>>>>>>> First, I explain the steps my data is making:
>>>>>>>>> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF
>>>>>>>>> as
>>>>>>>>> weighter
>>>>>>>>> 2) Transposing TDM
>>>>>>>>> 3a) Using Mahout SVD (Lanczos) with the transposed TDM
>>>>>>>>> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
>>>>>>>>> 3c) Using no dimension reduction (for testing purpose)
>>>>>>>>> 4) Transpose result (ONLY none / svd)
>>>>>>>>> 5) Calculating Cosine Similarty (from Mahout)
>>>>>>>>>
>>>>>>>>> Now... Some strange thinks happen:
>>>>>>>>> First of all: The demo data shows the similarity from document 1 to
>>>>>>>>> all other documents.
>>>>>>>>>
>>>>>>>>> the results using only cosine similarty (without dimension
>>>>>>>>> reduction):
>>>>>>>>> http://the-lord.de/img/none.png
>>>>>>>>>
>>>>>>>>> the result using svd, rank 10
>>>>>>>>> http://the-lord.de/img/svd-10.png
>>>>>>>>> some points falling down to the bottom.
>>>>>>>>>
>>>>>>>>> the results using ssvd rank 10
>>>>>>>>> http://the-lord.de/img/ssvd-10.png
>>>>>>>>>
>>>>>>>>> the result using svd, rank 100
>>>>>>>>> http://the-lord.de/img/svd-100.png
>>>>>>>>> more points falling down to the bottom.
>>>>>>>>>
>>>>>>>>> the results using ssvd rank 100
>>>>>>>>> http://the-lord.de/img/ssvd-100.png
>>>>>>>>>
>>>>>>>>> the results using svd rank 200
>>>>>>>>> http://the-lord.de/img/svd-200.png
>>>>>>>>> even more points falling down to the bottom.
>>>>>>>>>
>>>>>>>>> the results using svd rank 1000
>>>>>>>>> http://the-lord.de/img/svd-1000.png
>>>>>>>>> most points are at the bottom
>>>>>>>>>
>>>>>>>>> please beware of the scale:
>>>>>>>>> - the avg from none: 0,8712
>>>>>>>>> - the avg from svd rank 10: 0,2648
>>>>>>>>> - the avg from svd rank 100: 0,0628
>>>>>>>>> - the avg from svd rank 200: 0,0238
>>>>>>>>> - the avg from svd rank 1000: 0,0116
>>>>>>>>>
>>>>>>>>> so my question is:
>>>>>>>>> Can you explain this behavior? Why are the documents getting more
>>>>>>>>> equal with more ranks in svd. I thought it was the opposite.
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>> Stefan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Stefan Wienert
>>>>>>
>>>>>> http://www.wienert.cc
>>>>>> stefan@wienert.cc
>>>>>>
>>>>>> Telefon: +495251-2026838
>>>>>> Mobil: +49176-40170270
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Stefan Wienert
>>>>>
>>>>> http://www.wienert.cc
>>>>> stefan@wienert.cc
>>>>>
>>>>> Telefon: +495251-2026838
>>>>> Mobil: +49176-40170270
>>>>>
>>>>>
>>>
>>
>
>
>
> --
> Stefan Wienert
>
> http://www.wienert.cc
> stefan@wienert.cc
>
> Telefon: +495251-2026838
> Mobil: +49176-40170270
>

Re: tf-idf + svd + cosine similarity

Posted by Stefan Wienert <st...@wienert.cc>.

So... lets check the dimensions:

First step: Lucene Output:
227 rows (=docs) and 107909 cols (=tems)

transposed to:
107909 rows and 227 cols

reduced with svd (rank 100) to:
99 rows and 227 cols

transposed to: (actually there was a bug (with no effect on the SVD
result but on NONE result))
227 rows and 99 cols

So... now the cosine results are very similar to SVD 200.

Results are added.

@Sebastian: I will check if the bug affects my results.

2011/6/14 Fernando Fernández <fe...@gmail.com>:
> Hi Stefan,
>
> Are  you sure you need to transpose the input marix? I thought that what you
> get from lucene index was already document(rows)-term(columns) matrix, but
> you say that you obtain term-document matrix and transpose it. Is this
> correct? What are you using to obtain this matrix from Lucene? Is it
> possible that you are calculating similarities with the wrong matrix in some
> of the two cases? (With/without dimension reduction).
>
> Best,
> Fernando.
>
> 2011/6/14 Sebastian Schelter <ss...@apache.org>
>
>> Hi Stefan,
>>
>> I checked the implementation of RowSimilarityJob and we might still have a
>> bug in the 0.5 release... (f**k). I don't know if your problem is caused by
>> that, but the similarity scores might not be correct...
>>
>> We had this issue in 0.4 already, when someone realized that cooccurrences
>> were mapped out inconsistently, so for 0.5 we made sure that we always map
>> the smaller row as first value. But apparently I did not adjust the value
>> setting for the Cooccurrence object...
>>
>> In 0.5 the code is:
>>
>>  if (rowA <= rowB) {
>>   rowPair.set(rowA, rowB, weightA, weightB);
>>  } else {
>>   rowPair.set(rowB, rowA, weightB, weightA);
>>  }
>>  coocurrence.set(column.get(), valueA, valueB);
>>
>> But I should be (already fixed in current trunk some days ago):
>>
>>  if (rowA <= rowB) {
>>   rowPair.set(rowA, rowB, weightA, weightB);
>>   coocurrence.set(column.get(), valueA, valueB);
>>  } else {
>>   rowPair.set(rowB, rowA, weightB, weightA);
>>   coocurrence.set(column.get(), valueB, valueA);
>>  }
>>
>> Maybe you could rerun your test with the current trunk?
>>
>> --sebastian
>>
>>
>> On 14.06.2011 20:54, Sean Owen wrote:
>>
>>> It is a similarity, not a distance. Higher values mean more
>>> similarity, not less.
>>>
>>> I agree that similarity ought to decrease with more dimensions. That
>>> is what you observe -- except that you see quite high average
>>> similarity with no dimension reduction!
>>>
>>> An average cosine similarity of 0.87 sounds "high" to me for anything
>>> but a few dimensions. What's the dimensionality of the input without
>>> dimension reduction?
>>>
>>> Something is amiss in this pipeline. It is an interesting question!
>>>
>>> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<st...@wienert.cc>
>>>  wrote:
>>>
>>>> Actually I'm using  RowSimilarityJob() with
>>>> --input input
>>>> --output output
>>>> --numberOfColumns documentCount
>>>> --maxSimilaritiesPerRow documentCount
>>>> --similarityClassname SIMILARITY_UNCENTERED_COSINE
>>>>
>>>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE
>>>> calculates...
>>>> the source says: "distributed implementation of cosine similarity that
>>>> does not center its data"
>>>>
>>>> So... this seems to be the similarity and not the distance?
>>>>
>>>> Cheers,
>>>> Stefan
>>>>
>>>>
>>>>
>>>> 2011/6/14 Stefan Wienert<st...@wienert.cc>:
>>>>
>>>>> but... why do I get the different results with cosine similarity with
>>>>> no dimension reduction (with 100,000 dimensions) ?
>>>>>
>>>>> 2011/6/14 Fernando Fernández<fe...@gmail.com>:
>>>>>
>>>>>> Actually that's what your results are showing, aren't they? With rank
>>>>>> 1000
>>>>>> the similarity avg is the lowest...
>>>>>>
>>>>>>
>>>>>> 2011/6/14 Jake Mannix<ja...@gmail.com>
>>>>>>
>>>>>>  actually, wait - are your graphs showing *similarity*, or *distance*?
>>>>>>>  In
>>>>>>> higher
>>>>>>> dimensions, *distance* (and cosine angle) should grow, but on the
>>>>>>> other
>>>>>>> hand,
>>>>>>> *similarity* (1-cos(angle)) should go toward 0.
>>>>>>>
>>>>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert<st...@wienert.cc>
>>>>>>> wrote:
>>>>>>>
>>>>>>>  Hey Guys,
>>>>>>>>
>>>>>>>> I have some strange results in my LSA-Pipeline.
>>>>>>>>
>>>>>>>> First, I explain the steps my data is making:
>>>>>>>> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF
>>>>>>>> as
>>>>>>>> weighter
>>>>>>>> 2) Transposing TDM
>>>>>>>> 3a) Using Mahout SVD (Lanczos) with the transposed TDM
>>>>>>>> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
>>>>>>>> 3c) Using no dimension reduction (for testing purpose)
>>>>>>>> 4) Transpose result (ONLY none / svd)
>>>>>>>> 5) Calculating Cosine Similarty (from Mahout)
>>>>>>>>
>>>>>>>> Now... Some strange thinks happen:
>>>>>>>> First of all: The demo data shows the similarity from document 1 to
>>>>>>>> all other documents.
>>>>>>>>
>>>>>>>> the results using only cosine similarty (without dimension
>>>>>>>> reduction):
>>>>>>>> http://the-lord.de/img/none.png
>>>>>>>>
>>>>>>>> the result using svd, rank 10
>>>>>>>> http://the-lord.de/img/svd-10.png
>>>>>>>> some points falling down to the bottom.
>>>>>>>>
>>>>>>>> the results using ssvd rank 10
>>>>>>>> http://the-lord.de/img/ssvd-10.png
>>>>>>>>
>>>>>>>> the result using svd, rank 100
>>>>>>>> http://the-lord.de/img/svd-100.png
>>>>>>>> more points falling down to the bottom.
>>>>>>>>
>>>>>>>> the results using ssvd rank 100
>>>>>>>> http://the-lord.de/img/ssvd-100.png
>>>>>>>>
>>>>>>>> the results using svd rank 200
>>>>>>>> http://the-lord.de/img/svd-200.png
>>>>>>>> even more points falling down to the bottom.
>>>>>>>>
>>>>>>>> the results using svd rank 1000
>>>>>>>> http://the-lord.de/img/svd-1000.png
>>>>>>>> most points are at the bottom
>>>>>>>>
>>>>>>>> please beware of the scale:
>>>>>>>> - the avg from none: 0,8712
>>>>>>>> - the avg from svd rank 10: 0,2648
>>>>>>>> - the avg from svd rank 100: 0,0628
>>>>>>>> - the avg from svd rank 200: 0,0238
>>>>>>>> - the avg from svd rank 1000: 0,0116
>>>>>>>>
>>>>>>>> so my question is:
>>>>>>>> Can you explain this behavior? Why are the documents getting more
>>>>>>>> equal with more ranks in svd. I thought it was the opposite.
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>> Stefan
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Stefan Wienert
>>>>>
>>>>> http://www.wienert.cc
>>>>> stefan@wienert.cc
>>>>>
>>>>> Telefon: +495251-2026838
>>>>> Mobil: +49176-40170270
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Stefan Wienert
>>>>
>>>> http://www.wienert.cc
>>>> stefan@wienert.cc
>>>>
>>>> Telefon: +495251-2026838
>>>> Mobil: +49176-40170270
>>>>
>>>>
>>
>



-- 
Stefan Wienert

http://www.wienert.cc
stefan@wienert.cc

Telefon: +495251-2026838
Mobil: +49176-40170270

Re: tf-idf + svd + cosine similarity

Posted by Fernando Fernández <fe...@gmail.com>.

Hi Stefan,

Are  you sure you need to transpose the input marix? I thought that what you
get from lucene index was already document(rows)-term(columns) matrix, but
you say that you obtain term-document matrix and transpose it. Is this
correct? What are you using to obtain this matrix from Lucene? Is it
possible that you are calculating similarities with the wrong matrix in some
of the two cases? (With/without dimension reduction).

Best,
Fernando.

2011/6/14 Sebastian Schelter <ss...@apache.org>

> Hi Stefan,
>
> I checked the implementation of RowSimilarityJob and we might still have a
> bug in the 0.5 release... (f**k). I don't know if your problem is caused by
> that, but the similarity scores might not be correct...
>
> We had this issue in 0.4 already, when someone realized that cooccurrences
> were mapped out inconsistently, so for 0.5 we made sure that we always map
> the smaller row as first value. But apparently I did not adjust the value
> setting for the Cooccurrence object...
>
> In 0.5 the code is:
>
>  if (rowA <= rowB) {
>   rowPair.set(rowA, rowB, weightA, weightB);
>  } else {
>   rowPair.set(rowB, rowA, weightB, weightA);
>  }
>  coocurrence.set(column.get(), valueA, valueB);
>
> But I should be (already fixed in current trunk some days ago):
>
>  if (rowA <= rowB) {
>   rowPair.set(rowA, rowB, weightA, weightB);
>   coocurrence.set(column.get(), valueA, valueB);
>  } else {
>   rowPair.set(rowB, rowA, weightB, weightA);
>   coocurrence.set(column.get(), valueB, valueA);
>  }
>
> Maybe you could rerun your test with the current trunk?
>
> --sebastian
>
>
> On 14.06.2011 20:54, Sean Owen wrote:
>
>> It is a similarity, not a distance. Higher values mean more
>> similarity, not less.
>>
>> I agree that similarity ought to decrease with more dimensions. That
>> is what you observe -- except that you see quite high average
>> similarity with no dimension reduction!
>>
>> An average cosine similarity of 0.87 sounds "high" to me for anything
>> but a few dimensions. What's the dimensionality of the input without
>> dimension reduction?
>>
>> Something is amiss in this pipeline. It is an interesting question!
>>
>> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<st...@wienert.cc>
>>  wrote:
>>
>>> Actually I'm using  RowSimilarityJob() with
>>> --input input
>>> --output output
>>> --numberOfColumns documentCount
>>> --maxSimilaritiesPerRow documentCount
>>> --similarityClassname SIMILARITY_UNCENTERED_COSINE
>>>
>>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE
>>> calculates...
>>> the source says: "distributed implementation of cosine similarity that
>>> does not center its data"
>>>
>>> So... this seems to be the similarity and not the distance?
>>>
>>> Cheers,
>>> Stefan
>>>
>>>
>>>
>>> 2011/6/14 Stefan Wienert<st...@wienert.cc>:
>>>
>>>> but... why do I get the different results with cosine similarity with
>>>> no dimension reduction (with 100,000 dimensions) ?
>>>>
>>>> 2011/6/14 Fernando Fernández<fe...@gmail.com>:
>>>>
>>>>> Actually that's what your results are showing, aren't they? With rank
>>>>> 1000
>>>>> the similarity avg is the lowest...
>>>>>
>>>>>
>>>>> 2011/6/14 Jake Mannix<ja...@gmail.com>
>>>>>
>>>>>  actually, wait - are your graphs showing *similarity*, or *distance*?
>>>>>>  In
>>>>>> higher
>>>>>> dimensions, *distance* (and cosine angle) should grow, but on the
>>>>>> other
>>>>>> hand,
>>>>>> *similarity* (1-cos(angle)) should go toward 0.
>>>>>>
>>>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert<st...@wienert.cc>
>>>>>> wrote:
>>>>>>
>>>>>>  Hey Guys,
>>>>>>>
>>>>>>> I have some strange results in my LSA-Pipeline.
>>>>>>>
>>>>>>> First, I explain the steps my data is making:
>>>>>>> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF
>>>>>>> as
>>>>>>> weighter
>>>>>>> 2) Transposing TDM
>>>>>>> 3a) Using Mahout SVD (Lanczos) with the transposed TDM
>>>>>>> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
>>>>>>> 3c) Using no dimension reduction (for testing purpose)
>>>>>>> 4) Transpose result (ONLY none / svd)
>>>>>>> 5) Calculating Cosine Similarty (from Mahout)
>>>>>>>
>>>>>>> Now... Some strange thinks happen:
>>>>>>> First of all: The demo data shows the similarity from document 1 to
>>>>>>> all other documents.
>>>>>>>
>>>>>>> the results using only cosine similarty (without dimension
>>>>>>> reduction):
>>>>>>> http://the-lord.de/img/none.png
>>>>>>>
>>>>>>> the result using svd, rank 10
>>>>>>> http://the-lord.de/img/svd-10.png
>>>>>>> some points falling down to the bottom.
>>>>>>>
>>>>>>> the results using ssvd rank 10
>>>>>>> http://the-lord.de/img/ssvd-10.png
>>>>>>>
>>>>>>> the result using svd, rank 100
>>>>>>> http://the-lord.de/img/svd-100.png
>>>>>>> more points falling down to the bottom.
>>>>>>>
>>>>>>> the results using ssvd rank 100
>>>>>>> http://the-lord.de/img/ssvd-100.png
>>>>>>>
>>>>>>> the results using svd rank 200
>>>>>>> http://the-lord.de/img/svd-200.png
>>>>>>> even more points falling down to the bottom.
>>>>>>>
>>>>>>> the results using svd rank 1000
>>>>>>> http://the-lord.de/img/svd-1000.png
>>>>>>> most points are at the bottom
>>>>>>>
>>>>>>> please beware of the scale:
>>>>>>> - the avg from none: 0,8712
>>>>>>> - the avg from svd rank 10: 0,2648
>>>>>>> - the avg from svd rank 100: 0,0628
>>>>>>> - the avg from svd rank 200: 0,0238
>>>>>>> - the avg from svd rank 1000: 0,0116
>>>>>>>
>>>>>>> so my question is:
>>>>>>> Can you explain this behavior? Why are the documents getting more
>>>>>>> equal with more ranks in svd. I thought it was the opposite.
>>>>>>>
>>>>>>> Cheers
>>>>>>> Stefan
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Stefan Wienert
>>>>
>>>> http://www.wienert.cc
>>>> stefan@wienert.cc
>>>>
>>>> Telefon: +495251-2026838
>>>> Mobil: +49176-40170270
>>>>
>>>>
>>>
>>>
>>> --
>>> Stefan Wienert
>>>
>>> http://www.wienert.cc
>>> stefan@wienert.cc
>>>
>>> Telefon: +495251-2026838
>>> Mobil: +49176-40170270
>>>
>>>
>

Re: tf-idf + svd + cosine similarity

Posted by Sebastian Schelter <ss...@apache.org>.

Hi Stefan,

I checked the implementation of RowSimilarityJob and we might still have 
a bug in the 0.5 release... (f**k). I don't know if your problem is 
caused by that, but the similarity scores might not be correct...

We had this issue in 0.4 already, when someone realized that 
cooccurrences were mapped out inconsistently, so for 0.5 we made sure 
that we always map the smaller row as first value. But apparently I did 
not adjust the value setting for the Cooccurrence object...

In 0.5 the code is:

  if (rowA <= rowB) {
    rowPair.set(rowA, rowB, weightA, weightB);
  } else {
    rowPair.set(rowB, rowA, weightB, weightA);
  }
  coocurrence.set(column.get(), valueA, valueB);

But I should be (already fixed in current trunk some days ago):

  if (rowA <= rowB) {
    rowPair.set(rowA, rowB, weightA, weightB);
    coocurrence.set(column.get(), valueA, valueB);
  } else {
    rowPair.set(rowB, rowA, weightB, weightA);
    coocurrence.set(column.get(), valueB, valueA);
  }

Maybe you could rerun your test with the current trunk?

--sebastian

On 14.06.2011 20:54, Sean Owen wrote:
> It is a similarity, not a distance. Higher values mean more
> similarity, not less.
>
> I agree that similarity ought to decrease with more dimensions. That
> is what you observe -- except that you see quite high average
> similarity with no dimension reduction!
>
> An average cosine similarity of 0.87 sounds "high" to me for anything
> but a few dimensions. What's the dimensionality of the input without
> dimension reduction?
>
> Something is amiss in this pipeline. It is an interesting question!
>
> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<st...@wienert.cc>  wrote:
>> Actually I'm using  RowSimilarityJob() with
>> --input input
>> --output output
>> --numberOfColumns documentCount
>> --maxSimilaritiesPerRow documentCount
>> --similarityClassname SIMILARITY_UNCENTERED_COSINE
>>
>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE
>> calculates...
>> the source says: "distributed implementation of cosine similarity that
>> does not center its data"
>>
>> So... this seems to be the similarity and not the distance?
>>
>> Cheers,
>> Stefan
>>
>>
>>
>> 2011/6/14 Stefan Wienert<st...@wienert.cc>:
>>> but... why do I get the different results with cosine similarity with
>>> no dimension reduction (with 100,000 dimensions) ?
>>>
>>> 2011/6/14 Fernando Fernández<fe...@gmail.com>:
>>>> Actually that's what your results are showing, aren't they? With rank 1000
>>>> the similarity avg is the lowest...
>>>>
>>>>
>>>> 2011/6/14 Jake Mannix<ja...@gmail.com>
>>>>
>>>>> actually, wait - are your graphs showing *similarity*, or *distance*?  In
>>>>> higher
>>>>> dimensions, *distance* (and cosine angle) should grow, but on the other
>>>>> hand,
>>>>> *similarity* (1-cos(angle)) should go toward 0.
>>>>>
>>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert<st...@wienert.cc>
>>>>> wrote:
>>>>>
>>>>>> Hey Guys,
>>>>>>
>>>>>> I have some strange results in my LSA-Pipeline.
>>>>>>
>>>>>> First, I explain the steps my data is making:
>>>>>> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as
>>>>>> weighter
>>>>>> 2) Transposing TDM
>>>>>> 3a) Using Mahout SVD (Lanczos) with the transposed TDM
>>>>>> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
>>>>>> 3c) Using no dimension reduction (for testing purpose)
>>>>>> 4) Transpose result (ONLY none / svd)
>>>>>> 5) Calculating Cosine Similarty (from Mahout)
>>>>>>
>>>>>> Now... Some strange thinks happen:
>>>>>> First of all: The demo data shows the similarity from document 1 to
>>>>>> all other documents.
>>>>>>
>>>>>> the results using only cosine similarty (without dimension reduction):
>>>>>> http://the-lord.de/img/none.png
>>>>>>
>>>>>> the result using svd, rank 10
>>>>>> http://the-lord.de/img/svd-10.png
>>>>>> some points falling down to the bottom.
>>>>>>
>>>>>> the results using ssvd rank 10
>>>>>> http://the-lord.de/img/ssvd-10.png
>>>>>>
>>>>>> the result using svd, rank 100
>>>>>> http://the-lord.de/img/svd-100.png
>>>>>> more points falling down to the bottom.
>>>>>>
>>>>>> the results using ssvd rank 100
>>>>>> http://the-lord.de/img/ssvd-100.png
>>>>>>
>>>>>> the results using svd rank 200
>>>>>> http://the-lord.de/img/svd-200.png
>>>>>> even more points falling down to the bottom.
>>>>>>
>>>>>> the results using svd rank 1000
>>>>>> http://the-lord.de/img/svd-1000.png
>>>>>> most points are at the bottom
>>>>>>
>>>>>> please beware of the scale:
>>>>>> - the avg from none: 0,8712
>>>>>> - the avg from svd rank 10: 0,2648
>>>>>> - the avg from svd rank 100: 0,0628
>>>>>> - the avg from svd rank 200: 0,0238
>>>>>> - the avg from svd rank 1000: 0,0116
>>>>>>
>>>>>> so my question is:
>>>>>> Can you explain this behavior? Why are the documents getting more
>>>>>> equal with more ranks in svd. I thought it was the opposite.
>>>>>>
>>>>>> Cheers
>>>>>> Stefan
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Stefan Wienert
>>>
>>> http://www.wienert.cc
>>> stefan@wienert.cc
>>>
>>> Telefon: +495251-2026838
>>> Mobil: +49176-40170270
>>>
>>
>>
>>
>> --
>> Stefan Wienert
>>
>> http://www.wienert.cc
>> stefan@wienert.cc
>>
>> Telefon: +495251-2026838
>> Mobil: +49176-40170270
>>

Re: tf-idf + svd + cosine similarity

Posted by Sean Owen <sr...@gmail.com>.

It is a similarity, not a distance. Higher values mean more
similarity, not less.

I agree that similarity ought to decrease with more dimensions. That
is what you observe -- except that you see quite high average
similarity with no dimension reduction!

An average cosine similarity of 0.87 sounds "high" to me for anything
but a few dimensions. What's the dimensionality of the input without
dimension reduction?

Something is amiss in this pipeline. It is an interesting question!

On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert <st...@wienert.cc> wrote:
> Actually I'm using  RowSimilarityJob() with
> --input input
> --output output
> --numberOfColumns documentCount
> --maxSimilaritiesPerRow documentCount
> --similarityClassname SIMILARITY_UNCENTERED_COSINE
>
> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE
> calculates...
> the source says: "distributed implementation of cosine similarity that
> does not center its data"
>
> So... this seems to be the similarity and not the distance?
>
> Cheers,
> Stefan
>
>
>
> 2011/6/14 Stefan Wienert <st...@wienert.cc>:
>> but... why do I get the different results with cosine similarity with
>> no dimension reduction (with 100,000 dimensions) ?
>>
>> 2011/6/14 Fernando Fernández <fe...@gmail.com>:
>>> Actually that's what your results are showing, aren't they? With rank 1000
>>> the similarity avg is the lowest...
>>>
>>>
>>> 2011/6/14 Jake Mannix <ja...@gmail.com>
>>>
>>>> actually, wait - are your graphs showing *similarity*, or *distance*?  In
>>>> higher
>>>> dimensions, *distance* (and cosine angle) should grow, but on the other
>>>> hand,
>>>> *similarity* (1-cos(angle)) should go toward 0.
>>>>
>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert <st...@wienert.cc>
>>>> wrote:
>>>>
>>>> > Hey Guys,
>>>> >
>>>> > I have some strange results in my LSA-Pipeline.
>>>> >
>>>> > First, I explain the steps my data is making:
>>>> > 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as
>>>> > weighter
>>>> > 2) Transposing TDM
>>>> > 3a) Using Mahout SVD (Lanczos) with the transposed TDM
>>>> > 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
>>>> > 3c) Using no dimension reduction (for testing purpose)
>>>> > 4) Transpose result (ONLY none / svd)
>>>> > 5) Calculating Cosine Similarty (from Mahout)
>>>> >
>>>> > Now... Some strange thinks happen:
>>>> > First of all: The demo data shows the similarity from document 1 to
>>>> > all other documents.
>>>> >
>>>> > the results using only cosine similarty (without dimension reduction):
>>>> > http://the-lord.de/img/none.png
>>>> >
>>>> > the result using svd, rank 10
>>>> > http://the-lord.de/img/svd-10.png
>>>> > some points falling down to the bottom.
>>>> >
>>>> > the results using ssvd rank 10
>>>> > http://the-lord.de/img/ssvd-10.png
>>>> >
>>>> > the result using svd, rank 100
>>>> > http://the-lord.de/img/svd-100.png
>>>> > more points falling down to the bottom.
>>>> >
>>>> > the results using ssvd rank 100
>>>> > http://the-lord.de/img/ssvd-100.png
>>>> >
>>>> > the results using svd rank 200
>>>> > http://the-lord.de/img/svd-200.png
>>>> > even more points falling down to the bottom.
>>>> >
>>>> > the results using svd rank 1000
>>>> > http://the-lord.de/img/svd-1000.png
>>>> > most points are at the bottom
>>>> >
>>>> > please beware of the scale:
>>>> > - the avg from none: 0,8712
>>>> > - the avg from svd rank 10: 0,2648
>>>> > - the avg from svd rank 100: 0,0628
>>>> > - the avg from svd rank 200: 0,0238
>>>> > - the avg from svd rank 1000: 0,0116
>>>> >
>>>> > so my question is:
>>>> > Can you explain this behavior? Why are the documents getting more
>>>> > equal with more ranks in svd. I thought it was the opposite.
>>>> >
>>>> > Cheers
>>>> > Stefan
>>>> >
>>>>
>>>
>>
>>
>>
>> --
>> Stefan Wienert
>>
>> http://www.wienert.cc
>> stefan@wienert.cc
>>
>> Telefon: +495251-2026838
>> Mobil: +49176-40170270
>>
>
>
>
> --
> Stefan Wienert
>
> http://www.wienert.cc
> stefan@wienert.cc
>
> Telefon: +495251-2026838
> Mobil: +49176-40170270
>

Re: tf-idf + svd + cosine similarity

Posted by Stefan Wienert <st...@wienert.cc>.

Actually I'm using  RowSimilarityJob() with
--input input
--output output
--numberOfColumns documentCount
--maxSimilaritiesPerRow documentCount
--similarityClassname SIMILARITY_UNCENTERED_COSINE

Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE
calculates...
the source says: "distributed implementation of cosine similarity that
does not center its data"

So... this seems to be the similarity and not the distance?

Cheers,
Stefan



2011/6/14 Stefan Wienert <st...@wienert.cc>:
> but... why do I get the different results with cosine similarity with
> no dimension reduction (with 100,000 dimensions) ?
>
> 2011/6/14 Fernando Fernández <fe...@gmail.com>:
>> Actually that's what your results are showing, aren't they? With rank 1000
>> the similarity avg is the lowest...
>>
>>
>> 2011/6/14 Jake Mannix <ja...@gmail.com>
>>
>>> actually, wait - are your graphs showing *similarity*, or *distance*?  In
>>> higher
>>> dimensions, *distance* (and cosine angle) should grow, but on the other
>>> hand,
>>> *similarity* (1-cos(angle)) should go toward 0.
>>>
>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert <st...@wienert.cc>
>>> wrote:
>>>
>>> > Hey Guys,
>>> >
>>> > I have some strange results in my LSA-Pipeline.
>>> >
>>> > First, I explain the steps my data is making:
>>> > 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as
>>> > weighter
>>> > 2) Transposing TDM
>>> > 3a) Using Mahout SVD (Lanczos) with the transposed TDM
>>> > 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
>>> > 3c) Using no dimension reduction (for testing purpose)
>>> > 4) Transpose result (ONLY none / svd)
>>> > 5) Calculating Cosine Similarty (from Mahout)
>>> >
>>> > Now... Some strange thinks happen:
>>> > First of all: The demo data shows the similarity from document 1 to
>>> > all other documents.
>>> >
>>> > the results using only cosine similarty (without dimension reduction):
>>> > http://the-lord.de/img/none.png
>>> >
>>> > the result using svd, rank 10
>>> > http://the-lord.de/img/svd-10.png
>>> > some points falling down to the bottom.
>>> >
>>> > the results using ssvd rank 10
>>> > http://the-lord.de/img/ssvd-10.png
>>> >
>>> > the result using svd, rank 100
>>> > http://the-lord.de/img/svd-100.png
>>> > more points falling down to the bottom.
>>> >
>>> > the results using ssvd rank 100
>>> > http://the-lord.de/img/ssvd-100.png
>>> >
>>> > the results using svd rank 200
>>> > http://the-lord.de/img/svd-200.png
>>> > even more points falling down to the bottom.
>>> >
>>> > the results using svd rank 1000
>>> > http://the-lord.de/img/svd-1000.png
>>> > most points are at the bottom
>>> >
>>> > please beware of the scale:
>>> > - the avg from none: 0,8712
>>> > - the avg from svd rank 10: 0,2648
>>> > - the avg from svd rank 100: 0,0628
>>> > - the avg from svd rank 200: 0,0238
>>> > - the avg from svd rank 1000: 0,0116
>>> >
>>> > so my question is:
>>> > Can you explain this behavior? Why are the documents getting more
>>> > equal with more ranks in svd. I thought it was the opposite.
>>> >
>>> > Cheers
>>> > Stefan
>>> >
>>>
>>
>
>
>
> --
> Stefan Wienert
>
> http://www.wienert.cc
> stefan@wienert.cc
>
> Telefon: +495251-2026838
> Mobil: +49176-40170270
>



-- 
Stefan Wienert

http://www.wienert.cc
stefan@wienert.cc

Telefon: +495251-2026838
Mobil: +49176-40170270

Re: tf-idf + svd + cosine similarity

Posted by Stefan Wienert <st...@wienert.cc>.

but... why do I get the different results with cosine similarity with
no dimension reduction (with 100,000 dimensions) ?

2011/6/14 Fernando Fernández <fe...@gmail.com>:
> Actually that's what your results are showing, aren't they? With rank 1000
> the similarity avg is the lowest...
>
>
> 2011/6/14 Jake Mannix <ja...@gmail.com>
>
>> actually, wait - are your graphs showing *similarity*, or *distance*?  In
>> higher
>> dimensions, *distance* (and cosine angle) should grow, but on the other
>> hand,
>> *similarity* (1-cos(angle)) should go toward 0.
>>
>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert <st...@wienert.cc>
>> wrote:
>>
>> > Hey Guys,
>> >
>> > I have some strange results in my LSA-Pipeline.
>> >
>> > First, I explain the steps my data is making:
>> > 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as
>> > weighter
>> > 2) Transposing TDM
>> > 3a) Using Mahout SVD (Lanczos) with the transposed TDM
>> > 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
>> > 3c) Using no dimension reduction (for testing purpose)
>> > 4) Transpose result (ONLY none / svd)
>> > 5) Calculating Cosine Similarty (from Mahout)
>> >
>> > Now... Some strange thinks happen:
>> > First of all: The demo data shows the similarity from document 1 to
>> > all other documents.
>> >
>> > the results using only cosine similarty (without dimension reduction):
>> > http://the-lord.de/img/none.png
>> >
>> > the result using svd, rank 10
>> > http://the-lord.de/img/svd-10.png
>> > some points falling down to the bottom.
>> >
>> > the results using ssvd rank 10
>> > http://the-lord.de/img/ssvd-10.png
>> >
>> > the result using svd, rank 100
>> > http://the-lord.de/img/svd-100.png
>> > more points falling down to the bottom.
>> >
>> > the results using ssvd rank 100
>> > http://the-lord.de/img/ssvd-100.png
>> >
>> > the results using svd rank 200
>> > http://the-lord.de/img/svd-200.png
>> > even more points falling down to the bottom.
>> >
>> > the results using svd rank 1000
>> > http://the-lord.de/img/svd-1000.png
>> > most points are at the bottom
>> >
>> > please beware of the scale:
>> > - the avg from none: 0,8712
>> > - the avg from svd rank 10: 0,2648
>> > - the avg from svd rank 100: 0,0628
>> > - the avg from svd rank 200: 0,0238
>> > - the avg from svd rank 1000: 0,0116
>> >
>> > so my question is:
>> > Can you explain this behavior? Why are the documents getting more
>> > equal with more ranks in svd. I thought it was the opposite.
>> >
>> > Cheers
>> > Stefan
>> >
>>
>



-- 
Stefan Wienert

http://www.wienert.cc
stefan@wienert.cc

Telefon: +495251-2026838
Mobil: +49176-40170270

Re: tf-idf + svd + cosine similarity

Posted by Fernando Fernández <fe...@gmail.com>.

Actually that's what your results are showing, aren't they? With rank 1000
the similarity avg is the lowest...


2011/6/14 Jake Mannix <ja...@gmail.com>

> actually, wait - are your graphs showing *similarity*, or *distance*?  In
> higher
> dimensions, *distance* (and cosine angle) should grow, but on the other
> hand,
> *similarity* (1-cos(angle)) should go toward 0.
>
> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert <st...@wienert.cc>
> wrote:
>
> > Hey Guys,
> >
> > I have some strange results in my LSA-Pipeline.
> >
> > First, I explain the steps my data is making:
> > 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as
> > weighter
> > 2) Transposing TDM
> > 3a) Using Mahout SVD (Lanczos) with the transposed TDM
> > 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
> > 3c) Using no dimension reduction (for testing purpose)
> > 4) Transpose result (ONLY none / svd)
> > 5) Calculating Cosine Similarty (from Mahout)
> >
> > Now... Some strange thinks happen:
> > First of all: The demo data shows the similarity from document 1 to
> > all other documents.
> >
> > the results using only cosine similarty (without dimension reduction):
> > http://the-lord.de/img/none.png
> >
> > the result using svd, rank 10
> > http://the-lord.de/img/svd-10.png
> > some points falling down to the bottom.
> >
> > the results using ssvd rank 10
> > http://the-lord.de/img/ssvd-10.png
> >
> > the result using svd, rank 100
> > http://the-lord.de/img/svd-100.png
> > more points falling down to the bottom.
> >
> > the results using ssvd rank 100
> > http://the-lord.de/img/ssvd-100.png
> >
> > the results using svd rank 200
> > http://the-lord.de/img/svd-200.png
> > even more points falling down to the bottom.
> >
> > the results using svd rank 1000
> > http://the-lord.de/img/svd-1000.png
> > most points are at the bottom
> >
> > please beware of the scale:
> > - the avg from none: 0,8712
> > - the avg from svd rank 10: 0,2648
> > - the avg from svd rank 100: 0,0628
> > - the avg from svd rank 200: 0,0238
> > - the avg from svd rank 1000: 0,0116
> >
> > so my question is:
> > Can you explain this behavior? Why are the documents getting more
> > equal with more ranks in svd. I thought it was the opposite.
> >
> > Cheers
> > Stefan
> >
>

Re: tf-idf + svd + cosine similarity

Posted by Jake Mannix <ja...@gmail.com>.

actually, wait - are your graphs showing *similarity*, or *distance*?  In
higher
dimensions, *distance* (and cosine angle) should grow, but on the other
hand,
*similarity* (1-cos(angle)) should go toward 0.

On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert <st...@wienert.cc> wrote:

> Hey Guys,
>
> I have some strange results in my LSA-Pipeline.
>
> First, I explain the steps my data is making:
> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as
> weighter
> 2) Transposing TDM
> 3a) Using Mahout SVD (Lanczos) with the transposed TDM
> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
> 3c) Using no dimension reduction (for testing purpose)
> 4) Transpose result (ONLY none / svd)
> 5) Calculating Cosine Similarty (from Mahout)
>
> Now... Some strange thinks happen:
> First of all: The demo data shows the similarity from document 1 to
> all other documents.
>
> the results using only cosine similarty (without dimension reduction):
> http://the-lord.de/img/none.png
>
> the result using svd, rank 10
> http://the-lord.de/img/svd-10.png
> some points falling down to the bottom.
>
> the results using ssvd rank 10
> http://the-lord.de/img/ssvd-10.png
>
> the result using svd, rank 100
> http://the-lord.de/img/svd-100.png
> more points falling down to the bottom.
>
> the results using ssvd rank 100
> http://the-lord.de/img/ssvd-100.png
>
> the results using svd rank 200
> http://the-lord.de/img/svd-200.png
> even more points falling down to the bottom.
>
> the results using svd rank 1000
> http://the-lord.de/img/svd-1000.png
> most points are at the bottom
>
> please beware of the scale:
> - the avg from none: 0,8712
> - the avg from svd rank 10: 0,2648
> - the avg from svd rank 100: 0,0628
> - the avg from svd rank 200: 0,0238
> - the avg from svd rank 1000: 0,0116
>
> so my question is:
> Can you explain this behavior? Why are the documents getting more
> equal with more ranks in svd. I thought it was the opposite.
>
> Cheers
> Stefan
>

Re: tf-idf + svd + cosine similarity

Posted by Jake Mannix <ja...@gmail.com>.

You are running into "the curse of dimensionality".  The higher the
dimension you are in, the further apart (random) vectors are.

What you should to compare quality is to find the documents that you can
manually label as being "very similar" to document #1, and then see what
rank they show up in a list of "most similar to document 1" by each of the
various similarity metrics you've produced.  The metric which makes the
"known similar" documents highest in rank order *relative to the rest of the
documents* will be the one you think is best.

  -jake

On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert <st...@wienert.cc> wrote:

> Hey Guys,
>
> I have some strange results in my LSA-Pipeline.
>
> First, I explain the steps my data is making:
> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as
> weighter
> 2) Transposing TDM
> 3a) Using Mahout SVD (Lanczos) with the transposed TDM
> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
> 3c) Using no dimension reduction (for testing purpose)
> 4) Transpose result (ONLY none / svd)
> 5) Calculating Cosine Similarty (from Mahout)
>
> Now... Some strange thinks happen:
> First of all: The demo data shows the similarity from document 1 to
> all other documents.
>
> the results using only cosine similarty (without dimension reduction):
> http://the-lord.de/img/none.png
>
> the result using svd, rank 10
> http://the-lord.de/img/svd-10.png
> some points falling down to the bottom.
>
> the results using ssvd rank 10
> http://the-lord.de/img/ssvd-10.png
>
> the result using svd, rank 100
> http://the-lord.de/img/svd-100.png
> more points falling down to the bottom.
>
> the results using ssvd rank 100
> http://the-lord.de/img/ssvd-100.png
>
> the results using svd rank 200
> http://the-lord.de/img/svd-200.png
> even more points falling down to the bottom.
>
> the results using svd rank 1000
> http://the-lord.de/img/svd-1000.png
> most points are at the bottom
>
> please beware of the scale:
> - the avg from none: 0,8712
> - the avg from svd rank 10: 0,2648
> - the avg from svd rank 100: 0,0628
> - the avg from svd rank 200: 0,0238
> - the avg from svd rank 1000: 0,0116
>
> so my question is:
> Can you explain this behavior? Why are the documents getting more
> equal with more ranks in svd. I thought it was the opposite.
>
> Cheers
> Stefan
>