You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by djn <de...@gmail.com> on 2011/06/04 23:21:53 UTC

ItemSimilarityJob Cooccurrence Question

Regarding ItemSimilarityJob, it is my understanding that if there are two
input lines of the form &lt;user1, product1&gt; and &lt;user1, product2&gt;,
then that would constitute a co-occurrence between product1 and product2.

I've generated a large test dataset under this assumption, and it guarantees
that there will only be co-occurrences between pairs of product IDs that
I've predefined. I'm not using preference values and I'm setting
--booleanData true.

While the ItemSimilarityJob's output does include these predefined
co-occurrences, it also outputs a large number of co-occurrences (with small
co-occurrence counts) between products that are not co-occurring in the
input dataset. Can anyone provide some insight as to why this might be
happening?

--
View this message in context: http://lucene.472066.n3.nabble.com/ItemSimilarityJob-Cooccurrence-Question-tp3024516p3024516.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: ItemSimilarityJob Cooccurrence Question

Posted by Sebastian Schelter <ss...@apache.org>.
Hi Derek,

this shouldn't be happening and we have unit tests explicitly checking 
that.

Which version do you use? Please be sure to use Mahout 0.5 or the 
current trunk. Could you provide sample data where you see this happening?

--sebastian

On 04.06.2011 23:21, djn wrote:
> Regarding ItemSimilarityJob, it is my understanding that if there are two
> input lines of the form&lt;user1, product1&gt; and&lt;user1, product2&gt;,
> then that would constitute a co-occurrence between product1 and product2.
>
> I've generated a large test dataset under this assumption, and it guarantees
> that there will only be co-occurrences between pairs of product IDs that
> I've predefined. I'm not using preference values and I'm setting
> --booleanData true.
>
> While the ItemSimilarityJob's output does include these predefined
> co-occurrences, it also outputs a large number of co-occurrences (with small
> co-occurrence counts) between products that are not co-occurring in the
> input dataset. Can anyone provide some insight as to why this might be
> happening?
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/ItemSimilarityJob-Cooccurrence-Question-tp3024516p3024516.html
> Sent from the Mahout User List mailing list archive at Nabble.com.