You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Dan Filimon (JIRA)" <ji...@apache.org> on 2013/06/13 17:09:22 UTC

[jira] [Created] (MAHOUT-1261) TasteHadoopUtils.idToIndex can return an int that has size Integer.MAX_VALUE

Dan Filimon created MAHOUT-1261:
-----------------------------------

             Summary: TasteHadoopUtils.idToIndex can return an int that has size Integer.MAX_VALUE
                 Key: MAHOUT-1261
                 URL: https://issues.apache.org/jira/browse/MAHOUT-1261
             Project: Mahout
          Issue Type: Bug
          Components: Collaborative Filtering
    Affects Versions: 0.8
            Reporter: Dan Filimon
            Assignee: Sean Owen
            Priority: Minor


I'm running ItemSimilarityJob on a very large (~600M by 4B) matrix that's very sparse (total set of associations is 630MB).

The job fails because of an IndexException in ToUserVectorsReducer.
TasteHadoopUtils.idToIndex(long id) hashes a long with:
0x7fffffff & Longs.hashCode(id) (line o.a.m.cf.taste.hadoop.TasteHadoopUtils:57).

For some id (I don't know what value), the result returned is Integer.MAX_VALUE.
This cannot be set in the userVector because the cardinality of that is also Integer.MAX_VALUE and it throws an exception.

So, the issue is that values from 0 to INT_MAX are returned by idToIndex but the vector only has 0 to INT_MAX - 1 possible entries.
It's a nasty little off-by-one bug.

I'm thinking of just % size when setting.

[~ssc] & everyone else, thoughts? :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira