You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mahout.apache.org by co...@apache.org on 2013/03/29 22:29:00 UTC

[CONF] Apache Mahout > Collection(De-)Serialization

Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Collection(De-)Serialization (https://cwiki.apache.org/confluence/display/MAHOUT/Collection%28De-%29Serialization)
Comment: https://cwiki.apache.org/confluence/display/MAHOUT/Collection%28De-%29Serialization?focusedCommentId=30759975#comment-30759975

Comment added by Andrew McMurry:
---------------------------------------------------------------------

Many "Big Data" datasets are very sparse. 

Health data: most patients dont have most diseases. 
NLP data: most documents dont have all 15k+ common words. 
Graph data: most graphs are not fully connected. 
and so on...

Far FEWER datasets are very dense. 
Images come to mind, but that is already well addressed by OpenCV. 

Proposal: use a serialization/deserialization strategy that allows for sparse matrix representation. They are both memory and computationally efficient. Here is an example of a sparse matrix implementation that is used very heavily for ML tasks:  

http://www.mathworks.com/help/matlab/ref/sparse.html


Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action