You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by co...@apache.org on 2013/03/29 22:29:00 UTC
[CONF] Apache Mahout > Collection(De-)Serialization
Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Collection(De-)Serialization (https://cwiki.apache.org/confluence/display/MAHOUT/Collection%28De-%29Serialization)
Comment: https://cwiki.apache.org/confluence/display/MAHOUT/Collection%28De-%29Serialization?focusedCommentId=30759975#comment-30759975
Comment added by Andrew McMurry:
---------------------------------------------------------------------
Many "Big Data" datasets are very sparse.
Health data: most patients dont have most diseases.
NLP data: most documents dont have all 15k+ common words.
Graph data: most graphs are not fully connected.
and so on...
Far FEWER datasets are very dense.
Images come to mind, but that is already well addressed by OpenCV.
Proposal: use a serialization/deserialization strategy that allows for sparse matrix representation. They are both memory and computationally efficient. Here is an example of a sparse matrix implementation that is used very heavily for ML tasks:
http://www.mathworks.com/help/matlab/ref/sparse.html
Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action