You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Pradhuman Jhala <Pr...@fox.com> on 2008/12/03 00:07:25 UTC

sparse matrix format

Hi,
 
I am looking for documentation on the input format, particularly, the sparse matrix format, supported by various supervised & unsupervised algorithms available in Mahout. It looks like 'sparse matrix format' is supported, but I am not able to find details of it. 
 
While looking at the way kmean clustering uses org.apache.mahout.matrix package, it seems, it expects data in the 
"[sM+2, index_1:value_1, index_2:value_2, ...., index_M:value_M, ] format, for it be considered as 'sparse'. Just wondering if this is correct and consistant across all clutering algorithms. 
 
thanks.
Pradhuman

Re: sparse matrix format

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Hi Pradhuman,

All of the clustering algorithms use our vector implementation, and the 
actual class used (Sparse or Dense) should depend upon the encoding 
format used. If you write a preprocessor job to get your input vectors 
in the right format before running a clustering job on them I suggest 
using the SparseVector implementation. It will serialize itself in a 
manner similar to your example (though I'd expect to see just '[sM, ' 
where M is the cardinality of the vector).

Jeff

Pradhuman Jhala wrote:
> Hi,
>  
> I am looking for documentation on the input format, particularly, the sparse matrix format, supported by various supervised & unsupervised algorithms available in Mahout. It looks like 'sparse matrix format' is supported, but I am not able to find details of it. 
>  
> While looking at the way kmean clustering uses org.apache.mahout.matrix package, it seems, it expects data in the 
> "[sM+2, index_1:value_1, index_2:value_2, ...., index_M:value_M, ] format, for it be considered as 'sparse'. Just wondering if this is correct and consistant across all clutering algorithms. 
>  
> thanks.
> Pradhuman 
>
>