You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Chengi Liu <ch...@gmail.com> on 2014/07/30 08:39:40 UTC

Converting matrix format

Hi,
    I have an rdd with n rows and m columns... but most of them are 0 ..
its  as sparse matrix..

I would like to only get the non zero entries with their index?

Any equivalent python code would be

for i,x in enumerate(matrix):
   for j,y in enumerate(x):
        if y:
           print i,j,y

Now, what I would like to do is save i,j,y entries?
How do I do this in pyspark.
Thanks

Re: Converting matrix format

Posted by Chengi Liu <ch...@gmail.com>.

Thanks..
What if its a big matrix.. like billions rows million columns
On Wednesday, July 30, 2014, Davies Liu <da...@databricks.com> wrote:

> It will depends on the size of your matrix. If it can fit in memory,
> then you can
>
> sparse = sparse_matrix(matrix) # sparse_matrix is the function you had
> written
> sc.parallelize(sparse, NUM_OF_PARTITIONS)
>
> On Tue, Jul 29, 2014 at 11:39 PM, Chengi Liu <chengi.liu.86@gmail.com
> <javascript:;>> wrote:
> > Hi,
> >     I have an rdd with n rows and m columns... but most of them are 0 ..
> its
> > as sparse matrix..
> >
> > I would like to only get the non zero entries with their index?
> >
> > Any equivalent python code would be
> >
> > for i,x in enumerate(matrix):
> >    for j,y in enumerate(x):
> >         if y:
> >            print i,j,y
> >
> > Now, what I would like to do is save i,j,y entries?
> > How do I do this in pyspark.
> > Thanks
> >
> >
>

Re: Converting matrix format

Posted by Davies Liu <da...@databricks.com>.

It will depends on the size of your matrix. If it can fit in memory,
then you can

sparse = sparse_matrix(matrix) # sparse_matrix is the function you had written
sc.parallelize(sparse, NUM_OF_PARTITIONS)

On Tue, Jul 29, 2014 at 11:39 PM, Chengi Liu <ch...@gmail.com> wrote:
> Hi,
>     I have an rdd with n rows and m columns... but most of them are 0 .. its
> as sparse matrix..
>
> I would like to only get the non zero entries with their index?
>
> Any equivalent python code would be
>
> for i,x in enumerate(matrix):
>    for j,y in enumerate(x):
>         if y:
>            print i,j,y
>
> Now, what I would like to do is save i,j,y entries?
> How do I do this in pyspark.
> Thanks
>
>