You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Chengi Liu <ch...@gmail.com> on 2014/07/30 08:39:40 UTC
Converting matrix format
Hi,
I have an rdd with n rows and m columns... but most of them are 0 ..
its as sparse matrix..
I would like to only get the non zero entries with their index?
Any equivalent python code would be
for i,x in enumerate(matrix):
for j,y in enumerate(x):
if y:
print i,j,y
Now, what I would like to do is save i,j,y entries?
How do I do this in pyspark.
Thanks
Re: Converting matrix format
Posted by Chengi Liu <ch...@gmail.com>.
Thanks..
What if its a big matrix.. like billions rows million columns
On Wednesday, July 30, 2014, Davies Liu <da...@databricks.com> wrote:
> It will depends on the size of your matrix. If it can fit in memory,
> then you can
>
> sparse = sparse_matrix(matrix) # sparse_matrix is the function you had
> written
> sc.parallelize(sparse, NUM_OF_PARTITIONS)
>
> On Tue, Jul 29, 2014 at 11:39 PM, Chengi Liu <chengi.liu.86@gmail.com
> <javascript:;>> wrote:
> > Hi,
> > I have an rdd with n rows and m columns... but most of them are 0 ..
> its
> > as sparse matrix..
> >
> > I would like to only get the non zero entries with their index?
> >
> > Any equivalent python code would be
> >
> > for i,x in enumerate(matrix):
> > for j,y in enumerate(x):
> > if y:
> > print i,j,y
> >
> > Now, what I would like to do is save i,j,y entries?
> > How do I do this in pyspark.
> > Thanks
> >
> >
>
Re: Converting matrix format
Posted by Davies Liu <da...@databricks.com>.
It will depends on the size of your matrix. If it can fit in memory,
then you can
sparse = sparse_matrix(matrix) # sparse_matrix is the function you had written
sc.parallelize(sparse, NUM_OF_PARTITIONS)
On Tue, Jul 29, 2014 at 11:39 PM, Chengi Liu <ch...@gmail.com> wrote:
> Hi,
> I have an rdd with n rows and m columns... but most of them are 0 .. its
> as sparse matrix..
>
> I would like to only get the non zero entries with their index?
>
> Any equivalent python code would be
>
> for i,x in enumerate(matrix):
> for j,y in enumerate(x):
> if y:
> print i,j,y
>
> Now, what I would like to do is save i,j,y entries?
> How do I do this in pyspark.
> Thanks
>
>