You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Peter Aberline (JIRA)" <ji...@apache.org> on 2015/06/21 15:44:00 UTC

[jira] [Created] (SPARK-8510) Store and read NumPy arrays and matrices as values in sequence files

Peter Aberline created SPARK-8510:
-------------------------------------

             Summary: Store and read NumPy arrays and matrices as values in sequence files
                 Key: SPARK-8510
                 URL: https://issues.apache.org/jira/browse/SPARK-8510
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
            Reporter: Peter Aberline
            Priority: Minor


I have extended the provided example code DoubleArrayWritable example to store NumPy double type arrays and matrices as arrays of doubles and nested arrays of doubles.

Pandas DataFrames can be easily converted to NumPy matrices, so I've also added the ability to store the schema-less data from DataFrames and Series that contain double data. 

Other than my own use there seems to be demand for this functionality:

http://mail-archives.us.apache.org/mod_mbox/spark-user/201506.mbox/%3CCAJQK-mg1PUCc_hkV=Q3n-01iOQ_pkWE1g-c39XiMCo3KhqngQg@mail.gmail.com%3E

I'll be issuing a PR for this shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org