You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2019/07/09 21:40:00 UTC

[jira] [Resolved] (SPARK-28140) Pyspark API to create spark.mllib RowMatrix from DataFrame

     [ https://issues.apache.org/jira/browse/SPARK-28140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved SPARK-28140.
-------------------------------
       Resolution: Fixed
    Fix Version/s: 3.0.0

Issue resolved by pull request 24953
[https://github.com/apache/spark/pull/24953]

> Pyspark API to create spark.mllib RowMatrix from DataFrame
> ----------------------------------------------------------
>
>                 Key: SPARK-28140
>                 URL: https://issues.apache.org/jira/browse/SPARK-28140
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib, PySpark
>    Affects Versions: 3.0.0
>            Reporter: Henry Davidge
>            Assignee: Henry Davidge
>            Priority: Minor
>             Fix For: 3.0.0
>
>
> Since many functions are only implemented in spark.mllib, it is often necessary to convert DataFrames of spark.ml vectors to spark.mllib distributed matrix formats. The first step, converting the spark.ml vectors to the spark.mllib equivalent, is straightforward. However, to the best of my knowledge it's not possible to convert the resulting DataFrame to a RowMatrix without using a python lambda function, which can have a significant performance hit. In my recent use case, SVD took 3.5m using the Scala API, but 12m using Python.
> To get around this performance hit, I propose adding a constructor to the Pyspark RowMatrix class that accepts a DataFrame with a single column of spark.mllib vectors. I'd be happy to add an equivalent API for IndexedRowMatrix if there is demand.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org