You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Xiangrui Meng (JIRA)" <ji...@apache.org> on 2015/08/31 17:52:46 UTC

[jira] [Commented] (SPARK-10371) Optimize sequential projections

    [ https://issues.apache.org/jira/browse/SPARK-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14723585#comment-14723585 ] 

Xiangrui Meng commented on SPARK-10371:
---------------------------------------

ping [~yhuai]

> Optimize sequential projections
> -------------------------------
>
>                 Key: SPARK-10371
>                 URL: https://issues.apache.org/jira/browse/SPARK-10371
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML, SQL
>    Affects Versions: 1.5.0
>            Reporter: Xiangrui Meng
>
> In ML pipelines, each transformer/estimator appends new columns to the input DataFrame. For example, it might produce DataFrames like the following columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used.
> It would be nice to detect this pattern and re-use intermediate values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org