You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Xiangrui Meng (JIRA)" <ji...@apache.org> on 2015/08/31 17:52:45 UTC

[jira] [Created] (SPARK-10371) Optimize sequential projections

Xiangrui Meng created SPARK-10371:
-------------------------------------

             Summary: Optimize sequential projections
                 Key: SPARK-10371
                 URL: https://issues.apache.org/jira/browse/SPARK-10371
             Project: Spark
          Issue Type: New Feature
          Components: ML, SQL
    Affects Versions: 1.5.0
            Reporter: Xiangrui Meng


In ML pipelines, each transformer/estimator appends new columns to the input DataFrame. For example, it might produce DataFrames like the following columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used.

It would be nice to detect this pattern and re-use intermediate values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org