You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Max Moroz (JIRA)" <ji...@apache.org> on 2016/07/01 07:52:11 UTC

[jira] [Comment Edited] (SPARK-16319) Non-linear (DAG) pipelines need better explanation

    [ https://issues.apache.org/jira/browse/SPARK-16319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15358592#comment-15358592 ] 

Max Moroz edited comment on SPARK-16319 at 7/1/16 7:52 AM:
-----------------------------------------------------------

[~srowen] Sorry for being unclear. 

Suppose I create three pipelines.

All three are created from the same list of nodes node_list. The nodes don't need any input and output columns, they actually ignore the parameters provided to them. But for our purposes, we will provide three different sets of input/output columns:

1) pipeline_DAG provides input and output column names that indicate a non-linear acyclic data flow graph (relative to that graph, node_list is in topological order, as required)
2) pipeline_linear provides input and output column names that indicate a simple linear data flow graph (of course, again consistent with node_list order)
3) pipeline_unknown does not provide input and output columns at all, thus precluding ML from inferring the graph structure

Under what circumstances, calling a fit() on those pipelines will result in different behavior? (Same question for transform(), but I assume the answer will be identical). I couldn't find any code that would behave differently for these three objects (I expected, for example, something like an attempt to execute nodes in parallel when they don't depend on each other, but failed to see anything like that in the code). It would be good if the docs clarified that.

Another minor clarification that I'd recommend is to say what exactly is an "input column" / "output column" terms used in the implicit definition of DAG. I assume it's a parameter named inputCol / outputCol, which may contain a string or a list of strings (in pyspark), but it's not really obvious at all.



was (Author: mmoroz):
[~srowen] Sorry for being unclear. 

Suppose I create three pipelines.

All three are created from the same list of nodes node_list. The nodes don't need any input and output columns, they actually ignore the parameters provided to them. But for our purposes, we will provide three different sets of input/output columns:

1) pipeline_DAG provides input and output column names that indicate a non-linear acyclic data flow graph (such node_list is in topological order, as required)
2) pipeline_linear provides input and output column names that indicate a simple linear data flow graph (identical to the order of nodes in node_list)
3) pipeline_unknown does not provide input and output columns at all, thus precluding ML from inferring the graph structure

Under what circumstances, a fit() or a transform() called on those pipelines will be different? I couldn't find any code that would behave differently for these three objects (I expected, for example, something like an attempt to execute nodes in parallel when they don't depend on each other, but failed to see anything like that in the code). It would be good if the docs clarified that.

Another minor clarification that I'd recommend is to say what exactly is an "input column" / "output column" terms used in the implicit definition of DAG. I assume it's a parameter named inputCol / outputCol, which may contain a string or a list of strings (in pyspark), but it's not really obvious at all.


> Non-linear (DAG) pipelines need better explanation
> --------------------------------------------------
>
>                 Key: SPARK-16319
>                 URL: https://issues.apache.org/jira/browse/SPARK-16319
>             Project: Spark
>          Issue Type: Documentation
>          Components: ML
>    Affects Versions: 2.0.0
>            Reporter: Max Moroz
>            Priority: Minor
>
> There's a [paragraph|http://spark.apache.org/docs/2.0.0-preview/ml-guide.html#details] about non-linear pipeline in the ML docs, but it's not clear how DAG pipeline differs from a linear pipeline, and in fact, it seems that a "DAG Pipeline" results in the behavior identical to that of a regular linear pipeline (the stages are simply applied in the order provided when the pipeline is created). In addition, no checks of input and output columns seem to occur when the pipeline.fit() or pipeline.transform() is called.
> It would be better to clarify in the docs and/or remove that paragraph.
> I'd be happy to write it up, but I have no idea what the intention of this concept is at this point.
> [Additional reference on SO|http://stackoverflow.com/questions/37541668/non-linear-dag-ml-pipelines-in-apache-spark]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org