You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "zhengruifeng (JIRA)" <ji...@apache.org> on 2019/05/08 10:32:00 UTC

[jira] [Resolved] (SPARK-23805) support vector-size validation and Inference

     [ https://issues.apache.org/jira/browse/SPARK-23805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

zhengruifeng resolved SPARK-23805.
----------------------------------
    Resolution: Not A Problem

> support vector-size validation and Inference
> --------------------------------------------
>
>                 Key: SPARK-23805
>                 URL: https://issues.apache.org/jira/browse/SPARK-23805
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.4.0
>            Reporter: zhengruifeng
>            Priority: Major
>
> I think it maybe miningful to unify the usage of \{{AttributeGroup}} and support vector-size validation and inference in algs.
> My thoughts are:
>  * In \{{transformSchema}}, validate the input vector-size if possible. If the input vector-size can be obtained from schema, check it.
>  ** Suppose a \{{PCA}} estimator with k=4, the \{{transformSchema}} will require the vector-size to be no more than 4.
>  ** Suppose a \{{PCAModel}} trained with vectors of length 10, the \{{transformSchema}} will require the vector-size to be 10.
>  * In \{{transformSchema}}, inference the output vector-size if possible.
>  ** Suppose a \{{PCA}} estimator with k=4, the \{{transformSchema}} will return a schema with output vector-size=4.
>  ** Suppose a \{{PCAModel}} trained with k=4, the \{{transformSchema}} will return a schema with output vector-size=4.
>  * In \{{transform}}, inference the output vector-size if possible.
>  * In \{{fit}}, obtain the input vector-size from schema if possible. This can help eliminating redundant \{{first}} jobs.
>  
> Current PR only modifies \{{PCA}} and \{{MaxAbsScaler}} to illustrate my idea. Since the validation and inference is quite alg-speciafic, we may need to sperate the task into several small subtasks.
> How do you think about this? [~srowen] [~yanboliang] [~WeichenXu123] [~mlnick]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org