You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Cesar Flores <ce...@gmail.com> on 2015/03/17 23:26:52 UTC
ML Pipeline question about caching
Hello all:
I am using the ML Pipeline, which I consider very powerful. I have the next
use case:
- I have three transformers, which I will call A,B,C, that basically
extract features from text files, with no parameters.
- I have a final stage D, which is the logistic regression estimator.
- I am creating a pipeline with the sequence A,B,C,D.
- Finally, I am using this pipeline as estimator parameter of the
CrossValidator class.
I have some concerns about how data persistance inside the cross validator
works. For example, if only D has multiple parameters to tune using the
cross validator, my concern is that the transformation A->B->C is being
performed multiple times?. Is that the case, or it is Spark smart enough to
realize that it is possible to persist the output of C? Do it will be
better to leave A,B, and C outside the cross validator pipeline?
Thanks a lot
--
Cesar Flores
Re: ML Pipeline question about caching
Posted by Peter Rudenko <pe...@gmail.com>.
Hi Cesar,
I had a similar issue. Yes for now it’s better to do A,B,C outside a
crossvalidator. Take a look to my comment
<https://issues.apache.org/jira/browse/SPARK-4766?focusedCommentId=14320038&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14320038>
and this jira <https://issues.apache.org/jira/browse/SPARK-5844>. The
problem is that transformers could also have hyperparameters in the
future (like word2vec transformer). Then crossvalidator would need to
find need to find the best parameters for both transformer + estimator.
It will blow number of combinations (num parameters for transformer
/number parameters for estimator / number of folds).
Thanks,
Peter Rudenko
On 2015-03-18 00:26, Cesar Flores wrote:
>
> Hello all:
>
> I am using the ML Pipeline, which I consider very powerful. I have the
> next use case:
>
> * I have three transformers, which I will call A,B,C, that basically
> extract features from text files, with no parameters.
> * I have a final stage D, which is the logistic regression estimator.
> * I am creating a pipeline with the sequence A,B,C,D.
> * Finally, I am using this pipeline as estimator parameter of the
> CrossValidator class.
>
> I have some concerns about how data persistance inside the cross
> validator works. For example, if only D has multiple parameters to
> tune using the cross validator, my concern is that the transformation
> A->B->C is being performed multiple times?. Is that the case, or it is
> Spark smart enough to realize that it is possible to persist the
> output of C? Do it will be better to leave A,B, and C outside the
> cross validator pipeline?
>
> Thanks a lot
> --
> Cesar Flores