You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by XapaJIaMnu <nh...@gmail.com> on 2016/06/12 10:08:40 UTC

Several questions about how pyspark.ml works

Hey,

I have some additional Spark ML algorithms implemented in scala that I would
like to make available in pyspark. For a reference I am looking at the
available logistic regression implementation here:

https://spark.apache.org/docs/1.6.0/api/python/_modules/pyspark/ml/classification.html

I have couple of questions:
1) The constructor for the *class LogisticRegression* as far as I understand
just accepts the arguments and then just constructs the underlying Scala
object via /py4j/ and parses its arguments. This is done via the line
*self._java_obj = self._new_java_obj(
"org.apache.spark.ml.classification.LogisticRegression", self.uid)*
Is this correct?
What does the line *super(LogisticRegression, self).__init__()* do?

Does this mean that any python datastructures used with it will be converted
to java structures once the object is instantiated?

2) The corresponding model *class LogisticRegressionModel(JavaModel):* again
just instantiates the Java object and nothing else? Is just enough for me to
forward the arguments and instantiate the scala objects?
Does this mean that when the pipeline is created, even if the pipeline is
python it expects objects which are underlying scala code instantiated by
/py4j/. Can one use pure python elements inside the pipeline (dealing with
RDDs)? What would be the performance implication?

Cheers,

Nick



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Several-questions-about-how-pyspark-ml-works-tp27141.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Several questions about how pyspark.ml works

Posted by Yanbo Liang <yb...@gmail.com>.

Hi Nick,

Please see my inline reply.

Thanks
Yanbo

2016-06-12 3:08 GMT-07:00 XapaJIaMnu <nh...@gmail.com>:

> Hey,
>
> I have some additional Spark ML algorithms implemented in scala that I
> would
> like to make available in pyspark. For a reference I am looking at the
> available logistic regression implementation here:
>
>
> https://spark.apache.org/docs/1.6.0/api/python/_modules/pyspark/ml/classification.html
>
> I have couple of questions:
> 1) The constructor for the *class LogisticRegression* as far as I
> understand
> just accepts the arguments and then just constructs the underlying Scala
> object via /py4j/ and parses its arguments. This is done via the line
> *self._java_obj = self._new_java_obj(
> "org.apache.spark.ml.classification.LogisticRegression", self.uid)*
> Is this correct?
> What does the line *super(LogisticRegression, self).__init__()* do?
>

*super(LogisticRegression, self).__init__()* is used to initialize the
*Params* object at Python side, since we store all params at Python side
and transfer them to Scala side when calling *fit*.


>
> Does this mean that any python datastructures used with it will be
> converted
> to java structures once the object is instantiated?
>
> 2) The corresponding model *class LogisticRegressionModel(JavaModel):*
> again
> just instantiates the Java object and nothing else? Is just enough for me
> to
> forward the arguments and instantiate the scala objects?
> Does this mean that when the pipeline is created, even if the pipeline is
> python it expects objects which are underlying scala code instantiated by
> /py4j/. Can one use pure python elements inside the pipeline (dealing with
> RDDs)? What would be the performance implication?
>

*class LogisticRegressionModel(JavaModel)* is only a wrapper of the peer
Scala model object.


>
> Cheers,
>
> Nick
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Several-questions-about-how-pyspark-ml-works-tp27141.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>