You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by YiZhi Liu <ja...@gmail.com> on 2015/10/07 08:47:57 UTC

What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

Hi everyone,

I'm curious about the difference between
ml.classification.LogisticRegression and
mllib.classification.LogisticRegressionWithLBFGS. Both of them are
optimized using LBFGS, the only difference I see is LogisticRegression
takes DataFrame while LogisticRegressionWithLBFGS takes RDD.

So I wonder,
1. Why not simply add a DataFrame training interface to
LogisticRegressionWithLBFGS?
2. Whats the difference between ml.classification and
mllib.classification package?
3. Why doesn't ml.classification.LogisticRegression call
mllib.optimization.LBFGS / mllib.optimization.OWLQN directly? Instead,
it uses breeze.optimize.LBFGS and re-implements most of the procedures
in mllib.optimization.{LBFGS,OWLQN}.

Thank you.

Best,

-- 
Yizhi Liu
Senior Software Engineer / Data Mining
www.mvad.com, Shanghai, China

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

Posted by YiZhi Liu <ja...@gmail.com>.

Hi Tsai,

Thank you for pointing out the implementation details which I missed.
Yes I saw several jira issues with the intercept, regularization and
standardization, I just didn't realize it made such a big impact.
Thanks again.

2015-10-13 4:32 GMT+08:00 DB Tsai <db...@dbtsai.com>:
> Hi Liu,
>
> In ML, even after extracting the data into RDD, the versions between MLib
> and ML are quite different. Due to legacy design, in MLlib, we use Updater
> for handling regularization, and this layer of abstraction also does
> adaptive step size which is only for SGD. In order to get it working with
> LBFGS, some hacks were being done here and there, and in Updater, all the
> components including intercept are regularized which is not desirable in
> many cases. Also, in the legacy design, it's hard for us to do in-place
> standardization to improve the convergency rate. As a result, at some point,
> we decide to ditch those abstractions, and customize them for each
> algorithms. (Even LiR and LoR use different tricks to have better
> performance for numerical optimization, so it's hard to share code at that
> time. But I can see the point that we have working code now, so it's time to
> try to refactor those code to share more.)
>
>
> Sincerely,
>
> DB Tsai
> ----------------------------------------------------------
> Blog: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
> On Mon, Oct 12, 2015 at 1:24 AM, YiZhi Liu <ja...@gmail.com> wrote:
>>
>> Hi Joseph,
>>
>> Thank you for clarifying the motivation that you setup a different API
>> for ml pipelines, it sounds great. But I still think we could extract
>> some common parts of the training & inference procedures for ml and
>> mllib. In ml.classification.LogisticRegression, you simply transform
>> the DataFrame into RDD and follow the same procedures in
>> mllib.optimization.{LBFGS,OWLQN}, right?
>>
>> My suggestion is, if I may, ml package should focus on the public API,
>> and leave the underlying implementations, e.g. numerical optimization,
>> to mllib package.
>>
>> Please let me know if my understanding has any problem. Thank you!
>>
>> 2015-10-08 1:15 GMT+08:00 Joseph Bradley <jo...@databricks.com>:
>> > Hi YiZhi Liu,
>> >
>> > The spark.ml classes are part of the higher-level "Pipelines" API, which
>> > works with DataFrames.  When creating this API, we decided to separate
>> > it
>> > from the old API to avoid confusion.  You can read more about it here:
>> > http://spark.apache.org/docs/latest/ml-guide.html
>> >
>> > For (3): We use Breeze, but we have to modify it in order to do
>> > distributed
>> > optimization based on Spark.
>> >
>> > Joseph
>> >
>> > On Tue, Oct 6, 2015 at 11:47 PM, YiZhi Liu <ja...@gmail.com> wrote:
>> >>
>> >> Hi everyone,
>> >>
>> >> I'm curious about the difference between
>> >> ml.classification.LogisticRegression and
>> >> mllib.classification.LogisticRegressionWithLBFGS. Both of them are
>> >> optimized using LBFGS, the only difference I see is LogisticRegression
>> >> takes DataFrame while LogisticRegressionWithLBFGS takes RDD.
>> >>
>> >> So I wonder,
>> >> 1. Why not simply add a DataFrame training interface to
>> >> LogisticRegressionWithLBFGS?
>> >> 2. Whats the difference between ml.classification and
>> >> mllib.classification package?
>> >> 3. Why doesn't ml.classification.LogisticRegression call
>> >> mllib.optimization.LBFGS / mllib.optimization.OWLQN directly? Instead,
>> >> it uses breeze.optimize.LBFGS and re-implements most of the procedures
>> >> in mllib.optimization.{LBFGS,OWLQN}.
>> >>
>> >> Thank you.
>> >>
>> >> Best,
>> >>
>> >> --
>> >> Yizhi Liu
>> >> Senior Software Engineer / Data Mining
>> >> www.mvad.com, Shanghai, China
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> >> For additional commands, e-mail: user-help@spark.apache.org
>> >>
>> >
>>
>>
>>
>> --
>> Yizhi Liu
>> Senior Software Engineer / Data Mining
>> www.mvad.com, Shanghai, China
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>



-- 
Yizhi Liu
Senior Software Engineer / Data Mining
www.mvad.com, Shanghai, China

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

Posted by YiZhi Liu <ja...@gmail.com>.

Hi Tsai,

Thank you for pointing out the implementation details which I missed.
Yes I saw several jira issues with the intercept, regularization and
standardization, I just didn't realize it made such a big impact.
Thanks again.

2015-10-13 4:32 GMT+08:00 DB Tsai <db...@dbtsai.com>:
> Hi Liu,
>
> In ML, even after extracting the data into RDD, the versions between MLib
> and ML are quite different. Due to legacy design, in MLlib, we use Updater
> for handling regularization, and this layer of abstraction also does
> adaptive step size which is only for SGD. In order to get it working with
> LBFGS, some hacks were being done here and there, and in Updater, all the
> components including intercept are regularized which is not desirable in
> many cases. Also, in the legacy design, it's hard for us to do in-place
> standardization to improve the convergency rate. As a result, at some point,
> we decide to ditch those abstractions, and customize them for each
> algorithms. (Even LiR and LoR use different tricks to have better
> performance for numerical optimization, so it's hard to share code at that
> time. But I can see the point that we have working code now, so it's time to
> try to refactor those code to share more.)
>
>
> Sincerely,
>
> DB Tsai
> ----------------------------------------------------------
> Blog: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
> On Mon, Oct 12, 2015 at 1:24 AM, YiZhi Liu <ja...@gmail.com> wrote:
>>
>> Hi Joseph,
>>
>> Thank you for clarifying the motivation that you setup a different API
>> for ml pipelines, it sounds great. But I still think we could extract
>> some common parts of the training & inference procedures for ml and
>> mllib. In ml.classification.LogisticRegression, you simply transform
>> the DataFrame into RDD and follow the same procedures in
>> mllib.optimization.{LBFGS,OWLQN}, right?
>>
>> My suggestion is, if I may, ml package should focus on the public API,
>> and leave the underlying implementations, e.g. numerical optimization,
>> to mllib package.
>>
>> Please let me know if my understanding has any problem. Thank you!
>>
>> 2015-10-08 1:15 GMT+08:00 Joseph Bradley <jo...@databricks.com>:
>> > Hi YiZhi Liu,
>> >
>> > The spark.ml classes are part of the higher-level "Pipelines" API, which
>> > works with DataFrames.  When creating this API, we decided to separate
>> > it
>> > from the old API to avoid confusion.  You can read more about it here:
>> > http://spark.apache.org/docs/latest/ml-guide.html
>> >
>> > For (3): We use Breeze, but we have to modify it in order to do
>> > distributed
>> > optimization based on Spark.
>> >
>> > Joseph
>> >
>> > On Tue, Oct 6, 2015 at 11:47 PM, YiZhi Liu <ja...@gmail.com> wrote:
>> >>
>> >> Hi everyone,
>> >>
>> >> I'm curious about the difference between
>> >> ml.classification.LogisticRegression and
>> >> mllib.classification.LogisticRegressionWithLBFGS. Both of them are
>> >> optimized using LBFGS, the only difference I see is LogisticRegression
>> >> takes DataFrame while LogisticRegressionWithLBFGS takes RDD.
>> >>
>> >> So I wonder,
>> >> 1. Why not simply add a DataFrame training interface to
>> >> LogisticRegressionWithLBFGS?
>> >> 2. Whats the difference between ml.classification and
>> >> mllib.classification package?
>> >> 3. Why doesn't ml.classification.LogisticRegression call
>> >> mllib.optimization.LBFGS / mllib.optimization.OWLQN directly? Instead,
>> >> it uses breeze.optimize.LBFGS and re-implements most of the procedures
>> >> in mllib.optimization.{LBFGS,OWLQN}.
>> >>
>> >> Thank you.
>> >>
>> >> Best,
>> >>
>> >> --
>> >> Yizhi Liu
>> >> Senior Software Engineer / Data Mining
>> >> www.mvad.com, Shanghai, China
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> >> For additional commands, e-mail: user-help@spark.apache.org
>> >>
>> >
>>
>>
>>
>> --
>> Yizhi Liu
>> Senior Software Engineer / Data Mining
>> www.mvad.com, Shanghai, China
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>



-- 
Yizhi Liu
Senior Software Engineer / Data Mining
www.mvad.com, Shanghai, China

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

Posted by DB Tsai <db...@dbtsai.com>.

Hi Liu,

In ML, even after extracting the data into RDD, the versions between MLib
and ML are quite different. Due to legacy design, in MLlib, we use Updater
for handling regularization, and this layer of abstraction also does
adaptive step size which is only for SGD. In order to get it working with
LBFGS, some hacks were being done here and there, and in Updater, all the
components including intercept are regularized which is not desirable in
many cases. Also, in the legacy design, it's hard for us to do in-place
standardization to improve the convergency rate. As a result, at some
point, we decide to ditch those abstractions, and customize them for each
algorithms. (Even LiR and LoR use different tricks to have better
performance for numerical optimization, so it's hard to share code at that
time. But I can see the point that we have working code now, so it's time
to try to refactor those code to share more.)


Sincerely,

DB Tsai
----------------------------------------------------------
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D
<https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D>

On Mon, Oct 12, 2015 at 1:24 AM, YiZhi Liu <ja...@gmail.com> wrote:

> Hi Joseph,
>
> Thank you for clarifying the motivation that you setup a different API
> for ml pipelines, it sounds great. But I still think we could extract
> some common parts of the training & inference procedures for ml and
> mllib. In ml.classification.LogisticRegression, you simply transform
> the DataFrame into RDD and follow the same procedures in
> mllib.optimization.{LBFGS,OWLQN}, right?
>
> My suggestion is, if I may, ml package should focus on the public API,
> and leave the underlying implementations, e.g. numerical optimization,
> to mllib package.
>
> Please let me know if my understanding has any problem. Thank you!
>
> 2015-10-08 1:15 GMT+08:00 Joseph Bradley <jo...@databricks.com>:
> > Hi YiZhi Liu,
> >
> > The spark.ml classes are part of the higher-level "Pipelines" API, which
> > works with DataFrames.  When creating this API, we decided to separate it
> > from the old API to avoid confusion.  You can read more about it here:
> > http://spark.apache.org/docs/latest/ml-guide.html
> >
> > For (3): We use Breeze, but we have to modify it in order to do
> distributed
> > optimization based on Spark.
> >
> > Joseph
> >
> > On Tue, Oct 6, 2015 at 11:47 PM, YiZhi Liu <ja...@gmail.com> wrote:
> >>
> >> Hi everyone,
> >>
> >> I'm curious about the difference between
> >> ml.classification.LogisticRegression and
> >> mllib.classification.LogisticRegressionWithLBFGS. Both of them are
> >> optimized using LBFGS, the only difference I see is LogisticRegression
> >> takes DataFrame while LogisticRegressionWithLBFGS takes RDD.
> >>
> >> So I wonder,
> >> 1. Why not simply add a DataFrame training interface to
> >> LogisticRegressionWithLBFGS?
> >> 2. Whats the difference between ml.classification and
> >> mllib.classification package?
> >> 3. Why doesn't ml.classification.LogisticRegression call
> >> mllib.optimization.LBFGS / mllib.optimization.OWLQN directly? Instead,
> >> it uses breeze.optimize.LBFGS and re-implements most of the procedures
> >> in mllib.optimization.{LBFGS,OWLQN}.
> >>
> >> Thank you.
> >>
> >> Best,
> >>
> >> --
> >> Yizhi Liu
> >> Senior Software Engineer / Data Mining
> >> www.mvad.com, Shanghai, China
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> >> For additional commands, e-mail: user-help@spark.apache.org
> >>
> >
>
>
>
> --
> Yizhi Liu
> Senior Software Engineer / Data Mining
> www.mvad.com, Shanghai, China
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

Posted by DB Tsai <db...@dbtsai.com>.

Hi Liu,

In ML, even after extracting the data into RDD, the versions between MLib
and ML are quite different. Due to legacy design, in MLlib, we use Updater
for handling regularization, and this layer of abstraction also does
adaptive step size which is only for SGD. In order to get it working with
LBFGS, some hacks were being done here and there, and in Updater, all the
components including intercept are regularized which is not desirable in
many cases. Also, in the legacy design, it's hard for us to do in-place
standardization to improve the convergency rate. As a result, at some
point, we decide to ditch those abstractions, and customize them for each
algorithms. (Even LiR and LoR use different tricks to have better
performance for numerical optimization, so it's hard to share code at that
time. But I can see the point that we have working code now, so it's time
to try to refactor those code to share more.)


Sincerely,

DB Tsai
----------------------------------------------------------
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D
<https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D>

On Mon, Oct 12, 2015 at 1:24 AM, YiZhi Liu <ja...@gmail.com> wrote:

> Hi Joseph,
>
> Thank you for clarifying the motivation that you setup a different API
> for ml pipelines, it sounds great. But I still think we could extract
> some common parts of the training & inference procedures for ml and
> mllib. In ml.classification.LogisticRegression, you simply transform
> the DataFrame into RDD and follow the same procedures in
> mllib.optimization.{LBFGS,OWLQN}, right?
>
> My suggestion is, if I may, ml package should focus on the public API,
> and leave the underlying implementations, e.g. numerical optimization,
> to mllib package.
>
> Please let me know if my understanding has any problem. Thank you!
>
> 2015-10-08 1:15 GMT+08:00 Joseph Bradley <jo...@databricks.com>:
> > Hi YiZhi Liu,
> >
> > The spark.ml classes are part of the higher-level "Pipelines" API, which
> > works with DataFrames.  When creating this API, we decided to separate it
> > from the old API to avoid confusion.  You can read more about it here:
> > http://spark.apache.org/docs/latest/ml-guide.html
> >
> > For (3): We use Breeze, but we have to modify it in order to do
> distributed
> > optimization based on Spark.
> >
> > Joseph
> >
> > On Tue, Oct 6, 2015 at 11:47 PM, YiZhi Liu <ja...@gmail.com> wrote:
> >>
> >> Hi everyone,
> >>
> >> I'm curious about the difference between
> >> ml.classification.LogisticRegression and
> >> mllib.classification.LogisticRegressionWithLBFGS. Both of them are
> >> optimized using LBFGS, the only difference I see is LogisticRegression
> >> takes DataFrame while LogisticRegressionWithLBFGS takes RDD.
> >>
> >> So I wonder,
> >> 1. Why not simply add a DataFrame training interface to
> >> LogisticRegressionWithLBFGS?
> >> 2. Whats the difference between ml.classification and
> >> mllib.classification package?
> >> 3. Why doesn't ml.classification.LogisticRegression call
> >> mllib.optimization.LBFGS / mllib.optimization.OWLQN directly? Instead,
> >> it uses breeze.optimize.LBFGS and re-implements most of the procedures
> >> in mllib.optimization.{LBFGS,OWLQN}.
> >>
> >> Thank you.
> >>
> >> Best,
> >>
> >> --
> >> Yizhi Liu
> >> Senior Software Engineer / Data Mining
> >> www.mvad.com, Shanghai, China
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> >> For additional commands, e-mail: user-help@spark.apache.org
> >>
> >
>
>
>
> --
> Yizhi Liu
> Senior Software Engineer / Data Mining
> www.mvad.com, Shanghai, China
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

Posted by YiZhi Liu <ja...@gmail.com>.

Hi Joseph,

Thank you for clarifying the motivation that you setup a different API
for ml pipelines, it sounds great. But I still think we could extract
some common parts of the training & inference procedures for ml and
mllib. In ml.classification.LogisticRegression, you simply transform
the DataFrame into RDD and follow the same procedures in
mllib.optimization.{LBFGS,OWLQN}, right?

My suggestion is, if I may, ml package should focus on the public API,
and leave the underlying implementations, e.g. numerical optimization,
to mllib package.

Please let me know if my understanding has any problem. Thank you!

2015-10-08 1:15 GMT+08:00 Joseph Bradley <jo...@databricks.com>:
> Hi YiZhi Liu,
>
> The spark.ml classes are part of the higher-level "Pipelines" API, which
> works with DataFrames.  When creating this API, we decided to separate it
> from the old API to avoid confusion.  You can read more about it here:
> http://spark.apache.org/docs/latest/ml-guide.html
>
> For (3): We use Breeze, but we have to modify it in order to do distributed
> optimization based on Spark.
>
> Joseph
>
> On Tue, Oct 6, 2015 at 11:47 PM, YiZhi Liu <ja...@gmail.com> wrote:
>>
>> Hi everyone,
>>
>> I'm curious about the difference between
>> ml.classification.LogisticRegression and
>> mllib.classification.LogisticRegressionWithLBFGS. Both of them are
>> optimized using LBFGS, the only difference I see is LogisticRegression
>> takes DataFrame while LogisticRegressionWithLBFGS takes RDD.
>>
>> So I wonder,
>> 1. Why not simply add a DataFrame training interface to
>> LogisticRegressionWithLBFGS?
>> 2. Whats the difference between ml.classification and
>> mllib.classification package?
>> 3. Why doesn't ml.classification.LogisticRegression call
>> mllib.optimization.LBFGS / mllib.optimization.OWLQN directly? Instead,
>> it uses breeze.optimize.LBFGS and re-implements most of the procedures
>> in mllib.optimization.{LBFGS,OWLQN}.
>>
>> Thank you.
>>
>> Best,
>>
>> --
>> Yizhi Liu
>> Senior Software Engineer / Data Mining
>> www.mvad.com, Shanghai, China
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>



-- 
Yizhi Liu
Senior Software Engineer / Data Mining
www.mvad.com, Shanghai, China

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

Posted by YiZhi Liu <ja...@gmail.com>.

Hi Joseph,

Thank you for clarifying the motivation that you setup a different API
for ml pipelines, it sounds great. But I still think we could extract
some common parts of the training & inference procedures for ml and
mllib. In ml.classification.LogisticRegression, you simply transform
the DataFrame into RDD and follow the same procedures in
mllib.optimization.{LBFGS,OWLQN}, right?

My suggestion is, if I may, ml package should focus on the public API,
and leave the underlying implementations, e.g. numerical optimization,
to mllib package.

Please let me know if my understanding has any problem. Thank you!

2015-10-08 1:15 GMT+08:00 Joseph Bradley <jo...@databricks.com>:
> Hi YiZhi Liu,
>
> The spark.ml classes are part of the higher-level "Pipelines" API, which
> works with DataFrames.  When creating this API, we decided to separate it
> from the old API to avoid confusion.  You can read more about it here:
> http://spark.apache.org/docs/latest/ml-guide.html
>
> For (3): We use Breeze, but we have to modify it in order to do distributed
> optimization based on Spark.
>
> Joseph
>
> On Tue, Oct 6, 2015 at 11:47 PM, YiZhi Liu <ja...@gmail.com> wrote:
>>
>> Hi everyone,
>>
>> I'm curious about the difference between
>> ml.classification.LogisticRegression and
>> mllib.classification.LogisticRegressionWithLBFGS. Both of them are
>> optimized using LBFGS, the only difference I see is LogisticRegression
>> takes DataFrame while LogisticRegressionWithLBFGS takes RDD.
>>
>> So I wonder,
>> 1. Why not simply add a DataFrame training interface to
>> LogisticRegressionWithLBFGS?
>> 2. Whats the difference between ml.classification and
>> mllib.classification package?
>> 3. Why doesn't ml.classification.LogisticRegression call
>> mllib.optimization.LBFGS / mllib.optimization.OWLQN directly? Instead,
>> it uses breeze.optimize.LBFGS and re-implements most of the procedures
>> in mllib.optimization.{LBFGS,OWLQN}.
>>
>> Thank you.
>>
>> Best,
>>
>> --
>> Yizhi Liu
>> Senior Software Engineer / Data Mining
>> www.mvad.com, Shanghai, China
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>



-- 
Yizhi Liu
Senior Software Engineer / Data Mining
www.mvad.com, Shanghai, China

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

Posted by Joseph Bradley <jo...@databricks.com>.

Hi YiZhi Liu,

The spark.ml classes are part of the higher-level "Pipelines" API, which
works with DataFrames.  When creating this API, we decided to separate it
from the old API to avoid confusion.  You can read more about it here:
http://spark.apache.org/docs/latest/ml-guide.html

For (3): We use Breeze, but we have to modify it in order to do distributed
optimization based on Spark.

Joseph

On Tue, Oct 6, 2015 at 11:47 PM, YiZhi Liu <ja...@gmail.com> wrote:

> Hi everyone,
>
> I'm curious about the difference between
> ml.classification.LogisticRegression and
> mllib.classification.LogisticRegressionWithLBFGS. Both of them are
> optimized using LBFGS, the only difference I see is LogisticRegression
> takes DataFrame while LogisticRegressionWithLBFGS takes RDD.
>
> So I wonder,
> 1. Why not simply add a DataFrame training interface to
> LogisticRegressionWithLBFGS?
> 2. Whats the difference between ml.classification and
> mllib.classification package?
> 3. Why doesn't ml.classification.LogisticRegression call
> mllib.optimization.LBFGS / mllib.optimization.OWLQN directly? Instead,
> it uses breeze.optimize.LBFGS and re-implements most of the procedures
> in mllib.optimization.{LBFGS,OWLQN}.
>
> Thank you.
>
> Best,
>
> --
> Yizhi Liu
> Senior Software Engineer / Data Mining
> www.mvad.com, Shanghai, China
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

Posted by Joseph Bradley <jo...@databricks.com>.

Hi YiZhi Liu,

The spark.ml classes are part of the higher-level "Pipelines" API, which
works with DataFrames.  When creating this API, we decided to separate it
from the old API to avoid confusion.  You can read more about it here:
http://spark.apache.org/docs/latest/ml-guide.html

For (3): We use Breeze, but we have to modify it in order to do distributed
optimization based on Spark.

Joseph

On Tue, Oct 6, 2015 at 11:47 PM, YiZhi Liu <ja...@gmail.com> wrote:

> Hi everyone,
>
> I'm curious about the difference between
> ml.classification.LogisticRegression and
> mllib.classification.LogisticRegressionWithLBFGS. Both of them are
> optimized using LBFGS, the only difference I see is LogisticRegression
> takes DataFrame while LogisticRegressionWithLBFGS takes RDD.
>
> So I wonder,
> 1. Why not simply add a DataFrame training interface to
> LogisticRegressionWithLBFGS?
> 2. Whats the difference between ml.classification and
> mllib.classification package?
> 3. Why doesn't ml.classification.LogisticRegression call
> mllib.optimization.LBFGS / mllib.optimization.OWLQN directly? Instead,
> it uses breeze.optimize.LBFGS and re-implements most of the procedures
> in mllib.optimization.{LBFGS,OWLQN}.
>
> Thank you.
>
> Best,
>
> --
> Yizhi Liu
> Senior Software Engineer / Data Mining
> www.mvad.com, Shanghai, China
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>