You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hivemall.apache.org by takuti <gi...@git.apache.org> on 2017/05/15 09:10:11 UTC
[GitHub] incubator-hivemall pull request #79: [WIP][HIVEMALL-101] Separate optimizer ...
GitHub user takuti opened a pull request:
https://github.com/apache/incubator-hivemall/pull/79
[WIP][HIVEMALL-101] Separate optimizer implementation
## What changes were proposed in this pull request?
Finalize #14
## What type of PR is it?
Improvement, Feature
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-101
## How was this patch tested?
Unit test
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/takuti/incubator-hivemall HIVEMALL-101
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-hivemall/pull/79.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #79
----
commit b1535414146ce6711d8abc4892fd1d66b8dde342
Author: Takeshi YAMAMURO <li...@gmail.com>
Date: 2016-05-02T14:43:42Z
Add optimizer implementations
commit c2ca02aecf84cd678dc4f8729e1e01b1386826ea
Author: Takeshi YAMAMURO <li...@gmail.com>
Date: 2016-09-20T16:52:22Z
Revert some modifications
commit d36ea05a3fc22011c0932edd7b8b3c214b4bcf65
Author: myui <my...@apache.org>
Date: 2017-01-16T11:20:42Z
Updated license headers
commit 06404280b05ded0d947070ec847136ab898f3966
Author: myui <my...@apache.org>
Date: 2017-01-16T11:35:00Z
Fixed imports
commit 547cda4880269b28af4c60a869409d33599b748c
Author: myui <my...@apache.org>
Date: 2017-01-30T06:55:48Z
Add annotations
commit 2a523a72b571a694004eaa3b07355a3710427954
Author: myui <my...@apache.org>
Date: 2017-01-30T08:50:21Z
Refactored to support Optimizer
commit d14451cfeffd11b6a342d3a5eb878adbf812410f
Author: myui <my...@apache.org>
Date: 2017-02-08T05:44:57Z
Applied refactoring
commit fa1e8e5678f93a85c8d5fcaae01c4b3f5cf81f88
Author: Takuya Kitazawa <k....@gmail.com>
Date: 2017-05-11T08:09:48Z
Fix build errors
commit 2d69bf5e64b374fe4d130da1c0726e0c661e1664
Author: Takuya Kitazawa <k....@gmail.com>
Date: 2017-05-11T08:44:25Z
Remove unused import
commit c73695a70aa66afa1fd4d04115e943cf6c1c0b32
Author: Takuya Kitazawa <k....@gmail.com>
Date: 2017-05-15T08:39:23Z
Fix OptimizerOptions
* Order of short/long option names
* Parsed option handling
commit 3e13b36c783f825590d0247c6c572cccfec4a9b2
Author: Takuya Kitazawa <k....@gmail.com>
Date: 2017-05-15T08:48:27Z
Make loss function configureable in generic classifier/regressor
commit 9b26a22719a6d187cb4352e270ef26812de3f128
Author: Takuya Kitazawa <k....@gmail.com>
Date: 2017-05-15T08:49:21Z
Add some messages to the LossFunction classes
commit 00af6a6ae66aeff32bfc275eef54a0f9f6c191d0
Author: Takuya Kitazawa <k....@gmail.com>
Date: 2017-05-15T08:50:15Z
Add generic classifier/regressor UDTF test
commit 791764c9f09c77835b383ad85eaf6131a72b7ac0
Author: Takuya Kitazawa <k....@gmail.com>
Date: 2017-05-15T09:07:39Z
Wrap IllegalArgumentException in generic classifier/regressor UDTFs
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall issue #79: [WIP][HIVEMALL-101] Separate optimizer impleme...
Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:
https://github.com/apache/incubator-hivemall/pull/79
[![Coverage Status](https://coveralls.io/builds/11579905/badge)](https://coveralls.io/builds/11579905)
Changes Unknown when pulling **c57d09ee89128f20406ad34482fa7d1a4c8ffc3f on takuti:HIVEMALL-101** into ** on apache:master**.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall issue #79: [WIP][HIVEMALL-101] Separate optimizer impleme...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:
https://github.com/apache/incubator-hivemall/pull/79
@takuti `-iter` support should be another ticket. `-minibatch` support can be within this ticket.
Functional tests to confirm accuracy of `-loss logistic` to existing `logress` is required.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall issue #79: [WIP][HIVEMALL-101] Separate optimizer impleme...
Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on the issue:
https://github.com/apache/incubator-hivemall/pull/79
^^^ since generic regressor does not accept classification loss (e.g. logloss) just like sklearn, I keep removing `checkTargetValue()` from the `GeneralRegression` class
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall issue #79: [WIP][HIVEMALL-101] Separate optimizer impleme...
Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:
https://github.com/apache/incubator-hivemall/pull/79
[![Coverage Status](https://coveralls.io/builds/11625112/badge)](https://coveralls.io/builds/11625112)
Coverage increased (+0.7%) to 39.424% when pulling **c3b89f8a671a1ccf7a0c19e9f061d61c6e0c2807 on takuti:HIVEMALL-101** into **10e7d450fa8257efc5d614957fda514b2b91fdee on apache:master**.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall pull request #79: [WIP][HIVEMALL-101] Separate optimizer ...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/79#discussion_r116510491
--- Diff: core/src/main/java/hivemall/regression/GeneralRegressionUDTF.java ---
@@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.regression;
+
+import hivemall.annotations.Since;
+import hivemall.model.FeatureValue;
+import hivemall.optimizer.LossFunctions;
+import hivemall.optimizer.LossFunctions.LossFunction;
+import hivemall.optimizer.Optimizer;
+import hivemall.optimizer.OptimizerOptions;
+
+import java.util.Map;
+
+import javax.annotation.Nonnull;
+
+import org.apache.commons.cli.CommandLine;
+import org.apache.commons.cli.Options;
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
+
+/**
+ * A general regression class with replaceable optimization functions.
+ */
+@Description(name = "train_regression",
+ value = "_FUNC_(list<string|int|bigint> features, double label [, const string options])"
+ + " - Returns a relation consists of <string|int|bigint feature, float weight>",
+ extended = "Build a prediction model by a generic regressor")
+@Since(version = "0.5-rc.1")
+public final class GeneralRegressionUDTF extends RegressionBaseUDTF {
+
+ @Nonnull
+ private final Map<String, String> optimizerOptions;
+ private Optimizer optimizer;
+ private LossFunction lossFunction;
+
+ public GeneralRegressionUDTF() {
+ super(true); // This enables new model interfaces
+ this.optimizerOptions = OptimizerOptions.create();
+ }
+
+ @Override
+ public StructObjectInspector initialize(ObjectInspector[] argOIs) throws UDFArgumentException {
+ if (argOIs.length != 2 && argOIs.length != 3) {
+ throw new UDFArgumentException(this.getClass().getSimpleName()
+ + " takes 2 or 3 arguments: List<Text|Int|BitInt> features, float target "
+ + "[, constant string options]");
+ }
+
+ StructObjectInspector outputOI = super.initialize(argOIs);
+
+ if (lossFunction.forBinaryClassification()) {
+ throw new UDFArgumentException("The loss function `" + lossFunction + "` is not for regression");
+ }
+ if (is_mini_batch) {
+ throw new UDFArgumentException("_FUNC_ does not currently support `-mini_batch` option");
+ }
+
+ try {
+ this.optimizer = createOptimizer(optimizerOptions);
+ } catch (Throwable e) {
+ throw new UDFArgumentException(e.getMessage());
+ }
+
+ return outputOI;
+ }
+
+ @Override
+ protected Options getOptions() {
+ Options opts = super.getOptions();
+ opts.addOption("loss", "loss_function", true,
+ "Loss function [default: SquaredLoss, QuantileLoss, EpsilonInsensitiveLoss]");
+ OptimizerOptions.setup(opts);
+ return opts;
+ }
+
+ @Override
+ protected CommandLine processOptions(ObjectInspector[] argOIs) throws UDFArgumentException {
+ CommandLine cl = super.processOptions(argOIs);
+ try {
+ if (cl.hasOption("loss_function")) {
+ this.lossFunction = LossFunctions.getLossFunction(cl.getOptionValue("loss_function"));
+ } else {
+ this.lossFunction = LossFunctions.getLossFunction("SquaredLoss");
+ }
+ } catch (Throwable e) {
+ throw new UDFArgumentException(e.getMessage());
+ }
+ OptimizerOptions.propcessOptions(cl, optimizerOptions);
+ return cl;
+ }
+
+ @Override
+ protected final void checkTargetValue(final float target) throws UDFArgumentException {
--- End diff --
@takuti Maybe for logistic regression that is actually a classifier taking 0/1 values. @maropu is not expert of machine learning algorithm.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall issue #79: [HIVEMALL-101] Separate optimizer implementati...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:
https://github.com/apache/incubator-hivemall/pull/79
@maropu @takuti merged this so huge patch finally.. Thank you for your contribution!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall pull request #79: [WIP][HIVEMALL-101] Separate optimizer ...
Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/79#discussion_r116588229
--- Diff: core/src/main/java/hivemall/regression/GeneralRegressionUDTF.java ---
@@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.regression;
+
+import hivemall.annotations.Since;
+import hivemall.model.FeatureValue;
+import hivemall.optimizer.LossFunctions;
+import hivemall.optimizer.LossFunctions.LossFunction;
+import hivemall.optimizer.Optimizer;
+import hivemall.optimizer.OptimizerOptions;
+
+import java.util.Map;
+
+import javax.annotation.Nonnull;
+
+import org.apache.commons.cli.CommandLine;
+import org.apache.commons.cli.Options;
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
+
+/**
+ * A general regression class with replaceable optimization functions.
+ */
+@Description(name = "train_regression",
+ value = "_FUNC_(list<string|int|bigint> features, double label [, const string options])"
+ + " - Returns a relation consists of <string|int|bigint feature, float weight>",
+ extended = "Build a prediction model by a generic regressor")
+@Since(version = "0.5-rc.1")
+public final class GeneralRegressionUDTF extends RegressionBaseUDTF {
+
+ @Nonnull
+ private final Map<String, String> optimizerOptions;
+ private Optimizer optimizer;
+ private LossFunction lossFunction;
+
+ public GeneralRegressionUDTF() {
+ super(true); // This enables new model interfaces
+ this.optimizerOptions = OptimizerOptions.create();
+ }
+
+ @Override
+ public StructObjectInspector initialize(ObjectInspector[] argOIs) throws UDFArgumentException {
+ if (argOIs.length != 2 && argOIs.length != 3) {
+ throw new UDFArgumentException(this.getClass().getSimpleName()
+ + " takes 2 or 3 arguments: List<Text|Int|BitInt> features, float target "
+ + "[, constant string options]");
+ }
+
+ StructObjectInspector outputOI = super.initialize(argOIs);
+
+ if (lossFunction.forBinaryClassification()) {
+ throw new UDFArgumentException("The loss function `" + lossFunction + "` is not for regression");
+ }
+ if (is_mini_batch) {
+ throw new UDFArgumentException("_FUNC_ does not currently support `-mini_batch` option");
+ }
+
+ try {
+ this.optimizer = createOptimizer(optimizerOptions);
+ } catch (Throwable e) {
+ throw new UDFArgumentException(e.getMessage());
+ }
+
+ return outputOI;
+ }
+
+ @Override
+ protected Options getOptions() {
+ Options opts = super.getOptions();
+ opts.addOption("loss", "loss_function", true,
+ "Loss function [default: SquaredLoss, QuantileLoss, EpsilonInsensitiveLoss]");
+ OptimizerOptions.setup(opts);
+ return opts;
+ }
+
+ @Override
+ protected CommandLine processOptions(ObjectInspector[] argOIs) throws UDFArgumentException {
+ CommandLine cl = super.processOptions(argOIs);
+ try {
+ if (cl.hasOption("loss_function")) {
+ this.lossFunction = LossFunctions.getLossFunction(cl.getOptionValue("loss_function"));
+ } else {
+ this.lossFunction = LossFunctions.getLossFunction("SquaredLoss");
+ }
+ } catch (Throwable e) {
+ throw new UDFArgumentException(e.getMessage());
+ }
+ OptimizerOptions.propcessOptions(cl, optimizerOptions);
+ return cl;
+ }
+
+ @Override
+ protected final void checkTargetValue(final float target) throws UDFArgumentException {
--- End diff --
@myui Ah, it makes sense since originally the generic regressor used `LossFunctions.logisticLoss(target, predicted);`. Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall issue #79: [HIVEMALL-101] Separate optimizer implementati...
Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:
https://github.com/apache/incubator-hivemall/pull/79
[![Coverage Status](https://coveralls.io/builds/11896899/badge)](https://coveralls.io/builds/11896899)
Coverage increased (+0.8%) to 40.283% when pulling **5439bd80face5ef2f69650244ea8c9f0f13bed1b on takuti:HIVEMALL-101** into **1db5358767bb30a8c433e4530c39d8591bc28a36 on apache:master**.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall pull request #79: [HIVEMALL-101] Separate optimizer imple...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/incubator-hivemall/pull/79
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall issue #79: [HIVEMALL-101] Separate optimizer implementati...
Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:
https://github.com/apache/incubator-hivemall/pull/79
[![Coverage Status](https://coveralls.io/builds/11662166/badge)](https://coveralls.io/builds/11662166)
Coverage increased (+0.5%) to 39.184% when pulling **689bdbf77c985117c2064d4a042d7d45f2971165 on takuti:HIVEMALL-101** into **10e7d450fa8257efc5d614957fda514b2b91fdee on apache:master**.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall issue #79: [WIP][HIVEMALL-101] Separate optimizer impleme...
Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on the issue:
https://github.com/apache/incubator-hivemall/pull/79
I tested generic classifier and regressor on EMR by using the a9a data.
### Classifier
```
set hivevar:n_samples=16281;
set hivevar:total_steps=32562;
```
#### `logress`
```sql
drop table if exists logress_model;
create table logress_model as
select
feature,
avg(weight) as weight
from
(
select
logress(add_bias(features), label, '-total_steps ${total_steps}') as (feature, weight)
-- logress(add_bias(features), label, '-total_steps ${total_steps} -mini_batch 10') as (feature, weight)
from
train_x3
) t
group by feature;
```
```sql
WITH test_exploded as (
select
rowid,
label,
extract_feature(feature) as feature,
extract_weight(feature) as value
from
test LATERAL VIEW explode(add_bias(features)) t AS feature
),
predict as (
select
t.rowid,
sigmoid(sum(m.weight * t.value)) as prob,
CAST((case when sigmoid(sum(m.weight * t.value)) >= 0.5 then 1.0 else 0.0 end) as FLOAT) as label
from
test_exploded t LEFT OUTER JOIN
logress_model m ON (t.feature = m.feature)
group by
t.rowid
),
submit as (
select
t.label as actual,
pd.label as predicted,
pd.prob as probability
from
test t JOIN predict pd
on (t.rowid = pd.rowid)
)
select count(1) / ${n_samples} from submit
where actual = predicted;
```
#### `train_classifier`
```sql
train_classifier(add_bias(features), label, '-loss logloss -opt SGD -reg no -eta simple -total_steps ${total_steps}') as (feature, weight)
-- train_classifier(add_bias(features), label, '-loss logloss -opt SGD -reg no -eta simple -total_steps ${total_steps} -mini_batch 10') as (feature, weight)
```
Results were completely same:
| | online | mini-batch |
|:--|:--:|:--:|
|`logress`| 0.8414716540753026 | 0.848965051286776 |
|`train_classifier`| 0.8414716540753026 | 0.848965051286776 |
### Regression
Solved the a9a label prediction as a regression problem.
// Since non-generic Adagrad was designed for logistic loss (i.e. classification), we cannot compare it with generic regressor under the exactly same condition.
#### `train_adagrad_regr` (internally uses logistic loss)
```sql
drop table if exists adagrad_model;
create table adagrad_model as
select
feature,
avg(weight) as weight
from
(
select
train_adagrad_regr(features, label) as (feature, weight)
from
train_x3
) t
group by feature;
```
```sql
WITH test_exploded as (
select
rowid,
label,
extract_feature(feature) as feature,
extract_weight(feature) as value
from
test LATERAL VIEW explode(add_bias(features)) t AS feature
),
predict as (
select
t.rowid,
sigmoid(sum(m.weight * t.value)) as prob
from
test_exploded t LEFT OUTER JOIN
adagrad_model m ON (t.feature = m.feature)
group by
t.rowid
),
submit as (
select
t.label as actual,
pd.prob as probability
from
test t JOIN predict pd
on (t.rowid = pd.rowid)
)
select rmse(probability, actual) from submit;
```
### `train_regression`
```sql
train_regression(features, label, '-loss squaredloss -opt AdaGrad -reg no') as (feature, weight)
-- train_regression(features, label, '-loss squaredloss -opt AdaGrad -reg no -mini_batch 10') as (feature, weight)
```
| | online | mini-batch |
|:--|:--:|:--:|
|`train_adagrad_regr` (logistic loss) | 0.3254586866367811 | -- |
|`train_regression` (squared loss) | 0.3356422627079689 | 0.3348889704327727 |
As I mentioned in the last comment, I'm afraid whether the `-mini_batch` option works correctly for Adagrad. Fortunately, this example showed that the option slightly improved the accuracy of prediction in terms of RMSE.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall issue #79: [WIP][HIVEMALL-101] Separate optimizer impleme...
Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on the issue:
https://github.com/apache/incubator-hivemall/pull/79
I supported `-mini_batch` option for [regressor](https://github.com/takuti/incubator-hivemall/blob/c3b89f8a671a1ccf7a0c19e9f061d61c6e0c2807/core/src/main/java/hivemall/regression/GeneralRegressionUDTF.java#L121-L181) and [classifier](https://github.com/takuti/incubator-hivemall/blob/c3b89f8a671a1ccf7a0c19e9f061d61c6e0c2807/core/src/main/java/hivemall/classifier/GeneralClassifierUDTF.java#L122-L182) (same code).
The idea is just accumulating `new_weight` obtained from `optimizer.update()`. Once `miniBatchSize` samples are observed, a mean value of the accumulated `new_weight` will be set to a model via `model.setWeight`.
For SGD, it's clearly equivalent to [what RegressorBaseUDTF does](https://github.com/takuti/incubator-hivemall/blob/5dc6f4eb5a8d8532201f6706673e2381d47d7e70/core/src/main/java/hivemall/regression/RegressionBaseUDTF.java#L247-L251). However, I'm a little bit afraid if I can do the same thing for Adagrad, Adam, Adadelta and AdagradRDA. (Currently, doing the same thing for Adagrad, Adam and Adadelta are allowed. By contrast, AdagradRDA + `-mini_batch` option is not supported.)
BTW, practically, I observed that the naive Adagrad + `-mini_batch` implementation seems to work correctly as shown in the next comment:
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall issue #79: [WIP][HIVEMALL-101] Separate optimizer impleme...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:
https://github.com/apache/incubator-hivemall/pull/79
@takuti better to have this kind of documents.
http://spark.apache.org/docs/latest/mllib-optimization.html
http://scikit-learn.org/stable/modules/sgd.html#mathematical-formulation
BTW refer [1,2] for how Spark/scikit incorporates regularized updates. FYI
[1] https://github.com/apache/spark/blob/f830bb9170f6b853565d9dd30ca7418b93a54fe3/mllib/src/main/scala/org/apache/spark/mllib/optimization/Updater.scala
[2] https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/sgd_fast.pyx#L632
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall issue #79: [HIVEMALL-101] Separate optimizer implementati...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:
https://github.com/apache/incubator-hivemall/pull/79
@takuti It's preferred to have an abstract class. Please create it.
- hivemall.LearnerBase
- hivemall.GeneralLeanerBase
- hivemall.classifier.GeneralClassifier
- hivemall.regression.GeneralRegression
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall issue #79: [WIP][HIVEMALL-101] Separate optimizer impleme...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:
https://github.com/apache/incubator-hivemall/pull/79
@takuti I guess no mix-server-related issues in this PR. Will review for that though.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall pull request #79: [WIP][HIVEMALL-101] Separate optimizer ...
Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/79#discussion_r116445570
--- Diff: core/src/main/java/hivemall/regression/GeneralRegressionUDTF.java ---
@@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.regression;
+
+import hivemall.annotations.Since;
+import hivemall.model.FeatureValue;
+import hivemall.optimizer.LossFunctions;
+import hivemall.optimizer.LossFunctions.LossFunction;
+import hivemall.optimizer.Optimizer;
+import hivemall.optimizer.OptimizerOptions;
+
+import java.util.Map;
+
+import javax.annotation.Nonnull;
+
+import org.apache.commons.cli.CommandLine;
+import org.apache.commons.cli.Options;
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
+
+/**
+ * A general regression class with replaceable optimization functions.
+ */
+@Description(name = "train_regression",
+ value = "_FUNC_(list<string|int|bigint> features, double label [, const string options])"
+ + " - Returns a relation consists of <string|int|bigint feature, float weight>",
+ extended = "Build a prediction model by a generic regressor")
+@Since(version = "0.5-rc.1")
+public final class GeneralRegressionUDTF extends RegressionBaseUDTF {
+
+ @Nonnull
+ private final Map<String, String> optimizerOptions;
+ private Optimizer optimizer;
+ private LossFunction lossFunction;
+
+ public GeneralRegressionUDTF() {
+ super(true); // This enables new model interfaces
+ this.optimizerOptions = OptimizerOptions.create();
+ }
+
+ @Override
+ public StructObjectInspector initialize(ObjectInspector[] argOIs) throws UDFArgumentException {
+ if (argOIs.length != 2 && argOIs.length != 3) {
+ throw new UDFArgumentException(this.getClass().getSimpleName()
+ + " takes 2 or 3 arguments: List<Text|Int|BitInt> features, float target "
+ + "[, constant string options]");
+ }
+
+ StructObjectInspector outputOI = super.initialize(argOIs);
+
+ if (lossFunction.forBinaryClassification()) {
+ throw new UDFArgumentException("The loss function `" + lossFunction + "` is not for regression");
+ }
+ if (is_mini_batch) {
+ throw new UDFArgumentException("_FUNC_ does not currently support `-mini_batch` option");
+ }
+
+ try {
+ this.optimizer = createOptimizer(optimizerOptions);
+ } catch (Throwable e) {
+ throw new UDFArgumentException(e.getMessage());
+ }
+
+ return outputOI;
+ }
+
+ @Override
+ protected Options getOptions() {
+ Options opts = super.getOptions();
+ opts.addOption("loss", "loss_function", true,
+ "Loss function [default: SquaredLoss, QuantileLoss, EpsilonInsensitiveLoss]");
+ OptimizerOptions.setup(opts);
+ return opts;
+ }
+
+ @Override
+ protected CommandLine processOptions(ObjectInspector[] argOIs) throws UDFArgumentException {
+ CommandLine cl = super.processOptions(argOIs);
+ try {
+ if (cl.hasOption("loss_function")) {
+ this.lossFunction = LossFunctions.getLossFunction(cl.getOptionValue("loss_function"));
+ } else {
+ this.lossFunction = LossFunctions.getLossFunction("SquaredLoss");
+ }
+ } catch (Throwable e) {
+ throw new UDFArgumentException(e.getMessage());
+ }
+ OptimizerOptions.propcessOptions(cl, optimizerOptions);
+ return cl;
+ }
+
+ @Override
+ protected final void checkTargetValue(final float target) throws UDFArgumentException {
--- End diff --
@maropu This is a regressor which simply predicts real values. Why did you create this method? Values only in [0,1] are allowed...?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall issue #79: [HIVEMALL-101] Separate optimizer implementati...
Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on the issue:
https://github.com/apache/incubator-hivemall/pull/79
@myui Finished~
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall issue #79: [WIP][HIVEMALL-101] Separate optimizer impleme...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:
https://github.com/apache/incubator-hivemall/pull/79
@takuti `train()` can return current loss and cumulative loss should be managed for future iteration support, e.g., using
https://github.com/apache/incubator-hivemall/blob/master/core/src/main/java/hivemall/common/ConversionState.java
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall issue #79: [HIVEMALL-101] Separate optimizer implementati...
Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on the issue:
https://github.com/apache/incubator-hivemall/pull/79
@myui Almost done basically. Could you review when you get a chance?
One thing I like to discuss here is that `GeneralClassifierUDTF` and `GeneralRegressionUDTF` currently has a lot of duplicated code. Thus, current class structure
- Learner Base
- Binary Online Classifier
- General Classifier
- Regression Base
- General Regression
can be modified to
- Learner Base
- General Predictor Base
- General Classifier
- General Regression
for example.
If it sounds good for @myui, I will do so. Of course it's not mandatory, so keeping the current duplicated code is no problem.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall issue #79: [WIP][HIVEMALL-101] Separate optimizer impleme...
Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:
https://github.com/apache/incubator-hivemall/pull/79
[![Coverage Status](https://coveralls.io/builds/11627493/badge)](https://coveralls.io/builds/11627493)
Coverage increased (+0.7%) to 39.422% when pulling **f98bc73c89610f4b1a489c6b810752d843f9d7cc on takuti:HIVEMALL-101** into **10e7d450fa8257efc5d614957fda514b2b91fdee on apache:master**.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall issue #79: [HIVEMALL-101] Separate optimizer implementati...
Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:
https://github.com/apache/incubator-hivemall/pull/79
[![Coverage Status](https://coveralls.io/builds/11661000/badge)](https://coveralls.io/builds/11661000)
Coverage increased (+0.7%) to 39.422% when pulling **2724dbcc97218ae6237f5ff675027ad24f9501bb on takuti:HIVEMALL-101** into **10e7d450fa8257efc5d614957fda514b2b91fdee on apache:master**.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall issue #79: [WIP][HIVEMALL-101] Separate optimizer impleme...
Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:
https://github.com/apache/incubator-hivemall/pull/79
[![Coverage Status](https://coveralls.io/builds/11594767/badge)](https://coveralls.io/builds/11594767)
Coverage increased (+0.3%) to 38.968% when pulling **0f268943082be62e56c2acc86a02232b901081dd on takuti:HIVEMALL-101** into **10e7d450fa8257efc5d614957fda514b2b91fdee on apache:master**.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall issue #79: [WIP][HIVEMALL-101] Separate optimizer impleme...
Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:
https://github.com/apache/incubator-hivemall/pull/79
[![Coverage Status](https://coveralls.io/builds/11624842/badge)](https://coveralls.io/builds/11624842)
Coverage increased (+0.7%) to 39.438% when pulling **c3b89f8a671a1ccf7a0c19e9f061d61c6e0c2807 on takuti:HIVEMALL-101** into **10e7d450fa8257efc5d614957fda514b2b91fdee on apache:master**.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall issue #79: [WIP][HIVEMALL-101] Separate optimizer impleme...
Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:
https://github.com/apache/incubator-hivemall/pull/79
[![Coverage Status](https://coveralls.io/builds/11596234/badge)](https://coveralls.io/builds/11596234)
Coverage increased (+0.6%) to 39.251% when pulling **34cf8a1a7f2daa86fe3f9116a28b4497a74c2c3b on takuti:HIVEMALL-101** into **10e7d450fa8257efc5d614957fda514b2b91fdee on apache:master**.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall issue #79: [HIVEMALL-101] Separate optimizer implementati...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:
https://github.com/apache/incubator-hivemall/pull/79
@takuti well done :+1: will review.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall issue #79: [WIP][HIVEMALL-101] Separate optimizer impleme...
Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:
https://github.com/apache/incubator-hivemall/pull/79
[![Coverage Status](https://coveralls.io/builds/11645966/badge)](https://coveralls.io/builds/11645966)
Coverage increased (+1.07%) to 39.767% when pulling **0d573a0cbdb66d2b3d2cf49abd0ab61eb2bda76a on takuti:HIVEMALL-101** into **10e7d450fa8257efc5d614957fda514b2b91fdee on apache:master**.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall issue #79: [WIP][HIVEMALL-101] Separate optimizer impleme...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:
https://github.com/apache/incubator-hivemall/pull/79
@takuti `checkTargetValue()` is need for loss function, e.g., for logistic loss.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall issue #79: [WIP][HIVEMALL-101] Separate optimizer impleme...
Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:
https://github.com/apache/incubator-hivemall/pull/79
[![Coverage Status](https://coveralls.io/builds/11597582/badge)](https://coveralls.io/builds/11597582)
Coverage increased (+0.4%) to 39.132% when pulling **5dc6f4eb5a8d8532201f6706673e2381d47d7e70 on takuti:HIVEMALL-101** into **10e7d450fa8257efc5d614957fda514b2b91fdee on apache:master**.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall issue #79: [WIP][HIVEMALL-101] Separate optimizer impleme...
Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:
https://github.com/apache/incubator-hivemall/pull/79
[![Coverage Status](https://coveralls.io/builds/11627493/badge)](https://coveralls.io/builds/11627493)
Coverage increased (+0.7%) to 39.422% when pulling **f98bc73c89610f4b1a489c6b810752d843f9d7cc on takuti:HIVEMALL-101** into **10e7d450fa8257efc5d614957fda514b2b91fdee on apache:master**.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall issue #79: [WIP][HIVEMALL-101] Separate optimizer impleme...
Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:
https://github.com/apache/incubator-hivemall/pull/79
[![Coverage Status](https://coveralls.io/builds/11540991/badge)](https://coveralls.io/builds/11540991)
Coverage increased (+0.6%) to 39.27% when pulling **2b965fc1d1ef01a704690b920b59f71dc4d6a3d5 on takuti:HIVEMALL-101** into **68f6b465248117d085a9cdb7b532837b14e054c5 on apache:master**.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall issue #79: [WIP][HIVEMALL-101] Separate optimizer impleme...
Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on the issue:
https://github.com/apache/incubator-hivemall/pull/79
I listed TODOs in the top comment. If you have any other things I need to care, plz let me know.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-hivemall issue #79: [WIP][HIVEMALL-101] Separate optimizer impleme...
Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on the issue:
https://github.com/apache/incubator-hivemall/pull/79
Yep, that's why logistic loss is not selectable for now. `checkTargetValue()` will again come back later.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---