You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by jkbradley <gi...@git.apache.org> on 2014/08/02 02:01:01 UTC

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/1727

    [SPARK-2478] [mllib] DecisionTree Python API

    Added experimental Python API for Decision Trees.
    
    API:
    * class DecisionTreeModel
    ** predict() for single examples and RDDs, taking both feature vectors and LabeledPoints
    ** numNodes()
    ** depth()
    ** __str__()
    * class DecisionTree
    ** trainClassifier()
    ** trainRegressor()
    ** train()
    
    Examples and testing:
    * Added example testing classification and regression with batch prediction: examples/src/main/python/mllib/tree.py
    * Have also tested example usage in doc of python/pyspark/mllib/tree.py which tests single-example prediction with dense and sparse vectors
    
    Also: Small bug fix in python/pyspark/mllib/_common.py: In _linear_predictor_typecheck, changed check for RDD to use isinstance() instead of type() in order to catch RDD subclasses.
    
    CC @mengxr @manishamde

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark decisiontree-python-new

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1727.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1727
    
----
commit f8253520045d90c75b143d810edbb746f86cad8c
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-30T21:48:41Z

    Wrote Python API and example for DecisionTree.  Also added toString, depth, and numNodes methods to DecisionTreeModel.

commit 5f920a10b6114baa0744f55843969843b1f2babc
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-30T22:24:55Z

    Demonstration of bug before submitting fix: Updated DecisionTreeSuite so that 3 tests fail.  Will describe bug in next commit.

commit 73fbea2b2a921111cf22f4d9c76ea23c6a4f7afe
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-30T22:52:22Z

    Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix

commit 2283df878178d3b8c86ecde1d4220076af25b72f
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-30T22:53:14Z

    2 bug fixes.
    
    Indexing was inconsistent for aggregate calculations for unordered features (in multiclass classification with categorical features, where the features had few enough values such that they could be considered unordered, i.e., isSpaceSufficientForAllCategoricalSplits=true).
    
    * updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, binIndex), where
    ** featureValue was from arr (so it was a feature value)
    ** binIndex was in [0,…, 2^(maxFeatureValue-1)-1)
    * The rest of the code indexed agg as (node, feature, binIndex, label).
    * Corrected this bug by changing updateBinForUnorderedFeature to use the second indexing pattern.
    
    Unit tests in DecisionTreeSuite
    * Updated a few tests to train a model and test its training accuracy, which catches the indexing bug from updateBinForUnorderedFeature() discussed above.
    * Added new test (“stump with categorical variables for multiclass classification, with just enough bins”) to test bin extremes.
    
    Bug fix: calculateGainForSplit (for classification):
    * It used to return dummy prediction values when either the right or left children had 0 weight.  These were incorrect for multiclass classification.  It has been corrected.
    
    Updated impurities to allow for count = 0.  This was related to the above bug fix for calculateGainForSplit (for classification).
    
    Small updates to documentation and coding style.

commit 5fe44ed10450a3fbe407f5326da7391569003a78
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-30T23:07:46Z

    Merge remote-tracking branch 'upstream/master' into decisiontree-python-new

commit 8a758dbb18edf6efe8521598ab8da41736908841
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-30T23:08:48Z

    Merge branch 'decisiontree-bugfix' into decisiontree-python-new

commit 8ea8750cd5eeefa87d937ca4214a5f548dd2e6a4
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-31T00:05:49Z

    Bug fix: Off-by-1 when finding thresholds for splits for continuous features.
    
    * Exhibited bug in new test in DecisionTreeSuite: “stump with 1 continuous variable for binary classification, to check off-by-1 error”
    
    * Description: When finding thresholds for possible splits for continuous features in DecisionTree.findSplitsBins, the thresholds were set according to individual training examples’ feature values.  This can cause problems for small datasets, when the number of training examples equals numBins.
    
    * Fix: The threshold is set to be the average of 2 consecutive (sorted) examples’ feature values.  E.g.: If the old code set the threshold using example i, the new code sets the threshold using examples i and i+1.
    
    * Note: In 4 DecisionTreeSuite tests with all labels identical, removed check of threshold since it is somewhat arbitrary.

commit cd1d933a3d686107a7a8272b7138b701a820a877
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-31T00:06:39Z

    Merge branch 'decisiontree-bugfix' into decisiontree-python-new

commit 8e227ea826d6b38dc47e9a90ccf6683348c6dab0
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-31T00:18:55Z

    Changed Strategy so it only requires numClassesForClassification >= 2 for classification

commit da50db749f54a63565440d6c42f78373f1f2a2ac
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-31T00:32:10Z

    Added one more test to DecisionTreeSuite: stump with 2 continuous variables for binary classification.  Caused problems in past, but fixed now.

commit f5a036c4eff3499f5456c441572ffb11514385c9
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-31T00:33:28Z

    Merge branch 'decisiontree-bugfix' into decisiontree-python-new

commit 52e17c5b249afa10eb151e73ca36a72b4e6adbe8
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-31T16:24:21Z

    Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix

commit 59750f87c974299720ec556908c7e29b131d3476
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-31T18:08:46Z

    * Updated Strategy to check numClassesForClassification only if algo=Classification.
    * Updates based on comments:
    ** DecisionTreeRunner
    *** Made dataFormat arg default to libsvm
    ** Small cleanups
    ** tree.Node: Made recursive helper methods private, and renamed them.

commit bab3f190c51a8feced2bdb7d146072fcfa8cab72
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-31T18:10:55Z

    Merge remote-tracking branch 'upstream/master' into decisiontree-python-new

commit e06e423d7b046ae7e38001325ad7330a15179472
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-31T18:11:27Z

    Merge branch 'decisiontree-bugfix' into decisiontree-python-new

commit 376dca2c848739b1536e6ee8ddbc55043d1eef7a
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-31T18:27:18Z

    Updated meaning of maxDepth by 1 to fit scikit-learn and rpart.
    * In code, replaced usages of maxDepth <-- maxDepth + 1
    * In params, replace settings of maxDepth <-- maxDepth - 1

commit 6eed4822759377b241c8dd0adadf32102e01d472
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-31T18:39:00Z

    In DecisionTree: Changed from using procedural syntax for functions returning Unit to explicitly writing Unit return type.

commit 978cfcf84cb0259c7f65738fd3ed70f78928951e
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-31T18:40:43Z

    Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix

commit 8bb8aa06a4033277ddd117445783678af4ff3dfd
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-31T20:02:10Z

    Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix

commit dab0b674b93c7ada8e9d8ac1fc364c0c9438785b
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-31T20:08:46Z

    Added documentation for DecisionTree internals

commit 584449a23f4ce5705fad6d0e5e2bc9f55034bbe5
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-31T20:09:53Z

    Merge remote-tracking branch 'upstream/master' into decisiontree-python-new

commit 1b29c13d829aae78812b03835f309ae37e8d4084
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-31T20:10:02Z

    Merge branch 'decisiontree-bugfix' into decisiontree-python-new

commit 2b20c6151bab8a2ee218b851f40d54133f9807a2
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-31T20:39:43Z

    Small doc and style updates

commit b8fac571dc4baa58b4c4c1473bb2969553270865
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-01T01:56:37Z

    Finished Python DecisionTree API and example but need to test a bit more.

commit 66222477e4f9cb8c3ce1877312efa501c11bcf84
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-01T01:56:45Z

    Merge remote-tracking branch 'upstream/master' into decisiontree-python-new

commit 188cb0d05f5002ddacf3363b3ca79c41584e69d2
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-01T01:56:53Z

    Merge branch 'decisiontree-bugfix' into decisiontree-python-new

commit 665ba7822bde3cb8105efb31d22e0084265c92da
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-01T16:42:22Z

    Small updates towards Python DecisionTree API

commit 4562c08b5f08382f2e382d81f84c161966dc8315
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-01T16:42:57Z

    Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
    
    Conflicts:
    	mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
    (no real conflict; merged by concatenating)

commit 6df89a9f1130430367b6c7f0daa23e1cdfdc9839
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-01T20:18:20Z

    Merge remote-tracking branch 'upstream/master' into decisiontree-python-new

commit 93953f16e16e4605cbfe8a9e3a26b372e69707ae
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-01T21:34:54Z

    Likely done with Python API.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1727#issuecomment-50949260
  
    QA results for PR 1727:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>class DecisionTreeModel(object):<br>class DecisionTree(object):<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17721/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1727#issuecomment-50955594
  
    QA tests have started for PR 1727. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17758/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1727#discussion_r15725364
  
    --- Diff: python/pyspark/mllib/tree.py ---
    @@ -0,0 +1,219 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +from py4j.java_collections import MapConverter
    +
    +from pyspark import SparkContext, RDD
    +from pyspark.mllib._common import \
    +    _get_unmangled_rdd, _get_unmangled_double_vector_rdd, _serialize_double_vector, \
    +    _deserialize_labeled_point, _get_unmangled_labeled_point_rdd, \
    +    _deserialize_double
    +from pyspark.mllib.regression import LabeledPoint
    +from pyspark.serializers import NoOpSerializer
    +
    +class DecisionTreeModel(object):
    +    """
    +    A decision tree model for classification or regression.
    +
    +    WARNING: This is an experimental API.  It will probably be modified for Spark v1.2.
    +    """
    +
    +    def __init__(self, sc, java_model):
    +        """
    +        :param sc:  Spark context
    +        :param java_model:  Handle to Java model object
    +        """
    +        self._sc = sc
    +        self._java_model = java_model
    +
    +    def __del__(self):
    +        self._sc._gateway.detach(self._java_model)
    +
    +    def predict(self, x):
    +        """
    +        Predict the label of one or more examples.
    +        NOTE: This currently does NOT support batch prediction.
    +
    +        :param x:  Data point: feature vector, or a LabeledPoint (whose label is ignored).
    +        """
    +        pythonAPI = self._sc._jvm.PythonMLLibAPI()
    +        if isinstance(x, RDD):
    +            # Bulk prediction
    +            if x.count() == 0:
    +                raise RuntimeError("DecisionTreeModel.predict(x) given empty RDD x.")
    +            elementType = type(x.take(1)[0])
    +            if elementType == LabeledPoint:
    +                x = x.map(lambda x: x.features)
    +            dataBytes = _get_unmangled_double_vector_rdd(x)
    --- End diff --
    
    We don't need to cache data for prediction because it only needs a single pass. If `_get_unmangled_double_vector_rdd` also does other special operations, we can add `cache=True` to its arguments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1727#discussion_r15725151
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -19,6 +19,8 @@ package org.apache.spark.mllib.api.python
     
     import java.nio.{ByteBuffer, ByteOrder}
     
    +import scala.collection.JavaConversions._
    --- End diff --
    
    Importing `JavaConverters._` and using `.asScala` or `.asJava` explicitly is preferred.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1727#issuecomment-50972970
  
    LGTM. Merged into both master and branch-1.1. Thanks!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1727


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1727#issuecomment-50956653
  
    QA results for PR 1727:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>Re-index class labels in a dataset to the range {0,...,numClasses-1}.<br>class DecisionTreeModel(object):<br>class DecisionTree(object):<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17758/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1727#discussion_r15729989
  
    --- Diff: python/pyspark/mllib/tests.py ---
    @@ -256,9 +276,19 @@ def test_classification(self):
             self.assertTrue(nb_model.predict(features[2]) <= 0)
             self.assertTrue(nb_model.predict(features[3]) > 0)
     
    +        categoricalFeaturesInfo = {0: 3} # feature 0 has 3 categories
    +        dt_model = \
    +            DecisionTree.trainClassifier(rdd, numClasses=2,
    --- End diff --
    
    ditto: it may fit into the line above


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1727#discussion_r15725329
  
    --- Diff: python/pyspark/mllib/tree.py ---
    @@ -0,0 +1,219 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +from py4j.java_collections import MapConverter
    +
    +from pyspark import SparkContext, RDD
    +from pyspark.mllib._common import \
    +    _get_unmangled_rdd, _get_unmangled_double_vector_rdd, _serialize_double_vector, \
    +    _deserialize_labeled_point, _get_unmangled_labeled_point_rdd, \
    +    _deserialize_double
    +from pyspark.mllib.regression import LabeledPoint
    +from pyspark.serializers import NoOpSerializer
    +
    +class DecisionTreeModel(object):
    +    """
    +    A decision tree model for classification or regression.
    +
    +    WARNING: This is an experimental API.  It will probably be modified for Spark v1.2.
    +    """
    +
    +    def __init__(self, sc, java_model):
    +        """
    +        :param sc:  Spark context
    +        :param java_model:  Handle to Java model object
    +        """
    +        self._sc = sc
    +        self._java_model = java_model
    +
    +    def __del__(self):
    +        self._sc._gateway.detach(self._java_model)
    +
    +    def predict(self, x):
    +        """
    +        Predict the label of one or more examples.
    +        NOTE: This currently does NOT support batch prediction.
    +
    +        :param x:  Data point: feature vector, or a LabeledPoint (whose label is ignored).
    +        """
    +        pythonAPI = self._sc._jvm.PythonMLLibAPI()
    +        if isinstance(x, RDD):
    +            # Bulk prediction
    +            if x.count() == 0:
    +                raise RuntimeError("DecisionTreeModel.predict(x) given empty RDD x.")
    +            elementType = type(x.take(1)[0])
    +            if elementType == LabeledPoint:
    --- End diff --
    
    We don't support `predict(RDD[LabeledPoint])` in Scala/Python. This adds extra complexity to the API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1727#discussion_r15729990
  
    --- Diff: python/run-tests ---
    @@ -71,6 +71,7 @@ run_test "pyspark/mllib/random.py"
     run_test "pyspark/mllib/recommendation.py"
     run_test "pyspark/mllib/regression.py"
     run_test "pyspark/mllib/tests.py"
    +run_test "pyspark/mllib/util.py"
    --- End diff --
    
    Thanks for adding it!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1727#issuecomment-50972891
  
    QA results for PR 1727:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>Re-index class labels in a dataset to the range {0,...,numClasses-1}.<br>class DecisionTreeModel(object):<br>class DecisionTree(object):<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17773/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1727#discussion_r15729984
  
    --- Diff: examples/src/main/python/mllib/tree.py ---
    @@ -0,0 +1,129 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Decision tree classification and regression using MLlib.
    +"""
    +
    +import numpy, os, sys
    +
    +from operator import add
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.regression import LabeledPoint
    +from pyspark.mllib.tree import DecisionTree
    +from pyspark.mllib.util import MLUtils
    +
    +
    +def getAccuracy(dtModel, data):
    +    """
    +    Return accuracy of DecisionTreeModel on the given RDD[LabeledPoint].
    +    """
    +    seqOp = (lambda acc, x: acc + (x[0] == x[1]))
    +    predictions = dtModel.predict(data.map(lambda x: x.features))
    +    truth = data.map(lambda p: p.label)
    +    trainCorrect = predictions.zip(truth).aggregate(0, seqOp, add)
    +    return trainCorrect / (0.0 + data.count())
    +
    +
    +def getMSE(dtModel, data):
    +    """
    +    Return mean squared error (MSE) of DecisionTreeModel on the given
    +    RDD[LabeledPoint].
    +    """
    +    seqOp = (lambda acc, x: acc + numpy.square(x[0] - x[1]))
    +    predictions = dtModel.predict(data.map(lambda x: x.features))
    +    truth = data.map(lambda p: p.label)
    +    trainMSE = predictions.zip(truth).aggregate(0, seqOp, add)
    +    return trainMSE / (0.0 + data.count())
    +
    +
    +def reindexClassLabels(data):
    +    """
    +    Re-index class labels in a dataset to the range {0,...,numClasses-1}.
    +    If all labels in that range already appear at least once,
    +     then the returned RDD is the same one (without a mapping).
    +    Note: If a label simply does not appear in the data,
    +          the index will not include it.
    +          Be aware of this when reindexing subsampled data.
    +    :param data: RDD of LabeledPoint where labels are integer values
    +                 denoting labels for a classification problem.
    +    :return: Pair (reindexedData, origToNewLabels) where
    +             reindexedData is an RDD of LabeledPoint with labels in
    +              the range {0,...,numClasses-1}, and
    +             origToNewLabels is a dictionary mapping original labels
    +              to new labels.
    +    """
    +    # classCounts: class --> # examples in class
    +    classCounts = data.map(lambda x: x.label).countByValue()
    +    numExamples = sum(classCounts.values())
    +    sortedClasses = sorted(classCounts.keys())
    +    numClasses = len(classCounts)
    +    # origToNewLabels: class --> index in 0,...,numClasses-1
    +    if (numClasses < 2):
    +        print >> sys.stderr, \
    +            "Dataset for classification should have at least 2 classes." + \
    +            " The given dataset had only %d classes." % numClasses
    +        exit(-1)
    +    origToNewLabels = dict([(sortedClasses[i], i) for i in range(0,numClasses)])
    +
    +    print "numClasses = %d" % numClasses
    +    print "Per-class example fractions, counts:"
    +    print "Class\tFrac\tCount"
    +    for c in sortedClasses:
    +        frac = classCounts[c] / (numExamples + 0.0)
    +        print "%g\t%g\t%d" % (c, frac, classCounts[c])
    +
    +    if (sortedClasses[0] == 0 and sortedClasses[-1] == numClasses - 1):
    --- End diff --
    
    minor: Is it safe to assume that the input labels are all integers? It may be rare to have `{0, 0.5, 2}` but it may happen.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1727#discussion_r15729986
  
    --- Diff: examples/src/main/python/mllib/tree.py ---
    @@ -0,0 +1,129 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Decision tree classification and regression using MLlib.
    +"""
    +
    +import numpy, os, sys
    +
    +from operator import add
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.regression import LabeledPoint
    +from pyspark.mllib.tree import DecisionTree
    +from pyspark.mllib.util import MLUtils
    +
    +
    +def getAccuracy(dtModel, data):
    +    """
    +    Return accuracy of DecisionTreeModel on the given RDD[LabeledPoint].
    +    """
    +    seqOp = (lambda acc, x: acc + (x[0] == x[1]))
    +    predictions = dtModel.predict(data.map(lambda x: x.features))
    +    truth = data.map(lambda p: p.label)
    +    trainCorrect = predictions.zip(truth).aggregate(0, seqOp, add)
    +    return trainCorrect / (0.0 + data.count())
    +
    +
    +def getMSE(dtModel, data):
    +    """
    +    Return mean squared error (MSE) of DecisionTreeModel on the given
    +    RDD[LabeledPoint].
    +    """
    +    seqOp = (lambda acc, x: acc + numpy.square(x[0] - x[1]))
    +    predictions = dtModel.predict(data.map(lambda x: x.features))
    +    truth = data.map(lambda p: p.label)
    +    trainMSE = predictions.zip(truth).aggregate(0, seqOp, add)
    +    return trainMSE / (0.0 + data.count())
    +
    +
    +def reindexClassLabels(data):
    +    """
    +    Re-index class labels in a dataset to the range {0,...,numClasses-1}.
    +    If all labels in that range already appear at least once,
    +     then the returned RDD is the same one (without a mapping).
    +    Note: If a label simply does not appear in the data,
    +          the index will not include it.
    +          Be aware of this when reindexing subsampled data.
    +    :param data: RDD of LabeledPoint where labels are integer values
    +                 denoting labels for a classification problem.
    +    :return: Pair (reindexedData, origToNewLabels) where
    +             reindexedData is an RDD of LabeledPoint with labels in
    +              the range {0,...,numClasses-1}, and
    +             origToNewLabels is a dictionary mapping original labels
    +              to new labels.
    +    """
    +    # classCounts: class --> # examples in class
    +    classCounts = data.map(lambda x: x.label).countByValue()
    +    numExamples = sum(classCounts.values())
    +    sortedClasses = sorted(classCounts.keys())
    +    numClasses = len(classCounts)
    +    # origToNewLabels: class --> index in 0,...,numClasses-1
    +    if (numClasses < 2):
    +        print >> sys.stderr, \
    +            "Dataset for classification should have at least 2 classes." + \
    +            " The given dataset had only %d classes." % numClasses
    +        exit(-1)
    +    origToNewLabels = dict([(sortedClasses[i], i) for i in range(0,numClasses)])
    +
    +    print "numClasses = %d" % numClasses
    +    print "Per-class example fractions, counts:"
    +    print "Class\tFrac\tCount"
    +    for c in sortedClasses:
    +        frac = classCounts[c] / (numExamples + 0.0)
    +        print "%g\t%g\t%d" % (c, frac, classCounts[c])
    +
    +    if (sortedClasses[0] == 0 and sortedClasses[-1] == numClasses - 1):
    +        return (data, origToNewLabels)
    +    else:
    +        reindexedData = \
    +            data.map(lambda x: LabeledPoint(origToNewLabels[x.label], x.features))
    +        return (reindexedData, origToNewLabels)
    +
    +
    +def usage():
    +    print >> sys.stderr, \
    +        "Usage: logistic_regression [libsvm format data filepath]\n" + \
    +        " Note: This only supports binary classification."
    +    exit(-1)
    --- End diff --
    
    ditto: `exit(1)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1727#discussion_r15725584
  
    --- Diff: examples/src/main/python/mllib/tree.py ---
    @@ -0,0 +1,92 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Decision tree classification and regression using MLlib.
    +"""
    +
    +import sys, numpy
    +
    +from operator import add
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.regression import LabeledPoint
    +from pyspark.mllib.tree import DecisionTree
    +
    +
    +# Parse a line of text into an MLlib LabeledPoint object
    --- End diff --
    
    We can do that during QA. Let's focus on DT in this PR :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1727#discussion_r15729981
  
    --- Diff: examples/src/main/python/mllib/tree.py ---
    @@ -0,0 +1,129 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Decision tree classification and regression using MLlib.
    +"""
    +
    +import numpy, os, sys
    +
    +from operator import add
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.regression import LabeledPoint
    +from pyspark.mllib.tree import DecisionTree
    +from pyspark.mllib.util import MLUtils
    +
    +
    +def getAccuracy(dtModel, data):
    +    """
    +    Return accuracy of DecisionTreeModel on the given RDD[LabeledPoint].
    +    """
    +    seqOp = (lambda acc, x: acc + (x[0] == x[1]))
    +    predictions = dtModel.predict(data.map(lambda x: x.features))
    +    truth = data.map(lambda p: p.label)
    +    trainCorrect = predictions.zip(truth).aggregate(0, seqOp, add)
    +    return trainCorrect / (0.0 + data.count())
    --- End diff --
    
    minor: Do we want to check `data.count() == 0`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1727#discussion_r15730605
  
    --- Diff: examples/src/main/python/mllib/tree.py ---
    @@ -0,0 +1,129 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Decision tree classification and regression using MLlib.
    +"""
    +
    +import numpy, os, sys
    +
    +from operator import add
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.regression import LabeledPoint
    +from pyspark.mllib.tree import DecisionTree
    +from pyspark.mllib.util import MLUtils
    +
    +
    +def getAccuracy(dtModel, data):
    +    """
    +    Return accuracy of DecisionTreeModel on the given RDD[LabeledPoint].
    +    """
    +    seqOp = (lambda acc, x: acc + (x[0] == x[1]))
    +    predictions = dtModel.predict(data.map(lambda x: x.features))
    +    truth = data.map(lambda p: p.label)
    +    trainCorrect = predictions.zip(truth).aggregate(0, seqOp, add)
    +    return trainCorrect / (0.0 + data.count())
    +
    +
    +def getMSE(dtModel, data):
    +    """
    +    Return mean squared error (MSE) of DecisionTreeModel on the given
    +    RDD[LabeledPoint].
    +    """
    +    seqOp = (lambda acc, x: acc + numpy.square(x[0] - x[1]))
    +    predictions = dtModel.predict(data.map(lambda x: x.features))
    +    truth = data.map(lambda p: p.label)
    +    trainMSE = predictions.zip(truth).aggregate(0, seqOp, add)
    +    return trainMSE / (0.0 + data.count())
    +
    +
    +def reindexClassLabels(data):
    +    """
    +    Re-index class labels in a dataset to the range {0,...,numClasses-1}.
    +    If all labels in that range already appear at least once,
    +     then the returned RDD is the same one (without a mapping).
    +    Note: If a label simply does not appear in the data,
    +          the index will not include it.
    +          Be aware of this when reindexing subsampled data.
    +    :param data: RDD of LabeledPoint where labels are integer values
    +                 denoting labels for a classification problem.
    +    :return: Pair (reindexedData, origToNewLabels) where
    +             reindexedData is an RDD of LabeledPoint with labels in
    +              the range {0,...,numClasses-1}, and
    +             origToNewLabels is a dictionary mapping original labels
    +              to new labels.
    +    """
    +    # classCounts: class --> # examples in class
    +    classCounts = data.map(lambda x: x.label).countByValue()
    +    numExamples = sum(classCounts.values())
    +    sortedClasses = sorted(classCounts.keys())
    +    numClasses = len(classCounts)
    +    # origToNewLabels: class --> index in 0,...,numClasses-1
    +    if (numClasses < 2):
    +        print >> sys.stderr, \
    +            "Dataset for classification should have at least 2 classes." + \
    +            " The given dataset had only %d classes." % numClasses
    +        exit(-1)
    +    origToNewLabels = dict([(sortedClasses[i], i) for i in range(0,numClasses)])
    +
    +    print "numClasses = %d" % numClasses
    +    print "Per-class example fractions, counts:"
    +    print "Class\tFrac\tCount"
    +    for c in sortedClasses:
    +        frac = classCounts[c] / (numExamples + 0.0)
    +        print "%g\t%g\t%d" % (c, frac, classCounts[c])
    +
    +    if (sortedClasses[0] == 0 and sortedClasses[-1] == numClasses - 1):
    --- End diff --
    
    I don't think I do assume integer values.  This check is to see if we need to relabel for DecisionTree (which requires class labels to be in 0,...,numClasses-1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1727#discussion_r15725504
  
    --- Diff: python/pyspark/mllib/tree.py ---
    @@ -0,0 +1,219 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +from py4j.java_collections import MapConverter
    +
    +from pyspark import SparkContext, RDD
    +from pyspark.mllib._common import \
    +    _get_unmangled_rdd, _get_unmangled_double_vector_rdd, _serialize_double_vector, \
    +    _deserialize_labeled_point, _get_unmangled_labeled_point_rdd, \
    +    _deserialize_double
    +from pyspark.mllib.regression import LabeledPoint
    +from pyspark.serializers import NoOpSerializer
    +
    +class DecisionTreeModel(object):
    +    """
    +    A decision tree model for classification or regression.
    +
    +    WARNING: This is an experimental API.  It will probably be modified for Spark v1.2.
    +    """
    +
    +    def __init__(self, sc, java_model):
    +        """
    +        :param sc:  Spark context
    +        :param java_model:  Handle to Java model object
    +        """
    +        self._sc = sc
    +        self._java_model = java_model
    +
    +    def __del__(self):
    +        self._sc._gateway.detach(self._java_model)
    +
    +    def predict(self, x):
    +        """
    +        Predict the label of one or more examples.
    +        NOTE: This currently does NOT support batch prediction.
    --- End diff --
    
    It batch prediction working?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1727#discussion_r15725574
  
    --- Diff: examples/src/main/python/mllib/tree.py ---
    @@ -0,0 +1,92 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Decision tree classification and regression using MLlib.
    +"""
    +
    +import sys, numpy
    +
    +from operator import add
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.regression import LabeledPoint
    +from pyspark.mllib.tree import DecisionTree
    +
    +
    +# Parse a line of text into an MLlib LabeledPoint object
    --- End diff --
    
    I got this from logistic_regression.py  I'll update it everywhere I find it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1727#discussion_r15725100
  
    --- Diff: examples/src/main/python/mllib/tree.py ---
    @@ -0,0 +1,92 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Decision tree classification and regression using MLlib.
    +"""
    +
    +import sys, numpy
    +
    +from operator import add
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.regression import LabeledPoint
    +from pyspark.mllib.tree import DecisionTree
    +
    +
    +# Parse a line of text into an MLlib LabeledPoint object
    +def parsePoint(line):
    +    values = [float(s) for s in line.split(',')]
    +    if values[0] == -1:   # Convert -1 labels to 0 for MLlib
    +        values[0] = 0
    +    return LabeledPoint(values[0], values[1:])
    +
    +# Return accuracy of DecisionTreeModel on the given RDD[LabeledPoint].
    +def getAccuracy(dtModel, data):
    +    seqOp = (lambda acc, x: acc + (x[0] == x[1]))
    +    predictions = dtModel.predict(data)
    +    truth = data.map(lambda p: p.label)
    +    trainCorrect = predictions.zip(truth).aggregate(0, seqOp, add)
    +    return trainCorrect / (0.0 + data.count())
    +
    +# Return mean squared error (MSE) of DecisionTreeModel on the given RDD[LabeledPoint].
    +def getMSE(dtModel, data):
    +    seqOp = (lambda acc, x: acc + numpy.square(x[0] - x[1]))
    +    predictions = dtModel.predict(data)
    +    truth = data.map(lambda p: p.label)
    +    trainMSE = predictions.zip(truth).aggregate(0, seqOp, add)
    +    return trainMSE / (0.0 + data.count())
    +
    +# Return a new LabeledPoint with the label and feature 0 swapped.
    +def swapLabelAndFeature0(labeledPoint):
    +    newLabel = labeledPoint.label
    +    newFeatures = labeledPoint.features
    +    (newLabel, newFeatures[0]) = (newFeatures[0], newLabel)
    +    return LabeledPoint(newLabel, newFeatures)
    +
    +
    +if __name__ == "__main__":
    +    if len(sys.argv) != 1:
    +        print >> sys.stderr, "Usage: logistic_regression"
    +        exit(-1)
    +    sc = SparkContext(appName="PythonDT")
    +
    +    # Load data.
    +    dataPath = 'data/mllib/sample_tree_data.csv'
    --- End diff --
    
    Shall we make this configurable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1727#discussion_r15729985
  
    --- Diff: examples/src/main/python/mllib/tree.py ---
    @@ -0,0 +1,129 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Decision tree classification and regression using MLlib.
    +"""
    +
    +import numpy, os, sys
    +
    +from operator import add
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.regression import LabeledPoint
    +from pyspark.mllib.tree import DecisionTree
    +from pyspark.mllib.util import MLUtils
    +
    +
    +def getAccuracy(dtModel, data):
    +    """
    +    Return accuracy of DecisionTreeModel on the given RDD[LabeledPoint].
    +    """
    +    seqOp = (lambda acc, x: acc + (x[0] == x[1]))
    +    predictions = dtModel.predict(data.map(lambda x: x.features))
    +    truth = data.map(lambda p: p.label)
    +    trainCorrect = predictions.zip(truth).aggregate(0, seqOp, add)
    +    return trainCorrect / (0.0 + data.count())
    +
    +
    +def getMSE(dtModel, data):
    +    """
    +    Return mean squared error (MSE) of DecisionTreeModel on the given
    +    RDD[LabeledPoint].
    +    """
    +    seqOp = (lambda acc, x: acc + numpy.square(x[0] - x[1]))
    +    predictions = dtModel.predict(data.map(lambda x: x.features))
    +    truth = data.map(lambda p: p.label)
    +    trainMSE = predictions.zip(truth).aggregate(0, seqOp, add)
    +    return trainMSE / (0.0 + data.count())
    +
    +
    +def reindexClassLabels(data):
    +    """
    +    Re-index class labels in a dataset to the range {0,...,numClasses-1}.
    +    If all labels in that range already appear at least once,
    +     then the returned RDD is the same one (without a mapping).
    +    Note: If a label simply does not appear in the data,
    +          the index will not include it.
    +          Be aware of this when reindexing subsampled data.
    +    :param data: RDD of LabeledPoint where labels are integer values
    +                 denoting labels for a classification problem.
    +    :return: Pair (reindexedData, origToNewLabels) where
    +             reindexedData is an RDD of LabeledPoint with labels in
    +              the range {0,...,numClasses-1}, and
    +             origToNewLabels is a dictionary mapping original labels
    +              to new labels.
    +    """
    +    # classCounts: class --> # examples in class
    +    classCounts = data.map(lambda x: x.label).countByValue()
    +    numExamples = sum(classCounts.values())
    +    sortedClasses = sorted(classCounts.keys())
    +    numClasses = len(classCounts)
    +    # origToNewLabels: class --> index in 0,...,numClasses-1
    +    if (numClasses < 2):
    +        print >> sys.stderr, \
    +            "Dataset for classification should have at least 2 classes." + \
    +            " The given dataset had only %d classes." % numClasses
    +        exit(-1)
    +    origToNewLabels = dict([(sortedClasses[i], i) for i in range(0,numClasses)])
    +
    +    print "numClasses = %d" % numClasses
    +    print "Per-class example fractions, counts:"
    +    print "Class\tFrac\tCount"
    +    for c in sortedClasses:
    +        frac = classCounts[c] / (numExamples + 0.0)
    +        print "%g\t%g\t%d" % (c, frac, classCounts[c])
    +
    +    if (sortedClasses[0] == 0 and sortedClasses[-1] == numClasses - 1):
    +        return (data, origToNewLabels)
    +    else:
    +        reindexedData = \
    +            data.map(lambda x: LabeledPoint(origToNewLabels[x.label], x.features))
    +        return (reindexedData, origToNewLabels)
    +
    +
    +def usage():
    +    print >> sys.stderr, \
    +        "Usage: logistic_regression [libsvm format data filepath]\n" + \
    --- End diff --
    
    `logistic_regression` -> `tree` (or maybe we should change the name to `decision_tree_runner.py` to match Scala's.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1727#discussion_r15729983
  
    --- Diff: examples/src/main/python/mllib/tree.py ---
    @@ -0,0 +1,129 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Decision tree classification and regression using MLlib.
    +"""
    +
    +import numpy, os, sys
    +
    +from operator import add
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.regression import LabeledPoint
    +from pyspark.mllib.tree import DecisionTree
    +from pyspark.mllib.util import MLUtils
    +
    +
    +def getAccuracy(dtModel, data):
    +    """
    +    Return accuracy of DecisionTreeModel on the given RDD[LabeledPoint].
    +    """
    +    seqOp = (lambda acc, x: acc + (x[0] == x[1]))
    +    predictions = dtModel.predict(data.map(lambda x: x.features))
    +    truth = data.map(lambda p: p.label)
    +    trainCorrect = predictions.zip(truth).aggregate(0, seqOp, add)
    +    return trainCorrect / (0.0 + data.count())
    +
    +
    +def getMSE(dtModel, data):
    +    """
    +    Return mean squared error (MSE) of DecisionTreeModel on the given
    +    RDD[LabeledPoint].
    +    """
    +    seqOp = (lambda acc, x: acc + numpy.square(x[0] - x[1]))
    +    predictions = dtModel.predict(data.map(lambda x: x.features))
    +    truth = data.map(lambda p: p.label)
    +    trainMSE = predictions.zip(truth).aggregate(0, seqOp, add)
    +    return trainMSE / (0.0 + data.count())
    +
    +
    +def reindexClassLabels(data):
    +    """
    +    Re-index class labels in a dataset to the range {0,...,numClasses-1}.
    +    If all labels in that range already appear at least once,
    +     then the returned RDD is the same one (without a mapping).
    +    Note: If a label simply does not appear in the data,
    +          the index will not include it.
    +          Be aware of this when reindexing subsampled data.
    +    :param data: RDD of LabeledPoint where labels are integer values
    +                 denoting labels for a classification problem.
    +    :return: Pair (reindexedData, origToNewLabels) where
    +             reindexedData is an RDD of LabeledPoint with labels in
    +              the range {0,...,numClasses-1}, and
    +             origToNewLabels is a dictionary mapping original labels
    +              to new labels.
    +    """
    +    # classCounts: class --> # examples in class
    +    classCounts = data.map(lambda x: x.label).countByValue()
    +    numExamples = sum(classCounts.values())
    +    sortedClasses = sorted(classCounts.keys())
    +    numClasses = len(classCounts)
    +    # origToNewLabels: class --> index in 0,...,numClasses-1
    +    if (numClasses < 2):
    +        print >> sys.stderr, \
    +            "Dataset for classification should have at least 2 classes." + \
    +            " The given dataset had only %d classes." % numClasses
    +        exit(-1)
    +    origToNewLabels = dict([(sortedClasses[i], i) for i in range(0,numClasses)])
    --- End diff --
    
    space after `,`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1727#discussion_r15732257
  
    --- Diff: examples/src/main/python/mllib/tree.py ---
    @@ -0,0 +1,129 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Decision tree classification and regression using MLlib.
    +"""
    +
    +import numpy, os, sys
    +
    +from operator import add
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.regression import LabeledPoint
    +from pyspark.mllib.tree import DecisionTree
    +from pyspark.mllib.util import MLUtils
    +
    +
    +def getAccuracy(dtModel, data):
    +    """
    +    Return accuracy of DecisionTreeModel on the given RDD[LabeledPoint].
    +    """
    +    seqOp = (lambda acc, x: acc + (x[0] == x[1]))
    +    predictions = dtModel.predict(data.map(lambda x: x.features))
    +    truth = data.map(lambda p: p.label)
    +    trainCorrect = predictions.zip(truth).aggregate(0, seqOp, add)
    +    return trainCorrect / (0.0 + data.count())
    +
    +
    +def getMSE(dtModel, data):
    +    """
    +    Return mean squared error (MSE) of DecisionTreeModel on the given
    +    RDD[LabeledPoint].
    +    """
    +    seqOp = (lambda acc, x: acc + numpy.square(x[0] - x[1]))
    +    predictions = dtModel.predict(data.map(lambda x: x.features))
    +    truth = data.map(lambda p: p.label)
    +    trainMSE = predictions.zip(truth).aggregate(0, seqOp, add)
    +    return trainMSE / (0.0 + data.count())
    +
    +
    +def reindexClassLabels(data):
    +    """
    +    Re-index class labels in a dataset to the range {0,...,numClasses-1}.
    +    If all labels in that range already appear at least once,
    +     then the returned RDD is the same one (without a mapping).
    +    Note: If a label simply does not appear in the data,
    +          the index will not include it.
    +          Be aware of this when reindexing subsampled data.
    +    :param data: RDD of LabeledPoint where labels are integer values
    +                 denoting labels for a classification problem.
    +    :return: Pair (reindexedData, origToNewLabels) where
    +             reindexedData is an RDD of LabeledPoint with labels in
    +              the range {0,...,numClasses-1}, and
    +             origToNewLabels is a dictionary mapping original labels
    +              to new labels.
    +    """
    +    # classCounts: class --> # examples in class
    +    classCounts = data.map(lambda x: x.label).countByValue()
    +    numExamples = sum(classCounts.values())
    +    sortedClasses = sorted(classCounts.keys())
    +    numClasses = len(classCounts)
    +    # origToNewLabels: class --> index in 0,...,numClasses-1
    +    if (numClasses < 2):
    +        print >> sys.stderr, \
    +            "Dataset for classification should have at least 2 classes." + \
    +            " The given dataset had only %d classes." % numClasses
    +        exit(-1)
    +    origToNewLabels = dict([(sortedClasses[i], i) for i in range(0,numClasses)])
    +
    +    print "numClasses = %d" % numClasses
    +    print "Per-class example fractions, counts:"
    +    print "Class\tFrac\tCount"
    +    for c in sortedClasses:
    +        frac = classCounts[c] / (numExamples + 0.0)
    +        print "%g\t%g\t%d" % (c, frac, classCounts[c])
    +
    +    if (sortedClasses[0] == 0 and sortedClasses[-1] == numClasses - 1):
    --- End diff --
    
    Only the first and the last were checked. The values in the middle could be something like `0.5`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1727#discussion_r15725411
  
    --- Diff: python/pyspark/mllib/tree.py ---
    @@ -0,0 +1,219 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +from py4j.java_collections import MapConverter
    +
    +from pyspark import SparkContext, RDD
    +from pyspark.mllib._common import \
    +    _get_unmangled_rdd, _get_unmangled_double_vector_rdd, _serialize_double_vector, \
    +    _deserialize_labeled_point, _get_unmangled_labeled_point_rdd, \
    +    _deserialize_double
    +from pyspark.mllib.regression import LabeledPoint
    +from pyspark.serializers import NoOpSerializer
    +
    +class DecisionTreeModel(object):
    +    """
    +    A decision tree model for classification or regression.
    +
    +    WARNING: This is an experimental API.  It will probably be modified for Spark v1.2.
    +    """
    +
    +    def __init__(self, sc, java_model):
    +        """
    +        :param sc:  Spark context
    +        :param java_model:  Handle to Java model object
    +        """
    +        self._sc = sc
    +        self._java_model = java_model
    +
    +    def __del__(self):
    +        self._sc._gateway.detach(self._java_model)
    +
    +    def predict(self, x):
    +        """
    +        Predict the label of one or more examples.
    +        NOTE: This currently does NOT support batch prediction.
    +
    +        :param x:  Data point: feature vector, or a LabeledPoint (whose label is ignored).
    +        """
    +        pythonAPI = self._sc._jvm.PythonMLLibAPI()
    +        if isinstance(x, RDD):
    +            # Bulk prediction
    +            if x.count() == 0:
    +                raise RuntimeError("DecisionTreeModel.predict(x) given empty RDD x.")
    +            elementType = type(x.take(1)[0])
    +            if elementType == LabeledPoint:
    +                x = x.map(lambda x: x.features)
    +            dataBytes = _get_unmangled_double_vector_rdd(x)
    +            jSerializedPreds = pythonAPI.predictDecisionTreeModel(self._java_model, dataBytes._jrdd)
    +            dataBytes.unpersist()
    +            serializedPreds = RDD(jSerializedPreds, self._sc, NoOpSerializer())
    +            return serializedPreds.map(lambda bytes: _deserialize_double(bytearray(bytes)))
    +        else:
    +            if type(x) == LabeledPoint:
    +                x_ = _serialize_double_vector(x.features)
    +            else:
    +                # Assume x is a single data point.
    +                x_ = _serialize_double_vector(x)
    +            return pythonAPI.predictDecisionTreeModel(self._java_model, x_)
    +
    +    def numNodes(self):
    +        return self._java_model.numNodes()
    +
    +    def depth(self):
    +        return self._java_model.depth()
    +
    +    def __str__(self):
    +        return self._java_model.toString()
    +
    +
    +class DecisionTree(object):
    +    """
    +    Learning algorithm for a decision tree model for classification or regression.
    +
    +    WARNING: This is an experimental API.  It will probably be modified for Spark v1.2.
    +
    +    Example usage:
    +    >>> from numpy import array, ndarray
    +    >>> from pyspark.mllib.regression import LabeledPoint
    +    >>> from pyspark.mllib.tree import DecisionTree
    +    >>> from pyspark.mllib.linalg import SparseVector
    +    >>>
    +    >>> data = [
    +    ...     LabeledPoint(0.0, [0.0]),
    +    ...     LabeledPoint(1.0, [1.0]),
    +    ...     LabeledPoint(1.0, [2.0]),
    +    ...     LabeledPoint(1.0, [3.0])
    +    ... ]
    +    >>>
    +    >>> model = DecisionTree.trainClassifier(sc.parallelize(data), numClasses=2)
    +    >>> print(model)
    +    DecisionTreeModel classifier
    +      If (feature 0 <= 0.5)
    +       Predict: 0.0
    +      Else (feature 0 > 0.5)
    +       Predict: 1.0
    +
    +    >>> model.predict(array([1.0])) > 0
    +    True
    +    >>> model.predict(array([0.0])) == 0
    +    True
    +    >>> sparse_data = [
    +    ...     LabeledPoint(0.0, SparseVector(2, {0: 0.0})),
    +    ...     LabeledPoint(1.0, SparseVector(2, {1: 1.0})),
    +    ...     LabeledPoint(0.0, SparseVector(2, {0: 0.0})),
    +    ...     LabeledPoint(1.0, SparseVector(2, {1: 2.0}))
    +    ... ]
    +    >>>
    +    >>> model = DecisionTree.trainRegressor(sc.parallelize(sparse_data))
    +    >>> model.predict(array([0.0, 1.0])) == 1
    +    True
    +    >>> model.predict(array([0.0, 0.0])) == 0
    +    True
    +    >>> model.predict(SparseVector(2, {1: 1.0})) == 1
    +    True
    +    >>> model.predict(SparseVector(2, {1: 0.0})) == 0
    +    True
    +    """
    +
    +    @staticmethod
    +    def trainClassifier(data, numClasses, categoricalFeaturesInfo={},
    +                        impurity="gini", maxDepth=4, maxBins=100):
    +        """
    +        Train a DecisionTreeModel for classification.
    +
    +        :param data: RDD of NumPy vectors, one per element, where the first
    +                     coordinate is the label and the rest is the feature vector.
    +                     Labels are integers {0,1,...,numClasses}.
    +        :param numClasses: Number of classes for classification.
    +        :param categoricalFeaturesInfo: Map from categorical feature index to number of categories.
    --- End diff --
    
    a line in python doc should not have more than 80 (or 78 to be safe) chars. This is for people running python's help() under traditional terminals (80 x 24?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1727#discussion_r15725375
  
    --- Diff: python/pyspark/mllib/tree.py ---
    @@ -0,0 +1,219 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +from py4j.java_collections import MapConverter
    +
    +from pyspark import SparkContext, RDD
    +from pyspark.mllib._common import \
    +    _get_unmangled_rdd, _get_unmangled_double_vector_rdd, _serialize_double_vector, \
    +    _deserialize_labeled_point, _get_unmangled_labeled_point_rdd, \
    +    _deserialize_double
    +from pyspark.mllib.regression import LabeledPoint
    +from pyspark.serializers import NoOpSerializer
    +
    +class DecisionTreeModel(object):
    +    """
    +    A decision tree model for classification or regression.
    +
    +    WARNING: This is an experimental API.  It will probably be modified for Spark v1.2.
    +    """
    +
    +    def __init__(self, sc, java_model):
    +        """
    +        :param sc:  Spark context
    +        :param java_model:  Handle to Java model object
    +        """
    +        self._sc = sc
    +        self._java_model = java_model
    +
    +    def __del__(self):
    +        self._sc._gateway.detach(self._java_model)
    +
    +    def predict(self, x):
    +        """
    +        Predict the label of one or more examples.
    +        NOTE: This currently does NOT support batch prediction.
    +
    +        :param x:  Data point: feature vector, or a LabeledPoint (whose label is ignored).
    +        """
    +        pythonAPI = self._sc._jvm.PythonMLLibAPI()
    +        if isinstance(x, RDD):
    +            # Bulk prediction
    +            if x.count() == 0:
    +                raise RuntimeError("DecisionTreeModel.predict(x) given empty RDD x.")
    +            elementType = type(x.take(1)[0])
    +            if elementType == LabeledPoint:
    +                x = x.map(lambda x: x.features)
    +            dataBytes = _get_unmangled_double_vector_rdd(x)
    +            jSerializedPreds = pythonAPI.predictDecisionTreeModel(self._java_model, dataBytes._jrdd)
    +            dataBytes.unpersist()
    +            serializedPreds = RDD(jSerializedPreds, self._sc, NoOpSerializer())
    +            return serializedPreds.map(lambda bytes: _deserialize_double(bytearray(bytes)))
    +        else:
    +            if type(x) == LabeledPoint:
    --- End diff --
    
    ditto: Maybe we should remove the support of predicting `LabeledPoint`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1727#discussion_r15725310
  
    --- Diff: python/pyspark/mllib/tree.py ---
    @@ -0,0 +1,219 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +from py4j.java_collections import MapConverter
    +
    +from pyspark import SparkContext, RDD
    +from pyspark.mllib._common import \
    +    _get_unmangled_rdd, _get_unmangled_double_vector_rdd, _serialize_double_vector, \
    +    _deserialize_labeled_point, _get_unmangled_labeled_point_rdd, \
    +    _deserialize_double
    +from pyspark.mllib.regression import LabeledPoint
    +from pyspark.serializers import NoOpSerializer
    +
    +class DecisionTreeModel(object):
    +    """
    +    A decision tree model for classification or regression.
    +
    +    WARNING: This is an experimental API.  It will probably be modified for Spark v1.2.
    +    """
    +
    +    def __init__(self, sc, java_model):
    +        """
    +        :param sc:  Spark context
    +        :param java_model:  Handle to Java model object
    +        """
    +        self._sc = sc
    +        self._java_model = java_model
    +
    +    def __del__(self):
    +        self._sc._gateway.detach(self._java_model)
    +
    +    def predict(self, x):
    +        """
    +        Predict the label of one or more examples.
    +        NOTE: This currently does NOT support batch prediction.
    +
    +        :param x:  Data point: feature vector, or a LabeledPoint (whose label is ignored).
    +        """
    +        pythonAPI = self._sc._jvm.PythonMLLibAPI()
    +        if isinstance(x, RDD):
    +            # Bulk prediction
    +            if x.count() == 0:
    +                raise RuntimeError("DecisionTreeModel.predict(x) given empty RDD x.")
    --- End diff --
    
    return an empty RDD instead?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1727#discussion_r15725482
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -459,6 +466,76 @@ class PythonMLLibAPI extends Serializable {
       }
     
       /**
    +   * Java stub for Python mllib DecisionTree.train().
    +   * This stub returns a handle to the Java object instead of the content of the Java object.
    +   * Extra care needs to be taken in the Python code to ensure it gets freed on exit;
    +   * see the Py4J documentation.
    +   * @param dataBytesJRDD  Training data
    +   * @param categoricalFeaturesInfoJMap  Categorical features info, as Java map
    +   */
    +  def trainDecisionTreeModel(
    +      dataBytesJRDD: JavaRDD[Array[Byte]],
    +      algoStr: String,
    +      numClasses: Int,
    +      categoricalFeaturesInfoJMap: java.util.Map[Int,Int],
    --- End diff --
    
    add space after `,`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/1727#issuecomment-50955507
  
    @mengxr Hopefully good to go if Jenkins agrees.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1727#issuecomment-50947405
  
    QA tests have started for PR 1727. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17721/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1727#discussion_r15725093
  
    --- Diff: examples/src/main/python/mllib/tree.py ---
    @@ -0,0 +1,92 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Decision tree classification and regression using MLlib.
    +"""
    +
    +import sys, numpy
    +
    +from operator import add
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.regression import LabeledPoint
    +from pyspark.mllib.tree import DecisionTree
    +
    +
    +# Parse a line of text into an MLlib LabeledPoint object
    +def parsePoint(line):
    +    values = [float(s) for s in line.split(',')]
    +    if values[0] == -1:   # Convert -1 labels to 0 for MLlib
    +        values[0] = 0
    +    return LabeledPoint(values[0], values[1:])
    +
    +# Return accuracy of DecisionTreeModel on the given RDD[LabeledPoint].
    +def getAccuracy(dtModel, data):
    +    seqOp = (lambda acc, x: acc + (x[0] == x[1]))
    +    predictions = dtModel.predict(data)
    +    truth = data.map(lambda p: p.label)
    +    trainCorrect = predictions.zip(truth).aggregate(0, seqOp, add)
    +    return trainCorrect / (0.0 + data.count())
    +
    +# Return mean squared error (MSE) of DecisionTreeModel on the given RDD[LabeledPoint].
    +def getMSE(dtModel, data):
    +    seqOp = (lambda acc, x: acc + numpy.square(x[0] - x[1]))
    +    predictions = dtModel.predict(data)
    +    truth = data.map(lambda p: p.label)
    +    trainMSE = predictions.zip(truth).aggregate(0, seqOp, add)
    +    return trainMSE / (0.0 + data.count())
    +
    +# Return a new LabeledPoint with the label and feature 0 swapped.
    +def swapLabelAndFeature0(labeledPoint):
    --- End diff --
    
    It is really hard to guess what this method does until I read the code in main. So more doc is needed. Another option is to remove it and use it in unit test, because people will use this as a template to build their own applications. Swapping the first feature with the label is not a common operation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1727#discussion_r15729987
  
    --- Diff: python/pyspark/mllib/tests.py ---
    @@ -127,9 +128,19 @@ def test_classification(self):
             self.assertTrue(nb_model.predict(features[2]) <= 0)
             self.assertTrue(nb_model.predict(features[3]) > 0)
     
    +        categoricalFeaturesInfo = {0: 3} # feature 0 has 3 categories
    +        dt_model = \
    +            DecisionTree.trainClassifier(rdd, numClasses=2,
    --- End diff --
    
    Does it fit into the line above?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1727#issuecomment-50971402
  
    QA tests have started for PR 1727. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17773/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1727#discussion_r15729982
  
    --- Diff: examples/src/main/python/mllib/tree.py ---
    @@ -0,0 +1,129 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Decision tree classification and regression using MLlib.
    +"""
    +
    +import numpy, os, sys
    +
    +from operator import add
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.regression import LabeledPoint
    +from pyspark.mllib.tree import DecisionTree
    +from pyspark.mllib.util import MLUtils
    +
    +
    +def getAccuracy(dtModel, data):
    +    """
    +    Return accuracy of DecisionTreeModel on the given RDD[LabeledPoint].
    +    """
    +    seqOp = (lambda acc, x: acc + (x[0] == x[1]))
    +    predictions = dtModel.predict(data.map(lambda x: x.features))
    +    truth = data.map(lambda p: p.label)
    +    trainCorrect = predictions.zip(truth).aggregate(0, seqOp, add)
    +    return trainCorrect / (0.0 + data.count())
    +
    +
    +def getMSE(dtModel, data):
    +    """
    +    Return mean squared error (MSE) of DecisionTreeModel on the given
    +    RDD[LabeledPoint].
    +    """
    +    seqOp = (lambda acc, x: acc + numpy.square(x[0] - x[1]))
    +    predictions = dtModel.predict(data.map(lambda x: x.features))
    +    truth = data.map(lambda p: p.label)
    +    trainMSE = predictions.zip(truth).aggregate(0, seqOp, add)
    +    return trainMSE / (0.0 + data.count())
    +
    +
    +def reindexClassLabels(data):
    +    """
    +    Re-index class labels in a dataset to the range {0,...,numClasses-1}.
    +    If all labels in that range already appear at least once,
    +     then the returned RDD is the same one (without a mapping).
    +    Note: If a label simply does not appear in the data,
    +          the index will not include it.
    +          Be aware of this when reindexing subsampled data.
    +    :param data: RDD of LabeledPoint where labels are integer values
    +                 denoting labels for a classification problem.
    +    :return: Pair (reindexedData, origToNewLabels) where
    +             reindexedData is an RDD of LabeledPoint with labels in
    +              the range {0,...,numClasses-1}, and
    +             origToNewLabels is a dictionary mapping original labels
    +              to new labels.
    +    """
    +    # classCounts: class --> # examples in class
    +    classCounts = data.map(lambda x: x.label).countByValue()
    +    numExamples = sum(classCounts.values())
    +    sortedClasses = sorted(classCounts.keys())
    +    numClasses = len(classCounts)
    +    # origToNewLabels: class --> index in 0,...,numClasses-1
    +    if (numClasses < 2):
    +        print >> sys.stderr, \
    +            "Dataset for classification should have at least 2 classes." + \
    +            " The given dataset had only %d classes." % numClasses
    +        exit(-1)
    --- End diff --
    
    The exit code should be in range [0, 255]. So `-1` maps to `255`, which has a special meaning (out of range?). `exit(1)` may be better here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1727#discussion_r15725089
  
    --- Diff: examples/src/main/python/mllib/tree.py ---
    @@ -0,0 +1,92 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Decision tree classification and regression using MLlib.
    +"""
    +
    +import sys, numpy
    +
    +from operator import add
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.regression import LabeledPoint
    +from pyspark.mllib.tree import DecisionTree
    +
    +
    +# Parse a line of text into an MLlib LabeledPoint object
    --- End diff --
    
    Please use Python's style for doc.
    
    ~~~
    def parsePoint(line):
       """
       doc
       """
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on the pull request:

    https://github.com/apache/spark/pull/1727#issuecomment-50967302
  
    Not very familiar with the python API but the DT changes look good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1727#discussion_r15729992
  
    --- Diff: python/pyspark/mllib/util.py ---
    @@ -29,9 +30,9 @@ class MLUtils:
         Helper methods to load, save and pre-process data used in MLlib.
         """
     
    -    @deprecated
         @staticmethod
         def _parse_libsvm_line(line, multiclass):
    +        warnings.warn("deprecated", DeprecationWarning)
    --- End diff --
    
    Thanks for fixing it! CC: @srowen


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org