You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by davies <gi...@git.apache.org> on 2014/10/16 02:10:16 UTC

[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/2819

    [SPARK-3961] Python API for mllib.feature

    Added completed Python API for MLlib.feature
    
    Normalizer
    StandardScalerModel
    StandardScaler
    HashTF
    IDFModel
    IDF


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark feature

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2819.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2819
    
----
commit 8a50584ed6ea38b5fccc64e6da3fc18d4513c9c5
Author: Davies Liu <da...@gmail.com>
Date:   2014-10-16T00:02:16Z

    Python API for mllib.feature

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-59313284
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21789/consoleFull) for   PR 2819 at commit [`7a1891a`](https://github.com/apache/spark/commit/7a1891abe6647a5f9dc82c21add907fe2d4b9aa8).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class VectorTransformer(object):`
      * `class Normalizer(VectorTransformer):`
      * `class JavaModelWrapper(VectorTransformer):`
      * `class StandardScalerModel(JavaModelWrapper):`
      * `class StandardScaler(object):`
      * `class HashingTF(object):`
      * `class IDFModel(JavaModelWrapper):`
      * `class IDF(object):`
      * `class Word2VecModel(JavaModelWrapper):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-60686074
  
    @davies LGTM
    
    @davies @Ishiihara This debate about whether the Python API should be Pythonic or match the Scala/Java API is tough.  @mateiz has recommended the latter (match Scala/Java); it would certainly be good to converge on a community standard!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19448628
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -267,4 +346,25 @@ val data1 = data.map(x => (x.label, normalizer1.transform(x.features)))
     val data2 = data.map(x => (x.label, normalizer2.transform(x.features)))
     {% endhighlight %}
     </div>
    +
    +<div data-lang="python">
    +{% highlight python %}
    +from pyspark.mllib.util import MLUtils
    +from pyspark.mllib.linalg import Vectors
    +from pyspark.mllib.feature import Normalizer
    +
    +data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
    +label = data.map(lambda x: x.label)
    +features = data.map(lambda x: x.features)
    +
    +normalizer1 = Normalizer()
    +normalizer2 = Normalizer(p=float("inf"))
    +
    +# Each sample in data1 will be normalized using $L^2$ norm.
    +data1 = label.zip(normalizer1.transform(features))
    +
    +# Each sample in data2 will be normalized using $L^\infty$ norm.
    +data2 = label.zip(normalizer2.transform(features))
    --- End diff --
    
    fixed. Py4j does not support float("inf"), add a hack for support this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19454187
  
    --- Diff: python/pyspark/mllib/feature.py ---
    @@ -18,59 +18,348 @@
     """
     Python package for feature in MLlib.
     """
    +import sys
    +import warnings
    +
    +import py4j.protocol
    +from py4j.protocol import Py4JJavaError
    +from py4j.java_gateway import JavaObject
    +
    +from pyspark import RDD, SparkContext
     from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
    -from pyspark.mllib.linalg import _convert_to_vector, _to_java_object_rdd
    +from pyspark.mllib.linalg import Vectors, _to_java_object_rdd
    +
    +__all__ = ['Normalizer', 'StandardScalerModel', 'StandardScaler',
    +           'HashTF', 'IDFModel', 'IDF',
    +           'Word2Vec', 'Word2VecModel']
    +
    +
    +# Hack for support float('inf') in Py4j
    +old_smart_decode = py4j.protocol.smart_decode
    +
    +float_str_mapping = {
    +    u'nan': u'NaN',
    +    u'inf': u'Infinity',
    +    u'-inf': u'-Infinity',
    +}
    +
    +
    +def new_smart_decode(obj):
    +    if isinstance(obj, float):
    +        s = unicode(obj)
    +        return float_str_mapping.get(s, s)
    +    return old_smart_decode(obj)
    +
    +py4j.protocol.smart_decode = new_smart_decode
    +
    +
    +# TODO: move these helper functions into utils
    +_picklable_classes = [
    +    'LinkedList',
    +    'SparseVector',
    +    'DenseVector',
    +    'DenseMatrix',
    +    'Rating',
    +    'LabeledPoint',
    +]
    +
    +
    +def _py2java(sc, a):
    +    """ Convert Python object into Java """
    +    if isinstance(a, RDD):
    +        a = _to_java_object_rdd(a)
    +    elif not isinstance(a, (int, long, float, bool, basestring)):
    +        bytes = bytearray(PickleSerializer().dumps(a))
    +        a = sc._jvm.SerDe.loads(bytes)
    +    return a
    +
    +
    +def _java2py(sc, r):
    +    if isinstance(r, JavaObject):
    +        clsName = r.getClass().getSimpleName()
    +        if clsName in ("RDD", "JavaRDD"):
    +            if clsName == "RDD":
    +                r = r.toJavaRDD()
    +            jrdd = sc._jvm.SerDe.javaToPython(r)
    +            return RDD(jrdd, sc, AutoBatchedSerializer(PickleSerializer()))
     
    -__all__ = ['Word2Vec', 'Word2VecModel']
    +        elif clsName in _picklable_classes:
    +            r = sc._jvm.SerDe.dumps(r)
     
    +    if isinstance(r, bytearray):
    +        r = PickleSerializer().loads(str(r))
    +    return r
     
    -class Word2VecModel(object):
    +
    +def _callJavaFunc(sc, func, *args):
    +    """ Call Java Function
         """
    -    class for Word2Vec model
    +    args = [_py2java(sc, a) for a in args]
    +    return _java2py(sc, func(*args))
    +
    +
    +def _callAPI(sc, name, *args):
    +    """ Call API in PythonMLLibAPI
         """
    -    def __init__(self, sc, java_model):
    +    api = getattr(sc._jvm.PythonMLLibAPI(), name)
    +    return _callJavaFunc(sc, api, *args)
    +
    +
    +class VectorTransformer(object):
    +    """
    +    :: DeveloperApi ::
    +    Base class for transformation of a vector or RDD of vector
    +    """
    +    def transform(self, vector):
             """
    -        :param sc:  Spark context
    -        :param java_model:  Handle to Java model object
    +        Applies transformation on a vector.
    +
    +        :param vector: vector to be transformed.
             """
    +        raise NotImplementedError
    +
    +
    +class Normalizer(VectorTransformer):
    +    """
    +    :: Experimental ::
    +    Normalizes samples individually to unit L^p^ norm
    +
    +    For any 1 <= p <= float('inf'), normalizes samples using
    +    sum(abs(vector).^p^)^(1/p)^ as norm.
    +
    +    For p = float('inf'), max(abs(vector)) will be used as norm for normalization.
    +
    +    >>> v = Vectors.dense(range(3))
    +    >>> nor = Normalizer(1)
    +    >>> nor.transform(v)
    +    DenseVector([0.0, 0.3333, 0.6667])
    +
    +    >>> rdd = sc.parallelize([v])
    +    >>> nor.transform(rdd).collect()
    +    [DenseVector([0.0, 0.3333, 0.6667])]
    +
    +    >>> nor2 = Normalizer(float("inf"))
    +    >>> nor2.transform(v)
    +    DenseVector([0.0, 0.5, 1.0])
    +    """
    +    def __init__(self, p=2):
    +        """
    +        :param p: Normalization in L^p^ space, p = 2 by default.
    +        """
    +        assert p >= 1.0, "p should be greater than 1.0"
    +        self.p = float(p)
    +
    +    def transform(self, vector):
    +        """
    +        Applies unit length normalization on a vector.
    +
    +        :param vector: vector to be normalized.
    +        :return: normalized vector. If the norm of the input is zero, it
    +                will return the input vector.
    +        """
    +        sc = SparkContext._active_spark_context
    +        assert sc is not None, "SparkContext should be initialized first"
    +        return _callAPI(sc, "normalizeVector", self.p, vector)
    +
    +
    +class JavaModelWrapper(VectorTransformer):
    +    """
    +    Wrapper for the model in JVM
    +    """
    +    def __init__(self, sc, java_model):
             self._sc = sc
             self._java_model = java_model
     
         def __del__(self):
             self._sc._gateway.detach(self._java_model)
     
    -    def transform(self, word):
    +    def transform(self, dataset):
    +        return _callJavaFunc(self._sc, self._java_model.transform, dataset)
    +
    +
    +class StandardScalerModel(JavaModelWrapper):
    +    """
    +    :: Experimental ::
    +    Represents a StandardScaler model that can transform vectors.
    +    """
    +    def transform(self, vector):
             """
    -        :param word: a word
    -        :return: vector representation of word
    +        Applies standardization transformation on a vector.
    +
    +        :param vector: Vector to be standardized.
    +        :return: Standardized vector. If the variance of a column is zero,
    +                it will return default `0.0` for the column with zero variance.
    +        """
    +        return JavaModelWrapper.transform(self, vector)
    +
     
    +class StandardScaler(object):
    +    """
    +    :: Experimental ::
    +    Standardizes features by removing the mean and scaling to unit
    +    variance using column summary statistics on the samples in the
    +    training set.
    +
    +    >>> vs = [Vectors.dense([-2.0, 2.3, 0]), Vectors.dense([3.8, 0.0, 1.9])]
    +    >>> dataset = sc.parallelize(vs)
    +    >>> standardizer = StandardScaler(True, True)
    +    >>> model = standardizer.fit(dataset)
    +    >>> result = model.transform(dataset)
    +    >>> for r in result.collect(): r
    +    DenseVector([-0.7071, 0.7071, -0.7071])
    +    DenseVector([0.7071, -0.7071, 0.7071])
    +    """
    +    def __init__(self, withMean=False, withStd=True):
    +        """
    +        :param withMean: False by default. Centers the data with mean
    +                 before scaling. It will build a dense output, so this
    +                 does not work on sparse input and will raise an exception.
    +        :param withStd: True by default. Scales the data to unit standard
    +                 deviation.
    +        """
    +        if not (withMean or withStd):
    +            warnings.warn("Both withMean and withStd are false. The model does nothing.")
    +        self.withMean = withMean
    +        self.withStd = withStd
    +
    +    def fit(self, dataset):
    +        """
    +        Computes the mean and variance and stores as a model to be used for later scaling.
    +
    +        :param data: The data used to compute the mean and variance to build
    +                    the transformation model.
    +        :return: a StandardScalarModel
    +        """
    +        sc = dataset.context
    +        jmodel = _callAPI(sc, "fitStandardScaler", self.withMean, self.withStd, dataset)
    +        return StandardScalerModel(sc, jmodel)
    +
    +
    +class HashingTF(object):
    +    """
    +    :: Experimental ::
    +    Maps a sequence of terms to their term frequencies using the hashing trick.
    +
    +    >>> htf = HashingTF(100)
    +    >>> doc = "a a b b c d".split(" ")
    +    >>> htf.transform(doc)
    +    SparseVector(100, {1: 1.0, 14: 1.0, 31: 2.0, 44: 2.0})
    +    """
    +    def __init__(self, numFeatures=1 << 20):
    +        """
    +        :param numFeatures: number of features (default: 2^20)
    +        """
    +        self.numFeatures = numFeatures
    +
    +    def indexOf(self, term):
    +        """ Returns the index of the input term. """
    +        return hash(term) % self.numFeatures
    --- End diff --
    
    minor: It would be nice if we can use the same hash function as in Scala.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-60722411
  
      [Test build #22346 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22346/consoleFull) for   PR 2819 at commit [`4f48f48`](https://github.com/apache/spark/commit/4f48f48d0c013e50f1a96f1e6bb0af4d88bf366c).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class VectorTransformer(object):`
      * `class Normalizer(VectorTransformer):`
      * `class JavaModelWrapper(VectorTransformer):`
      * `class StandardScalerModel(JavaModelWrapper):`
      * `class StandardScaler(object):`
      * `class HashingTF(object):`
      * `class IDFModel(JavaModelWrapper):`
      * `class IDF(object):`
      * `class Word2VecModel(JavaModelWrapper):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19454983
  
    --- Diff: python/pyspark/mllib/feature.py ---
    @@ -95,33 +385,26 @@ class Word2Vec(object):
         >>> localDoc = [sentence, sentence]
         >>> doc = sc.parallelize(localDoc).map(lambda line: line.split(" "))
         >>> model = Word2Vec().setVectorSize(10).setSeed(42L).fit(doc)
    +
         >>> syms = model.findSynonyms("a", 2)
    -    >>> str(syms[0][0])
    -    'b'
    -    >>> str(syms[1][0])
    -    'c'
    -    >>> len(syms)
    -    2
    +    >>> [s[0] for s in syms]
    +    [u'b', u'c']
         >>> vec = model.transform("a")
    -    >>> len(vec)
    -    10
         >>> syms = model.findSynonyms(vec, 2)
    -    >>> str(syms[0][0])
    -    'b'
    -    >>> str(syms[1][0])
    -    'c'
    -    >>> len(syms)
    -    2
    +    >>> [s[0] for s in syms]
    +    [u'b', u'c']
         """
         def __init__(self):
             """
             Construct Word2Vec instance
             """
    +        import random  # this can't be on the top because of mllib.random
    +
             self.vectorSize = 100
             self.learningRate = 0.025
             self.numPartitions = 1
             self.numIterations = 1
    -        self.seed = 42L
    +        self.seed = random.randint(0, sys.maxint)
    --- End diff --
    
    Here has nothing with numpy, so it will have problems.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-60693309
  
    BTW we can also leave out the default args for now and add them later, if we want to take more time to decide this. But the Python API should definitely include all the methods in the Scala / Java one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-59302533
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21785/consoleFull) for   PR 2819 at commit [`486795f`](https://github.com/apache/spark/commit/486795f1d8792c15c9f97b22b1015b23fb7c8d81).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class VectorTransformer(object):`
      * `class Normalizer(VectorTransformer):`
      * `class JavaModelWrapper(VectorTransformer):`
      * `class StandardScalerModel(JavaModelWrapper):`
      * `class StandardScaler(object):`
      * `class HashingTF(object):`
      * `class IDFModel(JavaModelWrapper):`
      * `class IDF(object):`
      * `class Word2VecModel(JavaModelWrapper):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19454180
  
    --- Diff: python/pyspark/mllib/feature.py ---
    @@ -18,59 +18,348 @@
     """
     Python package for feature in MLlib.
     """
    +import sys
    +import warnings
    +
    +import py4j.protocol
    +from py4j.protocol import Py4JJavaError
    +from py4j.java_gateway import JavaObject
    +
    +from pyspark import RDD, SparkContext
     from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
    -from pyspark.mllib.linalg import _convert_to_vector, _to_java_object_rdd
    +from pyspark.mllib.linalg import Vectors, _to_java_object_rdd
    +
    +__all__ = ['Normalizer', 'StandardScalerModel', 'StandardScaler',
    +           'HashTF', 'IDFModel', 'IDF',
    +           'Word2Vec', 'Word2VecModel']
    +
    +
    +# Hack for support float('inf') in Py4j
    +old_smart_decode = py4j.protocol.smart_decode
    --- End diff --
    
    Shall we use underscore for those private vars and functions: `old_smart_decode`, `float_str_mapping`, `new_smart_decode`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-60676934
  
      [Test build #479 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/479/consoleFull) for   PR 2819 at commit [`3abb8c2`](https://github.com/apache/spark/commit/3abb8c2da68633d3312c2c8c3bf1680bb0ee8edf).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class VectorTransformer(object):`
      * `class Normalizer(VectorTransformer):`
      * `class JavaModelWrapper(VectorTransformer):`
      * `class StandardScalerModel(JavaModelWrapper):`
      * `class StandardScaler(object):`
      * `class HashingTF(object):`
      * `class IDFModel(JavaModelWrapper):`
      * `class IDF(object):`
      * `class Word2VecModel(JavaModelWrapper):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-59319145
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21790/consoleFull) for   PR 2819 at commit [`a405ae7`](https://github.com/apache/spark/commit/a405ae7b967a1a9398e3cdbb812149be7314f29e).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class VectorTransformer(object):`
      * `class Normalizer(VectorTransformer):`
      * `class JavaModelWrapper(VectorTransformer):`
      * `class StandardScalerModel(JavaModelWrapper):`
      * `class StandardScaler(object):`
      * `class HashingTF(object):`
      * `class IDFModel(JavaModelWrapper):`
      * `class IDF(object):`
      * `class Word2VecModel(JavaModelWrapper):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-60666721
  
      [Test build #22304 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22304/consoleFull) for   PR 2819 at commit [`3abb8c2`](https://github.com/apache/spark/commit/3abb8c2da68633d3312c2c8c3bf1680bb0ee8edf).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19429642
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -223,6 +279,29 @@ val data1 = data.map(x => (x.label, scaler1.transform(x.features)))
     val data2 = data.map(x => (x.label, scaler2.transform(Vectors.dense(x.features.toArray))))
     {% endhighlight %}
     </div>
    +
    +<div data-lang="python">
    +{% highlight python %}
    +from pyspark.mllib.util import MLUtils
    +from pyspark.mllib.linalg import Vectors
    +from pyspark.mllib.feature import StandardScaler
    +
    +data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
    +label = data.map(lambda x: x.label)
    +features = data.map(lambda x: x.features)
    +
    +scaler1 = StandardScaler().fit(features)
    +scaler2 = StandardScaler(withMean=True, withStd=True).fit(features)
    +
    +# data1 will be unit variance.
    +data1 = label.zip(scaler1.transform(features))
    +
    +# Without converting the features into dense vectors, transformation with zero mean will raise
    +# exception on sparse vector.
    +# data2 will be unit variance and zero mean.
    +data2 = label.zip(scaler1.transform(features.map(lambda x: Vectors.dense(x.toArray()))))
    --- End diff --
    
    Does this run for you?  It fails for me after calling data2.collect().  I think the bug is in linalg.py:426
    ```
    for i in xrange(self.indices.size):
    ```
    where self.indices does not have method size.  I figure it should be len(self.indices)
    (This bug must have been from a previous PR.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19425320
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -328,6 +364,16 @@ class PythonMLLibAPI extends Serializable {
           model.transform(word)
         }
     
    +    /**
    +     * TODO: model is not serializable
    --- End diff --
    
    Is this an outdated comment?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19435144
  
    --- Diff: python/pyspark/mllib/feature.py ---
    @@ -18,59 +18,324 @@
     """
     Python package for feature in MLlib.
     """
    +import warnings
    +
    +from py4j.protocol import Py4JJavaError
    +from py4j.java_gateway import JavaObject
    +
    +from pyspark import RDD, SparkContext
     from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
    -from pyspark.mllib.linalg import _convert_to_vector, _to_java_object_rdd
    +from pyspark.mllib.linalg import Vectors, _to_java_object_rdd
    +
    +__all__ = ['Normalizer', 'StandardScalerModel', 'StandardScaler',
    +           'HashTF', 'IDFModel', 'IDF',
    +           'Word2Vec', 'Word2VecModel']
    +
    +
    +# TODO: move these helper functions into utils
    +_picklable_classes = [
    +    'LinkedList',
    +    'SparseVector',
    +    'DenseVector',
    +    'DenseMatrix',
    +    'Rating',
    +    'LabeledPoint',
    +]
    +
    +
    +def _py2java(sc, a):
    +    """ Convert Python object into Java """
    +    if isinstance(a, RDD):
    +        a = _to_java_object_rdd(a)
    +    elif not isinstance(a, (int, long, float, bool, basestring)):
    +        bytes = bytearray(PickleSerializer().dumps(a))
    +        a = sc._jvm.SerDe.loads(bytes)
    +    return a
    +
     
    -__all__ = ['Word2Vec', 'Word2VecModel']
    +def _java2py(sc, r):
    +    if isinstance(r, JavaObject):
    +        clsName = r.getClass().getSimpleName()
    +        if clsName in ("RDD", "JavaRDD"):
    +            if clsName == "RDD":
    +                r = r.toJavaRDD()
    +            jrdd = sc._jvm.SerDe.javaToPython(r)
    +            return RDD(jrdd, sc, AutoBatchedSerializer(PickleSerializer()))
     
    +        elif clsName in _picklable_classes:
    +            r = sc._jvm.SerDe.dumps(r)
     
    -class Word2VecModel(object):
    +    if isinstance(r, bytearray):
    +        r = PickleSerializer().loads(str(r))
    +    return r
    +
    +
    +def _callJavaFunc(sc, func, *args):
    +    """ Call Java Function
         """
    -    class for Word2Vec model
    +    args = [_py2java(sc, a) for a in args]
    +    return _java2py(sc, func(*args))
    +
    +
    +def _callAPI(sc, name, *args):
    +    """ Call API in PythonMLLibAPI
         """
    -    def __init__(self, sc, java_model):
    +    api = getattr(sc._jvm.PythonMLLibAPI(), name)
    +    return _callJavaFunc(sc, api, *args)
    +
    +
    +class VectorTransformer(object):
    +    """
    +    :: DeveloperApi ::
    +    Base class for transformation of a vector or RDD of vector
    +    """
    +    def transform(self, vector):
             """
    -        :param sc:  Spark context
    -        :param java_model:  Handle to Java model object
    +        Applies transformation on a vector.
    +
    +        :param vector: vector to be transformed.
             """
    +        raise NotImplementedError
    +
    +
    +class Normalizer(VectorTransformer):
    +    """
    +    :: Experimental ::
    +    Normalizes samples individually to unit L^p^ norm
    +
    +    For any 1 <= p < Double.PositiveInfinity, normalizes samples using
    +    sum(abs(vector).^p^)^(1/p)^ as norm.
    +
    +    For p = Double.PositiveInfinity, max(abs(vector)) will be used as
    +    norm for normalization.
    +
    +    >>> v = Vectors.dense(range(3))
    +    >>> nor = Normalizer(1)
    +    >>> nor.transform(v)
    +    DenseVector([0.0, 0.3333, 0.6667])
    +
    +    >>> rdd = sc.parallelize([v])
    +    >>> nor.transform(rdd).collect()
    +    [DenseVector([0.0, 0.3333, 0.6667])]
    +    """
    +    def __init__(self, p=2):
    +        """
    +        :param p: Normalization in L^p^ space, p = 2 by default.
    +        """
    +        assert p >= 1.0, "p should be greater than 1.0"
    +        self.p = float(p)
    +
    +    def transform(self, vector):
    +        """
    +        Applies unit length normalization on a vector.
    +
    +        :param vector: vector to be normalized.
    +        :return: normalized vector. If the norm of the input is zero, it
    +                will return the input vector.
    +        """
    +        sc = SparkContext._active_spark_context
    +        assert sc is not None, "SparkContext should be initialized first"
    +        return _callAPI(sc, "normalizeVector", self.p, vector)
    +
    +
    +class JavaModelWrapper(VectorTransformer):
    --- End diff --
    
    I would like to do this refactor after merging this PR, other modules also need updates.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-59314990
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21790/consoleFull) for   PR 2819 at commit [`a405ae7`](https://github.com/apache/spark/commit/a405ae7b967a1a9398e3cdbb812149be7314f29e).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-60701334
  
      [Test build #22316 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22316/consoleFull) for   PR 2819 at commit [`b628693`](https://github.com/apache/spark/commit/b6286939304da666a8158c71a47b6c95af28b639).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class VectorTransformer(object):`
      * `class Normalizer(VectorTransformer):`
      * `class JavaModelWrapper(VectorTransformer):`
      * `class StandardScalerModel(JavaModelWrapper):`
      * `class StandardScaler(object):`
      * `class HashingTF(object):`
      * `class IDFModel(JavaModelWrapper):`
      * `class IDF(object):`
      * `class Word2VecModel(JavaModelWrapper):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-60677536
  
      [Test build #22304 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22304/consoleFull) for   PR 2819 at commit [`3abb8c2`](https://github.com/apache/spark/commit/3abb8c2da68633d3312c2c8c3bf1680bb0ee8edf).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class VectorTransformer(object):`
      * `class Normalizer(VectorTransformer):`
      * `class JavaModelWrapper(VectorTransformer):`
      * `class StandardScalerModel(JavaModelWrapper):`
      * `class StandardScaler(object):`
      * `class HashingTF(object):`
      * `class IDFModel(JavaModelWrapper):`
      * `class IDF(object):`
      * `class Word2VecModel(JavaModelWrapper):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by shaneknapp <gi...@git.apache.org>.

Github user shaneknapp commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-60665966
  
    jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-59296810
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21784/consoleFull) for   PR 2819 at commit [`8a50584`](https://github.com/apache/spark/commit/8a50584ed6ea38b5fccc64e6da3fc18d4513c9c5).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19454164
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -162,6 +204,20 @@ for((synonym, cosineSimilarity) <- synonyms) {
     }
     {% endhighlight %}
     </div>
    +<div data-lang="python">
    --- End diff --
    
    Shall we skip the python examples for `Word2Vec` in this PR? #2952 added example code for it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19454181
  
    --- Diff: python/pyspark/mllib/feature.py ---
    @@ -18,59 +18,348 @@
     """
     Python package for feature in MLlib.
     """
    +import sys
    +import warnings
    +
    +import py4j.protocol
    +from py4j.protocol import Py4JJavaError
    +from py4j.java_gateway import JavaObject
    +
    +from pyspark import RDD, SparkContext
     from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
    -from pyspark.mllib.linalg import _convert_to_vector, _to_java_object_rdd
    +from pyspark.mllib.linalg import Vectors, _to_java_object_rdd
    +
    +__all__ = ['Normalizer', 'StandardScalerModel', 'StandardScaler',
    +           'HashTF', 'IDFModel', 'IDF',
    +           'Word2Vec', 'Word2VecModel']
    +
    +
    +# Hack for support float('inf') in Py4j
    +old_smart_decode = py4j.protocol.smart_decode
    +
    +float_str_mapping = {
    +    u'nan': u'NaN',
    +    u'inf': u'Infinity',
    +    u'-inf': u'-Infinity',
    +}
    +
    +
    +def new_smart_decode(obj):
    +    if isinstance(obj, float):
    +        s = unicode(obj)
    +        return float_str_mapping.get(s, s)
    +    return old_smart_decode(obj)
    +
    +py4j.protocol.smart_decode = new_smart_decode
    +
    +
    +# TODO: move these helper functions into utils
    +_picklable_classes = [
    +    'LinkedList',
    +    'SparseVector',
    +    'DenseVector',
    +    'DenseMatrix',
    +    'Rating',
    +    'LabeledPoint',
    +]
    +
    +
    +def _py2java(sc, a):
    +    """ Convert Python object into Java """
    +    if isinstance(a, RDD):
    +        a = _to_java_object_rdd(a)
    +    elif not isinstance(a, (int, long, float, bool, basestring)):
    +        bytes = bytearray(PickleSerializer().dumps(a))
    +        a = sc._jvm.SerDe.loads(bytes)
    +    return a
    +
    +
    +def _java2py(sc, r):
    +    if isinstance(r, JavaObject):
    +        clsName = r.getClass().getSimpleName()
    +        if clsName in ("RDD", "JavaRDD"):
    +            if clsName == "RDD":
    +                r = r.toJavaRDD()
    +            jrdd = sc._jvm.SerDe.javaToPython(r)
    +            return RDD(jrdd, sc, AutoBatchedSerializer(PickleSerializer()))
     
    -__all__ = ['Word2Vec', 'Word2VecModel']
    +        elif clsName in _picklable_classes:
    +            r = sc._jvm.SerDe.dumps(r)
     
    +    if isinstance(r, bytearray):
    +        r = PickleSerializer().loads(str(r))
    +    return r
     
    -class Word2VecModel(object):
    +
    +def _callJavaFunc(sc, func, *args):
    +    """ Call Java Function
         """
    -    class for Word2Vec model
    +    args = [_py2java(sc, a) for a in args]
    +    return _java2py(sc, func(*args))
    +
    +
    +def _callAPI(sc, name, *args):
    +    """ Call API in PythonMLLibAPI
         """
    -    def __init__(self, sc, java_model):
    +    api = getattr(sc._jvm.PythonMLLibAPI(), name)
    +    return _callJavaFunc(sc, api, *args)
    +
    +
    +class VectorTransformer(object):
    +    """
    +    :: DeveloperApi ::
    +    Base class for transformation of a vector or RDD of vector
    +    """
    +    def transform(self, vector):
             """
    -        :param sc:  Spark context
    -        :param java_model:  Handle to Java model object
    +        Applies transformation on a vector.
    +
    +        :param vector: vector to be transformed.
             """
    +        raise NotImplementedError
    +
    +
    +class Normalizer(VectorTransformer):
    +    """
    +    :: Experimental ::
    +    Normalizes samples individually to unit L^p^ norm
    --- End diff --
    
    `L^p^` doesn't show up correctly in the generated doc. This is `L` with subscript `p`, so with Sphinx it should be
    
    ~~~
    L\ :sub:`p`\ norm
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19455162
  
    --- Diff: python/pyspark/mllib/feature.py ---
    @@ -18,59 +18,348 @@
     """
     Python package for feature in MLlib.
     """
    +import sys
    +import warnings
    +
    +import py4j.protocol
    +from py4j.protocol import Py4JJavaError
    +from py4j.java_gateway import JavaObject
    +
    +from pyspark import RDD, SparkContext
     from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
    -from pyspark.mllib.linalg import _convert_to_vector, _to_java_object_rdd
    +from pyspark.mllib.linalg import Vectors, _to_java_object_rdd
    +
    +__all__ = ['Normalizer', 'StandardScalerModel', 'StandardScaler',
    +           'HashTF', 'IDFModel', 'IDF',
    +           'Word2Vec', 'Word2VecModel']
    +
    +
    +# Hack for support float('inf') in Py4j
    +old_smart_decode = py4j.protocol.smart_decode
    +
    +float_str_mapping = {
    +    u'nan': u'NaN',
    +    u'inf': u'Infinity',
    +    u'-inf': u'-Infinity',
    +}
    +
    +
    +def new_smart_decode(obj):
    +    if isinstance(obj, float):
    +        s = unicode(obj)
    +        return float_str_mapping.get(s, s)
    +    return old_smart_decode(obj)
    +
    +py4j.protocol.smart_decode = new_smart_decode
    +
    +
    +# TODO: move these helper functions into utils
    +_picklable_classes = [
    +    'LinkedList',
    +    'SparseVector',
    +    'DenseVector',
    +    'DenseMatrix',
    +    'Rating',
    +    'LabeledPoint',
    +]
    +
    +
    +def _py2java(sc, a):
    +    """ Convert Python object into Java """
    +    if isinstance(a, RDD):
    +        a = _to_java_object_rdd(a)
    +    elif not isinstance(a, (int, long, float, bool, basestring)):
    +        bytes = bytearray(PickleSerializer().dumps(a))
    +        a = sc._jvm.SerDe.loads(bytes)
    +    return a
    +
    +
    +def _java2py(sc, r):
    +    if isinstance(r, JavaObject):
    +        clsName = r.getClass().getSimpleName()
    +        if clsName in ("RDD", "JavaRDD"):
    +            if clsName == "RDD":
    +                r = r.toJavaRDD()
    +            jrdd = sc._jvm.SerDe.javaToPython(r)
    +            return RDD(jrdd, sc, AutoBatchedSerializer(PickleSerializer()))
     
    -__all__ = ['Word2Vec', 'Word2VecModel']
    +        elif clsName in _picklable_classes:
    +            r = sc._jvm.SerDe.dumps(r)
     
    +    if isinstance(r, bytearray):
    +        r = PickleSerializer().loads(str(r))
    +    return r
     
    -class Word2VecModel(object):
    +
    +def _callJavaFunc(sc, func, *args):
    +    """ Call Java Function
         """
    -    class for Word2Vec model
    +    args = [_py2java(sc, a) for a in args]
    +    return _java2py(sc, func(*args))
    +
    +
    +def _callAPI(sc, name, *args):
    +    """ Call API in PythonMLLibAPI
         """
    -    def __init__(self, sc, java_model):
    +    api = getattr(sc._jvm.PythonMLLibAPI(), name)
    +    return _callJavaFunc(sc, api, *args)
    +
    +
    +class VectorTransformer(object):
    +    """
    +    :: DeveloperApi ::
    +    Base class for transformation of a vector or RDD of vector
    +    """
    +    def transform(self, vector):
             """
    -        :param sc:  Spark context
    -        :param java_model:  Handle to Java model object
    +        Applies transformation on a vector.
    +
    +        :param vector: vector to be transformed.
             """
    +        raise NotImplementedError
    +
    +
    +class Normalizer(VectorTransformer):
    +    """
    +    :: Experimental ::
    +    Normalizes samples individually to unit L^p^ norm
    +
    +    For any 1 <= p <= float('inf'), normalizes samples using
    +    sum(abs(vector).^p^)^(1/p)^ as norm.
    +
    +    For p = float('inf'), max(abs(vector)) will be used as norm for normalization.
    +
    +    >>> v = Vectors.dense(range(3))
    +    >>> nor = Normalizer(1)
    +    >>> nor.transform(v)
    +    DenseVector([0.0, 0.3333, 0.6667])
    +
    +    >>> rdd = sc.parallelize([v])
    +    >>> nor.transform(rdd).collect()
    +    [DenseVector([0.0, 0.3333, 0.6667])]
    +
    +    >>> nor2 = Normalizer(float("inf"))
    +    >>> nor2.transform(v)
    +    DenseVector([0.0, 0.5, 1.0])
    +    """
    +    def __init__(self, p=2):
    +        """
    +        :param p: Normalization in L^p^ space, p = 2 by default.
    +        """
    +        assert p >= 1.0, "p should be greater than 1.0"
    +        self.p = float(p)
    +
    +    def transform(self, vector):
    +        """
    +        Applies unit length normalization on a vector.
    +
    +        :param vector: vector to be normalized.
    +        :return: normalized vector. If the norm of the input is zero, it
    +                will return the input vector.
    +        """
    +        sc = SparkContext._active_spark_context
    +        assert sc is not None, "SparkContext should be initialized first"
    +        return _callAPI(sc, "normalizeVector", self.p, vector)
    +
    +
    +class JavaModelWrapper(VectorTransformer):
    +    """
    +    Wrapper for the model in JVM
    +    """
    +    def __init__(self, sc, java_model):
             self._sc = sc
             self._java_model = java_model
     
         def __del__(self):
             self._sc._gateway.detach(self._java_model)
     
    -    def transform(self, word):
    +    def transform(self, dataset):
    +        return _callJavaFunc(self._sc, self._java_model.transform, dataset)
    +
    +
    +class StandardScalerModel(JavaModelWrapper):
    +    """
    +    :: Experimental ::
    +    Represents a StandardScaler model that can transform vectors.
    +    """
    +    def transform(self, vector):
             """
    -        :param word: a word
    -        :return: vector representation of word
    +        Applies standardization transformation on a vector.
    +
    +        :param vector: Vector to be standardized.
    +        :return: Standardized vector. If the variance of a column is zero,
    +                it will return default `0.0` for the column with zero variance.
    +        """
    +        return JavaModelWrapper.transform(self, vector)
    +
     
    +class StandardScaler(object):
    +    """
    +    :: Experimental ::
    +    Standardizes features by removing the mean and scaling to unit
    +    variance using column summary statistics on the samples in the
    +    training set.
    +
    +    >>> vs = [Vectors.dense([-2.0, 2.3, 0]), Vectors.dense([3.8, 0.0, 1.9])]
    +    >>> dataset = sc.parallelize(vs)
    +    >>> standardizer = StandardScaler(True, True)
    +    >>> model = standardizer.fit(dataset)
    +    >>> result = model.transform(dataset)
    +    >>> for r in result.collect(): r
    +    DenseVector([-0.7071, 0.7071, -0.7071])
    +    DenseVector([0.7071, -0.7071, 0.7071])
    +    """
    +    def __init__(self, withMean=False, withStd=True):
    +        """
    +        :param withMean: False by default. Centers the data with mean
    +                 before scaling. It will build a dense output, so this
    +                 does not work on sparse input and will raise an exception.
    +        :param withStd: True by default. Scales the data to unit standard
    +                 deviation.
    +        """
    +        if not (withMean or withStd):
    +            warnings.warn("Both withMean and withStd are false. The model does nothing.")
    +        self.withMean = withMean
    +        self.withStd = withStd
    +
    +    def fit(self, dataset):
    +        """
    +        Computes the mean and variance and stores as a model to be used for later scaling.
    +
    +        :param data: The data used to compute the mean and variance to build
    +                    the transformation model.
    +        :return: a StandardScalarModel
    +        """
    +        sc = dataset.context
    +        jmodel = _callAPI(sc, "fitStandardScaler", self.withMean, self.withStd, dataset)
    +        return StandardScalerModel(sc, jmodel)
    +
    +
    +class HashingTF(object):
    +    """
    +    :: Experimental ::
    +    Maps a sequence of terms to their term frequencies using the hashing trick.
    +
    +    >>> htf = HashingTF(100)
    +    >>> doc = "a a b b c d".split(" ")
    +    >>> htf.transform(doc)
    +    SparseVector(100, {1: 1.0, 14: 1.0, 31: 2.0, 44: 2.0})
    +    """
    +    def __init__(self, numFeatures=1 << 20):
    +        """
    +        :param numFeatures: number of features (default: 2^20)
    +        """
    +        self.numFeatures = numFeatures
    +
    +    def indexOf(self, term):
    +        """ Returns the index of the input term. """
    +        return hash(term) % self.numFeatures
    --- End diff --
    
    No. Let's put a note in the doc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-60664787
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22303/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19455090
  
    --- Diff: python/pyspark/mllib/feature.py ---
    @@ -18,59 +18,348 @@
     """
     Python package for feature in MLlib.
     """
    +import sys
    +import warnings
    +
    +import py4j.protocol
    +from py4j.protocol import Py4JJavaError
    +from py4j.java_gateway import JavaObject
    +
    +from pyspark import RDD, SparkContext
     from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
    -from pyspark.mllib.linalg import _convert_to_vector, _to_java_object_rdd
    +from pyspark.mllib.linalg import Vectors, _to_java_object_rdd
    +
    +__all__ = ['Normalizer', 'StandardScalerModel', 'StandardScaler',
    +           'HashTF', 'IDFModel', 'IDF',
    +           'Word2Vec', 'Word2VecModel']
    +
    +
    +# Hack for support float('inf') in Py4j
    +old_smart_decode = py4j.protocol.smart_decode
    +
    +float_str_mapping = {
    +    u'nan': u'NaN',
    +    u'inf': u'Infinity',
    +    u'-inf': u'-Infinity',
    +}
    +
    +
    +def new_smart_decode(obj):
    +    if isinstance(obj, float):
    +        s = unicode(obj)
    +        return float_str_mapping.get(s, s)
    +    return old_smart_decode(obj)
    +
    +py4j.protocol.smart_decode = new_smart_decode
    +
    +
    +# TODO: move these helper functions into utils
    +_picklable_classes = [
    +    'LinkedList',
    +    'SparseVector',
    +    'DenseVector',
    +    'DenseMatrix',
    +    'Rating',
    +    'LabeledPoint',
    +]
    +
    +
    +def _py2java(sc, a):
    +    """ Convert Python object into Java """
    +    if isinstance(a, RDD):
    +        a = _to_java_object_rdd(a)
    +    elif not isinstance(a, (int, long, float, bool, basestring)):
    +        bytes = bytearray(PickleSerializer().dumps(a))
    +        a = sc._jvm.SerDe.loads(bytes)
    +    return a
    +
    +
    +def _java2py(sc, r):
    +    if isinstance(r, JavaObject):
    +        clsName = r.getClass().getSimpleName()
    +        if clsName in ("RDD", "JavaRDD"):
    +            if clsName == "RDD":
    +                r = r.toJavaRDD()
    +            jrdd = sc._jvm.SerDe.javaToPython(r)
    +            return RDD(jrdd, sc, AutoBatchedSerializer(PickleSerializer()))
     
    -__all__ = ['Word2Vec', 'Word2VecModel']
    +        elif clsName in _picklable_classes:
    +            r = sc._jvm.SerDe.dumps(r)
     
    +    if isinstance(r, bytearray):
    +        r = PickleSerializer().loads(str(r))
    +    return r
     
    -class Word2VecModel(object):
    +
    +def _callJavaFunc(sc, func, *args):
    +    """ Call Java Function
         """
    -    class for Word2Vec model
    +    args = [_py2java(sc, a) for a in args]
    +    return _java2py(sc, func(*args))
    +
    +
    +def _callAPI(sc, name, *args):
    +    """ Call API in PythonMLLibAPI
         """
    -    def __init__(self, sc, java_model):
    +    api = getattr(sc._jvm.PythonMLLibAPI(), name)
    +    return _callJavaFunc(sc, api, *args)
    +
    +
    +class VectorTransformer(object):
    +    """
    +    :: DeveloperApi ::
    +    Base class for transformation of a vector or RDD of vector
    +    """
    +    def transform(self, vector):
             """
    -        :param sc:  Spark context
    -        :param java_model:  Handle to Java model object
    +        Applies transformation on a vector.
    +
    +        :param vector: vector to be transformed.
             """
    +        raise NotImplementedError
    +
    +
    +class Normalizer(VectorTransformer):
    +    """
    +    :: Experimental ::
    +    Normalizes samples individually to unit L^p^ norm
    +
    +    For any 1 <= p <= float('inf'), normalizes samples using
    +    sum(abs(vector).^p^)^(1/p)^ as norm.
    +
    +    For p = float('inf'), max(abs(vector)) will be used as norm for normalization.
    +
    +    >>> v = Vectors.dense(range(3))
    +    >>> nor = Normalizer(1)
    +    >>> nor.transform(v)
    +    DenseVector([0.0, 0.3333, 0.6667])
    +
    +    >>> rdd = sc.parallelize([v])
    +    >>> nor.transform(rdd).collect()
    +    [DenseVector([0.0, 0.3333, 0.6667])]
    +
    +    >>> nor2 = Normalizer(float("inf"))
    +    >>> nor2.transform(v)
    +    DenseVector([0.0, 0.5, 1.0])
    +    """
    +    def __init__(self, p=2):
    --- End diff --
    
    It will be converted into float, but having "2.0" here will be better for docs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19430399
  
    --- Diff: python/pyspark/mllib/feature.py ---
    @@ -18,59 +18,324 @@
     """
     Python package for feature in MLlib.
     """
    +import warnings
    +
    +from py4j.protocol import Py4JJavaError
    +from py4j.java_gateway import JavaObject
    +
    +from pyspark import RDD, SparkContext
     from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
    -from pyspark.mllib.linalg import _convert_to_vector, _to_java_object_rdd
    +from pyspark.mllib.linalg import Vectors, _to_java_object_rdd
    +
    +__all__ = ['Normalizer', 'StandardScalerModel', 'StandardScaler',
    +           'HashTF', 'IDFModel', 'IDF',
    +           'Word2Vec', 'Word2VecModel']
    +
    +
    +# TODO: move these helper functions into utils
    +_picklable_classes = [
    +    'LinkedList',
    +    'SparseVector',
    +    'DenseVector',
    +    'DenseMatrix',
    +    'Rating',
    +    'LabeledPoint',
    +]
    +
    +
    +def _py2java(sc, a):
    +    """ Convert Python object into Java """
    +    if isinstance(a, RDD):
    +        a = _to_java_object_rdd(a)
    +    elif not isinstance(a, (int, long, float, bool, basestring)):
    +        bytes = bytearray(PickleSerializer().dumps(a))
    +        a = sc._jvm.SerDe.loads(bytes)
    +    return a
    +
     
    -__all__ = ['Word2Vec', 'Word2VecModel']
    +def _java2py(sc, r):
    +    if isinstance(r, JavaObject):
    +        clsName = r.getClass().getSimpleName()
    +        if clsName in ("RDD", "JavaRDD"):
    +            if clsName == "RDD":
    +                r = r.toJavaRDD()
    +            jrdd = sc._jvm.SerDe.javaToPython(r)
    +            return RDD(jrdd, sc, AutoBatchedSerializer(PickleSerializer()))
     
    +        elif clsName in _picklable_classes:
    +            r = sc._jvm.SerDe.dumps(r)
     
    -class Word2VecModel(object):
    +    if isinstance(r, bytearray):
    +        r = PickleSerializer().loads(str(r))
    +    return r
    +
    +
    +def _callJavaFunc(sc, func, *args):
    +    """ Call Java Function
         """
    -    class for Word2Vec model
    +    args = [_py2java(sc, a) for a in args]
    +    return _java2py(sc, func(*args))
    +
    +
    +def _callAPI(sc, name, *args):
    +    """ Call API in PythonMLLibAPI
         """
    -    def __init__(self, sc, java_model):
    +    api = getattr(sc._jvm.PythonMLLibAPI(), name)
    +    return _callJavaFunc(sc, api, *args)
    +
    +
    +class VectorTransformer(object):
    +    """
    +    :: DeveloperApi ::
    +    Base class for transformation of a vector or RDD of vector
    +    """
    +    def transform(self, vector):
             """
    -        :param sc:  Spark context
    -        :param java_model:  Handle to Java model object
    +        Applies transformation on a vector.
    +
    +        :param vector: vector to be transformed.
             """
    +        raise NotImplementedError
    +
    +
    +class Normalizer(VectorTransformer):
    +    """
    +    :: Experimental ::
    +    Normalizes samples individually to unit L^p^ norm
    +
    +    For any 1 <= p < Double.PositiveInfinity, normalizes samples using
    +    sum(abs(vector).^p^)^(1/p)^ as norm.
    +
    +    For p = Double.PositiveInfinity, max(abs(vector)) will be used as
    +    norm for normalization.
    +
    +    >>> v = Vectors.dense(range(3))
    +    >>> nor = Normalizer(1)
    +    >>> nor.transform(v)
    +    DenseVector([0.0, 0.3333, 0.6667])
    +
    +    >>> rdd = sc.parallelize([v])
    +    >>> nor.transform(rdd).collect()
    +    [DenseVector([0.0, 0.3333, 0.6667])]
    +    """
    +    def __init__(self, p=2):
    +        """
    +        :param p: Normalization in L^p^ space, p = 2 by default.
    +        """
    +        assert p >= 1.0, "p should be greater than 1.0"
    +        self.p = float(p)
    +
    +    def transform(self, vector):
    +        """
    +        Applies unit length normalization on a vector.
    +
    +        :param vector: vector to be normalized.
    +        :return: normalized vector. If the norm of the input is zero, it
    +                will return the input vector.
    +        """
    +        sc = SparkContext._active_spark_context
    +        assert sc is not None, "SparkContext should be initialized first"
    +        return _callAPI(sc, "normalizeVector", self.p, vector)
    +
    +
    +class JavaModelWrapper(VectorTransformer):
    --- End diff --
    
    This and several other general utilities would be useful elsewhere in the Python API.  Should they be moved to another .py file which other pyspark code could depend on?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19454184
  
    --- Diff: python/pyspark/mllib/feature.py ---
    @@ -18,59 +18,348 @@
     """
     Python package for feature in MLlib.
     """
    +import sys
    +import warnings
    +
    +import py4j.protocol
    +from py4j.protocol import Py4JJavaError
    +from py4j.java_gateway import JavaObject
    +
    +from pyspark import RDD, SparkContext
     from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
    -from pyspark.mllib.linalg import _convert_to_vector, _to_java_object_rdd
    +from pyspark.mllib.linalg import Vectors, _to_java_object_rdd
    +
    +__all__ = ['Normalizer', 'StandardScalerModel', 'StandardScaler',
    +           'HashTF', 'IDFModel', 'IDF',
    +           'Word2Vec', 'Word2VecModel']
    +
    +
    +# Hack for support float('inf') in Py4j
    +old_smart_decode = py4j.protocol.smart_decode
    +
    +float_str_mapping = {
    +    u'nan': u'NaN',
    +    u'inf': u'Infinity',
    +    u'-inf': u'-Infinity',
    +}
    +
    +
    +def new_smart_decode(obj):
    +    if isinstance(obj, float):
    +        s = unicode(obj)
    +        return float_str_mapping.get(s, s)
    +    return old_smart_decode(obj)
    +
    +py4j.protocol.smart_decode = new_smart_decode
    +
    +
    +# TODO: move these helper functions into utils
    +_picklable_classes = [
    +    'LinkedList',
    +    'SparseVector',
    +    'DenseVector',
    +    'DenseMatrix',
    +    'Rating',
    +    'LabeledPoint',
    +]
    +
    +
    +def _py2java(sc, a):
    +    """ Convert Python object into Java """
    +    if isinstance(a, RDD):
    +        a = _to_java_object_rdd(a)
    +    elif not isinstance(a, (int, long, float, bool, basestring)):
    +        bytes = bytearray(PickleSerializer().dumps(a))
    +        a = sc._jvm.SerDe.loads(bytes)
    +    return a
    +
    +
    +def _java2py(sc, r):
    +    if isinstance(r, JavaObject):
    +        clsName = r.getClass().getSimpleName()
    +        if clsName in ("RDD", "JavaRDD"):
    +            if clsName == "RDD":
    +                r = r.toJavaRDD()
    +            jrdd = sc._jvm.SerDe.javaToPython(r)
    +            return RDD(jrdd, sc, AutoBatchedSerializer(PickleSerializer()))
     
    -__all__ = ['Word2Vec', 'Word2VecModel']
    +        elif clsName in _picklable_classes:
    +            r = sc._jvm.SerDe.dumps(r)
     
    +    if isinstance(r, bytearray):
    +        r = PickleSerializer().loads(str(r))
    +    return r
     
    -class Word2VecModel(object):
    +
    +def _callJavaFunc(sc, func, *args):
    +    """ Call Java Function
         """
    -    class for Word2Vec model
    +    args = [_py2java(sc, a) for a in args]
    +    return _java2py(sc, func(*args))
    +
    +
    +def _callAPI(sc, name, *args):
    +    """ Call API in PythonMLLibAPI
         """
    -    def __init__(self, sc, java_model):
    +    api = getattr(sc._jvm.PythonMLLibAPI(), name)
    +    return _callJavaFunc(sc, api, *args)
    +
    +
    +class VectorTransformer(object):
    +    """
    +    :: DeveloperApi ::
    +    Base class for transformation of a vector or RDD of vector
    +    """
    +    def transform(self, vector):
             """
    -        :param sc:  Spark context
    -        :param java_model:  Handle to Java model object
    +        Applies transformation on a vector.
    +
    +        :param vector: vector to be transformed.
             """
    +        raise NotImplementedError
    +
    +
    +class Normalizer(VectorTransformer):
    +    """
    +    :: Experimental ::
    +    Normalizes samples individually to unit L^p^ norm
    +
    +    For any 1 <= p <= float('inf'), normalizes samples using
    +    sum(abs(vector).^p^)^(1/p)^ as norm.
    +
    +    For p = float('inf'), max(abs(vector)) will be used as norm for normalization.
    +
    +    >>> v = Vectors.dense(range(3))
    +    >>> nor = Normalizer(1)
    +    >>> nor.transform(v)
    +    DenseVector([0.0, 0.3333, 0.6667])
    +
    +    >>> rdd = sc.parallelize([v])
    +    >>> nor.transform(rdd).collect()
    +    [DenseVector([0.0, 0.3333, 0.6667])]
    +
    +    >>> nor2 = Normalizer(float("inf"))
    +    >>> nor2.transform(v)
    +    DenseVector([0.0, 0.5, 1.0])
    +    """
    +    def __init__(self, p=2):
    --- End diff --
    
    `2` -> `2.0`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-60722415
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22346/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19454179
  
    --- Diff: python/pyspark/mllib/feature.py ---
    @@ -18,59 +18,348 @@
     """
     Python package for feature in MLlib.
     """
    +import sys
    +import warnings
    +
    +import py4j.protocol
    +from py4j.protocol import Py4JJavaError
    +from py4j.java_gateway import JavaObject
    +
    +from pyspark import RDD, SparkContext
     from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
    -from pyspark.mllib.linalg import _convert_to_vector, _to_java_object_rdd
    +from pyspark.mllib.linalg import Vectors, _to_java_object_rdd
    +
    +__all__ = ['Normalizer', 'StandardScalerModel', 'StandardScaler',
    +           'HashTF', 'IDFModel', 'IDF',
    --- End diff --
    
    `HashTF` -> `HashingTF`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-59313288
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21789/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-60701340
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22316/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-60696519
  
      [Test build #22316 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22316/consoleFull) for   PR 2819 at commit [`b628693`](https://github.com/apache/spark/commit/b6286939304da666a8158c71a47b6c95af28b639).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-59319148
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21790/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19429057
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -95,8 +95,50 @@ tf.cache()
     val idf = new IDF(minDocFreq = 2).fit(tf)
     val tfidf: RDD[Vector] = idf.transform(tf)
     {% endhighlight %}
    +</div>
    +<div data-lang="python" markdown="1">
    +
    +TF and IDF are implemented in [HashingTF](api/python/pyspark.mllib.html#pyspark.mllib.feature.HashingTF)
    +and [IDF](api/python/pyspark.mllib.html#pyspark.mllib.feature.IDF).
    +`HashingTF` takes an RDD of list as the input.
    +Each record could be an iterable of strings or other types.
    +
    +{% highlight python %}
    +from pyspark import SparkContext
    +from pyspark.mllib.linalg import Vector
    +from pyspark.mllib.feature import HashingTF
    +
    +sc = SparkContext()
    +
    +# Load documents (one per line).
    +documents = sc.textFile("...").map(lambda line: line.split(" "))
    +
    +hashingTF = HashingTF()
    +tf = hashingTF.transform(documents)
    +{% endhighlight %}
    +
    +While applying `HashingTF` only needs a single pass to the data, applying `IDF` needs two passes: 
    +first to compute the IDF vector and second to scale the term frequencies by IDF.
    +
    +{% highlight python %}
    +from pyspark.mllib.feature import IDF
    +
    +# ... continue from the previous example
    +tf.cache()
    +idf = IDF().fit(tf)
    +tfidf = idf.transform(tf)
    +{% endhighlight %}
     
    +MLLib's IDF implementation provides an option for ignoring terms which occur in less than a
    +minimum number of documents.  In such cases, the IDF for these terms is set to 0.  This feature
    +can be used by passing the `minDocFreq` value to the IDF constructor.
     
    +{% highlight python %}
    +# ... continue from the previous example
    +tf.cache()
    +idf = IDF().fit(tf)
    --- End diff --
    
    IDF needs minDocFreq argument


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-60737839
  
    LGTM. Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-59474327
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21850/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-60666677
  
      [Test build #479 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/479/consoleFull) for   PR 2819 at commit [`3abb8c2`](https://github.com/apache/spark/commit/3abb8c2da68633d3312c2c8c3bf1680bb0ee8edf).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19454177
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/VectorTransformer.scala ---
    @@ -20,6 +20,7 @@ package org.apache.spark.mllib.feature
     import org.apache.spark.annotation.DeveloperApi
     import org.apache.spark.mllib.linalg.Vector
     import org.apache.spark.rdd.RDD
    +import org.apache.spark.api.java.JavaRDD
    --- End diff --
    
    organize imports in alphabetical order


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-59300862
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21784/consoleFull) for   PR 2819 at commit [`8a50584`](https://github.com/apache/spark/commit/8a50584ed6ea38b5fccc64e6da3fc18d4513c9c5).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class VectorTransformer(object):`
      * `class Normalizer(VectorTransformer):`
      * `class JavaModelWrapper(VectorTransformer):`
      * `class StandardScalerModel(JavaModelWrapper):`
      * `class StandardScaler(object):`
      * `class HashTF(object):`
      * `class IDFModel(JavaModelWrapper):`
      * `class IDF(object):`
      * `class Word2VecModel(JavaModelWrapper):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19432437
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -223,6 +279,29 @@ val data1 = data.map(x => (x.label, scaler1.transform(x.features)))
     val data2 = data.map(x => (x.label, scaler2.transform(Vectors.dense(x.features.toArray))))
     {% endhighlight %}
     </div>
    +
    +<div data-lang="python">
    +{% highlight python %}
    +from pyspark.mllib.util import MLUtils
    +from pyspark.mllib.linalg import Vectors
    +from pyspark.mllib.feature import StandardScaler
    +
    +data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
    +label = data.map(lambda x: x.label)
    +features = data.map(lambda x: x.features)
    +
    +scaler1 = StandardScaler().fit(features)
    +scaler2 = StandardScaler(withMean=True, withStd=True).fit(features)
    +
    +# data1 will be unit variance.
    +data1 = label.zip(scaler1.transform(features))
    +
    +# Without converting the features into dense vectors, transformation with zero mean will raise
    +# exception on sparse vector.
    +# data2 will be unit variance and zero mean.
    +data2 = label.zip(scaler1.transform(features.map(lambda x: Vectors.dense(x.toArray()))))
    --- End diff --
    
    fixed, thx!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19436179
  
    --- Diff: python/pyspark/mllib/feature.py ---
    @@ -94,90 +360,46 @@ class Word2Vec(object):
         >>> sentence = "a b " * 100 + "a c " * 10
         >>> localDoc = [sentence, sentence]
         >>> doc = sc.parallelize(localDoc).map(lambda line: line.split(" "))
    -    >>> model = Word2Vec().setVectorSize(10).setSeed(42L).fit(doc)
    +    >>> model = Word2Vec(vectorSize=10).fit(doc)
    +
         >>> syms = model.findSynonyms("a", 2)
    -    >>> str(syms[0][0])
    -    'b'
    -    >>> str(syms[1][0])
    -    'c'
    -    >>> len(syms)
    -    2
    +    >>> [s[0] for s in syms]
    +    [u'b', u'c']
         >>> vec = model.transform("a")
    -    >>> len(vec)
    -    10
         >>> syms = model.findSynonyms(vec, 2)
    -    >>> str(syms[0][0])
    -    'b'
    -    >>> str(syms[1][0])
    -    'c'
    -    >>> len(syms)
    -    2
    +    >>> [s[0] for s in syms]
    +    [u'b', u'c']
         """
    -    def __init__(self):
    +    def __init__(self, vectorSize=100, learningRate=0.025, numPartitions=1,
    +                 numIterations=1, seed=42L):
    --- End diff --
    
    Do we want the default seed to be random (as in the Scala API)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-60681125
  
      [Test build #22308 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22308/consoleFull) for   PR 2819 at commit [`806c7c2`](https://github.com/apache/spark/commit/806c7c24c7fbb1f10e8bbddcc804f4899d8f0b11).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class VectorTransformer(object):`
      * `class Normalizer(VectorTransformer):`
      * `class JavaModelWrapper(VectorTransformer):`
      * `class StandardScalerModel(JavaModelWrapper):`
      * `class StandardScaler(object):`
      * `class HashingTF(object):`
      * `class IDFModel(JavaModelWrapper):`
      * `class IDF(object):`
      * `class Word2VecModel(JavaModelWrapper):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2819


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19454191
  
    --- Diff: python/pyspark/mllib/feature.py ---
    @@ -18,59 +18,348 @@
     """
     Python package for feature in MLlib.
     """
    +import sys
    +import warnings
    +
    +import py4j.protocol
    +from py4j.protocol import Py4JJavaError
    +from py4j.java_gateway import JavaObject
    +
    +from pyspark import RDD, SparkContext
     from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
    -from pyspark.mllib.linalg import _convert_to_vector, _to_java_object_rdd
    +from pyspark.mllib.linalg import Vectors, _to_java_object_rdd
    +
    +__all__ = ['Normalizer', 'StandardScalerModel', 'StandardScaler',
    +           'HashTF', 'IDFModel', 'IDF',
    +           'Word2Vec', 'Word2VecModel']
    +
    +
    +# Hack for support float('inf') in Py4j
    +old_smart_decode = py4j.protocol.smart_decode
    +
    +float_str_mapping = {
    +    u'nan': u'NaN',
    +    u'inf': u'Infinity',
    +    u'-inf': u'-Infinity',
    +}
    +
    +
    +def new_smart_decode(obj):
    +    if isinstance(obj, float):
    +        s = unicode(obj)
    +        return float_str_mapping.get(s, s)
    +    return old_smart_decode(obj)
    +
    +py4j.protocol.smart_decode = new_smart_decode
    +
    +
    +# TODO: move these helper functions into utils
    +_picklable_classes = [
    +    'LinkedList',
    +    'SparseVector',
    +    'DenseVector',
    +    'DenseMatrix',
    +    'Rating',
    +    'LabeledPoint',
    +]
    +
    +
    +def _py2java(sc, a):
    +    """ Convert Python object into Java """
    +    if isinstance(a, RDD):
    +        a = _to_java_object_rdd(a)
    +    elif not isinstance(a, (int, long, float, bool, basestring)):
    +        bytes = bytearray(PickleSerializer().dumps(a))
    +        a = sc._jvm.SerDe.loads(bytes)
    +    return a
    +
    +
    +def _java2py(sc, r):
    +    if isinstance(r, JavaObject):
    +        clsName = r.getClass().getSimpleName()
    +        if clsName in ("RDD", "JavaRDD"):
    +            if clsName == "RDD":
    +                r = r.toJavaRDD()
    +            jrdd = sc._jvm.SerDe.javaToPython(r)
    +            return RDD(jrdd, sc, AutoBatchedSerializer(PickleSerializer()))
     
    -__all__ = ['Word2Vec', 'Word2VecModel']
    +        elif clsName in _picklable_classes:
    +            r = sc._jvm.SerDe.dumps(r)
     
    +    if isinstance(r, bytearray):
    +        r = PickleSerializer().loads(str(r))
    +    return r
     
    -class Word2VecModel(object):
    +
    +def _callJavaFunc(sc, func, *args):
    +    """ Call Java Function
         """
    -    class for Word2Vec model
    +    args = [_py2java(sc, a) for a in args]
    +    return _java2py(sc, func(*args))
    +
    +
    +def _callAPI(sc, name, *args):
    +    """ Call API in PythonMLLibAPI
         """
    -    def __init__(self, sc, java_model):
    +    api = getattr(sc._jvm.PythonMLLibAPI(), name)
    +    return _callJavaFunc(sc, api, *args)
    +
    +
    +class VectorTransformer(object):
    +    """
    +    :: DeveloperApi ::
    +    Base class for transformation of a vector or RDD of vector
    +    """
    +    def transform(self, vector):
             """
    -        :param sc:  Spark context
    -        :param java_model:  Handle to Java model object
    +        Applies transformation on a vector.
    +
    +        :param vector: vector to be transformed.
             """
    +        raise NotImplementedError
    +
    +
    +class Normalizer(VectorTransformer):
    +    """
    +    :: Experimental ::
    +    Normalizes samples individually to unit L^p^ norm
    +
    +    For any 1 <= p <= float('inf'), normalizes samples using
    +    sum(abs(vector).^p^)^(1/p)^ as norm.
    +
    +    For p = float('inf'), max(abs(vector)) will be used as norm for normalization.
    +
    +    >>> v = Vectors.dense(range(3))
    +    >>> nor = Normalizer(1)
    +    >>> nor.transform(v)
    +    DenseVector([0.0, 0.3333, 0.6667])
    +
    +    >>> rdd = sc.parallelize([v])
    +    >>> nor.transform(rdd).collect()
    +    [DenseVector([0.0, 0.3333, 0.6667])]
    +
    +    >>> nor2 = Normalizer(float("inf"))
    +    >>> nor2.transform(v)
    +    DenseVector([0.0, 0.5, 1.0])
    +    """
    +    def __init__(self, p=2):
    +        """
    +        :param p: Normalization in L^p^ space, p = 2 by default.
    +        """
    +        assert p >= 1.0, "p should be greater than 1.0"
    +        self.p = float(p)
    +
    +    def transform(self, vector):
    +        """
    +        Applies unit length normalization on a vector.
    +
    +        :param vector: vector to be normalized.
    +        :return: normalized vector. If the norm of the input is zero, it
    +                will return the input vector.
    +        """
    +        sc = SparkContext._active_spark_context
    +        assert sc is not None, "SparkContext should be initialized first"
    +        return _callAPI(sc, "normalizeVector", self.p, vector)
    +
    +
    +class JavaModelWrapper(VectorTransformer):
    +    """
    +    Wrapper for the model in JVM
    +    """
    +    def __init__(self, sc, java_model):
             self._sc = sc
             self._java_model = java_model
     
         def __del__(self):
             self._sc._gateway.detach(self._java_model)
     
    -    def transform(self, word):
    +    def transform(self, dataset):
    +        return _callJavaFunc(self._sc, self._java_model.transform, dataset)
    +
    +
    +class StandardScalerModel(JavaModelWrapper):
    +    """
    +    :: Experimental ::
    +    Represents a StandardScaler model that can transform vectors.
    +    """
    +    def transform(self, vector):
             """
    -        :param word: a word
    -        :return: vector representation of word
    +        Applies standardization transformation on a vector.
    +
    +        :param vector: Vector to be standardized.
    +        :return: Standardized vector. If the variance of a column is zero,
    +                it will return default `0.0` for the column with zero variance.
    +        """
    +        return JavaModelWrapper.transform(self, vector)
    +
     
    +class StandardScaler(object):
    +    """
    +    :: Experimental ::
    +    Standardizes features by removing the mean and scaling to unit
    +    variance using column summary statistics on the samples in the
    +    training set.
    +
    +    >>> vs = [Vectors.dense([-2.0, 2.3, 0]), Vectors.dense([3.8, 0.0, 1.9])]
    +    >>> dataset = sc.parallelize(vs)
    +    >>> standardizer = StandardScaler(True, True)
    +    >>> model = standardizer.fit(dataset)
    +    >>> result = model.transform(dataset)
    +    >>> for r in result.collect(): r
    +    DenseVector([-0.7071, 0.7071, -0.7071])
    +    DenseVector([0.7071, -0.7071, 0.7071])
    +    """
    +    def __init__(self, withMean=False, withStd=True):
    +        """
    +        :param withMean: False by default. Centers the data with mean
    +                 before scaling. It will build a dense output, so this
    +                 does not work on sparse input and will raise an exception.
    +        :param withStd: True by default. Scales the data to unit standard
    +                 deviation.
    +        """
    +        if not (withMean or withStd):
    +            warnings.warn("Both withMean and withStd are false. The model does nothing.")
    +        self.withMean = withMean
    +        self.withStd = withStd
    +
    +    def fit(self, dataset):
    +        """
    +        Computes the mean and variance and stores as a model to be used for later scaling.
    +
    +        :param data: The data used to compute the mean and variance to build
    +                    the transformation model.
    +        :return: a StandardScalarModel
    +        """
    +        sc = dataset.context
    +        jmodel = _callAPI(sc, "fitStandardScaler", self.withMean, self.withStd, dataset)
    +        return StandardScalerModel(sc, jmodel)
    +
    +
    +class HashingTF(object):
    +    """
    +    :: Experimental ::
    +    Maps a sequence of terms to their term frequencies using the hashing trick.
    +
    +    >>> htf = HashingTF(100)
    +    >>> doc = "a a b b c d".split(" ")
    +    >>> htf.transform(doc)
    +    SparseVector(100, {1: 1.0, 14: 1.0, 31: 2.0, 44: 2.0})
    +    """
    +    def __init__(self, numFeatures=1 << 20):
    +        """
    +        :param numFeatures: number of features (default: 2^20)
    +        """
    +        self.numFeatures = numFeatures
    +
    +    def indexOf(self, term):
    +        """ Returns the index of the input term. """
    +        return hash(term) % self.numFeatures
    +
    +    def transform(self, document):
    +        """
    +        Transforms the input document (list of terms) to term frequency vectors,
    +        or transform the RDD of document to RDD of term frequency vectors.
    +        """
    +        if isinstance(document, RDD):
    +            return document.map(self.transform)
    +
    +        freq = {}
    +        for term in document:
    +            i = self.indexOf(term)
    +            freq[i] = freq.get(i, 0) + 1.0
    +        return Vectors.sparse(self.numFeatures, freq.items())
    +
    +
    +class IDFModel(JavaModelWrapper):
    +    """
    +    Represents an IDF model that can transform term frequency vectors.
    +    """
    +    def transform(self, dataset):
    +        """
    +        Transforms term frequency (TF) vectors to TF-IDF vectors.
    +
    +        If `minDocFreq` was set for the IDF calculation,
    +        the terms which occur in fewer than `minDocFreq`
    +        documents will have an entry of 0.
    +
    +        :param dataset: an RDD of term frequency vectors
    +        :return: an RDD of TF-IDF vectors
    +        """
    +        return JavaModelWrapper.transform(self, dataset)
    +
    +
    +class IDF(object):
    +    """
    +    :: Experimental ::
    +    Inverse document frequency (IDF).
    +
    +    The standard formulation is used: `idf = log((m + 1) / (d(t) + 1))`,
    +    where `m` is the total number of documents and `d(t)` is the number
    +    of documents that contain term `t`.
    +
    +    This implementation supports filtering out terms which do not appear
    +    in a minimum number of documents (controlled by the variable `minDocFreq`).
    +    For terms that are not in at least `minDocFreq` documents, the IDF is
    +    found as 0, resulting in TF-IDFs of 0.
    +
    +    >>> n = 4
    +    >>> freqs = [Vectors.sparse(n, (1, 3), (1.0, 2.0)),
    +    ...          Vectors.dense([0.0, 1.0, 2.0, 3.0]),
    +    ...          Vectors.sparse(n, [1], [1.0])]
    +    >>> data = sc.parallelize(freqs)
    +    >>> idf = IDF()
    +    >>> model = idf.fit(data)
    +    >>> tfidf = model.transform(data)
    +    >>> for r in tfidf.collect(): r
    +    SparseVector(4, {1: 0.0, 3: 0.5754})
    +    DenseVector([0.0, 0.0, 1.3863, 0.863])
    +    SparseVector(4, {1: 0.0})
    +    """
    +    def __init__(self, minDocFreq=0):
    +        """
    +        :param minDocFreq: minimum of documents in which a term
    +                           should appear for filtering
    +        """
    +        self.minDocFreq = minDocFreq
    +
    +    def fit(self, dataset):
    +        """
    +        Computes the inverse document frequency.
    +
    +        :param dataset: an RDD of term frequency vectors
    +        """
    +        sc = dataset.context
    +        jmodel = _callAPI(sc, "fitIDF", self.minDocFreq, dataset)
    +        return IDFModel(sc, jmodel)
    +
    +
    +class Word2VecModel(JavaModelWrapper):
    +    """
    +    class for Word2Vec model
    +    """
    +    def transform(self, word):
    +        """
             Transforms a word to its vector representation
     
    -        Note: local use only
    --- End diff --
    
    By local use, we mean it doesn't work in a closure:
    
    ~~~
    rdd.map(model.transform)
    ~~~
    
    I think we should keep the note.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-60670564
  
      [Test build #22308 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22308/consoleFull) for   PR 2819 at commit [`806c7c2`](https://github.com/apache/spark/commit/806c7c24c7fbb1f10e8bbddcc804f4899d8f0b11).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19454166
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -267,4 +346,25 @@ val data1 = data.map(x => (x.label, normalizer1.transform(x.features)))
     val data2 = data.map(x => (x.label, normalizer2.transform(x.features)))
     {% endhighlight %}
     </div>
    +
    +<div data-lang="python">
    +{% highlight python %}
    +from pyspark.mllib.util import MLUtils
    +from pyspark.mllib.linalg import Vectors
    +from pyspark.mllib.feature import Normalizer
    +
    +data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
    +label = data.map(lambda x: x.label)
    --- End diff --
    
    `label` is not used


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-59302540
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21785/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19455072
  
    --- Diff: python/pyspark/mllib/feature.py ---
    @@ -18,59 +18,348 @@
     """
     Python package for feature in MLlib.
     """
    +import sys
    +import warnings
    +
    +import py4j.protocol
    +from py4j.protocol import Py4JJavaError
    +from py4j.java_gateway import JavaObject
    +
    +from pyspark import RDD, SparkContext
     from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
    -from pyspark.mllib.linalg import _convert_to_vector, _to_java_object_rdd
    +from pyspark.mllib.linalg import Vectors, _to_java_object_rdd
    +
    +__all__ = ['Normalizer', 'StandardScalerModel', 'StandardScaler',
    +           'HashTF', 'IDFModel', 'IDF',
    +           'Word2Vec', 'Word2VecModel']
    +
    +
    +# Hack for support float('inf') in Py4j
    +old_smart_decode = py4j.protocol.smart_decode
    +
    +float_str_mapping = {
    +    u'nan': u'NaN',
    +    u'inf': u'Infinity',
    +    u'-inf': u'-Infinity',
    +}
    +
    +
    +def new_smart_decode(obj):
    +    if isinstance(obj, float):
    +        s = unicode(obj)
    +        return float_str_mapping.get(s, s)
    +    return old_smart_decode(obj)
    +
    +py4j.protocol.smart_decode = new_smart_decode
    +
    +
    +# TODO: move these helper functions into utils
    +_picklable_classes = [
    +    'LinkedList',
    +    'SparseVector',
    +    'DenseVector',
    +    'DenseMatrix',
    +    'Rating',
    +    'LabeledPoint',
    +]
    +
    +
    +def _py2java(sc, a):
    +    """ Convert Python object into Java """
    +    if isinstance(a, RDD):
    +        a = _to_java_object_rdd(a)
    +    elif not isinstance(a, (int, long, float, bool, basestring)):
    +        bytes = bytearray(PickleSerializer().dumps(a))
    +        a = sc._jvm.SerDe.loads(bytes)
    +    return a
    +
    +
    +def _java2py(sc, r):
    +    if isinstance(r, JavaObject):
    +        clsName = r.getClass().getSimpleName()
    +        if clsName in ("RDD", "JavaRDD"):
    +            if clsName == "RDD":
    +                r = r.toJavaRDD()
    +            jrdd = sc._jvm.SerDe.javaToPython(r)
    +            return RDD(jrdd, sc, AutoBatchedSerializer(PickleSerializer()))
     
    -__all__ = ['Word2Vec', 'Word2VecModel']
    +        elif clsName in _picklable_classes:
    +            r = sc._jvm.SerDe.dumps(r)
     
    +    if isinstance(r, bytearray):
    +        r = PickleSerializer().loads(str(r))
    +    return r
     
    -class Word2VecModel(object):
    +
    +def _callJavaFunc(sc, func, *args):
    +    """ Call Java Function
         """
    -    class for Word2Vec model
    +    args = [_py2java(sc, a) for a in args]
    +    return _java2py(sc, func(*args))
    +
    +
    +def _callAPI(sc, name, *args):
    +    """ Call API in PythonMLLibAPI
         """
    -    def __init__(self, sc, java_model):
    +    api = getattr(sc._jvm.PythonMLLibAPI(), name)
    +    return _callJavaFunc(sc, api, *args)
    +
    +
    +class VectorTransformer(object):
    +    """
    +    :: DeveloperApi ::
    +    Base class for transformation of a vector or RDD of vector
    +    """
    +    def transform(self, vector):
             """
    -        :param sc:  Spark context
    -        :param java_model:  Handle to Java model object
    +        Applies transformation on a vector.
    +
    +        :param vector: vector to be transformed.
             """
    +        raise NotImplementedError
    +
    +
    +class Normalizer(VectorTransformer):
    +    """
    +    :: Experimental ::
    +    Normalizes samples individually to unit L^p^ norm
    +
    +    For any 1 <= p <= float('inf'), normalizes samples using
    +    sum(abs(vector).^p^)^(1/p)^ as norm.
    +
    +    For p = float('inf'), max(abs(vector)) will be used as norm for normalization.
    +
    +    >>> v = Vectors.dense(range(3))
    +    >>> nor = Normalizer(1)
    +    >>> nor.transform(v)
    +    DenseVector([0.0, 0.3333, 0.6667])
    +
    +    >>> rdd = sc.parallelize([v])
    +    >>> nor.transform(rdd).collect()
    +    [DenseVector([0.0, 0.3333, 0.6667])]
    +
    +    >>> nor2 = Normalizer(float("inf"))
    +    >>> nor2.transform(v)
    +    DenseVector([0.0, 0.5, 1.0])
    +    """
    +    def __init__(self, p=2):
    +        """
    +        :param p: Normalization in L^p^ space, p = 2 by default.
    +        """
    +        assert p >= 1.0, "p should be greater than 1.0"
    +        self.p = float(p)
    +
    +    def transform(self, vector):
    +        """
    +        Applies unit length normalization on a vector.
    +
    +        :param vector: vector to be normalized.
    +        :return: normalized vector. If the norm of the input is zero, it
    +                will return the input vector.
    +        """
    +        sc = SparkContext._active_spark_context
    +        assert sc is not None, "SparkContext should be initialized first"
    +        return _callAPI(sc, "normalizeVector", self.p, vector)
    +
    +
    +class JavaModelWrapper(VectorTransformer):
    +    """
    +    Wrapper for the model in JVM
    +    """
    +    def __init__(self, sc, java_model):
             self._sc = sc
             self._java_model = java_model
     
         def __del__(self):
             self._sc._gateway.detach(self._java_model)
     
    -    def transform(self, word):
    +    def transform(self, dataset):
    +        return _callJavaFunc(self._sc, self._java_model.transform, dataset)
    +
    +
    +class StandardScalerModel(JavaModelWrapper):
    +    """
    +    :: Experimental ::
    +    Represents a StandardScaler model that can transform vectors.
    +    """
    +    def transform(self, vector):
             """
    -        :param word: a word
    -        :return: vector representation of word
    +        Applies standardization transformation on a vector.
    +
    +        :param vector: Vector to be standardized.
    +        :return: Standardized vector. If the variance of a column is zero,
    +                it will return default `0.0` for the column with zero variance.
    +        """
    +        return JavaModelWrapper.transform(self, vector)
    +
     
    +class StandardScaler(object):
    +    """
    +    :: Experimental ::
    +    Standardizes features by removing the mean and scaling to unit
    +    variance using column summary statistics on the samples in the
    +    training set.
    +
    +    >>> vs = [Vectors.dense([-2.0, 2.3, 0]), Vectors.dense([3.8, 0.0, 1.9])]
    +    >>> dataset = sc.parallelize(vs)
    +    >>> standardizer = StandardScaler(True, True)
    +    >>> model = standardizer.fit(dataset)
    +    >>> result = model.transform(dataset)
    +    >>> for r in result.collect(): r
    +    DenseVector([-0.7071, 0.7071, -0.7071])
    +    DenseVector([0.7071, -0.7071, 0.7071])
    +    """
    +    def __init__(self, withMean=False, withStd=True):
    +        """
    +        :param withMean: False by default. Centers the data with mean
    +                 before scaling. It will build a dense output, so this
    +                 does not work on sparse input and will raise an exception.
    +        :param withStd: True by default. Scales the data to unit standard
    +                 deviation.
    +        """
    +        if not (withMean or withStd):
    +            warnings.warn("Both withMean and withStd are false. The model does nothing.")
    +        self.withMean = withMean
    +        self.withStd = withStd
    +
    +    def fit(self, dataset):
    +        """
    +        Computes the mean and variance and stores as a model to be used for later scaling.
    +
    +        :param data: The data used to compute the mean and variance to build
    +                    the transformation model.
    +        :return: a StandardScalarModel
    +        """
    +        sc = dataset.context
    +        jmodel = _callAPI(sc, "fitStandardScaler", self.withMean, self.withStd, dataset)
    +        return StandardScalerModel(sc, jmodel)
    +
    +
    +class HashingTF(object):
    +    """
    +    :: Experimental ::
    +    Maps a sequence of terms to their term frequencies using the hashing trick.
    +
    +    >>> htf = HashingTF(100)
    +    >>> doc = "a a b b c d".split(" ")
    +    >>> htf.transform(doc)
    +    SparseVector(100, {1: 1.0, 14: 1.0, 31: 2.0, 44: 2.0})
    +    """
    +    def __init__(self, numFeatures=1 << 20):
    +        """
    +        :param numFeatures: number of features (default: 2^20)
    +        """
    +        self.numFeatures = numFeatures
    +
    +    def indexOf(self, term):
    +        """ Returns the index of the input term. """
    +        return hash(term) % self.numFeatures
    --- End diff --
    
    term can be any type, it's very hard to have same hash with Scala.
    
    PS: In python mutable object (such as dict, set, list) is not hashable, should we support these types for term?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19454172
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -267,4 +346,25 @@ val data1 = data.map(x => (x.label, normalizer1.transform(x.features)))
     val data2 = data.map(x => (x.label, normalizer2.transform(x.features)))
     {% endhighlight %}
     </div>
    +
    +<div data-lang="python">
    +{% highlight python %}
    +from pyspark.mllib.util import MLUtils
    +from pyspark.mllib.linalg import Vectors
    +from pyspark.mllib.feature import Normalizer
    +
    +data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
    +label = data.map(lambda x: x.label)
    --- End diff --
    
    `label` -> `labels`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-59300866
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21784/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r18932968
  
    --- Diff: python/pyspark/mllib/feature.py ---
    @@ -95,90 +360,46 @@ class Word2Vec(object):
         >>> sentence = "a b " * 100 + "a c " * 10
         >>> localDoc = [sentence, sentence]
         >>> doc = sc.parallelize(localDoc).map(lambda line: line.split(" "))
    -    >>> model = Word2Vec().setVectorSize(10).setSeed(42L).fit(doc)
    +    >>> model = Word2Vec(vectorSize=10).fit(doc)
    +
         >>> syms = model.findSynonyms("a", 2)
    -    >>> str(syms[0][0])
    -    'b'
    -    >>> str(syms[1][0])
    -    'c'
    -    >>> len(syms)
    -    2
    +    >>> [s[0] for s in syms]
    +    [u'b', u'c']
         >>> vec = model.transform("a")
    -    >>> len(vec)
    -    10
         >>> syms = model.findSynonyms(vec, 2)
    -    >>> str(syms[0][0])
    -    'b'
    -    >>> str(syms[1][0])
    -    'c'
    -    >>> len(syms)
    -    2
    +    >>> [s[0] for s in syms]
    +    [u'b', u'c']
         """
    -    def __init__(self):
    +    def __init__(self, vectorSize=100, learningRate=0.025, numPartitions=1,
    +                 numIterations=1, seed=42L):
             """
             Construct Word2Vec instance
    -        """
    -        self.vectorSize = 100
    -        self.learningRate = 0.025
    -        self.numPartitions = 1
    -        self.numIterations = 1
    -        self.seed = 42L
     
    -    def setVectorSize(self, vectorSize):
    -        """
    -        Sets vector size (default: 100).
    +        :param vectorSize: vector size (default: 100).
    +        :param learningRate:  initial learning rate (default: 0.025).
    +        :param numPartitions: number of partitions (default: 1). Use
    +                              a small number for accuracy.
    +        :param numIterations: number of iterations (default: 1), which should
    +                              be smaller than or equal to number of partitions.
             """
    --- End diff --
    
    It's good to have same interface crossing languages, but sometimes it looks wired to having the API that is designed for Java.
    
    I'd like to simply the Python API a little bit (without introducing confusing), then Python programmer can feel better (in this case). We can find several similar cases in APIs of pyspark.RDD.
    
    Does it make sense?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-60717352
  
      [Test build #22346 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22346/consoleFull) for   PR 2819 at commit [`4f48f48`](https://github.com/apache/spark/commit/4f48f48d0c013e50f1a96f1e6bb0af4d88bf366c).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19454193
  
    --- Diff: python/pyspark/mllib/feature.py ---
    @@ -95,33 +385,26 @@ class Word2Vec(object):
         >>> localDoc = [sentence, sentence]
         >>> doc = sc.parallelize(localDoc).map(lambda line: line.split(" "))
         >>> model = Word2Vec().setVectorSize(10).setSeed(42L).fit(doc)
    +
         >>> syms = model.findSynonyms("a", 2)
    -    >>> str(syms[0][0])
    -    'b'
    -    >>> str(syms[1][0])
    -    'c'
    -    >>> len(syms)
    -    2
    +    >>> [s[0] for s in syms]
    +    [u'b', u'c']
         >>> vec = model.transform("a")
    -    >>> len(vec)
    -    10
         >>> syms = model.findSynonyms(vec, 2)
    -    >>> str(syms[0][0])
    -    'b'
    -    >>> str(syms[1][0])
    -    'c'
    -    >>> len(syms)
    -    2
    +    >>> [s[0] for s in syms]
    +    [u'b', u'c']
         """
         def __init__(self):
             """
             Construct Word2Vec instance
             """
    +        import random  # this can't be on the top because of mllib.random
    +
             self.vectorSize = 100
             self.learningRate = 0.025
             self.numPartitions = 1
             self.numIterations = 1
    -        self.seed = 42L
    +        self.seed = random.randint(0, sys.maxint)
    --- End diff --
    
    `sys.maxint` -> `2 ** 32 - 1` (see #2889)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-59474321
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21850/consoleFull) for   PR 2819 at commit [`59781b9`](https://github.com/apache/spark/commit/59781b93f305b0321c0badb66983101b1ffee39d).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class VectorTransformer(object):`
      * `class Normalizer(VectorTransformer):`
      * `class JavaModelWrapper(VectorTransformer):`
      * `class StandardScalerModel(JavaModelWrapper):`
      * `class StandardScaler(object):`
      * `class HashingTF(object):`
      * `class IDFModel(JavaModelWrapper):`
      * `class IDF(object):`
      * `class Word2VecModel(JavaModelWrapper):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19436387
  
    --- Diff: python/pyspark/mllib/feature.py ---
    @@ -18,59 +18,324 @@
     """
     Python package for feature in MLlib.
     """
    +import warnings
    +
    +from py4j.protocol import Py4JJavaError
    +from py4j.java_gateway import JavaObject
    +
    +from pyspark import RDD, SparkContext
     from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
    -from pyspark.mllib.linalg import _convert_to_vector, _to_java_object_rdd
    +from pyspark.mllib.linalg import Vectors, _to_java_object_rdd
    +
    +__all__ = ['Normalizer', 'StandardScalerModel', 'StandardScaler',
    +           'HashTF', 'IDFModel', 'IDF',
    +           'Word2Vec', 'Word2VecModel']
    +
    +
    +# TODO: move these helper functions into utils
    +_picklable_classes = [
    +    'LinkedList',
    +    'SparseVector',
    +    'DenseVector',
    +    'DenseMatrix',
    +    'Rating',
    +    'LabeledPoint',
    +]
    +
    +
    +def _py2java(sc, a):
    +    """ Convert Python object into Java """
    +    if isinstance(a, RDD):
    +        a = _to_java_object_rdd(a)
    +    elif not isinstance(a, (int, long, float, bool, basestring)):
    +        bytes = bytearray(PickleSerializer().dumps(a))
    +        a = sc._jvm.SerDe.loads(bytes)
    +    return a
    +
     
    -__all__ = ['Word2Vec', 'Word2VecModel']
    +def _java2py(sc, r):
    +    if isinstance(r, JavaObject):
    +        clsName = r.getClass().getSimpleName()
    +        if clsName in ("RDD", "JavaRDD"):
    +            if clsName == "RDD":
    +                r = r.toJavaRDD()
    +            jrdd = sc._jvm.SerDe.javaToPython(r)
    +            return RDD(jrdd, sc, AutoBatchedSerializer(PickleSerializer()))
     
    +        elif clsName in _picklable_classes:
    +            r = sc._jvm.SerDe.dumps(r)
     
    -class Word2VecModel(object):
    +    if isinstance(r, bytearray):
    +        r = PickleSerializer().loads(str(r))
    +    return r
    +
    +
    +def _callJavaFunc(sc, func, *args):
    +    """ Call Java Function
         """
    -    class for Word2Vec model
    +    args = [_py2java(sc, a) for a in args]
    +    return _java2py(sc, func(*args))
    +
    +
    +def _callAPI(sc, name, *args):
    +    """ Call API in PythonMLLibAPI
         """
    -    def __init__(self, sc, java_model):
    +    api = getattr(sc._jvm.PythonMLLibAPI(), name)
    +    return _callJavaFunc(sc, api, *args)
    +
    +
    +class VectorTransformer(object):
    +    """
    +    :: DeveloperApi ::
    +    Base class for transformation of a vector or RDD of vector
    +    """
    +    def transform(self, vector):
             """
    -        :param sc:  Spark context
    -        :param java_model:  Handle to Java model object
    +        Applies transformation on a vector.
    +
    +        :param vector: vector to be transformed.
             """
    +        raise NotImplementedError
    +
    +
    +class Normalizer(VectorTransformer):
    +    """
    +    :: Experimental ::
    +    Normalizes samples individually to unit L^p^ norm
    +
    +    For any 1 <= p < Double.PositiveInfinity, normalizes samples using
    +    sum(abs(vector).^p^)^(1/p)^ as norm.
    +
    +    For p = Double.PositiveInfinity, max(abs(vector)) will be used as
    +    norm for normalization.
    +
    +    >>> v = Vectors.dense(range(3))
    +    >>> nor = Normalizer(1)
    +    >>> nor.transform(v)
    +    DenseVector([0.0, 0.3333, 0.6667])
    +
    +    >>> rdd = sc.parallelize([v])
    +    >>> nor.transform(rdd).collect()
    +    [DenseVector([0.0, 0.3333, 0.6667])]
    +    """
    +    def __init__(self, p=2):
    +        """
    +        :param p: Normalization in L^p^ space, p = 2 by default.
    +        """
    +        assert p >= 1.0, "p should be greater than 1.0"
    +        self.p = float(p)
    +
    +    def transform(self, vector):
    +        """
    +        Applies unit length normalization on a vector.
    +
    +        :param vector: vector to be normalized.
    +        :return: normalized vector. If the norm of the input is zero, it
    +                will return the input vector.
    +        """
    +        sc = SparkContext._active_spark_context
    +        assert sc is not None, "SparkContext should be initialized first"
    +        return _callAPI(sc, "normalizeVector", self.p, vector)
    +
    +
    +class JavaModelWrapper(VectorTransformer):
    --- End diff --
    
    OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19425692
  
    --- Diff: python/pyspark/mllib/linalg.py ---
    @@ -111,6 +111,13 @@ def _vector_size(v):
             raise TypeError("Cannot treat type %s as a vector" % type(v))
     
     
    +def _format_float(f, digits=4):
    +    s = str(round(f, 4))
    --- End diff --
    
    Should the 2nd argument of round() be digits, not 4?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-59469759
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21850/consoleFull) for   PR 2819 at commit [`59781b9`](https://github.com/apache/spark/commit/59781b93f305b0321c0badb66983101b1ffee39d).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-60668168
  
    @davies  It looks good to me, except for the small comments above.  Thanks for the PR!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19436372
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -95,8 +95,50 @@ tf.cache()
     val idf = new IDF(minDocFreq = 2).fit(tf)
     val tfidf: RDD[Vector] = idf.transform(tf)
     {% endhighlight %}
    +</div>
    +<div data-lang="python" markdown="1">
    +
    +TF and IDF are implemented in [HashingTF](api/python/pyspark.mllib.html#pyspark.mllib.feature.HashingTF)
    +and [IDF](api/python/pyspark.mllib.html#pyspark.mllib.feature.IDF).
    +`HashingTF` takes an RDD of list as the input.
    +Each record could be an iterable of strings or other types.
    +
    +{% highlight python %}
    +from pyspark import SparkContext
    +from pyspark.mllib.linalg import Vector
    +from pyspark.mllib.feature import HashingTF
    +
    +sc = SparkContext()
    +
    +# Load documents (one per line).
    +documents = sc.textFile("...").map(lambda line: line.split(" "))
    +
    +hashingTF = HashingTF()
    +tf = hashingTF.transform(documents)
    +{% endhighlight %}
    +
    +While applying `HashingTF` only needs a single pass to the data, applying `IDF` needs two passes: 
    +first to compute the IDF vector and second to scale the term frequencies by IDF.
    +
    +{% highlight python %}
    +from pyspark.mllib.feature import IDF
    +
    +# ... continue from the previous example
    +tf.cache()
    +idf = IDF().fit(tf)
    +tfidf = idf.transform(tf)
    +{% endhighlight %}
     
    +MLLib's IDF implementation provides an option for ignoring terms which occur in less than a
    +minimum number of documents.  In such cases, the IDF for these terms is set to 0.  This feature
    +can be used by passing the `minDocFreq` value to the IDF constructor.
     
    +{% highlight python %}
    +# ... continue from the previous example
    +tf.cache()
    +idf = IDF().fit(tf)
    --- End diff --
    
    This is supposed to be an example of explicitly setting minDocFreq.  Please compare with the Scala example.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-60693263
  
    So regarding the interface, as I mentioned to Joseph, I would like the interfaces to be the same so that people can easily copy code between the languages. Many people will see a Spark example in one language on a slide, and then try to do the same thing in their own program, so we want what to be super simple. So don't remove the getters and setters. In this particular case, it may be okay to support keyword args *in addition* to the getters / setters, since it will be obvious that there's another way to do that. But we should only do this if we're absolutely certain that these methods will have no required args in the future, because otherwise default and named arguments can mess things out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19432289
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -95,8 +95,50 @@ tf.cache()
     val idf = new IDF(minDocFreq = 2).fit(tf)
     val tfidf: RDD[Vector] = idf.transform(tf)
     {% endhighlight %}
    +</div>
    +<div data-lang="python" markdown="1">
    +
    +TF and IDF are implemented in [HashingTF](api/python/pyspark.mllib.html#pyspark.mllib.feature.HashingTF)
    +and [IDF](api/python/pyspark.mllib.html#pyspark.mllib.feature.IDF).
    +`HashingTF` takes an RDD of list as the input.
    +Each record could be an iterable of strings or other types.
    +
    +{% highlight python %}
    +from pyspark import SparkContext
    +from pyspark.mllib.linalg import Vector
    +from pyspark.mllib.feature import HashingTF
    +
    +sc = SparkContext()
    +
    +# Load documents (one per line).
    +documents = sc.textFile("...").map(lambda line: line.split(" "))
    +
    +hashingTF = HashingTF()
    +tf = hashingTF.transform(documents)
    +{% endhighlight %}
    +
    +While applying `HashingTF` only needs a single pass to the data, applying `IDF` needs two passes: 
    +first to compute the IDF vector and second to scale the term frequencies by IDF.
    +
    +{% highlight python %}
    +from pyspark.mllib.feature import IDF
    +
    +# ... continue from the previous example
    +tf.cache()
    +idf = IDF().fit(tf)
    +tfidf = idf.transform(tf)
    +{% endhighlight %}
     
    +MLLib's IDF implementation provides an option for ignoring terms which occur in less than a
    +minimum number of documents.  In such cases, the IDF for these terms is set to 0.  This feature
    +can be used by passing the `minDocFreq` value to the IDF constructor.
     
    +{% highlight python %}
    +# ... continue from the previous example
    +tf.cache()
    +idf = IDF().fit(tf)
    --- End diff --
    
    default minDocFreq is zero.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

Posted by Ishiihara <gi...@git.apache.org>.

Github user Ishiihara commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r18932327
  
    --- Diff: python/pyspark/mllib/feature.py ---
    @@ -95,90 +360,46 @@ class Word2Vec(object):
         >>> sentence = "a b " * 100 + "a c " * 10
         >>> localDoc = [sentence, sentence]
         >>> doc = sc.parallelize(localDoc).map(lambda line: line.split(" "))
    -    >>> model = Word2Vec().setVectorSize(10).setSeed(42L).fit(doc)
    +    >>> model = Word2Vec(vectorSize=10).fit(doc)
    +
         >>> syms = model.findSynonyms("a", 2)
    -    >>> str(syms[0][0])
    -    'b'
    -    >>> str(syms[1][0])
    -    'c'
    -    >>> len(syms)
    -    2
    +    >>> [s[0] for s in syms]
    +    [u'b', u'c']
         >>> vec = model.transform("a")
    -    >>> len(vec)
    -    10
         >>> syms = model.findSynonyms(vec, 2)
    -    >>> str(syms[0][0])
    -    'b'
    -    >>> str(syms[1][0])
    -    'c'
    -    >>> len(syms)
    -    2
    +    >>> [s[0] for s in syms]
    +    [u'b', u'c']
         """
    -    def __init__(self):
    +    def __init__(self, vectorSize=100, learningRate=0.025, numPartitions=1,
    +                 numIterations=1, seed=42L):
             """
             Construct Word2Vec instance
    -        """
    -        self.vectorSize = 100
    -        self.learningRate = 0.025
    -        self.numPartitions = 1
    -        self.numIterations = 1
    -        self.seed = 42L
     
    -    def setVectorSize(self, vectorSize):
    -        """
    -        Sets vector size (default: 100).
    +        :param vectorSize: vector size (default: 100).
    +        :param learningRate:  initial learning rate (default: 0.025).
    +        :param numPartitions: number of partitions (default: 1). Use
    +                              a small number for accuracy.
    +        :param numIterations: number of iterations (default: 1), which should
    +                              be smaller than or equal to number of partitions.
             """
    --- End diff --
    
    In Scala/Java Word2Vec implementation , we used setters to set parameters, should we keep the same interface at python side?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-59310073
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21789/consoleFull) for   PR 2819 at commit [`7a1891a`](https://github.com/apache/spark/commit/7a1891abe6647a5f9dc82c21add907fe2d4b9aa8).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-60677547
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22304/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19455165
  
    --- Diff: python/pyspark/mllib/feature.py ---
    @@ -95,33 +385,26 @@ class Word2Vec(object):
         >>> localDoc = [sentence, sentence]
         >>> doc = sc.parallelize(localDoc).map(lambda line: line.split(" "))
         >>> model = Word2Vec().setVectorSize(10).setSeed(42L).fit(doc)
    +
         >>> syms = model.findSynonyms("a", 2)
    -    >>> str(syms[0][0])
    -    'b'
    -    >>> str(syms[1][0])
    -    'c'
    -    >>> len(syms)
    -    2
    +    >>> [s[0] for s in syms]
    +    [u'b', u'c']
         >>> vec = model.transform("a")
    -    >>> len(vec)
    -    10
         >>> syms = model.findSynonyms(vec, 2)
    -    >>> str(syms[0][0])
    -    'b'
    -    >>> str(syms[1][0])
    -    'c'
    -    >>> len(syms)
    -    2
    +    >>> [s[0] for s in syms]
    +    [u'b', u'c']
         """
         def __init__(self):
             """
             Construct Word2Vec instance
             """
    +        import random  # this can't be on the top because of mllib.random
    +
             self.vectorSize = 100
             self.learningRate = 0.025
             self.numPartitions = 1
             self.numIterations = 1
    -        self.seed = 42L
    +        self.seed = random.randint(0, sys.maxint)
    --- End diff --
    
    ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19425317
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -328,6 +364,16 @@ class PythonMLLibAPI extends Serializable {
           model.transform(word)
         }
     
    +    /**
    +     * TODO: model is not serializable
    +     * Transforms an RDD of words to its vector representation
    +     * @param rdd an RDD of words
    +     * @return an RDD of vector representations of words
    +     */
    +    def transform(rdd: JavaRDD[String]): JavaRDD[Vector] = {
    +      rdd.rdd.map(model.transform(_))
    --- End diff --
    
    could remove "(_)"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-60696435
  
    @mateiz @jkbradley @Ishiihara I had revert the API changes in Word2Vec, also remove the keyword arguments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-60681141
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22308/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2819#issuecomment-59298672
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21785/consoleFull) for   PR 2819 at commit [`486795f`](https://github.com/apache/spark/commit/486795f1d8792c15c9f97b22b1015b23fb7c8d81).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19429959
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -267,4 +346,25 @@ val data1 = data.map(x => (x.label, normalizer1.transform(x.features)))
     val data2 = data.map(x => (x.label, normalizer2.transform(x.features)))
     {% endhighlight %}
     </div>
    +
    +<div data-lang="python">
    +{% highlight python %}
    +from pyspark.mllib.util import MLUtils
    +from pyspark.mllib.linalg import Vectors
    +from pyspark.mllib.feature import Normalizer
    +
    +data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
    +label = data.map(lambda x: x.label)
    +features = data.map(lambda x: x.features)
    +
    +normalizer1 = Normalizer()
    +normalizer2 = Normalizer(p=float("inf"))
    +
    +# Each sample in data1 will be normalized using $L^2$ norm.
    +data1 = label.zip(normalizer1.transform(features))
    +
    +# Each sample in data2 will be normalized using $L^\infty$ norm.
    +data2 = label.zip(normalizer2.transform(features))
    --- End diff --
    
    This line fails because the p = infinity is not handled correctly somewhere.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3961] [MLlib] [PySpark] Python API for ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2819#discussion_r19454162
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -95,8 +95,50 @@ tf.cache()
     val idf = new IDF(minDocFreq = 2).fit(tf)
     val tfidf: RDD[Vector] = idf.transform(tf)
     {% endhighlight %}
    +</div>
    +<div data-lang="python" markdown="1">
    +
    +TF and IDF are implemented in [HashingTF](api/python/pyspark.mllib.html#pyspark.mllib.feature.HashingTF)
    +and [IDF](api/python/pyspark.mllib.html#pyspark.mllib.feature.IDF).
    +`HashingTF` takes an RDD of list as the input.
    +Each record could be an iterable of strings or other types.
    +
    +{% highlight python %}
    +from pyspark import SparkContext
    +from pyspark.mllib.linalg import Vector
    --- End diff --
    
    `Vector` is not used


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org