You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by WeichenXu123 <gi...@git.apache.org> on 2018/05/08 05:31:23 UTC

[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

GitHub user WeichenXu123 opened a pull request:

    https://github.com/apache/spark/pull/21265

    [SPARK-24146][PySpark][ML] spark.ml parity for sequential pattern mining - PrefixSpan: Python API

    ## What changes were proposed in this pull request?
    
    spark.ml parity for sequential pattern mining - PrefixSpan: Python API
    
    ## How was this patch tested?
    
    doctests


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/WeichenXu123/spark prefix_span_py

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21265.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21265
    
----
commit 83eeea1c539d59a4d8496437dcf06d82b43b0ca2
Author: WeichenXu <we...@...>
Date:   2018-05-08T05:29:24Z

    init pr

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21265#discussion_r192000596
  
    --- Diff: python/pyspark/ml/fpm.py ---
    @@ -243,3 +244,105 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items",
     
         def _create_model(self, java_model):
             return FPGrowthModel(java_model)
    +
    +
    +class PrefixSpan(JavaParams):
    +    """
    +    .. note:: Experimental
    +
    +    A parallel PrefixSpan algorithm to mine frequent sequential patterns.
    +    The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
    +    Efficiently by Prefix-Projected Pattern Growth
    +    (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
    +    This class is not yet an Estimator/Transformer, use :py:func:`findFrequentSequentialPatterns`
    +    method to run the PrefixSpan algorithm.
    +
    +    @see <a href="https://en.wikipedia.org/wiki/Sequential_Pattern_Mining">Sequential Pattern Mining
    +    (Wikipedia)</a>
    +    .. versionadded:: 2.4.0
    +
    +    """
    +
    +    minSupport = Param(Params._dummy(), "minSupport", "The minimal support level of the " +
    +                       "sequential pattern. Sequential pattern that appears more than " +
    +                       "(minSupport * size-of-the-dataset) times will be output. Must be >= 0.",
    +                       typeConverter=TypeConverters.toFloat)
    +
    +    maxPatternLength = Param(Params._dummy(), "maxPatternLength",
    +                             "The maximal length of the sequential pattern. Must be > 0.",
    +                             typeConverter=TypeConverters.toInt)
    +
    +    maxLocalProjDBSize = Param(Params._dummy(), "maxLocalProjDBSize",
    +                               "The maximum number of items (including delimiters used in the " +
    +                               "internal storage format) allowed in a projected database before " +
    +                               "local processing. If a projected database exceeds this size, " +
    +                               "another iteration of distributed prefix growth is run. " +
    +                               "Must be > 0.",
    +                               typeConverter=TypeConverters.toInt)
    --- End diff --
    
    Just test that python 'int' type range is the same with java 'long' type.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    **[Test build #91330 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91330/testReport)** for PR 21265 at commit [`6f40474`](https://github.com/apache/spark/commit/6f404747e13c500289920d15ee79b0f0509984f8).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    LGTM. Merged into master. Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21265#discussion_r192000950
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
    @@ -53,7 +53,7 @@ final class PrefixSpan(@Since("2.4.0") override val uid: String) extends Params
       @Since("2.4.0")
       val minSupport = new DoubleParam(this, "minSupport", "The minimal support level of the " +
         "sequential pattern. Sequential pattern that appears more than " +
    -    "(minSupport * size-of-the-dataset)." +
    +    "(minSupport * size-of-the-dataset)" +
    --- End diff --
    
    Need a space at the end before "times".


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91328/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    **[Test build #90361 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90361/testReport)** for PR 21265 at commit [`72abab0`](https://github.com/apache/spark/commit/72abab055ef5576b0c55e1a4c1dafcb3ac36f46f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    Jenkins, test this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    **[Test build #91328 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91328/testReport)** for PR 21265 at commit [`1248897`](https://github.com/apache/spark/commit/12488976debc51480ccf59bb9695575c739684ab).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3027/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21265#discussion_r191995667
  
    --- Diff: python/pyspark/ml/fpm.py ---
    @@ -243,3 +244,75 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items",
     
         def _create_model(self, java_model):
             return FPGrowthModel(java_model)
    +
    +
    +class PrefixSpan(object):
    +    """
    +    .. note:: Experimental
    +
    +    A parallel PrefixSpan algorithm to mine frequent sequential patterns.
    +    The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
    +    Efficiently by Prefix-Projected Pattern Growth
    +    (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
    +
    +    .. versionadded:: 2.4.0
    +
    +    """
    +    @staticmethod
    +    @since("2.4.0")
    +    def findFrequentSequentialPatterns(dataset,
    +                                       sequenceCol,
    +                                       minSupport,
    +                                       maxPatternLength,
    +                                       maxLocalProjDBSize):
    +        """
    +        .. note:: Experimental
    +        Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
    +
    +        :param dataset: A dataset or a dataframe containing a sequence column which is
    +                        `Seq[Seq[_]]` type.
    +        :param sequenceCol: The name of the sequence column in dataset, rows with nulls in this
    +                            column are ignored.
    +        :param minSupport: The minimal support level of the sequential pattern, any pattern that
    +                           appears more than (minSupport * size-of-the-dataset) times will be
    +                           output (recommended value: `0.1`).
    +        :param maxPatternLength: The maximal length of the sequential pattern
    +                                 (recommended value: `10`).
    +        :param maxLocalProjDBSize: The maximum number of items (including delimiters used in the
    +                                   internal storage format) allowed in a projected database before
    +                                   local processing. If a projected database exceeds this size,
    +                                   another iteration of distributed prefix growth is run
    +                                   (recommended value: `32000000`).
    +        :return: A `DataFrame` that contains columns of sequence and corresponding frequency.
    +                 The schema of it will be:
    +                  - `sequence: Seq[Seq[T]]` (T is the item type)
    +                  - `freq: Long`
    +
    +        >>> from pyspark.ml.fpm import PrefixSpan
    +        >>> from pyspark.sql import Row
    +        >>> df = sc.parallelize([Row(sequence=[[1, 2], [3]]),
    --- End diff --
    
    I think it is better to be put in a example. @mengxr What do you think ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3724/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21265#discussion_r192001923
  
    --- Diff: python/pyspark/ml/fpm.py ---
    @@ -243,3 +244,105 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items",
     
         def _create_model(self, java_model):
             return FPGrowthModel(java_model)
    +
    +
    +class PrefixSpan(JavaParams):
    +    """
    +    .. note:: Experimental
    +
    +    A parallel PrefixSpan algorithm to mine frequent sequential patterns.
    +    The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
    +    Efficiently by Prefix-Projected Pattern Growth
    +    (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
    +    This class is not yet an Estimator/Transformer, use :py:func:`findFrequentSequentialPatterns`
    +    method to run the PrefixSpan algorithm.
    +
    +    @see <a href="https://en.wikipedia.org/wiki/Sequential_Pattern_Mining">Sequential Pattern Mining
    +    (Wikipedia)</a>
    +    .. versionadded:: 2.4.0
    +
    +    """
    +
    +    minSupport = Param(Params._dummy(), "minSupport", "The minimal support level of the " +
    +                       "sequential pattern. Sequential pattern that appears more than " +
    +                       "(minSupport * size-of-the-dataset) times will be output. Must be >= 0.",
    +                       typeConverter=TypeConverters.toFloat)
    +
    +    maxPatternLength = Param(Params._dummy(), "maxPatternLength",
    +                             "The maximal length of the sequential pattern. Must be > 0.",
    +                             typeConverter=TypeConverters.toInt)
    +
    +    maxLocalProjDBSize = Param(Params._dummy(), "maxLocalProjDBSize",
    +                               "The maximum number of items (including delimiters used in the " +
    +                               "internal storage format) allowed in a projected database before " +
    +                               "local processing. If a projected database exceeds this size, " +
    +                               "another iteration of distributed prefix growth is run. " +
    +                               "Must be > 0.",
    +                               typeConverter=TypeConverters.toInt)
    +
    +    sequenceCol = Param(Params._dummy(), "sequenceCol", "The name of the sequence column in " +
    +                        "dataset, rows with nulls in this column are ignored.",
    +                        typeConverter=TypeConverters.toString)
    +
    +    @keyword_only
    +    def __init__(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000,
    +                 sequenceCol="sequence"):
    +        """
    +        __init__(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000, \
    +                 sequenceCol="sequence")
    +        """
    +        super(PrefixSpan, self).__init__()
    +        self._java_obj = self._new_java_obj("org.apache.spark.ml.fpm.PrefixSpan", self.uid)
    +        self._setDefault(minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000,
    +                         sequenceCol="sequence")
    +        kwargs = self._input_kwargs
    +        self.setParams(**kwargs)
    +
    +    @keyword_only
    +    @since("2.4.0")
    +    def setParams(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000,
    +                  sequenceCol="sequence"):
    +        """
    +        setParams(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000, \
    +                  sequenceCol="sequence")
    +        """
    +        kwargs = self._input_kwargs
    +        return self._set(**kwargs)
    +
    +    @since("2.4.0")
    +    def findFrequentSequentialPatterns(self, dataset):
    +        """
    +        .. note:: Experimental
    +        Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
    +
    +        :param dataset: A dataset or a dataframe containing a sequence column which is
    +                        `Seq[Seq[_]]` type.
    +        :return: A `DataFrame` that contains columns of sequence and corresponding frequency.
    +                 The schema of it will be:
    +                  - `sequence: Seq[Seq[T]]` (T is the item type)
    --- End diff --
    
    ditto


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91325/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    Let's wait on this until we make the decision in the last thread in https://github.com/apache/spark/pull/20973


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21265#discussion_r192002416
  
    --- Diff: python/pyspark/ml/fpm.py ---
    @@ -243,3 +244,105 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items",
     
         def _create_model(self, java_model):
             return FPGrowthModel(java_model)
    +
    +
    +class PrefixSpan(JavaParams):
    +    """
    +    .. note:: Experimental
    +
    +    A parallel PrefixSpan algorithm to mine frequent sequential patterns.
    +    The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
    +    Efficiently by Prefix-Projected Pattern Growth
    +    (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
    +    This class is not yet an Estimator/Transformer, use :py:func:`findFrequentSequentialPatterns`
    +    method to run the PrefixSpan algorithm.
    +
    +    @see <a href="https://en.wikipedia.org/wiki/Sequential_Pattern_Mining">Sequential Pattern Mining
    +    (Wikipedia)</a>
    +    .. versionadded:: 2.4.0
    +
    +    """
    +
    +    minSupport = Param(Params._dummy(), "minSupport", "The minimal support level of the " +
    +                       "sequential pattern. Sequential pattern that appears more than " +
    +                       "(minSupport * size-of-the-dataset) times will be output. Must be >= 0.",
    +                       typeConverter=TypeConverters.toFloat)
    +
    +    maxPatternLength = Param(Params._dummy(), "maxPatternLength",
    +                             "The maximal length of the sequential pattern. Must be > 0.",
    +                             typeConverter=TypeConverters.toInt)
    +
    +    maxLocalProjDBSize = Param(Params._dummy(), "maxLocalProjDBSize",
    +                               "The maximum number of items (including delimiters used in the " +
    +                               "internal storage format) allowed in a projected database before " +
    +                               "local processing. If a projected database exceeds this size, " +
    +                               "another iteration of distributed prefix growth is run. " +
    +                               "Must be > 0.",
    +                               typeConverter=TypeConverters.toInt)
    +
    +    sequenceCol = Param(Params._dummy(), "sequenceCol", "The name of the sequence column in " +
    +                        "dataset, rows with nulls in this column are ignored.",
    +                        typeConverter=TypeConverters.toString)
    +
    +    @keyword_only
    +    def __init__(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000,
    +                 sequenceCol="sequence"):
    +        """
    +        __init__(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000, \
    +                 sequenceCol="sequence")
    +        """
    +        super(PrefixSpan, self).__init__()
    +        self._java_obj = self._new_java_obj("org.apache.spark.ml.fpm.PrefixSpan", self.uid)
    +        self._setDefault(minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000,
    +                         sequenceCol="sequence")
    +        kwargs = self._input_kwargs
    +        self.setParams(**kwargs)
    +
    +    @keyword_only
    +    @since("2.4.0")
    +    def setParams(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000,
    +                  sequenceCol="sequence"):
    +        """
    +        setParams(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000, \
    +                  sequenceCol="sequence")
    +        """
    +        kwargs = self._input_kwargs
    +        return self._set(**kwargs)
    +
    +    @since("2.4.0")
    +    def findFrequentSequentialPatterns(self, dataset):
    +        """
    +        .. note:: Experimental
    +        Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
    +
    +        :param dataset: A dataset or a dataframe containing a sequence column which is
    +                        `Seq[Seq[_]]` type.
    +        :return: A `DataFrame` that contains columns of sequence and corresponding frequency.
    +                 The schema of it will be:
    +                  - `sequence: Seq[Seq[T]]` (T is the item type)
    +                  - `freq: Long`
    +
    +        >>> from pyspark.ml.fpm import PrefixSpan
    +        >>> from pyspark.sql import Row
    +        >>> df = sc.parallelize([Row(sequence=[[1, 2], [3]]),
    +        ...                      Row(sequence=[[1], [3, 2], [1, 2]]),
    +        ...                      Row(sequence=[[1, 2], [5]]),
    +        ...                      Row(sequence=[[6]])]).toDF()
    +        >>> prefixSpan = PrefixSpan(minSupport=0.5, maxPatternLength=5,
    +        ...                         maxLocalProjDBSize=32000000)
    --- End diff --
    
    remove this param from example


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    **[Test build #91325 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91325/testReport)** for PR 21265 at commit [`0be3a94`](https://github.com/apache/spark/commit/0be3a94d27f4203608ef82d2ef197b37606c53b3).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class PrefixSpan(JavaParams):`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    **[Test build #91330 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91330/testReport)** for PR 21265 at commit [`6f40474`](https://github.com/apache/spark/commit/6f404747e13c500289920d15ee79b0f0509984f8).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3033/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21265#discussion_r187187330
  
    --- Diff: python/pyspark/ml/fpm.py ---
    @@ -243,3 +244,75 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items",
     
         def _create_model(self, java_model):
             return FPGrowthModel(java_model)
    +
    +
    +class PrefixSpan(object):
    +    """
    +    .. note:: Experimental
    +
    +    A parallel PrefixSpan algorithm to mine frequent sequential patterns.
    +    The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
    +    Efficiently by Prefix-Projected Pattern Growth
    +    (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
    +
    +    .. versionadded:: 2.4.0
    +
    +    """
    +    @staticmethod
    +    @since("2.4.0")
    +    def findFrequentSequentialPatterns(dataset,
    +                                       sequenceCol,
    +                                       minSupport,
    +                                       maxPatternLength,
    +                                       maxLocalProjDBSize):
    +        """
    +        .. note:: Experimental
    +        Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
    +
    +        :param dataset: A dataset or a dataframe containing a sequence column which is
    +                        `Seq[Seq[_]]` type.
    +        :param sequenceCol: The name of the sequence column in dataset, rows with nulls in this
    +                            column are ignored.
    +        :param minSupport: The minimal support level of the sequential pattern, any pattern that
    +                           appears more than (minSupport * size-of-the-dataset) times will be
    +                           output (recommended value: `0.1`).
    +        :param maxPatternLength: The maximal length of the sequential pattern
    +                                 (recommended value: `10`).
    +        :param maxLocalProjDBSize: The maximum number of items (including delimiters used in the
    +                                   internal storage format) allowed in a projected database before
    +                                   local processing. If a projected database exceeds this size,
    +                                   another iteration of distributed prefix growth is run
    +                                   (recommended value: `32000000`).
    +        :return: A `DataFrame` that contains columns of sequence and corresponding frequency.
    +                 The schema of it will be:
    +                  - `sequence: Seq[Seq[T]]` (T is the item type)
    +                  - `freq: Long`
    +
    +        >>> from pyspark.ml.fpm import PrefixSpan
    +        >>> from pyspark.sql import Row
    +        >>> df = sc.parallelize([Row(sequence=[[1, 2], [3]]),
    --- End diff --
    
    My 2 cents: That sounds like a judgement call: If it's to explain the behavior more clearly, then that sounds reasonable.  I feel like it's pretty clear how nulls are treated from the doc.  maxPatternLength might benefit from an example.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    **[Test build #91332 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91332/testReport)** for PR 21265 at commit [`6f40474`](https://github.com/apache/spark/commit/6f404747e13c500289920d15ee79b0f0509984f8).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    **[Test build #90356 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90356/testReport)** for PR 21265 at commit [`83eeea1`](https://github.com/apache/spark/commit/83eeea1c539d59a4d8496437dcf06d82b43b0ca2).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class PrefixSpan(object):`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/21265


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3717/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21265#discussion_r192001886
  
    --- Diff: python/pyspark/ml/fpm.py ---
    @@ -243,3 +244,105 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items",
     
         def _create_model(self, java_model):
             return FPGrowthModel(java_model)
    +
    +
    +class PrefixSpan(JavaParams):
    +    """
    +    .. note:: Experimental
    +
    +    A parallel PrefixSpan algorithm to mine frequent sequential patterns.
    +    The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
    +    Efficiently by Prefix-Projected Pattern Growth
    +    (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
    +    This class is not yet an Estimator/Transformer, use :py:func:`findFrequentSequentialPatterns`
    +    method to run the PrefixSpan algorithm.
    +
    +    @see <a href="https://en.wikipedia.org/wiki/Sequential_Pattern_Mining">Sequential Pattern Mining
    +    (Wikipedia)</a>
    +    .. versionadded:: 2.4.0
    +
    +    """
    +
    +    minSupport = Param(Params._dummy(), "minSupport", "The minimal support level of the " +
    +                       "sequential pattern. Sequential pattern that appears more than " +
    +                       "(minSupport * size-of-the-dataset) times will be output. Must be >= 0.",
    +                       typeConverter=TypeConverters.toFloat)
    +
    +    maxPatternLength = Param(Params._dummy(), "maxPatternLength",
    +                             "The maximal length of the sequential pattern. Must be > 0.",
    +                             typeConverter=TypeConverters.toInt)
    +
    +    maxLocalProjDBSize = Param(Params._dummy(), "maxLocalProjDBSize",
    +                               "The maximum number of items (including delimiters used in the " +
    +                               "internal storage format) allowed in a projected database before " +
    +                               "local processing. If a projected database exceeds this size, " +
    +                               "another iteration of distributed prefix growth is run. " +
    +                               "Must be > 0.",
    +                               typeConverter=TypeConverters.toInt)
    +
    +    sequenceCol = Param(Params._dummy(), "sequenceCol", "The name of the sequence column in " +
    +                        "dataset, rows with nulls in this column are ignored.",
    +                        typeConverter=TypeConverters.toString)
    +
    +    @keyword_only
    +    def __init__(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000,
    +                 sequenceCol="sequence"):
    +        """
    +        __init__(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000, \
    +                 sequenceCol="sequence")
    +        """
    +        super(PrefixSpan, self).__init__()
    +        self._java_obj = self._new_java_obj("org.apache.spark.ml.fpm.PrefixSpan", self.uid)
    +        self._setDefault(minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000,
    +                         sequenceCol="sequence")
    +        kwargs = self._input_kwargs
    +        self.setParams(**kwargs)
    +
    +    @keyword_only
    +    @since("2.4.0")
    +    def setParams(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000,
    +                  sequenceCol="sequence"):
    +        """
    +        setParams(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000, \
    +                  sequenceCol="sequence")
    +        """
    +        kwargs = self._input_kwargs
    +        return self._set(**kwargs)
    +
    +    @since("2.4.0")
    +    def findFrequentSequentialPatterns(self, dataset):
    +        """
    +        .. note:: Experimental
    +        Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
    +
    +        :param dataset: A dataset or a dataframe containing a sequence column which is
    +                        `Seq[Seq[_]]` type.
    --- End diff --
    
    We should use a SQL type here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3722/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21265#discussion_r192002383
  
    --- Diff: python/pyspark/ml/fpm.py ---
    @@ -243,3 +244,75 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items",
     
         def _create_model(self, java_model):
             return FPGrowthModel(java_model)
    +
    +
    +class PrefixSpan(object):
    +    """
    +    .. note:: Experimental
    +
    +    A parallel PrefixSpan algorithm to mine frequent sequential patterns.
    +    The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
    +    Efficiently by Prefix-Projected Pattern Growth
    +    (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
    +
    +    .. versionadded:: 2.4.0
    +
    +    """
    +    @staticmethod
    +    @since("2.4.0")
    +    def findFrequentSequentialPatterns(dataset,
    +                                       sequenceCol,
    +                                       minSupport,
    +                                       maxPatternLength,
    +                                       maxLocalProjDBSize):
    +        """
    +        .. note:: Experimental
    +        Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
    +
    +        :param dataset: A dataset or a dataframe containing a sequence column which is
    +                        `Seq[Seq[_]]` type.
    +        :param sequenceCol: The name of the sequence column in dataset, rows with nulls in this
    +                            column are ignored.
    +        :param minSupport: The minimal support level of the sequential pattern, any pattern that
    +                           appears more than (minSupport * size-of-the-dataset) times will be
    +                           output (recommended value: `0.1`).
    +        :param maxPatternLength: The maximal length of the sequential pattern
    +                                 (recommended value: `10`).
    +        :param maxLocalProjDBSize: The maximum number of items (including delimiters used in the
    +                                   internal storage format) allowed in a projected database before
    +                                   local processing. If a projected database exceeds this size,
    +                                   another iteration of distributed prefix growth is run
    +                                   (recommended value: `32000000`).
    +        :return: A `DataFrame` that contains columns of sequence and corresponding frequency.
    +                 The schema of it will be:
    +                  - `sequence: Seq[Seq[T]]` (T is the item type)
    +                  - `freq: Long`
    +
    +        >>> from pyspark.ml.fpm import PrefixSpan
    +        >>> from pyspark.sql import Row
    +        >>> df = sc.parallelize([Row(sequence=[[1, 2], [3]]),
    --- End diff --
    
    We should keep doctest examples simple to read. For example, including `maxLocalProjDBSize` is not useful because we don't expect users to tuning this param often.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    **[Test build #91332 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91332/testReport)** for PR 21265 at commit [`6f40474`](https://github.com/apache/spark/commit/6f404747e13c500289920d15ee79b0f0509984f8).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90361/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90356/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3720/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91332/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

Posted by ludatabricks <gi...@git.apache.org>.

Github user ludatabricks commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21265#discussion_r187144226
  
    --- Diff: python/pyspark/ml/fpm.py ---
    @@ -243,3 +244,75 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items",
     
         def _create_model(self, java_model):
             return FPGrowthModel(java_model)
    +
    +
    +class PrefixSpan(object):
    +    """
    +    .. note:: Experimental
    +
    +    A parallel PrefixSpan algorithm to mine frequent sequential patterns.
    +    The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
    +    Efficiently by Prefix-Projected Pattern Growth
    +    (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
    +
    +    .. versionadded:: 2.4.0
    +
    +    """
    +    @staticmethod
    +    @since("2.4.0")
    +    def findFrequentSequentialPatterns(dataset,
    +                                       sequenceCol,
    +                                       minSupport,
    +                                       maxPatternLength,
    +                                       maxLocalProjDBSize):
    +        """
    +        .. note:: Experimental
    +        Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
    +
    +        :param dataset: A dataset or a dataframe containing a sequence column which is
    +                        `Seq[Seq[_]]` type.
    +        :param sequenceCol: The name of the sequence column in dataset, rows with nulls in this
    +                            column are ignored.
    +        :param minSupport: The minimal support level of the sequential pattern, any pattern that
    +                           appears more than (minSupport * size-of-the-dataset) times will be
    +                           output (recommended value: `0.1`).
    +        :param maxPatternLength: The maximal length of the sequential pattern
    +                                 (recommended value: `10`).
    +        :param maxLocalProjDBSize: The maximum number of items (including delimiters used in the
    +                                   internal storage format) allowed in a projected database before
    +                                   local processing. If a projected database exceeds this size,
    +                                   another iteration of distributed prefix growth is run
    +                                   (recommended value: `32000000`).
    +        :return: A `DataFrame` that contains columns of sequence and corresponding frequency.
    +                 The schema of it will be:
    +                  - `sequence: Seq[Seq[T]]` (T is the item type)
    +                  - `freq: Long`
    +
    +        >>> from pyspark.ml.fpm import PrefixSpan
    +        >>> from pyspark.sql import Row
    +        >>> df = sc.parallelize([Row(sequence=[[1, 2], [3]]),
    --- End diff --
    
    One question: Should we add something in the example to show some special case or how these parameters works? 
    For example 
    - add pattern which is larger than ``maxPatternLength``
    - add nulls in the column


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    **[Test build #90361 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90361/testReport)** for PR 21265 at commit [`72abab0`](https://github.com/apache/spark/commit/72abab055ef5576b0c55e1a4c1dafcb3ac36f46f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91330/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    **[Test build #90356 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90356/testReport)** for PR 21265 at commit [`83eeea1`](https://github.com/apache/spark/commit/83eeea1c539d59a4d8496437dcf06d82b43b0ca2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21265#discussion_r192001814
  
    --- Diff: python/pyspark/ml/fpm.py ---
    @@ -243,3 +244,105 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items",
     
         def _create_model(self, java_model):
             return FPGrowthModel(java_model)
    +
    +
    +class PrefixSpan(JavaParams):
    +    """
    +    .. note:: Experimental
    +
    +    A parallel PrefixSpan algorithm to mine frequent sequential patterns.
    +    The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
    +    Efficiently by Prefix-Projected Pattern Growth
    +    (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
    +    This class is not yet an Estimator/Transformer, use :py:func:`findFrequentSequentialPatterns`
    +    method to run the PrefixSpan algorithm.
    +
    +    @see <a href="https://en.wikipedia.org/wiki/Sequential_Pattern_Mining">Sequential Pattern Mining
    +    (Wikipedia)</a>
    +    .. versionadded:: 2.4.0
    +
    +    """
    +
    +    minSupport = Param(Params._dummy(), "minSupport", "The minimal support level of the " +
    +                       "sequential pattern. Sequential pattern that appears more than " +
    +                       "(minSupport * size-of-the-dataset) times will be output. Must be >= 0.",
    +                       typeConverter=TypeConverters.toFloat)
    +
    +    maxPatternLength = Param(Params._dummy(), "maxPatternLength",
    +                             "The maximal length of the sequential pattern. Must be > 0.",
    +                             typeConverter=TypeConverters.toInt)
    +
    +    maxLocalProjDBSize = Param(Params._dummy(), "maxLocalProjDBSize",
    +                               "The maximum number of items (including delimiters used in the " +
    +                               "internal storage format) allowed in a projected database before " +
    +                               "local processing. If a projected database exceeds this size, " +
    +                               "another iteration of distributed prefix growth is run. " +
    +                               "Must be > 0.",
    +                               typeConverter=TypeConverters.toInt)
    +
    +    sequenceCol = Param(Params._dummy(), "sequenceCol", "The name of the sequence column in " +
    +                        "dataset, rows with nulls in this column are ignored.",
    +                        typeConverter=TypeConverters.toString)
    +
    +    @keyword_only
    +    def __init__(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000,
    +                 sequenceCol="sequence"):
    +        """
    +        __init__(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000, \
    +                 sequenceCol="sequence")
    +        """
    +        super(PrefixSpan, self).__init__()
    +        self._java_obj = self._new_java_obj("org.apache.spark.ml.fpm.PrefixSpan", self.uid)
    +        self._setDefault(minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000,
    +                         sequenceCol="sequence")
    +        kwargs = self._input_kwargs
    +        self.setParams(**kwargs)
    +
    +    @keyword_only
    +    @since("2.4.0")
    +    def setParams(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000,
    +                  sequenceCol="sequence"):
    +        """
    +        setParams(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000, \
    +                  sequenceCol="sequence")
    +        """
    +        kwargs = self._input_kwargs
    +        return self._set(**kwargs)
    +
    +    @since("2.4.0")
    +    def findFrequentSequentialPatterns(self, dataset):
    +        """
    +        .. note:: Experimental
    +        Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
    +
    +        :param dataset: A dataset or a dataframe containing a sequence column which is
    --- End diff --
    
    There is no `Dataset` in PySpark.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    **[Test build #91328 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91328/testReport)** for PR 21265 at commit [`1248897`](https://github.com/apache/spark/commit/12488976debc51480ccf59bb9695575c739684ab).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    **[Test build #91325 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91325/testReport)** for PR 21265 at commit [`0be3a94`](https://github.com/apache/spark/commit/0be3a94d27f4203608ef82d2ef197b37606c53b3).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21265#discussion_r191996249
  
    --- Diff: python/pyspark/ml/fpm.py ---
    @@ -243,3 +244,105 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items",
     
         def _create_model(self, java_model):
             return FPGrowthModel(java_model)
    +
    +
    +class PrefixSpan(JavaParams):
    +    """
    +    .. note:: Experimental
    +
    +    A parallel PrefixSpan algorithm to mine frequent sequential patterns.
    +    The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
    +    Efficiently by Prefix-Projected Pattern Growth
    +    (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
    +    This class is not yet an Estimator/Transformer, use :py:func:`findFrequentSequentialPatterns`
    +    method to run the PrefixSpan algorithm.
    +
    +    @see <a href="https://en.wikipedia.org/wiki/Sequential_Pattern_Mining">Sequential Pattern Mining
    +    (Wikipedia)</a>
    +    .. versionadded:: 2.4.0
    +
    +    """
    +
    +    minSupport = Param(Params._dummy(), "minSupport", "The minimal support level of the " +
    +                       "sequential pattern. Sequential pattern that appears more than " +
    +                       "(minSupport * size-of-the-dataset) times will be output. Must be >= 0.",
    +                       typeConverter=TypeConverters.toFloat)
    +
    +    maxPatternLength = Param(Params._dummy(), "maxPatternLength",
    +                             "The maximal length of the sequential pattern. Must be > 0.",
    +                             typeConverter=TypeConverters.toInt)
    +
    +    maxLocalProjDBSize = Param(Params._dummy(), "maxLocalProjDBSize",
    +                               "The maximum number of items (including delimiters used in the " +
    +                               "internal storage format) allowed in a projected database before " +
    +                               "local processing. If a projected database exceeds this size, " +
    +                               "another iteration of distributed prefix growth is run. " +
    +                               "Must be > 0.",
    +                               typeConverter=TypeConverters.toInt)
    --- End diff --
    
    There isn't `TypeConverters.toLong`, do I need to add it ?
    My idea is that `TypeConverters.toInt` also fit to Long type in python side so I do not add it for now.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21265
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org