You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by WeichenXu123 <gi...@git.apache.org> on 2018/05/08 05:31:23 UTC
[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...
GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/21265
[SPARK-24146][PySpark][ML] spark.ml parity for sequential pattern mining - PrefixSpan: Python API
## What changes were proposed in this pull request?
spark.ml parity for sequential pattern mining - PrefixSpan: Python API
## How was this patch tested?
doctests
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/WeichenXu123/spark prefix_span_py
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21265.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21265
----
commit 83eeea1c539d59a4d8496437dcf06d82b43b0ca2
Author: WeichenXu <we...@...>
Date: 2018-05-08T05:29:24Z
init pr
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21265
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...
Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/21265#discussion_r192000596
--- Diff: python/pyspark/ml/fpm.py ---
@@ -243,3 +244,105 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items",
def _create_model(self, java_model):
return FPGrowthModel(java_model)
+
+
+class PrefixSpan(JavaParams):
+ """
+ .. note:: Experimental
+
+ A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+ The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
+ Efficiently by Prefix-Projected Pattern Growth
+ (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
+ This class is not yet an Estimator/Transformer, use :py:func:`findFrequentSequentialPatterns`
+ method to run the PrefixSpan algorithm.
+
+ @see <a href="https://en.wikipedia.org/wiki/Sequential_Pattern_Mining">Sequential Pattern Mining
+ (Wikipedia)</a>
+ .. versionadded:: 2.4.0
+
+ """
+
+ minSupport = Param(Params._dummy(), "minSupport", "The minimal support level of the " +
+ "sequential pattern. Sequential pattern that appears more than " +
+ "(minSupport * size-of-the-dataset) times will be output. Must be >= 0.",
+ typeConverter=TypeConverters.toFloat)
+
+ maxPatternLength = Param(Params._dummy(), "maxPatternLength",
+ "The maximal length of the sequential pattern. Must be > 0.",
+ typeConverter=TypeConverters.toInt)
+
+ maxLocalProjDBSize = Param(Params._dummy(), "maxLocalProjDBSize",
+ "The maximum number of items (including delimiters used in the " +
+ "internal storage format) allowed in a projected database before " +
+ "local processing. If a projected database exceeds this size, " +
+ "another iteration of distributed prefix growth is run. " +
+ "Must be > 0.",
+ typeConverter=TypeConverters.toInt)
--- End diff --
Just test that python 'int' type range is the same with java 'long' type.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21265
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21265
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21265
**[Test build #91330 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91330/testReport)** for PR 21265 at commit [`6f40474`](https://github.com/apache/spark/commit/6f404747e13c500289920d15ee79b0f0509984f8).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21265
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the issue:
https://github.com/apache/spark/pull/21265
LGTM. Merged into master. Thanks!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...
Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:
https://github.com/apache/spark/pull/21265#discussion_r192000950
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
@@ -53,7 +53,7 @@ final class PrefixSpan(@Since("2.4.0") override val uid: String) extends Params
@Since("2.4.0")
val minSupport = new DoubleParam(this, "minSupport", "The minimal support level of the " +
"sequential pattern. Sequential pattern that appears more than " +
- "(minSupport * size-of-the-dataset)." +
+ "(minSupport * size-of-the-dataset)" +
--- End diff --
Need a space at the end before "times".
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21265
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21265
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91328/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21265
**[Test build #90361 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90361/testReport)** for PR 21265 at commit [`72abab0`](https://github.com/apache/spark/commit/72abab055ef5576b0c55e1a4c1dafcb3ac36f46f).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/21265
Jenkins, test this please.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21265
**[Test build #91328 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91328/testReport)** for PR 21265 at commit [`1248897`](https://github.com/apache/spark/commit/12488976debc51480ccf59bb9695575c739684ab).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21265
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3027/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...
Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/21265#discussion_r191995667
--- Diff: python/pyspark/ml/fpm.py ---
@@ -243,3 +244,75 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items",
def _create_model(self, java_model):
return FPGrowthModel(java_model)
+
+
+class PrefixSpan(object):
+ """
+ .. note:: Experimental
+
+ A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+ The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
+ Efficiently by Prefix-Projected Pattern Growth
+ (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
+
+ .. versionadded:: 2.4.0
+
+ """
+ @staticmethod
+ @since("2.4.0")
+ def findFrequentSequentialPatterns(dataset,
+ sequenceCol,
+ minSupport,
+ maxPatternLength,
+ maxLocalProjDBSize):
+ """
+ .. note:: Experimental
+ Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
+
+ :param dataset: A dataset or a dataframe containing a sequence column which is
+ `Seq[Seq[_]]` type.
+ :param sequenceCol: The name of the sequence column in dataset, rows with nulls in this
+ column are ignored.
+ :param minSupport: The minimal support level of the sequential pattern, any pattern that
+ appears more than (minSupport * size-of-the-dataset) times will be
+ output (recommended value: `0.1`).
+ :param maxPatternLength: The maximal length of the sequential pattern
+ (recommended value: `10`).
+ :param maxLocalProjDBSize: The maximum number of items (including delimiters used in the
+ internal storage format) allowed in a projected database before
+ local processing. If a projected database exceeds this size,
+ another iteration of distributed prefix growth is run
+ (recommended value: `32000000`).
+ :return: A `DataFrame` that contains columns of sequence and corresponding frequency.
+ The schema of it will be:
+ - `sequence: Seq[Seq[T]]` (T is the item type)
+ - `freq: Long`
+
+ >>> from pyspark.ml.fpm import PrefixSpan
+ >>> from pyspark.sql import Row
+ >>> df = sc.parallelize([Row(sequence=[[1, 2], [3]]),
--- End diff --
I think it is better to be put in a example. @mengxr What do you think ?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21265
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21265
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3724/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...
Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:
https://github.com/apache/spark/pull/21265#discussion_r192001923
--- Diff: python/pyspark/ml/fpm.py ---
@@ -243,3 +244,105 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items",
def _create_model(self, java_model):
return FPGrowthModel(java_model)
+
+
+class PrefixSpan(JavaParams):
+ """
+ .. note:: Experimental
+
+ A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+ The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
+ Efficiently by Prefix-Projected Pattern Growth
+ (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
+ This class is not yet an Estimator/Transformer, use :py:func:`findFrequentSequentialPatterns`
+ method to run the PrefixSpan algorithm.
+
+ @see <a href="https://en.wikipedia.org/wiki/Sequential_Pattern_Mining">Sequential Pattern Mining
+ (Wikipedia)</a>
+ .. versionadded:: 2.4.0
+
+ """
+
+ minSupport = Param(Params._dummy(), "minSupport", "The minimal support level of the " +
+ "sequential pattern. Sequential pattern that appears more than " +
+ "(minSupport * size-of-the-dataset) times will be output. Must be >= 0.",
+ typeConverter=TypeConverters.toFloat)
+
+ maxPatternLength = Param(Params._dummy(), "maxPatternLength",
+ "The maximal length of the sequential pattern. Must be > 0.",
+ typeConverter=TypeConverters.toInt)
+
+ maxLocalProjDBSize = Param(Params._dummy(), "maxLocalProjDBSize",
+ "The maximum number of items (including delimiters used in the " +
+ "internal storage format) allowed in a projected database before " +
+ "local processing. If a projected database exceeds this size, " +
+ "another iteration of distributed prefix growth is run. " +
+ "Must be > 0.",
+ typeConverter=TypeConverters.toInt)
+
+ sequenceCol = Param(Params._dummy(), "sequenceCol", "The name of the sequence column in " +
+ "dataset, rows with nulls in this column are ignored.",
+ typeConverter=TypeConverters.toString)
+
+ @keyword_only
+ def __init__(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000,
+ sequenceCol="sequence"):
+ """
+ __init__(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000, \
+ sequenceCol="sequence")
+ """
+ super(PrefixSpan, self).__init__()
+ self._java_obj = self._new_java_obj("org.apache.spark.ml.fpm.PrefixSpan", self.uid)
+ self._setDefault(minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000,
+ sequenceCol="sequence")
+ kwargs = self._input_kwargs
+ self.setParams(**kwargs)
+
+ @keyword_only
+ @since("2.4.0")
+ def setParams(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000,
+ sequenceCol="sequence"):
+ """
+ setParams(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000, \
+ sequenceCol="sequence")
+ """
+ kwargs = self._input_kwargs
+ return self._set(**kwargs)
+
+ @since("2.4.0")
+ def findFrequentSequentialPatterns(self, dataset):
+ """
+ .. note:: Experimental
+ Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
+
+ :param dataset: A dataset or a dataframe containing a sequence column which is
+ `Seq[Seq[_]]` type.
+ :return: A `DataFrame` that contains columns of sequence and corresponding frequency.
+ The schema of it will be:
+ - `sequence: Seq[Seq[T]]` (T is the item type)
--- End diff --
ditto
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21265
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21265
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91325/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:
https://github.com/apache/spark/pull/21265
Let's wait on this until we make the decision in the last thread in https://github.com/apache/spark/pull/20973
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...
Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:
https://github.com/apache/spark/pull/21265#discussion_r192002416
--- Diff: python/pyspark/ml/fpm.py ---
@@ -243,3 +244,105 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items",
def _create_model(self, java_model):
return FPGrowthModel(java_model)
+
+
+class PrefixSpan(JavaParams):
+ """
+ .. note:: Experimental
+
+ A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+ The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
+ Efficiently by Prefix-Projected Pattern Growth
+ (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
+ This class is not yet an Estimator/Transformer, use :py:func:`findFrequentSequentialPatterns`
+ method to run the PrefixSpan algorithm.
+
+ @see <a href="https://en.wikipedia.org/wiki/Sequential_Pattern_Mining">Sequential Pattern Mining
+ (Wikipedia)</a>
+ .. versionadded:: 2.4.0
+
+ """
+
+ minSupport = Param(Params._dummy(), "minSupport", "The minimal support level of the " +
+ "sequential pattern. Sequential pattern that appears more than " +
+ "(minSupport * size-of-the-dataset) times will be output. Must be >= 0.",
+ typeConverter=TypeConverters.toFloat)
+
+ maxPatternLength = Param(Params._dummy(), "maxPatternLength",
+ "The maximal length of the sequential pattern. Must be > 0.",
+ typeConverter=TypeConverters.toInt)
+
+ maxLocalProjDBSize = Param(Params._dummy(), "maxLocalProjDBSize",
+ "The maximum number of items (including delimiters used in the " +
+ "internal storage format) allowed in a projected database before " +
+ "local processing. If a projected database exceeds this size, " +
+ "another iteration of distributed prefix growth is run. " +
+ "Must be > 0.",
+ typeConverter=TypeConverters.toInt)
+
+ sequenceCol = Param(Params._dummy(), "sequenceCol", "The name of the sequence column in " +
+ "dataset, rows with nulls in this column are ignored.",
+ typeConverter=TypeConverters.toString)
+
+ @keyword_only
+ def __init__(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000,
+ sequenceCol="sequence"):
+ """
+ __init__(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000, \
+ sequenceCol="sequence")
+ """
+ super(PrefixSpan, self).__init__()
+ self._java_obj = self._new_java_obj("org.apache.spark.ml.fpm.PrefixSpan", self.uid)
+ self._setDefault(minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000,
+ sequenceCol="sequence")
+ kwargs = self._input_kwargs
+ self.setParams(**kwargs)
+
+ @keyword_only
+ @since("2.4.0")
+ def setParams(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000,
+ sequenceCol="sequence"):
+ """
+ setParams(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000, \
+ sequenceCol="sequence")
+ """
+ kwargs = self._input_kwargs
+ return self._set(**kwargs)
+
+ @since("2.4.0")
+ def findFrequentSequentialPatterns(self, dataset):
+ """
+ .. note:: Experimental
+ Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
+
+ :param dataset: A dataset or a dataframe containing a sequence column which is
+ `Seq[Seq[_]]` type.
+ :return: A `DataFrame` that contains columns of sequence and corresponding frequency.
+ The schema of it will be:
+ - `sequence: Seq[Seq[T]]` (T is the item type)
+ - `freq: Long`
+
+ >>> from pyspark.ml.fpm import PrefixSpan
+ >>> from pyspark.sql import Row
+ >>> df = sc.parallelize([Row(sequence=[[1, 2], [3]]),
+ ... Row(sequence=[[1], [3, 2], [1, 2]]),
+ ... Row(sequence=[[1, 2], [5]]),
+ ... Row(sequence=[[6]])]).toDF()
+ >>> prefixSpan = PrefixSpan(minSupport=0.5, maxPatternLength=5,
+ ... maxLocalProjDBSize=32000000)
--- End diff --
remove this param from example
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21265
**[Test build #91325 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91325/testReport)** for PR 21265 at commit [`0be3a94`](https://github.com/apache/spark/commit/0be3a94d27f4203608ef82d2ef197b37606c53b3).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
* `class PrefixSpan(JavaParams):`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21265
**[Test build #91330 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91330/testReport)** for PR 21265 at commit [`6f40474`](https://github.com/apache/spark/commit/6f404747e13c500289920d15ee79b0f0509984f8).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21265
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3033/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...
Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/21265#discussion_r187187330
--- Diff: python/pyspark/ml/fpm.py ---
@@ -243,3 +244,75 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items",
def _create_model(self, java_model):
return FPGrowthModel(java_model)
+
+
+class PrefixSpan(object):
+ """
+ .. note:: Experimental
+
+ A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+ The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
+ Efficiently by Prefix-Projected Pattern Growth
+ (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
+
+ .. versionadded:: 2.4.0
+
+ """
+ @staticmethod
+ @since("2.4.0")
+ def findFrequentSequentialPatterns(dataset,
+ sequenceCol,
+ minSupport,
+ maxPatternLength,
+ maxLocalProjDBSize):
+ """
+ .. note:: Experimental
+ Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
+
+ :param dataset: A dataset or a dataframe containing a sequence column which is
+ `Seq[Seq[_]]` type.
+ :param sequenceCol: The name of the sequence column in dataset, rows with nulls in this
+ column are ignored.
+ :param minSupport: The minimal support level of the sequential pattern, any pattern that
+ appears more than (minSupport * size-of-the-dataset) times will be
+ output (recommended value: `0.1`).
+ :param maxPatternLength: The maximal length of the sequential pattern
+ (recommended value: `10`).
+ :param maxLocalProjDBSize: The maximum number of items (including delimiters used in the
+ internal storage format) allowed in a projected database before
+ local processing. If a projected database exceeds this size,
+ another iteration of distributed prefix growth is run
+ (recommended value: `32000000`).
+ :return: A `DataFrame` that contains columns of sequence and corresponding frequency.
+ The schema of it will be:
+ - `sequence: Seq[Seq[T]]` (T is the item type)
+ - `freq: Long`
+
+ >>> from pyspark.ml.fpm import PrefixSpan
+ >>> from pyspark.sql import Row
+ >>> df = sc.parallelize([Row(sequence=[[1, 2], [3]]),
--- End diff --
My 2 cents: That sounds like a judgement call: If it's to explain the behavior more clearly, then that sounds reasonable. I feel like it's pretty clear how nulls are treated from the doc. maxPatternLength might benefit from an example.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21265
**[Test build #91332 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91332/testReport)** for PR 21265 at commit [`6f40474`](https://github.com/apache/spark/commit/6f404747e13c500289920d15ee79b0f0509984f8).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21265
**[Test build #90356 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90356/testReport)** for PR 21265 at commit [`83eeea1`](https://github.com/apache/spark/commit/83eeea1c539d59a4d8496437dcf06d82b43b0ca2).
* This patch **fails Python style tests**.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
* `class PrefixSpan(object):`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/21265
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21265
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3717/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...
Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:
https://github.com/apache/spark/pull/21265#discussion_r192001886
--- Diff: python/pyspark/ml/fpm.py ---
@@ -243,3 +244,105 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items",
def _create_model(self, java_model):
return FPGrowthModel(java_model)
+
+
+class PrefixSpan(JavaParams):
+ """
+ .. note:: Experimental
+
+ A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+ The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
+ Efficiently by Prefix-Projected Pattern Growth
+ (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
+ This class is not yet an Estimator/Transformer, use :py:func:`findFrequentSequentialPatterns`
+ method to run the PrefixSpan algorithm.
+
+ @see <a href="https://en.wikipedia.org/wiki/Sequential_Pattern_Mining">Sequential Pattern Mining
+ (Wikipedia)</a>
+ .. versionadded:: 2.4.0
+
+ """
+
+ minSupport = Param(Params._dummy(), "minSupport", "The minimal support level of the " +
+ "sequential pattern. Sequential pattern that appears more than " +
+ "(minSupport * size-of-the-dataset) times will be output. Must be >= 0.",
+ typeConverter=TypeConverters.toFloat)
+
+ maxPatternLength = Param(Params._dummy(), "maxPatternLength",
+ "The maximal length of the sequential pattern. Must be > 0.",
+ typeConverter=TypeConverters.toInt)
+
+ maxLocalProjDBSize = Param(Params._dummy(), "maxLocalProjDBSize",
+ "The maximum number of items (including delimiters used in the " +
+ "internal storage format) allowed in a projected database before " +
+ "local processing. If a projected database exceeds this size, " +
+ "another iteration of distributed prefix growth is run. " +
+ "Must be > 0.",
+ typeConverter=TypeConverters.toInt)
+
+ sequenceCol = Param(Params._dummy(), "sequenceCol", "The name of the sequence column in " +
+ "dataset, rows with nulls in this column are ignored.",
+ typeConverter=TypeConverters.toString)
+
+ @keyword_only
+ def __init__(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000,
+ sequenceCol="sequence"):
+ """
+ __init__(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000, \
+ sequenceCol="sequence")
+ """
+ super(PrefixSpan, self).__init__()
+ self._java_obj = self._new_java_obj("org.apache.spark.ml.fpm.PrefixSpan", self.uid)
+ self._setDefault(minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000,
+ sequenceCol="sequence")
+ kwargs = self._input_kwargs
+ self.setParams(**kwargs)
+
+ @keyword_only
+ @since("2.4.0")
+ def setParams(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000,
+ sequenceCol="sequence"):
+ """
+ setParams(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000, \
+ sequenceCol="sequence")
+ """
+ kwargs = self._input_kwargs
+ return self._set(**kwargs)
+
+ @since("2.4.0")
+ def findFrequentSequentialPatterns(self, dataset):
+ """
+ .. note:: Experimental
+ Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
+
+ :param dataset: A dataset or a dataframe containing a sequence column which is
+ `Seq[Seq[_]]` type.
--- End diff --
We should use a SQL type here.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21265
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3722/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...
Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:
https://github.com/apache/spark/pull/21265#discussion_r192002383
--- Diff: python/pyspark/ml/fpm.py ---
@@ -243,3 +244,75 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items",
def _create_model(self, java_model):
return FPGrowthModel(java_model)
+
+
+class PrefixSpan(object):
+ """
+ .. note:: Experimental
+
+ A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+ The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
+ Efficiently by Prefix-Projected Pattern Growth
+ (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
+
+ .. versionadded:: 2.4.0
+
+ """
+ @staticmethod
+ @since("2.4.0")
+ def findFrequentSequentialPatterns(dataset,
+ sequenceCol,
+ minSupport,
+ maxPatternLength,
+ maxLocalProjDBSize):
+ """
+ .. note:: Experimental
+ Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
+
+ :param dataset: A dataset or a dataframe containing a sequence column which is
+ `Seq[Seq[_]]` type.
+ :param sequenceCol: The name of the sequence column in dataset, rows with nulls in this
+ column are ignored.
+ :param minSupport: The minimal support level of the sequential pattern, any pattern that
+ appears more than (minSupport * size-of-the-dataset) times will be
+ output (recommended value: `0.1`).
+ :param maxPatternLength: The maximal length of the sequential pattern
+ (recommended value: `10`).
+ :param maxLocalProjDBSize: The maximum number of items (including delimiters used in the
+ internal storage format) allowed in a projected database before
+ local processing. If a projected database exceeds this size,
+ another iteration of distributed prefix growth is run
+ (recommended value: `32000000`).
+ :return: A `DataFrame` that contains columns of sequence and corresponding frequency.
+ The schema of it will be:
+ - `sequence: Seq[Seq[T]]` (T is the item type)
+ - `freq: Long`
+
+ >>> from pyspark.ml.fpm import PrefixSpan
+ >>> from pyspark.sql import Row
+ >>> df = sc.parallelize([Row(sequence=[[1, 2], [3]]),
--- End diff --
We should keep doctest examples simple to read. For example, including `maxLocalProjDBSize` is not useful because we don't expect users to tuning this param often.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21265
**[Test build #91332 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91332/testReport)** for PR 21265 at commit [`6f40474`](https://github.com/apache/spark/commit/6f404747e13c500289920d15ee79b0f0509984f8).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21265
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90361/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21265
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90356/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21265
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3720/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21265
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91332/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...
Posted by ludatabricks <gi...@git.apache.org>.
Github user ludatabricks commented on a diff in the pull request:
https://github.com/apache/spark/pull/21265#discussion_r187144226
--- Diff: python/pyspark/ml/fpm.py ---
@@ -243,3 +244,75 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items",
def _create_model(self, java_model):
return FPGrowthModel(java_model)
+
+
+class PrefixSpan(object):
+ """
+ .. note:: Experimental
+
+ A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+ The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
+ Efficiently by Prefix-Projected Pattern Growth
+ (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
+
+ .. versionadded:: 2.4.0
+
+ """
+ @staticmethod
+ @since("2.4.0")
+ def findFrequentSequentialPatterns(dataset,
+ sequenceCol,
+ minSupport,
+ maxPatternLength,
+ maxLocalProjDBSize):
+ """
+ .. note:: Experimental
+ Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
+
+ :param dataset: A dataset or a dataframe containing a sequence column which is
+ `Seq[Seq[_]]` type.
+ :param sequenceCol: The name of the sequence column in dataset, rows with nulls in this
+ column are ignored.
+ :param minSupport: The minimal support level of the sequential pattern, any pattern that
+ appears more than (minSupport * size-of-the-dataset) times will be
+ output (recommended value: `0.1`).
+ :param maxPatternLength: The maximal length of the sequential pattern
+ (recommended value: `10`).
+ :param maxLocalProjDBSize: The maximum number of items (including delimiters used in the
+ internal storage format) allowed in a projected database before
+ local processing. If a projected database exceeds this size,
+ another iteration of distributed prefix growth is run
+ (recommended value: `32000000`).
+ :return: A `DataFrame` that contains columns of sequence and corresponding frequency.
+ The schema of it will be:
+ - `sequence: Seq[Seq[T]]` (T is the item type)
+ - `freq: Long`
+
+ >>> from pyspark.ml.fpm import PrefixSpan
+ >>> from pyspark.sql import Row
+ >>> df = sc.parallelize([Row(sequence=[[1, 2], [3]]),
--- End diff --
One question: Should we add something in the example to show some special case or how these parameters works?
For example
- add pattern which is larger than ``maxPatternLength``
- add nulls in the column
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21265
**[Test build #90361 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90361/testReport)** for PR 21265 at commit [`72abab0`](https://github.com/apache/spark/commit/72abab055ef5576b0c55e1a4c1dafcb3ac36f46f).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21265
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91330/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21265
**[Test build #90356 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90356/testReport)** for PR 21265 at commit [`83eeea1`](https://github.com/apache/spark/commit/83eeea1c539d59a4d8496437dcf06d82b43b0ca2).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21265
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...
Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:
https://github.com/apache/spark/pull/21265#discussion_r192001814
--- Diff: python/pyspark/ml/fpm.py ---
@@ -243,3 +244,105 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items",
def _create_model(self, java_model):
return FPGrowthModel(java_model)
+
+
+class PrefixSpan(JavaParams):
+ """
+ .. note:: Experimental
+
+ A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+ The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
+ Efficiently by Prefix-Projected Pattern Growth
+ (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
+ This class is not yet an Estimator/Transformer, use :py:func:`findFrequentSequentialPatterns`
+ method to run the PrefixSpan algorithm.
+
+ @see <a href="https://en.wikipedia.org/wiki/Sequential_Pattern_Mining">Sequential Pattern Mining
+ (Wikipedia)</a>
+ .. versionadded:: 2.4.0
+
+ """
+
+ minSupport = Param(Params._dummy(), "minSupport", "The minimal support level of the " +
+ "sequential pattern. Sequential pattern that appears more than " +
+ "(minSupport * size-of-the-dataset) times will be output. Must be >= 0.",
+ typeConverter=TypeConverters.toFloat)
+
+ maxPatternLength = Param(Params._dummy(), "maxPatternLength",
+ "The maximal length of the sequential pattern. Must be > 0.",
+ typeConverter=TypeConverters.toInt)
+
+ maxLocalProjDBSize = Param(Params._dummy(), "maxLocalProjDBSize",
+ "The maximum number of items (including delimiters used in the " +
+ "internal storage format) allowed in a projected database before " +
+ "local processing. If a projected database exceeds this size, " +
+ "another iteration of distributed prefix growth is run. " +
+ "Must be > 0.",
+ typeConverter=TypeConverters.toInt)
+
+ sequenceCol = Param(Params._dummy(), "sequenceCol", "The name of the sequence column in " +
+ "dataset, rows with nulls in this column are ignored.",
+ typeConverter=TypeConverters.toString)
+
+ @keyword_only
+ def __init__(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000,
+ sequenceCol="sequence"):
+ """
+ __init__(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000, \
+ sequenceCol="sequence")
+ """
+ super(PrefixSpan, self).__init__()
+ self._java_obj = self._new_java_obj("org.apache.spark.ml.fpm.PrefixSpan", self.uid)
+ self._setDefault(minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000,
+ sequenceCol="sequence")
+ kwargs = self._input_kwargs
+ self.setParams(**kwargs)
+
+ @keyword_only
+ @since("2.4.0")
+ def setParams(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000,
+ sequenceCol="sequence"):
+ """
+ setParams(self, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000, \
+ sequenceCol="sequence")
+ """
+ kwargs = self._input_kwargs
+ return self._set(**kwargs)
+
+ @since("2.4.0")
+ def findFrequentSequentialPatterns(self, dataset):
+ """
+ .. note:: Experimental
+ Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
+
+ :param dataset: A dataset or a dataframe containing a sequence column which is
--- End diff --
There is no `Dataset` in PySpark.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21265
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21265
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21265
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21265
**[Test build #91328 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91328/testReport)** for PR 21265 at commit [`1248897`](https://github.com/apache/spark/commit/12488976debc51480ccf59bb9695575c739684ab).
* This patch **fails Python style tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21265
**[Test build #91325 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91325/testReport)** for PR 21265 at commit [`0be3a94`](https://github.com/apache/spark/commit/0be3a94d27f4203608ef82d2ef197b37606c53b3).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...
Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/21265#discussion_r191996249
--- Diff: python/pyspark/ml/fpm.py ---
@@ -243,3 +244,105 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items",
def _create_model(self, java_model):
return FPGrowthModel(java_model)
+
+
+class PrefixSpan(JavaParams):
+ """
+ .. note:: Experimental
+
+ A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+ The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
+ Efficiently by Prefix-Projected Pattern Growth
+ (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
+ This class is not yet an Estimator/Transformer, use :py:func:`findFrequentSequentialPatterns`
+ method to run the PrefixSpan algorithm.
+
+ @see <a href="https://en.wikipedia.org/wiki/Sequential_Pattern_Mining">Sequential Pattern Mining
+ (Wikipedia)</a>
+ .. versionadded:: 2.4.0
+
+ """
+
+ minSupport = Param(Params._dummy(), "minSupport", "The minimal support level of the " +
+ "sequential pattern. Sequential pattern that appears more than " +
+ "(minSupport * size-of-the-dataset) times will be output. Must be >= 0.",
+ typeConverter=TypeConverters.toFloat)
+
+ maxPatternLength = Param(Params._dummy(), "maxPatternLength",
+ "The maximal length of the sequential pattern. Must be > 0.",
+ typeConverter=TypeConverters.toInt)
+
+ maxLocalProjDBSize = Param(Params._dummy(), "maxLocalProjDBSize",
+ "The maximum number of items (including delimiters used in the " +
+ "internal storage format) allowed in a projected database before " +
+ "local processing. If a projected database exceeds this size, " +
+ "another iteration of distributed prefix growth is run. " +
+ "Must be > 0.",
+ typeConverter=TypeConverters.toInt)
--- End diff --
There isn't `TypeConverters.toLong`, do I need to add it ?
My idea is that `TypeConverters.toInt` also fit to Long type in python side so I do not add it for now.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21265
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org