You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by WeichenXu123 <gi...@git.apache.org> on 2018/04/04 04:44:48 UTC

[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

GitHub user WeichenXu123 opened a pull request:

    https://github.com/apache/spark/pull/20973

    [SPARK-20114][ML] spark.ml parity for sequential pattern mining - PrefixSpan

    ## What changes were proposed in this pull request?
    
    PrefixSpan API for spark.ml. New implementation instead of #20810
    
    ## How was this patch tested?
    
    N/A


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/WeichenXu123/spark prefixSpan2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20973.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20973
    
----
commit d563c8fab0cb718b511ac78bc38e712a65148d17
Author: WeichenXu <we...@...>
Date:   2018-04-04T04:42:05Z

    init pr

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    **[Test build #4158 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4158/testReport)** for PR 20973 at commit [`bd0ce07`](https://github.com/apache/spark/commit/bd0ce07827cd038ddf2e63ebb5a6027d73a3c5a2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20973#discussion_r183865387
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
    @@ -0,0 +1,91 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.fpm
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions.col
    +import org.apache.spark.sql.types.{LongType, StructField, StructType}
    +import org.apache.spark.storage.StorageLevel
    +
    +/**
    + * :: Experimental ::
    + * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
    + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
    + * Efficiently by Prefix-Projected Pattern Growth
    + * (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
    + *
    + * @see <a href="https://en.wikipedia.org/wiki/Sequential_Pattern_Mining">Sequential Pattern Mining
    + * (Wikipedia)</a>
    + */
    +@Since("2.4.0")
    +@Experimental
    +object PrefixSpan {
    +
    +  /**
    +   * :: Experimental ::
    +   * Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
    +   *
    +   * @param dataset A dataset or a dataframe containing a sequence column which is
    +   *                {{{Seq[Seq[_]]}}} type
    +   * @param sequenceCol the name of the sequence column in dataset
    +   * @param minSupport the minimal support level of the sequential pattern, any pattern that
    +   *                   appears more than (minSupport * size-of-the-dataset) times will be output
    +   *                  (default: `0.1`).
    +   * @param maxPatternLength the maximal length of the sequential pattern, any pattern that appears
    +   *                         less than maxPatternLength will be output (default: `10`).
    +   * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the
    +   *                           internal storage format) allowed in a projected database before
    +   *                           local processing. If a projected database exceeds this size, another
    +   *                           iteration of distributed prefix growth is run (default: `32000000`).
    +   * @return A dataframe that contains columns of sequence and corresponding frequency.
    +   */
    +  @Since("2.4.0")
    +  def findFrequentSequentPatterns(
    +      dataset: Dataset[_],
    +      sequenceCol: String,
    +      minSupport: Double = 0.1,
    +      maxPatternLength: Int = 10,
    +      maxLocalProjDBSize: Long = 32000000L): DataFrame = {
    +    val handlePersistence = dataset.storageLevel == StorageLevel.NONE
    +
    +    val data = dataset.select(sequenceCol)
    --- End diff --
    
    Let's check the input schema and throw a clear exception if it's not OK.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20973#discussion_r183865609
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
    @@ -0,0 +1,91 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.fpm
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions.col
    +import org.apache.spark.sql.types.{LongType, StructField, StructType}
    +import org.apache.spark.storage.StorageLevel
    +
    +/**
    + * :: Experimental ::
    + * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
    + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
    + * Efficiently by Prefix-Projected Pattern Growth
    + * (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
    + *
    + * @see <a href="https://en.wikipedia.org/wiki/Sequential_Pattern_Mining">Sequential Pattern Mining
    + * (Wikipedia)</a>
    + */
    +@Since("2.4.0")
    +@Experimental
    +object PrefixSpan {
    +
    +  /**
    +   * :: Experimental ::
    +   * Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
    +   *
    +   * @param dataset A dataset or a dataframe containing a sequence column which is
    +   *                {{{Seq[Seq[_]]}}} type
    +   * @param sequenceCol the name of the sequence column in dataset
    --- End diff --
    
    It'd be nice to document that rows with nulls in this column are ignored.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1943/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    **[Test build #4158 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4158/testReport)** for PR 20973 at commit [`bd0ce07`](https://github.com/apache/spark/commit/bd0ce07827cd038ddf2e63ebb5a6027d73a3c5a2).
     * This patch passes all tests.
     * This patch **does not merge cleanly**.
     * This patch adds the following public classes _(experimental)_:
      * `class PrefixSpanSuite extends MLTest `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    **[Test build #90116 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90116/testReport)** for PR 20973 at commit [`76d4119`](https://github.com/apache/spark/commit/76d411998205a3920ee8d1e353c8422658b2e330).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20973#discussion_r183864721
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
    @@ -0,0 +1,91 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.fpm
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions.col
    +import org.apache.spark.sql.types.{LongType, StructField, StructType}
    +import org.apache.spark.storage.StorageLevel
    +
    +/**
    + * :: Experimental ::
    + * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
    + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
    + * Efficiently by Prefix-Projected Pattern Growth
    + * (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
    + *
    + * @see <a href="https://en.wikipedia.org/wiki/Sequential_Pattern_Mining">Sequential Pattern Mining
    + * (Wikipedia)</a>
    + */
    +@Since("2.4.0")
    +@Experimental
    +object PrefixSpan {
    +
    +  /**
    +   * :: Experimental ::
    +   * Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
    +   *
    +   * @param dataset A dataset or a dataframe containing a sequence column which is
    +   *                {{{Seq[Seq[_]]}}} type
    +   * @param sequenceCol the name of the sequence column in dataset
    +   * @param minSupport the minimal support level of the sequential pattern, any pattern that
    +   *                   appears more than (minSupport * size-of-the-dataset) times will be output
    +   *                  (default: `0.1`).
    +   * @param maxPatternLength the maximal length of the sequential pattern, any pattern that appears
    +   *                         less than maxPatternLength will be output (default: `10`).
    +   * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the
    +   *                           internal storage format) allowed in a projected database before
    +   *                           local processing. If a projected database exceeds this size, another
    +   *                           iteration of distributed prefix growth is run (default: `32000000`).
    +   * @return A dataframe that contains columns of sequence and corresponding frequency.
    --- End diff --
    
    Be very explicit about the output schema please: For each column, provide the name and DataType.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20973


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    **[Test build #88873 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88873/testReport)** for PR 20973 at commit [`d563c8f`](https://github.com/apache/spark/commit/d563c8fab0cb718b511ac78bc38e712a65148d17).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    **[Test build #88885 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88885/testReport)** for PR 20973 at commit [`bd0ce07`](https://github.com/apache/spark/commit/bd0ce07827cd038ddf2e63ebb5a6027d73a3c5a2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90040/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2803/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    **[Test build #4162 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4162/testReport)** for PR 20973 at commit [`76d4119`](https://github.com/apache/spark/commit/76d411998205a3920ee8d1e353c8422658b2e330).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20973#discussion_r183864177
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
    @@ -0,0 +1,91 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.fpm
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions.col
    +import org.apache.spark.sql.types.{LongType, StructField, StructType}
    +import org.apache.spark.storage.StorageLevel
    +
    +/**
    + * :: Experimental ::
    + * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
    + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
    + * Efficiently by Prefix-Projected Pattern Growth
    + * (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
    + *
    + * @see <a href="https://en.wikipedia.org/wiki/Sequential_Pattern_Mining">Sequential Pattern Mining
    + * (Wikipedia)</a>
    + */
    +@Since("2.4.0")
    +@Experimental
    +object PrefixSpan {
    +
    +  /**
    +   * :: Experimental ::
    +   * Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
    +   *
    +   * @param dataset A dataset or a dataframe containing a sequence column which is
    +   *                {{{Seq[Seq[_]]}}} type
    +   * @param sequenceCol the name of the sequence column in dataset
    +   * @param minSupport the minimal support level of the sequential pattern, any pattern that
    +   *                   appears more than (minSupport * size-of-the-dataset) times will be output
    +   *                  (default: `0.1`).
    +   * @param maxPatternLength the maximal length of the sequential pattern, any pattern that appears
    --- End diff --
    
    Let's fix this phrasing by just saying "the maximal length of the sequential pattern"  (The other part does not make sense: "any pattern that appears...")  Feel free to fix that in the old API doc too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20973#discussion_r187216080
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
    @@ -0,0 +1,96 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.fpm
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions.col
    +import org.apache.spark.sql.types.{ArrayType, LongType, StructField, StructType}
    +
    +/**
    + * :: Experimental ::
    + * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
    + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
    + * Efficiently by Prefix-Projected Pattern Growth
    + * (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
    + *
    + * @see <a href="https://en.wikipedia.org/wiki/Sequential_Pattern_Mining">Sequential Pattern Mining
    + * (Wikipedia)</a>
    + */
    +@Since("2.4.0")
    +@Experimental
    +object PrefixSpan {
    +
    +  /**
    +   * :: Experimental ::
    +   * Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
    +   *
    +   * @param dataset A dataset or a dataframe containing a sequence column which is
    +   *                {{{Seq[Seq[_]]}}} type
    +   * @param sequenceCol the name of the sequence column in dataset, rows with nulls in this column
    +   *                    are ignored
    +   * @param minSupport the minimal support level of the sequential pattern, any pattern that
    +   *                   appears more than (minSupport * size-of-the-dataset) times will be output
    +   *                  (recommended value: `0.1`).
    +   * @param maxPatternLength the maximal length of the sequential pattern
    +   *                         (recommended value: `10`).
    +   * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the
    +   *                           internal storage format) allowed in a projected database before
    +   *                           local processing. If a projected database exceeds this size, another
    +   *                           iteration of distributed prefix growth is run
    +   *                           (recommended value: `32000000`).
    +   * @return A `DataFrame` that contains columns of sequence and corresponding frequency.
    +   *         The schema of it will be:
    +   *          - `sequence: Seq[Seq[T]]` (T is the item type)
    +   *          - `freq: Long`
    +   */
    +  @Since("2.4.0")
    +  def findFrequentSequentialPatterns(
    +      dataset: Dataset[_],
    +      sequenceCol: String,
    --- End diff --
    
    I agree in general, but I don’t think it’s a big deal for PrefixSpan.  I think of our current static method as a temporary workaround until we do the work to build a Model which can make meaningful predictions.  This will mean that further PrefixSpan improvements may be blocked on this Model work, but I think that’s OK since predictions should be the next priority for PrefixSpan.  Once we have a Model, I recommend we deprecate the current static method.
    
    I'm also OK with changing this to use setters, but then we should name it something else so that we can replace it with an Estimator + Model pair later on.  I'd suggest "PrefixSpanBuilder."


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20973#discussion_r188853310
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
    @@ -0,0 +1,96 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.fpm
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions.col
    +import org.apache.spark.sql.types.{ArrayType, LongType, StructField, StructType}
    +
    +/**
    + * :: Experimental ::
    + * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
    + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
    + * Efficiently by Prefix-Projected Pattern Growth
    + * (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
    + *
    + * @see <a href="https://en.wikipedia.org/wiki/Sequential_Pattern_Mining">Sequential Pattern Mining
    + * (Wikipedia)</a>
    + */
    +@Since("2.4.0")
    +@Experimental
    +object PrefixSpan {
    +
    +  /**
    +   * :: Experimental ::
    +   * Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
    +   *
    +   * @param dataset A dataset or a dataframe containing a sequence column which is
    +   *                {{{Seq[Seq[_]]}}} type
    +   * @param sequenceCol the name of the sequence column in dataset, rows with nulls in this column
    +   *                    are ignored
    +   * @param minSupport the minimal support level of the sequential pattern, any pattern that
    +   *                   appears more than (minSupport * size-of-the-dataset) times will be output
    +   *                  (recommended value: `0.1`).
    +   * @param maxPatternLength the maximal length of the sequential pattern
    +   *                         (recommended value: `10`).
    +   * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the
    +   *                           internal storage format) allowed in a projected database before
    +   *                           local processing. If a projected database exceeds this size, another
    +   *                           iteration of distributed prefix growth is run
    +   *                           (recommended value: `32000000`).
    +   * @return A `DataFrame` that contains columns of sequence and corresponding frequency.
    +   *         The schema of it will be:
    +   *          - `sequence: Seq[Seq[T]]` (T is the item type)
    +   *          - `freq: Long`
    +   */
    +  @Since("2.4.0")
    +  def findFrequentSequentialPatterns(
    +      dataset: Dataset[_],
    +      sequenceCol: String,
    --- End diff --
    
    Sure. Will update soon!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88885/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20973#discussion_r186931806
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
    @@ -0,0 +1,96 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.fpm
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions.col
    +import org.apache.spark.sql.types.{ArrayType, LongType, StructField, StructType}
    +
    +/**
    + * :: Experimental ::
    + * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
    + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
    + * Efficiently by Prefix-Projected Pattern Growth
    + * (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
    + *
    + * @see <a href="https://en.wikipedia.org/wiki/Sequential_Pattern_Mining">Sequential Pattern Mining
    + * (Wikipedia)</a>
    + */
    +@Since("2.4.0")
    +@Experimental
    +object PrefixSpan {
    +
    +  /**
    +   * :: Experimental ::
    +   * Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
    +   *
    +   * @param dataset A dataset or a dataframe containing a sequence column which is
    +   *                {{{Seq[Seq[_]]}}} type
    +   * @param sequenceCol the name of the sequence column in dataset, rows with nulls in this column
    +   *                    are ignored
    +   * @param minSupport the minimal support level of the sequential pattern, any pattern that
    +   *                   appears more than (minSupport * size-of-the-dataset) times will be output
    +   *                  (recommended value: `0.1`).
    +   * @param maxPatternLength the maximal length of the sequential pattern
    +   *                         (recommended value: `10`).
    +   * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the
    +   *                           internal storage format) allowed in a projected database before
    +   *                           local processing. If a projected database exceeds this size, another
    +   *                           iteration of distributed prefix growth is run
    +   *                           (recommended value: `32000000`).
    +   * @return A `DataFrame` that contains columns of sequence and corresponding frequency.
    +   *         The schema of it will be:
    +   *          - `sequence: Seq[Seq[T]]` (T is the item type)
    +   *          - `freq: Long`
    +   */
    +  @Since("2.4.0")
    +  def findFrequentSequentialPatterns(
    +      dataset: Dataset[_],
    +      sequenceCol: String,
    --- End diff --
    
    @WeichenXu123 @jkbradley The static method doesn't scale with parameters. If we add a new param, we have to keep the old one for binary compatibility. Why not using setters? I think we only need to avoid using `fit` and `transform` names.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20973#discussion_r185058005
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
    @@ -44,26 +43,37 @@ object PrefixSpan {
        *
        * @param dataset A dataset or a dataframe containing a sequence column which is
        *                {{{Seq[Seq[_]]}}} type
    -   * @param sequenceCol the name of the sequence column in dataset
    +   * @param sequenceCol the name of the sequence column in dataset, rows with nulls in this column
    +   *                    are ignored
        * @param minSupport the minimal support level of the sequential pattern, any pattern that
        *                   appears more than (minSupport * size-of-the-dataset) times will be output
    -   *                  (default: `0.1`).
    -   * @param maxPatternLength the maximal length of the sequential pattern, any pattern that appears
    -   *                         less than maxPatternLength will be output (default: `10`).
    +   *                  (recommended value: `0.1`).
    +   * @param maxPatternLength the maximal length of the sequential pattern
    +   *                         (recommended value: `10`).
        * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the
        *                           internal storage format) allowed in a projected database before
        *                           local processing. If a projected database exceeds this size, another
    -   *                           iteration of distributed prefix growth is run (default: `32000000`).
    -   * @return A dataframe that contains columns of sequence and corresponding frequency.
    +   *                           iteration of distributed prefix growth is run
    +   *                           (recommended value: `32000000`).
    +   * @return A `DataFrame` that contains columns of sequence and corresponding frequency.
    +   *         The schema of it will be:
    +   *          - `sequence: Seq[Seq[T]]` (T is the item type)
    +   *          - `frequency: Long`
    --- End diff --
    
    I had asked for this change to "frequency" from "freq," but I belatedly realized that this conflicts with the existing FPGrowth API, which uses "freq."  It would be best to maintain consistency.  Would you mind reverting to "freq?"


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1954/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20973#discussion_r183863701
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
    @@ -0,0 +1,91 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.fpm
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions.col
    +import org.apache.spark.sql.types.{LongType, StructField, StructType}
    +import org.apache.spark.storage.StorageLevel
    +
    +/**
    + * :: Experimental ::
    + * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
    + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
    + * Efficiently by Prefix-Projected Pattern Growth
    + * (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
    + *
    + * @see <a href="https://en.wikipedia.org/wiki/Sequential_Pattern_Mining">Sequential Pattern Mining
    + * (Wikipedia)</a>
    + */
    +@Since("2.4.0")
    +@Experimental
    +object PrefixSpan {
    +
    +  /**
    +   * :: Experimental ::
    +   * Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
    +   *
    +   * @param dataset A dataset or a dataframe containing a sequence column which is
    +   *                {{{Seq[Seq[_]]}}} type
    +   * @param sequenceCol the name of the sequence column in dataset
    +   * @param minSupport the minimal support level of the sequential pattern, any pattern that
    +   *                   appears more than (minSupport * size-of-the-dataset) times will be output
    +   *                  (default: `0.1`).
    +   * @param maxPatternLength the maximal length of the sequential pattern, any pattern that appears
    +   *                         less than maxPatternLength will be output (default: `10`).
    +   * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the
    +   *                           internal storage format) allowed in a projected database before
    +   *                           local processing. If a projected database exceeds this size, another
    +   *                           iteration of distributed prefix growth is run (default: `32000000`).
    +   * @return A dataframe that contains columns of sequence and corresponding frequency.
    +   */
    +  @Since("2.4.0")
    +  def findFrequentSequentPatterns(
    +      dataset: Dataset[_],
    +      sequenceCol: String,
    +      minSupport: Double = 0.1,
    --- End diff --
    
    We never want to use default arguments in Scala APIs since they are not Java-friendly.  Let's just state recommended values in the docstrings.  We can add defaults when we create an Estimator.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20973#discussion_r188813405
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
    @@ -0,0 +1,96 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.fpm
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions.col
    +import org.apache.spark.sql.types.{ArrayType, LongType, StructField, StructType}
    +
    +/**
    + * :: Experimental ::
    + * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
    + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
    + * Efficiently by Prefix-Projected Pattern Growth
    + * (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
    + *
    + * @see <a href="https://en.wikipedia.org/wiki/Sequential_Pattern_Mining">Sequential Pattern Mining
    + * (Wikipedia)</a>
    + */
    +@Since("2.4.0")
    +@Experimental
    +object PrefixSpan {
    +
    +  /**
    +   * :: Experimental ::
    +   * Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
    +   *
    +   * @param dataset A dataset or a dataframe containing a sequence column which is
    +   *                {{{Seq[Seq[_]]}}} type
    +   * @param sequenceCol the name of the sequence column in dataset, rows with nulls in this column
    +   *                    are ignored
    +   * @param minSupport the minimal support level of the sequential pattern, any pattern that
    +   *                   appears more than (minSupport * size-of-the-dataset) times will be output
    +   *                  (recommended value: `0.1`).
    +   * @param maxPatternLength the maximal length of the sequential pattern
    +   *                         (recommended value: `10`).
    +   * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the
    +   *                           internal storage format) allowed in a projected database before
    +   *                           local processing. If a projected database exceeds this size, another
    +   *                           iteration of distributed prefix growth is run
    +   *                           (recommended value: `32000000`).
    +   * @return A `DataFrame` that contains columns of sequence and corresponding frequency.
    +   *         The schema of it will be:
    +   *          - `sequence: Seq[Seq[T]]` (T is the item type)
    +   *          - `freq: Long`
    +   */
    +  @Since("2.4.0")
    +  def findFrequentSequentialPatterns(
    +      dataset: Dataset[_],
    +      sequenceCol: String,
    --- End diff --
    
    @WeichenXu123 Do you have time to send a PR to update this API?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    Jenkins, test this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    Rerunning tests in case the R CRAN failure was from flakiness


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20973#discussion_r188731937
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
    @@ -0,0 +1,96 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.fpm
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions.col
    +import org.apache.spark.sql.types.{ArrayType, LongType, StructField, StructType}
    +
    +/**
    + * :: Experimental ::
    + * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
    + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
    + * Efficiently by Prefix-Projected Pattern Growth
    + * (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
    + *
    + * @see <a href="https://en.wikipedia.org/wiki/Sequential_Pattern_Mining">Sequential Pattern Mining
    + * (Wikipedia)</a>
    + */
    +@Since("2.4.0")
    +@Experimental
    +object PrefixSpan {
    +
    +  /**
    +   * :: Experimental ::
    +   * Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
    +   *
    +   * @param dataset A dataset or a dataframe containing a sequence column which is
    +   *                {{{Seq[Seq[_]]}}} type
    +   * @param sequenceCol the name of the sequence column in dataset, rows with nulls in this column
    +   *                    are ignored
    +   * @param minSupport the minimal support level of the sequential pattern, any pattern that
    +   *                   appears more than (minSupport * size-of-the-dataset) times will be output
    +   *                  (recommended value: `0.1`).
    +   * @param maxPatternLength the maximal length of the sequential pattern
    +   *                         (recommended value: `10`).
    +   * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the
    +   *                           internal storage format) allowed in a projected database before
    +   *                           local processing. If a projected database exceeds this size, another
    +   *                           iteration of distributed prefix growth is run
    +   *                           (recommended value: `32000000`).
    +   * @return A `DataFrame` that contains columns of sequence and corresponding frequency.
    +   *         The schema of it will be:
    +   *          - `sequence: Seq[Seq[T]]` (T is the item type)
    +   *          - `freq: Long`
    +   */
    +  @Since("2.4.0")
    +  def findFrequentSequentialPatterns(
    +      dataset: Dataset[_],
    +      sequenceCol: String,
    --- End diff --
    
    Adding `extends Estimator` later should only introduce new methods to the class but no breaking changes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20973#discussion_r183866393
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
    @@ -0,0 +1,91 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.fpm
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions.col
    +import org.apache.spark.sql.types.{LongType, StructField, StructType}
    +import org.apache.spark.storage.StorageLevel
    +
    +/**
    + * :: Experimental ::
    + * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
    + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
    + * Efficiently by Prefix-Projected Pattern Growth
    + * (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
    + *
    + * @see <a href="https://en.wikipedia.org/wiki/Sequential_Pattern_Mining">Sequential Pattern Mining
    + * (Wikipedia)</a>
    + */
    +@Since("2.4.0")
    +@Experimental
    +object PrefixSpan {
    +
    +  /**
    +   * :: Experimental ::
    +   * Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
    +   *
    +   * @param dataset A dataset or a dataframe containing a sequence column which is
    +   *                {{{Seq[Seq[_]]}}} type
    +   * @param sequenceCol the name of the sequence column in dataset
    +   * @param minSupport the minimal support level of the sequential pattern, any pattern that
    +   *                   appears more than (minSupport * size-of-the-dataset) times will be output
    +   *                  (default: `0.1`).
    +   * @param maxPatternLength the maximal length of the sequential pattern, any pattern that appears
    +   *                         less than maxPatternLength will be output (default: `10`).
    +   * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the
    +   *                           internal storage format) allowed in a projected database before
    +   *                           local processing. If a projected database exceeds this size, another
    +   *                           iteration of distributed prefix growth is run (default: `32000000`).
    +   * @return A dataframe that contains columns of sequence and corresponding frequency.
    +   */
    +  @Since("2.4.0")
    +  def findFrequentSequentPatterns(
    +      dataset: Dataset[_],
    +      sequenceCol: String,
    +      minSupport: Double = 0.1,
    +      maxPatternLength: Int = 10,
    +      maxLocalProjDBSize: Long = 32000000L): DataFrame = {
    +    val handlePersistence = dataset.storageLevel == StorageLevel.NONE
    +
    +    val data = dataset.select(sequenceCol)
    +    val sequences = data.where(col(sequenceCol).isNotNull).rdd
    +      .map(r => r.getAs[Seq[Seq[Any]]](0).map(_.toArray).toArray)
    +
    +    val mllibPrefixSpan = new mllibPrefixSpan()
    +      .setMinSupport(minSupport)
    +      .setMaxPatternLength(maxPatternLength)
    +      .setMaxLocalProjDBSize(maxLocalProjDBSize)
    +    if (handlePersistence) {
    +      sequences.persist(StorageLevel.MEMORY_AND_DISK)
    +    }
    +    val rows = mllibPrefixSpan.run(sequences).freqSequences.map(f => Row(f.sequence, f.freq))
    +    val schema = StructType(Seq(
    +      StructField("sequence", dataset.schema(sequenceCol).dataType, nullable = false),
    +      StructField("freq", LongType, nullable = false)))
    --- End diff --
    
    nit: I'd prefer to call the column "frequency"


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90116/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20973#discussion_r188491670
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
    @@ -0,0 +1,96 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.fpm
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions.col
    +import org.apache.spark.sql.types.{ArrayType, LongType, StructField, StructType}
    +
    +/**
    + * :: Experimental ::
    + * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
    + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
    + * Efficiently by Prefix-Projected Pattern Growth
    + * (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
    + *
    + * @see <a href="https://en.wikipedia.org/wiki/Sequential_Pattern_Mining">Sequential Pattern Mining
    + * (Wikipedia)</a>
    + */
    +@Since("2.4.0")
    +@Experimental
    +object PrefixSpan {
    +
    +  /**
    +   * :: Experimental ::
    +   * Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
    +   *
    +   * @param dataset A dataset or a dataframe containing a sequence column which is
    +   *                {{{Seq[Seq[_]]}}} type
    +   * @param sequenceCol the name of the sequence column in dataset, rows with nulls in this column
    +   *                    are ignored
    +   * @param minSupport the minimal support level of the sequential pattern, any pattern that
    +   *                   appears more than (minSupport * size-of-the-dataset) times will be output
    +   *                  (recommended value: `0.1`).
    +   * @param maxPatternLength the maximal length of the sequential pattern
    +   *                         (recommended value: `10`).
    +   * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the
    +   *                           internal storage format) allowed in a projected database before
    +   *                           local processing. If a projected database exceeds this size, another
    +   *                           iteration of distributed prefix growth is run
    +   *                           (recommended value: `32000000`).
    +   * @return A `DataFrame` that contains columns of sequence and corresponding frequency.
    +   *         The schema of it will be:
    +   *          - `sequence: Seq[Seq[T]]` (T is the item type)
    +   *          - `freq: Long`
    +   */
    +  @Since("2.4.0")
    +  def findFrequentSequentialPatterns(
    +      dataset: Dataset[_],
    +      sequenceCol: String,
    --- End diff --
    
    this way `final class PrefixSpan(override val uid: String) extends Params` seemingly breaks binary compatibility if later we change it into an estimator ?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20973#discussion_r183865745
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
    @@ -0,0 +1,91 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.fpm
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions.col
    +import org.apache.spark.sql.types.{LongType, StructField, StructType}
    +import org.apache.spark.storage.StorageLevel
    +
    +/**
    + * :: Experimental ::
    + * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
    + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
    + * Efficiently by Prefix-Projected Pattern Growth
    + * (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
    + *
    + * @see <a href="https://en.wikipedia.org/wiki/Sequential_Pattern_Mining">Sequential Pattern Mining
    + * (Wikipedia)</a>
    + */
    +@Since("2.4.0")
    +@Experimental
    +object PrefixSpan {
    +
    +  /**
    +   * :: Experimental ::
    +   * Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
    +   *
    +   * @param dataset A dataset or a dataframe containing a sequence column which is
    +   *                {{{Seq[Seq[_]]}}} type
    +   * @param sequenceCol the name of the sequence column in dataset
    --- End diff --
    
    You could add a unit test for that too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89837/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    **[Test build #4162 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4162/testReport)** for PR 20973 at commit [`76d4119`](https://github.com/apache/spark/commit/76d411998205a3920ee8d1e353c8422658b2e330).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    **[Test build #90040 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90040/testReport)** for PR 20973 at commit [`76d4119`](https://github.com/apache/spark/commit/76d411998205a3920ee8d1e353c8422658b2e330).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    **[Test build #89836 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89836/testReport)** for PR 20973 at commit [`dc7d779`](https://github.com/apache/spark/commit/dc7d779ce5e33c94acd87843db540e6fa6ff5a80).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    Merging with master


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    **[Test build #89836 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89836/testReport)** for PR 20973 at commit [`dc7d779`](https://github.com/apache/spark/commit/dc7d779ce5e33c94acd87843db540e6fa6ff5a80).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20973#discussion_r188813297
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
    @@ -0,0 +1,96 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.fpm
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions.col
    +import org.apache.spark.sql.types.{ArrayType, LongType, StructField, StructType}
    +
    +/**
    + * :: Experimental ::
    + * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
    + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
    + * Efficiently by Prefix-Projected Pattern Growth
    + * (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
    + *
    + * @see <a href="https://en.wikipedia.org/wiki/Sequential_Pattern_Mining">Sequential Pattern Mining
    + * (Wikipedia)</a>
    + */
    +@Since("2.4.0")
    +@Experimental
    +object PrefixSpan {
    +
    +  /**
    +   * :: Experimental ::
    +   * Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
    +   *
    +   * @param dataset A dataset or a dataframe containing a sequence column which is
    +   *                {{{Seq[Seq[_]]}}} type
    +   * @param sequenceCol the name of the sequence column in dataset, rows with nulls in this column
    +   *                    are ignored
    +   * @param minSupport the minimal support level of the sequential pattern, any pattern that
    +   *                   appears more than (minSupport * size-of-the-dataset) times will be output
    +   *                  (recommended value: `0.1`).
    +   * @param maxPatternLength the maximal length of the sequential pattern
    +   *                         (recommended value: `10`).
    +   * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the
    +   *                           internal storage format) allowed in a projected database before
    +   *                           local processing. If a projected database exceeds this size, another
    +   *                           iteration of distributed prefix growth is run
    +   *                           (recommended value: `32000000`).
    +   * @return A `DataFrame` that contains columns of sequence and corresponding frequency.
    +   *         The schema of it will be:
    +   *          - `sequence: Seq[Seq[T]]` (T is the item type)
    +   *          - `freq: Long`
    +   */
    +  @Since("2.4.0")
    +  def findFrequentSequentialPatterns(
    +      dataset: Dataset[_],
    +      sequenceCol: String,
    --- End diff --
    
    Oh, I think you're right @mengxr .  That approach sounds good.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89836/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88873/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20973#discussion_r186994754
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
    @@ -0,0 +1,96 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.fpm
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions.col
    +import org.apache.spark.sql.types.{ArrayType, LongType, StructField, StructType}
    +
    +/**
    + * :: Experimental ::
    + * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
    + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
    + * Efficiently by Prefix-Projected Pattern Growth
    + * (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
    + *
    + * @see <a href="https://en.wikipedia.org/wiki/Sequential_Pattern_Mining">Sequential Pattern Mining
    + * (Wikipedia)</a>
    + */
    +@Since("2.4.0")
    +@Experimental
    +object PrefixSpan {
    +
    +  /**
    +   * :: Experimental ::
    +   * Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
    +   *
    +   * @param dataset A dataset or a dataframe containing a sequence column which is
    +   *                {{{Seq[Seq[_]]}}} type
    +   * @param sequenceCol the name of the sequence column in dataset, rows with nulls in this column
    +   *                    are ignored
    +   * @param minSupport the minimal support level of the sequential pattern, any pattern that
    +   *                   appears more than (minSupport * size-of-the-dataset) times will be output
    +   *                  (recommended value: `0.1`).
    +   * @param maxPatternLength the maximal length of the sequential pattern
    +   *                         (recommended value: `10`).
    +   * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the
    +   *                           internal storage format) allowed in a projected database before
    +   *                           local processing. If a projected database exceeds this size, another
    +   *                           iteration of distributed prefix growth is run
    +   *                           (recommended value: `32000000`).
    +   * @return A `DataFrame` that contains columns of sequence and corresponding frequency.
    +   *         The schema of it will be:
    +   *          - `sequence: Seq[Seq[T]]` (T is the item type)
    +   *          - `freq: Long`
    +   */
    +  @Since("2.4.0")
    +  def findFrequentSequentialPatterns(
    +      dataset: Dataset[_],
    +      sequenceCol: String,
    --- End diff --
    
    I agree with using setters. @jkbradley What do you think of it ? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20973#discussion_r183865224
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
    @@ -0,0 +1,91 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.fpm
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions.col
    +import org.apache.spark.sql.types.{LongType, StructField, StructType}
    +import org.apache.spark.storage.StorageLevel
    +
    +/**
    + * :: Experimental ::
    + * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
    + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
    + * Efficiently by Prefix-Projected Pattern Growth
    + * (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
    + *
    + * @see <a href="https://en.wikipedia.org/wiki/Sequential_Pattern_Mining">Sequential Pattern Mining
    + * (Wikipedia)</a>
    + */
    +@Since("2.4.0")
    +@Experimental
    +object PrefixSpan {
    +
    +  /**
    +   * :: Experimental ::
    +   * Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
    +   *
    +   * @param dataset A dataset or a dataframe containing a sequence column which is
    +   *                {{{Seq[Seq[_]]}}} type
    +   * @param sequenceCol the name of the sequence column in dataset
    +   * @param minSupport the minimal support level of the sequential pattern, any pattern that
    +   *                   appears more than (minSupport * size-of-the-dataset) times will be output
    +   *                  (default: `0.1`).
    +   * @param maxPatternLength the maximal length of the sequential pattern, any pattern that appears
    +   *                         less than maxPatternLength will be output (default: `10`).
    +   * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the
    +   *                           internal storage format) allowed in a projected database before
    +   *                           local processing. If a projected database exceeds this size, another
    +   *                           iteration of distributed prefix growth is run (default: `32000000`).
    +   * @return A dataframe that contains columns of sequence and corresponding frequency.
    +   */
    +  @Since("2.4.0")
    +  def findFrequentSequentPatterns(
    +      dataset: Dataset[_],
    +      sequenceCol: String,
    +      minSupport: Double = 0.1,
    +      maxPatternLength: Int = 10,
    +      maxLocalProjDBSize: Long = 32000000L): DataFrame = {
    +    val handlePersistence = dataset.storageLevel == StorageLevel.NONE
    --- End diff --
    
    We don't really need this handlePersistence logic here since it's handled by the spark.mllib implementation.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    **[Test build #4165 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4165/testReport)** for PR 20973 at commit [`76d4119`](https://github.com/apache/spark/commit/76d411998205a3920ee8d1e353c8422658b2e330).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    **[Test build #88873 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88873/testReport)** for PR 20973 at commit [`d563c8f`](https://github.com/apache/spark/commit/d563c8fab0cb718b511ac78bc38e712a65148d17).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    **[Test build #88885 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88885/testReport)** for PR 20973 at commit [`bd0ce07`](https://github.com/apache/spark/commit/bd0ce07827cd038ddf2e63ebb5a6027d73a3c5a2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class PrefixSpanSuite extends MLTest `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20973#discussion_r183864852
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
    @@ -0,0 +1,91 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.fpm
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions.col
    +import org.apache.spark.sql.types.{LongType, StructField, StructType}
    +import org.apache.spark.storage.StorageLevel
    +
    +/**
    + * :: Experimental ::
    + * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
    + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
    + * Efficiently by Prefix-Projected Pattern Growth
    + * (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
    + *
    + * @see <a href="https://en.wikipedia.org/wiki/Sequential_Pattern_Mining">Sequential Pattern Mining
    + * (Wikipedia)</a>
    + */
    +@Since("2.4.0")
    +@Experimental
    +object PrefixSpan {
    +
    +  /**
    +   * :: Experimental ::
    +   * Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
    +   *
    +   * @param dataset A dataset or a dataframe containing a sequence column which is
    +   *                {{{Seq[Seq[_]]}}} type
    +   * @param sequenceCol the name of the sequence column in dataset
    +   * @param minSupport the minimal support level of the sequential pattern, any pattern that
    +   *                   appears more than (minSupport * size-of-the-dataset) times will be output
    +   *                  (default: `0.1`).
    +   * @param maxPatternLength the maximal length of the sequential pattern, any pattern that appears
    +   *                         less than maxPatternLength will be output (default: `10`).
    +   * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the
    +   *                           internal storage format) allowed in a projected database before
    +   *                           local processing. If a projected database exceeds this size, another
    +   *                           iteration of distributed prefix growth is run (default: `32000000`).
    +   * @return A dataframe that contains columns of sequence and corresponding frequency.
    +   */
    +  @Since("2.4.0")
    +  def findFrequentSequentPatterns(
    --- End diff --
    
    rename: findFrequentSequentPatterns -> findFrequentSequent**ial**Patterns


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    **[Test build #90116 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90116/testReport)** for PR 20973 at commit [`76d4119`](https://github.com/apache/spark/commit/76d411998205a3920ee8d1e353c8422658b2e330).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    **[Test build #89837 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89837/testReport)** for PR 20973 at commit [`acbf9e4`](https://github.com/apache/spark/commit/acbf9e4d116fbabdc768dbab578f38bdeb343a29).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2859/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    **[Test build #4165 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4165/testReport)** for PR 20973 at commit [`76d4119`](https://github.com/apache/spark/commit/76d411998205a3920ee8d1e353c8422658b2e330).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2667/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    **[Test build #90040 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90040/testReport)** for PR 20973 at commit [`76d4119`](https://github.com/apache/spark/commit/76d411998205a3920ee8d1e353c8422658b2e330).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20973#discussion_r185149879
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
    @@ -44,26 +43,37 @@ object PrefixSpan {
        *
        * @param dataset A dataset or a dataframe containing a sequence column which is
        *                {{{Seq[Seq[_]]}}} type
    -   * @param sequenceCol the name of the sequence column in dataset
    +   * @param sequenceCol the name of the sequence column in dataset, rows with nulls in this column
    +   *                    are ignored
        * @param minSupport the minimal support level of the sequential pattern, any pattern that
        *                   appears more than (minSupport * size-of-the-dataset) times will be output
    -   *                  (default: `0.1`).
    -   * @param maxPatternLength the maximal length of the sequential pattern, any pattern that appears
    -   *                         less than maxPatternLength will be output (default: `10`).
    +   *                  (recommended value: `0.1`).
    +   * @param maxPatternLength the maximal length of the sequential pattern
    +   *                         (recommended value: `10`).
        * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the
        *                           internal storage format) allowed in a projected database before
        *                           local processing. If a projected database exceeds this size, another
    -   *                           iteration of distributed prefix growth is run (default: `32000000`).
    -   * @return A dataframe that contains columns of sequence and corresponding frequency.
    +   *                           iteration of distributed prefix growth is run
    +   *                           (recommended value: `32000000`).
    +   * @return A `DataFrame` that contains columns of sequence and corresponding frequency.
    +   *         The schema of it will be:
    +   *          - `sequence: Seq[Seq[T]]` (T is the item type)
    +   *          - `frequency: Long`
    --- End diff --
    
    sure!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    **[Test build #89837 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89837/testReport)** for PR 20973 at commit [`acbf9e4`](https://github.com/apache/spark/commit/acbf9e4d116fbabdc768dbab578f38bdeb343a29).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20973#discussion_r188464083
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
    @@ -0,0 +1,96 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.fpm
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions.col
    +import org.apache.spark.sql.types.{ArrayType, LongType, StructField, StructType}
    +
    +/**
    + * :: Experimental ::
    + * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
    + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
    + * Efficiently by Prefix-Projected Pattern Growth
    + * (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
    + *
    + * @see <a href="https://en.wikipedia.org/wiki/Sequential_Pattern_Mining">Sequential Pattern Mining
    + * (Wikipedia)</a>
    + */
    +@Since("2.4.0")
    +@Experimental
    +object PrefixSpan {
    +
    +  /**
    +   * :: Experimental ::
    +   * Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
    +   *
    +   * @param dataset A dataset or a dataframe containing a sequence column which is
    +   *                {{{Seq[Seq[_]]}}} type
    +   * @param sequenceCol the name of the sequence column in dataset, rows with nulls in this column
    +   *                    are ignored
    +   * @param minSupport the minimal support level of the sequential pattern, any pattern that
    +   *                   appears more than (minSupport * size-of-the-dataset) times will be output
    +   *                  (recommended value: `0.1`).
    +   * @param maxPatternLength the maximal length of the sequential pattern
    +   *                         (recommended value: `10`).
    +   * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the
    +   *                           internal storage format) allowed in a projected database before
    +   *                           local processing. If a projected database exceeds this size, another
    +   *                           iteration of distributed prefix growth is run
    +   *                           (recommended value: `32000000`).
    +   * @return A `DataFrame` that contains columns of sequence and corresponding frequency.
    +   *         The schema of it will be:
    +   *          - `sequence: Seq[Seq[T]]` (T is the item type)
    +   *          - `freq: Long`
    +   */
    +  @Since("2.4.0")
    +  def findFrequentSequentialPatterns(
    +      dataset: Dataset[_],
    +      sequenceCol: String,
    --- End diff --
    
    It should be easier to keep the `PrefixSpan` name and make it an `Estimator` later. For example:
    
    ~~~scala
    final class PrefixSpan(override val uid: String) extends Params {
      // param, setters, getters
      def findFrequentSequentialPatterns(dataset: Dataset[_]): DataFrame
    }
    ~~~
    
    Later we can add `Estimator.fit` and `PrefixSpanModel.transform`. Any issue with this approach?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    LGTM pending jenkins tests


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/20973
  
    Jenkins, test this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org