You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by tashoyan <gi...@git.apache.org> on 2018/02/11 15:28:29 UTC

[GitHub] spark pull request #20578: [SPARK-23318][ML] FP-growth: WARN FPGrowth: Input...

GitHub user tashoyan opened a pull request:

    https://github.com/apache/spark/pull/20578

    [SPARK-23318][ML] FP-growth: WARN FPGrowth: Input data is not cached

    ## What changes were proposed in this pull request?
    
    Cache the RDD of items in ml.FPGrowth before passing it to mllib.FPGrowth. Cache only when the user did not cache the input dataset of transactions. This fixes the warning about uncached data emerging from mllib.FPGrowth.
    
    ## How was this patch tested?
    
    Manually:
    1. Run ml.FPGrowthExample - warning is there
    2. Apply the fix
    3. Run ml.FPGrowthExample again - no warning anymore

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tashoyan/spark SPARK-23318

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20578.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20578
    
----
commit d17d3fbee84fcb0072d3030f3118ca18ce783e0c
Author: Arseniy Tashoyan <ta...@...>
Date:   2018-02-10T21:16:51Z

    [SPARK-23318][ML]Workaround for 'ArrayStoreException: [Ljava.lang.Object' when trying to cache the RDD of items.

commit e0eb8519bf09db12f5d5bc426eaf17d6488e05c1
Author: Arseniy Tashoyan <ta...@...>
Date:   2018-02-11T15:21:39Z

    [SPARK-23318][ML] Cache the RDD of items if the user did not cache the input dataset of transactions. This should eliminate the warning about uncahed data in mllib.FPGrowth.

commit 374a49c2bf447f3ddfed655f6eda9c8cd5f45285
Author: Arseniy Tashoyan <ta...@...>
Date:   2018-02-11T15:23:58Z

    Merge remote-tracking branch 'upstream/master' into SPARK-23318

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20578: [SPARK-23318][ML] FP-growth: WARN FPGrowth: Input data i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20578
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20578: [SPARK-23318][ML] FP-growth: WARN FPGrowth: Input...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20578#discussion_r167439487
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala ---
    @@ -158,18 +159,30 @@ class FPGrowth @Since("2.2.0") (
       }
     
       private def genericFit[T: ClassTag](dataset: Dataset[_]): FPGrowthModel = {
    +    val handlePersistence = dataset.storageLevel == StorageLevel.NONE
    +
         val data = dataset.select($(itemsCol))
    -    val items = data.where(col($(itemsCol)).isNotNull).rdd.map(r => r.getSeq[T](0).toArray)
    +    val items = data.where(col($(itemsCol)).isNotNull).rdd.map(r => r.getSeq[Any](0).toArray)
    --- End diff --
    
    Hm, I wonder how it worked before then? maybe it is related to the caching.
    But then this changes the type of `items`, and therefore `parentModel`. Maybe it doesn't matter, just surprising.
    
    T is a ClassTag here and `.toArray` wouldn't work otherwise. Small similar example:
    
    ```
    scala> def foo[T:scala.reflect.ClassTag](s: Seq[T]) = s.toArray
    foo: [T](s: Seq[T])(implicit evidence$1: scala.reflect.ClassTag[T])Array[T]
    
    scala> def foo[T](s: Seq[T]) = s.toArray
    <console>:11: error: No ClassTag available for T
           def foo[T](s: Seq[T]) = s.toArray
                                     ^
    
    scala> def foo[T:scala.reflect.ClassTag](s: Seq[T]) = s.toArray
    foo: [T](s: Seq[T])(implicit evidence$1: scala.reflect.ClassTag[T])Array[T]
    
    scala> foo(Seq("foo", "bar"))
    res3: Array[String] = Array(foo, bar)
    ```
    
    oh well if it works with this change and not without, seems OK, just an interesting curiosity.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20578: [SPARK-23318][ML] FP-growth: WARN FPGrowth: Input data i...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/20578
  
    Merged to master


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20578: [SPARK-23318][ML] FP-growth: WARN FPGrowth: Input data i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20578
  
    **[Test build #4092 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4092/testReport)** for PR 20578 at commit [`374a49c`](https://github.com/apache/spark/commit/374a49c2bf447f3ddfed655f6eda9c8cd5f45285).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20578: [SPARK-23318][ML] FP-growth: WARN FPGrowth: Input...

Posted by tashoyan <gi...@git.apache.org>.
Github user tashoyan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20578#discussion_r167441224
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala ---
    @@ -158,18 +159,30 @@ class FPGrowth @Since("2.2.0") (
       }
     
       private def genericFit[T: ClassTag](dataset: Dataset[_]): FPGrowthModel = {
    +    val handlePersistence = dataset.storageLevel == StorageLevel.NONE
    +
         val data = dataset.select($(itemsCol))
    -    val items = data.where(col($(itemsCol)).isNotNull).rdd.map(r => r.getSeq[T](0).toArray)
    +    val items = data.where(col($(itemsCol)).isNotNull).rdd.map(r => r.getSeq[Any](0).toArray)
    --- End diff --
    
    It is not only for caching. Same ArrayStoreException occurs if one tries to execute collect() on the items RDD. No exception when using a concrete type like String instead of T. Probably the latter explains how it worked before - people invoked dataset.cache() in their code where type parameter of the Dataset is known.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20578: [SPARK-23318][ML] FP-growth: WARN FPGrowth: Input...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20578#discussion_r167438904
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala ---
    @@ -158,18 +159,30 @@ class FPGrowth @Since("2.2.0") (
       }
     
       private def genericFit[T: ClassTag](dataset: Dataset[_]): FPGrowthModel = {
    +    val handlePersistence = dataset.storageLevel == StorageLevel.NONE
    +
         val data = dataset.select($(itemsCol))
    -    val items = data.where(col($(itemsCol)).isNotNull).rdd.map(r => r.getSeq[T](0).toArray)
    +    val items = data.where(col($(itemsCol)).isNotNull).rdd.map(r => r.getSeq[Any](0).toArray)
    --- End diff --
    
    Is the change from T to Any necessary? I'd presume it does in fact need to be a Seq of the item type.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20578: [SPARK-23318][ML] FP-growth: WARN FPGrowth: Input data i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20578
  
    **[Test build #4092 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4092/testReport)** for PR 20578 at commit [`374a49c`](https://github.com/apache/spark/commit/374a49c2bf447f3ddfed655f6eda9c8cd5f45285).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20578: [SPARK-23318][ML] FP-growth: WARN FPGrowth: Input...

Posted by tashoyan <gi...@git.apache.org>.
Github user tashoyan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20578#discussion_r167439260
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala ---
    @@ -158,18 +159,30 @@ class FPGrowth @Since("2.2.0") (
       }
     
       private def genericFit[T: ClassTag](dataset: Dataset[_]): FPGrowthModel = {
    +    val handlePersistence = dataset.storageLevel == StorageLevel.NONE
    +
         val data = dataset.select($(itemsCol))
    -    val items = data.where(col($(itemsCol)).isNotNull).rdd.map(r => r.getSeq[T](0).toArray)
    +    val items = data.where(col($(itemsCol)).isNotNull).rdd.map(r => r.getSeq[Any](0).toArray)
    --- End diff --
    
    Yes it is necessary. Otherwise cache() on items RDD leads to `ArrayStoreException`. It seems that due to type erasure instances of `Array[Nothing]` are created. But `toArray` attemps to add instances of `java.lang.Object`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20578: [SPARK-23318][ML] FP-growth: WARN FPGrowth: Input...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20578


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20578: [SPARK-23318][ML] FP-growth: WARN FPGrowth: Input...

Posted by tashoyan <gi...@git.apache.org>.
Github user tashoyan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20578#discussion_r167442376
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala ---
    @@ -158,18 +159,30 @@ class FPGrowth @Since("2.2.0") (
       }
     
       private def genericFit[T: ClassTag](dataset: Dataset[_]): FPGrowthModel = {
    +    val handlePersistence = dataset.storageLevel == StorageLevel.NONE
    +
         val data = dataset.select($(itemsCol))
    -    val items = data.where(col($(itemsCol)).isNotNull).rdd.map(r => r.getSeq[T](0).toArray)
    +    val items = data.where(col($(itemsCol)).isNotNull).rdd.map(r => r.getSeq[Any](0).toArray)
    --- End diff --
    
    An interesting curiosity for me: why FPGrowth contract requires `Array` of items, not `Seq`? First, it's strange for the contract to require a specific implementation rather than an interface. Second, this leads to redundant `toArray` and back `toSeq` transformations. `Seq` would be more convenient, as `Row` class has `getSeq` method but does not have `getArray`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20578: [SPARK-23318][ML] FP-growth: WARN FPGrowth: Input...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20578#discussion_r167563685
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala ---
    @@ -158,18 +159,30 @@ class FPGrowth @Since("2.2.0") (
       }
     
       private def genericFit[T: ClassTag](dataset: Dataset[_]): FPGrowthModel = {
    +    val handlePersistence = dataset.storageLevel == StorageLevel.NONE
    +
         val data = dataset.select($(itemsCol))
    -    val items = data.where(col($(itemsCol)).isNotNull).rdd.map(r => r.getSeq[T](0).toArray)
    +    val items = data.where(col($(itemsCol)).isNotNull).rdd.map(r => r.getSeq[Any](0).toArray)
    --- End diff --
    
    Yeah, probably a quirk of history. If it works it's fine.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20578: [SPARK-23318][ML] FP-growth: WARN FPGrowth: Input data i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20578
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org