You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by shahidki31 <gi...@git.apache.org> on 2018/05/08 20:34:34 UTC

[GitHub] spark pull request #21270: Power Iteration Clustering in SparkML throws exce...

GitHub user shahidki31 opened a pull request:

    https://github.com/apache/spark/pull/21270

    Power Iteration Clustering in SparkML throws exception, when the ID in IntType

    While running the following code, PIC throws exception.
    ```
    val data = spark.createDataFrame(Seq(
          (0, Array(1), Array(0.9)),
          (1, Array(2), Array(0.9)),
          (2, Array(3), Array(0.9)),
          (3, Array(4), Array(0.1)),
          (4, Array(5), Array(0.9))
        )).toDF("id", "neighbors", "similarities")
    
    val result = new PowerIterationClustering()
          .setK(2)
          .setMaxIter(10)
          .setInitMode("random")
          .transform(data)
          .select("id", "prediction")
    ```
    
    **Result**
    `org.apache.spark.sql.AnalysisException: cannot resolve '`prediction`' given input columns: [id, neighbors, similarities];;
    'Project [id#215, 'prediction]
    +- AnalysisBarrier
          +- Project [id#215, neighbors#216, similarities#217]
             +- Join Inner, (id#215 = id#234)
                :- Project [_1#209 AS id#215, _2#210 AS neighbors#216, _3#211 AS similarities#217]
                :  +- LocalRelation [_1#209, _2#210, _3#211]
                +- Project [cast(id#230L as int) AS id#234]
                   +- LogicalRDD [id#230L, prediction#231], false
    
    	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
    	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
    	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
    	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
    	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
    	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
    	at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
    
    `
    
    
    ## What changes were proposed in this pull request?
    
      1) PIC needs to return only "id" and "predictions". Currently it returns the entire data, including neighborhood array and similarity array.
     2) MLLib PIC returns "id" as Long, and "predictions" as Int. So, In ML, we don't need to typecast to the user input ID type. We can directly display the output of MLLib PIC.
    
    ## How was this patch tested?
    Added a UT


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/shahidki31/spark sparkSim

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21270.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21270
    
----
commit f7bb93a1e84821d9777229eb72f06f150c741729
Author: Shahid <sh...@...>
Date:   2018-05-08T17:08:50Z

    Example code for Power Iteration Clustering

commit ff9e0795dbdcd6f3548ef8e6e73d805bb9b7584e
Author: Shahid <sh...@...>
Date:   2018-05-08T20:02:15Z

    Example code for Power Iteration Clustering

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21270: Power Iteration Clustering in SparkML throws exception, ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21270
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21270: [SPARK-24213][ML]Power Iteration Clustering in SparkML t...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/21270
  
    @shahidki31 Seemingly what you said above is anothor issue ? You can create another jira for that. :)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21270: [SPARK-24213][ML]Power Iteration Clustering in SparkML t...

Posted by shahidki31 <gi...@git.apache.org>.

Github user shahidki31 commented on the issue:

    https://github.com/apache/spark/pull/21270
  
    Thank you @jkbradly. Actually one more issue is there. Currently we are skipping some of the nodes which are not there in the ID column, but there in the neighboring column. Spark MLLib is diplaying cluster indices corresponding to all the nodes. 
    
    So, Is it necessary for the join operation?Shall I open a new PR, adressing the issue? Kindly reply


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21270: Power Iteration Clustering in SparkML throws exception, ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21270
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21270: [SPARK-24213][ML]Power Iteration Clustering in SparkML t...

Posted by shahidki31 <gi...@git.apache.org>.

Github user shahidki31 commented on the issue:

    https://github.com/apache/spark/pull/21270
  
    @WeichenXu123 Thanks for the comment. I have created another Jira and I have raised a PR for that. That PR will fix this issue as well. Can you please review the PR?
    
    Jira : https://issues.apache.org/jira/browse/SPARK-24217
    PR: https://github.com/apache/spark/pull/21277


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21270: [SPARK-24213][ML]Power Iteration Clustering in SparkML t...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/21270
  
    Thanks for the patch!  I just commented on https://issues.apache.org/jira/browse/SPARK-24213 though and would like to replace this with https://github.com/apache/spark/pull/21274
    Could you please close this issue and help with reviewing the other PR?  Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21270: [SPARK-24213][ML]Power Iteration Clustering in Sp...

Posted by shahidki31 <gi...@git.apache.org>.

Github user shahidki31 closed the pull request at:

    https://github.com/apache/spark/pull/21270


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org