You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by imatiach-msft <gi...@git.apache.org> on 2017/01/07 01:05:59 UTC

[GitHub] spark pull request #16494: [SPARK-17975][MLLIB] Fix EMLDAOptimizer failing w...

GitHub user imatiach-msft opened a pull request:

    https://github.com/apache/spark/pull/16494

    [SPARK-17975][MLLIB] Fix EMLDAOptimizer failing with ClassCastException

    ## What changes were proposed in this pull request?
    
    LDA fails with a ClassCastException when run on a dataset with at least one row that contains an empty sparse vector.  The error occurs in method fromEdges where one of the edges may already be an EdgeRDDImpl and it does not need to be converted.
    
    ## How was this patch tested?
    
    I first ran LDA on the dataset provided by the JIRA submitter and I was able to reproduce the issue.  I then fixed the issue based on the submitter's suggestion and simplified the test case so that we wouldn't need to read in a file.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/imatiach-msft/spark ilmat/fix-EMLDA

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16494.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16494
    
----
commit 0a201b713276b92a20db330ccd20b9a562694f5a
Author: Ilya Matiach <il...@microsoft.com>
Date:   2017-01-06T21:36:19Z

    adding test case to reproduce the error

commit 66dbfea60fec23fb8b39e23adf1861cfa02d7d42
Author: Ilya Matiach <il...@microsoft.com>
Date:   2017-01-07T00:42:49Z

    [SPARK-17975][MLLIB] Fix EMLDAOptimizer failing with ClassCastException

commit 0d1c475c80a6fd0373108610ca8e41f7af0e6d01
Author: Ilya Matiach <il...@microsoft.com>
Date:   2017-01-07T01:04:40Z

    Optimizing test case

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16494: [SPARK-17975][MLLIB] Fix EMLDAOptimizer failing w...

Posted by imatiach-msft <gi...@git.apache.org>.
Github user imatiach-msft closed the pull request at:

    https://github.com/apache/spark/pull/16494


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16494: [SPARK-17975][MLLIB] Fix EMLDAOptimizer failing with Cla...

Posted by imatiach-msft <gi...@git.apache.org>.
Github user imatiach-msft commented on the issue:

    https://github.com/apache/spark/pull/16494
  
    @QQshu1 1.) It seems that we are creating the wrong type of object when getting edges, EdgeRDDImpl instead of edge partitions, when setting the checkpoint 2.) I was able to repro the issue this way and apply the recommended fix - but now that the underlying issue in the graph apis is fixed EMLDA works fine


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16494: [SPARK-17975][MLLIB] Fix EMLDAOptimizer failing with Cla...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16494
  
    **[Test build #72030 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72030/testReport)** for PR 16494 at commit [`e350400`](https://github.com/apache/spark/commit/e3504000cf22446bcaafaccedfae6ce21d653a50).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16494: [SPARK-17975][MLLIB] Fix EMLDAOptimizer failing with Cla...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16494
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72030/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16494: [SPARK-17975][MLLIB] Fix EMLDAOptimizer failing with Cla...

Posted by imatiach-msft <gi...@git.apache.org>.
Github user imatiach-msft commented on the issue:

    https://github.com/apache/spark/pull/16494
  
    @tdas @ankurdave would you be able to verify that this is the same issue that you are working on or if you have any additional thoughts on this bug?  Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16494: [SPARK-17975][MLLIB] Fix EMLDAOptimizer failing with Cla...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16494
  
    **[Test build #71002 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71002/testReport)** for PR 16494 at commit [`0d1c475`](https://github.com/apache/spark/commit/0d1c475c80a6fd0373108610ca8e41f7af0e6d01).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16494: [SPARK-17975][MLLIB] Fix EMLDAOptimizer failing with Cla...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16494
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16494: [SPARK-17975][MLLIB] Fix EMLDAOptimizer failing with Cla...

Posted by imatiach-msft <gi...@git.apache.org>.
Github user imatiach-msft commented on the issue:

    https://github.com/apache/spark/pull/16494
  
    thank you for the deep investigation and the better root cause analysis.  I will refrain from working on this pull request then until either: 1.) the issue is fixed and I can update the branch and verify the test passes or 2.) one of the people you mentioned above replies and clarifies if this is an issue that won't be fixed with their changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16494: [SPARK-17975][MLLIB] Fix EMLDAOptimizer failing with Cla...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16494
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16494: [SPARK-17975][MLLIB] Fix EMLDAOptimizer failing with Cla...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16494
  
    **[Test build #71002 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71002/testReport)** for PR 16494 at commit [`0d1c475`](https://github.com/apache/spark/commit/0d1c475c80a6fd0373108610ca8e41f7af0e6d01).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16494: [SPARK-17975][MLLIB] Fix EMLDAOptimizer failing with Cla...

Posted by imatiach-msft <gi...@git.apache.org>.
Github user imatiach-msft commented on the issue:

    https://github.com/apache/spark/pull/16494
  
    @jkbradley @vanzin @skyluc @luluorta @uncleGen @kanzhang Could you please take a look at this pull request to fix the method fromEdges in EdgeRDD class used by LDA?  Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16494: [SPARK-17975][MLLIB] Fix EMLDAOptimizer failing with Cla...

Posted by imatiach-msft <gi...@git.apache.org>.
Github user imatiach-msft commented on the issue:

    https://github.com/apache/spark/pull/16494
  
    @jkbradley I verified that this bug is fixed after rebasing to latest.  Should I go ahead and close this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16494: [SPARK-17975][MLLIB] Fix EMLDAOptimizer failing with Cla...

Posted by imatiach-msft <gi...@git.apache.org>.
Github user imatiach-msft commented on the issue:

    https://github.com/apache/spark/pull/16494
  
    I'll go ahead and close this review.  We can always reopen later, but it really does seem that it is fixed now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16494: [SPARK-17975][MLLIB] Fix EMLDAOptimizer failing with Cla...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/16494
  
    Thanks for the patch.  This sounds like it may be the same bug being addressed in https://issues.apache.org/jira/browse/SPARK-14804 so I'll CC @tdas
    
    If so, then I believe the bug is really with how checkpointing is being done for Graphs, not with construction in fromEdges.  The checkpointing bug makes GraphX sort of forget the type of the EdgeRDD when it is recovered, making it handle it incorrectly in subsequent operations like fromEdges.
    
    However, I don't know if this can arise elsewhere, so let me ping @ankurdave : any thoughts on this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16494: [SPARK-17975][MLLIB] Fix EMLDAOptimizer failing with Cla...

Posted by QQshu1 <gi...@git.apache.org>.
Github user QQshu1 commented on the issue:

    https://github.com/apache/spark/pull/16494
  
    @imatiach-msft  Excuse me. I have two questions.
        1.Why this issue only happen that if we use "sc.setCheckpointDir(path)" ?
        2. You say "LDA fails with a ClassCastException when run on a dataset with at least one row that contains an empty sparse vector. "  Why "a dataset with at least one row that contains an empty sparse vector" that can happen the issue ?
      Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16494: [SPARK-17975][MLLIB] Fix EMLDAOptimizer failing with Cla...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16494
  
    **[Test build #72030 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72030/testReport)** for PR 16494 at commit [`e350400`](https://github.com/apache/spark/commit/e3504000cf22446bcaafaccedfae6ce21d653a50).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16494: [SPARK-17975][MLLIB] Fix EMLDAOptimizer failing with Cla...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16494
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71002/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org