You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by davies <gi...@git.apache.org> on 2014/12/15 23:46:30 UTC

[GitHub] spark pull request: [SPARK-4841] fix zip with textFile()

GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/3706

    [SPARK-4841] fix zip with textFile()

    UTF8Deserializer can not be used in BatchedSerializer, so always use PickleSerializer() when change batchSize in zip().
    
    Also, if two RDD have the same batch size already, they did not need re-serialize any more.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark fix_4841

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3706.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3706
    
----
commit 379d2c8670a33d33efa2214558360ac138b40f7f
Author: Davies Liu <da...@databricks.com>
Date:   2014-12-15T22:34:16Z

    fix zip with textFile()

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4841] fix zip with textFile()

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3706#issuecomment-67098770
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24473/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4841] fix zip with textFile()

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3706#discussion_r21865402
  
    --- Diff: python/pyspark/tests.py ---
    @@ -533,6 +533,11 @@ def test_zip_with_different_serializers(self):
             a = a._reserialize(BatchedSerializer(PickleSerializer(), 2))
             b = b._reserialize(MarshalSerializer())
             self.assertEqual(a.zip(b).collect(), [(0, 100), (1, 101), (2, 102), (3, 103), (4, 104)])
    +        path = os.path.join(SPARK_HOME, "python/test_support/hello.txt")
    --- End diff --
    
    We can't generate a RDD with UTF8Deserializer right now, it's only used to read data from JVM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4841] fix zip with textFile()

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3706#issuecomment-67091306
  
      [Test build #24473 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24473/consoleFull) for   PR 3706 at commit [`20ce3a3`](https://github.com/apache/spark/commit/20ce3a3e2a327e0fd490ee9231d9a9912a8b28ee).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4841] fix zip with textFile()

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3706#issuecomment-67083867
  
      [Test build #24472 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24472/consoleFull) for   PR 3706 at commit [`e3ebf7c`](https://github.com/apache/spark/commit/e3ebf7cf1929fe06d30754193291a11c1dec14d1).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4841] fix zip with textFile()

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3706#issuecomment-67098764
  
      [Test build #24473 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24473/consoleFull) for   PR 3706 at commit [`20ce3a3`](https://github.com/apache/spark/commit/20ce3a3e2a327e0fd490ee9231d9a9912a8b28ee).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4841] fix zip with textFile()

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3706#issuecomment-67090989
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24471/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4841] fix zip with textFile()

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/3706#issuecomment-67120024
  
    Thanks for the update.
    
    I've looked over this again and tried it out with a few more hand-written test cases and I've been unable to find any more bugs, so this looks good to me.
    
    I'm going to merge this into `master` and add a `backport-needed` label in JIRA targeted for 1.2.1.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4841] fix zip with textFile()

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/3706


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4841] fix zip with textFile()

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3706#issuecomment-67092987
  
      [Test build #24472 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24472/consoleFull) for   PR 3706 at commit [`e3ebf7c`](https://github.com/apache/spark/commit/e3ebf7cf1929fe06d30754193291a11c1dec14d1).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class Analyzer(catalog: Catalog, registry: FunctionRegistry, caseSensitive: Boolean)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4841] fix zip with textFile()

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3706#issuecomment-67092996
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24472/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4841] fix zip with textFile()

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/3706#issuecomment-67386822
  
    I've merged this into `branch-1.2`, so this fix will be included in Spark 1.2.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4841] fix zip with textFile()

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3706#issuecomment-67081216
  
      [Test build #24471 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24471/consoleFull) for   PR 3706 at commit [`379d2c8`](https://github.com/apache/spark/commit/379d2c8670a33d33efa2214558360ac138b40f7f).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4841] fix zip with textFile()

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3706#discussion_r21865214
  
    --- Diff: python/pyspark/tests.py ---
    @@ -533,6 +533,11 @@ def test_zip_with_different_serializers(self):
             a = a._reserialize(BatchedSerializer(PickleSerializer(), 2))
             b = b._reserialize(MarshalSerializer())
             self.assertEqual(a.zip(b).collect(), [(0, 100), (1, 101), (2, 102), (3, 103), (4, 104)])
    +        path = os.path.join(SPARK_HOME, "python/test_support/hello.txt")
    --- End diff --
    
    Can we reproduce the bug by creating an RDD of string directly? It is simpler than touching disk. It is also helpful to put the JIRA number in the comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4841] fix zip with textFile()

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3706#issuecomment-67090982
  
      [Test build #24471 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24471/consoleFull) for   PR 3706 at commit [`379d2c8`](https://github.com/apache/spark/commit/379d2c8670a33d33efa2214558360ac138b40f7f).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4841] fix zip with textFile()

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/3706#issuecomment-67091017
  
    @JoshRosen Good catch, it's a bug in _reserialize(), introduced in #2920, great thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4841] fix zip with textFile()

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/3706#issuecomment-67088229
  
    It looks like there are still a few lingering bugs related to `zip`.  For example, the following program crashes:
    
    ```python
    text = sc.textFile("README.md")
    numbers = text.map(lambda x: 1)
    text.zip(numbers).count()  # Works fine
    text.zip(numbers).count()  # A second time, this throws an error:
    ```
    
    The error is
    
    ```
    py4j.protocol.Py4JJavaError: An error occurred while calling o185.collect.
    : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 8.0 failed 1 times, most recent failure: Lost task 1.0 in stage 8.0 (TID 16, localhost): org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition
    	at org.apache.spark.rdd.RDD$$anonfun$zip$1$$anon$1.hasNext(RDD.scala:727)
    	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:351)
    	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:411)
    	at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:241)
    	at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:203)
    	at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:203)
    	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1459)
    	at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:202)
    
    Driver stacktrace:
    	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
    	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
    	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
    	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
    	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
    	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
    	at scala.Option.foreach(Option.scala:236)
    	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
    	at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
    	at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
    	at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
    	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
    	at akka.actor.ActorCell.invoke(ActorCell.scala:487)
    	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
    	at akka.dispatch.Mailbox.run(Mailbox.scala:220)
    	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
    	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
    ```
    
    Not sure if this is a new bug, but potentially an indicator that there's other lingering problems in `zip`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org