You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by witgo <gi...@git.apache.org> on 2014/04/16 07:00:58 UTC

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

GitHub user witgo opened a pull request:

    https://github.com/apache/spark/pull/423

    SPARK-1509: add zipWithIndex zipWithUniqueId methods to java api

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/witgo/spark zipWithIndex

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/423.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #423
    
----
commit 11e2e7f523668e042a1847fa97d92f896a5bc75b
Author: witgo <wi...@qq.com>
Date:   2014-04-16T04:55:57Z

    add zipWithIndex zipWithUniqueId methods to java api

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/423#discussion_r11920691
  
    --- Diff: core/src/test/java/org/apache/spark/JavaAPISuite.java ---
    @@ -182,13 +182,30 @@ public void call(String s) {
         Assert.assertEquals(2, foreachCalls);
       }
     
    -    @Test
    -    public void toLocalIterator() {
    -        List<Integer> correct = Arrays.asList(1, 2, 3, 4);
    -        JavaRDD<Integer> rdd = sc.parallelize(correct);
    -        List<Integer> result = Lists.newArrayList(rdd.toLocalIterator());
    -        Assert.assertTrue(correct.equals(result));
    -    }
    +  @Test
    +  public void toLocalIterator() {
    +    List<Integer> correct = Arrays.asList(1, 2, 3, 4);
    +    JavaRDD<Integer> rdd = sc.parallelize(correct);
    +    List<Integer> result = Lists.newArrayList(rdd.toLocalIterator());
    +    Assert.assertTrue(correct.equals(result));
    +  }
    +
    +  @Test
    +  public void zipWithUniqueId() {
    +    List<Integer> dataArray = Arrays.asList(1, 2, 3, 4);
    +    JavaPairRDD<Integer, Long> zip = sc.parallelize(dataArray).zipWithUniqueId();
    +    JavaRDD<Long> indexes = zip.values();
    +    Assert.assertTrue(new HashSet<Long>(indexes.collect()).size() == 4);
    +  }
    +
    +  @Test
    +  public void zipWithIndex() {
    +    List<Integer> dataArray = Arrays.asList(1, 2, 3, 4);
    +    JavaPairRDD<Integer, Long> zip = sc.parallelize(dataArray).zipWithIndex();
    +    JavaRDD<Long> indexes = zip.values();
    +    HashSet<Long> correctIndexes = new HashSet<Long>(Arrays.asList(0l, 1l, 2l, 3l));
    --- End diff --
    
    You should use a list instead of a set here, because you want to assert on the exact order.
    
    Also, use `L` instead of `l`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/423#issuecomment-41244978
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/423#discussion_r12081698
  
    --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala ---
    @@ -263,6 +263,26 @@ trait JavaRDDLike[T, This <: JavaRDDLike[T, This]] extends Serializable {
           rdd.zipPartitions(other.rdd)(fn)(other.classTag, fakeClassTag[V]))(fakeClassTag[V])
       }
     
    +  /**
    +   * Zips this RDD with generated unique Long ids. Items in the kth partition will get ids k, n+k,
    +   * 2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method
    +   * won't trigger a spark job, which is different from [[org.apache.spark.rdd.RDD#zipWithIndex]].
    +   */
    +  def zipWithUniqueId[Long](): JavaPairRDD[T, Long] = {
    --- End diff --
    
    basically what you created here is a type parameter named "Long" (surprisingly not a keyword in Scala), and you got the compiler to infer the type when you were calling it from Java.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/423#issuecomment-41649291
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/423#discussion_r12081245
  
    --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala ---
    @@ -263,6 +263,26 @@ trait JavaRDDLike[T, This <: JavaRDDLike[T, This]] extends Serializable {
           rdd.zipPartitions(other.rdd)(fn)(other.classTag, fakeClassTag[V]))(fakeClassTag[V])
       }
     
    +  /**
    +   * Zips this RDD with generated unique Long ids. Items in the kth partition will get ids k, n+k,
    +   * 2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method
    +   * won't trigger a spark job, which is different from [[org.apache.spark.rdd.RDD#zipWithIndex]].
    +   */
    +  def zipWithUniqueId[Long](): JavaPairRDD[T, Long] = {
    --- End diff --
    
    ~~~
    def zipWithUniqueId(): JavaPairRDD[T, Long]
    ~~~
    
    would return `JavaPairRDD<T, Object>`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/423#discussion_r11870666
  
    --- Diff: core/src/test/java/org/apache/spark/JavaAPISuite.java ---
    @@ -182,13 +182,39 @@ public void call(String s) {
         Assert.assertEquals(2, foreachCalls);
       }
     
    -    @Test
    -    public void toLocalIterator() {
    -        List<Integer> correct = Arrays.asList(1, 2, 3, 4);
    -        JavaRDD<Integer> rdd = sc.parallelize(correct);
    -        List<Integer> result = Lists.newArrayList(rdd.toLocalIterator());
    -        Assert.assertTrue(correct.equals(result));
    -    }
    +  @Test
    +  public void toLocalIterator() {
    +    List<Integer> correct = Arrays.asList(1, 2, 3, 4);
    +    JavaRDD<Integer> rdd = sc.parallelize(correct);
    +    List<Integer> result = Lists.newArrayList(rdd.toLocalIterator());
    +    Assert.assertTrue(correct.equals(result));
    +  }
    +
    +  @Test
    +  public void zipWithUniqueId() {
    +    List<Integer> correct = Arrays.asList(1, 2, 3, 4);
    +    JavaPairRDD<Integer, Long> zip = sc.parallelize(correct).zipWithUniqueId();
    +    JavaRDD<Long> indexes = zip.map(new Function<Tuple2<Integer, Long>, Long>() {
    --- End diff --
    
    You can use `JavaPairRDD#values` to get the indices.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/423#discussion_r12081946
  
    --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala ---
    @@ -263,6 +263,26 @@ trait JavaRDDLike[T, This <: JavaRDDLike[T, This]] extends Serializable {
           rdd.zipPartitions(other.rdd)(fn)(other.classTag, fakeClassTag[V]))(fakeClassTag[V])
       }
     
    +  /**
    +   * Zips this RDD with generated unique Long ids. Items in the kth partition will get ids k, n+k,
    +   * 2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method
    +   * won't trigger a spark job, which is different from [[org.apache.spark.rdd.RDD#zipWithIndex]].
    +   */
    +  def zipWithUniqueId[Long](): JavaPairRDD[T, Long] = {
    --- End diff --
    
    let's just put java.lang.Long. It is not that "long" anyway.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/423#issuecomment-40563251
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

Posted by witgo <gi...@git.apache.org>.

Github user witgo commented on a diff in the pull request:

    https://github.com/apache/spark/pull/423#discussion_r12081268
  
    --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala ---
    @@ -263,6 +263,26 @@ trait JavaRDDLike[T, This <: JavaRDDLike[T, This]] extends Serializable {
           rdd.zipPartitions(other.rdd)(fn)(other.classTag, fakeClassTag[V]))(fakeClassTag[V])
       }
     
    +  /**
    +   * Zips this RDD with generated unique Long ids. Items in the kth partition will get ids k, n+k,
    +   * 2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method
    +   * won't trigger a spark job, which is different from [[org.apache.spark.rdd.RDD#zipWithIndex]].
    +   */
    +  def zipWithUniqueId[Long](): JavaPairRDD[T, Long] = {
    --- End diff --
    
    Yes,in my test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/423#issuecomment-41244971
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/423#issuecomment-41651538
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14561/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/423#discussion_r12081715
  
    --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala ---
    @@ -263,6 +263,26 @@ trait JavaRDDLike[T, This <: JavaRDDLike[T, This]] extends Serializable {
           rdd.zipPartitions(other.rdd)(fn)(other.classTag, fakeClassTag[V]))(fakeClassTag[V])
       }
     
    +  /**
    +   * Zips this RDD with generated unique Long ids. Items in the kth partition will get ids k, n+k,
    +   * 2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method
    +   * won't trigger a spark job, which is different from [[org.apache.spark.rdd.RDD#zipWithIndex]].
    +   */
    +  def zipWithUniqueId[Long](): JavaPairRDD[T, Long] = {
    --- End diff --
    
    Try:
    
    ~~~
      def zipWithUniqueId(): JavaPairRDD[T, java.lang.Long] = {
        JavaPairRDD.fromRDD(rdd.zipWithUniqueId().map(x => (x._1, new java.lang.Long(x._2))))
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/423#discussion_r12081688
  
    --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala ---
    @@ -263,6 +263,26 @@ trait JavaRDDLike[T, This <: JavaRDDLike[T, This]] extends Serializable {
           rdd.zipPartitions(other.rdd)(fn)(other.classTag, fakeClassTag[V]))(fakeClassTag[V])
       }
     
    +  /**
    +   * Zips this RDD with generated unique Long ids. Items in the kth partition will get ids k, n+k,
    +   * 2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method
    +   * won't trigger a spark job, which is different from [[org.apache.spark.rdd.RDD#zipWithIndex]].
    +   */
    +  def zipWithUniqueId[Long](): JavaPairRDD[T, Long] = {
    --- End diff --
    
    @mengxr already found this out - but the reason is you'd want to declare the type as java.lang.Double instead of Long.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/423#issuecomment-41246766
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14428/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

Posted by witgo <gi...@git.apache.org>.

Github user witgo commented on a diff in the pull request:

    https://github.com/apache/spark/pull/423#discussion_r12080480
  
    --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala ---
    @@ -263,6 +263,26 @@ trait JavaRDDLike[T, This <: JavaRDDLike[T, This]] extends Serializable {
           rdd.zipPartitions(other.rdd)(fn)(other.classTag, fakeClassTag[V]))(fakeClassTag[V])
       }
     
    +  /**
    +   * Zips this RDD with generated unique Long ids. Items in the kth partition will get ids k, n+k,
    +   * 2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method
    +   * won't trigger a spark job, which is different from [[org.apache.spark.rdd.RDD#zipWithIndex]].
    +   */
    +  def zipWithUniqueId[Long](): JavaPairRDD[T, Long] = {
    --- End diff --
    
    When remove the `[Long]`. The type of return value is JavaPairRDD<Integer,Object>


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/423


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/423#issuecomment-41649034
  
    LGTM if Jenkins is happy.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/423#discussion_r11870613
  
    --- Diff: core/src/test/java/org/apache/spark/JavaAPISuite.java ---
    @@ -182,13 +182,39 @@ public void call(String s) {
         Assert.assertEquals(2, foreachCalls);
       }
     
    -    @Test
    -    public void toLocalIterator() {
    -        List<Integer> correct = Arrays.asList(1, 2, 3, 4);
    -        JavaRDD<Integer> rdd = sc.parallelize(correct);
    -        List<Integer> result = Lists.newArrayList(rdd.toLocalIterator());
    -        Assert.assertTrue(correct.equals(result));
    -    }
    +  @Test
    +  public void toLocalIterator() {
    +    List<Integer> correct = Arrays.asList(1, 2, 3, 4);
    +    JavaRDD<Integer> rdd = sc.parallelize(correct);
    +    List<Integer> result = Lists.newArrayList(rdd.toLocalIterator());
    +    Assert.assertTrue(correct.equals(result));
    +  }
    +
    +  @Test
    +  public void zipWithUniqueId() {
    +    List<Integer> correct = Arrays.asList(1, 2, 3, 4);
    --- End diff --
    
    `correct` is not a correct name for this variable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/423#issuecomment-41651537
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

Posted by witgo <gi...@git.apache.org>.

Github user witgo commented on a diff in the pull request:

    https://github.com/apache/spark/pull/423#discussion_r12082137
  
    --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala ---
    @@ -263,6 +263,26 @@ trait JavaRDDLike[T, This <: JavaRDDLike[T, This]] extends Serializable {
           rdd.zipPartitions(other.rdd)(fn)(other.classTag, fakeClassTag[V]))(fakeClassTag[V])
       }
     
    +  /**
    +   * Zips this RDD with generated unique Long ids. Items in the kth partition will get ids k, n+k,
    +   * 2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method
    +   * won't trigger a spark job, which is different from [[org.apache.spark.rdd.RDD#zipWithIndex]].
    +   */
    +  def zipWithUniqueId[Long](): JavaPairRDD[T, Long] = {
    --- End diff --
    
    @rxin  You're right, has been modified.
    
    @mengxr 
    ```scala
     def zipWithUniqueId(): JavaPairRDD[T, java.lang.Long] = {
        JavaPairRDD.fromRDD(rdd.zipWithUniqueId().map(x => (x._1, new java.lang.Long(x._2)))
    ```create too many objects.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/423#issuecomment-41714523
  
    Thanks. I've merged this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/423#discussion_r11870757
  
    --- Diff: core/src/test/java/org/apache/spark/JavaAPISuite.java ---
    @@ -182,13 +182,39 @@ public void call(String s) {
         Assert.assertEquals(2, foreachCalls);
       }
     
    -    @Test
    -    public void toLocalIterator() {
    -        List<Integer> correct = Arrays.asList(1, 2, 3, 4);
    -        JavaRDD<Integer> rdd = sc.parallelize(correct);
    -        List<Integer> result = Lists.newArrayList(rdd.toLocalIterator());
    -        Assert.assertTrue(correct.equals(result));
    -    }
    +  @Test
    +  public void toLocalIterator() {
    +    List<Integer> correct = Arrays.asList(1, 2, 3, 4);
    +    JavaRDD<Integer> rdd = sc.parallelize(correct);
    +    List<Integer> result = Lists.newArrayList(rdd.toLocalIterator());
    +    Assert.assertTrue(correct.equals(result));
    +  }
    +
    +  @Test
    +  public void zipWithUniqueId() {
    +    List<Integer> correct = Arrays.asList(1, 2, 3, 4);
    +    JavaPairRDD<Integer, Long> zip = sc.parallelize(correct).zipWithUniqueId();
    +    JavaRDD<Long> indexes = zip.map(new Function<Tuple2<Integer, Long>, Long>() {
    +      @Override
    +      public Long call(Tuple2<Integer, Long> t) throws Exception {
    +        return t._2();
    +      }
    +    });
    +    Assert.assertTrue(new HashSet<Long>(indexes.collect()).size() == 4);
    +  }
    +
    +  @Test
    +  public void zipWithIndex() {
    +    List<Integer> correct = Arrays.asList(1, 2, 3, 4);
    +    JavaPairRDD<Integer, Long> zip = sc.parallelize(correct).zipWithIndex();
    +    JavaRDD<Long> indexes = zip.map(new Function<Tuple2<Integer, Long>, Long>() {
    +      @Override
    +      public Long call(Tuple2<Integer, Long> t) throws Exception {
    +        return t._2();
    +      }
    +    });
    +    Assert.assertTrue(new HashSet<Long>(indexes.collect()).size() == 4);
    --- End diff --
    
    Should assert the indices are exactly (0, 1, 2, 3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/423#issuecomment-41244779
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/423#issuecomment-41649188
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/423#issuecomment-41246765
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/423#discussion_r11870601
  
    --- Diff: core/src/test/java/org/apache/spark/JavaAPISuite.java ---
    @@ -182,13 +182,39 @@ public void call(String s) {
         Assert.assertEquals(2, foreachCalls);
       }
     
    -    @Test
    -    public void toLocalIterator() {
    -        List<Integer> correct = Arrays.asList(1, 2, 3, 4);
    -        JavaRDD<Integer> rdd = sc.parallelize(correct);
    -        List<Integer> result = Lists.newArrayList(rdd.toLocalIterator());
    -        Assert.assertTrue(correct.equals(result));
    -    }
    +  @Test
    +  public void toLocalIterator() {
    +    List<Integer> correct = Arrays.asList(1, 2, 3, 4);
    +    JavaRDD<Integer> rdd = sc.parallelize(correct);
    +    List<Integer> result = Lists.newArrayList(rdd.toLocalIterator());
    +    Assert.assertTrue(correct.equals(result));
    +  }
    +
    +  @Test
    +  public void zipWithUniqueId() {
    +    List<Integer> correct = Arrays.asList(1, 2, 3, 4);
    +    JavaPairRDD<Integer, Long> zip = sc.parallelize(correct).zipWithUniqueId();
    --- End diff --
    
    Should test with more than one partitions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

Posted by witgo <gi...@git.apache.org>.

Github user witgo commented on a diff in the pull request:

    https://github.com/apache/spark/pull/423#discussion_r12081885
  
    --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala ---
    @@ -263,6 +263,26 @@ trait JavaRDDLike[T, This <: JavaRDDLike[T, This]] extends Serializable {
           rdd.zipPartitions(other.rdd)(fn)(other.classTag, fakeClassTag[V]))(fakeClassTag[V])
       }
     
    +  /**
    +   * Zips this RDD with generated unique Long ids. Items in the kth partition will get ids k, n+k,
    +   * 2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method
    +   * won't trigger a spark job, which is different from [[org.apache.spark.rdd.RDD#zipWithIndex]].
    +   */
    +  def zipWithUniqueId[Long](): JavaPairRDD[T, Long] = {
    --- End diff --
    
    ```scala
      def zipWithUniqueId(): JavaPairRDD[T, JLong] = {
        JavaPairRDD.fromRDD(rdd.zipWithUniqueId()).asInstanceOf[JavaPairRDD[T, JLong]]
      } 
    ```
     is better?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/423#issuecomment-41649299
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/423#discussion_r12080026
  
    --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala ---
    @@ -263,6 +263,26 @@ trait JavaRDDLike[T, This <: JavaRDDLike[T, This]] extends Serializable {
           rdd.zipPartitions(other.rdd)(fn)(other.classTag, fakeClassTag[V]))(fakeClassTag[V])
       }
     
    +  /**
    +   * Zips this RDD with generated unique Long ids. Items in the kth partition will get ids k, n+k,
    +   * 2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method
    +   * won't trigger a spark job, which is different from [[org.apache.spark.rdd.RDD#zipWithIndex]].
    +   */
    +  def zipWithUniqueId[Long](): JavaPairRDD[T, Long] = {
    +    JavaPairRDD.fromRDD(rdd.zipWithUniqueId()).asInstanceOf[JavaPairRDD[T, Long]]
    +  }
    +
    +  /**
    +   * Zips this RDD with its element indices. The ordering is first based on the partition index
    +   * and then the ordering of items within each partition. So the first item in the first
    +   * partition gets index 0, and the last item in the last partition receives the largest index.
    +   * This is similar to Scala's zipWithIndex but it uses Long instead of Int as the index type.
    +   * This method needs to trigger a spark job when this RDD contains more than one partitions.
    +   */
    +  def zipWithIndex[Long](): JavaPairRDD[T, Long] = {
    --- End diff --
    
    Ditto.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1509: add zipWithIndex zipWithUniqueId m...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/423#discussion_r12080023
  
    --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala ---
    @@ -263,6 +263,26 @@ trait JavaRDDLike[T, This <: JavaRDDLike[T, This]] extends Serializable {
           rdd.zipPartitions(other.rdd)(fn)(other.classTag, fakeClassTag[V]))(fakeClassTag[V])
       }
     
    +  /**
    +   * Zips this RDD with generated unique Long ids. Items in the kth partition will get ids k, n+k,
    +   * 2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method
    +   * won't trigger a spark job, which is different from [[org.apache.spark.rdd.RDD#zipWithIndex]].
    +   */
    +  def zipWithUniqueId[Long](): JavaPairRDD[T, Long] = {
    --- End diff --
    
    Just saw this. Why do you need `[Long]` here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---