You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by hotou <gi...@git.apache.org> on 2015/02/20 03:01:13 UTC

[GitHub] spark pull request: [SPARK-5860][CORE] JdbcRDD: overflow on large ...

GitHub user hotou opened a pull request:

    https://github.com/apache/spark/pull/4701

    [SPARK-5860][CORE] JdbcRDD: overflow on large range with high number of partitions

    Fix a overflow bug in JdbcRDD when calculating partitions for large BIGINT ids

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/hotou/spark SPARK-5860

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4701.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4701
    
----
commit 4e9ff4f34a56bbbdf65a7e79e546ecb7eb5729e9
Author: Evan Yu <eh...@gmail.com>
Date:   2015-02-19T14:57:55Z

    [SPARK-5860][CORE] JdbcRDD overflow on large range with high number of partitions

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5860][CORE] JdbcRDD: overflow on large ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4701#issuecomment-75177658
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5860][CORE] JdbcRDD: overflow on large ...

Posted by hotou <gi...@git.apache.org>.

Github user hotou commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4701#discussion_r25076778
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala ---
    @@ -64,8 +64,8 @@ class JdbcRDD[T: ClassTag](
         // bounds are inclusive, hence the + 1 here and - 1 on end
         val length = 1 + upperBound - lowerBound
         (0 until numPartitions).map(i => {
    -      val start = lowerBound + ((i * length) / numPartitions).toLong
    -      val end = lowerBound + (((i + 1) * length) / numPartitions).toLong - 1
    +      val start = lowerBound + ((BigDecimal(i) * length) / numPartitions).toLong
    --- End diff --
    
    I think there is a problem here where 
    length = 1 + upperBound - lowerBound 
    This can over flow as well, if lowerBound = 0 and upperBound = Long.MAX_VALUE
    I should fix this, actually make the change smaller


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5860][CORE] JdbcRDD: overflow on large ...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4701#discussion_r25054004
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala ---
    @@ -64,8 +64,8 @@ class JdbcRDD[T: ClassTag](
         // bounds are inclusive, hence the + 1 here and - 1 on end
         val length = 1 + upperBound - lowerBound
         (0 until numPartitions).map(i => {
    -      val start = lowerBound + ((i * length) / numPartitions).toLong
    -      val end = lowerBound + (((i + 1) * length) / numPartitions).toLong - 1
    +      val start = lowerBound + ((BigDecimal(i) * length) / numPartitions).toLong
    --- End diff --
    
    maybe this would be more robust if we just increment by a fixed width every iteration?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5860][CORE] JdbcRDD: overflow on large ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/4701


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5860][CORE] JdbcRDD: overflow on large ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4701#discussion_r25123052
  
    --- Diff: core/src/test/scala/org/apache/spark/rdd/JdbcRDDSuite.scala ---
    @@ -29,22 +29,42 @@ class JdbcRDDSuite extends FunSuite with BeforeAndAfter with LocalSparkContext {
         Class.forName("org.apache.derby.jdbc.EmbeddedDriver")
         val conn = DriverManager.getConnection("jdbc:derby:target/JdbcRDDSuiteDb;create=true")
         try {
    -      val create = conn.createStatement
    -      create.execute("""
    -        CREATE TABLE FOO(
    -          ID INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1),
    -          DATA INTEGER
    -        )""")
    -      create.close()
    -      val insert = conn.prepareStatement("INSERT INTO FOO(DATA) VALUES(?)")
    -      (1 to 100).foreach { i =>
    -        insert.setInt(1, i * 2)
    -        insert.executeUpdate
    +
    +      try {
    +        val create = conn.createStatement
    +        create.execute("""
    +          CREATE TABLE FOO(
    +            ID INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1),
    +            DATA INTEGER
    +          )""")
    +        create.close()
    +        val insert = conn.prepareStatement("INSERT INTO FOO(DATA) VALUES(?)")
    +        (1 to 100).foreach { i =>
    +          insert.setInt(1, i * 2)
    +          insert.executeUpdate
    +        }
    +        insert.close()
    +      } catch {
    +        case e: SQLException if e.getSQLState == "X0Y32" =>
    +        // table exists
           }
    -      insert.close()
    -    } catch {
    -      case e: SQLException if e.getSQLState == "X0Y32" =>
    +
    +      try {
    +        val create = conn.createStatement
    +        create.execute("CREATE TABLE BIGINT_TEST(ID BIGINT NOT NULL, DATA INTEGER)")
    +        create.close()
    +        val insert = conn.prepareStatement("INSERT INTO BIGINT_TEST VALUES(?,?)")
    +        (1 to 100).foreach { i =>
    +          insert.setLong(1, 100000000000000000L +  4000000000000000L * i)
    +          insert.setInt(2, i)
    +          insert.executeUpdate
    +        }
    +        insert.close()
    +      } catch {
    --- End diff --
    
    Oh right I get it, never mind. LGTM. I can merge in a bit if there are no further comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5860][CORE] JdbcRDD: overflow on large ...

Posted by hotou <gi...@git.apache.org>.

Github user hotou commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4701#discussion_r25071707
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala ---
    @@ -64,8 +64,8 @@ class JdbcRDD[T: ClassTag](
         // bounds are inclusive, hence the + 1 here and - 1 on end
         val length = 1 + upperBound - lowerBound
         (0 until numPartitions).map(i => {
    -      val start = lowerBound + ((i * length) / numPartitions).toLong
    -      val end = lowerBound + (((i + 1) * length) / numPartitions).toLong - 1
    +      val start = lowerBound + ((BigDecimal(i) * length) / numPartitions).toLong
    --- End diff --
    
    @srowen No that won't work, the problem here is this
    length is always long, but (i * length) can overflow long. In my test case for example
    
    567279357766147899L * 20 = -7101156918386593636
    
    So we need a type that is can represent larger numbers


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5860][CORE] JdbcRDD: overflow on large ...

Posted by hotou <gi...@git.apache.org>.

Github user hotou commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4701#discussion_r25123040
  
    --- Diff: core/src/test/scala/org/apache/spark/rdd/JdbcRDDSuite.scala ---
    @@ -29,22 +29,42 @@ class JdbcRDDSuite extends FunSuite with BeforeAndAfter with LocalSparkContext {
         Class.forName("org.apache.derby.jdbc.EmbeddedDriver")
         val conn = DriverManager.getConnection("jdbc:derby:target/JdbcRDDSuiteDb;create=true")
         try {
    -      val create = conn.createStatement
    -      create.execute("""
    -        CREATE TABLE FOO(
    -          ID INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1),
    -          DATA INTEGER
    -        )""")
    -      create.close()
    -      val insert = conn.prepareStatement("INSERT INTO FOO(DATA) VALUES(?)")
    -      (1 to 100).foreach { i =>
    -        insert.setInt(1, i * 2)
    -        insert.executeUpdate
    +
    +      try {
    +        val create = conn.createStatement
    +        create.execute("""
    +          CREATE TABLE FOO(
    +            ID INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1),
    +            DATA INTEGER
    +          )""")
    +        create.close()
    +        val insert = conn.prepareStatement("INSERT INTO FOO(DATA) VALUES(?)")
    +        (1 to 100).foreach { i =>
    +          insert.setInt(1, i * 2)
    +          insert.executeUpdate
    +        }
    +        insert.close()
    +      } catch {
    +        case e: SQLException if e.getSQLState == "X0Y32" =>
    +        // table exists
           }
    -      insert.close()
    -    } catch {
    -      case e: SQLException if e.getSQLState == "X0Y32" =>
    +
    +      try {
    +        val create = conn.createStatement
    +        create.execute("CREATE TABLE BIGINT_TEST(ID BIGINT NOT NULL, DATA INTEGER)")
    +        create.close()
    +        val insert = conn.prepareStatement("INSERT INTO BIGINT_TEST VALUES(?,?)")
    +        (1 to 100).foreach { i =>
    +          insert.setLong(1, 100000000000000000L +  4000000000000000L * i)
    +          insert.setInt(2, i)
    +          insert.executeUpdate
    +        }
    +        insert.close()
    +      } catch {
    --- End diff --
    
    There was a problem when I tried to to that. The original writer uses the inner catch block to prevent re-creating the table. 
    
    catch {
      case e: SQLException if e.getSQLState == "X0Y32" =>
      // table exists
    }
    
    Which means it has to exist for each table to be created. I was simply following that pattern. 
    
    An alternative would be is to drop and re-create the table each time, which produce cleaner codes, but may slow down the test suite a little bit



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5860][CORE] JdbcRDD: overflow on large ...

Posted by hotou <gi...@git.apache.org>.

Github user hotou commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4701#discussion_r25077261
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala ---
    @@ -64,8 +64,8 @@ class JdbcRDD[T: ClassTag](
         // bounds are inclusive, hence the + 1 here and - 1 on end
         val length = 1 + upperBound - lowerBound
         (0 until numPartitions).map(i => {
    -      val start = lowerBound + ((i * length) / numPartitions).toLong
    -      val end = lowerBound + (((i + 1) * length) / numPartitions).toLong - 1
    +      val start = lowerBound + ((BigDecimal(i) * length) / numPartitions).toLong
    --- End diff --
    
    Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5860][CORE] JdbcRDD: overflow on large ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4701#discussion_r25076552
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala ---
    @@ -64,8 +64,8 @@ class JdbcRDD[T: ClassTag](
         // bounds are inclusive, hence the + 1 here and - 1 on end
         val length = 1 + upperBound - lowerBound
         (0 until numPartitions).map(i => {
    -      val start = lowerBound + ((i * length) / numPartitions).toLong
    -      val end = lowerBound + (((i + 1) * length) / numPartitions).toLong - 1
    +      val start = lowerBound + ((BigDecimal(i) * length) / numPartitions).toLong
    --- End diff --
    
    Yep that's sound, good explanation. This is also why it can't overflow when `toLong` is called. I think this is the right fix with `BigInt`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5860][CORE] JdbcRDD: overflow on large ...

Posted by hotou <gi...@git.apache.org>.

Github user hotou commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4701#discussion_r25074390
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala ---
    @@ -64,8 +64,8 @@ class JdbcRDD[T: ClassTag](
         // bounds are inclusive, hence the + 1 here and - 1 on end
         val length = 1 + upperBound - lowerBound
         (0 until numPartitions).map(i => {
    -      val start = lowerBound + ((i * length) / numPartitions).toLong
    -      val end = lowerBound + (((i + 1) * length) / numPartitions).toLong - 1
    +      val start = lowerBound + ((BigDecimal(i) * length) / numPartitions).toLong
    --- End diff --
    
    @srowen 
    
    1. BigInt in scala actually works, I should change to that. 
    2. The overflow is not necessary to be only the last partition, in my test case it was the last 4 partitions
    3. I think the current algo give us better partitions than the fix interval algo, for the reason in above comment, would you agree with that ?
    4. The current algo does +1 on the length and -1 on the end of each interval, I think that works fine. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5860][CORE] JdbcRDD: overflow on large ...

Posted by hotou <gi...@git.apache.org>.

Github user hotou commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4701#discussion_r25075949
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala ---
    @@ -64,8 +64,8 @@ class JdbcRDD[T: ClassTag](
         // bounds are inclusive, hence the + 1 here and - 1 on end
         val length = 1 + upperBound - lowerBound
         (0 until numPartitions).map(i => {
    -      val start = lowerBound + ((i * length) / numPartitions).toLong
    -      val end = lowerBound + (((i + 1) * length) / numPartitions).toLong - 1
    +      val start = lowerBound + ((BigDecimal(i) * length) / numPartitions).toLong
    --- End diff --
    
    @srowen Sorry I don't understand how last partition can get >= upperBound
    
    With this codes
    
    val length = 1 + upperBound - lowerBound
    (0 until numPartitions).map(i => {
          val start = lowerBound + ((BigInt(i) * length) / numPartitions).toLong
          val end = lowerBound + ((BigInt(i + 1) * length) / numPartitions).toLong - 1
          new JdbcPartition(i, start, end)
     })
    
    the last iteration on this:
    i = numPartitions - 1
    
    So end = lowerBound + (numPartitions - 1 + 1) * (1 + upperBound - lowerBound) / numPartitions - 1
    = lowerBound + numPartitions/ numPartitions * (1 + upperBound - lowerBound)  - 1
    = lowerBound 1 + upperBound - lowerBound - 1
    = upperBound
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5860][CORE] JdbcRDD: overflow on large ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4701#discussion_r25068816
  
    --- Diff: core/src/test/scala/org/apache/spark/rdd/JdbcRDDSuite.scala ---
    @@ -29,22 +29,42 @@ class JdbcRDDSuite extends FunSuite with BeforeAndAfter with LocalSparkContext {
         Class.forName("org.apache.derby.jdbc.EmbeddedDriver")
         val conn = DriverManager.getConnection("jdbc:derby:target/JdbcRDDSuiteDb;create=true")
         try {
    -      val create = conn.createStatement
    -      create.execute("""
    -        CREATE TABLE FOO(
    -          ID INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1),
    -          DATA INTEGER
    -        )""")
    -      create.close()
    -      val insert = conn.prepareStatement("INSERT INTO FOO(DATA) VALUES(?)")
    -      (1 to 100).foreach { i =>
    -        insert.setInt(1, i * 2)
    -        insert.executeUpdate
    +
    --- End diff --
    
    Were these changes accidental?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5860][CORE] JdbcRDD: overflow on large ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4701#discussion_r25074705
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala ---
    @@ -64,8 +64,8 @@ class JdbcRDD[T: ClassTag](
         // bounds are inclusive, hence the + 1 here and - 1 on end
         val length = 1 + upperBound - lowerBound
         (0 until numPartitions).map(i => {
    -      val start = lowerBound + ((i * length) / numPartitions).toLong
    -      val end = lowerBound + (((i + 1) * length) / numPartitions).toLong - 1
    +      val start = lowerBound + ((BigDecimal(i) * length) / numPartitions).toLong
    --- End diff --
    
    It does although the difference is +/-1 for all partitions but the last. Yea I get your point about the final partition but isn't it possible that its end is >= upperBound? I haven't thought it through, I admit. A version with `BigInteger` here is definitely an improvement so I'd support that. I suppose I'm still concerned this fails in exactly the corner case you're trying to fix -- very large `length`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5860][CORE] JdbcRDD: overflow on large ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4701#discussion_r25073572
  
    --- Diff: core/src/test/scala/org/apache/spark/rdd/JdbcRDDSuite.scala ---
    @@ -29,22 +29,42 @@ class JdbcRDDSuite extends FunSuite with BeforeAndAfter with LocalSparkContext {
         Class.forName("org.apache.derby.jdbc.EmbeddedDriver")
         val conn = DriverManager.getConnection("jdbc:derby:target/JdbcRDDSuiteDb;create=true")
         try {
    -      val create = conn.createStatement
    -      create.execute("""
    -        CREATE TABLE FOO(
    -          ID INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1),
    -          DATA INTEGER
    -        )""")
    -      create.close()
    -      val insert = conn.prepareStatement("INSERT INTO FOO(DATA) VALUES(?)")
    -      (1 to 100).foreach { i =>
    -        insert.setInt(1, i * 2)
    -        insert.executeUpdate
    +
    --- End diff --
    
    Yes but here it looks like you just indented existing code or something. The new test below looks right.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5860][CORE] JdbcRDD: overflow on large ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4701#discussion_r25073470
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala ---
    @@ -64,8 +64,8 @@ class JdbcRDD[T: ClassTag](
         // bounds are inclusive, hence the + 1 here and - 1 on end
         val length = 1 + upperBound - lowerBound
         (0 until numPartitions).map(i => {
    -      val start = lowerBound + ((i * length) / numPartitions).toLong
    -      val end = lowerBound + (((i + 1) * length) / numPartitions).toLong - 1
    +      val start = lowerBound + ((BigDecimal(i) * length) / numPartitions).toLong
    --- End diff --
    
    Ah right I see that since the bounds are used for ID-like values, it's not unrealistic to think that `length` is near 2^63, and it's a `long`. Hm, does this overflow anyway in the final partition? if `upperBound` is near 2^63 then this can compute an `end` that doesn't fit back into a `long` even if the intermediate result does. PS why not `BigInteger`?
    
    While we're fixing it though, I tend to think that @rxin is right and that this can just map to fixed intervals rather than perform this computation. 
    
    BTW since the range is inclusive, I feel like the last partition should subtract 1 from its end? and this should never be larger than `upperBound`? there might be a few reasons to revise this logic.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5860][CORE] JdbcRDD: overflow on large ...

Posted by hotou <gi...@git.apache.org>.

Github user hotou commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4701#discussion_r25073802
  
    --- Diff: core/src/test/scala/org/apache/spark/rdd/JdbcRDDSuite.scala ---
    @@ -29,22 +29,42 @@ class JdbcRDDSuite extends FunSuite with BeforeAndAfter with LocalSparkContext {
         Class.forName("org.apache.derby.jdbc.EmbeddedDriver")
         val conn = DriverManager.getConnection("jdbc:derby:target/JdbcRDDSuiteDb;create=true")
         try {
    -      val create = conn.createStatement
    -      create.execute("""
    -        CREATE TABLE FOO(
    -          ID INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1),
    -          DATA INTEGER
    -        )""")
    -      create.close()
    -      val insert = conn.prepareStatement("INSERT INTO FOO(DATA) VALUES(?)")
    -      (1 to 100).foreach { i =>
    -        insert.setInt(1, i * 2)
    -        insert.executeUpdate
    +
    --- End diff --
    
    Oh.. that's because I need to create a new table, and I put that and the old table creation in one single
    try {
     ...
    } finally {
       conn.close()
    }


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5860][CORE] JdbcRDD: overflow on large ...

Posted by hotou <gi...@git.apache.org>.

Github user hotou commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4701#discussion_r25071164
  
    --- Diff: core/src/test/scala/org/apache/spark/rdd/JdbcRDDSuite.scala ---
    @@ -29,22 +29,42 @@ class JdbcRDDSuite extends FunSuite with BeforeAndAfter with LocalSparkContext {
         Class.forName("org.apache.derby.jdbc.EmbeddedDriver")
         val conn = DriverManager.getConnection("jdbc:derby:target/JdbcRDDSuiteDb;create=true")
         try {
    -      val create = conn.createStatement
    -      create.execute("""
    -        CREATE TABLE FOO(
    -          ID INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1),
    -          DATA INTEGER
    -        )""")
    -      create.close()
    -      val insert = conn.prepareStatement("INSERT INTO FOO(DATA) VALUES(?)")
    -      (1 to 100).foreach { i =>
    -        insert.setInt(1, i * 2)
    -        insert.executeUpdate
    +
    --- End diff --
    
    I just want to add a new test for this bug, is this not the right place to put it ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5860][CORE] JdbcRDD: overflow on large ...

Posted by hotou <gi...@git.apache.org>.

Github user hotou commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4701#discussion_r25072880
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala ---
    @@ -64,8 +64,8 @@ class JdbcRDD[T: ClassTag](
         // bounds are inclusive, hence the + 1 here and - 1 on end
         val length = 1 + upperBound - lowerBound
         (0 until numPartitions).map(i => {
    -      val start = lowerBound + ((i * length) / numPartitions).toLong
    -      val end = lowerBound + (((i + 1) * length) / numPartitions).toLong - 1
    +      val start = lowerBound + ((BigDecimal(i) * length) / numPartitions).toLong
    --- End diff --
    
    @rxin I actually favor the current partition algo, it's pretty neat in a way, for example
    
    lowerBound = 1
    upperBound = 100
    numPartition = 8
    
    With fix length increase, you get 
    [1,13],[14,26],[27,39],[40,52],[53,65],[66,78],[79,91],[92,100]
    In which you always end up with one small partition at the end
    
    With the current algo you get
    [1,12],[13,25],[26,37],[38,50],[51,62],[63,75],[76,87],[88,100]
    You get more evenly distributed partitions


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5860][CORE] JdbcRDD: overflow on large ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4701#discussion_r25122197
  
    --- Diff: core/src/test/scala/org/apache/spark/rdd/JdbcRDDSuite.scala ---
    @@ -29,22 +29,42 @@ class JdbcRDDSuite extends FunSuite with BeforeAndAfter with LocalSparkContext {
         Class.forName("org.apache.derby.jdbc.EmbeddedDriver")
         val conn = DriverManager.getConnection("jdbc:derby:target/JdbcRDDSuiteDb;create=true")
         try {
    -      val create = conn.createStatement
    -      create.execute("""
    -        CREATE TABLE FOO(
    -          ID INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1),
    -          DATA INTEGER
    -        )""")
    -      create.close()
    -      val insert = conn.prepareStatement("INSERT INTO FOO(DATA) VALUES(?)")
    -      (1 to 100).foreach { i =>
    -        insert.setInt(1, i * 2)
    -        insert.executeUpdate
    +
    +      try {
    +        val create = conn.createStatement
    +        create.execute("""
    +          CREATE TABLE FOO(
    +            ID INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1),
    +            DATA INTEGER
    +          )""")
    +        create.close()
    +        val insert = conn.prepareStatement("INSERT INTO FOO(DATA) VALUES(?)")
    +        (1 to 100).foreach { i =>
    +          insert.setInt(1, i * 2)
    +          insert.executeUpdate
    +        }
    +        insert.close()
    +      } catch {
    +        case e: SQLException if e.getSQLState == "X0Y32" =>
    +        // table exists
           }
    -      insert.close()
    -    } catch {
    -      case e: SQLException if e.getSQLState == "X0Y32" =>
    +
    +      try {
    +        val create = conn.createStatement
    +        create.execute("CREATE TABLE BIGINT_TEST(ID BIGINT NOT NULL, DATA INTEGER)")
    +        create.close()
    +        val insert = conn.prepareStatement("INSERT INTO BIGINT_TEST VALUES(?,?)")
    +        (1 to 100).foreach { i =>
    +          insert.setLong(1, 100000000000000000L +  4000000000000000L * i)
    +          insert.setInt(2, i)
    +          insert.executeUpdate
    +        }
    +        insert.close()
    +      } catch {
    --- End diff --
    
    LGTM except can't we keep this catch block in common in the outer try block? I realize it would mean changing some of the variable names in the second block.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5860][CORE] JdbcRDD: overflow on large ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4701#discussion_r25061779
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala ---
    @@ -64,8 +64,8 @@ class JdbcRDD[T: ClassTag](
         // bounds are inclusive, hence the + 1 here and - 1 on end
         val length = 1 + upperBound - lowerBound
         (0 until numPartitions).map(i => {
    -      val start = lowerBound + ((i * length) / numPartitions).toLong
    -      val end = lowerBound + (((i + 1) * length) / numPartitions).toLong - 1
    +      val start = lowerBound + ((BigDecimal(i) * length) / numPartitions).toLong
    --- End diff --
    
    `BigDecimal` is overkill and feels funny to be introducing floating point. `i` and `length` are `int` already so using `long`s here is sufficient. I really think this is solve by simply changing `i` to `i.toLong` and removing the superfluous `.toLong` at the end then.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org