You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Micael Capitão (JIRA)" <ji...@apache.org> on 2015/04/09 13:28:12 UTC
[jira] [Updated] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

     [ https://issues.apache.org/jira/browse/SPARK-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Micael Capitão updated SPARK-6800:
----------------------------------
    Description: 
Having a Derby table with people info (id, name, age) defined like this:

{code}
val jdbcUrl = "jdbc:derby:memory:PeopleDB;create=true"
val conn = DriverManager.getConnection(jdbcUrl)
val stmt = conn.createStatement()
stmt.execute("CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT)")
stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50)")
stmt.execute("INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23)")
stmt.execute("INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12)")
stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Pereira', 32)")
stmt.execute("INSERT INTO Person(name, age) VALUES('Miguel Costa', 15)")
stmt.execute("INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13)")
{code}

If I try to read that table from Spark SQL with lower/upper bounds, like this:

{code}
val people = sqlContext.jdbc(url = jdbcUrl, table = "Person",
      columnName = "age", lowerBound = 0, upperBound = 40, numPartitions = 10)
people.show()
{code}

I get this result:
{noformat}
PERSON_ID NAME             AGE
3         Ana Rita Costa   12 
5         Miguel Costa     15 
6         Anabela Sintra   13 
2         Lurdes Pereira   23 
4         Armando Pereira  32 
1         Armando Carvalho 50 
{noformat}

Which is wrong, considering the defined upper bound has been ignored (I get a person with age 50!).
Digging the code, I've found that in {{JDBCRelation.columnPartition}} the WHERE clauses it generates are the following:
{code}
(0) age < 4,0
(1) age >= 4  AND age < 8,1
(2) age >= 8  AND age < 12,2
(3) age >= 12 AND age < 16,3
(4) age >= 16 AND age < 20,4
(5) age >= 20 AND age < 24,5
(6) age >= 24 AND age < 28,6
(7) age >= 28 AND age < 32,7
(8) age >= 32 AND age < 36,8
(9) age >= 36,9
{code}

The last condition ignores the upper bound and the other ones may result in repeated rows being read.

Using the JdbcRDD (and converting it to a DataFrame) I would have something like this:
{code}
val jdbcRdd = new JdbcRDD(sc, () => DriverManager.getConnection(jdbcUrl),
      "SELECT * FROM Person WHERE age >= ? and age <= ?", 0, 40, 10,
      rs => (rs.getInt(1), rs.getString(2), rs.getInt(3)))
val people = jdbcRdd.toDF("PERSON_ID", "NAME", "AGE")
people.show()
{code}

Resulting in:
{noformat}
PERSON_ID NAME            AGE
3         Ana Rita Costa  12 
5         Miguel Costa    15 
6         Anabela Sintra  13 
2         Lurdes Pereira  23 
4         Armando Pereira 32 
{noformat}

Which is correct!

Confirming the WHERE clauses generated by the JdbcRDD in the {{getPartitions}} I've found it generates the following:
{code}
(0) age >= 0  AND age <= 3
(1) age >= 4  AND age <= 7
(2) age >= 8  AND age <= 11
(3) age >= 12 AND age <= 15
(4) age >= 16 AND age <= 19
(5) age >= 20 AND age <= 23
(6) age >= 24 AND age <= 27
(7) age >= 28 AND age <= 31
(8) age >= 32 AND age <= 35
(9) age >= 36 AND age <= 40
{code}

This is the behaviour I was expecting from the Spark SQL version. Is the Spark SQL version buggy or this some weird expected behaviour?

  was:
Having a Derby table with people info (id, name, age) defined like this:

{code}
val jdbcUrl = "jdbc:derby:memory:PeopleDB;create=true"
val conn = DriverManager.getConnection(jdbcUrl)
val stmt = conn.createStatement()
stmt.execute("CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT)")
stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50)")
stmt.execute("INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23)")
stmt.execute("INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12)")
stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Pereira', 32)")
stmt.execute("INSERT INTO Person(name, age) VALUES('Miguel Costa', 15)")
stmt.execute("INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13)")
{code}

If I try to read that table from Spark SQL with lower/upper bounds, like this:

{code}
val people = sqlContext.jdbc(url = jdbcUrl, table = "Person",
      columnName = "age", lowerBound = 0, upperBound = 40, numPartitions = 10)
people.show()
{code}

I get this result:
{noformat}
PERSON_ID NAME             AGE
3         Ana Rita Costa   12 
5         Miguel Costa     15 
6         Anabela Sintra   13 
2         Lurdes Pereira   23 
4         Armando Pereira  32 
1         Armando Carvalho 50 
{noformat}

Which is wrong, considering the defined upper bound has been ignored (I get a person with age 50!).
Digging the code, I've found that in {{JDBCRelation.columnPartition}} the WHERE clauses it generates are the following:
{code}
(0) age < 4,0
(1) age >= 4  AND age < 8,1
(2) age >= 8  AND age < 12,2
(3) age >= 12 AND age < 16,3
(4) age >= 16 AND age < 20,4
(5) age >= 20 AND age < 24,5
(6) age >= 24 AND age < 28,6
(7) age >= 28 AND age < 32,7
(8) age >= 32 AND age < 36,8
(9) age >= 36,9
{code}

The last condition ignores the upper bound and the other ones may result in repeated rows being read.

Using the JdbcRDD (and converting it to a DataFrame) I would have something like this:
{code}
val jdbcRdd = new JdbcRDD(sc, () => DriverManager.getConnection(jdbcUrl),
      "SELECT * FROM Person WHERE age >= ? and age <= ?", 0, 40, 10,
      rs => (rs.getInt(1), rs.getString(2), rs.getInt(3)))
val people = jdbcRdd.toDF("PERSON_ID", "NAME", "AGE")
people.show()
{code}

Resulting in:
{noformat}
PERSON_ID NAME            AGE
3         Ana Rita Costa  12 
5         Miguel Costa    15 
6         Anabela Sintra  13 
2         Lurdes Pereira  23 
4         Armando Pereira 32 
{noformat}

Which is correct!

Confirming the WHERE clauses generated by the JdbcRDD in the {{getPartitions}} I've found it generates the following:
{code}
(0) age >= 0  AND age <= 3
(1) age >= 4  AND age <= 7
(2) age >= 8  AND age <= 11
(3) age >= 12 AND age <= 15
(4) age >= 16 AND age <= 19
(5) age >= 20 AND age <= 23
(6) age >= 24 AND age <= 27
(7) age >= 28 AND age <= 31
(8) age >= 32 AND age <= 35
(8) age >= 36 AND age <= 40
{code}

This is the behaviour I was expecting from the Spark SQL version. The Spark SQL version is buggy, as far as I can tell.


> Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.
> ------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-6800
>                 URL: https://issues.apache.org/jira/browse/SPARK-6800
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.3.0
>         Environment: Windows 8.1, Derby, Spark 1.3.0 CDH5.4.0, Scala 2.10
>            Reporter: Micael Capitão
>
> Having a Derby table with people info (id, name, age) defined like this:
> {code}
> val jdbcUrl = "jdbc:derby:memory:PeopleDB;create=true"
> val conn = DriverManager.getConnection(jdbcUrl)
> val stmt = conn.createStatement()
> stmt.execute("CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Pereira', 32)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Miguel Costa', 15)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13)")
> {code}
> If I try to read that table from Spark SQL with lower/upper bounds, like this:
> {code}
> val people = sqlContext.jdbc(url = jdbcUrl, table = "Person",
>       columnName = "age", lowerBound = 0, upperBound = 40, numPartitions = 10)
> people.show()
> {code}
> I get this result:
> {noformat}
> PERSON_ID NAME             AGE
> 3         Ana Rita Costa   12 
> 5         Miguel Costa     15 
> 6         Anabela Sintra   13 
> 2         Lurdes Pereira   23 
> 4         Armando Pereira  32 
> 1         Armando Carvalho 50 
> {noformat}
> Which is wrong, considering the defined upper bound has been ignored (I get a person with age 50!).
> Digging the code, I've found that in {{JDBCRelation.columnPartition}} the WHERE clauses it generates are the following:
> {code}
> (0) age < 4,0
> (1) age >= 4  AND age < 8,1
> (2) age >= 8  AND age < 12,2
> (3) age >= 12 AND age < 16,3
> (4) age >= 16 AND age < 20,4
> (5) age >= 20 AND age < 24,5
> (6) age >= 24 AND age < 28,6
> (7) age >= 28 AND age < 32,7
> (8) age >= 32 AND age < 36,8
> (9) age >= 36,9
> {code}
> The last condition ignores the upper bound and the other ones may result in repeated rows being read.
> Using the JdbcRDD (and converting it to a DataFrame) I would have something like this:
> {code}
> val jdbcRdd = new JdbcRDD(sc, () => DriverManager.getConnection(jdbcUrl),
>       "SELECT * FROM Person WHERE age >= ? and age <= ?", 0, 40, 10,
>       rs => (rs.getInt(1), rs.getString(2), rs.getInt(3)))
> val people = jdbcRdd.toDF("PERSON_ID", "NAME", "AGE")
> people.show()
> {code}
> Resulting in:
> {noformat}
> PERSON_ID NAME            AGE
> 3         Ana Rita Costa  12 
> 5         Miguel Costa    15 
> 6         Anabela Sintra  13 
> 2         Lurdes Pereira  23 
> 4         Armando Pereira 32 
> {noformat}
> Which is correct!
> Confirming the WHERE clauses generated by the JdbcRDD in the {{getPartitions}} I've found it generates the following:
> {code}
> (0) age >= 0  AND age <= 3
> (1) age >= 4  AND age <= 7
> (2) age >= 8  AND age <= 11
> (3) age >= 12 AND age <= 15
> (4) age >= 16 AND age <= 19
> (5) age >= 20 AND age <= 23
> (6) age >= 24 AND age <= 27
> (7) age >= 28 AND age <= 31
> (8) age >= 32 AND age <= 35
> (9) age >= 36 AND age <= 40
> {code}
> This is the behaviour I was expecting from the Spark SQL version. Is the Spark SQL version buggy or this some weird expected behaviour?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org