You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by HyukjinKwon <gi...@git.apache.org> on 2016/07/22 03:35:37 UTC

[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/14313

    [SPARK-16674][SQL] Avoid per-record type dispatch in JDBC when reading

    ## What changes were proposed in this pull request?
    
    Currently, `JDBCRDD.compute` is doing type dispatch for each row to read appropriate values.
    It might not have to be done like this because the schema is already kept in `JDBCRDD`.
    
    So, appropriate converters can be created first according to the schema, and then apply them to each row.
    
    ## How was this patch tested?
    
    Existing tests should cover this.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-16674

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14313.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14313
    
----
commit ad64483e2263485f843a469310fb8d252824d09e
Author: hyukjinkwon <gu...@gmail.com>
Date:   2016-07-22T03:32:44Z

    Avoid per-record type dispatch in JDBC when reading

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14313#discussion_r71980099
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala ---
    @@ -407,84 +495,8 @@ private[sql] class JDBCRDD(
             var i = 0
             while (i < conversions.length) {
    --- End diff --
    
    Using funtional transformation is generally slower due to virtual function calls. This part is executed a lot and such overhead will become really heavy. See https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...

Posted by jaceklaskowski <gi...@git.apache.org>.

Github user jaceklaskowski commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14313#discussion_r71977329
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala ---
    @@ -322,46 +322,134 @@ private[sql] class JDBCRDD(
         }
       }
     
    -  // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so that
    -  // we don't have to potentially poke around in the Metadata once for every
    -  // row.
    -  // Is there a better way to do this?  I'd rather be using a type that
    -  // contains only the tags I define.
    -  abstract class JDBCConversion
    -  case object BooleanConversion extends JDBCConversion
    -  case object DateConversion extends JDBCConversion
    -  case class  DecimalConversion(precision: Int, scale: Int) extends JDBCConversion
    -  case object DoubleConversion extends JDBCConversion
    -  case object FloatConversion extends JDBCConversion
    -  case object IntegerConversion extends JDBCConversion
    -  case object LongConversion extends JDBCConversion
    -  case object BinaryLongConversion extends JDBCConversion
    -  case object StringConversion extends JDBCConversion
    -  case object TimestampConversion extends JDBCConversion
    -  case object BinaryConversion extends JDBCConversion
    -  case class ArrayConversion(elementConversion: JDBCConversion) extends JDBCConversion
    +  // A `JDBCConversion` is responsible for converting a value from `ResultSet`
    +  // to a value in a field for `InternalRow`.
    +  private type JDBCConversion = (ResultSet, Int) => Any
    +
    +  // This `ArrayElementConversion` is responsible for converting elements in
    +  // an array from `ResultSet`.
    +  private type ArrayElementConversion = (Object) => Any
     
       /**
    -   * Maps a StructType to a type tag list.
    +   * Maps a StructType to conversions for each type.
        */
       def getConversions(schema: StructType): Array[JDBCConversion] =
         schema.fields.map(sf => getConversions(sf.dataType, sf.metadata))
     
       private def getConversions(dt: DataType, metadata: Metadata): JDBCConversion = dt match {
    -    case BooleanType => BooleanConversion
    -    case DateType => DateConversion
    -    case DecimalType.Fixed(p, s) => DecimalConversion(p, s)
    -    case DoubleType => DoubleConversion
    -    case FloatType => FloatConversion
    -    case IntegerType => IntegerConversion
    -    case LongType => if (metadata.contains("binarylong")) BinaryLongConversion else LongConversion
    -    case StringType => StringConversion
    -    case TimestampType => TimestampConversion
    -    case BinaryType => BinaryConversion
    -    case ArrayType(et, _) => ArrayConversion(getConversions(et, metadata))
    +    case BooleanType =>
    +      (rs: ResultSet, pos: Int) => rs.getBoolean(pos)
    +
    +    case DateType =>
    +      (rs: ResultSet, pos: Int) =>
    +        // DateTimeUtils.fromJavaDate does not handle null value, so we need to check it.
    +        val dateVal = rs.getDate(pos)
    +        if (dateVal != null) {
    +          DateTimeUtils.fromJavaDate(dateVal)
    +        } else {
    +          null
    +        }
    +
    +    case DecimalType.Fixed(p, s) =>
    +      (rs: ResultSet, pos: Int) =>
    +        val decimalVal = rs.getBigDecimal(pos)
    +        if (decimalVal == null) {
    --- End diff --
    
    Same as above (plus you're checking equality with `null` opposite to the above -- consistency violated)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14313#discussion_r72005085
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala ---
    @@ -322,46 +322,135 @@ private[sql] class JDBCRDD(
         }
       }
     
    -  // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so that
    -  // we don't have to potentially poke around in the Metadata once for every
    -  // row.
    -  // Is there a better way to do this?  I'd rather be using a type that
    -  // contains only the tags I define.
    -  abstract class JDBCConversion
    -  case object BooleanConversion extends JDBCConversion
    -  case object DateConversion extends JDBCConversion
    -  case class  DecimalConversion(precision: Int, scale: Int) extends JDBCConversion
    -  case object DoubleConversion extends JDBCConversion
    -  case object FloatConversion extends JDBCConversion
    -  case object IntegerConversion extends JDBCConversion
    -  case object LongConversion extends JDBCConversion
    -  case object BinaryLongConversion extends JDBCConversion
    -  case object StringConversion extends JDBCConversion
    -  case object TimestampConversion extends JDBCConversion
    -  case object BinaryConversion extends JDBCConversion
    -  case class ArrayConversion(elementConversion: JDBCConversion) extends JDBCConversion
    +  // A `JDBCConversion` is responsible for converting a value from `ResultSet`
    +  // to a value in a field for `InternalRow`.
    +  private type JDBCConversion = (ResultSet, Int) => Any
    --- End diff --
    
    also explain what's the 2 arguments in the comment?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    Thank you for your review @cloud-fan. Sorry for too many nits. I will try to be more careful for the next time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14313#discussion_r72013464
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala ---
    @@ -322,43 +322,130 @@ private[sql] class JDBCRDD(
         }
       }
     
    -  // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so that
    -  // we don't have to potentially poke around in the Metadata once for every
    -  // row.
    -  // Is there a better way to do this?  I'd rather be using a type that
    -  // contains only the tags I define.
    -  abstract class JDBCConversion
    -  case object BooleanConversion extends JDBCConversion
    -  case object DateConversion extends JDBCConversion
    -  case class  DecimalConversion(precision: Int, scale: Int) extends JDBCConversion
    -  case object DoubleConversion extends JDBCConversion
    -  case object FloatConversion extends JDBCConversion
    -  case object IntegerConversion extends JDBCConversion
    -  case object LongConversion extends JDBCConversion
    -  case object BinaryLongConversion extends JDBCConversion
    -  case object StringConversion extends JDBCConversion
    -  case object TimestampConversion extends JDBCConversion
    -  case object BinaryConversion extends JDBCConversion
    -  case class ArrayConversion(elementConversion: JDBCConversion) extends JDBCConversion
    +  // A `JDBCValueSetter` is responsible for converting and setting a value from `ResultSet`
    +  // into a field for `MutableRow`. The last argument `Int` means the index for the
    +  // value to be set in the row and also used for the value to retrieve from `ResultSet`.
    +  private type ValueSetter = (ResultSet, MutableRow, Int) => Unit
    --- End diff --
    
    `JDBCValueSetter`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14313#discussion_r72005232
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala ---
    @@ -407,84 +496,8 @@ private[sql] class JDBCRDD(
             var i = 0
             while (i < conversions.length) {
               val pos = i + 1
    -          conversions(i) match {
    -            case BooleanConversion => mutableRow.setBoolean(i, rs.getBoolean(pos))
    -            case DateConversion =>
    -              // DateTimeUtils.fromJavaDate does not handle null value, so we need to check it.
    -              val dateVal = rs.getDate(pos)
    -              if (dateVal != null) {
    -                mutableRow.setInt(i, DateTimeUtils.fromJavaDate(dateVal))
    -              } else {
    -                mutableRow.update(i, null)
    -              }
    -            // When connecting with Oracle DB through JDBC, the precision and scale of BigDecimal
    -            // object returned by ResultSet.getBigDecimal is not correctly matched to the table
    -            // schema reported by ResultSetMetaData.getPrecision and ResultSetMetaData.getScale.
    -            // If inserting values like 19999 into a column with NUMBER(12, 2) type, you get through
    -            // a BigDecimal object with scale as 0. But the dataframe schema has correct type as
    -            // DecimalType(12, 2). Thus, after saving the dataframe into parquet file and then
    -            // retrieve it, you will get wrong result 199.99.
    -            // So it is needed to set precision and scale for Decimal based on JDBC metadata.
    --- End diff --
    
    Oh, my mistake. I will add this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62792/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...

Posted by jaceklaskowski <gi...@git.apache.org>.

Github user jaceklaskowski commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14313#discussion_r71977344
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala ---
    @@ -322,46 +322,134 @@ private[sql] class JDBCRDD(
         }
       }
     
    -  // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so that
    -  // we don't have to potentially poke around in the Metadata once for every
    -  // row.
    -  // Is there a better way to do this?  I'd rather be using a type that
    -  // contains only the tags I define.
    -  abstract class JDBCConversion
    -  case object BooleanConversion extends JDBCConversion
    -  case object DateConversion extends JDBCConversion
    -  case class  DecimalConversion(precision: Int, scale: Int) extends JDBCConversion
    -  case object DoubleConversion extends JDBCConversion
    -  case object FloatConversion extends JDBCConversion
    -  case object IntegerConversion extends JDBCConversion
    -  case object LongConversion extends JDBCConversion
    -  case object BinaryLongConversion extends JDBCConversion
    -  case object StringConversion extends JDBCConversion
    -  case object TimestampConversion extends JDBCConversion
    -  case object BinaryConversion extends JDBCConversion
    -  case class ArrayConversion(elementConversion: JDBCConversion) extends JDBCConversion
    +  // A `JDBCConversion` is responsible for converting a value from `ResultSet`
    +  // to a value in a field for `InternalRow`.
    +  private type JDBCConversion = (ResultSet, Int) => Any
    +
    +  // This `ArrayElementConversion` is responsible for converting elements in
    +  // an array from `ResultSet`.
    +  private type ArrayElementConversion = (Object) => Any
     
       /**
    -   * Maps a StructType to a type tag list.
    +   * Maps a StructType to conversions for each type.
        */
       def getConversions(schema: StructType): Array[JDBCConversion] =
         schema.fields.map(sf => getConversions(sf.dataType, sf.metadata))
     
       private def getConversions(dt: DataType, metadata: Metadata): JDBCConversion = dt match {
    -    case BooleanType => BooleanConversion
    -    case DateType => DateConversion
    -    case DecimalType.Fixed(p, s) => DecimalConversion(p, s)
    -    case DoubleType => DoubleConversion
    -    case FloatType => FloatConversion
    -    case IntegerType => IntegerConversion
    -    case LongType => if (metadata.contains("binarylong")) BinaryLongConversion else LongConversion
    -    case StringType => StringConversion
    -    case TimestampType => TimestampConversion
    -    case BinaryType => BinaryConversion
    -    case ArrayType(et, _) => ArrayConversion(getConversions(et, metadata))
    +    case BooleanType =>
    +      (rs: ResultSet, pos: Int) => rs.getBoolean(pos)
    +
    +    case DateType =>
    +      (rs: ResultSet, pos: Int) =>
    +        // DateTimeUtils.fromJavaDate does not handle null value, so we need to check it.
    +        val dateVal = rs.getDate(pos)
    +        if (dateVal != null) {
    +          DateTimeUtils.fromJavaDate(dateVal)
    +        } else {
    +          null
    +        }
    +
    +    case DecimalType.Fixed(p, s) =>
    +      (rs: ResultSet, pos: Int) =>
    +        val decimalVal = rs.getBigDecimal(pos)
    +        if (decimalVal == null) {
    +          null
    +        } else {
    +          Decimal(decimalVal, p, s)
    +        }
    +
    +    case DoubleType =>
    +      (rs: ResultSet, pos: Int) => rs.getDouble(pos)
    +
    +    case FloatType =>
    +      (rs: ResultSet, pos: Int) => rs.getFloat(pos)
    +
    +    case IntegerType =>
    +      (rs: ResultSet, pos: Int) => rs.getInt(pos)
    +
    +    case LongType if metadata.contains("binarylong") =>
    +      (rs: ResultSet, pos: Int) =>
    +        val bytes = rs.getBytes(pos)
    +        var ans = 0L
    +        var j = 0
    +        while (j < bytes.size) {
    +          ans = 256 * ans + (255 & bytes(j))
    +          j = j + 1
    +        }
    +        ans
    +
    +    case LongType =>
    +      (rs: ResultSet, pos: Int) => rs.getLong(pos)
    +
    +    case StringType =>
    +      (rs: ResultSet, pos: Int) =>
    +        // TODO(davies): use getBytes for better performance, if the encoding is UTF-8
    +        UTF8String.fromString(rs.getString(pos))
    +
    +    case TimestampType =>
    +      (rs: ResultSet, pos: Int) =>
    +        val t = rs.getTimestamp(pos)
    +        if (t != null) {
    +          DateTimeUtils.fromJavaTimestamp(t)
    +        } else {
    +          null
    +        }
    +
    +    case BinaryType =>
    +      (rs: ResultSet, pos: Int) => rs.getBytes(pos)
    +
    +    case ArrayType(et, _) =>
    +      val elementConversion: ArrayElementConversion = getArrayElementConversion(et, metadata)
    +      (rs: ResultSet, pos: Int) =>
    +        val array = rs.getArray(pos).getArray
    +        if (array != null) {
    +          val data = elementConversion.apply(array)
    +          new GenericArrayData(data)
    +        } else {
    +          null
    +        }
    +
         case _ => throw new IllegalArgumentException(s"Unsupported type ${dt.simpleString}")
       }
     
    +  private def getArrayElementConversion(
    +      dt: DataType,
    +      metadata: Metadata): ArrayElementConversion = {
    +    dt match {
    +      case TimestampType =>
    +        (array: Object) =>
    +          array.asInstanceOf[Array[java.sql.Timestamp]].map { timestamp =>
    +            nullSafeConvert(timestamp, DateTimeUtils.fromJavaTimestamp)
    +          }
    +
    +      case StringType =>
    +        (array: Object) =>
    +          array.asInstanceOf[Array[java.lang.String]]
    +            .map(UTF8String.fromString)
    +
    +      case DateType =>
    +        (array: Object) =>
    +          array.asInstanceOf[Array[java.sql.Date]].map { date =>
    +            nullSafeConvert(date, DateTimeUtils.fromJavaDate)
    +          }
    +
    +      case dt: DecimalType =>
    +        (array: Object) =>
    +          array.asInstanceOf[Array[java.math.BigDecimal]].map { decimal =>
    +            nullSafeConvert[java.math.BigDecimal](decimal, d => Decimal(d, dt.precision, dt.scale))
    +          }
    +
    +      case LongType if metadata.contains("binarylong") =>
    +        throw new IllegalArgumentException(s"Unsupported array element conversion.")
    --- End diff --
    
    Can you give more context info to the exception?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    LGTM except some style comments, thanks for working on it!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14313#discussion_r71980089
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala ---
    @@ -322,46 +322,134 @@ private[sql] class JDBCRDD(
         }
       }
     
    -  // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so that
    -  // we don't have to potentially poke around in the Metadata once for every
    -  // row.
    -  // Is there a better way to do this?  I'd rather be using a type that
    -  // contains only the tags I define.
    -  abstract class JDBCConversion
    -  case object BooleanConversion extends JDBCConversion
    -  case object DateConversion extends JDBCConversion
    -  case class  DecimalConversion(precision: Int, scale: Int) extends JDBCConversion
    -  case object DoubleConversion extends JDBCConversion
    -  case object FloatConversion extends JDBCConversion
    -  case object IntegerConversion extends JDBCConversion
    -  case object LongConversion extends JDBCConversion
    -  case object BinaryLongConversion extends JDBCConversion
    -  case object StringConversion extends JDBCConversion
    -  case object TimestampConversion extends JDBCConversion
    -  case object BinaryConversion extends JDBCConversion
    -  case class ArrayConversion(elementConversion: JDBCConversion) extends JDBCConversion
    +  // A `JDBCConversion` is responsible for converting a value from `ResultSet`
    +  // to a value in a field for `InternalRow`.
    +  private type JDBCConversion = (ResultSet, Int) => Any
    +
    +  // This `ArrayElementConversion` is responsible for converting elements in
    +  // an array from `ResultSet`.
    +  private type ArrayElementConversion = (Object) => Any
     
       /**
    -   * Maps a StructType to a type tag list.
    +   * Maps a StructType to conversions for each type.
        */
       def getConversions(schema: StructType): Array[JDBCConversion] =
         schema.fields.map(sf => getConversions(sf.dataType, sf.metadata))
     
       private def getConversions(dt: DataType, metadata: Metadata): JDBCConversion = dt match {
    -    case BooleanType => BooleanConversion
    -    case DateType => DateConversion
    -    case DecimalType.Fixed(p, s) => DecimalConversion(p, s)
    -    case DoubleType => DoubleConversion
    -    case FloatType => FloatConversion
    -    case IntegerType => IntegerConversion
    -    case LongType => if (metadata.contains("binarylong")) BinaryLongConversion else LongConversion
    -    case StringType => StringConversion
    -    case TimestampType => TimestampConversion
    -    case BinaryType => BinaryConversion
    -    case ArrayType(et, _) => ArrayConversion(getConversions(et, metadata))
    +    case BooleanType =>
    +      (rs: ResultSet, pos: Int) => rs.getBoolean(pos)
    +
    +    case DateType =>
    +      (rs: ResultSet, pos: Int) =>
    +        // DateTimeUtils.fromJavaDate does not handle null value, so we need to check it.
    +        val dateVal = rs.getDate(pos)
    +        if (dateVal != null) {
    --- End diff --
    
    I guess this is a critical path. I think we don't need yo introduce virtual function calls.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    **[Test build #62789 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62789/consoleFull)** for PR 14313 at commit [`8ac66b1`](https://github.com/apache/spark/commit/8ac66b17f81a2fdc0866df26889b5e2fcc634c51).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62789/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14313#discussion_r72005487
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala ---
    @@ -322,46 +322,135 @@ private[sql] class JDBCRDD(
         }
       }
     
    -  // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so that
    -  // we don't have to potentially poke around in the Metadata once for every
    -  // row.
    -  // Is there a better way to do this?  I'd rather be using a type that
    -  // contains only the tags I define.
    -  abstract class JDBCConversion
    -  case object BooleanConversion extends JDBCConversion
    -  case object DateConversion extends JDBCConversion
    -  case class  DecimalConversion(precision: Int, scale: Int) extends JDBCConversion
    -  case object DoubleConversion extends JDBCConversion
    -  case object FloatConversion extends JDBCConversion
    -  case object IntegerConversion extends JDBCConversion
    -  case object LongConversion extends JDBCConversion
    -  case object BinaryLongConversion extends JDBCConversion
    -  case object StringConversion extends JDBCConversion
    -  case object TimestampConversion extends JDBCConversion
    -  case object BinaryConversion extends JDBCConversion
    -  case class ArrayConversion(elementConversion: JDBCConversion) extends JDBCConversion
    +  // A `JDBCConversion` is responsible for converting a value from `ResultSet`
    +  // to a value in a field for `InternalRow`.
    +  private type JDBCConversion = (ResultSet, Int) => Any
    +
    +  // This `ArrayElementConversion` is responsible for converting elements in
    +  // an array from `ResultSet`.
    +  private type ArrayElementConversion = (Object) => Any
     
       /**
    -   * Maps a StructType to a type tag list.
    +   * Maps a StructType to conversions for each type.
        */
       def getConversions(schema: StructType): Array[JDBCConversion] =
         schema.fields.map(sf => getConversions(sf.dataType, sf.metadata))
     
       private def getConversions(dt: DataType, metadata: Metadata): JDBCConversion = dt match {
    -    case BooleanType => BooleanConversion
    -    case DateType => DateConversion
    -    case DecimalType.Fixed(p, s) => DecimalConversion(p, s)
    -    case DoubleType => DoubleConversion
    -    case FloatType => FloatConversion
    -    case IntegerType => IntegerConversion
    -    case LongType => if (metadata.contains("binarylong")) BinaryLongConversion else LongConversion
    -    case StringType => StringConversion
    -    case TimestampType => TimestampConversion
    -    case BinaryType => BinaryConversion
    -    case ArrayType(et, _) => ArrayConversion(getConversions(et, metadata))
    +    case BooleanType =>
    +      (rs: ResultSet, pos: Int) => rs.getBoolean(pos)
    +
    +    case DateType =>
    +      (rs: ResultSet, pos: Int) =>
    +        // DateTimeUtils.fromJavaDate does not handle null value, so we need to check it.
    +        val dateVal = rs.getDate(pos)
    +        if (dateVal != null) {
    +          DateTimeUtils.fromJavaDate(dateVal)
    +        } else {
    +          null
    +        }
    +
    +    case DecimalType.Fixed(p, s) =>
    +      (rs: ResultSet, pos: Int) =>
    +        val decimalVal = rs.getBigDecimal(pos)
    +        if (decimalVal != null) {
    +          Decimal(decimalVal, p, s)
    +        } else {
    +          null
    +        }
    +
    +    case DoubleType =>
    +      (rs: ResultSet, pos: Int) => rs.getDouble(pos)
    +
    +    case FloatType =>
    +      (rs: ResultSet, pos: Int) => rs.getFloat(pos)
    +
    +    case IntegerType =>
    +      (rs: ResultSet, pos: Int) => rs.getInt(pos)
    +
    +    case LongType if metadata.contains("binarylong") =>
    +      (rs: ResultSet, pos: Int) =>
    +        val bytes = rs.getBytes(pos)
    +        var ans = 0L
    +        var j = 0
    +        while (j < bytes.size) {
    +          ans = 256 * ans + (255 & bytes(j))
    +          j = j + 1
    +        }
    +        ans
    +
    +    case LongType =>
    +      (rs: ResultSet, pos: Int) => rs.getLong(pos)
    +
    +    case StringType =>
    +      (rs: ResultSet, pos: Int) =>
    +        // TODO(davies): use getBytes for better performance, if the encoding is UTF-8
    +        UTF8String.fromString(rs.getString(pos))
    +
    +    case TimestampType =>
    +      (rs: ResultSet, pos: Int) =>
    +        val t = rs.getTimestamp(pos)
    +        if (t != null) {
    +          DateTimeUtils.fromJavaTimestamp(t)
    +        } else {
    +          null
    +        }
    +
    +    case BinaryType =>
    +      (rs: ResultSet, pos: Int) => rs.getBytes(pos)
    +
    +    case ArrayType(et, _) =>
    +      val elementConversion: ArrayElementConversion = getArrayElementConversion(et, metadata)
    +      (rs: ResultSet, pos: Int) =>
    +        val array = rs.getArray(pos).getArray
    +        if (array != null) {
    +          val data = elementConversion.apply(array)
    +          new GenericArrayData(data)
    +        } else {
    +          null
    +        }
    +
         case _ => throw new IllegalArgumentException(s"Unsupported type ${dt.simpleString}")
       }
     
    +  private def getArrayElementConversion(
    --- End diff --
    
    I'd like to inline this method, as what the previous code does


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    @cloud-fan I just addressed your comments. I added another argument in `JDBCConversion` for `MutableRow` so that we can avoid type-boxing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62705/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14313#discussion_r72022899
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala ---
    @@ -322,43 +322,133 @@ private[sql] class JDBCRDD(
         }
       }
     
    -  // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so that
    -  // we don't have to potentially poke around in the Metadata once for every
    -  // row.
    -  // Is there a better way to do this?  I'd rather be using a type that
    -  // contains only the tags I define.
    -  abstract class JDBCConversion
    -  case object BooleanConversion extends JDBCConversion
    -  case object DateConversion extends JDBCConversion
    -  case class  DecimalConversion(precision: Int, scale: Int) extends JDBCConversion
    -  case object DoubleConversion extends JDBCConversion
    -  case object FloatConversion extends JDBCConversion
    -  case object IntegerConversion extends JDBCConversion
    -  case object LongConversion extends JDBCConversion
    -  case object BinaryLongConversion extends JDBCConversion
    -  case object StringConversion extends JDBCConversion
    -  case object TimestampConversion extends JDBCConversion
    -  case object BinaryConversion extends JDBCConversion
    -  case class ArrayConversion(elementConversion: JDBCConversion) extends JDBCConversion
    +  // A `JDBCValueSetter` is responsible for converting and setting a value from `ResultSet`
    +  // into a field for `MutableRow`. The last argument `Int` means the index for the
    +  // value to be set in the row and also used for the value to retrieve from `ResultSet`.
    +  private type JDBCValueSetter = (ResultSet, MutableRow, Int) => Unit
     
       /**
    -   * Maps a StructType to a type tag list.
    +   * Creates a StructType to setters for each type.
    --- End diff --
    
    update this doc?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/14313


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    **[Test build #62804 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62804/consoleFull)** for PR 14313 at commit [`486dabd`](https://github.com/apache/spark/commit/486dabd1fa508712e77d5e67f08939d2408d047a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14313#discussion_r72005471
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala ---
    @@ -407,84 +496,8 @@ private[sql] class JDBCRDD(
             var i = 0
             while (i < conversions.length) {
               val pos = i + 1
    -          conversions(i) match {
    -            case BooleanConversion => mutableRow.setBoolean(i, rs.getBoolean(pos))
    -            case DateConversion =>
    -              // DateTimeUtils.fromJavaDate does not handle null value, so we need to check it.
    -              val dateVal = rs.getDate(pos)
    -              if (dateVal != null) {
    -                mutableRow.setInt(i, DateTimeUtils.fromJavaDate(dateVal))
    -              } else {
    -                mutableRow.update(i, null)
    -              }
    -            // When connecting with Oracle DB through JDBC, the precision and scale of BigDecimal
    -            // object returned by ResultSet.getBigDecimal is not correctly matched to the table
    -            // schema reported by ResultSetMetaData.getPrecision and ResultSetMetaData.getScale.
    -            // If inserting values like 19999 into a column with NUMBER(12, 2) type, you get through
    -            // a BigDecimal object with scale as 0. But the dataframe schema has correct type as
    -            // DecimalType(12, 2). Thus, after saving the dataframe into parquet file and then
    -            // retrieve it, you will get wrong result 199.99.
    -            // So it is needed to set precision and scale for Decimal based on JDBC metadata.
    -            case DecimalConversion(p, s) =>
    -              val decimalVal = rs.getBigDecimal(pos)
    -              if (decimalVal == null) {
    -                mutableRow.update(i, null)
    -              } else {
    -                mutableRow.update(i, Decimal(decimalVal, p, s))
    -              }
    -            case DoubleConversion => mutableRow.setDouble(i, rs.getDouble(pos))
    -            case FloatConversion => mutableRow.setFloat(i, rs.getFloat(pos))
    -            case IntegerConversion => mutableRow.setInt(i, rs.getInt(pos))
    -            case LongConversion => mutableRow.setLong(i, rs.getLong(pos))
    -            // TODO(davies): use getBytes for better performance, if the encoding is UTF-8
    -            case StringConversion => mutableRow.update(i, UTF8String.fromString(rs.getString(pos)))
    -            case TimestampConversion =>
    -              val t = rs.getTimestamp(pos)
    -              if (t != null) {
    -                mutableRow.setLong(i, DateTimeUtils.fromJavaTimestamp(t))
    -              } else {
    -                mutableRow.update(i, null)
    -              }
    -            case BinaryConversion => mutableRow.update(i, rs.getBytes(pos))
    -            case BinaryLongConversion =>
    -              val bytes = rs.getBytes(pos)
    -              var ans = 0L
    -              var j = 0
    -              while (j < bytes.size) {
    -                ans = 256 * ans + (255 & bytes(j))
    -                j = j + 1
    -              }
    -              mutableRow.setLong(i, ans)
    -            case ArrayConversion(elementConversion) =>
    -              val array = rs.getArray(pos).getArray
    -              if (array != null) {
    -                val data = elementConversion match {
    -                  case TimestampConversion =>
    -                    array.asInstanceOf[Array[java.sql.Timestamp]].map { timestamp =>
    -                      nullSafeConvert(timestamp, DateTimeUtils.fromJavaTimestamp)
    -                    }
    -                  case StringConversion =>
    -                    array.asInstanceOf[Array[java.lang.String]]
    -                      .map(UTF8String.fromString)
    -                  case DateConversion =>
    -                    array.asInstanceOf[Array[java.sql.Date]].map { date =>
    -                      nullSafeConvert(date, DateTimeUtils.fromJavaDate)
    -                    }
    -                  case DecimalConversion(p, s) =>
    -                    array.asInstanceOf[Array[java.math.BigDecimal]].map { decimal =>
    -                      nullSafeConvert[java.math.BigDecimal](decimal, d => Decimal(d, p, s))
    -                    }
    -                  case BinaryLongConversion =>
    -                    throw new IllegalArgumentException(s"Unsupported array element conversion $i")
    -                  case _: ArrayConversion =>
    -                    throw new IllegalArgumentException("Nested arrays unsupported")
    -                  case _ => array.asInstanceOf[Array[Any]]
    -                }
    -                mutableRow.update(i, new GenericArrayData(data))
    -              } else {
    -                mutableRow.update(i, null)
    -              }
    -          }
    +          val value = conversions(i).apply(rs, pos)
    +          mutableRow.update(i, value)
    --- End diff --
    
    previously we use `mutableRow.setXXX` for primitive types, but now we always use `update`. Is there a way we can still avoid the boxing here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...

Posted by jaceklaskowski <gi...@git.apache.org>.

Github user jaceklaskowski commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14313#discussion_r71977310
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala ---
    @@ -322,46 +322,134 @@ private[sql] class JDBCRDD(
         }
       }
     
    -  // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so that
    -  // we don't have to potentially poke around in the Metadata once for every
    -  // row.
    -  // Is there a better way to do this?  I'd rather be using a type that
    -  // contains only the tags I define.
    -  abstract class JDBCConversion
    -  case object BooleanConversion extends JDBCConversion
    -  case object DateConversion extends JDBCConversion
    -  case class  DecimalConversion(precision: Int, scale: Int) extends JDBCConversion
    -  case object DoubleConversion extends JDBCConversion
    -  case object FloatConversion extends JDBCConversion
    -  case object IntegerConversion extends JDBCConversion
    -  case object LongConversion extends JDBCConversion
    -  case object BinaryLongConversion extends JDBCConversion
    -  case object StringConversion extends JDBCConversion
    -  case object TimestampConversion extends JDBCConversion
    -  case object BinaryConversion extends JDBCConversion
    -  case class ArrayConversion(elementConversion: JDBCConversion) extends JDBCConversion
    +  // A `JDBCConversion` is responsible for converting a value from `ResultSet`
    +  // to a value in a field for `InternalRow`.
    +  private type JDBCConversion = (ResultSet, Int) => Any
    +
    +  // This `ArrayElementConversion` is responsible for converting elements in
    +  // an array from `ResultSet`.
    +  private type ArrayElementConversion = (Object) => Any
     
       /**
    -   * Maps a StructType to a type tag list.
    +   * Maps a StructType to conversions for each type.
        */
       def getConversions(schema: StructType): Array[JDBCConversion] =
         schema.fields.map(sf => getConversions(sf.dataType, sf.metadata))
     
       private def getConversions(dt: DataType, metadata: Metadata): JDBCConversion = dt match {
    -    case BooleanType => BooleanConversion
    -    case DateType => DateConversion
    -    case DecimalType.Fixed(p, s) => DecimalConversion(p, s)
    -    case DoubleType => DoubleConversion
    -    case FloatType => FloatConversion
    -    case IntegerType => IntegerConversion
    -    case LongType => if (metadata.contains("binarylong")) BinaryLongConversion else LongConversion
    -    case StringType => StringConversion
    -    case TimestampType => TimestampConversion
    -    case BinaryType => BinaryConversion
    -    case ArrayType(et, _) => ArrayConversion(getConversions(et, metadata))
    +    case BooleanType =>
    +      (rs: ResultSet, pos: Int) => rs.getBoolean(pos)
    +
    +    case DateType =>
    +      (rs: ResultSet, pos: Int) =>
    +        // DateTimeUtils.fromJavaDate does not handle null value, so we need to check it.
    +        val dateVal = rs.getDate(pos)
    +        if (dateVal != null) {
    --- End diff --
    
    `Option(dateVal).map(...).orNull`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    **[Test build #62801 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62801/consoleFull)** for PR 14313 at commit [`c336382`](https://github.com/apache/spark/commit/c336382e3df40d1eddf29b462a13b5933596d8ce).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    **[Test build #62705 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62705/consoleFull)** for PR 14313 at commit [`ad64483`](https://github.com/apache/spark/commit/ad64483e2263485f843a469310fb8d252824d09e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62760/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    **[Test build #62760 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62760/consoleFull)** for PR 14313 at commit [`5335093`](https://github.com/apache/spark/commit/53350935b476e2a30dfd03f7fbfe857e6c4316d0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...

Posted by jaceklaskowski <gi...@git.apache.org>.

Github user jaceklaskowski commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14313#discussion_r71977337
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala ---
    @@ -322,46 +322,134 @@ private[sql] class JDBCRDD(
         }
       }
     
    -  // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so that
    -  // we don't have to potentially poke around in the Metadata once for every
    -  // row.
    -  // Is there a better way to do this?  I'd rather be using a type that
    -  // contains only the tags I define.
    -  abstract class JDBCConversion
    -  case object BooleanConversion extends JDBCConversion
    -  case object DateConversion extends JDBCConversion
    -  case class  DecimalConversion(precision: Int, scale: Int) extends JDBCConversion
    -  case object DoubleConversion extends JDBCConversion
    -  case object FloatConversion extends JDBCConversion
    -  case object IntegerConversion extends JDBCConversion
    -  case object LongConversion extends JDBCConversion
    -  case object BinaryLongConversion extends JDBCConversion
    -  case object StringConversion extends JDBCConversion
    -  case object TimestampConversion extends JDBCConversion
    -  case object BinaryConversion extends JDBCConversion
    -  case class ArrayConversion(elementConversion: JDBCConversion) extends JDBCConversion
    +  // A `JDBCConversion` is responsible for converting a value from `ResultSet`
    +  // to a value in a field for `InternalRow`.
    +  private type JDBCConversion = (ResultSet, Int) => Any
    +
    +  // This `ArrayElementConversion` is responsible for converting elements in
    +  // an array from `ResultSet`.
    +  private type ArrayElementConversion = (Object) => Any
     
       /**
    -   * Maps a StructType to a type tag list.
    +   * Maps a StructType to conversions for each type.
        */
       def getConversions(schema: StructType): Array[JDBCConversion] =
         schema.fields.map(sf => getConversions(sf.dataType, sf.metadata))
     
       private def getConversions(dt: DataType, metadata: Metadata): JDBCConversion = dt match {
    -    case BooleanType => BooleanConversion
    -    case DateType => DateConversion
    -    case DecimalType.Fixed(p, s) => DecimalConversion(p, s)
    -    case DoubleType => DoubleConversion
    -    case FloatType => FloatConversion
    -    case IntegerType => IntegerConversion
    -    case LongType => if (metadata.contains("binarylong")) BinaryLongConversion else LongConversion
    -    case StringType => StringConversion
    -    case TimestampType => TimestampConversion
    -    case BinaryType => BinaryConversion
    -    case ArrayType(et, _) => ArrayConversion(getConversions(et, metadata))
    +    case BooleanType =>
    +      (rs: ResultSet, pos: Int) => rs.getBoolean(pos)
    +
    +    case DateType =>
    +      (rs: ResultSet, pos: Int) =>
    +        // DateTimeUtils.fromJavaDate does not handle null value, so we need to check it.
    +        val dateVal = rs.getDate(pos)
    +        if (dateVal != null) {
    +          DateTimeUtils.fromJavaDate(dateVal)
    +        } else {
    +          null
    +        }
    +
    +    case DecimalType.Fixed(p, s) =>
    +      (rs: ResultSet, pos: Int) =>
    +        val decimalVal = rs.getBigDecimal(pos)
    +        if (decimalVal == null) {
    +          null
    +        } else {
    +          Decimal(decimalVal, p, s)
    +        }
    +
    +    case DoubleType =>
    +      (rs: ResultSet, pos: Int) => rs.getDouble(pos)
    +
    +    case FloatType =>
    +      (rs: ResultSet, pos: Int) => rs.getFloat(pos)
    +
    +    case IntegerType =>
    +      (rs: ResultSet, pos: Int) => rs.getInt(pos)
    +
    +    case LongType if metadata.contains("binarylong") =>
    +      (rs: ResultSet, pos: Int) =>
    +        val bytes = rs.getBytes(pos)
    +        var ans = 0L
    +        var j = 0
    +        while (j < bytes.size) {
    +          ans = 256 * ans + (255 & bytes(j))
    +          j = j + 1
    +        }
    +        ans
    +
    +    case LongType =>
    +      (rs: ResultSet, pos: Int) => rs.getLong(pos)
    +
    +    case StringType =>
    +      (rs: ResultSet, pos: Int) =>
    +        // TODO(davies): use getBytes for better performance, if the encoding is UTF-8
    +        UTF8String.fromString(rs.getString(pos))
    +
    +    case TimestampType =>
    +      (rs: ResultSet, pos: Int) =>
    +        val t = rs.getTimestamp(pos)
    +        if (t != null) {
    --- End diff --
    
    same as above


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    **[Test build #62811 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62811/consoleFull)** for PR 14313 at commit [`25894f1`](https://github.com/apache/spark/commit/25894f124658f3077f419051a52c33cc5d36306c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14313#discussion_r72008403
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala ---
    @@ -322,46 +322,133 @@ private[sql] class JDBCRDD(
         }
       }
     
    -  // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so that
    -  // we don't have to potentially poke around in the Metadata once for every
    -  // row.
    -  // Is there a better way to do this?  I'd rather be using a type that
    -  // contains only the tags I define.
    -  abstract class JDBCConversion
    -  case object BooleanConversion extends JDBCConversion
    -  case object DateConversion extends JDBCConversion
    -  case class  DecimalConversion(precision: Int, scale: Int) extends JDBCConversion
    -  case object DoubleConversion extends JDBCConversion
    -  case object FloatConversion extends JDBCConversion
    -  case object IntegerConversion extends JDBCConversion
    -  case object LongConversion extends JDBCConversion
    -  case object BinaryLongConversion extends JDBCConversion
    -  case object StringConversion extends JDBCConversion
    -  case object TimestampConversion extends JDBCConversion
    -  case object BinaryConversion extends JDBCConversion
    -  case class ArrayConversion(elementConversion: JDBCConversion) extends JDBCConversion
    +  // A `JDBCConversion` is responsible for converting and setting a value from `ResultSet`
    --- End diff --
    
    `JDBCConversion` seems not a good name now, do you have better any ideas?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    **[Test build #62811 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62811/consoleFull)** for PR 14313 at commit [`25894f1`](https://github.com/apache/spark/commit/25894f124658f3077f419051a52c33cc5d36306c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    **[Test build #62804 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62804/consoleFull)** for PR 14313 at commit [`486dabd`](https://github.com/apache/spark/commit/486dabd1fa508712e77d5e67f08939d2408d047a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    **[Test build #62792 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62792/consoleFull)** for PR 14313 at commit [`a385318`](https://github.com/apache/spark/commit/a3853182b375539bd329fb10464c1a746d4eaa47).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    **[Test build #62708 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62708/consoleFull)** for PR 14313 at commit [`5eae0e6`](https://github.com/apache/spark/commit/5eae0e6b7d68a7781b1849441a000cf3ff7fe804).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    **[Test build #62705 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62705/consoleFull)** for PR 14313 at commit [`ad64483`](https://github.com/apache/spark/commit/ad64483e2263485f843a469310fb8d252824d09e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    **[Test build #62760 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62760/consoleFull)** for PR 14313 at commit [`5335093`](https://github.com/apache/spark/commit/53350935b476e2a30dfd03f7fbfe857e6c4316d0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62804/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    **[Test build #62792 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62792/consoleFull)** for PR 14313 at commit [`a385318`](https://github.com/apache/spark/commit/a3853182b375539bd329fb10464c1a746d4eaa47).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62801/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...

Posted by jaceklaskowski <gi...@git.apache.org>.

Github user jaceklaskowski commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14313#discussion_r71977368
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala ---
    @@ -407,84 +495,8 @@ private[sql] class JDBCRDD(
             var i = 0
             while (i < conversions.length) {
    --- End diff --
    
    Why `while` not `foreach` or similar?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    Could you please take a look here @cloud-fan and @yhuai ? This is happening for writing too. I will open new one for writing as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62708/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14313#discussion_r72005172
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala ---
    @@ -407,84 +496,8 @@ private[sql] class JDBCRDD(
             var i = 0
             while (i < conversions.length) {
               val pos = i + 1
    -          conversions(i) match {
    -            case BooleanConversion => mutableRow.setBoolean(i, rs.getBoolean(pos))
    -            case DateConversion =>
    -              // DateTimeUtils.fromJavaDate does not handle null value, so we need to check it.
    -              val dateVal = rs.getDate(pos)
    -              if (dateVal != null) {
    -                mutableRow.setInt(i, DateTimeUtils.fromJavaDate(dateVal))
    -              } else {
    -                mutableRow.update(i, null)
    -              }
    -            // When connecting with Oracle DB through JDBC, the precision and scale of BigDecimal
    -            // object returned by ResultSet.getBigDecimal is not correctly matched to the table
    -            // schema reported by ResultSetMetaData.getPrecision and ResultSetMetaData.getScale.
    -            // If inserting values like 19999 into a column with NUMBER(12, 2) type, you get through
    -            // a BigDecimal object with scale as 0. But the dataframe schema has correct type as
    -            // DecimalType(12, 2). Thus, after saving the dataframe into parquet file and then
    -            // retrieve it, you will get wrong result 199.99.
    -            // So it is needed to set precision and scale for Decimal based on JDBC metadata.
    --- End diff --
    
    why do we remove these comments?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14313#discussion_r72005398
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala ---
    @@ -322,46 +322,135 @@ private[sql] class JDBCRDD(
         }
       }
     
    -  // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so that
    -  // we don't have to potentially poke around in the Metadata once for every
    -  // row.
    -  // Is there a better way to do this?  I'd rather be using a type that
    -  // contains only the tags I define.
    -  abstract class JDBCConversion
    -  case object BooleanConversion extends JDBCConversion
    -  case object DateConversion extends JDBCConversion
    -  case class  DecimalConversion(precision: Int, scale: Int) extends JDBCConversion
    -  case object DoubleConversion extends JDBCConversion
    -  case object FloatConversion extends JDBCConversion
    -  case object IntegerConversion extends JDBCConversion
    -  case object LongConversion extends JDBCConversion
    -  case object BinaryLongConversion extends JDBCConversion
    -  case object StringConversion extends JDBCConversion
    -  case object TimestampConversion extends JDBCConversion
    -  case object BinaryConversion extends JDBCConversion
    -  case class ArrayConversion(elementConversion: JDBCConversion) extends JDBCConversion
    +  // A `JDBCConversion` is responsible for converting a value from `ResultSet`
    +  // to a value in a field for `InternalRow`.
    +  private type JDBCConversion = (ResultSet, Int) => Any
    +
    +  // This `ArrayElementConversion` is responsible for converting elements in
    +  // an array from `ResultSet`.
    +  private type ArrayElementConversion = (Object) => Any
    --- End diff --
    
    looks like this type alias is not very useful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62811/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14313#discussion_r72013658
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala ---
    @@ -322,43 +322,130 @@ private[sql] class JDBCRDD(
         }
       }
     
    -  // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so that
    -  // we don't have to potentially poke around in the Metadata once for every
    -  // row.
    -  // Is there a better way to do this?  I'd rather be using a type that
    -  // contains only the tags I define.
    -  abstract class JDBCConversion
    -  case object BooleanConversion extends JDBCConversion
    -  case object DateConversion extends JDBCConversion
    -  case class  DecimalConversion(precision: Int, scale: Int) extends JDBCConversion
    -  case object DoubleConversion extends JDBCConversion
    -  case object FloatConversion extends JDBCConversion
    -  case object IntegerConversion extends JDBCConversion
    -  case object LongConversion extends JDBCConversion
    -  case object BinaryLongConversion extends JDBCConversion
    -  case object StringConversion extends JDBCConversion
    -  case object TimestampConversion extends JDBCConversion
    -  case object BinaryConversion extends JDBCConversion
    -  case class ArrayConversion(elementConversion: JDBCConversion) extends JDBCConversion
    +  // A `JDBCValueSetter` is responsible for converting and setting a value from `ResultSet`
    +  // into a field for `MutableRow`. The last argument `Int` means the index for the
    +  // value to be set in the row and also used for the value to retrieve from `ResultSet`.
    +  private type ValueSetter = (ResultSet, MutableRow, Int) => Unit
     
       /**
    -   * Maps a StructType to a type tag list.
    +   * Creates a StructType to setters for each type.
        */
    -  def getConversions(schema: StructType): Array[JDBCConversion] =
    -    schema.fields.map(sf => getConversions(sf.dataType, sf.metadata))
    -
    -  private def getConversions(dt: DataType, metadata: Metadata): JDBCConversion = dt match {
    -    case BooleanType => BooleanConversion
    -    case DateType => DateConversion
    -    case DecimalType.Fixed(p, s) => DecimalConversion(p, s)
    -    case DoubleType => DoubleConversion
    -    case FloatType => FloatConversion
    -    case IntegerType => IntegerConversion
    -    case LongType => if (metadata.contains("binarylong")) BinaryLongConversion else LongConversion
    -    case StringType => StringConversion
    -    case TimestampType => TimestampConversion
    -    case BinaryType => BinaryConversion
    -    case ArrayType(et, _) => ArrayConversion(getConversions(et, metadata))
    +  def makeSetters(schema: StructType): Array[ValueSetter] =
    +    schema.fields.map(sf => makeSetters(sf.dataType, sf.metadata))
    +
    +  private def makeSetters(dt: DataType, metadata: Metadata): ValueSetter = dt match {
    +    case BooleanType =>
    +      (rs: ResultSet, row: MutableRow, pos: Int) =>
    +        row.setBoolean(pos, rs.getBoolean(pos + 1))
    +
    +    case DateType =>
    +      (rs: ResultSet, row: MutableRow, pos: Int) =>
    +        // DateTimeUtils.fromJavaDate does not handle null value, so we need to check it.
    +        val dateVal = rs.getDate(pos + 1)
    +        if (dateVal != null) {
    +          row.setInt(pos, DateTimeUtils.fromJavaDate(dateVal))
    +        } else {
    +          row.update(pos, null)
    +        }
    +
    +    // When connecting with Oracle DB through JDBC, the precision and scale of BigDecimal
    +    // object returned by ResultSet.getBigDecimal is not correctly matched to the table
    +    // schema reported by ResultSetMetaData.getPrecision and ResultSetMetaData.getScale.
    +    // If inserting values like 19999 into a column with NUMBER(12, 2) type, you get through
    +    // a BigDecimal object with scale as 0. But the dataframe schema has correct type as
    +    // DecimalType(12, 2). Thus, after saving the dataframe into parquet file and then
    +    // retrieve it, you will get wrong result 199.99.
    +    // So it is needed to set precision and scale for Decimal based on JDBC metadata.
    +    case DecimalType.Fixed(p, s) =>
    +      (rs: ResultSet, row: MutableRow, pos: Int) =>
    +        val decimal =
    +          nullSafeConvert[java.math.BigDecimal](rs.getBigDecimal(pos + 1), d => Decimal(d, p, s))
    +        row.update(pos, decimal)
    +
    +    case DoubleType =>
    +      (rs: ResultSet, row: MutableRow, pos: Int) =>
    +        row.setDouble(pos, rs.getDouble(pos + 1))
    +
    +    case FloatType =>
    +      (rs: ResultSet, row: MutableRow, pos: Int) =>
    +        row.setFloat(pos, rs.getFloat(pos + 1))
    +
    +    case IntegerType =>
    +      (rs: ResultSet, row: MutableRow, pos: Int) =>
    +        row.setInt(pos, rs.getInt(pos + 1))
    +
    +    case LongType if metadata.contains("binarylong") =>
    +      (rs: ResultSet, row: MutableRow, pos: Int) =>
    +        val bytes = rs.getBytes(pos + 1)
    +        var ans = 0L
    +        var j = 0
    +        while (j < bytes.size) {
    +          ans = 256 * ans + (255 & bytes(j))
    +          j = j + 1
    +        }
    +        row.setLong(pos, ans)
    +
    +    case LongType =>
    +      (rs: ResultSet, row: MutableRow, pos: Int) =>
    +        row.setLong(pos, rs.getLong(pos + 1))
    +
    +    case StringType =>
    +      (rs: ResultSet, row: MutableRow, pos: Int) =>
    +        // TODO(davies): use getBytes for better performance, if the encoding is UTF-8
    +        row.update(pos, UTF8String.fromString(rs.getString(pos + 1)))
    +
    +    case TimestampType =>
    +      (rs: ResultSet, row: MutableRow, pos: Int) =>
    +        val t = rs.getTimestamp(pos + 1)
    +        if (t != null) {
    +          row.setLong(pos, DateTimeUtils.fromJavaTimestamp(t))
    +        } else {
    +          row.update(pos, null)
    +        }
    +
    +    case BinaryType =>
    +      (rs: ResultSet, row: MutableRow, pos: Int) =>
    +        row.update(pos, rs.getBytes(pos + 1))
    +
    +    case ArrayType(et, _) =>
    +      val elementConversion = et match {
    +        case TimestampType =>
    +          (array: Object) =>
    +            array.asInstanceOf[Array[java.sql.Timestamp]].map { timestamp =>
    +              nullSafeConvert(timestamp, DateTimeUtils.fromJavaTimestamp)
    +            }
    +
    +        case StringType =>
    +          (array: Object) =>
    +            array.asInstanceOf[Array[java.lang.String]]
    +              .map(UTF8String.fromString)
    +
    +        case DateType =>
    +          (array: Object) =>
    +            array.asInstanceOf[Array[java.sql.Date]].map { date =>
    +              nullSafeConvert(date, DateTimeUtils.fromJavaDate)
    +            }
    +
    +        case dt: DecimalType =>
    +          (array: Object) =>
    +            array.asInstanceOf[Array[java.math.BigDecimal]].map { decimal =>
    +              nullSafeConvert[java.math.BigDecimal](
    +                decimal, d => Decimal(d, dt.precision, dt.scale))
    +            }
    +
    +        case LongType if metadata.contains("binarylong") =>
    +          throw new IllegalArgumentException(s"Unsupported array element " +
    +            s"type ${dt.simpleString} based on binary")
    +
    +        case ArrayType(_, _) =>
    +          throw new IllegalArgumentException("Nested arrays unsupported")
    +
    +        case _ => (array: Object) => array.asInstanceOf[Array[Any]]
    +      }
    +
    +      (rs: ResultSet, row: MutableRow, pos: Int) =>
    +        row.update(pos, nullSafeConvert(rs.getArray(pos + 1).getArray, elementConversion))
    --- End diff --
    
    `new GenericArrayData`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    **[Test build #62789 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62789/consoleFull)** for PR 14313 at commit [`8ac66b1`](https://github.com/apache/spark/commit/8ac66b17f81a2fdc0866df26889b5e2fcc634c51).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14313#discussion_r72005309
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala ---
    @@ -322,46 +322,135 @@ private[sql] class JDBCRDD(
         }
       }
     
    -  // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so that
    -  // we don't have to potentially poke around in the Metadata once for every
    -  // row.
    -  // Is there a better way to do this?  I'd rather be using a type that
    -  // contains only the tags I define.
    -  abstract class JDBCConversion
    -  case object BooleanConversion extends JDBCConversion
    -  case object DateConversion extends JDBCConversion
    -  case class  DecimalConversion(precision: Int, scale: Int) extends JDBCConversion
    -  case object DoubleConversion extends JDBCConversion
    -  case object FloatConversion extends JDBCConversion
    -  case object IntegerConversion extends JDBCConversion
    -  case object LongConversion extends JDBCConversion
    -  case object BinaryLongConversion extends JDBCConversion
    -  case object StringConversion extends JDBCConversion
    -  case object TimestampConversion extends JDBCConversion
    -  case object BinaryConversion extends JDBCConversion
    -  case class ArrayConversion(elementConversion: JDBCConversion) extends JDBCConversion
    +  // A `JDBCConversion` is responsible for converting a value from `ResultSet`
    +  // to a value in a field for `InternalRow`.
    +  private type JDBCConversion = (ResultSet, Int) => Any
    +
    +  // This `ArrayElementConversion` is responsible for converting elements in
    +  // an array from `ResultSet`.
    +  private type ArrayElementConversion = (Object) => Any
     
       /**
    -   * Maps a StructType to a type tag list.
    +   * Maps a StructType to conversions for each type.
        */
       def getConversions(schema: StructType): Array[JDBCConversion] =
         schema.fields.map(sf => getConversions(sf.dataType, sf.metadata))
     
       private def getConversions(dt: DataType, metadata: Metadata): JDBCConversion = dt match {
    -    case BooleanType => BooleanConversion
    -    case DateType => DateConversion
    -    case DecimalType.Fixed(p, s) => DecimalConversion(p, s)
    -    case DoubleType => DoubleConversion
    -    case FloatType => FloatConversion
    -    case IntegerType => IntegerConversion
    -    case LongType => if (metadata.contains("binarylong")) BinaryLongConversion else LongConversion
    -    case StringType => StringConversion
    -    case TimestampType => TimestampConversion
    -    case BinaryType => BinaryConversion
    -    case ArrayType(et, _) => ArrayConversion(getConversions(et, metadata))
    +    case BooleanType =>
    +      (rs: ResultSet, pos: Int) => rs.getBoolean(pos)
    +
    +    case DateType =>
    +      (rs: ResultSet, pos: Int) =>
    +        // DateTimeUtils.fromJavaDate does not handle null value, so we need to check it.
    +        val dateVal = rs.getDate(pos)
    +        if (dateVal != null) {
    +          DateTimeUtils.fromJavaDate(dateVal)
    +        } else {
    +          null
    +        }
    +
    +    case DecimalType.Fixed(p, s) =>
    +      (rs: ResultSet, pos: Int) =>
    +        val decimalVal = rs.getBigDecimal(pos)
    +        if (decimalVal != null) {
    --- End diff --
    
    use `nullSafeConvert`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    **[Test build #62801 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62801/consoleFull)** for PR 14313 at commit [`c336382`](https://github.com/apache/spark/commit/c336382e3df40d1eddf29b462a13b5933596d8ce).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14313#discussion_r72007301
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala ---
    @@ -407,84 +496,8 @@ private[sql] class JDBCRDD(
             var i = 0
             while (i < conversions.length) {
               val pos = i + 1
    -          conversions(i) match {
    -            case BooleanConversion => mutableRow.setBoolean(i, rs.getBoolean(pos))
    -            case DateConversion =>
    -              // DateTimeUtils.fromJavaDate does not handle null value, so we need to check it.
    -              val dateVal = rs.getDate(pos)
    -              if (dateVal != null) {
    -                mutableRow.setInt(i, DateTimeUtils.fromJavaDate(dateVal))
    -              } else {
    -                mutableRow.update(i, null)
    -              }
    -            // When connecting with Oracle DB through JDBC, the precision and scale of BigDecimal
    -            // object returned by ResultSet.getBigDecimal is not correctly matched to the table
    -            // schema reported by ResultSetMetaData.getPrecision and ResultSetMetaData.getScale.
    -            // If inserting values like 19999 into a column with NUMBER(12, 2) type, you get through
    -            // a BigDecimal object with scale as 0. But the dataframe schema has correct type as
    -            // DecimalType(12, 2). Thus, after saving the dataframe into parquet file and then
    -            // retrieve it, you will get wrong result 199.99.
    -            // So it is needed to set precision and scale for Decimal based on JDBC metadata.
    --- End diff --
    
    Fxied.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    **[Test build #62708 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62708/consoleFull)** for PR 14313 at commit [`5eae0e6`](https://github.com/apache/spark/commit/5eae0e6b7d68a7781b1849441a000cf3ff7fe804).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/14313
  
    thanks, merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14313#discussion_r72008447
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala ---
    @@ -322,46 +322,133 @@ private[sql] class JDBCRDD(
         }
       }
     
    -  // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so that
    -  // we don't have to potentially poke around in the Metadata once for every
    -  // row.
    -  // Is there a better way to do this?  I'd rather be using a type that
    -  // contains only the tags I define.
    -  abstract class JDBCConversion
    -  case object BooleanConversion extends JDBCConversion
    -  case object DateConversion extends JDBCConversion
    -  case class  DecimalConversion(precision: Int, scale: Int) extends JDBCConversion
    -  case object DoubleConversion extends JDBCConversion
    -  case object FloatConversion extends JDBCConversion
    -  case object IntegerConversion extends JDBCConversion
    -  case object LongConversion extends JDBCConversion
    -  case object BinaryLongConversion extends JDBCConversion
    -  case object StringConversion extends JDBCConversion
    -  case object TimestampConversion extends JDBCConversion
    -  case object BinaryConversion extends JDBCConversion
    -  case class ArrayConversion(elementConversion: JDBCConversion) extends JDBCConversion
    +  // A `JDBCConversion` is responsible for converting and setting a value from `ResultSet`
    +  // into a field for `MutableRow`. The last argument `Int` means the index for the
    +  // value to be set in the row and also used for the value to retrieve from `ResultSet`.
    +  private type JDBCConversion = (ResultSet, MutableRow, Int) => Unit
     
       /**
    -   * Maps a StructType to a type tag list.
    +   * Maps a StructType to conversions for each type.
        */
       def getConversions(schema: StructType): Array[JDBCConversion] =
         schema.fields.map(sf => getConversions(sf.dataType, sf.metadata))
     
       private def getConversions(dt: DataType, metadata: Metadata): JDBCConversion = dt match {
    -    case BooleanType => BooleanConversion
    -    case DateType => DateConversion
    -    case DecimalType.Fixed(p, s) => DecimalConversion(p, s)
    -    case DoubleType => DoubleConversion
    -    case FloatType => FloatConversion
    -    case IntegerType => IntegerConversion
    -    case LongType => if (metadata.contains("binarylong")) BinaryLongConversion else LongConversion
    -    case StringType => StringConversion
    -    case TimestampType => TimestampConversion
    -    case BinaryType => BinaryConversion
    -    case ArrayType(et, _) => ArrayConversion(getConversions(et, metadata))
    +    case BooleanType =>
    +      (rs: ResultSet, row: MutableRow, pos: Int) =>
    +        row.setBoolean(pos, rs.getBoolean(pos + 1))
    +
    +    case DateType =>
    +      (rs: ResultSet, row: MutableRow, pos: Int) =>
    +        // DateTimeUtils.fromJavaDate does not handle null value, so we need to check it.
    +        val dateVal = rs.getDate(pos + 1)
    +        if (dateVal != null) {
    +          row.setInt(pos, DateTimeUtils.fromJavaDate(dateVal))
    +        } else {
    +          row.update(pos, null)
    +        }
    +
    +    // When connecting with Oracle DB through JDBC, the precision and scale of BigDecimal
    +    // object returned by ResultSet.getBigDecimal is not correctly matched to the table
    +    // schema reported by ResultSetMetaData.getPrecision and ResultSetMetaData.getScale.
    +    // If inserting values like 19999 into a column with NUMBER(12, 2) type, you get through
    +    // a BigDecimal object with scale as 0. But the dataframe schema has correct type as
    +    // DecimalType(12, 2). Thus, after saving the dataframe into parquet file and then
    +    // retrieve it, you will get wrong result 199.99.
    +    // So it is needed to set precision and scale for Decimal based on JDBC metadata.
    +    case DecimalType.Fixed(p, s) =>
    +      (rs: ResultSet, row: MutableRow, pos: Int) =>
    +        val decimal =
    +          nullSafeConvert[java.math.BigDecimal](rs.getBigDecimal(pos + 1), d => Decimal(d, p, s))
    +        row.update(pos, decimal)
    +
    +    case DoubleType =>
    +      (rs: ResultSet, row: MutableRow, pos: Int) =>
    +        row.setDouble(pos, rs.getDouble(pos + 1))
    +
    +    case FloatType =>
    +      (rs: ResultSet, row: MutableRow, pos: Int) =>
    +        row.setFloat(pos, rs.getFloat(pos + 1))
    +
    +    case IntegerType =>
    +      (rs: ResultSet, row: MutableRow, pos: Int) =>
    +        row.setInt(pos, rs.getInt(pos + 1))
    +
    +    case LongType if metadata.contains("binarylong") =>
    +      (rs: ResultSet, row: MutableRow, pos: Int) =>
    +        val bytes = rs.getBytes(pos + 1)
    +        var ans = 0L
    +        var j = 0
    +        while (j < bytes.size) {
    +          ans = 256 * ans + (255 & bytes(j))
    +          j = j + 1
    +        }
    +        row.setLong(pos, ans)
    +
    +    case LongType =>
    +      (rs: ResultSet, row: MutableRow, pos: Int) =>
    +        row.setLong(pos, rs.getLong(pos + 1))
    +
    +    case StringType =>
    +      (rs: ResultSet, row: MutableRow, pos: Int) =>
    +        // TODO(davies): use getBytes for better performance, if the encoding is UTF-8
    +        row.update(pos, UTF8String.fromString(rs.getString(pos + 1)))
    +
    +    case TimestampType =>
    +      (rs: ResultSet, row: MutableRow, pos: Int) =>
    +        val t = rs.getTimestamp(pos + 1)
    +        if (t != null) {
    +          row.setLong(pos, DateTimeUtils.fromJavaTimestamp(t))
    +        } else {
    +          row.update(pos, null)
    +        }
    +
    +    case BinaryType =>
    +      (rs: ResultSet, row: MutableRow, pos: Int) =>
    +        row.update(pos, rs.getBytes(pos + 1))
    +
    +    case ArrayType(et, _) =>
    +      val elementConversion = getArrayElementConversion(et, metadata)
    +      (rs: ResultSet, row: MutableRow, pos: Int) =>
    +        row.update(pos, nullSafeConvert(rs.getArray(pos + 1).getArray, elementConversion))
    +
         case _ => throw new IllegalArgumentException(s"Unsupported type ${dt.simpleString}")
       }
     
    +  private def getArrayElementConversion(dt: DataType, metadata: Metadata) = dt match {
    --- End diff --
    
    inline this method?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14313#discussion_r72013547
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala ---
    @@ -322,43 +322,130 @@ private[sql] class JDBCRDD(
         }
       }
     
    -  // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so that
    -  // we don't have to potentially poke around in the Metadata once for every
    -  // row.
    -  // Is there a better way to do this?  I'd rather be using a type that
    -  // contains only the tags I define.
    -  abstract class JDBCConversion
    -  case object BooleanConversion extends JDBCConversion
    -  case object DateConversion extends JDBCConversion
    -  case class  DecimalConversion(precision: Int, scale: Int) extends JDBCConversion
    -  case object DoubleConversion extends JDBCConversion
    -  case object FloatConversion extends JDBCConversion
    -  case object IntegerConversion extends JDBCConversion
    -  case object LongConversion extends JDBCConversion
    -  case object BinaryLongConversion extends JDBCConversion
    -  case object StringConversion extends JDBCConversion
    -  case object TimestampConversion extends JDBCConversion
    -  case object BinaryConversion extends JDBCConversion
    -  case class ArrayConversion(elementConversion: JDBCConversion) extends JDBCConversion
    +  // A `JDBCValueSetter` is responsible for converting and setting a value from `ResultSet`
    +  // into a field for `MutableRow`. The last argument `Int` means the index for the
    +  // value to be set in the row and also used for the value to retrieve from `ResultSet`.
    +  private type ValueSetter = (ResultSet, MutableRow, Int) => Unit
    --- End diff --
    
    Sure!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org