You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/05/10 16:14:40 UTC

[GitHub] [spark] Eugene-Mark opened a new pull request, #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Eugene-Mark opened a new pull request, #36499:
URL: https://github.com/apache/spark/pull/36499

   ### What changes were proposed in this pull request?
    - Implemented getCatalystType method in TeradataDialect
    - Handle Types.NUMERIC explicitly
   
   ### Why are the changes needed?
   Load table from Teradata, if the type of column in Teradata is `Number`, it will be converted to `DecimalType(38,0)` can lose the fractional part of original data.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, it will convert Number type to DecimalType(38,18) if the scale is 0, so that keep the fractional part in some way.
   
   ### How was this patch tested?
   UT is added to JDBCSuite.scala.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Eugene-Mark commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

Eugene-Mark commented on PR #36499:
URL: https://github.com/apache/spark/pull/36499#issuecomment-1146638887

@srowen Thanks for your response. For first part, `indicate NUMBER with the system limits for precision and scale`, we didn't find more explanations about it. It sounds like the scale and precision is flexible depending on user's input, but can't be larger than system limit. Since it's flexible, maybe they just return scale as `0` to show the case. (I'm actually thinking maybe it's better to provides a invalid value, like `-1`, then for downstream caller like Spark can handle the case better. )
Before it's fixed from Teradata side (or maybe never), the issue goes into which situation can be tolerated(in more cases):
1. A number like 1234.5678 is rounded to 1234 (Current behavior)
2. A number like 1234 is turned to 1234.0000
IMHO, the second option seems more reasonable.

As for "Can a caller work around it in this case with a cast or does that not work?". Yes, the cast can be a work around. However, it forces user to take care of the precision and scale of each Number column and it would be more tedious when query is complex with a lot columns to be taken care of. And it somehow goes against the flexibility of original Number(*) definition.

Anyway, I agree with you that there seems hard to find a "correct" answer, more like a tradeoff and also needs document to mark it out.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Eugene-Mark commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

Eugene-Mark commented on PR #36499:
URL: https://github.com/apache/spark/pull/36499#issuecomment-1129569337

   @srowen It should be a Teradata specific issue. I tried to read data with teradata driver, `terajdbc4` and `tdgssconfig` , the data read contains the fractional part. The code is sth like:
     ```
         Class.forName("com.teradata.jdbc.TeraDriver")
         connection = DriverManager.getconnection(s"jdbc:teradata://$host")
         new org.apache.commons.dbutils.QueryRunner().query(connection, "select * from test_db.test_table", new MapListHandler())
     ```
   The `Number` column type is converted to `java.math.BigDecimal` with scale reserved, which conforms to the mapping relationship per guide [here](https://docs.teradata.com/r/Teradata-VantageTM-SQL-Data-Definition-Language-Detailed-Topics/March-2019/CREATE-PROCEDURE-and-REPLACE-PROCEDURE-External-Form/Data-Type-Mapping-Between-SQL-and-Java). 
   
   I found Spark didn't provide explicit Teradata connector like what has been taken care like `DB2ConectionProvider`, `MariaDBConnectionProvider`, `MSSQLConnectionProvider`, etc. 
   
   In conclusion, we're using pure built-in JDBC library to invoke Teradata which might not be a suggested way, and might result in sth unexpected like the issue listed above. For a quick fix about the issue I met, maybe we can just merge this PR. At the same time, I will spend some effort to help introduce TeradataConnectionProvider to Spark if there is no license related issue. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Eugene-Mark commented on a diff in pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

Eugene-Mark commented on code in PR #36499:
URL: https://github.com/apache/spark/pull/36499#discussion_r891588142


##########
sql/core/src/main/scala/org/apache/spark/sql/jdbc/TeradataDialect.scala:
##########
@@ -96,4 +97,29 @@ private case object TeradataDialect extends JdbcDialect {
   override def getLimitClause(limit: Integer): String = {
     ""
   }
+
+  override def getCatalystType(
+    sqlType: Int, typeName: String, size: Int, md: MetadataBuilder): Option[DataType] = {
+    if (sqlType == Types.NUMERIC) {
+      if (md == null) {
+        Option(DecimalType.SYSTEM_DEFAULT)
+      } else {
+        val scale = md.build().getLong("scale")
+        // In Teradata, Number or Number(*) means precision and scale is flexible.
+        // However, the scale returned from JDBC is 0, which leads to fractional part loss.
+        // Handle this special case by adding explicit conversion to system default decimal type.
+        // Note, even if the NUMBER is defined with explicit precision and scale like NUMBER(20, 0),
+        // Spark will treat it as DecimalType.SYSTEM_DEFAULT, which is NUMBER(38,18)
+        if (scale == 0) {
+          Option(DecimalType.SYSTEM_DEFAULT)
+        } else {
+          // In Teradata, Number(*, scale) will return size, namely precision, as 40,
+          // which conflicts to DecimalType.MAX_PRECISION
+          Option(DecimalType(Math.min(size, DecimalType.MAX_PRECISION), scale.toInt))

Review Comment:
   Thanks for this comment! It reminds me of sth more reasonable than my current practice! Since in Teradata, only Number(*)/Number, Number(*,scale) and Number(precision,scale) is valid expression, which means when scale is flexible, the precision returned must be 40. So we don't need to convert all scale = 0 to default decimal type, but only need to do it when the precision = 40 is detected. Which means we will respect user's explicit scale = 0 settings, like Number(20,0), will be converted to DecimalType(20,0). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Eugene-Mark commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

Eugene-Mark commented on PR #36499:
URL: https://github.com/apache/spark/pull/36499#issuecomment-1146822523

   For NUMBER(*) on Teradata, the scale is not fixed but can suit itself to different value, as they said, it's only constrained by `system limit`. So the issue for Teradata is about how to denote the scale in JDBC, maybe Teredata side think the value `0` is the best way to denote the `flexible` property. 
   The key is about how Spark reflect the `flexible` scale from Teredata,  round it to integer (current practice) or reserve the fractional part with specified scale. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

srowen commented on PR #36499:
URL: https://github.com/apache/spark/pull/36499#issuecomment-1146838926

   I see, so we should interpret this as "maximum scale" or something in Spark? that seems OK, and if we're only confident about Teradata, this seems OK. Let's add a note in the release notes for the behavior change


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Eugene-Mark commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

Eugene-Mark commented on PR #36499:
URL: https://github.com/apache/spark/pull/36499#issuecomment-1150774845

   The test failure seems have no relationship with the committed code, several recent PRs failed with same error, like [this one](https://github.com/JoshRosen/spark/runs/6804854541?check_suite_focus=true) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Eugene-Mark commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

Eugene-Mark commented on PR #36499:
URL: https://github.com/apache/spark/pull/36499#issuecomment-1147448291

   Agreed with you that it's better not to modify Oracle related part, just removed from the commit.
   Yes, I suggest we use scale = 18. 
   And for precision, when `Number(*)` or `Number` is used in Teradata, the precision returned from JDBC is `40`, which is larger than Spark's max precision, so I also handles this case explicitly. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

srowen commented on PR #36499:
URL: https://github.com/apache/spark/pull/36499#issuecomment-1160937525

   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on a diff in pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

srowen commented on code in PR #36499:
URL: https://github.com/apache/spark/pull/36499#discussion_r891287048


##########
sql/core/src/main/scala/org/apache/spark/sql/jdbc/TeradataDialect.scala:
##########
@@ -96,4 +97,29 @@ private case object TeradataDialect extends JdbcDialect {
   override def getLimitClause(limit: Integer): String = {
     ""
   }
+
+  override def getCatalystType(
+    sqlType: Int, typeName: String, size: Int, md: MetadataBuilder): Option[DataType] = {
+    if (sqlType == Types.NUMERIC) {

Review Comment:
   OK, now down to nits. I would use `sqlType match {` for consistency. Also return, `Some(...)` when the argument is definitely non-null, like a constant, as in all cases here



##########
sql/core/src/main/scala/org/apache/spark/sql/jdbc/TeradataDialect.scala:
##########
@@ -96,4 +97,29 @@ private case object TeradataDialect extends JdbcDialect {
   override def getLimitClause(limit: Integer): String = {
     ""
   }
+
+  override def getCatalystType(
+    sqlType: Int, typeName: String, size: Int, md: MetadataBuilder): Option[DataType] = {
+    if (sqlType == Types.NUMERIC) {
+      if (md == null) {
+        Option(DecimalType.SYSTEM_DEFAULT)
+      } else {
+        val scale = md.build().getLong("scale")
+        // In Teradata, Number or Number(*) means precision and scale is flexible.
+        // However, the scale returned from JDBC is 0, which leads to fractional part loss.
+        // Handle this special case by adding explicit conversion to system default decimal type.
+        // Note, even if the NUMBER is defined with explicit precision and scale like NUMBER(20, 0),
+        // Spark will treat it as DecimalType.SYSTEM_DEFAULT, which is NUMBER(38,18)
+        if (scale == 0) {
+          Option(DecimalType.SYSTEM_DEFAULT)
+        } else {
+          // In Teradata, Number(*, scale) will return size, namely precision, as 40,
+          // which conflicts to DecimalType.MAX_PRECISION
+          Option(DecimalType(Math.min(size, DecimalType.MAX_PRECISION), scale.toInt))

Review Comment:
   What if precision = 40 and scale = 0? do we need to entertain that possibility, or is scale=0 always going to mean default precision too?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

srowen commented on PR #36499:
URL: https://github.com/apache/spark/pull/36499#issuecomment-1146640596

   It sounds like the scale is just 'unknown' even on the Teradata side? that doesn't sound right. But this isn't a Spark issue then, or, no assumption we make in Spark is any more or less correct, no?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Eugene-Mark commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

Eugene-Mark commented on PR #36499:
URL: https://github.com/apache/spark/pull/36499#issuecomment-1152425534

   Jenkins retest this please
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Eugene-Mark commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

Eugene-Mark commented on PR #36499:
URL: https://github.com/apache/spark/pull/36499#issuecomment-1147065740

   We have "maximum scale" [defined in Spark](https://github.com/apache/spark/blob/bab70b1ef24a2461395b32f609a9274269cb000e/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala#L128), however, it's not suitable in our case. The current `MAX_SCALE` is `38` and is used in sth like boundary protecting in Decimal's [divide operator](https://github.com/apache/spark/blob/bab70b1ef24a2461395b32f609a9274269cb000e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala#L138). 
   We need to have a relatively universal decimal type ( Can cover almost "all" types of Number value), which I think current [SYSTEM_DEFAULT](https://github.com/apache/spark/blob/bab70b1ef24a2461395b32f609a9274269cb000e/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala#L129) is right candidate upon the usage of it in `JavaTypeInference` and `ScalaReflection`, it's used as default type of Decimal/Number.  
   `val SYSTEM_DEFAULT: DecimalType = DecimalType(MAX_PRECISION, 18)`
   
   I suggest we also modify the OracleDialects [default decimal](https://github.com/apache/spark/blob/cc0bf563b8caea21da5692f05e34b5f77e002ab9/sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala#L88).  The `DecimalType(DecimalType.MAX_PRECISION, 10)` is more like a magic type since there is no clue about why the scale should be set to 10. For the sake of the consistency, maybe it's better to replaced it with `DecimalType.SYSTEM_DEFAULT`.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

srowen commented on PR #36499:
URL: https://github.com/apache/spark/pull/36499#issuecomment-1156448341

   Hm, I think the doc build error is unrelated


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on a diff in pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

srowen commented on code in PR #36499:
URL: https://github.com/apache/spark/pull/36499#discussion_r901825209


##########
docs/sql-migration-guide.md:
##########
@@ -22,6 +22,11 @@ license: |
 * Table of contents
 {:toc}
 
+## Upgrading from Spark SQL 3.3 to 3.4
+  

Review Comment:
   Is this note related tot his change? the second one is



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Eugene-Mark commented on a diff in pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

Eugene-Mark commented on code in PR #36499:
URL: https://github.com/apache/spark/pull/36499#discussion_r891528661


##########
sql/core/src/main/scala/org/apache/spark/sql/jdbc/TeradataDialect.scala:
##########
@@ -96,4 +97,29 @@ private case object TeradataDialect extends JdbcDialect {
   override def getLimitClause(limit: Integer): String = {
     ""
   }
+
+  override def getCatalystType(
+    sqlType: Int, typeName: String, size: Int, md: MetadataBuilder): Option[DataType] = {
+    if (sqlType == Types.NUMERIC) {

Review Comment:
   Good point! Will modify as per.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Eugene-Mark commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

Eugene-Mark commented on PR #36499:
URL: https://github.com/apache/spark/pull/36499#issuecomment-1160577823

   Document updated, thanks for previous valuable comments!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen closed pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

srowen closed pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType
URL: https://github.com/apache/spark/pull/36499


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Eugene-Mark commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

Eugene-Mark commented on PR #36499:
URL: https://github.com/apache/spark/pull/36499#issuecomment-1159351619

   It's interesting that previous commit could pass the test and some other PRs can pass it also. I will try to revert some changes to see whether I can pass it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on PR #36499:
URL: https://github.com/apache/spark/pull/36499#issuecomment-1122966623

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Eugene-Mark commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

Eugene-Mark commented on PR #36499:
URL: https://github.com/apache/spark/pull/36499#issuecomment-1129099918

   @srowen I'm also not a Teradata guy, just invokes Teradata's API from Spark and found the issue. I didn't find the document explaining the issue at Teradata side. I tried to print metadata from [JdbcUtils.scala -> getSchema](https://github.com/apache/spark/blob/191e535b975e5813719d3143797c9fcf86321368/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L277), which indicates that the `fieldScale` is `0` before it passes to downstream invoker. The metadata is fetched from `ResultSet`, which is generated right after Spark execute `statement.executeQuery` with query `s"SELECT * FROM $table WHERE 1=0"`.   
   Maybe it's good enough to give user a default decimal instead of a round int before we find better way to explain what happened on Teradata side. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

srowen commented on PR #36499:
URL: https://github.com/apache/spark/pull/36499#issuecomment-1154024801

   I think you have to retrigger on your end - can you try re-running the jobs? or push a dummy empty commit?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

srowen commented on PR #36499:
URL: https://github.com/apache/spark/pull/36499#issuecomment-1127758808

   I don't know anything about teradata - is it documented that this should be the result, and is it specific to teradata?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on PR #36499:
URL: https://github.com/apache/spark/pull/36499#issuecomment-1128272888

   Yeah, I don't quite follow the change. Why do we need to change precision and scale?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

srowen commented on PR #36499:
URL: https://github.com/apache/spark/pull/36499#issuecomment-1129130839

   OK, I just wonder if this is specific to Teradata, or whether it can be changed elsewhere higher up in the abstraction layers. 
   
   But you're saying the scale/precision info is lost in the JDBC connector? then I wonder what we can do, really; we need to know this to get it right


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Eugene-Mark commented on a diff in pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

Eugene-Mark commented on code in PR #36499:
URL: https://github.com/apache/spark/pull/36499#discussion_r901863572


##########
docs/sql-migration-guide.md:
##########
@@ -22,6 +22,11 @@ license: |
 * Table of contents
 {:toc}
 
+## Upgrading from Spark SQL 3.3 to 3.4
+  

Review Comment:
   Thanks for point it out!, just removed the first bullet which has been merged by mistake when resolving the doc conflicts. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

srowen commented on PR #36499:
URL: https://github.com/apache/spark/pull/36499#issuecomment-1147421904

   You're saying, basically, assume scale=18? that's seems reasonable.
   Or are you saying there needs to be an arbitrary precision type? I don't see how a DB would support that.
   I'm hesitant to modify Oracle without knowing why it is how it is, and why it should change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Eugene-Mark commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

Eugene-Mark commented on PR #36499:
URL: https://github.com/apache/spark/pull/36499#issuecomment-1126221985

   @HyukjinKwon @srowen It's appreciated if the PR can be reviewed recently. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Eugene-Mark commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

Eugene-Mark commented on PR #36499:
URL: https://github.com/apache/spark/pull/36499#issuecomment-1129110003

   @HyukjinKwon The [issue-38846 ](https://issues.apache.org/jira/browse/SPARK-38846) shows that the Number type of Teradata will lose its fractional part after loading to Spark. We find that JDBC ResultSetMetaData only has `scaleField == 0` no matter what's it like on Teradata side. Inspired by how we handle [Oracle ResultSetMetaData inconsistency issue](https://github.com/apache/spark/blob/191e535b975e5813719d3143797c9fcf86321368/sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala#L80), I explicitly change scale at `TeradataDialect`, to make sure at least a default DecimalType with scale 18 is returned, instead of a rounded int. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Eugene-Mark commented on a diff in pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

Eugene-Mark commented on code in PR #36499:
URL: https://github.com/apache/spark/pull/36499#discussion_r891588142


##########
sql/core/src/main/scala/org/apache/spark/sql/jdbc/TeradataDialect.scala:
##########
@@ -96,4 +97,29 @@ private case object TeradataDialect extends JdbcDialect {
   override def getLimitClause(limit: Integer): String = {
     ""
   }
+
+  override def getCatalystType(
+    sqlType: Int, typeName: String, size: Int, md: MetadataBuilder): Option[DataType] = {
+    if (sqlType == Types.NUMERIC) {
+      if (md == null) {
+        Option(DecimalType.SYSTEM_DEFAULT)
+      } else {
+        val scale = md.build().getLong("scale")
+        // In Teradata, Number or Number(*) means precision and scale is flexible.
+        // However, the scale returned from JDBC is 0, which leads to fractional part loss.
+        // Handle this special case by adding explicit conversion to system default decimal type.
+        // Note, even if the NUMBER is defined with explicit precision and scale like NUMBER(20, 0),
+        // Spark will treat it as DecimalType.SYSTEM_DEFAULT, which is NUMBER(38,18)
+        if (scale == 0) {
+          Option(DecimalType.SYSTEM_DEFAULT)
+        } else {
+          // In Teradata, Number(*, scale) will return size, namely precision, as 40,
+          // which conflicts to DecimalType.MAX_PRECISION
+          Option(DecimalType(Math.min(size, DecimalType.MAX_PRECISION), scale.toInt))

Review Comment:
   Thanks for this comment! It reminds me of sth more reasonable than my current practice! Since in Teradata, only Number(\*)/Number, Number(*,scale) and Number(precision,scale) is valid expression, which means when scale is flexible, the precision returned must be 40. So we don't need to convert all scale = 0 to default decimal type, but only need to do it when the precision = 40 is detected. Which means we will respect user's explicit scale = 0 settings, like Number(20,0), will be converted to DecimalType(20,0). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Eugene-Mark commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

Eugene-Mark commented on PR #36499:
URL: https://github.com/apache/spark/pull/36499#issuecomment-1150603723

   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Eugene-Mark commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

Eugene-Mark commented on PR #36499:
URL: https://github.com/apache/spark/pull/36499#issuecomment-1152503470

   Please kindly help relaunch the test once the CI issue has been fixed, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

srowen commented on PR #36499:
URL: https://github.com/apache/spark/pull/36499#issuecomment-1159353408

   I think it's spurious, we can ignore it, but let's see one more time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

srowen commented on PR #36499:
URL: https://github.com/apache/spark/pull/36499#issuecomment-1140281363

   So if I create a NUMBER in Teradata without a scale, then it uses a system default scale. Do we know what that is? 
   I'm confused if Teradata doesn't record and return the actual scale used in the driver, because otherwise we have to guess here.
   What if I have a NUMBER that is scale=0, actually? this would be wrong.
   I imagine your change is more correct, but, I'm also aware this is a behavior change in a case where it seems like there is no correct answer.
   
   Can a caller work around it in this case with a cast or does that not work?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Eugene-Mark commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

Posted by GitBox <gi...@apache.org>.

Eugene-Mark commented on PR #36499:
URL: https://github.com/apache/spark/pull/36499#issuecomment-1137093715

   @HyukjinKwon @srowen Just updated the latest comment with some findings about the root cause of the issue and current solution. Any comment is welcomed, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org